# Vector casting for noise reduction

## Abstract

We report a new method for the reduction of noise from spectra. This method is based on casting vectors from one data point to the following data points of the noisy spectrum. The noise-reduced spectrum is computed from the casted vectors within a margin that is identified by an envelope-finder algorithm. We compared here the presented method with the Savitzky–Golay and the wavelet transform approaches for noise reduction using simulated Raman spectra of various signal-to-noise ratios between 1 and 25 dB and experimentally acquired Raman spectra. The method presented here performs well compared with the Savitzky–Golay and the wavelets-based denoising method, especially at small signal-to-noise ratios and furthermore relies on a minimum of human input requirements.

## 1 INTRODUCTION

Spectral analysis involves processing of spectroscopic data or patterns for quantification and/or identification of samples or processes.1-7 The spectroscopic raw data usually contain contributions originating from the desired signal itself, the noise and from the background or interferences from undesired signals.8 One of the first processing steps (often the first one) of spectroscopic raw data is the elimination of the noise or the reduction of the noise level. This is especially challenging when the signal-to-noise ratio (SNR) is small, meaning when the differentiation between noise and signal based on solely intensity or peak height is not straightforward. Existing noise reduction algorithms can reduce the noise level on the one hand, but on the other hand—especially at small SNRs—can also manipulate and with this falsify the desired signal contribution.

The origin of small SNRs can be manifold. Small available or realizable excitation powers9 in combination with small interaction probabilities (cross sections) between the excitation and the matter under investigation often result in small SNRs.10-12 Also, a less efficient signal detection or short acceptable signal integration time can result in small SNRs.13, 14 On the contrary, even the long integration of low signal levels can lead to small SNRs, as together with the signal also the thermal background together with its thermal noise is accumulated. The ineluctable contamination of spectroscopic data with noise therefore limits the performance of spectroscopic techniques.15-17

Many postprocessing techniques have been used to denoise spectroscopic data, such as the Savitzky–Golay (SG) filter,18, 19 smoothing based on the wavelet transform method,20-22 the “perfect smoother” method,23 the finite impulse response (FIR) smoother,24 and smoothing based on the “Wiener estimation.”25 The SG smoother is the most popular and frequently used method to denoise spectroscopic data.8, 25, 26 It is based on the least-squares fitting of polynomials of specified order to connected data points contained in a moving window of specified size. The larger the size of the window is chosen, meaning the more data points are considered for the polynomial fit, the more the raw spectral data are smoothed.24 Not only the noise but also the sharp signal features can potentially be smoothed out, like it is the case for all smoothing algorithms. Thus, a compromise needs to be made between smoothing out the noise and a loss of spectral information by carefully adjusting the window size and polynomial order of the filter.

Smoothing based on wavelets is simple to use, while adapting well to the form of the signal being smoothed.22, 27 Here, the noisy raw spectral data are transformed into a wavelet domain by decomposing it into a set of orthonormal wavelet basis functions. The major signal trends of the spectrum are assignable to large wavelet coefficients, whereas the noise is assignable to only small coefficients.20, 28 Hereupon, the noise is suppressed by thresholding the wavelet coefficients. Then the not-suppressed coefficients are reverse transformed to obtain the noise-reduced spectra. However, the selection of the wavelet basis functions and the threshold value have a great impact on the performance of the method and are strongly problem dependent.25 Moreover, the application of this approach to spectra with small SNR can reduce, remove, or manipulate also signal contributions.29

Člupek et al.24 tested the FIR smoother to suppress noise in spectra. They reported that this technique offers better preservation of the real signal contribution compared with the SG smoother. However, it is demanding in computation.24, 25 Using the “Wiener estimation,” Chen et al.25 developed a method on the basis of spectral reconstruction to recover spectra with small SNR. In comparison with other denoising methods such as the SG method, the FIR smoother, and the wavelet transform method, their method showed excellent performance. However, a calibration data set that relies on input spectra with large SNRs is required for the successful denoising of spectra with small SNR.

We here introduce a vector casting method for noise reduction. We compared its performance with the frequently used SG and wavelets denoising methods. The performance comparison considers the extractability of the real signal contribution. To the best of our knowledge, vector casting has never been applied to denoise spectra.

## 2 MATERIAL AND METHODS

### 2.1 Samples

We used two sets of samples to validate the vector casting method. The two sets comprise simulated Raman spectra and experimentally acquired Raman spectra. At this point, it should be underlined that the vector casting method is not limited to the treatment of Raman spectra. Therefore, the descriptions provided in the sections that follow are provided in a general context and can be transferred to any kind of spectral data. We only consider contributions to the spectroscopic data coming from the real signal and from the noise. We neglect the potentially occurring contributions of a background, as the background is usually subtracted from the spectroscopic data using baseline correction methods.11, 30, 31 These baseline correction methods can still be applied after the noise reduction method.

*S*

_{sig}(

*x*

_{i}) and a noise spectrum

*N*

_{sim}(

*x*

_{i}), where

*x*

_{i}is the variable (Raman shift in the case of Raman spectra, wavelength in the case of fluorescence spectra, wavenumber in the case of absorption spectra, temperature in the case of differential scanning calorimetry spectra, theta in the case of X-ray diffraction spectra, etc.). Due to Doppler broadening, collisional broadening and optical effects spectral lines, peaks, or bands are never strictly monochromatic but feature a distribution around their centre.32 Spectral signal profiles can be fitted by Lorentzian, Gaussian, or Voigt profiles.33

*A*

_{n}, widths

*σ*

_{n}, and being centered at different variables

*x*

_{n}. We chose Lorentzian peaks as they best reflect theoretical Raman signal lines.34 The usage of other peak shapes or a different number of peaks would not influence the vector casting method.

Figure 1 shows the simulated spectrum *S*_{sig}(*x*_{i}). The number seven of Lorentzian peaks and the parameters of these peaks were chosen to imitate overlapping peaks (*x*_{n} = 840), narrow peaks (*x*_{n} = 848,*x*_{n} = 900), small peaks (*x*_{n} = 830), and broad peaks (*x*_{n} = 820,*x*_{n} = 860).

*n*

_{ph}(also referred to as photon noise), thermal noise

*n*

_{th}, and readout noise

*n*

_{rd}.35, 37

*e*(

*x*

_{i}) is Gaussian noise having a standard deviation of one and mean of zero.36, 38 The shot noise

*S*

_{sig}(

*x*

_{i}) and with this is a function of the variable

*x*

_{i}. Variables

*x*

_{i}with large signal feature a large shot noise, whereas variables without signal feature no shot noise.

*B*. The thermal background is supposed to be a constant over

*x*

_{i}. Also, the read out noise is considered as a constant

*c*over

*x*

_{i}.

*B*in Equation 5, the constant

*c*in Equation 6, or by scaling the signal

*S*

_{sig}(

*x*

_{i}) in Equation 4.

For the acquisition of the experimental Raman spectra,11 we used as excitation source a diode laser (Toptica DLpro) emitting 785-nm radiation and a spectrometer (Ventana from Ocean Optics) for signal detection between 800 and 940 nm, which corresponds to Raman shifts between 200 and 2,000 cm^{−1}. With an excitation laser power of 10 mW, we collected Raman spectra of liquid ethanol at various integration times between 20 and 1,000 ms. From the different integration times, experimental spectra *R*(*x*_{i}) with various SNRs resulted. Also, the experimentally acquired spectra are composed of a signal and a noise contribution. Additionally, a quasi-noise-free (low-noise) Raman spectrum of ethanol was acquired with an excitation power of 300 mW and 1,000 ms of signal integration time. This quasi-noise-free spectrum can be considered as a reference spectrum or as a quasi-pure signal spectrum *S*_{sig}(*x*_{i}). We chose ethanol for the acquisition of the experimental spectra, as the Raman spectrum of ethanol also shows narrow, broad, and overlapping peaks.

*r*(

*x*

_{i}) results after noise reduction from either the simulated or the experimental spectra

*R*(

*x*

_{i}) with either the vector casting method, the envelope-finder method, the SG method, or the wavelet transform method. In order to compare the performance of the different noise reduction methods, we quantified the deviation between the real signal spectrum

*S*

_{sig}(

*x*

_{i}) and the noise-reduced spectra

*r*(

*x*

_{i}) derived with the different methods according to Equations 8a to 8d as proposed by Barton et al.35 These equations quantify (a) the mean improvement of the signal quality across the entire spectral range (Equation 8a), (b) monitor whether or not the algorithm interacted negatively with the spectral peaks (Equation 8c), and (c) quantify the signal-to-noise improvement (Equation 8d) relative to SNR of the original noisy spectrum (

*R*(

*x*

_{i})).

## 3 RESULTS AND DISCUSSION

The vector casting method requires preprocessing of the raw spectra. In the first step, the top and bottom envelopes of the noisy spectra have to be identified using an envelope-finder algorithm that is described in detail below. Afterwards, the vectors are casted within the margin of the before identified envelopes for the derivation of the noise-reduced spectrum *r*(*x*_{i}). We want to emphasize here that already the envelope-finder algorithm alone provides a significant noise reduction.

### 3.1 Envelope-finder algorithm

*E*

_{top}(

*x*

_{i}) and a bottom

*E*

_{bottom}(

*x*

_{i}) envelope of the noisy spectrum

*R*(

*x*

_{i}). In the first level (Level 1), all data points of

*R*(

*x*

_{i}) are classified as either peak or valley, irrespectively of whether the peak is due to noise or due to a real signal. On this account, a forward and a backward differentiation is made.

*p*(

*x*

_{i}) or a valley

*v*(

*x*

_{i}) if its forward and backward differentiation are both positive or negative, respectively.

In the second level (Level 2), data points of *p*(*x*_{i}) and *v*(*x*_{i}) are searched for peaks and valleys by forward and backward differentiation. Figure 2 middle shows these computed peak and valley data points of the peaks and valleys obtained in Level 1. The notations *pp*(*x*_{i}) and *vp*(*x*_{i}) indicate the peaks of *p*(*x*_{i}) and valleys of *p*(*x*_{i}), respectively. Similarly, *pv*(*x*_{i}) and *vv*(*x*_{i}) means, respectively, peaks of *v*(*x*_{i}) and valleys of *v*(*x*_{i}). Computing the peaks and valleys recursively, in a third level, the peaks and valley of *pp*(*x*_{i}), *vp*(*x*_{i}), *pv*(*x*_{i}), and *vp*(*x*_{i}) can be computed by forward and backward differentiation as well. Figure 2 bottom shows the Level 3 valleys of the Level 2 valleys of the Level 1 peaks, which is referred to as *vvp*(*x*_{i}). It also can be seen that the first two red diamonds (*vvp*(*x*_{i})) in Figure 2 bottom can be considered as left and right border of a signal peak.

*vvp*(

*x*

_{i}) data points (red diamonds) are suitable as indicators for the left and the right border of potential peaks. Figure 3b,c shows upon which criteria a decision is made whether or not a signal peak is situated between two

*vvp*(

*x*

_{i}) data points, here the left border (lb) and the right border (rb) diamond. The procedure for the classification of peak and peak-free regions is as follows:

- The maximum value of the difference

*vvp*(

*x*

_{i}) data points (lb and rb) is computed. The window size

*w*is defined automatically by the

*vvp*(

*x*

_{i}) and does not have to be chosen by the user. In Figure 3b, the maximum difference

*ΔI*

_{max}is shown as green line and can be defined as the maximum deviation between consecutive data points within the window.

- The height
*P*_{h}of a potential signal peak which in Figure 3b is shown as a blue line

*vvp*(

*x*

_{i}) data points.

- The window is classified as a peak region
*w*^{p}if

*p*(

*x*

_{i}) and

*v*(

*x*

_{i}) left of the maximum of the potential peak are both positive and right of the potential peak are both negative.

*w*

^{pf}region.

*p*(

*x*

_{i}) and valley

*v*(

*x*

_{i}) (Figure 2 top) data points are a good first estimate of the top and bottom envelopes of the noisy signal

*R*(

*x*

_{i}). For smoothing these envelopes

*p*(

*x*

_{i}) and

*v*(

*x*

_{i}), we considered a moving window

*w*

^{m}consisting of 2

*n*+1 peak (

*P*

_{i}) or valley (

*V*

_{i}) data points. Here, the window size has to be set manually. Later, the influence of the chosen moving window size

*w*

^{m}onto the noise reduction performance will be discussed. Here, we assigned

*P*

_{i}to the data points of

*p*(

*x*

_{i}) confined within

*w*

^{m}, and is one of the data points of

*P*

_{i}, which is located at center of the

*w*

^{m}. Similarly,

*V*

_{i}is assigned to the data points of

*v*(

*x*

_{i}) included in

*w*

^{m}, and is the central data point of

*V*

_{i}. The central peak or valley data point within the moving window

*w*

^{m}is updated by a new value. The updating procedure depends on whether or is within a peak region

*w*

^{p}or within a peak-free region

*w*

^{pf}. If they are part of a peak-free region, they are substituted by the moving window

*w*

^{m}average.

*w*

^{p}, they are substituted by

*ΔI*

_{max}, where

*ΔI*

_{max}is computed according to Equation 13.

*j*is the number of peak (

*P*

_{i}) or valley (

*V*

_{i}) data points contained within the moving window

*w*

^{m}where |

*P*

_{i}−

*C*

_{i}

^{top}| ≤

*ΔI*

_{max}or |

*V*

_{i}−

*C*

_{i}

^{bottom}| ≤

*ΔI*

_{max}.

In Equation 17, the central peak
or valley
data points are updated by averaging all the peak *P*_{i} or the valley *V*_{i} data points within *w*^{m}, respectively. Contrary, Equation 18 updates the central peak
or valley
data points by averaging peak *P*_{i} or valley *V*_{i} data points, which fulfill a condition |*P*_{i} − *C*_{i}^{top}| ≤ *ΔI*_{max} or |*V*_{i} − *C*_{i}^{bottom}| ≤ *ΔI*_{max}. This condition makes sure that only peak *P*_{i} or valley *V*_{i} data points that are not far from
or valley
are considered to update
or valley
, respectively.

By linear interpolation between all updated peak
and valley
data points for all variables *x*_{i} that according to Equations 11 and 12 have neither been assigned to a valley nor a peak point, the noise-reduced top envelope *E*_{top}(*x*_{i}) and bottom envelope *E*_{bottom}(*x*_{i}) are generated. Figure 4 shows both of them computed for a moving window with the size *n* = 9. Scheme 1 presents the flow chart of the envelope-finder algorithm where *m* is total number of peak/valley data points of the noisy spectrum.

*R*(

*x*

_{i}) are shown as grey line. The real signal contribution

*S*

_{sig}(

*x*

_{i}) behind the spectral data is shown as dashed line. The solid black line shows the

*E*

_{mean}(

*x*

_{i}). Apparently,

*E*

_{mean}(

*x*

_{i}) is already close to

*S*

_{sig}(

*x*

_{i}). This indicates that the envelope-finder algorithm alone has a great potential of noise reduction. Nevertheless, the noise level can be even more reduced if in a next step vectors are casted within the margins of the top and the bottom envelopes.

### 3.2 Vector casting based smoothing

*r*(

*x*

_{k}) to subsequent not yet noise-reduced data points

*R*(

*x*

_{i > k}), as it is illustrated in Figure 5. The starting already noise-reduced point

*r*(

*x*

_{k})

*x*

_{i = 0}. From this already noise-reduced point

*r*(

*x*

_{k}), vectors

*i*>

*k*.

Second, all vectors that cross either the top or the bottom envelope are deleted from the set of vectors . Deleted vectors are highlighted in red in Figure 5, whereas remaining vectors are highlighted in green.

*l*is the number of remaining vectors.

*x*

_{i = k+1}that is situated one increment right of the already noise-reduced data point

*x*

_{k}, the new noise-reduced value

*x*

_{i = k}and

*x*

_{i = k+1}. This procedure is repeated as long as all raw data points

*R*(

*x*

_{i}) have been replaced by noise-reduced data points

*r*(

*x*

_{i}).

Figure 5a,b shows the noise reduction due to the vector casting method in a spectral region that does not contain a signal peak and in a spectral region that does contain a signal peak respectively. Figure 5a (zoomed plot) shows the details of the computation of the next noise-reduced data point starting from the previous one and Figure 5c shows as solid black line the computed noise-reduced spectrum *r*(*x*_{i}).

In Figure 5, vectors are not casted from the previously noise-reduced data point to all of the subsequent data points but only to subsequent data points contained in a certain window *w*^{vector}. Casting the vectors not to all subsequent data points but only to data points contained in a certain window reduces the computation demand significantly. In Figure 5, the size of the window *w*^{vector} in which the vectors are casted is *M* = 150, meaning, that vectors are casted to the subsequent 150 data points. Scheme 2 shows the flow chart of the vector casting method.

### 3.3 Parameter tuning effect

The algorithms outlined in the previous section requires two input parameters: the size *n* of the moving window *w*^{m} and the size *M* of the window in which vectors are casted *w*^{vector}. In order to investigate the effect of these parameters, we applied the vector casting method at different values of *n* and *M*. In Figure 6, we showed the results at *n* = 1,5,9,11 keeping *M* = 150.

Increasing *n* initially from *n* = 1 to *n* = 5 improves the smoothness of the noisy signal especially for small *SNR* (peak-free region Figure 6). However, the vector casting method is rather insensitive to further increase of the size of the smoothing window from *n* = 9 to *n* = 11. The peak regions are also less sensitive to the change in *n* as compared with the peak-free regions because the data points for averaging are determined automatically (Equation 18) where only small number of nearby data points are involved.

The effect of varying the number of vectors to be casted is shown in Figure 7 and was tested by setting *M* = 50,100,and 150 keeping *n* = 9. Compared with the mean of envelopes (black line in Figure 7), casting vectors show significant improvement. However, increasing *M* further than *M* = 50 did not show significant improvement as the noise-reduced spectra look rather similar. This can be justified by the circumstance that the larger the distance between *x*_{i} and *x*_{k} is, the less is the probability of the corresponding vector to be included in the computation of the new noise-reduced data point *r*(*x*_{k+1}) in Equation 24.

### 3.4 Comparison with Savitzky–Golay and wavelet transform smoothing techniques

Figure 8 shows the simulated signal spectrum as solid black line and as grey simulated raw spectra with noise levels between 1 and 25 dB. At each SNR, 10 samples were simulated. The raw spectra are noise reduced using the presented vector casting method, the presented envelope-finder algorithm, the SG method, and the wavelet transform method. For the SG and the wavelet transform method, the input parameters were optimized with respect to a maximum overall SNR performance between the obtained noise-reduced spectrum and the pure signal spectrum according to Equation 8d. Figure 9a,b shows the parameters selected to give optimal denoised spectra for the wavelets and SG methods, respectively.

With respect to the SG method, the window size was varied from three to the maximum odd number that was smaller than or equal to the number of data points of the spectrum, and the polynomial order was varied between one and nine. During denoising of the simulated noisy signals, as it can be seen in Figure 9b, polynomial order of three and window size of nine were more frequently selected.

With respect to the wavelet transform method, a wavelet denoising function (*wdenoise*) using the software package “Wavelet Toolbox” in MATLAB (by MathWorks Inc.) was used. Improved implementation versions of the wavelet denoising technique40, 41 can exist; however, the relevant codes are not available and thus could not be applied. Therefore, using the wavelets denoising built in MATLAB, we varied the level of decomposition between 1 and 10. Four different threshold selection rules42 were tested. For the selection of the suppression coefficients, mean, median, soft and hard thresholding 43 approaches were evaluated. Moreover, two different wavelet families (symlets and Daubechies) were tested. Figure 9a shows the frequency of usage of these parameters while optimally denoising the simulated noisy signals with wavelets method. With respect to the envelope-finder approach, we used a size of the moving window of *n* = 9, and for the vector casting method, we casted the vectors in a window containing 150 data points.

Figure 10 shows the SNR achievement of the four denoising methods computed using Equations 8a, 8c, and 8d. As it can be seen in Figure 10a,b, all the denosing methods improved the original SNR across the entire spectral region as well as at sharp peaks. The vector casting and wavelet methods perform better as compared with the other two methods. The vector casting method performs better than wavelet method at lower SNR, whereas the wavelet method exceeds the performance of vector casting method at higher SNRs. Figure 10c depicts the overall performance of the denoising methods in smoothing the noisy signal while at the same time keeping the spectral peaks undistorted. For noisy signal with SNR up to 15 dB, the vector casting method performs better followed by the mean envelopes. For higher SNRs, the wavelets method exceeded the performance of the here proposed methods.

Figure 11 shows the simulated raw spectrum *R*(*x*_{i}) with a SNR = 10 dB as grey line, the pure signal spectrum *S*_{sig}(*x*_{i}) as blue line, and the denoised spectra *r*(*x*_{i}) of vector casting, mean envelope, Savitzky–Golay, and wavelets as green, black, magenta, and red lines, respectively. From the comparison of the pure signal spectrum and denoised signal spectra information can be extracted about the performance of the different noise reduction methods with respect to the level of smoothing at peak-free regions and preservation of spectral shapes at peak regions (zoomed figures in Figure 10). With the vector casting method (green line), the noise-reduced spectrum shows excellent match to the peak locations of the signal spectrum and preserves the spectral shape information rather well. And the standard deviation of the denoised spectrum is very small compared with the other methods specially at peak-free regions. Denoising applying solely the envelope-finder algorithm (black line) provides an overall noisier noise-reduced spectrum than the vector casting method. Still the peak heights and spectral shape are preserved rather well. The noise-reduced spectrum obtained using the optimized Savitzky–Golay method (magenta) shows a noisier spectrum and that the spectral peak shape information is manipulated compared with the pure signal spectrum. The wavelet transform method (red line) shows a better performance than the Savitzky–Golay method. Nonetheless, the peak positions and the peak shapes are manipulated slightly.

Finally, the performance of the four denoising approaches is compared based on experimentally acquired Raman spectra. Figure 12 shows 14 experimental Raman spectra of ethanol (grey lines) featuring different noise levels. A quasi-pure ethanol signal spectrum (black line) with large SNR is also shown as quasi-pure signal spectrum *S*_{sig}(*x*_{i}).

Figure 13a–d shows each 14 Raman spectra (grey lines) of ethanol, normalized to the highest peak at around 845 nm, recovered using vector casting, mean of envelopes, wavelet, and Savitzky–Golay smoothing methods, respectively. The spectrum of ethanol with high SNR is also shown in blue line in the figures for reference. Moreover, to assess the reproducibility of the recovered spectra, the standard deviation of the 14 recovered noise-reduced spectra at each variable (here Raman shift) is computed and depicted alongside the recovered spectra as a red line. The standard deviation is quantified on the right ordinate. As it can be seen from Figure 13, the standard deviation is higher around the peak regions than at peak-free regions. Thus, all techniques affect the peak to some extent. However, with the vector casting method, a better reproducibility of the spectra was obtained. In every peak, the vector casting method achieved the minimum standard deviation. The mean of the envelopes also shows a comparable result with the wavelets method. The standard deviation of the Raman peak at around 812 nm is 0.013, 0.021, and 0.052 for the vector casting, wavelets, and Savitzky–Golay methods, respectively. The peak broadening effect of the Savitzky–Golay technique is highly reflected by the standard deviation of the peak at around 842 nm. Moreover, the standard deviation of the double Raman peaks at 855 nm, shoulder peak at 870 nm, and Raman peak at 884 nm is decreased from 0.07 and 0.025 to 0.021, from 0.04 and 0.019 to 0.016, and from 0.07 and 0.024 to 0.02 with respect to Savitzky–Golay and wavelet methods, respectively.

Next to the circumstance that the here proposed new method for the denoising of raw spectra outperforms the two most frequently used methods, it has to be mentioned that the newly proposed method also involves a minimum of human interaction. In contrast, our method requires envelope detection that involves peak detection. We also compared the proposed algorithm in terms of computational efficiency. The language used for the implementation was Python.44 The average time taken for the envelope-finder algorithm was comparable with SG on a Dell Latitude E7450 with an Intel Core i7 processor. However, the vector casting method took longer execution time, and the average execution time depends on the number of vectors to be casted.

## 4 CONCLUSION

In this study, we developed a new method for the processing of spectra that are relevant for the purification of spectral signal from spectra with small SNR. Of course, this technique cannot extract signal peaks that are smaller than the noise level, but it can remove noise, although manipulating the characteristics of the pure signal less than the wavelet transform method or the SG method. Furthermore, the proposed method does only to a minimum extent rely on input parameters that have to be chosen by humans. Summarizing, the proposed method should be considered reliable, robust, and accurate.

## ACKNOWLEDGEMENTS

The project leading to this result has received funding from the Wilhelm Sander-Stiftung, Munich, Germany (Grant 2017.111.1). It also has received funding from the European Union's Horizon 2020 research and innovation programme under ERC Starting Grant agreement 637654 (Inhomogeneities).