Volume 55, Issue 7 p. 761-773
RESEARCH ARTICLE
Open Access

Quantifying the impact of sample, instrument, and data processing on biological signatures in modern and fossil tissues detected with Raman spectroscopy

Jasmina Wiemann

Corresponding Author

Jasmina Wiemann

Robert A. Pritzker Center for Meteoritics and Polar Studies, Earth Science Section, Negaunee Integrative Research Center, Field Museum of Natural History, Chicago, Illinois, USA

Department of the Geophysical Sciences, University of Chicago, Chicago, Illinois, USA

Department of Earth and Planetary Sciences, Johns Hopkins University, Baltimore, Maryland, USA

Division of Geological and Planetary Sciences, California Institute of Technology, Pasadena, California, USA

Natural History Museum of Los Angeles County, Los Angeles, California, USA

Correspondence

Jasmina Wiemann, Robert A. Pritzker Center for Meteoritics and Polar Studies, Earth Science Section, Negaunee Integrative Research Center, Field Museum of Natural History, Chicago, IL, USA.

Email: [email protected]

Search for more papers by this author
Philipp R. Heck

Philipp R. Heck

Robert A. Pritzker Center for Meteoritics and Polar Studies, Earth Science Section, Negaunee Integrative Research Center, Field Museum of Natural History, Chicago, Illinois, USA

Department of the Geophysical Sciences, University of Chicago, Chicago, Illinois, USA

Search for more papers by this author
First published: 20 March 2024

Funding information: We acknoweldge funding from the Agouron Institute (JW) and the TAWANI Foundation (PRH).

Abstract

Raman spectroscopy is a popular tool for characterizing complex biological materials and their geological remains. Ordination methods, such as principal component analysis (PCA), use spectral variance to create a compositional space, the ChemoSpace, grouping samples based on spectroscopic manifestations reflecting different biological properties or geological processes. PCA allows to reduce the dimensionality of complex spectroscopic data and facilitates the extraction of informative features into formats suitable for downstream statistical analyses, thus representing a first step in the development of diagnostic biosignatures from complex modern and fossil tissues. For such samples, however, there is presently no systematic and accessible survey of the impact of sample, instrument, and spectral processing on the occupation of the ChemoSpace. Here, the influence of sample count, unwanted signals and different signal-to-noise ratios, spectrometer decalibration, baseline subtraction, and spectral normalization on ChemoSpace grouping is investigated and exemplified using synthetic spectra. Increase in sample size improves the dissociation of groups in the ChemoSpace, and our sample yields a representative and mostly stable pattern in occupation with less than 10 samples per group. The impact of systemic interference of different amplitude and frequency, periodical or random features that can be introduced by instrument or sample, on compositional biological signatures is reduced by PCA and allows to extract biological information even when spectra of differing signal-to-noise ratios are compared. Routine offsets ( ±1 cm−1) in spectrometer calibration contribute in our sample to less than 0.1% of the total spectral variance captured in the ChemoSpace and do not obscure biological information. Standard adaptive baselining, together with normalization, increases spectral comparability and facilitates the extraction of informative features. The ChemoSpace approach to biosignatures represents a powerful tool for exploring, denoising, and integrating molecular information from modern and ancient organismal samples.

1 INTRODUCTION

Raman spectroscopy allows non-destructive compositional fingerprinting of complex biological and geological materials.1-10 Rapidly generated in situ spectra yield information on covalent, ionic, and non-covalent chemical interactions enabling a comparative search for informative heterogeneities across a diversity of samples,1 such as modern organismal tissues and their fossilization products. Spectroscopic biosignatures, such as phylogenetic and metabolic signals, represent diagnostic tools in cancer research,3-7 and a number of signatures present in fresh tissues preserve, occasionally altered but not unrecognizable, in fossilized carbonaceous tissues: in integrative data sets, spectroscopic signatures reflecting the relative abundance of different organic functional groups1 and organo-mineral interactions2 encode molecular manifestations of phylogenetic affinity,2-7, 11-13 physiology,2-7, 11-17 and degree and mode of environmental or diagenetic alteration.1, 2, 18 These signals are relative and can only be analyzed in a comparative framework.1-7, 11-18

Spectra collected across a diversity of tissue samples may contain additional unwanted signals that reduce the signal-to-noise ratio1, 2 (“noise” representing here signal that is not of direct chemical nature). Examples include a nonlinear background based on sample fluorescence induced by the excitation source,1, 19 lower intensity counts due to diffusive scattering at rough sample surfaces,1, 19 and (quasi-)sinusoidal signals resulting from reflective scattering at layers with different optical properties within a tissue sample or introduced by certain instrument components (laser filters in combination with specific line gratings).19, 20 Most of these unwanted signals can be described as wave functions of different periodicity, amplitude, and frequency: fluorescence often times behaves like n = 1–1.5 sine wave half cycles, diffusive scattering at tissue layers or filter materials is accurately represented by low-frequency periodical sine waves, and shot noise tends to behave like a random high-frequency, low-amplitude interference.1, 19, 20 Noisy spectra containing a diversity of unwanted signals are a well-known challenge in biological tissue spectroscopy,1, 7 and processing routines beyond despiking and standard-based luminescence correction,21-23 including adaptive baselining (background correction sensitive to the total spectral curve) and normalization (intensity scaling based on individual peaks or integrated spectral areas), are employed to minimize the impact of unwanted signals on data interpretation.1, 2, 7, 19 Similarly, spectral phase shift introduced by temperature-based instrument decalibration can be traced and corrected across a series of analytical sessions,24 but nonlinear decalibration rates render correction (up to ±1 cm−1 wavenumber) during a single analytical session challenging.

In the last 30 years, spectroscopy has shifted from the exclusively qualitative interpretation of Raman spectra8-10 toward a comparative approach1, 2, 5-7 that relies, as an essential first step in the data analysis, on ordination methods (dimensionality reduction), such as principal component analysis (PCA). PCA allows to explore, denoise (Figure 1), identify, and extract informative heterogeneities (effectively “latent variables”) from sets of inherently complex spectra, each characterized by a very large number of data points collected over the wavenumber range.1, 2, 7, 11-18 PCA captures the covariance of spectral features in an n-dimensional compositional space, the ChemoSpace, where n equals the number of features considered.25, 26 The ChemoSpace is based on a variance–covariance matrix ([number of spectra] × [number of features]).25, 26

Details are in the caption following the image
Schematic drawing showcasing the denoising potential of principal component analysis (PCA). (A) n = 1 Raman spectrum of the sample type 1 plotted over the organic fingerprint region. A set (n = 5) of synthetic technical replicates based on the spectrum plotted in (A) were summed with a medium-frequency synthetic interference function, in order to generate the n = 5 spectra shown in (B). (B) The artificially noised n = 5 varieties of the source spectrum shown in (A). The source spectrum in (A) is plotted under the summed spectral curves and is shaded in gray. PCA is applied to the n = 5 summed spectra. (C) Resulting PC 1 axis loadings plotted over the organic fingerprint region match the original source spectrum shown in (A). Detectable synthetic interference has been efficiently removed by PCA. PCA allows for the robust denoising of spectroscopic data collected for biological or paleontological tissue samples. Source spectra and interference functions can be found in Table S1.

The general order of magnitude of the minimum number of spectra required to achieve a stable Raman ChemoSpace occupation remains yet to be determined: the data point distribution across the ChemoSpace changes with the number and type of spectra or selected peaks included in the analysis; increasing the number of considered spectra increases the statistical power of sample group separation. Once the number of included Raman spectra allows for an accurate representation of the compositional diversity in a sample set, a stable pattern in ChemoSpace occupation is reached.

The number of considered features and thus dimensions of the ChemoSpace (n) can encompass all the data points that contribute to a spectrum or, more commonly, selected peaks of interest.1, 2 PCA benefits from the subsampling of peaks in spectra, an approach that prevents overweighting broad signals and spectral regions uninformative for a given question.27 Because the normalized intensities of Raman peaks represent the relative abundancies of molecular features in the sampled area, the ChemoSpace can be thought of as a multivariate application of Lambert–Beer's law, which defines that the spectroscopic signal of a compound is proportional to its abundance in the sample. When spectra of different biological sample types are analyzed by means of PCA, variance corresponding to different biosignatures is commonly expressed by the first two or three principal components (PCs)—the axes of the ChemoSpace displaying different aspects of variance in the data, sorted by descending contributions to the total variance.1, 2, 26 Based on the information represented along the PCs and the distribution of eigenvectors that illustrate the impact of individual peaks on the placing of a sample in the ChemoSpace, PCA allows for the exploration and identification of features that are particularly informative for sample grouping and thus represents an essential tool for the subsampling of spectral data points required toward downstream classification or cluster analyses. Co-dependence of individual or overlapping Raman peaks based on molecular connectivity, as well as (spatial) covariance of certain compounds in biological systems, has previously posed an additional challenge (also coined the “cage of covariance”) to the stand-alone interpretation of modern biological ChemoSpaces28, 29—however, cross-interrogation of complementary spectroscopic data (i.e., Raman and Fourier-Transform Infrared Spectroscopy [FT-IR]) and experimental chemical alteration of individual reference samples offer suitable controls when interpreting compositional spaces.

The impact of analytical variables and different types of unwanted spectral features on classification approaches to spectroscopic biosignatures in modern and fossil tissues, such as linear discriminant analysis (LDA)30 and its corresponding machine-learning tools (i.e., support vector machines, SVM),31, 32 is known and has led to a number of end user recommendations,30-32 but it is only incompletely characterized for the PCA ChemoSpace. Given the potential of the ChemoSpace to address questions in modern biology1, 3, 4 and clinical diagnostics,5-7 and the recent peak in interest by the paleontological,2, 11-18 geological,33 and astrobiological33 research communities, a systematic survey of the impact of sample size (Figure 2), spectral signal-to-noise ratios (Figures 3, 4A,C, 5, and 6), spectrometer decalibration (Figure 7), baseline subtraction routines (Figure 8), and normalization procedures (Figure 4B,D) on informative ChemoSpace grouping, accessible to non-specialists from different disciplines, is overdue. In this study, we utilize simplified models of representative tissue spectra to quantify and explain trends in the impact of sample, instrument, and data processing on ChemoSpace occupation and the detectability of compositional biosignatures.

Details are in the caption following the image
The impact of sample size on the ChemoSpace occupation. The selection of plots aims to showcase the initial ChemoSpace occupation with only n = 3 samples per group (A), the key steps in cluster rotation resulting from an increase in sample number (B, C), the stable ChemoSpace occupation based on n = 30 samples per group (D), and the relationship between the number of samples and the amount of variance represented in the ChemoSpace for this example. Arrows and the shaded area in between them represent eigenvectors in the biplot. (A) ChemoSpace plot resulting from n = 3 varieties (synthetic technical replicates) of two sample types (1: teal; 2: orange). (B) ChemoSpace plot resulting from n = 5 varieties (synthetic technical replicates) of the two sample types (1: teal; 2: orange). (C) ChemoSpace plot resulting from n = 6 varieties (synthetic technical replicates) of the two sample types (1: teal; 2: orange). (D) ChemoSpace plot resulting from n = 30 varieties (synthetic technical replicates) of the two sample types (1: teal; 2: orange). All PC loadings are listed in the ChemoSpace plots. All source data can be found in spreadsheet Table S2. (E) Graph showing the relationship between the number of samples and the amount of variance explained on principal component axis (PC) 1 (teal), PC 2 (orange), and both combined (gray). All source data can be found in Table S2.
Details are in the caption following the image
The impact of systemic low-frequency sinusoidal interference and different signal-to-noise ratios on ChemoSpace occupation. (A) Plot of n = 5 varieties (synthetic technical replicates) of n = 5 differently scaled sets of spectra corresponding to the sample type 1 over the organic fingerprint region. (B) Plot of n = 5 varieties (synthetic technical replicates) of n = 5 differently scaled sets of spectra corresponding to the sample type 2 over the organic fingerprint region. (C) Plot of n = 5 varieties (synthetic technical replicates) of n = 5 differently scaled sets of spectra corresponding to the two sample types (1: teal hues; 2: orange hues) added to the normalized synthetic interference function (for details, see figure or Section 2) over the organic fingerprint region; the interference function represents unwanted spectral features introduced by edge filter ripples, refraction at optical layers within a stratified biological tissues sample, or Mie-ripples. Four Raman band positions are indicated (x1 – x4), and the colored data points label the mean average intensity of the individual sets of spectra, in order to visually explain how the variance–covariance matrix is built. Signal-to-noise ratios range from ~1% to 50%. Sets of spectra matching in their signal-to-noise ratio are extracted in (D)–(H). (D) Set of spectra extracted from (C) with a signal-to-noise ratio of 1:1. (E) Set of spectra extracted from (C) with a signal-to-noise-ratio of 1:1.5. (F) Set of spectra extracted from (C) with a signal-to-noise ratio of 1:2. (G) Set of spectra extracted from (C) with a signal-to-noise ratio of 1:10. (H) Set of spectra extracted from c with a signal-to-noise ratio of 1:100. (I) ChemoSpace across principal components (PCs) 1 and 2 based on a variance–covariance matrix including select relative intensities (see Section 2) extracted from the plot in (C). Data point fill colors correspond to the sample type (1: teal hues; 2: orange hues; compare C). The different signal-to-noise ratios are shown for groups of data points, and the labeled arrows indicate trends in the data distribution across the ChemoSpace. All source data can be found in Tables S1 and S3.
Details are in the caption following the image
The impact of medium-frequency sinusoidal interference and different signal-to-noise ratios on ChemoSpace occupation. (A) Plot of n = 5 varieties (synthetic technical replicates) of n = 5 differently scaled sets of spectra corresponding to the two sample types (1: teal hues; 2: orange hues), added to the normalized synthetic interference function (for details, see figure) over the organic fingerprint region. (B) The same spectra as in (A), normalized (standard normalization) to the highest peak in each spectrum. (C) ChemoSpace plot across principal components (PCs) 1 and 2 based on a variance–covariance matrix including select relative intensities (see Section 2) extracted from the plot in (A). Data point fill colors correspond to the sample type (1: teal hues; 2: orange hues; compare A). The different signal-to-noise ratios are shown for groups of data points, and the labeled arrows indicate general patterns in the data distribution across the ChemoSpace. (D) ChemoSpace plot across PCs 1 and 2 based on a variance–covariance matrix including select relative intensities (see Section 2) extracted from the plot in (B). Data point fill colors correspond to the sample type (1: teal hues; 2: orange hues; compare B). The different signal-to-noise ratios are highlighted for groups of data points, and the labeled arrows indicate trends in the data distribution across the ChemoSpace. All source data can be found in Tables S4 and S5. The spectral denoising process is showcased in Figure 1.
Details are in the caption following the image
The impact of high-frequency random shot noise and different signal-to-noise ratios on ChemoSpace occupation. (A) Plot of n = 5 varieties (synthetic technical replicates) of n = 4 differently scaled sets of spectra corresponding to the two sample types (1: teal hues; 2: orange hues), added to the measured high-frequency random shot noise over the organic fingerprint region. (B) ChemoSpace across principal components (PCs) 1 and 2 based on a variance–covariance matrix including select relative intensities (see Section 2) extracted from the plot in (A). Data point fill colors correspond to the sample type (1;2), and the values in the parentheses correspond to the signal-to-noise ratio (compare A). Semi-transparent data points of spectra without added random shot noise have been plotted in the background (and were projected upwards or downwards to reveal the arrangement of data points in the compositional space) for direct comparison with the opaque data points (which are plotted in the foreground) containing random shot noise. All source data can be found in Table S6.
Details are in the caption following the image
Trends in the relative abundance of informative versus unwanted signals (compare with the signal-to-noise ratio) in spectroscopic data published in the molecular medical and biological literature (n = 8 data sets, n = 3–5 replicates were analyzed; see Table S7 and the molecular paleobiological literature [n = 6 data sets, n = 3–5 replicates were analyzed; Table S6]: categories are separated along the x-axis of the plot). The percentage of true compositional signal relative to the total amount of spectroscopic signal, which includes both compositional and unwanted signals, in the published sets of spectra is shown on the y-axis of the plot. The bars associated with the percentage of informative signal in spectra from medical and biological publications represent the standard deviation based on the analyzed spectral sample ( ±1σ). For molecular paleobiological studies with sufficient spectral data published alongside the article,2, 11, 12, 16, 18 one outlier study was identified.20 Signal-to-noise ratios (S/N) corresponding to the listed percentage of informative spectral signals in Figure 3D–F are plotted in form of gray, dashed lines (labeled in the figure). The color gradient in the data bars corresponds to trends in the spectral quality.
Details are in the caption following the image
The impact of spectrometer decalibration on the occupation of the ChemoSpace. (A) Plot of n = 5 varieties (synthetic technical replicates) of n = 5 different x-axis offsets applied to the two sample types (1: teal; 2: orange) over the organic fingerprint region. (B) ChemoSpace across principal components (PCs) 3 and 4 based on a variance–covariance matrix including select relative intensities (see Section 2) extracted from the plot in (A). Data point outline colors correspond to the sample type (1;2), and fill colors correspond to the x-axis offset (compare A). The scree plot of PC loadings indicates the placement of the decalibration signal. All source data can be found in Table S8.
Details are in the caption following the image
The impact of baseline subtraction on the occupation of the ChemoSpace. (A) Plot of n = 5 varieties (synthetic technical replicates) of n = 5 different baseline subtraction approaches (in SpectraGryph26: linear, 50%, 30%, 20%, 10%) applied to the two sample types (1: teal hues; 2: orange hues) over the organic fingerprint region. (B) ChemoSpace plot across principal components (PCs) 1 and 2 based on variance–covariance matrix including relative intensities (see Section 2) extracted from the spectra in (A). Data point outline colors correspond to the sample type (1: black; 2: red), and the fill colors correspond to the different baseline subtraction routines (labeled in the figure). Red arrows point toward increasingly adaptive baselines. All source data can be found in Table S9.

2 METHODS

In order to illustrate the effects of sample size, instrument features, and spectral processing on ChemoSpace occupation, we have selected spectra of two different biological tissues with a simple composition (avian eggshell membrane and avian eggshell [Gallus domesticus]), labeled as sample types 1 and 2. For the purpose of experimentation without major signal distortion, these spectra were modified in a number of ways: (1) scaling whole-spectra to generate varieties (5–30, depending on the specific analysis) of the same source signal, (2) superimposing synthetic sinusoidal wave functions (as simplified representatives of effects related to reflective scattering at tissue layers and features introduced by edge filters) with low and medium frequencies (the latter equal the average Raman band width in the source spectra), (3) superimposing measurements of high-frequency random shot noise (which is common in spectra of biological materials), (4) shifting spectra along the x-axis (+1, +0.5, 0, −0.5, −1 cm−1 offsets), (5) baselining spectra with linear and adaptive approaches (as performed with the SpectraGryph freeware34: linear [no offset]; 50%, 30%, 20%, 10% baseline adaptivity options), and (6) normalizing spectra relative to the highest peak. All source data are available in Tables S1S9.

2.1 Impact of the sample: Sample size

Adding samples to a small initial data set is expected to result in rotation of the axis separating the two sample groups as the amount of variation within the groups increases.25, 26 Stable ChemoSpace occupation requires a representative sample, the sample size varying depending on the amount of spectral variance captured in the data set. Various experimental and computational tools can be employed to aid the determination of ideal sample sizes and the critical evaluation of PCA model stability35-37; however, here, we aim to showcase the impact of an increasing number of spectra per sample group on ChemoSpace occupation as it is representative for complex biological tissues: 30 scaled varieties of the two source spectra (sample types 1 and 2; Table S1), that is, a total of 60, were generated. The ChemoSpaces resulting from the individual sets of 3, 5, and 6 samples per sample type were plotted to showcase key steps in the axis rotation of groups, as well as the terminal ChemoSpace occupation (set of 30 samples). To do so, all spectra were plotted in SpectraGryph.34 Relative intensities were extracted from all spectra at 39 Raman band positions: 510, 536, 577, 644, 667, 698, 711, 725, 739, 753, 761, 778, 811, 839, 856, 880, 931, 959, 993, 1005, 1031, 1124, 1165, 1186, 1229, 1249, 1330, 1344, 1356, 1363, 1418, 1445, 1478, 1535, 1550, 1586, 1609, 1676, 1751 cm−1. These data resulted in [3 to 30]  × [39] variance–covariance matrices (Table S2). 2D-ChemoSpaces (Figure 2A–D) and variance captured along the PC axes (PC loadings, Figure 2E) were graphed in PAST 3.038 and are shown in Figure 2.

2.2 Impact of the sample and instrument: Systemic unwanted signals

To determine the impact of simplified systemic unwanted signals,19, 20 such as sinusoidal features resulting from reflective scattering at tissue layers or instrument optics (laser-cancelling filters) on ChemoSpace occupation, 5 individual varieties of the two source spectra (sample types 1 and 2) were scaled to 100%, ~66%, 50%, 10%, 0.1% of their normalized intensity (the highest peak scaled to the value 1) and the results plotted in SpectraGryph (Figure 3A,B). Two sinusoidal interference functions (unwanted signals) were computed, one with a low frequency ( f x = 0.5 + 0.5 × sin 0.015 x ) (Table S3) and the other one with a medium frequency ( f x = 0.5 + 0.5 × sin 0.08 x ) matching the average Raman band width in the source spectra (Tables S4 and S5). In addition, high-frequency random shots noise was collected from spectroscopic measurements (40 random noise signals collected over the organic fingerprint region; Table S6). The sets of scaled spectra for sample types 1 and 2 were added to these interference functions, resulting in different signal-to-noise ratios: 1:1 (informative signal content: 50%), 1:1.5 (informative signal content: 40%), 1:2 (informative signal content: ~33%), 1:10 (informative signal content: ~9%), 1:100 (informative signal content: ~1%; not included in the analysis of random shot noise for visualization purposes). The resulting combined signals for low (all spectral varieties in Figure 3C, individually scaled subsamples in Figure 3D–H), medium (Figure 4A), and high (Figure 5A) frequency interference were plotted separately in SpectraGryph,34 and intensities at the selected 39 band positions (listed above) were extracted. The resulting variance–covariance matrices containing low (Figure 3I), medium (Figure 4C), and high (Figure 5B) frequency interference were subjected to PCAs in PAST 3.0,38 and the resulting PC loadings and 2D-ChemoSpaces were graphed.

To contextualize and constrain the signal-to-noise ratios in Raman spectra of modern and fossil biological samples, between 3 and 5 (as available in the individual studies) technical replicates of organic Raman spectra published in the fields of medicine, biology, and paleobiology were compiled from the literature (Table S7). Technical replicates (Figure 1A) were plotted in SpectraGryph34 and whole-spectral data were exported (resolution varies across published data sets) to create corresponding variance–covariance matrices. PC 1 axis loading functions were extracted (Figure 1C), plotted, and normalized together with one of the 3–5 source spectra in SpectraGryph.34 Integrals of each spectrum and the corresponding PC 1 axis loading function, which represents the true compositional signal, were calculated over the whole spectral range (resolution differs across published data sets). The area under the PC 1 axis loading function was compared with that under the source spectrum containing potential unwanted signals and expressed as the percentage of coverage (Figure 1B). Percentage ranges capturing the relationship between the total spectral signal and the true compositional signal were plotted in PAST 3.038 (Figure 5). Figure 1 illustrates the process of denoising the biological tissue spectra through PCA.

2.3 Impact of the instrument: Spectrometer decalibration

To characterize how ChemoSpace occupation is impacted by minute spectrometer decalibration that occurs routinely during longer analytical sessions in response to changes in room temperature,24 the 5 scaled varieties of the two source spectra (sample types 1 and 2) were shifted along the x-axis as follows: +1, +0.5, +0, −0.5, −1 cm−1, resulting in a total of n = 5 × 5 + 5 × 5 = 50 spectral varieties. All resulting spectra were plotted in SpectraGryph34 (Figure 6A). A variance–covariance matrix (Table S8) was built based on the extracted intensities of major peaks at the previously introduced 39 band positions. The resulting variance–covariance matrix (50  × 39) was subjected to PCA in PAST 3.038 and the (1) variance explained by the calibration signal, (2) sample separation based on calibration differences captured along PCs 3 and 4 in the Chemospace, and (3) corresponding scree plot are illustrated in Figure 6B.

2.4 Impact of spectral processing: Spectral baselining

Baseline subtraction is an established approach1 employed to increase the comparability of spectra when background signals differ across samples. Background shapes differ substantially in sets of spectra collected from, i.e., modern and fossil biological tissues. To capture the influence of baselining on ChemoSpace occupation, 5 varieties of the two source spectra (sample types 1 and 2) were subjected to the linear option and the 50%, 30%, 20%, and 10% adaptive baselining options (no y-axis offset in either case) in SpectraGryph. All n = 5 × 5 + 5 × 5 = 50 resulting spectra were plotted in SpectraGryph (Figure 7A). Excessive ( 20% in SpectraGryph) baseline adaptivity leads to partial subtraction of signal associated with the highest peaks in the spectra and alters the ratio of normalized signal intensities that encode biosignatures. Relative intensities at the same 39 band positions (introduced above) were extracted from all spectra and incorporated into a [50]  x [39] variance–covariance matrix (Table S9). PCA was performed in PAST 3.038 to capture the impact of different baselines on ChemoSpace occupation reflected in PC loadings and sample position in the ChemoSpace plot based on PCs 1 and 2 (Figure 7B).

2.5 Impact of spectral processing: Spectral normalization

Normalization scales a spectrum based on the highest peak, a particular selected peak, or the area under the spectral curve.1, 34 It is commonly applied prior to any quantitative analysis1 to increase comparability across spectra given the variability of absolute Raman intensities among diverse samples. The combined set of 50 varieties of spectra containing the synthetic, medium-frequency interference (introduced above) was plotted (Figure 3A) to capture the impact of normalization on ChemoSpace occupation. The highest peak of each spectrum was scaled to a value of 1 (a common approach) using the SpectraGryph34 normalization option (Figure 3B). Relative intensities were extracted from all spectra at the 39 wavenumber positions generating a [50]  × [39] variance–covariance matrix. Figure 3D shows the resulting PC loadings and ChemoSpace plot based on PCs 1 and 2.

3 RESULTS AND DISCUSSION

The effects of sample size, instrument decalibration, and spectral processing on ChemoSpace occupation were simulated and showcased in six distinct experiments. Minute changes in spectrometer calibration, the systemic presence of unwanted signal, differences in the spectral signal-to-noise ratios, linear and standard adaptive baseline subtraction, and spectral normalization do not overprint the biologically informative grouping of tissue samples in the ChemoSpace. Spectral processing, including baseline correction and normalization prior to PCA, improved data comparability and biosignature separation. A near-stable pattern in ChemoSpace occupation is, in this example, reached with as few as 6 spectra per sample type.

3.1 Stable ChemoSpace occupation can be reached with less than 10 samples

The number of samples required to achieve a stable ChemoSpace occupation is as few as 6 per sample type in this data set representing biological tissues (Figure 2E). With 12 samples, the two clusters are separated across PC 2 which accounts for 42.7% of the variance in the data set, whereas intra-group variance accounts for 57.3% of the total and is captured on PC 1. In contrast to PC loadings, eigenvectors in the ChemoSpace biplot (teal and orange arrows in Figure 2A–D) allow the sources of variance in the data, including biological signals within and across tissues, to be differentiated even when cluster separation occurs diagonally in the ChemoSpace. Such eigenvector trajectories allowed us to infer that rotation of the axis separating sample clusters in this ChemoSpace results from an increase in the contribution of intra-group variance to the total variance as spectra are added: intra-group variance becomes the primary source of variance and is displayed along PC 1. This experiment suggests that the sampling strategy should reflect the scientific question of interest: in integrative data sets including modern and fossil tissues, ChemoSpace grouping will account more accurately for variation in different modes of (diagenetic) alteration of a biological tissue, for example, when an increasing number of fossil samples from different depositional settings is considered. The sample set analyzed here is not supposed to provide a generalizable model that can be directly transferred to other data sets, but rather aims to showcase and explain trends in the relationship between sampling strategy and ChemoSpace occupation.

3.2 PCA allows biosignatures to be detected in a ChemoSpace even if systemic unwanted signals are present in spectra

PCA as employed here is based on a variance–covariance matrix. The focus on variance rather than qualitative comparisons of absolute spectral differences (Figure 3C) facilitates the detection of biologically meaningful sample grouping, even when prominent unwanted signals, such as sample- or instrument-related spectral features,19, 20 are present. In addition, the extraction of relative intensities at informative wavenumber positions allows features relevant to a given question to be emphasized.27 A mostly stable, omnipresent interference signal is unlikely to become the primary source of variance in a diverse data set, such as a sample of different tissues. PCA reliably separates the two clusters corresponding to signals 1 and 2, regardless of the frequency of a periodic, systemically present, unwanted signal, or the total spectral signal-to-noise ratio (Figures 3I and 4C). An omnipresent and invariant unwanted signal will not overprint compositional biosignatures in a ChemoSpace PCA, even if it includes more complex spectral features, such as, i.e., signals associated with quartz glass slides. Random high-frequency shot noise at realistic intensity, as modeled in Figure 5A, is shown to lead to minor displacements of individual data points in the compositional space; however, it does not overprint compositional biosignatures separating the two sample groups (Figure 5B).

Although a decrease in the signal-to-noise ratio of spectral data results in increased convergence of the two sample clusters in the ChemoSpace (regardless of the nature of unwanted signal present), clusters are separated even for spectra with a signal-to-noise ratio of 1:100 (Figures 3H–I and 4C). The spectra modeled here in Figures 4 and 5 (with signal-to-noise ratios ranging from 1:100 to 1:1) include a higher amount of unwanted signal than most published Raman spectra. Informative spectral content ranges from ~42% to 90% ( ±1σ) in the biological and medical literature (Figure 6, based on 8 spectral data sets, 3–5 replicates were analyzed) and ~69% to 98% in the molecular paleobiological literature2, 11-14, 16-19 excluding one statistical outlier20 (Figure 6, based on 6 spectral data sets, 3–5 replicates were analyzed). Field-specific ranges of compositional signal content in published spectra mostly overlap. Comparatively high signal-to-noise ratios in carbonaceous fossilization products of biological tissues are the result of smoother textures following dehydration and compaction, as well as reduced fluorescence (Figure 6). Regardless of the type of unwanted signal present in biological or geological organic Raman spectra, PCA reliably extracts informative features (see Figure 1 for denoising).

3.3 Minor in-session spectrometer decalibration does not overprint ChemoSpace biosignatures

Spectrometer decalibration accounting for ±1 cm−1 wavenumber is only evident across PCs 3 and 4 (Figure 7) and explains less than 0.1% variance in this data set. Thus, any type of biosignature accounting for more than 0.06% variance (loading PC 3) in the data set will outweigh the decalibration signal in the ChemoSpace. All previously published spectroscopic biosignatures5-7, 11-18 exceed the amount of variance resulting from decalibration by at least two orders of magnitude.

3.4 Standard adaptive baselining increases comparability and ChemoSpace signal extraction

It is essential to subtract spectral backgrounds without affecting informative bands, in order to prevent differences in background shape from appearing as a major source of variance which could potentially overprint biosignatures.1 Linear baseline subtraction (Figure 8A) does not completely remove nonlinear background signals, which are common in spectra of heterogenous and stratified biological tissues, and may introduce or amplify spectral incomparability (see linear baseline subtraction applied to sample spectrum 1 in Figure 8A). Adaptive baselining, in contrast, eliminates all types of background signal (Figure 8A) regardless of shape. Baselining may result in minor spatial convergence of informative clusters (sample types 1 and 2) in the ChemoSpace (Figure 8B), if adaptivity exceeds the standard (less of the original spectral signal remains; treshold determined here: <30% in SpectraGryph34): as baseline adaptivity increases beyond the standard, broad Raman bands of high intensity lose comparatively more signal than narrow bands with relatively low intensity (Figure 8A). This loss of the biological signal encoded in informative band ratios decreases the separation of groups in the ChemoSpace (indicated by the red arrows in Figure 8B). Standard adaptive baselines (treshold: 30% in SpectraGryph34) increase intra-group comparability without cluster convergence, resulting in the collapse of individual spectral data points within a sample type in the ChemoSpace (Figure 7B).

3.5 Normalization increases comparability and improves signal

Spectral normalization emphasizes key differences within a sample set by amplifying differences in relative spectral intensities (Figure 4B). The ChemoSpace PCA shows how normalization increases direct comparability (all normalized spectra share a highest peak scaled to 1; they do not range in intensity counts over orders of magnitude like in the raw spectra) across synthetic replicates, as demonstrated by the closer grouping of data points within a subsample (Figure 4D). Normalization also homogenizes the distribution of data within clusters associated with signals 1 and 2—an inference based on the resulting uniform data point spacing (Figure 4D) compared to the non-uniform data point spacing observed among non-normalized spectra (Figure 4C). Most biosignatures are encoded in the relative abundance of different functional groups,1, 2 so spectral normalization (based on the highest peak in the spectrum, a different informative peak, or a spectral area) facilitates the extraction of meaningful signal. The suitability of different modes of normalization for individual data sets depends on the specific question and the nature and comparability of spectral intensities in the sample set.

4 CONCLUSIONS

Quantification of the impact of sample size, instrument features, and spectral processing on the occupation of the ChemoSpace provides an analytical framework for the extraction of molecular biosignatures from spectroscopic fingerprints of tissues from extant and extinct organisms: Minor instrument decalibration during an analytical session does not overprint major biological signatures in a ChemoSpace PCA. Spectral processing routines, such as standard adaptive baseline subtraction, as well as normalization prior to statistical analysis of spectra, increase data comparability and facilitate the extraction of informative features. Stable ChemoSpace occupation can be achieved with fewer than 10 spectra per sample group when analyzing biosignatures. PCA facilitates the distinction of informative compositional and systemic unwanted signals, regardless of the waveform, periodicity, frequency, and amplitude of a spectral interference, even at relatively low signal-to-noise ratios. The ChemoSpace approach to biosignatures represents a powerful tool for exploring, denoising, and integrating information from modern and ancient organismal samples.

ACKNOWLEDGMENTS

The authors thank D. Briggs for helpful comments and edits and J. Eiler, M. Brown, and G. Rossman for helpful conversations.

    CONFLICT OF INTEREST STATEMENT

    We declare no competing interests.

    DATA AVAILABILITY STATEMENT

    All source data, synthetic interference functions, and corresponding variance–covariance matrices are available in the Supporting Information spreadsheet (Tables S1S9) and are intended to be published alongside our article.