Abstract
10
Rapid quantification of sulfur -containing amino acids, particularly cysteine, in legumes is critical for assessing 11
nutritional quality, supporting breeding program screening, and ensuring consistency in quality control processes. 12
However, conventional methods, such as high-performance liquid chromatography (HPLC), are time-consuming 13
and resource-intensive for high -throughput applications. This study evaluated artificial intelligence models for 14
predicting cysteine concentration from surface-enhanced Raman spectroscopy (SERS) spectra of pea extracts. 15
SERS spectra were acquired from 20 cultivars grown at three geographically distinct locations, with HPLC-16
measured cysteine concentrations as a ground truth reference. Linear regression, partial least squares regression, 17
support vector regression, random forest regression, and a one -dimensional convolutional neural network (1D -18
CNN) were compared using within -cultivar splits and leave-one-cultivar-out (LOCO) evaluation. The 1D-CNN 19
achieved RMSE 0.008 g/100 g within cultivars and maintained performance under LOCO, while other models 20
showed limited generalization. Shapley Additive Explanations highlighted informative bands in the 630–760 cm⁻¹ 21
range, and noise modeling optimized scan-count selection. 22
Keywords
SERS, Legumes, Amino-acids, Deep-learning, Generalization, SHAP, Noise-modeling 23
1. Introduction 24
Legumes are an important source of plant-based protein in human diets (Lisciani et al., 2024; Samal et al., 2023). 25
Their seeds contain approximately 20–45% protein on a dry weight basis, depending on species and cultivar 26
(Maphosa et al., 2017) . This is higher than the protein content of most widely consumed plant-based foods, 27
including cereals (7 –15%) and vegetables (1 –5%) (Boye et al., 2010) . Despite this advantage, legume protein 28
quality is constrained by low levels of the sulfur-containing amino acids (SCAAs), cysteine and methionine (Iqbal 29
et al., 2006) . Together, these two amino acids can at times constitute the primary limiting essential and semi-30
essential amino acids determining protein quality. Peas, beans, and lentils typically contain only 12 –18 mg/g 31
protein of cysteine plus methionine, which is below the 22–25 mg/g protein recommended in the Food and 32
Agriculture Organization/World Health Organization (FAO/WHO) amino acid reference pattern for high-quality 33
dietary protein (WHO/FAO/UNU, 2025). SCAAs levels are influenced by cultivar genetics and environmental 34
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
2
conditions, such as soil type, climate, and agronomic practices, as well as by genotype -by-environment (G×E) 35
interactions (Gerrano et al., 2022). High-throughput, cultivar-robust quantification methods are important for the 36
development of reliable, routine identification of high-SCAA genotypes and for quality control in commercial 37
legume protein ingredients. Conventional analytical methods, such as high-performance liquid chromatography 38
(HPLC) (Snyder et al., 2010) and gas chromatography –mass spectrometry (GC –MS) (Sparkman et al., 2011) , 39
provide accurate SCAAs measurements but require multi-step sample preparation, including protein hydrolysis 40
and complex derivatization. They depend on specialized equipment, costly reagents, and long analysis times, 41
which limit applicability for large-scale screening and rapid quality control in the food industry. 42
These limitations have motivated a shift toward spectroscopic methods that enable faster , more direct analysis. 43
Vibrational spectroscopy methods, including infrared (IR), near-infrared (NIR), and Raman spectroscopy, provide 44
detailed information on molecular structure, bonding, and composition (Bokobza, 1998; Ng & Simmons, 1999). 45
IR-based techniques can be limited by water absorption and sample preparation requirements (Chon et al., 2021), 46
whereas Raman spectroscopy is less affected by water and can be applied to aqueous extracts with minimal sample 47
handling (Park et al., 2023) . A practical limitation is that conventional Raman scattering is a low-probability 48
phenomenon with weak intensity , reducing sensitivity for low -concentration analytes (Das & Agrawal, 2011) . 49
Surface-enhanced Raman spectroscopy (SERS) addresses this by using plasmonic nanostructures to amplify 50
Raman signals and improve the sensitivity of detection for low-abundance analytes (Moskovits, 1985; Pilot et al., 51
2019). A defining property of quantitative SERS, in contrast to SERS for detection alone, is that under controlled 52
experimental conditions, the scattered intensity is proportional to the number of molecules contributing to the 53
enhancement. This means that s pectral responses exhibit approximately proportional, monotonic behavior with 54
analyte concentration, roughly analogous to Beer–Lambert–type relationships amenable to linear regression, and 55
stable, repeatable patterns well-suited to machine learning (ML) analysis. 56
However, in complex food and biological matrices, SERS measurements are compromised by substrate 57
heterogeneity, adsorption effects, fluorescence background, and non-linear baseline drift (Grys et al., 2021; Pilot 58
et al., 2019). Conventional univariate or linear chemometric techniques often fail to decouple the target analyte 59
signal from these complex, stochastic interferences. This limitation necessitates the use of artificial intelligence 60
(AI) methods, including ML and deep learning (DL), to learn a quantitative mapping from SERS spectra to target 61
analyte concentration. Recent label-free SERS studies show that ML/DL approaches can extract chemical and 62
structural information from SERS spectra for discrimination and recognition tasks. Barucci et al. developed a 63
hybrid strategy combining peak fitting with principal component analysis to discriminate proteins with closely 64
similar spectral profiles, providing a reproducible approach to capture structure -dependent spectral variation in 65
human and animal proteins (Barucci et al., 2021) . Peng et al. implemented a DL -based, label -free SERS 66
framework for screening and recognizing small-molecule binding sites in human drug-target proteins (Peng et al., 67
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
3
2022) . While these studies focus on animal and human proteins and primarily address qualitative SERS analysis, 68
related work in legume proteins supports extending this approach with ML/DL to quantitative analysis in complex 69
food matrices (Findlay et al., 2025). 70
Accordingly, the present study develops and evaluates AI models to predict cysteine concentration from SERS 71
spectra of pea (Pisum sativum L.) cultivars. Pea was selected as a representative legume matrix as pea protein is 72
a widely used ingredient of growing importance in protein isolate production and processing, sustainable plant-73
based meat analogs and human nutrition (Shanthakumar et al., 2022). Peas are self-pollinating; thus, the pedigrees 74
and lineages are well characterized, and cultivars exhibit genetic stability. Peas are the subject of well-developed 75
breeding programs and cultivar collections, making them a good candidate for amino acid panel analysis by HPLC 76
to establish reference values. Cysteine was selected as the analytical target because it contributes directly to the 77
total SCAA pool and exhibits thiol-based surface-binding chemistry compatible with quantitative SERS (Findlay 78
et al., 2025). A dataset of SERS spectra collected from 20 pea cultivars was used to investigate whether AI models 79
can learn chemically meaningful relationships between spectral patterns and cysteine concentration. 80
To evaluate the complexity required to model these data, we selected five algorithms were selected, ranging from 81
linear regression (LR) to convolutional neural networks (CNN). LR and partial least squares regression (PLSR) 82
were included to establish a baseline and to represent standard chemometric approaches that assume linear 83
spectral–concentration relationships. To assess whether non -linear modeling alone improves performance, we 84
evaluated support vector regression (SVR) and random forest regression (RFR), which capture complex 85
boundaries and variable interactions but rely on fixed input features. Finally, a DL model, a one -dimensional 86
CNN (1D-CNN), was assessed. Unlike standard regression models that treat spectral points as independent 87
features, CNNs are designed to learn hierarchical, local patterns such as peak shapes, widths, and relative shifts. 88
This capability is hypothesized to make DL models more robust to the absolute intensity fluctuations and baseline 89
shifts common in SERS, enabling better generalization across cultivars. 90
To assess this generalization capability, an evaluation framework was designed to distinguish between two forms 91
of spectral variability that shape model behavior. The first, referred to as intra-cultivar spectral variability, arises 92
from instrumental and substrate -related effects, including fluorescence background, stochastic noise, and local 93
variations in electromagnetic field enhancement across the SERS substrate. These sources of variation occur 94
within cultivars and reflect the physics of the measurement process. Therefore, they are assessed using a within-95
cultivar evaluation strategy, in which the training and test sets consist of spectra from the same cultivar (Section 96
2.3). The second form, termed inter -cultivar spectral variability, arises from cultivar -dependent biochemical 97
variability, driven by genotype × environment (G×E) interactions that modify the molecular composition of pea 98
extracts. This biochemical variation changes SERS peak intensities, peak positions, and baseline curvature across 99
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
4
cultivars. To evaluate model performance under these conditions, we applied a leave-one-cultivar-out (LOCO) 100
cross-validation strategy, which tests the model on an unseen cultivar excluded from the training process (Section 101
2.3). Evaluating AI models under both intra - and inter-cultivar spectral variability is critical for assessing their 102
suitability for practical applications in legume breeding and quality assessment. For practical deployment in large-103
scale breeding programs or industrial quality control, analytical models must be able to predict analyte 104
concentrations in new, unseen cultivars without requiring retraining. Consequently, the ability to generalize across 105
genotypes, despite significant G×E biochemical variability, is a prerequisite for the operational utility of SERS -106
based screening. 107
Finally, we extended the evaluation to address two practical aspects of deployment: model interpretability and 108
data acquisition efficiency. Shapley Additive Explanations (SHAP) were applied to identify the specific spectral 109
features driving the predictions, ensuring that the model relies on chemically relevant vibrational 110
modes. Separately, to optimize operational efficiency, a noise-modeling study was conducted to determine the 111
minimum number of scans required for accurate prediction, providing practical guidelines for reducing 112
acquisition time in high-throughput settings. 113
2. Materials and Methods 114
This section is divided into two main parts. Section 2.1 describes the experimental workflow and dataset 115
generation, including the biological materials, sample preparation, SERS measurements, and the structure of the 116
resulting spectral dataset. Section 2.2 details the AI analysis framework, including preprocessing, ML baselines, 117
the DL architecture used to predict cysteine concentration from SERS spectra, and the model training and 118
evaluation procedures. The overall workflow is summarized in Figure 1, which includes the post hoc model 119
interpretability using SHAP and the noise-modeling study to guide scan count optimization. 120
2.1. Overview of Experimental Data 121
This subsection describes the experimental workflow used to generate the SERS dataset for predicting cysteine 122
concentration. The workflow includes cultivar selection, reference cysteine quantification, preparation of sample 123
extracts, fabrication/preparation of SERS substrates, and SERS spectral acquisition. These steps produced a 124
structured spectral dataset that serves as the input to the AI analyses described in Section 2.2. 125
2.1.1. Pea Cultivars and Reference Compound 126
Flours from twenty pea cultivars from the CDC breeding program at the University of Saskatchewan , were 127
selected based on their contrasting protein profiles to represent a diverse range: AAC Chrome, AAC Lacombe, 128
AAC Liscard, CDC Amarillo, CDC Athabasca, CDC Canary, CDC Dakota, CDC Golden, CDC Greenwater, 129
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
5
CDC Inca, CDC Jasper, CDC Striker, CDC Lewochko, CDC Meadow, CDC Patrick, CDC Saffron, CDC 130
Spectrum, CDC Spruce, CDC Tetris, and Redbat 88. The prefixes in these names indicate their breeding origin: 131
'AAC' denotes varieties from Agriculture and Agri -Food Canada, and 'CDC' denotes those from the Crop 132
Development Centre. Samples from each cultivar were ground into flour and analyzed for cysteine using the 133
oxidative hydrolysis HPLC method described below. The corresponding HPLC reference cysteine concentrations 134
for each cultivar across the three locations are provided in Table S1 (Supplementary Material). 135
136
Figure 1. Overall workflow for SERS data acquisition and AI-based prediction of cysteine concentration. Left: SERS dataset generation 137
from pea cultivars across three locations (60 samples). Sample extracts were prepared and measured on P-SERS substrates using a 785-138
nm excitation source and fiber-optic probe–based backscattering collection, with spectra acquired at multiple spots per substrate (3 139
spots/substrate, 36 spectra/spot; 108 spectra/sample; 6, 480 total spectra). Right: AI -based modeling pipeline, including spectral 140
preprocessing (smoothing, baseline correction, normalization) and dataset assembly by pairing preprocessed spectra with HPLC-derived 141
cysteine concentrations (ground truth). Models included machine -learning baselines (LR, PLSR, SVR, RFR) and a 1D -CNN. 142
Performance was evaluated using intra-cultivar splits and inter-cultivar (LOCO) testing, with RMSE, MAE, and R² as evaluation metrics. 143
The best -performing 1D -CNN was further analyzed using SHAP for interpretability and noise modeling to optimize scan count 144
(acquisition time). 145
Reference
cysteine concentrations were determined using the performic acid oxidation –acid hydrolysis HPLC 146
Method
described in (Findlay et al., 2025) . In this procedure, proteins in pea flour extracts were oxidized with 147
performic acid to convert cysteine (including disulfide -linked forms) to stable cysteic acid . Oxidized samples 148
were then subjected to acid hydrolysis, derivatized using the AccQ -Tag Ultra reagent system, and separated on 149
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
6
an AccQ-Tag Ultra C18 (1.7 μm) reversed-phase column using a Shimadzu UPLC system equipped with an SIL-150
30AC autosampler. Quantification was performed using calibration standards and hydrated amino acid molecular 151
weights to obtain accurate cysteine-equivalent concentrations. L-cysteine (≥97%, Sigma-Aldrich) was used as the 152
Reference
amino acid standard for calibration during HPLC analysis. 153
2.1.2. Sample and Substrate Preparation 154
Alkaline extracts were prepared from pea flour samples (Section 2.1.1) , and SERS spectral acquisition was 155
performed following the method described by (Findlay et al., 2025) with modifications to reagent ratios and 156
parameters described below. Pea flour was dispersed in Milli-Q water (0.2 g/1 mL). The suspension was kept on 157
an ice bath and homogenized at a speed setting of 5 (~20,000 rpm) for 30 seconds with an IKA Ultra-Turrax 158
homogenizer (IKA-Werke GmbH & Co. KG, Staufen, Germany) . Alkaline extraction was performed by adding 159
25 µL of a preprepared 1 M NaOH solution to 1.0 mL of pea homogenate, yielding a final pH of approximately 160
9. The mixture was vortexed for 10 seconds and incubated at room temperature (25 °C) for 2 hours. Samples were 161
then centrifuged at 8000 rpm (~5000 × g, rotor radius 7 cm) for 15 minutes at 4 °C. The supernatant was collected 162
as the alkaline extract and stored at −80 °C. 163
Just prior to spectral acquisition, frozen extracts were thawed at room temperature and vortexed to ensure 164
homogeneity. For each sample, 200 µL of extract was transferred into an individual silicone well and mixed with 165
100 µL of 20 mM tris (2-carboxyethyl) phosphine (TCEP) solution at pH 7, resulting in a final working volume 166
of 300 µL. TCEP was added to reduce disulfide bonds and liberate free thiol for chemisorption to the SERS 167
substrate. To prepare the SERS substrates, prefabricated paper-based SERS (P-SERS) substrates (Metrohm) were 168
handled. Each substrate was positioned over a silicone well, and the handle was removed to allow the plasmonic 169
surface tip to fall into the extract –TCEP mixture with the active side facing upward. The substrate was fully 170
immersed and incubated at room temperature for 45 minutes to ensure consistent analyte–surface interaction. 171
Following incubation, the substrates were transferred directly to the Raman system stage for spectral acquisition 172
without drying or additional processing. 173
2.1.3. SERS Spectral Acquisition 174
SERS measurements were acquired using a Raman system equipped with a 785 nm excitation laser delivering 100 175
mW at the sample surface. The system consisted of a Raman spectrometer coupled to a microscope -mounted 176
sampling stage, enabling reproducible positioning of the P -SERS substrates beneath the Raman probe . Spectra 177
were collected with a 1000 ms integration time and two co-additions, which were automatically combined by the 178
instrument control software into a single stored spectrum, thereby balancing signal quality and measurement 179
speed. Following the 45-minute incubation described in Section 2.1.2, each silicone well with P-SERS substrate 180
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
7
was transferred directly to the Raman sampling stage while still immersed. The laser spot was focused on the 181
submerged surface of the SERS substrate with the plasmonic sensing region oriented upward to maintain 182
consistent optical alignment between the Raman probe and the active substrate surface. 183
To characterize nanoscale surface heterogeneity, spectra were acquired from three distinct spots on each P-SERS 184
substrate. At each spot, 36 sequential spectra were collected without adjusting the optical focus or repositioning 185
the substrate, yielding 108 spectra per sample. The same measurement protocol was repeated independently for 186
samples obtained from each of the three Saskatchewan growing locations (Limerick, Rosthern, and Sutherland), 187
yielding a total of 324 spectra per cultivar. 188
2.1.4. Dataset 189
The complete dataset consisted of 6,480 raw SERS spectra (20 cultivars × 3 locations × 108 spectra). Each 190
spectrum was a fixed-length vector of SERS intensities, indexed by Raman shift (cm⁻¹), and was associated with 191
a reference cysteine concentration determined by HPLC. All spectra from a given cultivar and location shared the 192
same reference value. This yields a diverse dataset that captures both intra-cultivar spectral variability associated 193
with instrumental and substrate -related effects (using P-SERS substrates from two separate manufacturing 194
batches) and inter -cultivar spectral variability arising from cultivar - and location -dependent biochemical 195
differences. It therefore supports a robust evaluation of model performance and generalizability for predicting 196
HPLC cysteine concentration from SERS spectra. 197
2.2. AI-Based Modeling and Data Analysis Framework 198
This section describes the computational workflow used to develop models to predict HPLC cysteine 199
concentration from SERS spectra. All computational steps were applied after the spectral dataset described in 200
Section 2.1 was generated. The workflow includes spectral preprocessing, implementation of ML and DL models, 201
and model evaluation. 202
2.2.1. Spectral Preprocessing of SERS Data 203
Preprocessing is essential for preparing SERS spectra for AI -based modeling. Raw spectra often contain 204
distortions, including baseline drift, random noise, fluorescence background, and intensity fluctuations, which 205
can obscure chemically meaningful features and introduce non -chemical variability. Although many 206
preprocessing algorithms exist, their suitability depends on the underlying physics of the spectroscopic technique. 207
Accordingly, preprocessing should be tailored to the dominant sources of variability rather than applied in a 208
generic manner. 209
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
8
In SERS and Raman spectra, one of the most prominent artifacts is the fluorescence background. It is a high-210
intensity, smoothly varying signal that can overwhelm Raman peaks and originates from both sample constituents 211
and detector effects such as charge-coupled device (CCD) baseline drift (Bocklitz et al., 2011; Liland et al., 2016). 212
SERS and Raman spectra are also affected by cosmic-ray artifacts, which appear as sharp, non -physical peaks 213
resulting from high-energy particle impacts on the detector and can bias spectral analysis (Bocklitz et al., 2011; 214
Wu & Chen, 2017). Additional distortions arise from Gaussian noise and other stochastic fluctuations inherent to 215
Raman scattering, which reduce signal -to-noise ratios (S/N) in low-concentration measurements (Wahl et al., 216
2020). Fluctuations in laser power and changes in optical focusing further contribute to inconsistent peak heights 217
across measurements. Another source of variation results from batch-to-batch differences in commercial SERS 218
substrates. In this case, differences in nanostructure morphology and surface chemistry modify the local 219
electromagnetic field distribution and change signal intensities across identical samples (Jeon et al., 2025) . 220
Numerous preprocessing methods have been proposed to address these issues, including baseline correction (Li 221
et al., 2013; Lieber & Mahadevan-Jansen, 2003; Morháč & Matoušek, 2008; Peng et al., 2010), smoothing (Chen 222
et al., 2013; Gorry, 1990; Kernel Smoothing - M.P. Wand, M.C. Jones ), spike removal (Justusson, 1981; Li & 223
Dai, 2011; Whitaker & Hayes, 2018) , normalization, and derivative -based approaches (Chemometrics: Data 224
Analysis for the Laboratory and Chemical Plant - Richard G. Brereton - Google Books, n.d.; Fearn et al., 2009). 225
In this study, SERS spectra were preprocessed using a workflow consisting of Savitzky –Golay (SG) smoothing 226
(Savitzky & Golay, 1964), modified polynomial baseline correction (ModPoly) (Xia et al., 2018), and min–max 227
normalization. First, SG smoothing was applied to reduce high -frequency noise while preserving peak shape. It 228
applies a low-order polynomial filter within a moving window, where the window length and polynomial order 229
control the degree of smoothing. Second, the ModPoly baseline correction was used to remove the fluorescence 230
Background
and the slowly varying baseline curvature. This method estimates a polynomial baseline that captures 231
the background trend of the spectrum, with the polynomial degree controlling its curvature. Cosmic-ray artifacts 232
were addressed through the acquisition strategy rather than through post-processing. By collecting spectra with 233
minimal co-addition, cosmic ray events were limited to single replicates rather than being averaged into the final 234
spectra. This helps prevent attenuation of small but chemically relevant peaks. Finally, for the linear and kernel-235
based models (LR, PLSR, RFR, SVR), min –max normalization scaled each spectrum to the 0 –1 range, thereby 236
minimizing sensitivity to absolute intensity differences. In contrast, the 1D -CNN used unscaled spectral inputs, 237
with internal scaling handled via batch-normalization layers (Section 2.2.2.5) rather than external normalization. 238
For each AI model described in Section 2.2.2, the preprocessing hyperparameters (SG window length, SG 239
polynomial order, and ModPoly degree) were tuned using a grid search on the training set, with performance 240
evaluated on a held-out validation set. The final preprocessing configuration used for each model is summarized 241
in Table 1. 242
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
9
Table 1. Preprocessing configurations used for each model. All models use Savitzky–Golay (SG) smoothing and modified-polynomial 243
baseline correction (ModPoly). LR, PLSR, RFR and SVR apply min –max normalization, whereas the 1D -CNN relies on internal 244
normalization (batch-normalization layers). 245
Model SG Window Length SG Polynomial Order ModPoly Degree Normalization
LR 11 3 3 Min-Max
PLSR 5 3 9 Min-Max
RFR 17 4 11 Min-Max
SVR 9 3 7 Min-Max
1D-CNN 11 3 2 -
2.2.2. Machine Learning and Deep Learning Models 246
To analyze the preprocessed SERS spectra, four ML algorithms were evaluated: LR, PLSR, RFR, and SVR. In 247
addition, a 1D-CNN was used as the DL model. Each model was trained independently using the preprocessing 248
workflow described in Section 2.2.1. The model descriptions, training procedures, and hyperparameter settings 249
are outlined below. To ensure reproducibility, the complete source code and a sample dataset are openly available 250
at https://github.com/Elhamm1/SERS-Data-Analysis/tree/main. 251
2.2.2.1. Linear Regression 252
LR was used as a simple baseline model to describe the relationship between the preprocessed SERS spectra and 253
cysteine concentration. For each spectrum 𝒙, the predicted cysteine value 𝑦# was expressed as a linear combination 254
of spectral intensities plus an intercept ( 𝑦# = 𝒘𝑻𝒙 + 𝑏), where 𝒙 ∈ ℝ" denotes the full preprocessed spectrum 255
vector (with 𝑝 = 1496 Raman shift bins), 𝒘 denotes the regression coefficients and 𝑏 is the intercept term. The 256
parameters 𝒘 and 𝑏 were estimated by ordinary least squares, minimizing the sum of squared differences between 257
predicted and HPLC -measured cysteine values in the training set , i.e., / (
#
$%& 𝑦$ − (𝒘𝑻𝒙𝒊 + 𝑏))( . Because LR 258
assumes a linear relationship between spectral features and cysteine concentration, it provides a baseline for 259
comparison with more flexible ML and DL models. 260
2.2.2.2. Partial Least Squares Regression 261
PLSR was included as a standard chemometric approach capable of handling strong collinearity across spectral 262
variables. Unlike ordinary linear regression, which operates directly on the original intensities 𝒙, PLSR projects 263
each spectrum 𝒙 ∈ ℝ" into a lower -dimensional set of latent variables (components) that are constructed to 264
maximize the covariance between the spectral features and cysteine concentration. The predicted cysteine 265
concentration 𝑦# is then expressed as a linear combination of these latent variables plus an intercept, 𝑦# = 𝑐&𝑡& +266
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
10
𝑐( 𝑡( + ⋯ + 𝑐) 𝑡) + 𝑏, where 𝑡&, … , 𝑡) are the latent components extracted from the spectra 𝒙, 𝑐&, … , 𝑐) are the 267
corresponding regression coefficients, 𝐴 is the number of components, and 𝑏 is an intercept term. 268
In this study, the optimal number of latent components was determined using five-fold cross-validation on a 269
predefined search grid , yielding a final model with 𝐴 = 25 components. The model was then refit using the 270
selected number of components and used to generate predictions for the validation spectra. This procedure yields 271
an LR model in a latent space aligned with cysteine variation and serves as a strong chemometric benchmark for 272
comparison with the nonlinear ML and DL models. 273
2.2.2.3. Support Vector Regression 274
SVR was used as a kernel-based nonlinear model to capture more complex relationships between the preprocessed 275
SERS spectra and cysteine concentration. In SVR, the prediction for a new spectrum 𝒙 is written as 𝑦#(𝒙) =276
∑ 𝛼$
#
$%& 𝐾(𝒙𝒊, 𝒙) + 𝑏, where 𝒙 ∈ ℝ", 𝒙𝒊 are training spectra, 𝛼$ are learned weights, 𝑏 is an intercept term, 277
and 𝐾(⋅,⋅) is a kernel function that defines the similarity between pairs of spectra and maps the data into a high -278
dimensional feature space. 279
In this study, a radial basis function (RBF) kernel was used to enable flexible, smooth nonlinear fits. The RBF 280
kernel was defined as 𝐾(𝒙𝒊, 𝒙𝒋) = exp (−𝛾 ∥ 𝒙𝒊 − 𝒙𝒋 ∥( ), where ∥ 𝒙𝒊 − 𝒙𝒋 ∥ is the Euclidean distance between 281
two spectra and 𝛾 controls the rate of decay of similarity with spectral distance. Model training was formulated 282
as an optimization problem that keeps the regression function flat while constraining prediction errors within an 283
𝜀-insensitive tube around the observed cysteine values. Deviations larger than 𝜀 are penalized through the 284
regularization parameter 𝐶, which controls the trade -off between model complexity and error tolerance. Before 285
SVR, input spectra were standardized to zero mean and unit variance. The hyperparameters 𝐶, 𝜀, and 𝛾 were 286
selected by five-fold cross-validation over a predefined grid using a Smooth L1 (Huber) loss, and the final SVR 287
model used 𝐶 = 5.0, 𝜀 = 0.01, and 𝛾 = 0.01. With this configuration, SVR provides a flexible nonlinear baseline 288
that can model smooth spectral –concentration relationships while controlling model complexity through 289
regularization and the kernel parameters. 290
2.2.2.4. Random Forest Regression 291
Random Forest Regression was used to model nonlinear relationships between the preprocessed SERS spectra 292
and cysteine concentration by combining many decision trees. The training data consist of pairs (𝒙𝒊, 𝑦$), where 293
𝒙𝒊 is the spectrum of the sample 𝑖 (intensities at 1,496 Raman shift variables, denoted as p) and 𝑦$ is its HPLC-294
measured cysteine value. In this approach, a decision tree learns a sequence of binary split rules based on 295
individual spectral variables. At each split, the algorithm selects a Raman shift variable and a threshold to reduce 296
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
11
the variation in cysteine values across the resulting child nodes. This splitting process is repeated recursively and 297
stops when further splits do not further reduce variation or when the remaining node contains few samples. The 298
final nodes (leaf nodes) contain training spectra with similar cysteine values, and the prediction from a single tree 299
for a new spectrum is the mean cysteine value of the training samples that fall into the same leaf. A random forest 300
combines the predictions of many such trees to improve generalization and reduce variance. 301
In this study, 𝑇 = 300 trees were trained. Each tree was fitted on a sample of the training data, and at each split, 302
only a subset of spectral variables was considered as candidates, with the number of candidate features set to L𝑝. 303
For a given spectrum 𝒙, the forest prediction was obtained by averaging the outputs of all trees, 𝑦#+,(𝒙) =304
&
- ∑ 𝑓.
-
.%& (𝒙), where 𝑓.(𝒙) is the prediction from the 𝑡-th tree. This ensemble structure enables RFR to capture 305
nonlinear dependencies and interaction effects among spectral features while reducing variance by averaging 306
across multiple diverse trees. 307
2.2.2.5. One-dimensional Convolutional Neural Network 308
1D-CNN was used as a DL model to capture nonlinear relationships between the SERS spectra and cysteine 309
concentration. For an input spectrum 𝐱 ∈ ℝ&/01, the network treats 𝐱 as a one-dimensional sequence with a single 310
input channel and outputs a scalar prediction 𝑦# = 𝑓2(𝐱). For the 𝑖-th spectrum 𝐱$, this is 𝑦#$ = 𝑓2(𝐱$), 311
where 𝑓2 denotes the network parameterized by 𝜃. 312
The architecture comprised four consecutive convolutional blocks, followed by two fully connected layers. The 313
convolutional blocks used 1D convolutions with a kernel size of 5 and increasing numbers of filters (16, 32, 64, 314
and 128). Each block applied convolution, batch normalization, and a ReLU activation, followed by max-pooling 315
with a pool size (and stride) of 2 to downsample the spectral axis. These pooling operations reduced the spectral 316
length by an overall factor of 16. The resulting feature maps were flattened and passed to a fully connected layer 317
with 128 units and ReLU activation, followed by dropout (rate = 0.3) to reduce overfitting. A final linear output 318
layer with a single neuron produced the predicted cysteine value. 319
The network was trained using mini-batch gradient descent with a batch size of 32. The training objective was a 320
Smooth L1 (Huber) loss with 𝛽 = 0.02. Optimization was performed using AdamW with an initial learning rate 321
of 1 × 103/ and decoupled weight decay, together with a OneCycle learning rate schedule over 100 epochs. 322
Mixed-precision training and gradient clipping were used to stabilize optimization. Model performance was 323
monitored on a held -out validation set at the end of each epoch, and the parameter set 𝜃 corresponding to the 324
lowest validation loss was retained as the final 1D -CNN model. This architecture enables the 1D-CNN to learn 325
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
12
hierarchical spectral features and capture nonlinear relationships between SERS patterns and cysteine 326
concentration, while maintaining strong generalization via regularization. 327
2.3. Model Evaluation Strategy 328
To compare the ML and DL models, we used two evaluation strategies. First, we performed a within -cultivar 329
split: approximately 80% of the spectra from each cultivar were used for training, and the remaining 20% were 330
reserved for testing. Because the training and test spectra were from the same cultivar, the model was evaluated 331
under conditions in which the spectral distribution was familiar. This setting provides a controlled baseline for 332
assessing performance in the presence of intra-cultivar spectral variability and for tuning preprocessing and model 333
hyperparameters. Second, we used a leave -one-cultivar-out (LOCO) cross -validation protocol to evaluate 334
generalization across cultivars. In each LOCO fold, one cultivar was withheld as an independent test set, and the 335
models were trained on spectra from the remaining 19 cultivars. This procedure was repeated until each cultivar 336
had served once as the held-out test set. LOCO therefore measures how well a model generalizes to spectra from 337
an unseen cultivar, assessing robustness against inter-cultivar spectral variability. 338
Model performance under both evaluation strategies was quantified using standard regression metrics. The root 339
mean squared error (RMSE) was defined as RMSE = R
&
4 / (
4
$%& 𝑦$ − 𝑦#$)( , which places greater weight on larger 340
errors. The mean absolute error (MAE) was calculated as MAE =
&
4 ∑ ∣4
$%& 𝑦$ − 𝑦#$ ∣, providing an average error 341
measure in the same units as cysteine concentration, g/100 g. The coefficient of determination was computed as 342
𝑅( = 1 −
5 (!
"#$ 7"378")%
5 (!
"#$ 7"37¯ )% , where 𝑦$ denotes the true cysteine concentration for the sample 𝑖, 𝑦#$ is the corresponding 343
model prediction, 𝑦¯ is the mean of the observed 𝑦$ values, and 𝑛 is the number of spectra in the test set for a given 344
split or LOCO fold. Together, these metrics summarize prediction performance under both within -cultivar and 345
LOCO evaluations and enable a direct comparison of the ML models and the DL approach. 346
3. Results and Discussion 347
This section presents the results of the AI-based cysteine quantification and discusses their implications for model 348
generalizability and practical deployment. As detailed in Section 2.1.4, the dataset comprised 6,480 SERS spectra 349
collected across 20 pea cultivars. For the analysis, these spectra were stored in NumPy binary format, with each 350
file containing a fixed -length vector of 1,496 Raman intensity values. All ML models were implemented using 351
the scikit-learn library, while the 1D-CNN was implemented in PyTorch. Each model was trained on the same 352
vectorized spectral inputs under the two evaluation strategies described in Section 2.3 . These strategies were 353
selected to distinguish between measurement -related artifacts (intra -cultivar) and biological diversity (inter -354
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
13
cultivar). Accordingly, Section 3.1 evaluates the predictive performance of the models under these different 355
sources of spectral variability, while Section 3.2 extends beyond predictive performance to examine the 356
interpretability and operational robustness of the 1D-CNN framework. 357
3.1. Impact of Spectral Variability on Model Performance 358
3.1.1. Intra-Cultivar Spectral Variability 359
To assess model robustness against measurement noise, Table 2 summarizes the performance of the five models 360
on raw versus preprocessed spectra. This comparison highlights how each algorithm responds to measurement -361
related artifacts, such as fluorescence background, baseline drift, and stochastic noise. For the four ML models 362
(LR, PLSR, SVR, RFR), preprocessing yielded clear improvements, reducing RMSE and increasing R². This 363
suggests that these models are sensitive to baseline drift and noise, which can obscure the underlying spectral 364
patterns. LR and PLSR benefited from baseline correction and scaling, which strengthened the linear relationship 365
between spectral features and concentration. RFR also benefited from preprocessing, as reduced background and 366
noise can yield more stable split decisions and more consistent averaging across trees. Finally, the improvement 367
observed for SVR indicates that kernel-based similarity calculations are easily distorted by peak-shape noise. 368
In contrast, the 1D -CNN achieved high performance (RMSE = 0.008 g/100 g, R² = 0.862) on both raw and 369
preprocessed inputs, showing no dependence on external preprocessing. This robustness arises from the 370
convolutional layers, which process local spectral windows to capture the peak shape and structure rather than 371
relying on absolute intensity. Additionally, internal mechanisms such as batch normalization and pooling 372
effectively handle global intensity scaling and high-frequency fluctuations. Consequently, despite using substrates 373
from two different manufacturing batches, the 1D-CNN can more effectively decouple the target signal from 374
intra-cultivar measurement variability than other models. 375
Table 2: Performance of the five predictive models on raw and preprocessed SERS spectra. Results are reported as RMSE, MAE, and 376
R² for each model before and after applying Savitzky–Golay (SG) smoothing, modified polynomial baseline correction (ModPoly), and 377
min–max normalization. 378
RMSE (g/100 g) MAE (g/100 g) R²
Model Raw Preprocessed Raw Preprocessed Raw Preprocessed
LR 0.013 0.012 0.010 0.008 0.561 0.650
PLSR 0.014 0.013 0.012 0.011 0.569 0.626
RFR 0.014 0.013 0.011 0.010 0.585 0.662
SVR 0.010 0.008 0.008 0.007 0.794 0.861
1D-CNN 0.008 0.008 0.007 0.007 0.858 0.862
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
14
3.1.2. Inter-Cultivar Spectral Variability 379
To assess generalization to unseen genotypes, Table 3 compares model performance under within -cultivar and 380
LOCO evaluation strategies . Under within -cultivar testing, where the test set contains familiar spectral 381
distributions, all models performed reasonably well (RMSE 0.008 –0.013 g/100 g). In this setting, variability is 382
dominated by instrumental noise rather than by biochemical differences, allowing the models to fit the cultivar's 383
identity rather than its biochemical signature. However, under the LOCO evaluation, the ML models exhibited a 384
significant performance decline when applied to unseen cultivars. R² values dropped to 0.037–0.124, and RMSE 385
increased by one order of magnitude . This indicates that these models rely on absolute peak intensities , which 386
vary due to G×E interactions and substrate effects, rather than the intrinsic molecular signature of cysteine. 387
Conversely, the 1D-CNN demonstrated robust generalization, maintaining a low RMSE of 0.011 g/100 g and an 388
R² of 0.795 under LOCO conditions. This suggests that the convolutional architecture learns spectral features that 389
are stable across cultivars. The network captures the local structure around each Raman peak and learns how 390
intensities vary within small neighbourhoods , rather than relying on absolute peak height s, which vary across 391
cultivars and substrates. It can learn detailed peak -shape characteristics, including curvature, width, and 392
asymmetry, which are linked to molecular structure. These results confirm that while methods are sufficient for 393
characterizing known samples, a 1D-CNN is required for generalizable prediction in breeding and quality-control 394
applications, where new cultivars are encountered. 395
Table 3. Comparison of model performance under within-cultivar and LOCO evaluation schemes using preprocessed spectra. 396
RMSE (g/100 g) MAE (g/100 g) R²
Model Within-cultivar LOCO Within-cultivar LOCO Within-cultivar LOCO
LR 0.012 0.097 0.008 0.045 0.650 0.037
PLSR 0.013 0.021 0.011 0.017 0.626 0.103
RFR 0.013 0.022 0.010 0.017 0.662 0.124
SVR 0.008 0.022 0.007 0.018 0.861 0.118
1D-CNN 0.008 0.011 0.007 0.008 0.862 0.795
The validation of quantitative prediction and robustness to measurement variability addresses a key limitation 397
highlighted in the Introduction. Prior work has used ML with SERS for qualitative objectives, such as 398
discriminating protein types (Barucci et al., 2021) or identifying binding -related spectral changes (Peng et al., 399
2022). Here, we benchmark AI-based SERS analysis for quantitative regression in a complex food matrix using 400
a deployment-focused evaluation. It is shown that the 1D -CNN maintains strong performance when applied to 401
unseen cultivars. To the best of our knowledge, this is the first application of deep learning to quantify a specific 402
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
15
amino acid (cysteine) in legume extracts using SERS. This advancement establishes a scalable framework for 403
high-throughput phenotyping, enabling breeders to rapidly screen for nutritional quality. 404
405
3.2. Practical Applications of the 1D-CNN for Cysteine Quantification in Pea Cultivars 406
This section extends beyond predictive performance to examine how the 1D-CNN can be used to support SERS-407
based quantification of cysteine in this study . First, we use SHAP to identify the Raman regions that most 408
contribute to model predictions under both evaluation schemes. Second, we evaluate the model’s sensitivity to 409
spectral noise using a controlled augmentation framework that simulates varying scan counts and quantifies their 410
effects on predictive performance. 411
3.2.1. Interpreting Raman Vibrational Features Across Cultivars 412
To examine which Raman regions the 1D -CNN uses for cysteine prediction, we applied SHAP analysis to the 413
trained 1D -CNN models under both evaluation schemes. For the within -cultivar split, SHAP values were 414
computed for the corresponding within-cultivar model. For LOCO (20-fold), SHAP values were computed using 415
the best-performing fold. Figure 2 shows SHAP summary plots for the within-cultivar model (left) and the LOCO 416
model (right). Each point corresponds to a spectrum in the SHAP evaluation set. Features are ranked from top to 417
bottom by mean absolute SHAP value, which reflects the average magnitude of the contribution of each feature 418
to the predicted cysteine concentration across spectra. The horizontal axis shows SHAP values, indicating whether 419
each feature increases or decreases the predicted cysteine concentration. Point color indicates the Raman intensity 420
at that Raman shift, from low to high. 421
In the within-cultivar setting (Figure 2, left), the most impactful features are distributed across multiple Raman-422
shift regions rather than concentrated in a single band. Dominated contributions appear both near ~880–930 cm⁻¹ 423
(e.g., 890, 918, 921, 931 cm⁻¹) and within the ~630 –650 cm⁻¹ region (e.g., 637, 640, 643, 648 cm⁻¹), with 424
additional contributions at intermediate bands (e.g., ~669 –791 cm⁻¹). This pattern suggests that when cultivar 425
identity is shared between training and testing, the model can rely on a broader set of spectral features rather than 426
on a single feature associated with cysteine. 427
In the LOCO setting (Figure 2, right), the SHAP ranking is more structured across Raman regions. Although a 428
low-Raman shift feature near ~200 cm⁻¹ appears as a top contributor, most of the highly ranked features are 429
concentrated in the ~630 –760 cm⁻¹ range (e.g., ~632 –648 cm⁻¹ and ~702 –725 cm⁻¹). Low -Raman-shift SERS 430
features near ~200 cm⁻¹ are attributed to substrate -related contributions, such as metal –adsorbate interactions, 431
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
16
metal lattice/phonon modes, or electronic scattering, rather than to internal molecular vibrations (Inagaki et al., 432
2019). In addition, Ag P-SERS spectra show a dominant low-shift band near ~235 cm⁻¹ that has been assigned to 433
Ag–Ag stretching (Findlay et al., 2025) , consistent with substrate -related contributions at low Raman shifts. In 434
contrast, biochemical interpretation is supported by the highly ranked bands in the 630 –760 cm⁻¹ region (Adar 435
et al., 2022). Features near ~643–648 cm⁻¹ and ~712–725 cm⁻¹ are consistent with reported protein carbon–sulfur 436
(C–S)–related vibrations in this band range, supporting their relevance for cysteine prediction under LOCO. The 437
LOCO ranking provides the most appropriate basis for interpreting 1D-CNN behavior in cross-cultivar prediction, 438
because it reflects features that remain informative when the test cultivar is not represented in the training data. 439
440
Figure 2. SHAP summary plots showing Raman-shift regions that contribute to 1D-CNN predictions of cysteine concentration under 441
two evaluation schemes: within-cultivar split (left) and leave-one-cultivar-out (LOCO) evaluation (right). Features (Raman shift, cm⁻¹) 442
are ordered from top to bottom by mean absolute SHAP value, representing the average contribution magnitude to the model outp ut. 443
Each point corresponds to one spectrum in the SHAP evaluation set. The x-axis shows SHAP values (in the units of the model output), 444
where positive values increase the predicted cysteine concentration and negative values decrease it. Point color indicates th e feature 445
value at that Raman shift, from low to high. 446
3.2.2. Data Acquisition Optimization and Noise Modeling 447
The controlled-noise study examines how a 1D-CNN can inform practical choices in SERS data acquisition. As 448
described in Section 2.1.3, each stored spectrum was acquired with two co -additions, meaning that two 449
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
17
consecutive acquisitions were combined into a single spectrum. In the noise-modeling analysis, the scan count 𝑁 450
refers to the effective number of averaged acquisitions per spectrum (co -additions). Because random noise 451
decreases with increasing 𝑁, spectra acquired with fewer effective scans have lower signal -to-noise ratios. 452
Quantifying how prediction performance changes as the number of scans decreases is therefore important for 453
balancing acquisition time against analytical performance. To assess this trade-off, we used a noise-modeling and 454
augmentation framework to generate synthetic spectra that mimic measurements at different effective scan counts 455
while preserving the underlying spectral structure of the original data. 456
The augmentation strategy was based on signal-averaging theory, where the standard deviation of random noise 457
scales as 1/√𝑁. We defined a high -SNR reference level 𝑁;<= = 512, and scaled the additive noise by L𝑁;<=/𝑁 458
to simulate effective scan counts from 64 down to 1. The reference level 𝑁;<= = 512 was chosen to provide a 459
wide signal-to-noise range to resolve performance trends across scan counts. It is used only as a reference for 460
scaling the added noise and does not imply that spectra were experimentally acquired with 512 co-additions. This 461
approach generated datasets with increasing noise while preserving the underlying spectral structure. 462
To perform the noise augmentation, we randomly selected 10 spectra from the 324 spectra available for each 463
cultivar and generated augmented versions of these spectra for each simulated scan count. The results, 464
summarized in Table 4, show a relationship between scan count and predictive performance. As the number of 465
scans decreases from 64 to 1, RMSE increases from 0.009 to 0.016, MAE increases from 0.008 to 0.014, and R( 466
decreases from 0.843 to 0.446. Improvements in model performance become smaller beyond 16 scans. The model 467
maintains low RMSE and higher R( at 64, 32, 16, and 8 scans. At 4 and 2 scans, RMSE increases to 0.014 and 468
R( falls to 0.607 and 0.583. Performance is lowest at 1 scan. These results indicate that the 1D-CNN is robust to 469
noise as the number of scans is reduced. Based on these results, 8 scans provide a good balance between 470
acquisition time and predictive performance. Using 2 co -additions, consistent with the experimental protocol 471
(Section 2.1.3), is also possible but yields lower accuracy than 8 scans. 472
Table 4. Performance of the 1D-CNN model as a function of simulated scan count in the noise -modeling experiment. Spectra at each 473
scan level were generated by scaling additive noise relative to a 512-scan reference. Performance is reported as RMSE, MAE, and R& 474
for cysteine concentration. 475
Number of Scans RMSE MAE R2
64 0.009 0.008 0.843
32 0.010 0.009 0.777
16 0.010 0.009 0.774
8 0.011 0.009 0.770
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
18
4 0.014 0.012 0.607
2 0.014 0.013 0.583
1 0.016 0.014 0.446
Beyond evaluating noise effects, this analysis highlights the application of data augmentation when experimental 476
data are limited. By generating realistic spectral variants that expand the training set, the augmentation procedure 477
allows the 1D-CNN to learn a broader range of instrumental noise, baseline variation, and spectral fluctuations. 478
This is particularly useful in SERS studies, where data collection is often constrained by sample availability, 479
instrument time, or substrate variability. 480
4. Conclusion 481
This study demonstrates that AI-based modeling of SERS spectra enables the quantitative prediction of cysteine 482
in pea extracts. Model performance depended on the dominant source of variability represented in the evaluation. 483
Under within-cultivar testing, where intra-cultivar spectral variability dominates, the ML models benefited from 484
preprocessing and achieved moderate -to-high performance. In contrast, when the evaluation introduced inter -485
cultivar spectral variability through LOCO testing, the performance of traditional regression models declined 486
sharply, indicating weak generalization to unseen cultivars. The 1D -CNN showed better cross -cultivar 487
generalization, with only a small increase in RMSE from within -cultivar to LOCO testing, supporting its 488
suitability for applications where new cultivars are expected at deployment. 489
SHAP analysis provided insight into how the 1D-CNN interprets behaves under intra- and inter-cultivar spectral 490
variability. Within the cultivar, feature importance was distributed across multiple regions. Under LOCO 491
conditions, feature importance became more structured and concentrated in the ~630 –760 cm⁻¹ region, with an 492
additional contribution from a low-Raman shift feature near ~200 cm⁻¹. The concentration of influential features 493
in the 630–760 cm⁻¹ range, which is consistent with reported C–S–related vibrational contributions in proteins, 494
supports a chemical basis for cross -cultivar prediction and confirms the identification of spectral patterns that 495
remain stable across cultivars. From a practical perspective, the noise study indicates that with 8 scans, the 1D -496
CNN maintained performance comparable to that obtained with 16 or 32 scans, thereby reducing acquisition time. 497
In addition, the model demonstrated consistency across batches with respect to substrate variability, thereby 498
addressing a barrier to SERS reproducibility. It confirms its suitability for routine operations where consumable 499
properties vary. 500
Overall, the findings support the use of SERS combined with DL as a practical and scalable approach for rapid, 501
cross-cultivar prediction of cysteine concentration, thereby supporting food-quality control and cultivar selection. 502
Future work should extend the approach to full-panel amino acid profiling and compare the current 1D-CNN with 503
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
19
alternative DL architectures to evaluate improvements in generalization and robustness across diverse plant 504
protein matrices. 505
Acknowledgements
506
This work was supported by the National Research Council of Canada (NRC ) through the Sustainable Protein 507
Production (SPP) Program grant number SPP-142–1. The authors also acknowledge Obasi Ukpai Ukoji and Sristi 508
Mundhada for their contributions to the SERS data acquisition. 509
Refrences 510
Adar et al. (2022). Interpretation of Raman spectrum of proteins. Spectroscopy, 37(2), 9–13, 511
25. https://doi.org/10.56530/spectroscopy.lo2270l5 512
Barucci, A., D’Andrea, C., Farnesi, E., Banchelli, M., Amicucci, C., de Angelis, M., Hwang, B., & 513
Matteini, P. (2021). Label-free SERS detection of proteins based on machine learning classification 514
of chemo-structural determinants. Analyst, 146(2), 674–682. https://doi.org/10.1039/D0AN02137G 515
Bocklitz, T., Walter, A., Hartmann, K., Rösch, P., & Popp, J. (2011). How to pre-process Raman spectra 516
for reliable and stable models? Analytica Chimica Acta, 704(1–2), 47–517
56. https://doi.org/10.1016/j.aca.2011.06.043 518
Bokobza, L. (1998). Near infrared spectroscopy. Journal of Near Infrared Spectroscopy, 6(1), 3–519
17. https://doi.org/10.1255/jnirs.116 520
Boye, J., Zare, F., & Pletch, A. (2010). Pulse proteins: Processing, characterization, functional properties 521
and applications in food and feed. Food Research International, 43(2), 414–522
431. https://doi.org/10.1016/j.foodres.2009.09.003 523
Brereton, R. G. (2003). Chemometrics: Data analysis for the laboratory and chemical plant. John Wiley 524
& Sons. https://doi.org/10.1002/0470863242 525
Chen, G., Xie, W., & Zhao, Y. (2013, June 9–11). Wavelet-based denoising: A brief review. 526
In Proceedings of the 2013 4th International Conference on Intelligent Control and Information 527
Processing (ICICIP) (pp. 570–574). IEEE. https://doi.org/10.1109/ICICIP.2013.6568140 528
Chon, B., Xu, S., & Lee, Y. J. (2021). Compensation of strong water absorption in infrared spectroscopy 529
reveals the secondary structure of proteins in dilute solutions. Analytical Chemistry, 93(4), 2215–530
2225. https://doi.org/10.1021/acs.analchem.0c04091 531
Das, R. S., & Agrawal, Y. K. (2011). Raman spectroscopy: Recent advancements, techniques and 532
applications. Vibrational Spectroscopy, 57(2), 163–533
176. https://doi.org/10.1016/j.vibspec.2011.08.003. 534
Fearn, T., Riccioli, C., Garrido-Varo, A., & Guerrero-Ginel, J. E. (2009). On the geometry of SNV and 535
MSC. Chemometrics and Intelligent Laboratory Systems, 96(1), 22–536
26. https://doi.org/10.1016/j.chemolab.2008.11.006. 537
Findlay, C. R. J., Ukoji, O. U., Mundhada, S., Polley, B., Ko, A. C.-T., Bhowmik, P., & Paliwal, J. 538
(2025). Quantitative paper-based SERS method for the rapid determination of sulfur amino acid 539
residues in Pisum sativum. Measurement: Food, 19, Article 540
100240. https://doi.org/10.1016/j.meafoo.2025.100240 541
Gerrano, A. S., Mbuma, N. W., & Mumm, R. H. (2022). Expression of nutritional traits in vegetable 542
cowpea grown under various South African agro-ecological conditions. Plants, 11(11), Article 543
1422. https://doi.org/10.3390/plants11111422 544
Gorry, P. A. (1990). General least-squares smoothing and differentiation by the convolution (Savitzky–545
Golay) method. Analytical Chemistry, 62(6), 570–573. https://doi.org/10.1021/ac00205a007 546
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
20
Grys, D. B., Chikkaraddy, R., Kamp, M., Scherman, O. A., Baumberg, J. J., & de Nijs, B. (2021). 547
Eliminating irreproducibility in SERS substrates. Journal of Raman Spectroscopy, 52(2), 412–548
419. https://doi.org/10.1002/jrs.6008 549
Inagaki, M., Motobayashi, K., & Ikeda, K. (2019). Low-frequency surface-enhanced Raman scattering 550
spectroscopy at metal electrode surfaces. Current Opinion in Electrochemistry, 17, 143–551
148. https://doi.org/10.1016/j.coelec.2019.06.001 552
Iqbal, A., Khalil, I. A., Ateeq, N., & Sayyar Khan, M. (2006). Nutritional quality of important food 553
legumes. Food Chemistry, 97(2), 331–335. https://doi.org/10.1016/j.foodchem.2005.05.011 554
Jeon, Y., Lee, S., Jeon, Y. J., Kim, D., Ham, J. H., Jung, D. H., Kim, H. Y., & You, J. (2025). Rapid 555
identification of pathogenic bacteria using data preprocessing and machine learning-augmented 556
label-free surface-enhanced Raman scattering. Sensors and Actuators B: Chemical, 425, Article 557
136963. https://doi.org/10.1016/j.snb.2024.136963 558
Justusson, B. I. (1981). Median filtering: Statistical properties. In T. S. Huang (Ed.), Two-dimensional 559
digital signal processing II: Transforms and median filters (pp. 161–196). Springer-560
Verlag. https://doi.org/10.1007/BFb0057597 561
Wand, M. P., & Jones, M. C. (1994). Kernel smoothing. CRC Press. https://doi.org/10.1201/b14876 562
Li, S., & Dai, L. (2011). An improved algorithm to remove cosmic spikes in Raman spectra for online 563
monitoring. Applied Spectroscopy, 65(11), 1300–1306. https://doi.org/10.1366/10-06169 564
Li, Z., Zhan, D. J., Wang, J. J., Huang, J., Xu, Q. S., Zhang, Z. M., Zheng, Y. B., Liang, Y. Z., & Wang, 565
H. (2013). Morphological weighted penalized least squares for background correction. Analyst, 566
138(16), 4483–4492. https://doi.org/10.1039/c3an00743j 567
Lieber, C. A., & Mahadevan-Jansen, A. (2003). Automated method for subtraction of fluorescence from 568
biological Raman spectra. Applied Spectroscopy, 57(11), 1363–569
1367. https://doi.org/10.1366/000370203322554518 570
Liland, K. H., Kohler, A., & Afseth, N. K. (2016). Model-based pre-processing in Raman spectroscopy 571
of biological samples. Journal of Raman Spectroscopy, 47(6), 643–572
650. https://doi.org/10.1002/jrs.4886 573
Lisciani, S., Marconi, S., Le Donne, C., Camilli, E., Aguzzi, A., Gabrielli, P., Gambelli, L., Kunert, K., 574
Marais, D., Vorster, B. J., Alvarado-Ramos, K., Reboul, E., Cominelli, E., Preite, C., Sparvoli, F., 575
Losa, A., Sala, T., Botha, A. M., & Ferrari, M. (2024). Legumes and common beans in sustainable 576
diets: Nutritional quality, environmental benefits, spread and use in food preparations. Frontiers in 577
Nutrition, 11, Article 1385232. https://doi.org/10.3389/fnut.2024.1385232 578
Maphosa, Y., & Jideani, V. A. (2017). The role of legumes in human nutrition. In M. Chávarri Hueda 579
(Ed.), Functional food: Improve health through adequate food (pp. 103–121). 580
InTechOpen. https://doi.org/10.5772/intechopen.69127 581
Morháč, M., & Matoušek, V. (2008). Peak clipping algorithms for background estimation in 582
spectroscopic data. Applied Spectroscopy, 62(1), 91–583
106. https://doi.org/10.1366/000370208783412762 584
Moskovits, M. (1985). Surface-enhanced spectroscopy. Reviews of Modern Physics, 57(3), 783–585
826. https://doi.org/10.1103/RevModPhys.57.783 586
Ng, L. M., & Simmons, R. (1999). Infrared spectroscopy. Analytical Chemistry, 71(12), 343R–587
350R. https://doi.org/10.1021/a1999908r 588
Park, M., Somborn, A., Schlehuber, D., Keuter, V., & Deerberg, G. (2023). Raman spectroscopy in crop 589
quality assessment: Focusing on sensing secondary metabolites: A review. Horticulture Research, 590
10(5), uhad074. https://doi.org/10.1093/hr/uhad074 591
Peng, J., Peng, S., Jiang, A., Wei, J., Li, C., & Tan, J. (2010). Asymmetric least squares for multiple 592
spectra baseline correction. Analytica Chimica Acta, 683(1), 63–593
68. https://doi.org/10.1016/j.aca.2010.08.033 594
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
21
Peng, M., Wang, Z., Sun, X., Guo, X., Wang, H., Li, R., Liu, Q., Chen, M., & Chen, X. (2022). Deep 595
learning-based label-free surface-enhanced Raman scattering screening and recognition of small-596
molecule binding sites in proteins. Analytical Chemistry, 94(33), 11483–597
11491. https://doi.org/10.1021/acs.analchem.2c01158 598
Pilot, R., Signorini, R., Durante, C., Orian, L., Bhamidipati, M., & Fabris, L. (2019). A review on 599
surface-enhanced Raman scattering. Biosensors, 9(2), Article 600
57. https://doi.org/10.3390/bios9020057 601
Samal, I., Bhoi, T. K., Raj, M. N., Majhi, P. K., Murmu, S., Pradhan, A. K., Kumar, D., Paschapur, A. 602
U., Joshi, D. C., & Guru, P. N. (2023). Underutilized legumes: Nutrient status and advanced 603
breeding approaches for qualitative and quantitative enhancement. Frontiers in Nutrition, 10, Article 604
1110750. https://doi.org/10.3389/fnut.2023.1110750 605
Savitzky, A., & Golay, M. J. E. (1964). Smoothing and differentiation of data by simplified least squares 606
procedures. Analytical Chemistry, 36(8), 1627–1639. https://doi.org/10.1021/ac60214a047 607
EFSA NDA Panel (EFSA Panel on Dietetic Products, Nutrition and Allergies). (2012). Scientific opinion 608
on dietary reference values for protein. EFSA Journal, 10(2), Article 609
2557. https://doi.org/10.2903/j.efsa.2012.2557 610
Shanthakumar, P., Klepacka, J., Bains, A., Chawla, P., Dhull, S. B., & Najda, A. (2022). The current 611
situation of pea protein and its application in the food industry. Molecules, 27(16), Article 612
5354. https://doi.org/10.3390/molecules27165354 613
Snyder, L. R., Kirkland, J. J., & Dolan, J. W. (2010). Introduction to modern liquid chromatography (3rd 614
ed.). John Wiley & Sons. https://doi.org/10.1002/9780470508183 615
Sparkman, O. D., Penton, Z. E., & Kitson, F. G. (2011). Gas chromatography and mass spectrometry: A 616
practical guide(2nd ed.). Academic Press. https://doi.org/10.1016/c2009-0-17039-3 617
Wahl, J., Sjödahl, M., & Ramser, K. (2020). Single-step preprocessing of Raman spectra using 618
convolutional neural networks. Applied Spectroscopy, 74(4), 427–619
438. https://doi.org/10.1177/0003702819888949 620
Whitaker, D. A., & Hayes, K. (2018). A simple algorithm for despiking Raman spectra. Chemometrics 621
and Intelligent Laboratory Systems, 179, 82–84. https://doi.org/10.1016/j.chemolab.2018.06.009 622
World Health Organization, Food and Agriculture Organization of the United Nations, & United 623
Nations University. (2007). Protein and amino acid requirements in human nutrition: Report of a 624
joint WHO/FAO/UNU expert consultation (WHO Technical Report Series No. 625
935). https://iris.who.int/handle/10665/43411 626
Wu, Y., & Chen, L. (2017, July 24–26). Comparison of spectra processing methods for SERS based 627
quantitative analysis. In Proceedings of the 2017 4th International Conference on Information, 628
Cybernetics and Computational Social Systems (ICCSS) (pp. 130–136). 629
IEEE. https://doi.org/10.1109/ICCSS.2017.8091399 630
Hu, H. B., Bai, J., Xia, G., Zhang, W. D., & Ma, Y. (2018). Baseline correction method for Raman 631
spectra based on piecewise polynomial fitting. In J. Chu (Ed.), Fifth conference on frontiers in 632
optical imaging technology and applications (FOI 2018) (Proceedings of SPIE, Vol. 10832, Paper 633
108321D). SPIE. https://doi.org/10.1117/12.2511445 634
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.