Generalizable Cysteine Quantification in Pea Cultivars from SERS Spectra Using AI

doi:10.64898/2026.03.20.713189

Generalizable Cysteine Quantification in Pea Cultivars from SERS Spectra Using AI

2026 · doi:10.64898/2026.03.20.713189

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 71,239 characters · extracted from oa-pdf · 7 sections · click to expand

Abstract

10 Rapid quantification of sulfur -containing amino acids, particularly cysteine, in legumes is critical for assessing 11 nutritional quality, supporting breeding program screening, and ensuring consistency in quality control processes. 12 However, conventional methods, such as high-performance liquid chromatography (HPLC), are time-consuming 13 and resource-intensive for high -throughput applications. This study evaluated artificial intelligence models for 14 predicting cysteine concentration from surface-enhanced Raman spectroscopy (SERS) spectra of pea extracts. 15 SERS spectra were acquired from 20 cultivars grown at three geographically distinct locations, with HPLC-16 measured cysteine concentrations as a ground truth reference. Linear regression, partial least squares regression, 17 support vector regression, random forest regression, and a one -dimensional convolutional neural network (1D -18 CNN) were compared using within -cultivar splits and leave-one-cultivar-out (LOCO) evaluation. The 1D-CNN 19 achieved RMSE 0.008 g/100 g within cultivars and maintained performance under LOCO, while other models 20 showed limited generalization. Shapley Additive Explanations highlighted informative bands in the 630–760 cm⁻¹ 21 range, and noise modeling optimized scan-count selection. 22

Keywords

SERS, Legumes, Amino-acids, Deep-learning, Generalization, SHAP, Noise-modeling 23 1. Introduction 24 Legumes are an important source of plant-based protein in human diets (Lisciani et al., 2024; Samal et al., 2023). 25 Their seeds contain approximately 20–45% protein on a dry weight basis, depending on species and cultivar 26 (Maphosa et al., 2017) . This is higher than the protein content of most widely consumed plant-based foods, 27 including cereals (7 –15%) and vegetables (1 –5%) (Boye et al., 2010) . Despite this advantage, legume protein 28 quality is constrained by low levels of the sulfur-containing amino acids (SCAAs), cysteine and methionine (Iqbal 29 et al., 2006) . Together, these two amino acids can at times constitute the primary limiting essential and semi-30 essential amino acids determining protein quality. Peas, beans, and lentils typically contain only 12 –18 mg/g 31 protein of cysteine plus methionine, which is below the 22–25 mg/g protein recommended in the Food and 32 Agriculture Organization/World Health Organization (FAO/WHO) amino acid reference pattern for high-quality 33 dietary protein (WHO/FAO/UNU, 2025). SCAAs levels are influenced by cultivar genetics and environmental 34 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 2 conditions, such as soil type, climate, and agronomic practices, as well as by genotype -by-environment (G×E) 35 interactions (Gerrano et al., 2022). High-throughput, cultivar-robust quantification methods are important for the 36 development of reliable, routine identification of high-SCAA genotypes and for quality control in commercial 37 legume protein ingredients. Conventional analytical methods, such as high-performance liquid chromatography 38 (HPLC) (Snyder et al., 2010) and gas chromatography –mass spectrometry (GC –MS) (Sparkman et al., 2011) , 39 provide accurate SCAAs measurements but require multi-step sample preparation, including protein hydrolysis 40 and complex derivatization. They depend on specialized equipment, costly reagents, and long analysis times, 41 which limit applicability for large-scale screening and rapid quality control in the food industry. 42 These limitations have motivated a shift toward spectroscopic methods that enable faster , more direct analysis. 43 Vibrational spectroscopy methods, including infrared (IR), near-infrared (NIR), and Raman spectroscopy, provide 44 detailed information on molecular structure, bonding, and composition (Bokobza, 1998; Ng & Simmons, 1999). 45 IR-based techniques can be limited by water absorption and sample preparation requirements (Chon et al., 2021), 46 whereas Raman spectroscopy is less affected by water and can be applied to aqueous extracts with minimal sample 47 handling (Park et al., 2023) . A practical limitation is that conventional Raman scattering is a low-probability 48 phenomenon with weak intensity , reducing sensitivity for low -concentration analytes (Das & Agrawal, 2011) . 49 Surface-enhanced Raman spectroscopy (SERS) addresses this by using plasmonic nanostructures to amplify 50 Raman signals and improve the sensitivity of detection for low-abundance analytes (Moskovits, 1985; Pilot et al., 51 2019). A defining property of quantitative SERS, in contrast to SERS for detection alone, is that under controlled 52 experimental conditions, the scattered intensity is proportional to the number of molecules contributing to the 53 enhancement. This means that s pectral responses exhibit approximately proportional, monotonic behavior with 54 analyte concentration, roughly analogous to Beer–Lambert–type relationships amenable to linear regression, and 55 stable, repeatable patterns well-suited to machine learning (ML) analysis. 56 However, in complex food and biological matrices, SERS measurements are compromised by substrate 57 heterogeneity, adsorption effects, fluorescence background, and non-linear baseline drift (Grys et al., 2021; Pilot 58 et al., 2019). Conventional univariate or linear chemometric techniques often fail to decouple the target analyte 59 signal from these complex, stochastic interferences. This limitation necessitates the use of artificial intelligence 60 (AI) methods, including ML and deep learning (DL), to learn a quantitative mapping from SERS spectra to target 61 analyte concentration. Recent label-free SERS studies show that ML/DL approaches can extract chemical and 62 structural information from SERS spectra for discrimination and recognition tasks. Barucci et al. developed a 63 hybrid strategy combining peak fitting with principal component analysis to discriminate proteins with closely 64 similar spectral profiles, providing a reproducible approach to capture structure -dependent spectral variation in 65 human and animal proteins (Barucci et al., 2021) . Peng et al. implemented a DL -based, label -free SERS 66 framework for screening and recognizing small-molecule binding sites in human drug-target proteins (Peng et al., 67 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 3 2022) . While these studies focus on animal and human proteins and primarily address qualitative SERS analysis, 68 related work in legume proteins supports extending this approach with ML/DL to quantitative analysis in complex 69 food matrices (Findlay et al., 2025). 70 Accordingly, the present study develops and evaluates AI models to predict cysteine concentration from SERS 71 spectra of pea (Pisum sativum L.) cultivars. Pea was selected as a representative legume matrix as pea protein is 72 a widely used ingredient of growing importance in protein isolate production and processing, sustainable plant-73 based meat analogs and human nutrition (Shanthakumar et al., 2022). Peas are self-pollinating; thus, the pedigrees 74 and lineages are well characterized, and cultivars exhibit genetic stability. Peas are the subject of well-developed 75 breeding programs and cultivar collections, making them a good candidate for amino acid panel analysis by HPLC 76 to establish reference values. Cysteine was selected as the analytical target because it contributes directly to the 77 total SCAA pool and exhibits thiol-based surface-binding chemistry compatible with quantitative SERS (Findlay 78 et al., 2025). A dataset of SERS spectra collected from 20 pea cultivars was used to investigate whether AI models 79 can learn chemically meaningful relationships between spectral patterns and cysteine concentration. 80 To evaluate the complexity required to model these data, we selected five algorithms were selected, ranging from 81 linear regression (LR) to convolutional neural networks (CNN). LR and partial least squares regression (PLSR) 82 were included to establish a baseline and to represent standard chemometric approaches that assume linear 83 spectral–concentration relationships. To assess whether non -linear modeling alone improves performance, we 84 evaluated support vector regression (SVR) and random forest regression (RFR), which capture complex 85 boundaries and variable interactions but rely on fixed input features. Finally, a DL model, a one -dimensional 86 CNN (1D-CNN), was assessed. Unlike standard regression models that treat spectral points as independent 87 features, CNNs are designed to learn hierarchical, local patterns such as peak shapes, widths, and relative shifts. 88 This capability is hypothesized to make DL models more robust to the absolute intensity fluctuations and baseline 89 shifts common in SERS, enabling better generalization across cultivars. 90 To assess this generalization capability, an evaluation framework was designed to distinguish between two forms 91 of spectral variability that shape model behavior. The first, referred to as intra-cultivar spectral variability, arises 92 from instrumental and substrate -related effects, including fluorescence background, stochastic noise, and local 93 variations in electromagnetic field enhancement across the SERS substrate. These sources of variation occur 94 within cultivars and reflect the physics of the measurement process. Therefore, they are assessed using a within-95 cultivar evaluation strategy, in which the training and test sets consist of spectra from the same cultivar (Section 96 2.3). The second form, termed inter -cultivar spectral variability, arises from cultivar -dependent biochemical 97 variability, driven by genotype × environment (G×E) interactions that modify the molecular composition of pea 98 extracts. This biochemical variation changes SERS peak intensities, peak positions, and baseline curvature across 99 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 4 cultivars. To evaluate model performance under these conditions, we applied a leave-one-cultivar-out (LOCO) 100 cross-validation strategy, which tests the model on an unseen cultivar excluded from the training process (Section 101 2.3). Evaluating AI models under both intra - and inter-cultivar spectral variability is critical for assessing their 102 suitability for practical applications in legume breeding and quality assessment. For practical deployment in large-103 scale breeding programs or industrial quality control, analytical models must be able to predict analyte 104 concentrations in new, unseen cultivars without requiring retraining. Consequently, the ability to generalize across 105 genotypes, despite significant G×E biochemical variability, is a prerequisite for the operational utility of SERS -106 based screening. 107 Finally, we extended the evaluation to address two practical aspects of deployment: model interpretability and 108 data acquisition efficiency. Shapley Additive Explanations (SHAP) were applied to identify the specific spectral 109 features driving the predictions, ensuring that the model relies on chemically relevant vibrational 110 modes. Separately, to optimize operational efficiency, a noise-modeling study was conducted to determine the 111 minimum number of scans required for accurate prediction, providing practical guidelines for reducing 112 acquisition time in high-throughput settings. 113 2. Materials and Methods 114 This section is divided into two main parts. Section 2.1 describes the experimental workflow and dataset 115 generation, including the biological materials, sample preparation, SERS measurements, and the structure of the 116 resulting spectral dataset. Section 2.2 details the AI analysis framework, including preprocessing, ML baselines, 117 the DL architecture used to predict cysteine concentration from SERS spectra, and the model training and 118 evaluation procedures. The overall workflow is summarized in Figure 1, which includes the post hoc model 119 interpretability using SHAP and the noise-modeling study to guide scan count optimization. 120 2.1. Overview of Experimental Data 121 This subsection describes the experimental workflow used to generate the SERS dataset for predicting cysteine 122 concentration. The workflow includes cultivar selection, reference cysteine quantification, preparation of sample 123 extracts, fabrication/preparation of SERS substrates, and SERS spectral acquisition. These steps produced a 124 structured spectral dataset that serves as the input to the AI analyses described in Section 2.2. 125 2.1.1. Pea Cultivars and Reference Compound 126 Flours from twenty pea cultivars from the CDC breeding program at the University of Saskatchewan , were 127 selected based on their contrasting protein profiles to represent a diverse range: AAC Chrome, AAC Lacombe, 128 AAC Liscard, CDC Amarillo, CDC Athabasca, CDC Canary, CDC Dakota, CDC Golden, CDC Greenwater, 129 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 5 CDC Inca, CDC Jasper, CDC Striker, CDC Lewochko, CDC Meadow, CDC Patrick, CDC Saffron, CDC 130 Spectrum, CDC Spruce, CDC Tetris, and Redbat 88. The prefixes in these names indicate their breeding origin: 131 'AAC' denotes varieties from Agriculture and Agri -Food Canada, and 'CDC' denotes those from the Crop 132 Development Centre. Samples from each cultivar were ground into flour and analyzed for cysteine using the 133 oxidative hydrolysis HPLC method described below. The corresponding HPLC reference cysteine concentrations 134 for each cultivar across the three locations are provided in Table S1 (Supplementary Material). 135 136 Figure 1. Overall workflow for SERS data acquisition and AI-based prediction of cysteine concentration. Left: SERS dataset generation 137 from pea cultivars across three locations (60 samples). Sample extracts were prepared and measured on P-SERS substrates using a 785-138 nm excitation source and fiber-optic probe–based backscattering collection, with spectra acquired at multiple spots per substrate (3 139 spots/substrate, 36 spectra/spot; 108 spectra/sample; 6, 480 total spectra). Right: AI -based modeling pipeline, including spectral 140 preprocessing (smoothing, baseline correction, normalization) and dataset assembly by pairing preprocessed spectra with HPLC-derived 141 cysteine concentrations (ground truth). Models included machine -learning baselines (LR, PLSR, SVR, RFR) and a 1D -CNN. 142 Performance was evaluated using intra-cultivar splits and inter-cultivar (LOCO) testing, with RMSE, MAE, and R² as evaluation metrics. 143 The best -performing 1D -CNN was further analyzed using SHAP for interpretability and noise modeling to optimize scan count 144 (acquisition time). 145

Reference

cysteine concentrations were determined using the performic acid oxidation –acid hydrolysis HPLC 146

Method

described in (Findlay et al., 2025) . In this procedure, proteins in pea flour extracts were oxidized with 147 performic acid to convert cysteine (including disulfide -linked forms) to stable cysteic acid . Oxidized samples 148 were then subjected to acid hydrolysis, derivatized using the AccQ -Tag Ultra reagent system, and separated on 149 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 6 an AccQ-Tag Ultra C18 (1.7 μm) reversed-phase column using a Shimadzu UPLC system equipped with an SIL-150 30AC autosampler. Quantification was performed using calibration standards and hydrated amino acid molecular 151 weights to obtain accurate cysteine-equivalent concentrations. L-cysteine (≥97%, Sigma-Aldrich) was used as the 152

Reference

amino acid standard for calibration during HPLC analysis. 153 2.1.2. Sample and Substrate Preparation 154 Alkaline extracts were prepared from pea flour samples (Section 2.1.1) , and SERS spectral acquisition was 155 performed following the method described by (Findlay et al., 2025) with modifications to reagent ratios and 156 parameters described below. Pea flour was dispersed in Milli-Q water (0.2 g/1 mL). The suspension was kept on 157 an ice bath and homogenized at a speed setting of 5 (~20,000 rpm) for 30 seconds with an IKA Ultra-Turrax 158 homogenizer (IKA-Werke GmbH & Co. KG, Staufen, Germany) . Alkaline extraction was performed by adding 159 25 µL of a preprepared 1 M NaOH solution to 1.0 mL of pea homogenate, yielding a final pH of approximately 160 9. The mixture was vortexed for 10 seconds and incubated at room temperature (25 °C) for 2 hours. Samples were 161 then centrifuged at 8000 rpm (~5000 × g, rotor radius 7 cm) for 15 minutes at 4 °C. The supernatant was collected 162 as the alkaline extract and stored at −80 °C. 163 Just prior to spectral acquisition, frozen extracts were thawed at room temperature and vortexed to ensure 164 homogeneity. For each sample, 200 µL of extract was transferred into an individual silicone well and mixed with 165 100 µL of 20 mM tris (2-carboxyethyl) phosphine (TCEP) solution at pH 7, resulting in a final working volume 166 of 300 µL. TCEP was added to reduce disulfide bonds and liberate free thiol for chemisorption to the SERS 167 substrate. To prepare the SERS substrates, prefabricated paper-based SERS (P-SERS) substrates (Metrohm) were 168 handled. Each substrate was positioned over a silicone well, and the handle was removed to allow the plasmonic 169 surface tip to fall into the extract –TCEP mixture with the active side facing upward. The substrate was fully 170 immersed and incubated at room temperature for 45 minutes to ensure consistent analyte–surface interaction. 171 Following incubation, the substrates were transferred directly to the Raman system stage for spectral acquisition 172 without drying or additional processing. 173 2.1.3. SERS Spectral Acquisition 174 SERS measurements were acquired using a Raman system equipped with a 785 nm excitation laser delivering 100 175 mW at the sample surface. The system consisted of a Raman spectrometer coupled to a microscope -mounted 176 sampling stage, enabling reproducible positioning of the P -SERS substrates beneath the Raman probe . Spectra 177 were collected with a 1000 ms integration time and two co-additions, which were automatically combined by the 178 instrument control software into a single stored spectrum, thereby balancing signal quality and measurement 179 speed. Following the 45-minute incubation described in Section 2.1.2, each silicone well with P-SERS substrate 180 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 7 was transferred directly to the Raman sampling stage while still immersed. The laser spot was focused on the 181 submerged surface of the SERS substrate with the plasmonic sensing region oriented upward to maintain 182 consistent optical alignment between the Raman probe and the active substrate surface. 183 To characterize nanoscale surface heterogeneity, spectra were acquired from three distinct spots on each P-SERS 184 substrate. At each spot, 36 sequential spectra were collected without adjusting the optical focus or repositioning 185 the substrate, yielding 108 spectra per sample. The same measurement protocol was repeated independently for 186 samples obtained from each of the three Saskatchewan growing locations (Limerick, Rosthern, and Sutherland), 187 yielding a total of 324 spectra per cultivar. 188 2.1.4. Dataset 189 The complete dataset consisted of 6,480 raw SERS spectra (20 cultivars × 3 locations × 108 spectra). Each 190 spectrum was a fixed-length vector of SERS intensities, indexed by Raman shift (cm⁻¹), and was associated with 191 a reference cysteine concentration determined by HPLC. All spectra from a given cultivar and location shared the 192 same reference value. This yields a diverse dataset that captures both intra-cultivar spectral variability associated 193 with instrumental and substrate -related effects (using P-SERS substrates from two separate manufacturing 194 batches) and inter -cultivar spectral variability arising from cultivar - and location -dependent biochemical 195 differences. It therefore supports a robust evaluation of model performance and generalizability for predicting 196 HPLC cysteine concentration from SERS spectra. 197 2.2. AI-Based Modeling and Data Analysis Framework 198 This section describes the computational workflow used to develop models to predict HPLC cysteine 199 concentration from SERS spectra. All computational steps were applied after the spectral dataset described in 200 Section 2.1 was generated. The workflow includes spectral preprocessing, implementation of ML and DL models, 201 and model evaluation. 202 2.2.1. Spectral Preprocessing of SERS Data 203 Preprocessing is essential for preparing SERS spectra for AI -based modeling. Raw spectra often contain 204 distortions, including baseline drift, random noise, fluorescence background, and intensity fluctuations, which 205 can obscure chemically meaningful features and introduce non -chemical variability. Although many 206 preprocessing algorithms exist, their suitability depends on the underlying physics of the spectroscopic technique. 207 Accordingly, preprocessing should be tailored to the dominant sources of variability rather than applied in a 208 generic manner. 209 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 8 In SERS and Raman spectra, one of the most prominent artifacts is the fluorescence background. It is a high-210 intensity, smoothly varying signal that can overwhelm Raman peaks and originates from both sample constituents 211 and detector effects such as charge-coupled device (CCD) baseline drift (Bocklitz et al., 2011; Liland et al., 2016). 212 SERS and Raman spectra are also affected by cosmic-ray artifacts, which appear as sharp, non -physical peaks 213 resulting from high-energy particle impacts on the detector and can bias spectral analysis (Bocklitz et al., 2011; 214 Wu & Chen, 2017). Additional distortions arise from Gaussian noise and other stochastic fluctuations inherent to 215 Raman scattering, which reduce signal -to-noise ratios (S/N) in low-concentration measurements (Wahl et al., 216 2020). Fluctuations in laser power and changes in optical focusing further contribute to inconsistent peak heights 217 across measurements. Another source of variation results from batch-to-batch differences in commercial SERS 218 substrates. In this case, differences in nanostructure morphology and surface chemistry modify the local 219 electromagnetic field distribution and change signal intensities across identical samples (Jeon et al., 2025) . 220 Numerous preprocessing methods have been proposed to address these issues, including baseline correction (Li 221 et al., 2013; Lieber & Mahadevan-Jansen, 2003; Morháč & Matoušek, 2008; Peng et al., 2010), smoothing (Chen 222 et al., 2013; Gorry, 1990; Kernel Smoothing - M.P. Wand, M.C. Jones ), spike removal (Justusson, 1981; Li & 223 Dai, 2011; Whitaker & Hayes, 2018) , normalization, and derivative -based approaches (Chemometrics: Data 224 Analysis for the Laboratory and Chemical Plant - Richard G. Brereton - Google Books, n.d.; Fearn et al., 2009). 225 In this study, SERS spectra were preprocessed using a workflow consisting of Savitzky –Golay (SG) smoothing 226 (Savitzky & Golay, 1964), modified polynomial baseline correction (ModPoly) (Xia et al., 2018), and min–max 227 normalization. First, SG smoothing was applied to reduce high -frequency noise while preserving peak shape. It 228 applies a low-order polynomial filter within a moving window, where the window length and polynomial order 229 control the degree of smoothing. Second, the ModPoly baseline correction was used to remove the fluorescence 230

Background

and the slowly varying baseline curvature. This method estimates a polynomial baseline that captures 231 the background trend of the spectrum, with the polynomial degree controlling its curvature. Cosmic-ray artifacts 232 were addressed through the acquisition strategy rather than through post-processing. By collecting spectra with 233 minimal co-addition, cosmic ray events were limited to single replicates rather than being averaged into the final 234 spectra. This helps prevent attenuation of small but chemically relevant peaks. Finally, for the linear and kernel-235 based models (LR, PLSR, RFR, SVR), min –max normalization scaled each spectrum to the 0 –1 range, thereby 236 minimizing sensitivity to absolute intensity differences. In contrast, the 1D -CNN used unscaled spectral inputs, 237 with internal scaling handled via batch-normalization layers (Section 2.2.2.5) rather than external normalization. 238 For each AI model described in Section 2.2.2, the preprocessing hyperparameters (SG window length, SG 239 polynomial order, and ModPoly degree) were tuned using a grid search on the training set, with performance 240 evaluated on a held-out validation set. The final preprocessing configuration used for each model is summarized 241 in Table 1. 242 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 9 Table 1. Preprocessing configurations used for each model. All models use Savitzky–Golay (SG) smoothing and modified-polynomial 243 baseline correction (ModPoly). LR, PLSR, RFR and SVR apply min –max normalization, whereas the 1D -CNN relies on internal 244 normalization (batch-normalization layers). 245 Model SG Window Length SG Polynomial Order ModPoly Degree Normalization LR 11 3 3 Min-Max PLSR 5 3 9 Min-Max RFR 17 4 11 Min-Max SVR 9 3 7 Min-Max 1D-CNN 11 3 2 - 2.2.2. Machine Learning and Deep Learning Models 246 To analyze the preprocessed SERS spectra, four ML algorithms were evaluated: LR, PLSR, RFR, and SVR. In 247 addition, a 1D-CNN was used as the DL model. Each model was trained independently using the preprocessing 248 workflow described in Section 2.2.1. The model descriptions, training procedures, and hyperparameter settings 249 are outlined below. To ensure reproducibility, the complete source code and a sample dataset are openly available 250 at https://github.com/Elhamm1/SERS-Data-Analysis/tree/main. 251 2.2.2.1. Linear Regression 252 LR was used as a simple baseline model to describe the relationship between the preprocessed SERS spectra and 253 cysteine concentration. For each spectrum 𝒙, the predicted cysteine value 𝑦# was expressed as a linear combination 254 of spectral intensities plus an intercept ( 𝑦# = 𝒘𝑻𝒙 + 𝑏), where 𝒙 ∈ ℝ" denotes the full preprocessed spectrum 255 vector (with 𝑝 = 1496 Raman shift bins), 𝒘 denotes the regression coefficients and 𝑏 is the intercept term. The 256 parameters 𝒘 and 𝑏 were estimated by ordinary least squares, minimizing the sum of squared differences between 257 predicted and HPLC -measured cysteine values in the training set , i.e., / ( # $%& 𝑦$ − (𝒘𝑻𝒙𝒊 + 𝑏))( . Because LR 258 assumes a linear relationship between spectral features and cysteine concentration, it provides a baseline for 259 comparison with more flexible ML and DL models. 260 2.2.2.2. Partial Least Squares Regression 261 PLSR was included as a standard chemometric approach capable of handling strong collinearity across spectral 262 variables. Unlike ordinary linear regression, which operates directly on the original intensities 𝒙, PLSR projects 263 each spectrum 𝒙 ∈ ℝ" into a lower -dimensional set of latent variables (components) that are constructed to 264 maximize the covariance between the spectral features and cysteine concentration. The predicted cysteine 265 concentration 𝑦# is then expressed as a linear combination of these latent variables plus an intercept, 𝑦# = 𝑐&𝑡& +266 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 10 𝑐( 𝑡( + ⋯ + 𝑐) 𝑡) + 𝑏, where 𝑡&, … , 𝑡) are the latent components extracted from the spectra 𝒙, 𝑐&, … , 𝑐) are the 267 corresponding regression coefficients, 𝐴 is the number of components, and 𝑏 is an intercept term. 268 In this study, the optimal number of latent components was determined using five-fold cross-validation on a 269 predefined search grid , yielding a final model with 𝐴 = 25 components. The model was then refit using the 270 selected number of components and used to generate predictions for the validation spectra. This procedure yields 271 an LR model in a latent space aligned with cysteine variation and serves as a strong chemometric benchmark for 272 comparison with the nonlinear ML and DL models. 273 2.2.2.3. Support Vector Regression 274 SVR was used as a kernel-based nonlinear model to capture more complex relationships between the preprocessed 275 SERS spectra and cysteine concentration. In SVR, the prediction for a new spectrum 𝒙 is written as 𝑦#(𝒙) =276 ∑ 𝛼$ # $%& 𝐾(𝒙𝒊, 𝒙) + 𝑏, where 𝒙 ∈ ℝ", 𝒙𝒊 are training spectra, 𝛼$ are learned weights, 𝑏 is an intercept term, 277 and 𝐾(⋅,⋅) is a kernel function that defines the similarity between pairs of spectra and maps the data into a high -278 dimensional feature space. 279 In this study, a radial basis function (RBF) kernel was used to enable flexible, smooth nonlinear fits. The RBF 280 kernel was defined as 𝐾(𝒙𝒊, 𝒙𝒋) = exp (−𝛾 ∥ 𝒙𝒊 − 𝒙𝒋 ∥( ), where ∥ 𝒙𝒊 − 𝒙𝒋 ∥ is the Euclidean distance between 281 two spectra and 𝛾 controls the rate of decay of similarity with spectral distance. Model training was formulated 282 as an optimization problem that keeps the regression function flat while constraining prediction errors within an 283 𝜀-insensitive tube around the observed cysteine values. Deviations larger than 𝜀 are penalized through the 284 regularization parameter 𝐶, which controls the trade -off between model complexity and error tolerance. Before 285 SVR, input spectra were standardized to zero mean and unit variance. The hyperparameters 𝐶, 𝜀, and 𝛾 were 286 selected by five-fold cross-validation over a predefined grid using a Smooth L1 (Huber) loss, and the final SVR 287 model used 𝐶 = 5.0, 𝜀 = 0.01, and 𝛾 = 0.01. With this configuration, SVR provides a flexible nonlinear baseline 288 that can model smooth spectral –concentration relationships while controlling model complexity through 289 regularization and the kernel parameters. 290 2.2.2.4. Random Forest Regression 291 Random Forest Regression was used to model nonlinear relationships between the preprocessed SERS spectra 292 and cysteine concentration by combining many decision trees. The training data consist of pairs (𝒙𝒊, 𝑦$), where 293 𝒙𝒊 is the spectrum of the sample 𝑖 (intensities at 1,496 Raman shift variables, denoted as p) and 𝑦$ is its HPLC-294 measured cysteine value. In this approach, a decision tree learns a sequence of binary split rules based on 295 individual spectral variables. At each split, the algorithm selects a Raman shift variable and a threshold to reduce 296 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 11 the variation in cysteine values across the resulting child nodes. This splitting process is repeated recursively and 297 stops when further splits do not further reduce variation or when the remaining node contains few samples. The 298 final nodes (leaf nodes) contain training spectra with similar cysteine values, and the prediction from a single tree 299 for a new spectrum is the mean cysteine value of the training samples that fall into the same leaf. A random forest 300 combines the predictions of many such trees to improve generalization and reduce variance. 301 In this study, 𝑇 = 300 trees were trained. Each tree was fitted on a sample of the training data, and at each split, 302 only a subset of spectral variables was considered as candidates, with the number of candidate features set to L𝑝. 303 For a given spectrum 𝒙, the forest prediction was obtained by averaging the outputs of all trees, 𝑦#+,(𝒙) =304 & - ∑ 𝑓. - .%& (𝒙), where 𝑓.(𝒙) is the prediction from the 𝑡-th tree. This ensemble structure enables RFR to capture 305 nonlinear dependencies and interaction effects among spectral features while reducing variance by averaging 306 across multiple diverse trees. 307 2.2.2.5. One-dimensional Convolutional Neural Network 308 1D-CNN was used as a DL model to capture nonlinear relationships between the SERS spectra and cysteine 309 concentration. For an input spectrum 𝐱 ∈ ℝ&/01, the network treats 𝐱 as a one-dimensional sequence with a single 310 input channel and outputs a scalar prediction 𝑦# = 𝑓2(𝐱). For the 𝑖-th spectrum 𝐱$, this is 𝑦#$ = 𝑓2(𝐱$), 311 where 𝑓2 denotes the network parameterized by 𝜃. 312 The architecture comprised four consecutive convolutional blocks, followed by two fully connected layers. The 313 convolutional blocks used 1D convolutions with a kernel size of 5 and increasing numbers of filters (16, 32, 64, 314 and 128). Each block applied convolution, batch normalization, and a ReLU activation, followed by max-pooling 315 with a pool size (and stride) of 2 to downsample the spectral axis. These pooling operations reduced the spectral 316 length by an overall factor of 16. The resulting feature maps were flattened and passed to a fully connected layer 317 with 128 units and ReLU activation, followed by dropout (rate = 0.3) to reduce overfitting. A final linear output 318 layer with a single neuron produced the predicted cysteine value. 319 The network was trained using mini-batch gradient descent with a batch size of 32. The training objective was a 320 Smooth L1 (Huber) loss with 𝛽 = 0.02. Optimization was performed using AdamW with an initial learning rate 321 of 1 × 103/ and decoupled weight decay, together with a OneCycle learning rate schedule over 100 epochs. 322 Mixed-precision training and gradient clipping were used to stabilize optimization. Model performance was 323 monitored on a held -out validation set at the end of each epoch, and the parameter set 𝜃 corresponding to the 324 lowest validation loss was retained as the final 1D -CNN model. This architecture enables the 1D-CNN to learn 325 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 12 hierarchical spectral features and capture nonlinear relationships between SERS patterns and cysteine 326 concentration, while maintaining strong generalization via regularization. 327 2.3. Model Evaluation Strategy 328 To compare the ML and DL models, we used two evaluation strategies. First, we performed a within -cultivar 329 split: approximately 80% of the spectra from each cultivar were used for training, and the remaining 20% were 330 reserved for testing. Because the training and test spectra were from the same cultivar, the model was evaluated 331 under conditions in which the spectral distribution was familiar. This setting provides a controlled baseline for 332 assessing performance in the presence of intra-cultivar spectral variability and for tuning preprocessing and model 333 hyperparameters. Second, we used a leave -one-cultivar-out (LOCO) cross -validation protocol to evaluate 334 generalization across cultivars. In each LOCO fold, one cultivar was withheld as an independent test set, and the 335 models were trained on spectra from the remaining 19 cultivars. This procedure was repeated until each cultivar 336 had served once as the held-out test set. LOCO therefore measures how well a model generalizes to spectra from 337 an unseen cultivar, assessing robustness against inter-cultivar spectral variability. 338 Model performance under both evaluation strategies was quantified using standard regression metrics. The root 339 mean squared error (RMSE) was defined as RMSE = R & 4 / ( 4 $%& 𝑦$ − 𝑦#$)( , which places greater weight on larger 340 errors. The mean absolute error (MAE) was calculated as MAE = & 4 ∑ ∣4 $%& 𝑦$ − 𝑦#$ ∣, providing an average error 341 measure in the same units as cysteine concentration, g/100 g. The coefficient of determination was computed as 342 𝑅( = 1 − 5 (! "#$ 7"378")% 5 (! "#$ 7"37¯ )% , where 𝑦$ denotes the true cysteine concentration for the sample 𝑖, 𝑦#$ is the corresponding 343 model prediction, 𝑦¯ is the mean of the observed 𝑦$ values, and 𝑛 is the number of spectra in the test set for a given 344 split or LOCO fold. Together, these metrics summarize prediction performance under both within -cultivar and 345 LOCO evaluations and enable a direct comparison of the ML models and the DL approach. 346 3. Results and Discussion 347 This section presents the results of the AI-based cysteine quantification and discusses their implications for model 348 generalizability and practical deployment. As detailed in Section 2.1.4, the dataset comprised 6,480 SERS spectra 349 collected across 20 pea cultivars. For the analysis, these spectra were stored in NumPy binary format, with each 350 file containing a fixed -length vector of 1,496 Raman intensity values. All ML models were implemented using 351 the scikit-learn library, while the 1D-CNN was implemented in PyTorch. Each model was trained on the same 352 vectorized spectral inputs under the two evaluation strategies described in Section 2.3 . These strategies were 353 selected to distinguish between measurement -related artifacts (intra -cultivar) and biological diversity (inter -354 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 13 cultivar). Accordingly, Section 3.1 evaluates the predictive performance of the models under these different 355 sources of spectral variability, while Section 3.2 extends beyond predictive performance to examine the 356 interpretability and operational robustness of the 1D-CNN framework. 357 3.1. Impact of Spectral Variability on Model Performance 358 3.1.1. Intra-Cultivar Spectral Variability 359 To assess model robustness against measurement noise, Table 2 summarizes the performance of the five models 360 on raw versus preprocessed spectra. This comparison highlights how each algorithm responds to measurement -361 related artifacts, such as fluorescence background, baseline drift, and stochastic noise. For the four ML models 362 (LR, PLSR, SVR, RFR), preprocessing yielded clear improvements, reducing RMSE and increasing R². This 363 suggests that these models are sensitive to baseline drift and noise, which can obscure the underlying spectral 364 patterns. LR and PLSR benefited from baseline correction and scaling, which strengthened the linear relationship 365 between spectral features and concentration. RFR also benefited from preprocessing, as reduced background and 366 noise can yield more stable split decisions and more consistent averaging across trees. Finally, the improvement 367 observed for SVR indicates that kernel-based similarity calculations are easily distorted by peak-shape noise. 368 In contrast, the 1D -CNN achieved high performance (RMSE = 0.008 g/100 g, R² = 0.862) on both raw and 369 preprocessed inputs, showing no dependence on external preprocessing. This robustness arises from the 370 convolutional layers, which process local spectral windows to capture the peak shape and structure rather than 371 relying on absolute intensity. Additionally, internal mechanisms such as batch normalization and pooling 372 effectively handle global intensity scaling and high-frequency fluctuations. Consequently, despite using substrates 373 from two different manufacturing batches, the 1D-CNN can more effectively decouple the target signal from 374 intra-cultivar measurement variability than other models. 375 Table 2: Performance of the five predictive models on raw and preprocessed SERS spectra. Results are reported as RMSE, MAE, and 376 R² for each model before and after applying Savitzky–Golay (SG) smoothing, modified polynomial baseline correction (ModPoly), and 377 min–max normalization. 378 RMSE (g/100 g) MAE (g/100 g) R² Model Raw Preprocessed Raw Preprocessed Raw Preprocessed LR 0.013 0.012 0.010 0.008 0.561 0.650 PLSR 0.014 0.013 0.012 0.011 0.569 0.626 RFR 0.014 0.013 0.011 0.010 0.585 0.662 SVR 0.010 0.008 0.008 0.007 0.794 0.861 1D-CNN 0.008 0.008 0.007 0.007 0.858 0.862 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 14 3.1.2. Inter-Cultivar Spectral Variability 379 To assess generalization to unseen genotypes, Table 3 compares model performance under within -cultivar and 380 LOCO evaluation strategies . Under within -cultivar testing, where the test set contains familiar spectral 381 distributions, all models performed reasonably well (RMSE 0.008 –0.013 g/100 g). In this setting, variability is 382 dominated by instrumental noise rather than by biochemical differences, allowing the models to fit the cultivar's 383 identity rather than its biochemical signature. However, under the LOCO evaluation, the ML models exhibited a 384 significant performance decline when applied to unseen cultivars. R² values dropped to 0.037–0.124, and RMSE 385 increased by one order of magnitude . This indicates that these models rely on absolute peak intensities , which 386 vary due to G×E interactions and substrate effects, rather than the intrinsic molecular signature of cysteine. 387 Conversely, the 1D-CNN demonstrated robust generalization, maintaining a low RMSE of 0.011 g/100 g and an 388 R² of 0.795 under LOCO conditions. This suggests that the convolutional architecture learns spectral features that 389 are stable across cultivars. The network captures the local structure around each Raman peak and learns how 390 intensities vary within small neighbourhoods , rather than relying on absolute peak height s, which vary across 391 cultivars and substrates. It can learn detailed peak -shape characteristics, including curvature, width, and 392 asymmetry, which are linked to molecular structure. These results confirm that while methods are sufficient for 393 characterizing known samples, a 1D-CNN is required for generalizable prediction in breeding and quality-control 394 applications, where new cultivars are encountered. 395 Table 3. Comparison of model performance under within-cultivar and LOCO evaluation schemes using preprocessed spectra. 396 RMSE (g/100 g) MAE (g/100 g) R² Model Within-cultivar LOCO Within-cultivar LOCO Within-cultivar LOCO LR 0.012 0.097 0.008 0.045 0.650 0.037 PLSR 0.013 0.021 0.011 0.017 0.626 0.103 RFR 0.013 0.022 0.010 0.017 0.662 0.124 SVR 0.008 0.022 0.007 0.018 0.861 0.118 1D-CNN 0.008 0.011 0.007 0.008 0.862 0.795 The validation of quantitative prediction and robustness to measurement variability addresses a key limitation 397 highlighted in the Introduction. Prior work has used ML with SERS for qualitative objectives, such as 398 discriminating protein types (Barucci et al., 2021) or identifying binding -related spectral changes (Peng et al., 399 2022). Here, we benchmark AI-based SERS analysis for quantitative regression in a complex food matrix using 400 a deployment-focused evaluation. It is shown that the 1D -CNN maintains strong performance when applied to 401 unseen cultivars. To the best of our knowledge, this is the first application of deep learning to quantify a specific 402 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 15 amino acid (cysteine) in legume extracts using SERS. This advancement establishes a scalable framework for 403 high-throughput phenotyping, enabling breeders to rapidly screen for nutritional quality. 404 405 3.2. Practical Applications of the 1D-CNN for Cysteine Quantification in Pea Cultivars 406 This section extends beyond predictive performance to examine how the 1D-CNN can be used to support SERS-407 based quantification of cysteine in this study . First, we use SHAP to identify the Raman regions that most 408 contribute to model predictions under both evaluation schemes. Second, we evaluate the model’s sensitivity to 409 spectral noise using a controlled augmentation framework that simulates varying scan counts and quantifies their 410 effects on predictive performance. 411 3.2.1. Interpreting Raman Vibrational Features Across Cultivars 412 To examine which Raman regions the 1D -CNN uses for cysteine prediction, we applied SHAP analysis to the 413 trained 1D -CNN models under both evaluation schemes. For the within -cultivar split, SHAP values were 414 computed for the corresponding within-cultivar model. For LOCO (20-fold), SHAP values were computed using 415 the best-performing fold. Figure 2 shows SHAP summary plots for the within-cultivar model (left) and the LOCO 416 model (right). Each point corresponds to a spectrum in the SHAP evaluation set. Features are ranked from top to 417 bottom by mean absolute SHAP value, which reflects the average magnitude of the contribution of each feature 418 to the predicted cysteine concentration across spectra. The horizontal axis shows SHAP values, indicating whether 419 each feature increases or decreases the predicted cysteine concentration. Point color indicates the Raman intensity 420 at that Raman shift, from low to high. 421 In the within-cultivar setting (Figure 2, left), the most impactful features are distributed across multiple Raman-422 shift regions rather than concentrated in a single band. Dominated contributions appear both near ~880–930 cm⁻¹ 423 (e.g., 890, 918, 921, 931 cm⁻¹) and within the ~630 –650 cm⁻¹ region (e.g., 637, 640, 643, 648 cm⁻¹), with 424 additional contributions at intermediate bands (e.g., ~669 –791 cm⁻¹). This pattern suggests that when cultivar 425 identity is shared between training and testing, the model can rely on a broader set of spectral features rather than 426 on a single feature associated with cysteine. 427 In the LOCO setting (Figure 2, right), the SHAP ranking is more structured across Raman regions. Although a 428 low-Raman shift feature near ~200 cm⁻¹ appears as a top contributor, most of the highly ranked features are 429 concentrated in the ~630 –760 cm⁻¹ range (e.g., ~632 –648 cm⁻¹ and ~702 –725 cm⁻¹). Low -Raman-shift SERS 430 features near ~200 cm⁻¹ are attributed to substrate -related contributions, such as metal –adsorbate interactions, 431 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 16 metal lattice/phonon modes, or electronic scattering, rather than to internal molecular vibrations (Inagaki et al., 432 2019). In addition, Ag P-SERS spectra show a dominant low-shift band near ~235 cm⁻¹ that has been assigned to 433 Ag–Ag stretching (Findlay et al., 2025) , consistent with substrate -related contributions at low Raman shifts. In 434 contrast, biochemical interpretation is supported by the highly ranked bands in the 630 –760 cm⁻¹ region (Adar 435 et al., 2022). Features near ~643–648 cm⁻¹ and ~712–725 cm⁻¹ are consistent with reported protein carbon–sulfur 436 (C–S)–related vibrations in this band range, supporting their relevance for cysteine prediction under LOCO. The 437 LOCO ranking provides the most appropriate basis for interpreting 1D-CNN behavior in cross-cultivar prediction, 438 because it reflects features that remain informative when the test cultivar is not represented in the training data. 439 440 Figure 2. SHAP summary plots showing Raman-shift regions that contribute to 1D-CNN predictions of cysteine concentration under 441 two evaluation schemes: within-cultivar split (left) and leave-one-cultivar-out (LOCO) evaluation (right). Features (Raman shift, cm⁻¹) 442 are ordered from top to bottom by mean absolute SHAP value, representing the average contribution magnitude to the model outp ut. 443 Each point corresponds to one spectrum in the SHAP evaluation set. The x-axis shows SHAP values (in the units of the model output), 444 where positive values increase the predicted cysteine concentration and negative values decrease it. Point color indicates th e feature 445 value at that Raman shift, from low to high. 446 3.2.2. Data Acquisition Optimization and Noise Modeling 447 The controlled-noise study examines how a 1D-CNN can inform practical choices in SERS data acquisition. As 448 described in Section 2.1.3, each stored spectrum was acquired with two co -additions, meaning that two 449 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 17 consecutive acquisitions were combined into a single spectrum. In the noise-modeling analysis, the scan count 𝑁 450 refers to the effective number of averaged acquisitions per spectrum (co -additions). Because random noise 451 decreases with increasing 𝑁, spectra acquired with fewer effective scans have lower signal -to-noise ratios. 452 Quantifying how prediction performance changes as the number of scans decreases is therefore important for 453 balancing acquisition time against analytical performance. To assess this trade-off, we used a noise-modeling and 454 augmentation framework to generate synthetic spectra that mimic measurements at different effective scan counts 455 while preserving the underlying spectral structure of the original data. 456 The augmentation strategy was based on signal-averaging theory, where the standard deviation of random noise 457 scales as 1/√𝑁. We defined a high -SNR reference level 𝑁;<= = 512, and scaled the additive noise by L𝑁;<=/𝑁 458 to simulate effective scan counts from 64 down to 1. The reference level 𝑁;<= = 512 was chosen to provide a 459 wide signal-to-noise range to resolve performance trends across scan counts. It is used only as a reference for 460 scaling the added noise and does not imply that spectra were experimentally acquired with 512 co-additions. This 461 approach generated datasets with increasing noise while preserving the underlying spectral structure. 462 To perform the noise augmentation, we randomly selected 10 spectra from the 324 spectra available for each 463 cultivar and generated augmented versions of these spectra for each simulated scan count. The results, 464 summarized in Table 4, show a relationship between scan count and predictive performance. As the number of 465 scans decreases from 64 to 1, RMSE increases from 0.009 to 0.016, MAE increases from 0.008 to 0.014, and R( 466 decreases from 0.843 to 0.446. Improvements in model performance become smaller beyond 16 scans. The model 467 maintains low RMSE and higher R( at 64, 32, 16, and 8 scans. At 4 and 2 scans, RMSE increases to 0.014 and 468 R( falls to 0.607 and 0.583. Performance is lowest at 1 scan. These results indicate that the 1D-CNN is robust to 469 noise as the number of scans is reduced. Based on these results, 8 scans provide a good balance between 470 acquisition time and predictive performance. Using 2 co -additions, consistent with the experimental protocol 471 (Section 2.1.3), is also possible but yields lower accuracy than 8 scans. 472 Table 4. Performance of the 1D-CNN model as a function of simulated scan count in the noise -modeling experiment. Spectra at each 473 scan level were generated by scaling additive noise relative to a 512-scan reference. Performance is reported as RMSE, MAE, and R& 474 for cysteine concentration. 475 Number of Scans RMSE MAE R2 64 0.009 0.008 0.843 32 0.010 0.009 0.777 16 0.010 0.009 0.774 8 0.011 0.009 0.770 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 18 4 0.014 0.012 0.607 2 0.014 0.013 0.583 1 0.016 0.014 0.446 Beyond evaluating noise effects, this analysis highlights the application of data augmentation when experimental 476 data are limited. By generating realistic spectral variants that expand the training set, the augmentation procedure 477 allows the 1D-CNN to learn a broader range of instrumental noise, baseline variation, and spectral fluctuations. 478 This is particularly useful in SERS studies, where data collection is often constrained by sample availability, 479 instrument time, or substrate variability. 480 4. Conclusion 481 This study demonstrates that AI-based modeling of SERS spectra enables the quantitative prediction of cysteine 482 in pea extracts. Model performance depended on the dominant source of variability represented in the evaluation. 483 Under within-cultivar testing, where intra-cultivar spectral variability dominates, the ML models benefited from 484 preprocessing and achieved moderate -to-high performance. In contrast, when the evaluation introduced inter -485 cultivar spectral variability through LOCO testing, the performance of traditional regression models declined 486 sharply, indicating weak generalization to unseen cultivars. The 1D -CNN showed better cross -cultivar 487 generalization, with only a small increase in RMSE from within -cultivar to LOCO testing, supporting its 488 suitability for applications where new cultivars are expected at deployment. 489 SHAP analysis provided insight into how the 1D-CNN interprets behaves under intra- and inter-cultivar spectral 490 variability. Within the cultivar, feature importance was distributed across multiple regions. Under LOCO 491 conditions, feature importance became more structured and concentrated in the ~630 –760 cm⁻¹ region, with an 492 additional contribution from a low-Raman shift feature near ~200 cm⁻¹. The concentration of influential features 493 in the 630–760 cm⁻¹ range, which is consistent with reported C–S–related vibrational contributions in proteins, 494 supports a chemical basis for cross -cultivar prediction and confirms the identification of spectral patterns that 495 remain stable across cultivars. From a practical perspective, the noise study indicates that with 8 scans, the 1D -496 CNN maintained performance comparable to that obtained with 16 or 32 scans, thereby reducing acquisition time. 497 In addition, the model demonstrated consistency across batches with respect to substrate variability, thereby 498 addressing a barrier to SERS reproducibility. It confirms its suitability for routine operations where consumable 499 properties vary. 500 Overall, the findings support the use of SERS combined with DL as a practical and scalable approach for rapid, 501 cross-cultivar prediction of cysteine concentration, thereby supporting food-quality control and cultivar selection. 502 Future work should extend the approach to full-panel amino acid profiling and compare the current 1D-CNN with 503 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 19 alternative DL architectures to evaluate improvements in generalization and robustness across diverse plant 504 protein matrices. 505

Acknowledgements

506 This work was supported by the National Research Council of Canada (NRC ) through the Sustainable Protein 507 Production (SPP) Program grant number SPP-142–1. The authors also acknowledge Obasi Ukpai Ukoji and Sristi 508 Mundhada for their contributions to the SERS data acquisition. 509 Refrences 510 Adar et al. (2022). Interpretation of Raman spectrum of proteins. Spectroscopy, 37(2), 9–13, 511 25. https://doi.org/10.56530/spectroscopy.lo2270l5 512 Barucci, A., D’Andrea, C., Farnesi, E., Banchelli, M., Amicucci, C., de Angelis, M., Hwang, B., & 513 Matteini, P. (2021). Label-free SERS detection of proteins based on machine learning classification 514 of chemo-structural determinants. Analyst, 146(2), 674–682. https://doi.org/10.1039/D0AN02137G 515 Bocklitz, T., Walter, A., Hartmann, K., Rösch, P., & Popp, J. (2011). How to pre-process Raman spectra 516 for reliable and stable models? Analytica Chimica Acta, 704(1–2), 47–517 56. https://doi.org/10.1016/j.aca.2011.06.043 518 Bokobza, L. (1998). Near infrared spectroscopy. Journal of Near Infrared Spectroscopy, 6(1), 3–519 17. https://doi.org/10.1255/jnirs.116 520 Boye, J., Zare, F., & Pletch, A. (2010). Pulse proteins: Processing, characterization, functional properties 521 and applications in food and feed. Food Research International, 43(2), 414–522 431. https://doi.org/10.1016/j.foodres.2009.09.003 523 Brereton, R. G. (2003). Chemometrics: Data analysis for the laboratory and chemical plant. John Wiley 524 & Sons. https://doi.org/10.1002/0470863242 525 Chen, G., Xie, W., & Zhao, Y. (2013, June 9–11). Wavelet-based denoising: A brief review. 526 In Proceedings of the 2013 4th International Conference on Intelligent Control and Information 527 Processing (ICICIP) (pp. 570–574). IEEE. https://doi.org/10.1109/ICICIP.2013.6568140 528 Chon, B., Xu, S., & Lee, Y. J. (2021). Compensation of strong water absorption in infrared spectroscopy 529 reveals the secondary structure of proteins in dilute solutions. Analytical Chemistry, 93(4), 2215–530 2225. https://doi.org/10.1021/acs.analchem.0c04091 531 Das, R. S., & Agrawal, Y. K. (2011). Raman spectroscopy: Recent advancements, techniques and 532 applications. Vibrational Spectroscopy, 57(2), 163–533 176. https://doi.org/10.1016/j.vibspec.2011.08.003. 534 Fearn, T., Riccioli, C., Garrido-Varo, A., & Guerrero-Ginel, J. E. (2009). On the geometry of SNV and 535 MSC. Chemometrics and Intelligent Laboratory Systems, 96(1), 22–536 26. https://doi.org/10.1016/j.chemolab.2008.11.006. 537 Findlay, C. R. J., Ukoji, O. U., Mundhada, S., Polley, B., Ko, A. C.-T., Bhowmik, P., & Paliwal, J. 538 (2025). Quantitative paper-based SERS method for the rapid determination of sulfur amino acid 539 residues in Pisum sativum. Measurement: Food, 19, Article 540 100240. https://doi.org/10.1016/j.meafoo.2025.100240 541 Gerrano, A. S., Mbuma, N. W., & Mumm, R. H. (2022). Expression of nutritional traits in vegetable 542 cowpea grown under various South African agro-ecological conditions. Plants, 11(11), Article 543 1422. https://doi.org/10.3390/plants11111422 544 Gorry, P. A. (1990). General least-squares smoothing and differentiation by the convolution (Savitzky–545 Golay) method. Analytical Chemistry, 62(6), 570–573. https://doi.org/10.1021/ac00205a007 546 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 20 Grys, D. B., Chikkaraddy, R., Kamp, M., Scherman, O. A., Baumberg, J. J., & de Nijs, B. (2021). 547 Eliminating irreproducibility in SERS substrates. Journal of Raman Spectroscopy, 52(2), 412–548 419. https://doi.org/10.1002/jrs.6008 549 Inagaki, M., Motobayashi, K., & Ikeda, K. (2019). Low-frequency surface-enhanced Raman scattering 550 spectroscopy at metal electrode surfaces. Current Opinion in Electrochemistry, 17, 143–551 148. https://doi.org/10.1016/j.coelec.2019.06.001 552 Iqbal, A., Khalil, I. A., Ateeq, N., & Sayyar Khan, M. (2006). Nutritional quality of important food 553 legumes. Food Chemistry, 97(2), 331–335. https://doi.org/10.1016/j.foodchem.2005.05.011 554 Jeon, Y., Lee, S., Jeon, Y. J., Kim, D., Ham, J. H., Jung, D. H., Kim, H. Y., & You, J. (2025). Rapid 555 identification of pathogenic bacteria using data preprocessing and machine learning-augmented 556 label-free surface-enhanced Raman scattering. Sensors and Actuators B: Chemical, 425, Article 557 136963. https://doi.org/10.1016/j.snb.2024.136963 558 Justusson, B. I. (1981). Median filtering: Statistical properties. In T. S. Huang (Ed.), Two-dimensional 559 digital signal processing II: Transforms and median filters (pp. 161–196). Springer-560 Verlag. https://doi.org/10.1007/BFb0057597 561 Wand, M. P., & Jones, M. C. (1994). Kernel smoothing. CRC Press. https://doi.org/10.1201/b14876 562 Li, S., & Dai, L. (2011). An improved algorithm to remove cosmic spikes in Raman spectra for online 563 monitoring. Applied Spectroscopy, 65(11), 1300–1306. https://doi.org/10.1366/10-06169 564 Li, Z., Zhan, D. J., Wang, J. J., Huang, J., Xu, Q. S., Zhang, Z. M., Zheng, Y. B., Liang, Y. Z., & Wang, 565 H. (2013). Morphological weighted penalized least squares for background correction. Analyst, 566 138(16), 4483–4492. https://doi.org/10.1039/c3an00743j 567 Lieber, C. A., & Mahadevan-Jansen, A. (2003). Automated method for subtraction of fluorescence from 568 biological Raman spectra. Applied Spectroscopy, 57(11), 1363–569 1367. https://doi.org/10.1366/000370203322554518 570 Liland, K. H., Kohler, A., & Afseth, N. K. (2016). Model-based pre-processing in Raman spectroscopy 571 of biological samples. Journal of Raman Spectroscopy, 47(6), 643–572 650. https://doi.org/10.1002/jrs.4886 573 Lisciani, S., Marconi, S., Le Donne, C., Camilli, E., Aguzzi, A., Gabrielli, P., Gambelli, L., Kunert, K., 574 Marais, D., Vorster, B. J., Alvarado-Ramos, K., Reboul, E., Cominelli, E., Preite, C., Sparvoli, F., 575 Losa, A., Sala, T., Botha, A. M., & Ferrari, M. (2024). Legumes and common beans in sustainable 576 diets: Nutritional quality, environmental benefits, spread and use in food preparations. Frontiers in 577 Nutrition, 11, Article 1385232. https://doi.org/10.3389/fnut.2024.1385232 578 Maphosa, Y., & Jideani, V. A. (2017). The role of legumes in human nutrition. In M. Chávarri Hueda 579 (Ed.), Functional food: Improve health through adequate food (pp. 103–121). 580 InTechOpen. https://doi.org/10.5772/intechopen.69127 581 Morháč, M., & Matoušek, V. (2008). Peak clipping algorithms for background estimation in 582 spectroscopic data. Applied Spectroscopy, 62(1), 91–583 106. https://doi.org/10.1366/000370208783412762 584 Moskovits, M. (1985). Surface-enhanced spectroscopy. Reviews of Modern Physics, 57(3), 783–585 826. https://doi.org/10.1103/RevModPhys.57.783 586 Ng, L. M., & Simmons, R. (1999). Infrared spectroscopy. Analytical Chemistry, 71(12), 343R–587 350R. https://doi.org/10.1021/a1999908r 588 Park, M., Somborn, A., Schlehuber, D., Keuter, V., & Deerberg, G. (2023). Raman spectroscopy in crop 589 quality assessment: Focusing on sensing secondary metabolites: A review. Horticulture Research, 590 10(5), uhad074. https://doi.org/10.1093/hr/uhad074 591 Peng, J., Peng, S., Jiang, A., Wei, J., Li, C., & Tan, J. (2010). Asymmetric least squares for multiple 592 spectra baseline correction. Analytica Chimica Acta, 683(1), 63–593 68. https://doi.org/10.1016/j.aca.2010.08.033 594 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint 21 Peng, M., Wang, Z., Sun, X., Guo, X., Wang, H., Li, R., Liu, Q., Chen, M., & Chen, X. (2022). Deep 595 learning-based label-free surface-enhanced Raman scattering screening and recognition of small-596 molecule binding sites in proteins. Analytical Chemistry, 94(33), 11483–597 11491. https://doi.org/10.1021/acs.analchem.2c01158 598 Pilot, R., Signorini, R., Durante, C., Orian, L., Bhamidipati, M., & Fabris, L. (2019). A review on 599 surface-enhanced Raman scattering. Biosensors, 9(2), Article 600 57. https://doi.org/10.3390/bios9020057 601 Samal, I., Bhoi, T. K., Raj, M. N., Majhi, P. K., Murmu, S., Pradhan, A. K., Kumar, D., Paschapur, A. 602 U., Joshi, D. C., & Guru, P. N. (2023). Underutilized legumes: Nutrient status and advanced 603 breeding approaches for qualitative and quantitative enhancement. Frontiers in Nutrition, 10, Article 604 1110750. https://doi.org/10.3389/fnut.2023.1110750 605 Savitzky, A., & Golay, M. J. E. (1964). Smoothing and differentiation of data by simplified least squares 606 procedures. Analytical Chemistry, 36(8), 1627–1639. https://doi.org/10.1021/ac60214a047 607 EFSA NDA Panel (EFSA Panel on Dietetic Products, Nutrition and Allergies). (2012). Scientific opinion 608 on dietary reference values for protein. EFSA Journal, 10(2), Article 609 2557. https://doi.org/10.2903/j.efsa.2012.2557 610 Shanthakumar, P., Klepacka, J., Bains, A., Chawla, P., Dhull, S. B., & Najda, A. (2022). The current 611 situation of pea protein and its application in the food industry. Molecules, 27(16), Article 612 5354. https://doi.org/10.3390/molecules27165354 613 Snyder, L. R., Kirkland, J. J., & Dolan, J. W. (2010). Introduction to modern liquid chromatography (3rd 614 ed.). John Wiley & Sons. https://doi.org/10.1002/9780470508183 615 Sparkman, O. D., Penton, Z. E., & Kitson, F. G. (2011). Gas chromatography and mass spectrometry: A 616 practical guide(2nd ed.). Academic Press. https://doi.org/10.1016/c2009-0-17039-3 617 Wahl, J., Sjödahl, M., & Ramser, K. (2020). Single-step preprocessing of Raman spectra using 618 convolutional neural networks. Applied Spectroscopy, 74(4), 427–619 438. https://doi.org/10.1177/0003702819888949 620 Whitaker, D. A., & Hayes, K. (2018). A simple algorithm for despiking Raman spectra. Chemometrics 621 and Intelligent Laboratory Systems, 179, 82–84. https://doi.org/10.1016/j.chemolab.2018.06.009 622 World Health Organization, Food and Agriculture Organization of the United Nations, & United 623 Nations University. (2007). Protein and amino acid requirements in human nutrition: Report of a 624 joint WHO/FAO/UNU expert consultation (WHO Technical Report Series No. 625 935). https://iris.who.int/handle/10665/43411 626 Wu, Y., & Chen, L. (2017, July 24–26). Comparison of spectra processing methods for SERS based 627 quantitative analysis. In Proceedings of the 2017 4th International Conference on Information, 628 Cybernetics and Computational Social Systems (ICCSS) (pp. 130–136). 629 IEEE. https://doi.org/10.1109/ICCSS.2017.8091399 630 Hu, H. B., Bai, J., Xia, G., Zhang, W. D., & Ma, Y. (2018). Baseline correction method for Raman 631 spectra based on piecewise polynomial fitting. In J. Chu (Ed.), Fifth conference on frontiers in 632 optical imaging technology and applications (FOI 2018) (Proceedings of SPIE, Vol. 10832, Paper 633 108321D). SPIE. https://doi.org/10.1117/12.2511445 634 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted March 24, 2026. ; https://doi.org/10.64898/2026.03.20.713189doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-06-17T06:32:23.968882+00:00