AbiOmics: An End-to-End Pipeline to Train Machine Learning Models for Discrimination of Plant Abiotic Stresses Using Transcriptomic Profiling Data

preprint OA: closed
📄 Open PDF Full text JSON View at publisher
Full text 51,415 characters · extracted from oa-pdf · 8 sections · click to expand

Abstract

14 Abiotic stresses are primary constraints on global crop productivity, reducing yields by up to 15 80%. While traditional phenotypic sensing detects stress only after physiological symptoms 16 emerge and often fails to discriminate specific stressor types, transcriptomic profiling offers a 17 high-dimensional solution, capturing rapid and sensitive molecular shifts. In this study, we 18 developed AbiOmics, the first end-to-end machine learning pipeline specifically designed to 19 identify and discriminate among multiple stressors. This approach represents a previously 20 undocumented method for stress specification using large-scale transcriptomic big data. We 21 identified 320 stress-specific marker genes using a curated collection of 1,243 transcriptomes of 22 Arabidopsis samples treated with four major abiotic stresses, salt, cold, heat, and drought. A 23 single-layer perceptron model trained on these features achieved 91% accuracy during five-fold 24 cross-validation and 93% accuracy on an independent test set. The model demonstrated an 25 unprecedented capacity to generalize to multi-stress conditions, identifying concurrent 26 signatures in combinatorial salt-and-heat treatments. By integrating marker identification with 27 SHAP-based biological interpretation, AbiOmics provides a rigorously validated diagnostic tool 28 superior to conventional sensing. This framework establishes a high-confidence labeling 29 strategy for AI-driven crop management and precision breeding to mitigate climate change 30 impacts. 31 32 33 Graphical Abstract 34 35

Keywords

Abiotic stress, Transcriptomic profiling, Machine learning, Stress discrimination, 36 Arabidopsis 37 38 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 3

Introduction

39 As sessile organisms, plants are inextricably linked to their environment, and adverse conditions 40 during growth and development can cause severe tissue damage or mortality. Even under 41 moderate suboptimal conditions, plants initiate sophisticated stress-signaling cascades that 42 prioritize survival over productivity (1). This metabolic shift triggers fundamental changes in 43 growth (2), organ development (3), senescence (4), and reproduction (5). For instance, plants 44 employ reciprocal regulation through antagonistic interactions between intracellular regulators to 45 balance stress responses with growth (6,7). This inherent trade-off significantly constrains 46 biomass accumulation in unfavorable environments (8) , with research indicating that adverse 47 conditions reduce average crop yields by 50%, and up to 80% in extreme cases (9-11). 48 Consequently, the precise identification of specific abiotic stressors is critical for elucidating 49 adaptation mechanisms and developing tailored cultivation strategies to mitigate productivity 50 losses. 51 Traditional stress diagnosis relying on visible phenotypic assessments is often inadequate, as 52 discernible damage typically appears only after physiological decline is advanced. Moreover, 53 distinct stressors often converge on similar phenotypes, complicating the identification of 54 specific causal factors (12). To address these limitations, advanced imaging and sensing 55 technologies have been deployed for early-stage diagnosis. Established methods leverage 56 interpretable biological mechanisms, such as chlorophyll fluorescence imaging (CFI) for 57 monitoring photosynthetic efficiency, thermal infrared (TIR) imaging for detecting stomatal 58 closure under drought or heat, red-edge shift analysis for quantifying chlorophyll content, and 59 LIDAR for evaluating structural architectural changes (13-17). While these techniques enable 60 the detection of subtle physiological shifts, they are generally optimized for single stressors and 61 lack the capacity to discriminate among multiple stressors simultaneously. 62 In contrast, hyperspectral imaging (HSI) captures reflectance across hundreds of narrow 63 wavelength bands, facilitating the development of multi-stress diagnostic models through 64 machine learning (18). However, while HSI is highly sensitive, its reproducibility is often 65 compromised by environmental interference during measurement and the inherent complexity of 66 data interpretation (12,18). Similarly, wearable electrochemical sensors have emerged as tools 67 for real-time monitoring of tissue impedance or volatile organic compounds (VOCs) (19), yet 68 they remain constrained by environmental noise and limited long-term stability. Despite these 69 technological advancements, a significant bottleneck persists: most current methods can detect 70 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 4 the presence of stress but fail to pinpoint the specific stressor type, making precise agricultural 71 management difficult (20). 72 Transcriptomic profiling offers a promising solution to this challenge, as the transcriptome is 73 intimately linked to stress-response mechanisms and undergoes rapid, sensitive shifts upon 74 stress onset (12,21). Unlike phenotypic or spectral data, transcriptomes provide high-75 dimensional insights into the expression levels of every gene in the genome, potentially allowing 76 for the discrimination of specific stressor types (22). While gene expression analysis has been 77 used extensively to characterize stress-response pathways and identify tolerance factors 78 (23,24), its application in stressor-specific diagnosis remains underdeveloped. Early efforts, 79 such as a mini-scale microarray of 12 expressed sequence tags (ESTs) for identifying drought, 80 salinity, and temperature stress, lacked rigorous validation and broad applicability (25). 81 Recently, the integration of machine learning with large-scale transcriptomic metadata has 82 opened new avenues for stress diagnosis. Studies have analyzed hundreds of transcriptomes 83 across various species, including Arabidopsis and barley, to identify core regulators of abiotic 84 and biotic stress responses (26). Furthermore, machine learning models have successfully 85 predicted disease severity in plant-pathogen interactions across diverse datasets (27). Despite 86 these advancements, the use of transcriptomic big data to specify and distinguish between 87 multiple abiotic stressors has not yet been reported. 88 In this study, we aimed to develop a robust machine learning model to identify specific abiotic 89 stressors using transcriptomic metadata. To ensure broad applicability, we developed an end-to-90 end pipeline to train machine learning models for stress discrimination. We curated a 91 comprehensive dataset of Arabidopsis leaf transcriptomes from public databases, covering key 92 stressors: cold, heat, salt, and drought. Using these expression profiles, we trained and 93 validated a diagnostic model. Subsequently, we evaluated its accuracy and generalizability 94 using independent subsets of samples exposed to both single and combinatorial stress 95 treatments. 96 97 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 5

Materials and methods

98 Collection and processing of RNA-seq data 99 Transcriptomic datasets of Arabidopsis species subjected to abiotic stress were retrieved from 100 the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). 101 Searches combined Arabidopsis with four abiotic stress terms: cold, heat, salt, and drought. 102 Search results were filtered to include only datasets generated on Illumina sequencing platforms. 103 To ensure data integrity, SRA accession numbers were manually curated by cross-referencing 104 them with original published studies (see Results for detailed selection criteria). Raw SRA files 105 were downloaded and converted to FASTQ format using the SRA toolkit. Because the retrieved 106 datasets contained a mixture of single-end and paired-end libraries, all data were standardized 107 to single-end format to ensure downstream compatibility; for paired-end libraries, only forward 108 reads (R1) were retained. 109 Raw RNA-seq reads were quality-trimmed using AdapterRemoval (v2.3.4) with default 110 parameters (28). Transcript quantification was performed using the Arabidopsis thaliana TAIR10 111 coding sequence (CDS) annotation as the reference (29). Read alignment and transcript 112 abundance estimation (transcripts per million; TPM) were calculated using the RSEM pipeline 113 (30) utilizing Bowtie2 for read mapping (31). 114 Differential expression analysis, GO term enrichment, and marker gene selection 115 Differential expression analysis was performed on TPM values using the PyDESeq2 Python 116 package. For each stress condition, 120 stress-treated samples were compared against 120 117 matched controls. Differentially expressed genes (DEGs) were identified using a threshold of 118 |log2 fold change| (log2FC) ≥ 1 and an adjusted P-value ≤ 0.001. To minimize noise, transcripts 119 with a maximum TPM < 20 across all samples were excluded. 120 Gene Ontology (GO) enrichment analysis was conducted using ShinyGO (v0.85.1) (32), with 121 the Biological Process database. The genes for stress-specific up- and down-regulated DEGs 122 were analyzed. Significant GO terms were identified using a False Discovery Rate (FDR) < 0.05. 123 DEGs were categorized into up- and down-regulated groups, and Venn diagram analysis was 124 employed to identify stress-specific DEGs for each of the four abiotic conditions. 125 To construct the diagnostic model, we established a consolidated set of 320 marker genes. This 126 set was generated by randomly selecting 40 DEGs from each of the four up-regulated and four 127 down-regulated stress-specific groups. Random selection was intentionally chosen over 128 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 6 ranking-based approaches (e.g., by fold change or variance) to avoid overfitting to any single 129 metric and to promote generalizability. The validity and robustness of this approach were 130 subsequently confirmed by repeated sampling across 300 iterations (see below). The 131 expression patterns of these 320 marker genes were visualized via heatmaps generated with 132 the seaborn Python library. 133 To evaluate the robustness of marker selection, we assessed model performance stability 134 across marker gene set sizes of 40, 80, 160, and 320. For each size, 300 random gene sets 135 were generated, and performance was evaluated via 5-fold cross-validation and an independent 136 test dataset. The variability in accuracy resulting from random selection is presented as 137 distributions in violin plots. Furthermore, the concordance between cross-validation accuracy 138 and independent test performance was analyzed using the Pearson correlation coefficient. 139 Dimensionality Reduction Analyses 140 Principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) 141 were performed using the scikit-learn Python library. Of the 27,416 genes detected across all 142 samples, genes with TPM = 0 in all samples were removed, leaving 25,576 genes for 143 dimensionality reduction analyses. For DEG-focused analyses, 6,670 non-redundant DEGs and 144 the 320 selected marker genes were used. TPM values were log2-transformed and scaled using 145 min–max normalization. A total of 1,243 curated RNA-seq samples, including 512 control, 148 146 cold, 133 salt, 266 heat, and 184 drought, were used for the analyses. Visualization of PCA and 147 t-SNE results was performed using the seaborn Python library. 148 Model training and evaluation 149 The model was trained using PyTorch. To prevent data leakage, the 65 independent test 150 samples (13 per control group and 4 stress-treated groups) were fully excluded prior to all 151 upstream processing steps, including DEG analysis and marker gene selection. The remaining 152 600 training samples (120 per control group and 4 stress-treated groups) were then split 153 into a 5-fold cross-validation set using scikit-learn. Log2-transformed and min–max–scaled TPM 154 values of the marker genes were used as model inputs. Training was performed with the 155 following hyperparameters: Learning rate=0.005, Batch size=84, and optimizer=Nesterov-156 accelerated Adaptive Moment Estimation (NAdam). Optimal training epochs were determined 157 using early stopping based on the minimum validation loss. Five models were trained, each 158 corresponding to one cross-validation fold. 159 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 7 The importance of input marker genes was analyzed with SHapley Additive exPlanations (SHAP) 160 (33). For each sample in the dataset, local SHAP values were calculated to quantify the impact 161 of every marker gene on the individual prediction. The top 20 genes were ranked by their global 162 importance scores, and their gene symbols and functions were retrieved using mygene 3.2.2, a 163 Python package. 164 Model performance was evaluated using a 5-fold cross-validation test on the data and an 165 independent test set of 65 samples. For single-stress samples, class predictions were based on 166 the highest sigmoid output across the five classes (four stresses and one control). Performance 167 was assessed using precision, recall, F1 score, and accuracy. Cross-validation performance 168 metrics were averaged across folds, and standard deviations were calculated. For double-stress 169 samples, sigmoid outputs for each stress class were visualized, with circle size representing the 170 magnitude of the predicted probability. 171 172 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 8

Results

173 Collection and curation of stress-treated RNA-seq samples 174 To develop a machine learning model for discrimination of plant stress types, we curated RNA-175 seq datasets from Arabidopsis species subjected to four abiotic stresses: salt, cold, heat, and 176 drought. An initial search identified 773 samples for salt stress, 640 for cold, 1103 for heat, and 177 942 for drought. Given the heterogeneity of experimental conditions of the samples, we applied 178 stringent filtering criteria to ensure consistency and relevance for stress classification: 1) For 179 drought stress, only samples induced by water deprivation were retained; those involving 180 chemical inducers (e.g., salt) were excluded. 2) Samples involving additional treatments (e.g., 181 hormones, pathogens, herbicides) were excluded. 3) Samples subjected to multiple 182 simultaneous stresses were removed. 4) Only samples derived from leaf-related tissues (leaf, 183 seedling, and rosette) were included. We did not filter based on stress duration, intensity, 184 genotype, replication, or mutation background. Filtering was performed manually through a 185 literature review. This process yielded 133 salt, 148 cold, 266 heat, and 184 drought stress 186 samples. Control samples corresponding to each stress condition were also identified, totaling 187 82–183 samples. 188 To balance the dataset and prevent overfitting, we standardized the class counts. For each 189 stress type, 120 samples were randomly selected for training and 13 for independent testing. 190 From the pool of control samples, 30 per stress type (totaling 120) were selected for training, 191 and 13 (3 each for salt, cold, and heat; 4 for drought) for testing. In total, 600 samples (120 per 192 class) were used for five-fold cross-validation, and 65 samples were reserved for independent 193 testing. 194 Model training strategy 195 We employed a five-fold cross-validation strategy to ensure robust model training and 196 evaluation (34). The dataset was split into training (80%), validation (10%), and test (10%) 197 subsets. Early stopping was implemented based on validation performance (35). Test data was 198 used to evaluate the trained models. As input data, normalized gene expression values (TPM) 199 were used (Figure 1). Given the limited sample size (480 training samples), we performed 200 differential expression analysis to reduce the input feature size. Models were trained on the 201 selected genes and evaluated using cross-validation and independent test data. At the final 202 stage, we evaluated the contribution of marker genes to model performance using SHAP . All 203 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 9 these steps were constructed into an end-to-end pipeline for broad applications of the method to 204 other plant species. 205 206 207 Figure 1. Schematic representation of the machine learning pipeline. The workflow includes 208 data acquisition from public databases, quality trimming, transcript quantification, differential 209 expression analysis for feature selection, and a five-fold cross-validation strategy for model 210 training and evaluation. 211 212 Identification of stress-specific marker genes 213 To identify marker genes capable of distinguishing abiotic stress responses in plants, we 214 performed differential expression analysis using DESeq2. By comparing 120 stress-treated 215 samples against 120 common control samples, we identified a robust set of differentially 216 expressed genes (DEGs) with high statistical significance. Specifically, we detected 1,017 (salt), 217 517 (cold), 917 (heat), and 1,703 (drought) upregulated DEGs, alongside 445, 954, 1,006, and 218 2,587 downregulated DEGs, respectively (Figure 2A). To isolate stress-specific marker genes, 219 we utilized Venn diagram analysis (Figure 2B), which revealed 581 (salt), 287 (cold), 632 (heat), 220 and 1,076 (drought) unique upregulated DEGs. Similarly, unique downregulated DEGs were 221 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 10 identified as 81 (salt), 319 (cold), 320 (heat), and 1,626 (drought). These unique DEGs were 222 subsequently prioritized as candidate markers for stress classification. 223 224 225 Figure 2. Identification of stress-responsive differentially expressed genes (DEGs). (A) Volcano 226 plots of up-regulated and down-regulated DEGs identified for each abiotic stress condition (salt, 227 cold, heat, and drought) compared to controls. (B) Venn diagrams showing the overlap of DEGs 228 across the four stress conditions, highlighting the unique stress-specific DEGs used for marker 229 gene selection. 230 231 Gene Ontology (GO) enrichment analysis of stress-specific DEGs for biological processes 232 showed that 'Response to stress' was the most abundant category across all four stress 233 conditions (Figure 3). Similarly, 'Response to chemical' and 'Cellular response to stimulus' were 234 consistently highly ranked. When analyzing specific stress types, unique enrichment patterns 235 were identified. 'Cellular response to chemical stimulus' showed the highest fold enrichment for 236 salt stress, while 'Response to abiotic stimulus' was dominant for cold stress. Notably, 'Heat 237 acclimation' and 'Translation' exhibited the greatest fold enrichment for heat and drought stress, 238 respectively. These results confirm that identifying marker genes from large-scale datasets 239 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 11 using DEGs effectively captures biologically relevant stress responses, supporting their utility as 240 diagnostic markers. 241 242 243 Figure 3. Gene Ontology (GO) enrichment analysis of stress-responsive genes. Top enriched 244 Biological Process GO terms for the DEGs identified under salt, cold, heat, and drought stress. 245 Significance was determined using a False Discovery Rate (FDR) threshold of 0.05. 246 247 Dimensionality reduction and feature selection 248 Although initial DEG filtering reduced the feature space, the remaining 4,922 stress-specific 249 DEGs represented an impractically large set for a diagnostic panel. To construct a concise and 250 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 12 balanced marker set, we randomly down-sampled the pool to 40 up-regulated and 40 down-251 regulated genes per stress type, yielding a final set of 320 markers (Figure 4A). To evaluate the 252 discriminatory power of this subset, we compared clustering performance using PCA and t-SNE 253 across three input levels, all genes, total DEGs, and the 320 selected markers (Figure 4B). PCA 254 failed to distinctly separate stress conditions. While t-SNE improved clustering, significant 255 overlap persisted. These results indicated that unsupervised dimensionality reduction was 256 insufficient for precise classification, requiring a supervised machine learning approach. 257 258 259 Figure 4. Feature selection and dimensionality reduction of transcriptomic data. (A) Heatmap 260 visualization of the 320 selected marker genes (40 upregulated and 40 downregulated genes 261 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 13 per stressor) across all curated samples. (B) Comparison of sample clustering using Principal 262 Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) based on 263 all detected genes, all non-redundant DEGs, and the final 320 marker genes. 264 265 Model architecture and performance evaluation 266 To develop a machine learning model, we tested a multilayer perceptron (MLP) with varying 267 numbers of hidden layers. We tested various hyperparameter combinations of the MLP and 268 identified that a single fully connected hidden layer (i.e., a one-hidden-layer MLP, hereafter 269 referred to as a single-layer perceptron) yielded optimal performance (Figure 5A). To 270 accommodate potential multi-stress conditions, the architecture used sigmoid activation at the 271 output layer to enable multi-label classification. However, as model development was restricted 272 to single-stress-treated samples, the class with the highest sigmoid probability was assigned as 273 the predicted label. 274 275 276 277 Figure 5. Model architecture and classification performance. (A) Architecture of the single-layer 278 perceptron model used for stress classification. (B) Confusion matrix and performance metrics 279 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 14 (precision, recall, and F1-score) obtained from 5-fold cross-validation on the training set. (C) 280 Performance evaluation on the independent test set of 65 samples, demonstrating high 281 accuracy and model generalizability. 282 283 Five-fold cross-validation demonstrated robust performance (Figure 5B), with a macro-average 284 F1-score of 0.90 ± 0.04 and an overall accuracy of 0.91 ± 0.03. Among the classes, Cold stress 285 achieved the highest F1-score (0.98 ± 0.02), reflecting distinct transcriptomic signatures. 286 Conversely, Control samples exhibited the lowest F1-score (0.80 ± 0.16). The discrepancy 287 between high precision (0.85 ± 0.12) and lower recall (0.75 ± 0.13) suggests occasional 288 misclassification of control samples as stressed. This may reflect the inherent heterogeneity of 289 control transcriptomes, which were pooled from experiments conducted under different baseline 290 conditions across multiple studies. To rule out data leakage from DEG selection, we evaluated 291 the model on 65 independent test samples that were excluded from the feature selection 292 process (Figure 5C). If there were data leakage, the test results from the independent test 293 samples were expected to be significantly lower than those from five-fold cross-validation. 294 However, the resulting accuracy and F1-score were 0.93, confirming the model's generalizability. 295 Validation of the marker gene selection method and analysis of key marker genes 296 To determine the optimal number of features and the validity of random selection, we evaluated 297 model performance across varying feature set sizes using 300 iterations of randomly selected 298 marker genes. We tested total set sizes of 40, 80, 160, and 320 genes, composed of 5, 10, 20, 299 and 40 up- and down-regulated DEGs per stress condition, respectively (Figure 6A). Average 300 accuracy in five-fold cross-validation improved from 0.83 (40 genes) to 0.87 (80 genes) and 301 0.90 (160 genes), saturating at 0.91 with 320 genes. This confirms that the 320-gene set used 302 in our final model is sufficient for optimal performance. 303 We further assessed whether selecting the specific gene subsets that yielded the highest cross-304 validation accuracy would outperform random selection. While the "best" subsets achieved 305 accuracies of up to 0.94, a Pearson correlation test between cross-validation scores and 306 performance on 65 independent test samples revealed a weak correlation (r < 0.279) across all 307 set sizes. This lack of correlation indicates that optimizing for specific gene subsets in cross-308 validation does not guarantee generalizability to independent data, thereby supporting the 309 robustness of our random selection approach. 310 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 15 To evaluate the contribution of individual marker genes to the classification of the four plant 311 abiotic stresses and one control, we performed a global feature importance analysis using 312 SHAP. Most marker genes had SHAP values that were less than half that of the top-performing 313 gene (Figure 6B). The top 20 genes with the highest mean absolute SHAP values were 314 identified, and their functions were investigated (Figure 6C and 6D, Supplementary Figures S1-315 4). The most important gene to discriminate salt stress was the RPM1-interacting protein 4 316 (RIN4) family protein (Figure 6D). Similarly, UDP-Glycosyltransferase superfamily protein and 317 xyloglucan endotransglucosylase/hydrolase 13 were the most important genes for discriminating 318 between cold and heat stress, respectively (Supplementary Figures S1 and S2). In drought and 319 control conditions, the most important gene was lipid transfer protein 4, indicating that the 320 gene's up- and down-regulation can distinguish the two conditions (Supplementary Figures S3 321 and S4). 322 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 16 323 Figure 6. Assessment of model performance stability across varying marker gene set sizes and 324 marker gene importance with SHAP analysis. (A) Accuracy distributions for 300 iterations of 325 random feature selection are shown for 40, 80, 160, and 320 marker genes. Dots represent 326 individual 5-fold cross-validation results. Summary statistics include maximum (green), mean 327 (red), and standard deviation. Pearson correlation coefficients (P-coeff) and P-values indicating 328 the concordance between cross-validation and independent test-set accuracy are provided for 329 each condition. (B) Distribution of mean absolute SHAP values. (C) SHAP values of the top 20 330 marker genes. (D) Mean absolute SHAP values and function of the top 20 marker genes. 331 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 17 Evaluation of Multi-Stress Samples 332 To assess model performance under complex conditions, we evaluated samples subjected to 333 combined stress treatments. We analyzed eight samples treated with combined Salt and Heat 334 stress (36-38) and three treated with Heat and Drought stress (39). The model identified both 335 stress signatures in the Salt+Heat samples. However, in the Heat+Drought samples, only the 336 drought signature was detected (Figure 7). As detailed in the Discussion section, this partial 337 detection is likely due to the specific intensity thresholds used in the heat treatment in that 338 dataset. Although formal statistical validation was precluded by the very small sample sizes (n = 339 8 and n = 3, respectively), the dual classification observed in the Salt+Heat group provides 340 preliminary evidence that models trained on single-stress data may retain the ability to 341 generalize to multi-stress conditions when stress intensities are sufficient. 342 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 18 343 Figure 7. Model performance on multi-stress combinatorial samples. Visualization of sigmoid 344 output probabilities for samples subjected to simultaneous stressors (e.g., salt + heat and heat + 345 drought). Circle size represents the magnitude of the predicted probability for each stress class. 346 The model successfully identified both stressors in salt-heat combinations but prioritized drought 347 signatures in heat-drought samples. 348 349 350 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 19

Discussion

351 Our results demonstrate that plant abiotic stress can be accurately classified using a machine 352 learning model trained on transcriptomic profiling data. To our knowledge, this is the first study 353 to report a method capable of distinguishing among multiple types of abiotic stress in plants. 354 Transcriptomic data provide a powerful resource for early detection, as changes in gene 355 expression occur well before visible physiological symptoms appear (40,41). For example, 356 Kawasaki et al. (2001) reported detectable transcriptomic alterations as early as 15 min after 357 salt exposure (21). These findings support the utility of transcriptome-based approaches for the 358 sensitive and early identification of plant stress responses. 359 While abiotic stress is a major constraint on crop productivity (42), existing detection methods, 360 such as thermal infrared and visible-spectrum imaging, have limitations. Visible imaging detects 361 stress only after phenotypic symptoms emerge, and while thermal imaging can identify early 362 physiological changes, neither approach can definitively identify the specific underlying cause 363 (43,44). Our method addresses this limitation by enabling early detection while simultaneously 364 pinpointing the specific stressor, offering a more actionable strategy to mitigate yield loss. 365 We used Arabidopsis because extensive RNA-seq datasets representing diverse stress 366 treatments are publicly available. These datasets, however, were generated for heterogeneous 367 purposes and were not optimized for stress classification. To curate clear training examples, we 368 excluded samples exposed to confounding stressors, such as pathogens or herbicides, and 369 limited the dataset to leaf-containing samples. This allowed us to assemble a set of single-370 stress, leaf-derived transcriptomes for model training. We did not filter samples by species, 371 ecotype, or mutant background, assuming that stress-responsive transcriptional patterns would 372 be broadly conserved. Any gene expression differences attributable to genotype variation would 373 be assigned low weights during model training and thus minimally influence classification 374 performance. One notable limitation of the current dataset is the heterogeneity of control 375 samples, which were pooled from multiple independent experiments conducted under varying 376 baseline conditions. This likely contributes to the comparatively lower F1-score observed for the 377 control class, as the model must generalize across a wide range of “unstressed” transcriptional 378 states. Future datasets with standardized control conditions would be expected to improve 379 classification performance for this class. 380 A key challenge in realistic agricultural settings is the occurrence of combined stresses. Our 381 model successfully detected both stressors in samples treated with Salt and Heat. However, in 382 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 20 samples treated with combined Heat and Drought, the model detected only the drought signal. 383 We attribute this discrepancy to inconsistencies in the definition of "stress" across public 384 datasets. An analysis of the experimental metadata reveals that the Heat–Salt samples were 385 treated at clear stress-inducing temperatures: 35°C (SRR11214537–38) (36), 33°C 386 (SRR11468708–10) (37), or 43°C (SRR2302917–19) (38). In contrast, the Heat–Drought 387 samples (SRR23615393–95) were exposed to only 27°C (39). As the model was trained on 388 samples typically treated at ≥ 33°C, it likely classified the 27°C condition as non-stress (normal) 389 relative to the heat signature. This observation highlights that multi-stress classification requires 390 training data with stress thresholds that align with the specific definitions of stress in the target 391 environment. 392 Although transcriptomic profiling provides high-resolution information about plant stress 393 responses, it remains costly and time-consuming, limiting its use in cases requiring immediate 394 decision-making. Nevertheless, we propose two practical applications. First, transcriptome-395 based stress classification can serve as a high-confidence labeling strategy for training other 396 machine learning models that rely on cultivation metadata or imaging data. As machine learning 397 approaches for plant stress detection continue to advance (19,45,46), the accuracy of training 398 labels remains a critical determinant of model performance. Our method provides evidence-399 based, biologically grounded labels that can enhance downstream model reliability. Second, 400 transcriptome-based classification can support precision breeding by distinguishing stress-401 resistant from stress-tolerant genotypes. As climate change intensifies, stress-tolerant cultivars 402 may maintain survival but suffer yield penalties, making it essential to identify truly stress-403 resistant lines. Our approach provides a quantitative framework for phenotyping breeding 404 populations using molecular stress signatures. 405 In summary, this study presents the first machine learning model capable of classifying multiple 406 abiotic stress types in plants, achieving over 91% accuracy and demonstrating the potential to 407 identify combined stress conditions. To accommodate the broad application of the methods, we 408 also developed an end-to-end pipeline to train models in various plant species. Future work 409 should focus on generating high-quality training datasets from crop species and developing 410 stress-induction protocols that reflect realistic agricultural environments. Defining crop-specific 411 stress thresholds and designing experiments optimized for stress classification will further 412 enhance model performance. Ultimately, this approach provides a foundation for AI-driven 413 decision-support systems in crop management and precision breeding, offering a timely tool for 414 addressing the challenges posed by global climate change. 415 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 21

Acknowledgements

416 We thank all researchers who have deposited transcriptome data of abiotic stress-treated 417 Arabidopsis plants in the public database. 418 419 Supplementary data 420 Supplementary Figure S1. SHAP analysis of marker genes for cold stress. 421 Supplementary Figure S2. SHAP analysis of marker genes for heat stress. 422 Supplementary Figure S3. SHAP analysis of marker genes for drought stress. 423 Supplementary Figure S4. SHAP analysis of marker genes for control. 424 425 Conflict of interest 426 The authors declare no competing interests. 427 428 429 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 22

References

430 1. Zhang H, Zhao Y , Zhu JK. Thriving under Stress: How Plants Balance Growth and the 431 Stress Response. Dev Cell. 2020;55:529–43. 10.1016/j.devcel.2020.10.012 432 2. Bechtold U, Field B. Molecular mechanisms controlling plant growth during abiotic stress. 433 J Exp Bot. 2018;69:2753–58. 10.1093/jxb/ery157 434 3. Meng Y , Zhu P, Gou C et al. Auxin and Ethylene Play Important Roles in Parthenocarpy 435 Under Low-Temperature Stress Revealed by Transcriptome Analysis in Cucumber 436 (Cucumis sativus L.). J Plant Growth Regul. 2024;43:1137–52. 10.1007/s00344-023-437 11172-z 438 4. Asad MAU, Yan Z, Zhou L et al. How abiotic stresses trigger sugar signaling to modulate 439 leaf senescence? Plant Physiol Biochem. 2024;210:108650. 440 10.1016/j.plaphy.2024.108650 441 5. Zinta G, Khan A, AbdElgawad H et al. Unveiling the Redox Control of Plant Reproductive 442 Development during Abiotic Stress. Front Plant Sci. 2016;7. 10.3389/fpls.2016.00700 443 6. Kasuga M, Liu Q, Miura S et al. Improving plant drought, salt, and freezing tolerance by 444 gene transfer of a single stress-inducible transcription factor. Nat Biotechnol. 445 1999;17:287–91. 10.1038/7036 446 7. Wang P, Zhao Y , Li Z et al. Reciprocal Regulation of the TOR Kinase and ABA Receptor 447 Balances Plant Growth and Stress Response. Mol Cell. 2018;69:100–12.e106. 448 10.1016/j.molcel.2017.12.002 449 8. Skirycz A, Vandenbroucke K, Clauw P et al. Survival and growth of Arabidopsis plants 450 given limited water are not equal. Nat Biotechnology. 2011;29:212–14. 10.1038/nbt.1800 451 9. Kopecká R, Kameniarová M, Č erný M et al. Abiotic Stress in Crop Production. Int J Mol 452 Sci. 2023. 10.3390/ijms24076603. 10.3390/ijms24076603 453 10. Vij S, Tyagi AK. Emerging trends in the functional genomics of the abiotic stress 454 response in crop plants. Plant Biotechnol J. 2007;5:361–80. 10.1111/j.1467-455 7652.2007.00239.x 456 11. Zurbriggen MD, Hajirezaei MR, Carrillo N. Engineering the future. Development of 457 transgenic plants with enhanced tolerance to adverse environments. Biotechnol Genet 458 Eng Rev. 2010;27:33–56. 10.1080/02648725.2010.10648144 459 12. Zandi A, Hosseinirad S, Kashani Zadeh H et al. A systematic review of multi-mode 460 analytics for enhanced plant stress evaluation. Front Plant Sci. 2025;16. 461 10.3389/fpls.2025.1545025 462 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 23 13. Gitelson AA, Merzlyak MN. Signature Analysis of Leaf Reflectance Spectra: Algorithm 463 Development for Remote Sensing of Chlorophyll. J Plant Physiol. 1996;148:494–500. 464 10.1016/S0176-1617(96)80284-7 465 14. Mulugeta Aneley G, Haas M, Köhl K. LIDAR-Based Phenotyping for Drought Response 466 and Drought Tolerance in Potato. Potato Res. 2023;66:1225–56. 10.1007/s11540-022-467 09567-8 468 15. Jones HG, Serraj R, Loveys BR et al. Thermal infrared imaging of crop canopies for the 469 remote diagnosis and quantification of plant responses to water stress in the field. Funct 470 Plant Biol. 2009;36:978–9. 10.1071/FP09123 471 16. Kalaji HM, Jajoo A, Oukarroum A et al. Chlorophyll a fluorescence as a tool to monitor 472 physiological status of plants under abiotic stress conditions. Acta Physiol Plant. 473 2016;38:102. 10.1007/s11738-016-2113-y 474 17. Zhou R, Hyldgaard B, Yu X et al. Phenotyping of faba beans ( Vicia faba L.) under cold 475 and heat stresses using chlorophyll fluorescence. Euphytica, 2018 476 ;214:68. 10.1007/s10681-018-2154-y 477 18. Zhang K, Yan F, Liu P. The application of hyperspectral imaging for wheat biotic and 478 abiotic stress analysis: A review. Comput Electron Agric. 2024;221:109008. 479 10.1016/j.compag.2024.109008 480 19. Kim D, Zarei M, Lee S et al. Wearable Standalone Sensing Systems for Smart 481 Agriculture. Adv Sci. 2025;12:2414748. 10.1002/advs.202414748 482 20. Houetohossou SCA, Houndji VR, Hounmenou CG et al. Deep learning methods for 483 biotic and abiotic stresses detection and classification in fruits and vegetables: State of 484 the art and perspectives. Artif Intell Agric. 2023;9:46–60. 10.1016/j.aiia.2023.08.001 485 21. Kawasaki S, Borchert C, Deyholos M et al. Gene Expression Profiles during the Initial 486 Phase of Salt Stress in Rice. Plant Cell. 2001;13:889–905. 10.1105/tpc.13.4.889 487 22. Sanchez-Munoz R, Depaepe T, Samalova M et al. Machine-learning meta-analysis 488 reveals ethylene as a central component of the molecular core in abiotic stress 489 responses in Arabidopsis. Nat Commun. 2025;16:4778. 10.1038/s41467-025-59542-3 490 23. Kamali S, Singh A. Genomic and Transcriptomic Approaches to Developing Abiotic 491 Stress-Resilient Crops. Agronomy. 2023. 10.3390/agronomy13122903. 492 24. Rurek M, Smolibowski M. Variability of plant transcriptomic responses under stress 493 acclimation: a review from high throughput studies. Acta Biochim Pol. 2024;71. 494 10.3389/abp.2024.13585 495 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 24 25. Tamaoki M, Matsuyama T, Nakajima N et al. A method for diagnosis of plant 496 environmental stresses by gene expression profiling using a cDNA macroarray. Environ 497 Pollut. 2004;131:137–45. 10.1016/j.envpol.2004.01.008 498 26. Panahi B. Transcriptome signature for multiple biotic and abiotic stress in barley 499 (Hordeum vulgare L.) identifies using machine learning approach. Curr Plant Biol. 500 2024;40:100416. 10.1016/j.cpb.2024.100416 501 27. Sia J, Zhang W, Chen, M et al. Machine learning-based identification of general 502 transcriptional predictors for plant disease. New Phytol. 2025;245:785–806. 503 10.1111/nph.20264 504 28. Lindgreen S. AdapterRemoval: easy cleaning of next-generation sequencing reads. 505 BMC Res Notes. 2012;5:337. 10.1186/1756-0500-5-337 506 29. Lamesch P, Berardini TZ, Li D et al. The Arabidopsis Information Resource (TAIR): 507 improved gene annotation and new tools. Nucleic Acids Res. 2012;40:D1202–D1210. 508 10.1093/nar/gkr1090 509 30. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or 510 without a reference genome. BMC Bioinformatics. 2011;12:323. 10.1186/1471-2105-12-511 323 512 31. Langdon WB. Performance of genetic programming optimised Bowtie2 on genome 513 comparison and analytic testing (GCAT) benchmarks. BioData Min. 2015;8:1. 514 10.1186/s13040-014-0034-0 515 32. Ge SX, Jung D, Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and 516 plants. Bioinformatics. 2020;36:2628–9. 10.1093/bioinformatics/btz931 517 33. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural 518 Inf Process Syst. 2017:30. 519 34. Marcot BG, Hanea AM. What is an optimal value of k in k-fold cross-validation in discrete 520 Bayesian network analysis? Comput Stat. 2021;36:2009–31. 10.1007/s00180-020-521 00999-9 522 35. Prechelt L. Early Stopping — But When?. In: Montavon, G., Orr, G.B., Müller, KR. (eds) 523 Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science. Springer, 524 Berlin, Heidelberg. 2012;7700. 10.1007/978-3-642-35289-8_5 525 36. Sewelam N, Brilhaus D, Bräutigam A et al. Molecular plant responses to combined 526 abiotic stresses put a spotlight on unknown and abundant genes. J Exp Bot. 527 2020;71:5098–112. 10.1093/jxb/eraa250 528 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint 25 37. Zandalinas SI, Sengupta S, Fritschi FB et al. The impact of multifactorial stress 529 combination on plant growth and survival. New Phytol. 2021;230:1034–48. 530 10.1111/nph.17232 531 38. Suzuki N, Bassil E, Hamilton JS et al. ABA Is Required for Plant Acclimation to a 532 Combination of Salt and Heat Stress. PLoS One. 2016;11:e0147625. 533 10.1371/journal.pone.0147625 534 39. Garcia-Molina A, Pastor V. Systemic analysis of metabolome reconfiguration in 535 Arabidopsis after abiotic stressors uncovers metabolites that modulate defense against 536 pathogens. Plant Commun. 2024;5. 10.1016/j.xplc.2023.100645 537 40. Puig CP, Dagar A, Marti Ibanez C et al. Pre-symptomatic transcriptome changes during 538 cold storage of chilling sensitive and resistant peach cultivars to elucidate chilling injury 539 mechanisms. BMC Genomics. 2015;16:245. 10.1186/s12864-015-1395-6 540 41. Ueda A, Kathiresan A, Bennett J et al. Comparative transcriptome analyses of barley 541 and rice under salt stress. Theor Appl Genet. 2006;112:1286–94. 10.1007/s00122-006-542 0231-4 543 42. Iqbal MS, Singh AK, Ansari MI. Effect of Drought Stress on Crop Production. In: Rakshit, 544 A., Singh, H., Singh, A., Singh, U., Fraceto, L. (eds) New Frontiers in Stress 545 Management for Durable Agriculture. Springer, Singapore. 2020. 10.1007/978-981-15-546 1322-0_3 547 43. Fevgas G, Lagkas T, Argyriou V et al. Detection of Biotic or Abiotic Stress in Vineyards 548 Using Thermal and RGB Images Captured via IoT Sensors. IEEE Access. 549 2023;11:105902–15. 10.1109/ACCESS.2023.3320048 550 44. Carter GA, Miller RL. Early detection of plant stress by digital imaging within narrow 551 stress-sensitive wavebands. Remote Sens Environ. 1994;50:295–302. 10.1016/0034-552 4257(94)90079-5 553 45. Gou C, Zafar S, Hasnain Z et al. Machine and Deep Learning: Arti ficial Intelligence 554 Application in Biotic and Abiotic Stress Management in Plants. Front Biosci (Landmark 555 Ed). 2024;29. 10.31083/j.fbl2901020 556 46. Chandel NS, Chakraborty SK, Rajwade YA et al. Identifying crop water stress using deep 557 learning models. Neural Comput & Applic. 2021;33:5353–67. 10.1007/s00521-020-558 05325-4 559 560 561 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 27, 2026. ; https://doi.org/10.64898/2026.02.25.707868doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-06-17T06:32:23.968882+00:00