Abstract
18
19
Physiologically relevant drought stress is difficult to apply consistently, and the heterogeneity in 20
experimental design, growth conditions, and sampling schemes make it challenging to compare 21
water deficit studies in plants. Here, we re-analyzed hundreds of drought gene expression 22
experiments across diverse model and crop species and quantified the variability across 23
studies. We found that drought studies are surprisingly uncomparable, even when accounting 24
for differences in genotype, environment, drought severity, and method of drying. Many studies, 25
including most Arabidopsis work, lack high-quality phenotypic and physiological datasets to 26
accompany gene expression, making it impossible to assess the severity or in some cases the 27
occurrence of water deficit stress events. From these datasets, we developed supervised 28
learning classifiers that can accurately predict if RNA-seq samples have experienced a 29
physiologically relevant drought stress, and suggest this can be used as a quality control for 30
future studies. Together, our analyses highlight the need for more community standardization, 31
and the importance of paired physiology data to quantify stress severity for reproducibility and 32
future data analyses. 33
34
35
Introduction
36
37
Drought, increasingly prevalent in both natural and agricultural landscapes, is escalating in 38
frequency and severity due to the dynamic climate. This trend has spurred the development of 39
an extensive and increasingly interdisciplinary research community focused on understanding 40
plant adaptation to water-limited environments (Osmolovskaya et al., 2018; Ekundayo et al., 41
2022). Meteorologically, drought manifests as drier than normal conditions, but its physiological 42
impact on plants varies based on the duration, severity, and timing of the stress events, 43
alongside local soil and habitat conditions (Gupta et al., 2020; Tardieu, 2012). Mild, infrequent 44
drought events may result in only slight reductions in photosynthesis and growth, often without 45
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
2
significant impacts on biomass or yield. In contrast, recurrent or severe bouts of drought may 46
cause unrecoverable damage or even plant death (Farooq et al., 2009). Central to 47
understanding and engineering drought resilience is the ability to apply consistent, 48
physiologically relevant, and reproducible stress events across scales (Großkinsky et al., 2015). 49
Such standardization is necessary to develop a community framework that allows for 50
comparison and expansion of previous experiments (Juenger and Verslues, 2022). 51
Water deficit responses likely evolved during terrestrialization, and they have been 52
continually refined, repurposed, and diversified to enable plants to colonize virtually every biome 53
(Bowles et al., 2021). Resistance to drought is an emergent phenotype involving the 54
synchronization of numerous physiological and genetic processes, and diverse lineages of 55
plants have evolved numerous adaptations to avoid, escape, and tolerate water deficits (Artur 56
and Kajala, 2021);(Chaves and Oliveira, 2004; Turner, 1986). Different plant lineages, 57
populations, or even individual genotypes use combinations of these strategies to tolerate water 58
Limitations
(Verslues and Juenger, 2011; Basu et al., 2016; Farooq et al., 2009). The genetic 59
mechanisms underlying responses or tolerance to drought stress are highly complex and 60
involve the activation of hundreds to thousands of pathways that collectively enable resilience to 61
water deficit. Most drought related pathways were discovered and characterized in the model 62
plant Arabidopsis, but core regulatory, biochemical, and physiological responses are broadly 63
conserved across green plants (Shinozaki and Yamaguchi-Shinozaki, 2007). The genetic basis 64
of adaptations to water deficit is an active and exciting area of plant science research, and 65
numerous important research gaps still remain (Verslues et al., 2023; Eckardt et al., 2023). 66
One promising approach to closing the knowledge gaps in understanding the genetic 67
basis of drought adaptation is using large-scale omics technologies. Numerous large-scale 68
datasets have been collected across diverse plant lineages to study the effects of drought 69
stress. Some studies have measured physiological responses in naturally water limited 70
environments (Pardo et al., 2022; Danilevskaya et al., 2019; Groen et al., 2022), but most use 71
simulated drought events under controlled or semi-controlled conditions to induce water deficit 72
responses (Gonzalez et al., 2022). Simulated drought studies range in scale and severity from 73
large rainout shelters withholding water from thousands of plants in an ecological or agricultural 74
setting to agar plates containing solutes to lower water potential. Each of these approaches has 75
benefits and drawbacks related to cost, consistency, and accuracy of applying drought. For 76
example, using polyethylene glycol, mannitol, and salt to lower water potential may not actually 77
induce true drought responsive pathways (Gonzalez et al., 2022), and restricting plants to small 78
pots in growth chambers or greenhouses can impact root growth and lead to physiologically 79
irrelevant and irreproducible drying (Granier et al., 2006). Individual labs utilize radically different 80
experimental approaches, growth conditions, and sampling schemes for drought assays, and 81
these added variables mask emergent properties of an already complex phenotype. A major 82
challenge for cross-species analysis is finding comparable biological datasets with similar 83
design, implementation, and sampling. 84
Here, to evaluate the comparability and reusability of drought gene expression data, we 85
compared public datasets across labs and experiments, and searched for patterns that 86
delineate drought and control conditions. We first focused on data from the model plant 87
Arabidopsis thaliana and then expanded our analyses to include five additional model and crop 88
species with the most published drought data. We found that drought gene expression data are 89
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
3
more variable compared to other abiotic stresses, and many studies lack basic physiological 90
data to assess the magnitude or even presence of water deficit stress. Our analyses highlight 91
the need for community standardization to enable the reuse and integration of datasets across 92
scales from different laboratories. 93
94
Results
95
96
Variability of gene expression datasets on drought in Arabidopsis 97
98
Plant responses to drought is a well-studied topic. There are ~34,000 articles in PubMed related 99
to drought stress in plants, and this wealth of knowledge has uncovered numerous phenotypes, 100
pathways, and genes underlying responses to water deficit (Gupta et al., 2020). Most of our 101
understanding of drought responses at the molecular genetic level is based on work in 102
Arabidopsis, including over 100 studies surveying genome-wide gene expression (RNAseq) 103
changes under water deficit across different accessions, environmental conditions, and mutant 104
backgrounds (Supplemental Table 1). Collectively, these datasets have been incorporated into 105
public gene expression atlases, co-expression networks, and other tools that are broadly used 106
by the plant science community to understand which genes and pathways underlie drought 107
responses (Lamesch et al., 2012; Klepikova et al., 2016). However, drought experiments vary 108
wildly in the degree, severity, and implementation of water deficit. Many experiments are 109
analyzed in isolation and arrive at independent conclusions. The choice of which drought 110
experiments to reference for future studies can drastically alter hypothesis generation and 111
inference of biological function. This raises a fundamental question, how comparable are 112
drought studies across different experiments? 113
To survey the variability in public drought data, we re-analyzed 109 water deficit RNAseq 114
experiments in Arabidopsis obtained from the Sequence Read Archive (SRA). We manually 115
curated metadata from 1,301 RNAseq samples across these 109 BioProjects. These datasets 116
include a range of genotypes and mutant backgrounds, developmental time points and tissues, 117
and differences in stress severity and duration across a range of natural or controlled drought 118
conditions (Figure 1; Supplemental table 1). Based on the available metadata, 81% of studies 119
were conducted in growth chambers in standard potting media, 13% on agar plates, and 5% in 120
greenhouses. Half of the studies (51%) applied a natural dry down by stopping irrigation, 27% 121
had controlled drying to a set soil moisture content, 8% removed plants from media and let them 122
air dry, and 14% used PEG to lower water potential or ABA to simulate water deficit responses. 123
Surprisingly, 39% of Arabidopsis studies did not report paired physiology data measuring plant 124
stress, such as gas exchange, photosynthesis, leaf water potential, or leaf relative water 125
content. Next, we processed the raw reads through a common pipeline to remove variation 126
arising from the different algorithmic and statistical frameworks used in each individual study. 127
Raw Illumina RNAseq reads were quality trimmed and aligned to the TAIR10 gene models, and 128
raw or batch corrected expression values in transcripts per million (TPM) were used as a basis 129
for downstream analysis. 130
To identify any factors that clearly delineate samples within or across experiments, we 131
used dimensionality reduction with an expectation that samples should cluster by water stress 132
status. Principal component analysis (PCA) and t-Distributed Stochastic Neighbor Embedding 133
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
4
(t-SNE) show no clear separation between drought-treated and control samples across 134
Arabidopsis drought experiments (Figure 2a; Supplemental Figure 1, 2). Within experiments, 135
some BioProjects show clear separation of drought and control samples/replicates, but a 136
surprising number have interspersed samples (Supplemental Figure 3). Across experiments, 137
samples were broadly separated by tissue type and BioProject. Root, seedlings, inflorescence, 138
and siliques form groups in a similar dimensional space, whereas clusters of leaf and whole 139
plant samples were more dispersed across PC1 and PC2 or dimensions 1 and 2 for PCA and t-140
SNE, respectively (Figure 2a, Supplemental Figure 2). Samples within the same experiment or 141
BioProject tended to cluster together, suggesting that the experimental effects are responsible 142
for much of the separation we observed (Supplemental Figure 3). Individual accessions of 143
Arabidopsis show remarkable differences in expression dynamics under drought (Des Marais et 144
al., 2012), and we tested if genotype differences may explain the lack of correlation between 145
stressed samples across experiments. Similar to other experimental factors, there is no clear 146
separation by accession (Figure 2a). There is also no clear separation of samples by the 147
reported duration, severity, or type of water deficit (e.g., drying vs solute based), or technical 148
variables such as sequencing read length, technology, read chemistry, or publication year. 149
To test if the observed separation by experiment (BioProject) rather than stress-related 150
factors is caused by batch effects, we applied ComBat (Behdenna et al., 2023) to the 151
expression matrix. This method leverages an empirical Bayes framework to estimate and 152
subsequently adjust for batch effects. Post-ComBat adjustment, the first two principal 153
components of the expression values accounted for only 19% of the total variance 154
(Supplemental Figure 4). Even when batch effects associated with BioProject were addressed, 155
the samples did not show clear differentiation between stress and non-stress conditions, and 156
much of the variance that was removed relates to true biological differences rather than 157
technical artifacts. Together, this suggests drought experiments in Arabidopsis have extreme 158
variability and individual datasets are largely incomparable using traditional approaches. 159
We next sought to understand why the Arabidopsis drought RNAseq data appeared so 160
variable across experiments. We first hypothesized that since dimensionality reduction provides 161
a summary across all genes, individual factors such as drought stress may be confounded by 162
other experimental factors. Therefore, to see if confounding factors are masking clustering of 163
drought-stressed samples, we surveyed the expression pattern of nine drought marker genes 164
across all the samples. If confounding factors indeed masked the drought-response programs of 165
relevant genes, we would expect the drought-marker genes to be consistently induced in 166
drought across all experiments. The nine drought marker genes included RD29A (Nordin et al., 167
1991), RD22 (Yamaguchi-Shinozaki and Shinozaki, 1993), RAB18, DREB2A (Liu et al., 1998), 168
COR15A (Hincha et al., 2021), LEA4-5 (Bray, 2004), P5CS1 (Yoshiba et al., 1995), RD20, and 169
KIN2. Marker genes were generally expressed at higher levels in drought-stressed samples 170
compared to well-watered, but this is highly variable across our datasets (Figure 2b, 171
Supplemental Figure 5). For instance, the dehydrin RAB18 and ABA induced transcription factor 172
DREB2 were highly expressed under drought but generally not expressed under well-watered 173
conditions. However, roughly a third of samples labeled as ‘drought’ had no detectable 174
expression of these two genes (Figure 2b, Supplemental Figure 5). This pattern is consistent 175
across all nine drought marker genes, suggesting the hypothesis that confounding factors are 176
masking the drought response program is not supported. Instead, the data show that many 177
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
5
drought-treated samples lack a molecular signature of water deficit, and may not have 178
experienced a physiologically relevant drought stress. 179
We then wondered whether the observed inconsistencies in drought expression data 180
might simply reflect a wide range of responses to various drought regimes applied, given the 181
varying drought intensities and diverse growing conditions of different experiments. To 182
understand if a similar variability exists in response to other stresses, we re-analyzed publicly 183
available expression datasets related to heat stress in Arabidopsis. There are fewer published 184
heat expression datasets, and we curated 21 experiments where plants were subjected to 185
physiologically relevant heat stress in Arabidopsis between temperature ranges of 35-42 C. 186
Metadata indicated that growth conditions were similarly variable as drought studies, and the 187
raw reads were processed as described above. Strikingly, when we performed dimensionality 188
reduction on log-transformed expression data, the samples were distinctly categorized into 189
either 'heat' or 'control' groups based on principal component 2, which explained 14% of the 190
variation (Figure 3a). Similar to the drought dataset, principal component 1, which explained 191
21% of the variation, differentiated the samples based on tissue type, grouping them as whole 192
seedlings or leaves. Unlike in the drought dataset, there was no distinct separation by 193
experiment (BioProject). We surveyed the expression patterns of four heat marker genes 194
(HSP70, HSP90, MBF1c, and DREB2A) to see if individual genes have a similarly clear pattern. 195
Heat marker genes have notably higher expression in all heat stressed samples compared to 196
control, and there is little overlap in the distribution of expression levels of these marker genes 197
between the two conditions (Figure 3b, 3c). Thus, public heat stress expression datasets 198
demonstrate a clear molecular signature of heat stress. This signature persists through any 199
experimental variation across studies, indicating that the inconsistencies noted in the drought 200
stress data are not simply artifacts of work conducted in different laboratories. 201
202
Developing a predictive model for classifying drought gene expression 203
204
Dimensionality reduction and clustering approaches were unable to delineate drought-stressed 205
from control samples, and we hypothesize this was driven by quality issues of the underlying 206
datasets. We previously developed a cross-species predictive model to accurately classify 207
drought stressed RNAseq data in maize and sorghum (Pardo et al., 2022), and sought to test if 208
this approach could differentiate among drought and control samples in Arabidopsis. We 209
developed Random Forest (RF) based predictive models to classify the Arabidopsis samples as 210
“drought” or “control” based on normalized gene expression values alone. We divided the 211
RNAseq samples into a training set with 75% of the experiments (BioProjects) and a testing set 212
with the remaining 25%. The overall accuracy of our predictive model was 66% (Supplemental 213
Table 2), which is substantially lower than the model developed with high-quality maize and 214
sorghum data (Pardo et al., 2022). The precision and recall for ‘drought’ samples were 0.79 and 215
0.51, respectively and 0.61 and 0.84 for control. We tested four other classifiers including Linear 216
Support Vector Classifier (SVC), Simple Neural Network (MLP), Histogram-based Gradient 217
Boosting Classifier (HGB), and K-Nearest Neighbor Classifier (KNN) to see if this improved 218
predictive accuracy. HGB had similar performance to RF (overall accuracy 65%) but SVC, KNN, 219
and MLP performed significantly worse with 52%, 52%, and 56% overall accuracy 220
(Supplemental Table 2). The relatively low predictive accuracy in Arabidopsis was initially 221
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
6
surprising, as our model was trained with significantly more data and tested within a single 222
species with less genetic diversity than either maize or sorghum. Further, Arabidopsis 223
experiments are generally conducted within a narrower set of conditions compared to the 224
diverse growth chamber, greenhouse and field environments for maize and sorghum. We 225
suspect that the reduced predictive accuracy in Arabidopsis might be due to the inclusion of 226
datasets with plants that are not physiologically stressed, thereby diminishing the effectiveness 227
of our models. 228
To assess the efficiency of the RF models in predicting drought and control samples for 229
each Arabidopsis experiment, we employed the Leave-One-Group-Out cross-validation method. 230
This method involved iterative training of the model using all datasets except one, and then 231
gauging both its overall and individual sample performances. The aggregate performance was 232
0.71, with a precision of 0.77 and a recall of 0.75. Performance metrics for individual datasets 233
varied widely, from a purely random prediction (approximately 0.5) to absolute accuracy (1.0) 234
(Figure 4d). Random Forest had perfect or near perfect prediction in six experiments, but 235
notably, 14 Arabidopsis experiments had performances nearing random results, with scores 236
less than 0.65. Three of these experiments reported ‘mild’ drought stress events where the 237
authors collected limited physiology data to support the degree of plant stress (Dubois et al., 238
2017)(Clauw et al., 2016, 2015). This includes two large-scale experiments of 6 and 98 239
Arabidopsis accessions subjected to ‘mild’ drought stress in an automated plant phenotyping 240
platform (Clauw et al., 2016, 2015). These experiments collected limited physiology data aside 241
from growth rate and soil moisture content, making it difficult to evaluate if the plants were 242
experiencing a physiologically relevant drought stress event, or if they were simply growing with 243
less but sufficient soil water content. 244
To identify the most important underlying features in the model, we developed a second 245
Random Forest classifier using training data from each Arabidopsis BioProject. The overall 246
accuracy improved significantly to 86% with a 0.77 precision and 0.87 recall for control and 0.92 247
and 0.85 for drought, respectively (Figure 4a, b). Random Forest classifiers rank and quantify 248
the importance of each feature in the underlying testing dataset, and we surveyed which 249
features (genes) were most important for our drought predictive model. The top 100 genes with 250
the most predictive power are enriched in Gene Ontology (GO) terms exclusively related to 251
drought processes (Figure 4c). This includes abscisic acid-activated signaling, and responses to 252
osmotic stress, water stress, salt stress, oxygen-containing compounds, and ABA among 253
others. This is perhaps not surprising, but supports that our model is using genes with well 254
supported roles in drought responses to make its classification. Among the top predictors are 255
regulators of the ABA signaling pathway (HAI1, HAI2, and ABI2), ABA responsive genes 256
(RD29B, DIG2, RAB1), LEA proteins (LEA4-5, ABR, and RAB18), and a lipid transfer protein 257
involved in cuticle formation (LTP3) (Supplemental Table 3). The top predictors also include 258
genes with unknown function (Supplemental Table 3), and predictive modeling may be used to 259
identify new genes with uncharacterized roles in drought stress responses. 260
261
Variability in drought expression profiles across diverse crop and model plants 262
263
Our analyses suggested that drought RNAseq data in Arabidopsis is wildly variable, but 264
is this a unique feature of Arabidopsis or a wider issue with drought studies in plants? To 265
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
7
answer this question, we re-analyzed published RNAseq data for five additional plant species 266
with between 12-57 individual drought experiments. This includes 179 soybean, 318 tomato, 267
137 wheat, 1,701 maize, and 981 rice RNAseq samples (Supplemental table 1). Notably, these 268
experiments exhibit greater variability compared to Arabidopsis in terms of genotypic diversity, 269
tissue type, drought assay methods, environmental conditions (e.g., greenhouse, growth 270
chamber, or field settings), and stress severity. We processed this data using the same 271
analytical pipeline that we applied to the Arabidopsis data, utilizing the most recent or highest-272
quality reference genome available for each species. Similar to the Arabidopsis results, 273
Principal Component Analysis (PCA) did not reveal clear distinctions between drought-exposed 274
and control samples for any of the species (Figure 5). This lack of separation could potentially 275
be attributed to experimental artifacts, and to address this, we applied ComBat to mitigate batch 276
effects across all of the BioProjects. The application of ComBat led to a reduction in variability 277
among the studies for maize and rice, resulting in distinct clustering of drought and control 278
groups in the adjusted expression data (Figure 5). However, when applying ComBat to the 279
wheat, soy, and tomato datasets, we did not observe a clear clustering pattern based on stress 280
level, mirroring our observations in Arabidopsis. We tested Leave-One-Group-Out cross-281
validation in rice to test if predictive models could more efficiently discern drought and control 282
given the improved sample comparability compared to Arabidopsis. The aggregate performance 283
across all models was 0.82, with a precision of 0.84 and a recall of 0.92. Performance metrics 284
for individual datasets generally exceeded those observed in Arabidopsis, with 18 datasets 285
achieving perfect predictive accuracy (Figure 4E). Together, this suggests datasets outside of 286
Arabidopsis profile drought more consistently and more comparable, but there is still significant 287
variability. 288
289
Discussion
290
Genome-scale datasets have increased exponentially, and millions of individual studies 291
from plants are publicly available. These datasets span the genomic, transcriptomic, epigenetic, 292
and chromatin landscapes for thousands of diverse species aimed at numerous research 293
questions related to plant form and function. Historically, most of these datasets were examined 294
in isolation, often due to the specialized nature of each study and the lack of tools or interest in 295
integrative analyses (Sielemann et al., 2020). However, recent computational and 296
methodological advancements encourage data integration into more comprehensive 297
comparative frameworks to identify conserved or emergent features of plant systems 298
(Sreedasyam et al., 2023). A critical aspect of comparing multiple datasets is ensuring 299
uniformity and consistency across experiments. Without this, the nuances of the data can 300
become blurred, as sparsity and heterogeneity have the potential to overshadow crucial 301
biological insights. Here, we re-analyzed the wealth of Arabidopsis drought gene expression 302
data and found that inconsistencies across experiments, potential quality issues, and a lack of 303
paired physiology data complicate integrative analyses. Below, we discuss the factors 304
underlying these issues and propose guidelines that can enable enhanced comparability and 305
reproducibility of future studies. 306
Applying consistent abiotic stress is relatively straightforward for heat, cold, salinity, UV, 307
light, or nutrient deficiencies. Temperature can be raised or lowered for a set time, plants can be 308
subjected to too much or too little light, and solutes or micronutrients can be maintained to 309
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
8
predetermined molarities or concentrations to induce the desired deficiency. These discrete 310
conditions enhance reproducibility across labs and enable cross-study comparisons. Our re-311
analysis of heat stress data in Arabidopsis supports this claim, as we identified a consistent 312
expression profile under heat regardless of differences in growth conditions, developmental 313
timing, tissue, or other variables across labs and separate studies. Drought stress in contrast, is 314
more difficult to apply, and the lack of standardization makes it difficult to compare experiments 315
or even distinguish between control and stress samples in some cases. Drought itself is a 316
complex and ill-defined stress with unclear delineations of mild and severe or tolerance and 317
avoidance. More than other abiotic stresses, environmental contexts such as the volume and 318
composition of soil or media, temperature, vapor pressure deficit, light intensity, and air flow 319
affect the progression of drought stress. Plants growing in small pots, well-draining soil, higher 320
temperatures, more light, or low humidity will dry faster. Days of drying is reported for most 321
Arabidopsis studies, but this is an arbitrary metric that does not reflect discrete plant 322
physiological states and is impossible to standardize across conditions without additional data to 323
measure the water status of the plant and its soil. Even drought studies performed in the same 324
environment can be challenging to standardize, as different genotypes or mutants can use 325
water at different rates, causing individual plants to experience differing drought severities within 326
the same timeframe and experiment (Ginzburg et al., 2022). 327
We observed that a surprising 39% of drought RNAseq experiments in Arabidopsis 328
report no drought physiology data, potentially undermining any findings, and preventing these 329
datasets from being used in comparative studies or meta analyses. Many of these studies 330
simply withheld watering for days to weeks and collected control and ‘drought’ stressed tissues 331
for RNAseq and other downstream analyses. No measurements of soil water content or 332
physiological responses were performed. Other studies measured only soil water content 333
without paired measurements in plant tissues, providing no estimate of how stress was affecting 334
the plants. These factors likely explain why we were unable to clearly separate stressed and 335
control samples across studies, even when controlling for experimental artifacts and batch 336
effects. Many ‘drought’ samples have expression signatures that mirror well-watered tissues, 337
suggesting that many plants were not experiencing the physiologically relevant drought stress 338
authors thought had been applied. This noisiness and heterogeneity made it difficult to develop 339
an accurate predictive model of drought stress, and the predictive accuracy was more or less 340
random for almost a third of studies. 341
Drought datasets outside of Arabidopsis are generally higher quality, with more 342
consistent application of drought conditions and distinct signatures of water deficit responses. 343
The reason for this is unclear. It is not difficult to stress Arabidopsis, and physiological 344
signatures of water stress can be seen with a moderate decrease in water potential (-0.5–1.5 345
Mp). Compared to Arabidopsis, maize, rice, tomato, wheat, and soybean are generally grown at 346
higher temperatures under higher light intensity, and they have much higher rates of 347
evapotranspiration. When combined with greater plant biomass and the use of small pots, these 348
species may experience drought stress after only a few days without watering in growth 349
chambers and greenhouse settings. In contrast, small or uncrowded Arabidopsis plants may 350
grow well for a week or more between waterings without experiencing water deficit (Ginzburg et 351
al., 2022). Consequently, the community standard of withholding water for 7-10 days for 352
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
9
Arabidopsis may be insufficient for stressing plants, and this could explain the prevalence of 353
lower quality datasets of Arabidopsis research. 354
Applying consistent drought is challenging and efforts of standardizing stress severity 355
have seen mixed success. Automated phenotyping systems can accurately manage soil water 356
content, but they are expensive, have limited scale and flexibility, and plants are still subjected 357
to other environmental fluctuations. While solutes like PEG or sorbitol can help control soil water 358
potential, they may induce unnatural responses in plants (Gonzalez et al., 2022). Rainout 359
shelters or controlled irrigation can help replicate natural drought conditions, but plants are often 360
subjected to other environmental stresses throughout a growing season. Although there is no 361
universally accepted method for conducting drought studies in plants, accurately quantifying 362
water status and associated physiological responses can facilitate the comparison of 363
experiments across different laboratories (Juenger and Verslues, 2022). Developing a deeper 364
understanding of drought responses requires integrating datasets that range from mild to 365
sublethal and from a wide sampling of genotypes, tissues, and conditions. Rather than 366
standardizing drought, we suggest that researchers should collect paired physiology, 367
biochemistry, and morphological datasets at sufficient temporal and spatial resolution to quantify 368
plant health. These traits can first verify that plants are experiencing the desired level of stress 369
prior to expensive sequencing, and serve as features or covariates for integration and re-370
analysis of multiple datasets. 371
372
Methods
373
374
Assembling a representative catalog of gene expression RNAseq data 375
We assembled a database of drought RNAseq data in Arabidopsis, soybean, tomato, rice, 376
maize, and rice from the NCBI sequence read archive (SRA). Bulk data was retrieved using a 377
series of drought or heat stress related keywords with the SRA Advanced Search Builder. The 378
following metadata was collected for each experiment: tissue type(s), developmental stage, 379
environment (e.g., greenhouse, field, growth chamber etc.), media type, duration of stress, 380
mechanism of drying, associated physiology datasets, genotype, number of timepoints, and 381
number of replicates. 112 studies had a linked publication in the NCBI metadata and 130 had no 382
associated publication across all 6 species. Similar metadata was retrieved for individual SRA 383
samples along with a binary classification of treatment (drought or control) where possible. 384
Metadata was retrieved from the SRA and associated publications, but the lack of publications 385
and ambiguity in some labels led to a high degree of missing or sparse metadata for many 386
samples, and our manual annotations were conservative to reduce mislabeling samples for 387
analysis and downstream predictive modeling. 388
389
RNAseq data processing 390
Raw RNAseq reads were downloaded from the NCBI SRA and quantified using a pipeline to 391
trim, align, and quantify gene expression data (https://github.com/pardojer23/RNAseqV2). 392
Briefly, sequence adapters were trimmed and a quality check was performed on the raw FASTQ 393
files using the fastp program (v0.23.2) (Chen et al., 2018). The cleaned sequencing reads were 394
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
10
then pseudo-aligned to the Arabidopsis TAIR10 (Cheng et al., 2017), maize (Zea mays B73 V5) 395
(Hufford et al., 2021), rice (Oryza sativa Kitaake v3.1) (Jain et al., 2019), tomato (Solanum 396
lycopersicum ITAG4.0) (Hosmani et al., 2019), soybean (Glycine max var. Williams 82 V4) 397
(Valliyodan et al., 2019), wheat (Triticum aestivum cv. Chinese Spring RefSeq v2.1) (Zhu et al., 398
2021) genomes using salmon (v1.6) (Patro et al., 2017). The transcript level counts were 399
converted to gene level using the R package TXimport (v 1.22.0) (Soneson et al., 2015). Raw 400
TPMs or log2+1 transformed values were used for downstream analyses. The median 401
alignment rate is 69.1% across all species and 70.8% in Arabidopsis, 79.5% in maize , 62.0% in 402
rice, 65.1% in tomato, 64.7% in soybean, and 64.9% in wheat. These alignment rates are 403
consistent with other meta analyses of gene expression in Arabidopsis (Zhang et al., 2020). 404
Principal Component Analysis (PCA) was performed using built in functions in Scikit-learn 405
(Pedregosa et al., 2011) on the log2 transformed gene expression data (TPMs) to reduce 406
dimensionality and capture the main sources of variation within the datasets. The first two 407
principal components were plotted for each species and labeled by various factors. 408
409
Predictive modeling of drought stress responses 410
Our previous work on predictive modeling using drought gene expression in sorghum and maize 411
found that the Random Forest ensemble learning method performed best for classification 412
(Pardo et al., 2022), so we first tested Random Forest on our data. Random Forest models were 413
constructed using the RandomForestClassifier function from scikit-learn (v1.1.0) (Pedregosa et 414
al., 2011). To select the hyper-parameters, the RandomizedGridSearchCV function was utilized, 415
with 100 iterations employing 3-fold cross-validation to traverse the parameter space. Samples 416
were split into 75% training and 25% testing, and model performance was compared using the 417
full unbalanced datasets for each species as well as balanced subsets. Analyses using 418
balanced datasets in Arabidopsis had slightly lower but similar precision and recall, so the full 419
set of samples were used. Feature importance was calculated using the mean decrease in 420
impurity (Gini score) as implemented in scikit-learn (v1.1.0). All genes were subsequently 421
ranked by their respective importance score. Enriched Gene Ontology terms were calculated for 422
the top predictive features in Arabidopsis using the Panther classification system (Mi et al., 423
2013). 424
We also tested predictive classification for three additional machine learning algorithms: 425
Linear Support Vector Classifier (LinearSVC), Simple Neural Network (via the Multi-layer 426
Perceptron classifier, MLPClassifier), and Histogram-based Gradient Boosting Classifier 427
(HistGradientBoostingClassifier) implemented using scikit-learn (v1.1.0). The Linear Support 428
Vector Classifier was implemented using the LinearSVC classand the fit method was used to 429
train the model, and the predict method was applied to generate predictions. A simple neural 430
network was developed using the MLPClassifier class. We initialized the MLP with one hidden 431
layer of 100 neurons, and trained the model using the fit method before making predictions with 432
the predict method. The Histogram-based Gradient Boosting Classification Tree, a variant of 433
gradient boosting that is much faster than the traditional Gradient Boosting Classifier, was 434
implemented using the HistGradientBoostingClassifier class from the sklearn.ensemble module. 435
This algorithm is capable of handling missing values, and it also applies the 'Early-Stopping' 436
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
11
Method
to avoid overfitting. For all the models, we evaluated their performance by calculating 437
the accuracy score and generating a classification report, which included precision, recall, f1-438
score, and support for each class. Random Forest outperformed other classifiers in all 439
instances, and was thus used for downstream analyses. 440
441
442
Data availability: The data analyzed in this meta-analysis are detailed in Supplemental Table 443
1, including Sequence Read Archive (SRA) identifiers, PubMed IDs, and all other sample 444
metadata. Raw expression values, expressed in transcripts per million, are accessible on Dryad 445
(https://doi.org/10.5061/dryad.7sqv9s50g). Jupyter notebooks containing all Python code used 446
in this project, along with additional metadata, are available on GitHub: 447
https://github.com/bobvanburen/Drought_meta_analysis_VanBuren_etal_2024. 448
449
Acknowledgements
This work was funded, in part, by the Water and Life Interface Institute 450
(NSF-DBI-2213983) to SYR, RAM, and RV, the United States Department of Agriculture 451
National Institute of Food and Agriculture (USDA-NIFA 2022-67013-36118) to RV, and the US 452
Department of Energy, Office of Science, Office of Biological and Environmental Research, 453
Genomics Sciences Program grants DE-SC0021286 and DE-SC0023160 to SYR. CM, JP, 454
and JS were supported by the predoctoral training award T32-GM110523 from the National 455
Institute of General Medical Sciences of the NIH. AP, JS, and MLW were supported by the 456
National Science Foundation Research Traineeship Program (NSF-NRT 1828149). 457
458
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
12
References
459
Artur, M.A.S. and Kajala, K. (2021). Convergent evolution of gene regulatory networks 460
underlying plant adaptations to dry environments. Plant Cell Environ. 44: 3211–3222. 461
Basu, S., Ramegowda, V., Kumar, A., and Pereira, A. (2016). Plant adaptation to drought 462
stress. F1000Res. 5. 463
Behdenna, A., Colange, M., Haziza, J., Gema, A., Appé, G., Azencott, C.-A., and Nordor, A. 464
(2023). pyComBat, a Python tool for batch effects correction in high-throughput molecular 465
data using empirical Bayes methods. BMC Bioinformatics 24: 459. 466
Bowles, A.M.C., Paps, J., and Bechtold, U. (2021). Evolutionary Origins of Drought Tolerance 467
in Spermatophytes. Front. Plant Sci. 12: 655924. 468
Bray, E.A. (2004). Genes commonly regulated by water-deficit stress in Arabidopsis thaliana. J. 469
Exp. Bot. 55: 2331–2341. 470
Chaves, M.M. and Oliveira, M.M. (2004). Mechanisms underlying plant resilience to water 471
deficits: prospects for water-saving agriculture. J. Exp. Bot. 55: 2365–2384. 472
Cheng, C.-Y., Krishnakumar, V., Chan, A.P., Thibaud-Nissen, F., Schobel, S., and Town, 473
C.D. (2017). Araport11: a complete reannotation of the Arabidopsis thaliana reference 474
genome. Plant J. 89: 789–804. 475
Chen, S., Zhou, Y., Chen, Y., and Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ 476
preprocessor. Bioinformatics 34: i884–i890. 477
Clauw, P. et al. (2016). Leaf Growth Response to Mild Drought: Natural Variation in 478
Arabidopsis Sheds Light on Trait Architecture. Plant Cell 28: 2417–2434. 479
Clauw, P., Coppens, F., De Beuf, K., Dhondt, S., Van Daele, T., Maleux, K., Storme, V., 480
Clement, L., Gonzalez, N., and Inzé, D. (2015). Leaf responses to mild drought stress in 481
natural variants of Arabidopsis. Plant Physiol. 167: 800–816. 482
Danilevskaya, O.N., Yu, G., Meng, X., Xu, J., Stephenson, E., Estrada, S., Chilakamarri, S., 483
Zastrow-Hayes, G., and Thatcher, S. (2019). Developmental and transcriptional 484
responses of maize to drought stress under field conditions. Plant Direct 3: e00129. 485
Des Marais, D.L., McKay, J.K., Richards, J.H., Sen, S., Wayne, T., and Juenger, T.E. 486
(2012). Physiological genomics of response to soil drying in diverse Arabidopsis 487
accessions. Plant Cell 24: 893–914. 488
Dubois, M., Claeys, H., Van den Broeck, L., and Inzé, D. (2017). Time of day determines 489
Arabidopsis transcriptome and growth dynamics under mild drought. Plant Cell Environ. 40: 490
180–189. 491
Eckard
t, N.A. et al. (2023). Climate change challenges, plant science solutions. Plant Cell 35: 492
24–66. 493
Ekundayo, O.Y., Abiodun, B.J., and Kalumba, A.M. (2022). Global quantitative and 494
qualitative assessment of drought research from 1861 to 2019. International Journal of 495
Disaster Risk Reduction 70: 102770. 496
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
13
Farooq, M., Wahid, A., Kobayashi, N., Fujita, D., and Basra, S.M.A. (2009). Plant Drought 497
Stress: Effects, Mechanisms and Management. In Sustainable Agriculture, E. Lichtfouse, 498
M. Navarrete, P. Debaeke, S. Véronique, and C. Alberola, eds (Springer Netherlands: 499
Dordrecht), pp. 153–188. 500
Ginzburg, D.N., Bossi, F., and Rhee, S.Y. (2022). Uncoupling differential water usage from 501
drought resistance in a dwarf Arabidopsis mutant. Plant Physiol. 190: 2115–2121. 502
Gonzalez, S., Swift, J., Xu, J., Illouz-Eliaz, N., Nery, J.R., and Ecker, J.R. (2022). Mimicking 503
genuine drought responses using a high throughput plate assay. bioRxiv: 504
2022.11.25.517922. 505
Granier, C. et al. (2006). PHENOPSIS, an automated platform for reproducible phenotyping of 506
plant responses to soil water deficit in Arabidopsis thaliana permitted the identification of an 507
accession with low sensitivity to soil water deficit. New Phytol. 169: 623–635. 508
Groen, S.C., Joly-Lopez, Z., Platts, A.E., Natividad, M., Fresquez, Z., Mauck, W.M., 509
Quintana, M.R., Cabral, C.L.U., Torres, R.O., Satija, R., Purugganan, M.D., and Henry, 510
A. (2022). Evolutionary systems biology reveals patterns of rice adaptation to drought-511
prone agro-ecosystems. Plant Cell 34: 759–783. 512
Großkinsky, D.K., Svensgaard, J., Christensen, S., and Roitsch, T. (2015). Plant phenomics 513
and the need for physiological phenotyping across scales to narrow the genotype-to-514
phenotype knowledge gap. J. Exp. Bot. 66: 5429–5440. 515
Gupta, A., Rico-Medina, A., and Caño-Delgado, A.I. (2020). The physiology of plant 516
responses to drought. Science 368: 266–269. 517
Hincha, D.K., Zuther, E., and Popova, A.V. (2021). Stabilization of Dry Sucrose Glasses by 518
Four LEA_4 Proteins from Arabidopsis thaliana. Biomolecules 11. 519
Hosmani, P.S. et al. (2019). An improved de novo assembly and annotation of the tomato 520
Reference
genome using single-molecule sequencing, Hi-C proximity ligation and optical 521
maps. bioRxiv: 767764. 522
Hufford, M.B. et al. (2021). De novo assembly, annotation, and comparative analysis of 26 523
diverse maize genomes. Science 373: 655–662. 524
Jain, R. et al. (2019). Genome sequence of the model rice variety KitaakeX. BMC Genomics 525
20: 905. 526
Juenger, T.E. and Verslues, P.E. (2022). Time for a drought experiment: Do you know your 527
plants’ water status? The Plant Cell. 528
Klepikova, A.V., Kasianov, A.S., Gerasimov, E.S., Logacheva, M.D., and Penin, A.A. 529
(2016). A high resolution map of the Arabidopsis thaliana developmental transcriptome 530
based on RNA-seq profiling. Plant J. 88: 1058–1070. 531
L
amesch, P. et al. (2012). The Arabidopsis Information Resource (TAIR): improved gene 532
annotation and new tools. Nucleic Acids Res. 40: D1202–10. 533
Liu, Q., Kasuga, M., Sakuma, Y., Abe, H., Miura, S., Yamaguchi-Shinozaki, K., and 534
Shinozaki, K. (1998). Two transcription factors, DREB1 and DREB2, with an EREBP/AP2 535
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
14
DNA binding domain separate two cellular signal transduction pathways in drought- and 536
low-temperature-responsive gene expression, respectively, in Arabidopsis. Plant Cell 10: 537
1391–1406. 538
Mi, H., Muruganujan, A., Casagrande, J.T., and Thomas, P.D. (2013). Large-scale gene 539
function analysis with the PANTHER classification system. Nat. Protoc. 8: 1551–1566. 540
Nordin, K., Heino, P., and Palva, E.T. (1991). Separate signal pathways regulate the 541
expression of a low-temperature-induced gene in Arabidopsis thaliana (L.) Heynh. Plant 542
Mol. Biol. 16: 1061–1071. 543
Osmolovskaya, N., Shumilina, J., Kim, A., Didio, A., Grishina, T., Bilova, T., Keltsieva, 544
O.A., Zhukov, V., Tikhonovich, I., Tarakhovskaya, E., Frolov, A., and Wessjohann, 545
L.A. (2018). Methodology of Drought Stress Research: Experimental Setup and 546
Physiological Characterization. Int. J. Mol. Sci. 19. 547
Pardo, J., Wai, C.M., Harman, M., Nguyen, A., Kremling, K.A., Romay, C., Lepak, N., 548
Bauerle, T.L., Buckler, E.S., Thompson, A.M., and VanBuren, R. (2022). Cross-species 549
predictive modeling reveals conserved drought responses between maize and sorghum. 550
bioRxiv: 2022.09.26.509573. 551
Patro, R., Duggal, G., Love, M.I., Irizarry, R.A., and Kingsford, C. (2017). Salmon provides 552
fast and bias-aware quantification of transcript expression. Nat. Methods 14: 417–419. 553
Pedregosa, Varoquaux, and Gramfort (2011). Scikit-learn: Machine learning in Python. of 554
machine Learning …. 555
Sielemann, K., Hafner, A., and Pucker, B. (2020). The reuse of public datasets in the life 556
sciences: potential risks and rewards. PeerJ 8: e9954. 557
Soneson, C., Love, M.I., and Robinson, M.D. (2015). Differential analyses for RNA-seq: 558
transcript-level estimates improve gene-level inferences. F1000Res. 4: 1521. 559
Sreedasyam, A. et al. (2023). JGI Plant Gene Atlas: an updateable transcriptome resource to 560
improve functional gene descriptions across the plant kingdom. Nucleic Acids Res. 561
Tardieu, F. (2012). Any trait or trait-related allele can confer drought tolerance: just design the 562
right drought scenario. J. Exp. Bot. 63: 25–31. 563
Turner, N.C. (1986). Adaptation to Water Deficits: a Changing Perspective. Funct. Plant Biol. 564
13: 175–190. 565
Valliyodan, B. et al. (2019). Construction and comparison of three reference-quality genome 566
assemblies for soybean. Plant J. 100: 1066–1082. 567
Verslues, P.E. et al. (2023). Burning questions for a warming and changing world: 15 568
unknowns in plant abiotic stress. Plant Cell 35: 67–108. 569
Verslues, P.E. and Juenger, T.E. (2011). Drought, metabolites, and Arabidopsis natural 570
variation: a promising combination for understanding adaptation to water-limited 571
environments. Curr. Opin. Plant Biol. 14: 240–245. 572
Yamaguchi-Shinozaki, K. and Shinozaki, K. (1993). The plant hormone abscisic acid 573
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
15
mediates the drought-induced expression but not the seed-specific expression of rd22, a 574
gene responsive to dehydration stress in Arabidopsis thaliana. Mol. Gen. Genet. 238: 17–575
25. 576
Yoshiba, Y., Kiyosue, T., Katagiri, T., Ueda, H., Mizoguchi, T., Yamaguchi-Shinozaki, K., 577
Wada, K., Harada, Y., and Shinozaki, K. (1995). Correlation between the induction of a 578
gene for delta 1-pyrroline-5-carboxylate synthetase and the accumulation of proline in 579
Arabidopsis thaliana under osmotic stress. Plant J. 7: 751–760. 580
Zhang, H., Zhang, F., Yu, Y., Feng, L., Jia, J., Liu, B., Li, B., Guo, H., and Zhai, J. (2020). A 581
comprehensive online database for exploring /i120,000 public Arabidopsis RNA-seq 582
libraries. Mol. Plant 13: 1231–1233. 583
Zhu, T. et al. (2021). Optical maps refine the bread wheat Triticum aestivum cv. Chinese Spring 584
genome assembly. Plant J. 107: 303–314. 585
586
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
16
Figures 587
588
589
Figure 1. Summary of Arabidopsis drought gene expression metadata. Metadata was 590
collected for the 36 BioProjects with associated publications. The top row shows a histogram of 591
developmental stage or age of plants (in days) at sampling, the environment where studies were592
conducted, and the media plants were propagated in. The bottom row shows a histogram of the 593
duration of water deficit stress, mechanism of drying, and paired physiology data. Studies using 594
PEG, air drying, and ABA are not plotted in the stress duration graph, as experiment times 595
ranged from 1-8 hours. Abbreviations are as follows: relative water content (RWC), soil moisture 596
content (SMC), fresh weight (FW), and electrolyte leakage (EL). 597
598
16
re
e
g
re
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
17
599
Figure 2. Dramatic variability of public drought gene expression datasets in Arabidopsis. 600
(a) Principal component analysis of 1,301 drought related RNAseq samples collected from the 601
sequence read archive (SRA). The first two principal components are plotted for all samples an d602
colored by different factors including a binary classification of drought and control (upper left), 603
BioProject (upper right), genotype/accession of the sample (Col-0 or others; bottom left), and 604
the tissue type (bottom right). (b) Violin plot of log2 transformed TPM of RNAseq data for nine 605
drought marker genes in samples classified as control (left) and drought (right). 606
607
17
d
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
18
608
Figure 3. Comparison of public heat stress gene expression data in Arabidopsis. (a) 609
Principal component analysis of 156 heat stress related RNAseq samples collected from the 610
sequence read archive (SRA). The first two principal components are plotted for all samples and611
colored by different factors including a binary classification of heat stressed and control (top), 612
BioProject (middle), and the tissue type (bottom). (b) Violin plot of log2 transformed TPM of 613
RNAseq data for four drought marker genes in samples classified as control (left) and heat 614
stress (right). (c ) Histogram of the same log2 transformed gene expression data as (b) for two 615
heat stress marker genes of HSP70 (At3G12580; left), and DREB2A (AT5G05410; right). 616
18
nd
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
19
617
Figure 4. Predictive modeling of water stress in drought expression data. (a) Receiver 618
operating characteristic curve showing the performance of the Random Forest based drought 619
classification model across all classification thresholds. (b) Confusion matrix of the drought 620
predictive model. (c ) Multi-dimensional scaling plot showing clusters of enriched gene ontology 621
terms for the top 100 most important features (genes) in the Arabidopsis Random Forest 622
machine learning models. The size of each circle is proportional to the number of genes 623
annotated with each term and the circles are colored by the log10 of the adjusted p-value. 624
Histogram of the predictive accuracy of the Random Forest models for classifying drought using 625
a leave one experiment out approach for Arabidopsis (d) and Rice (e). A predictive accuracy of 626
1 corresponds to perfect prediction, and 0.5 is more or less random. 627
628
19
y
g
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
20
629
630
Figure 5. Variability of drought gene expression datasets across model and crop species.631
PCA plots are shown for drought gene expression datasets in the eudicots Arabidopsis, 632
soybean, and tomato (top) and monocots rice, maize, and wheat (bottom). The PCA of 633
Arabidopsis data from Figure 2a is re-drawn here to enable easier comparison. The first two 634
principal components are plotted for all samples and colored by different factors including a 635
binary classification of drought (blue) and control (gold). Samples corresponding to recovery or 636
unclear timepoints are not included. PCAs from Log2 transformation of the raw TPMs are shown637
on the top panel for each species, and ComBat corrected samples are shown below. 638
20
s.
n
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.