Variability in drought gene expression datasets highlight the need for community standardization

doi:10.1101/2024.02.04.578814

Variability in drought gene expression datasets highlight the need for community standardization

2024 · doi:10.1101/2024.02.04.578814

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 59,232 characters · extracted from oa-pdf · 10 sections · click to expand

Abstract

18 19 Physiologically relevant drought stress is difficult to apply consistently, and the heterogeneity in 20 experimental design, growth conditions, and sampling schemes make it challenging to compare 21 water deficit studies in plants. Here, we re-analyzed hundreds of drought gene expression 22 experiments across diverse model and crop species and quantified the variability across 23 studies. We found that drought studies are surprisingly uncomparable, even when accounting 24 for differences in genotype, environment, drought severity, and method of drying. Many studies, 25 including most Arabidopsis work, lack high-quality phenotypic and physiological datasets to 26 accompany gene expression, making it impossible to assess the severity or in some cases the 27 occurrence of water deficit stress events. From these datasets, we developed supervised 28 learning classifiers that can accurately predict if RNA-seq samples have experienced a 29 physiologically relevant drought stress, and suggest this can be used as a quality control for 30 future studies. Together, our analyses highlight the need for more community standardization, 31 and the importance of paired physiology data to quantify stress severity for reproducibility and 32 future data analyses. 33 34 35

Introduction

36 37 Drought, increasingly prevalent in both natural and agricultural landscapes, is escalating in 38 frequency and severity due to the dynamic climate. This trend has spurred the development of 39 an extensive and increasingly interdisciplinary research community focused on understanding 40 plant adaptation to water-limited environments (Osmolovskaya et al., 2018; Ekundayo et al., 41 2022). Meteorologically, drought manifests as drier than normal conditions, but its physiological 42 impact on plants varies based on the duration, severity, and timing of the stress events, 43 alongside local soil and habitat conditions (Gupta et al., 2020; Tardieu, 2012). Mild, infrequent 44 drought events may result in only slight reductions in photosynthesis and growth, often without 45 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 2 significant impacts on biomass or yield. In contrast, recurrent or severe bouts of drought may 46 cause unrecoverable damage or even plant death (Farooq et al., 2009). Central to 47 understanding and engineering drought resilience is the ability to apply consistent, 48 physiologically relevant, and reproducible stress events across scales (Großkinsky et al., 2015). 49 Such standardization is necessary to develop a community framework that allows for 50 comparison and expansion of previous experiments (Juenger and Verslues, 2022). 51 Water deficit responses likely evolved during terrestrialization, and they have been 52 continually refined, repurposed, and diversified to enable plants to colonize virtually every biome 53 (Bowles et al., 2021). Resistance to drought is an emergent phenotype involving the 54 synchronization of numerous physiological and genetic processes, and diverse lineages of 55 plants have evolved numerous adaptations to avoid, escape, and tolerate water deficits (Artur 56 and Kajala, 2021);(Chaves and Oliveira, 2004; Turner, 1986). Different plant lineages, 57 populations, or even individual genotypes use combinations of these strategies to tolerate water 58

Limitations

(Verslues and Juenger, 2011; Basu et al., 2016; Farooq et al., 2009). The genetic 59 mechanisms underlying responses or tolerance to drought stress are highly complex and 60 involve the activation of hundreds to thousands of pathways that collectively enable resilience to 61 water deficit. Most drought related pathways were discovered and characterized in the model 62 plant Arabidopsis, but core regulatory, biochemical, and physiological responses are broadly 63 conserved across green plants (Shinozaki and Yamaguchi-Shinozaki, 2007). The genetic basis 64 of adaptations to water deficit is an active and exciting area of plant science research, and 65 numerous important research gaps still remain (Verslues et al., 2023; Eckardt et al., 2023). 66 One promising approach to closing the knowledge gaps in understanding the genetic 67 basis of drought adaptation is using large-scale omics technologies. Numerous large-scale 68 datasets have been collected across diverse plant lineages to study the effects of drought 69 stress. Some studies have measured physiological responses in naturally water limited 70 environments (Pardo et al., 2022; Danilevskaya et al., 2019; Groen et al., 2022), but most use 71 simulated drought events under controlled or semi-controlled conditions to induce water deficit 72 responses (Gonzalez et al., 2022). Simulated drought studies range in scale and severity from 73 large rainout shelters withholding water from thousands of plants in an ecological or agricultural 74 setting to agar plates containing solutes to lower water potential. Each of these approaches has 75 benefits and drawbacks related to cost, consistency, and accuracy of applying drought. For 76 example, using polyethylene glycol, mannitol, and salt to lower water potential may not actually 77 induce true drought responsive pathways (Gonzalez et al., 2022), and restricting plants to small 78 pots in growth chambers or greenhouses can impact root growth and lead to physiologically 79 irrelevant and irreproducible drying (Granier et al., 2006). Individual labs utilize radically different 80 experimental approaches, growth conditions, and sampling schemes for drought assays, and 81 these added variables mask emergent properties of an already complex phenotype. A major 82 challenge for cross-species analysis is finding comparable biological datasets with similar 83 design, implementation, and sampling. 84 Here, to evaluate the comparability and reusability of drought gene expression data, we 85 compared public datasets across labs and experiments, and searched for patterns that 86 delineate drought and control conditions. We first focused on data from the model plant 87 Arabidopsis thaliana and then expanded our analyses to include five additional model and crop 88 species with the most published drought data. We found that drought gene expression data are 89 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 3 more variable compared to other abiotic stresses, and many studies lack basic physiological 90 data to assess the magnitude or even presence of water deficit stress. Our analyses highlight 91 the need for community standardization to enable the reuse and integration of datasets across 92 scales from different laboratories. 93 94

Results

95 96 Variability of gene expression datasets on drought in Arabidopsis 97 98 Plant responses to drought is a well-studied topic. There are ~34,000 articles in PubMed related 99 to drought stress in plants, and this wealth of knowledge has uncovered numerous phenotypes, 100 pathways, and genes underlying responses to water deficit (Gupta et al., 2020). Most of our 101 understanding of drought responses at the molecular genetic level is based on work in 102 Arabidopsis, including over 100 studies surveying genome-wide gene expression (RNAseq) 103 changes under water deficit across different accessions, environmental conditions, and mutant 104 backgrounds (Supplemental Table 1). Collectively, these datasets have been incorporated into 105 public gene expression atlases, co-expression networks, and other tools that are broadly used 106 by the plant science community to understand which genes and pathways underlie drought 107 responses (Lamesch et al., 2012; Klepikova et al., 2016). However, drought experiments vary 108 wildly in the degree, severity, and implementation of water deficit. Many experiments are 109 analyzed in isolation and arrive at independent conclusions. The choice of which drought 110 experiments to reference for future studies can drastically alter hypothesis generation and 111 inference of biological function. This raises a fundamental question, how comparable are 112 drought studies across different experiments? 113 To survey the variability in public drought data, we re-analyzed 109 water deficit RNAseq 114 experiments in Arabidopsis obtained from the Sequence Read Archive (SRA). We manually 115 curated metadata from 1,301 RNAseq samples across these 109 BioProjects. These datasets 116 include a range of genotypes and mutant backgrounds, developmental time points and tissues, 117 and differences in stress severity and duration across a range of natural or controlled drought 118 conditions (Figure 1; Supplemental table 1). Based on the available metadata, 81% of studies 119 were conducted in growth chambers in standard potting media, 13% on agar plates, and 5% in 120 greenhouses. Half of the studies (51%) applied a natural dry down by stopping irrigation, 27% 121 had controlled drying to a set soil moisture content, 8% removed plants from media and let them 122 air dry, and 14% used PEG to lower water potential or ABA to simulate water deficit responses. 123 Surprisingly, 39% of Arabidopsis studies did not report paired physiology data measuring plant 124 stress, such as gas exchange, photosynthesis, leaf water potential, or leaf relative water 125 content. Next, we processed the raw reads through a common pipeline to remove variation 126 arising from the different algorithmic and statistical frameworks used in each individual study. 127 Raw Illumina RNAseq reads were quality trimmed and aligned to the TAIR10 gene models, and 128 raw or batch corrected expression values in transcripts per million (TPM) were used as a basis 129 for downstream analysis. 130 To identify any factors that clearly delineate samples within or across experiments, we 131 used dimensionality reduction with an expectation that samples should cluster by water stress 132 status. Principal component analysis (PCA) and t-Distributed Stochastic Neighbor Embedding 133 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 4 (t-SNE) show no clear separation between drought-treated and control samples across 134 Arabidopsis drought experiments (Figure 2a; Supplemental Figure 1, 2). Within experiments, 135 some BioProjects show clear separation of drought and control samples/replicates, but a 136 surprising number have interspersed samples (Supplemental Figure 3). Across experiments, 137 samples were broadly separated by tissue type and BioProject. Root, seedlings, inflorescence, 138 and siliques form groups in a similar dimensional space, whereas clusters of leaf and whole 139 plant samples were more dispersed across PC1 and PC2 or dimensions 1 and 2 for PCA and t-140 SNE, respectively (Figure 2a, Supplemental Figure 2). Samples within the same experiment or 141 BioProject tended to cluster together, suggesting that the experimental effects are responsible 142 for much of the separation we observed (Supplemental Figure 3). Individual accessions of 143 Arabidopsis show remarkable differences in expression dynamics under drought (Des Marais et 144 al., 2012), and we tested if genotype differences may explain the lack of correlation between 145 stressed samples across experiments. Similar to other experimental factors, there is no clear 146 separation by accession (Figure 2a). There is also no clear separation of samples by the 147 reported duration, severity, or type of water deficit (e.g., drying vs solute based), or technical 148 variables such as sequencing read length, technology, read chemistry, or publication year. 149 To test if the observed separation by experiment (BioProject) rather than stress-related 150 factors is caused by batch effects, we applied ComBat (Behdenna et al., 2023) to the 151 expression matrix. This method leverages an empirical Bayes framework to estimate and 152 subsequently adjust for batch effects. Post-ComBat adjustment, the first two principal 153 components of the expression values accounted for only 19% of the total variance 154 (Supplemental Figure 4). Even when batch effects associated with BioProject were addressed, 155 the samples did not show clear differentiation between stress and non-stress conditions, and 156 much of the variance that was removed relates to true biological differences rather than 157 technical artifacts. Together, this suggests drought experiments in Arabidopsis have extreme 158 variability and individual datasets are largely incomparable using traditional approaches. 159 We next sought to understand why the Arabidopsis drought RNAseq data appeared so 160 variable across experiments. We first hypothesized that since dimensionality reduction provides 161 a summary across all genes, individual factors such as drought stress may be confounded by 162 other experimental factors. Therefore, to see if confounding factors are masking clustering of 163 drought-stressed samples, we surveyed the expression pattern of nine drought marker genes 164 across all the samples. If confounding factors indeed masked the drought-response programs of 165 relevant genes, we would expect the drought-marker genes to be consistently induced in 166 drought across all experiments. The nine drought marker genes included RD29A (Nordin et al., 167 1991), RD22 (Yamaguchi-Shinozaki and Shinozaki, 1993), RAB18, DREB2A (Liu et al., 1998), 168 COR15A (Hincha et al., 2021), LEA4-5 (Bray, 2004), P5CS1 (Yoshiba et al., 1995), RD20, and 169 KIN2. Marker genes were generally expressed at higher levels in drought-stressed samples 170 compared to well-watered, but this is highly variable across our datasets (Figure 2b, 171 Supplemental Figure 5). For instance, the dehydrin RAB18 and ABA induced transcription factor 172 DREB2 were highly expressed under drought but generally not expressed under well-watered 173 conditions. However, roughly a third of samples labeled as ‘drought’ had no detectable 174 expression of these two genes (Figure 2b, Supplemental Figure 5). This pattern is consistent 175 across all nine drought marker genes, suggesting the hypothesis that confounding factors are 176 masking the drought response program is not supported. Instead, the data show that many 177 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 5 drought-treated samples lack a molecular signature of water deficit, and may not have 178 experienced a physiologically relevant drought stress. 179 We then wondered whether the observed inconsistencies in drought expression data 180 might simply reflect a wide range of responses to various drought regimes applied, given the 181 varying drought intensities and diverse growing conditions of different experiments. To 182 understand if a similar variability exists in response to other stresses, we re-analyzed publicly 183 available expression datasets related to heat stress in Arabidopsis. There are fewer published 184 heat expression datasets, and we curated 21 experiments where plants were subjected to 185 physiologically relevant heat stress in Arabidopsis between temperature ranges of 35-42 C. 186 Metadata indicated that growth conditions were similarly variable as drought studies, and the 187 raw reads were processed as described above. Strikingly, when we performed dimensionality 188 reduction on log-transformed expression data, the samples were distinctly categorized into 189 either 'heat' or 'control' groups based on principal component 2, which explained 14% of the 190 variation (Figure 3a). Similar to the drought dataset, principal component 1, which explained 191 21% of the variation, differentiated the samples based on tissue type, grouping them as whole 192 seedlings or leaves. Unlike in the drought dataset, there was no distinct separation by 193 experiment (BioProject). We surveyed the expression patterns of four heat marker genes 194 (HSP70, HSP90, MBF1c, and DREB2A) to see if individual genes have a similarly clear pattern. 195 Heat marker genes have notably higher expression in all heat stressed samples compared to 196 control, and there is little overlap in the distribution of expression levels of these marker genes 197 between the two conditions (Figure 3b, 3c). Thus, public heat stress expression datasets 198 demonstrate a clear molecular signature of heat stress. This signature persists through any 199 experimental variation across studies, indicating that the inconsistencies noted in the drought 200 stress data are not simply artifacts of work conducted in different laboratories. 201 202 Developing a predictive model for classifying drought gene expression 203 204 Dimensionality reduction and clustering approaches were unable to delineate drought-stressed 205 from control samples, and we hypothesize this was driven by quality issues of the underlying 206 datasets. We previously developed a cross-species predictive model to accurately classify 207 drought stressed RNAseq data in maize and sorghum (Pardo et al., 2022), and sought to test if 208 this approach could differentiate among drought and control samples in Arabidopsis. We 209 developed Random Forest (RF) based predictive models to classify the Arabidopsis samples as 210 “drought” or “control” based on normalized gene expression values alone. We divided the 211 RNAseq samples into a training set with 75% of the experiments (BioProjects) and a testing set 212 with the remaining 25%. The overall accuracy of our predictive model was 66% (Supplemental 213 Table 2), which is substantially lower than the model developed with high-quality maize and 214 sorghum data (Pardo et al., 2022). The precision and recall for ‘drought’ samples were 0.79 and 215 0.51, respectively and 0.61 and 0.84 for control. We tested four other classifiers including Linear 216 Support Vector Classifier (SVC), Simple Neural Network (MLP), Histogram-based Gradient 217 Boosting Classifier (HGB), and K-Nearest Neighbor Classifier (KNN) to see if this improved 218 predictive accuracy. HGB had similar performance to RF (overall accuracy 65%) but SVC, KNN, 219 and MLP performed significantly worse with 52%, 52%, and 56% overall accuracy 220 (Supplemental Table 2). The relatively low predictive accuracy in Arabidopsis was initially 221 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 6 surprising, as our model was trained with significantly more data and tested within a single 222 species with less genetic diversity than either maize or sorghum. Further, Arabidopsis 223 experiments are generally conducted within a narrower set of conditions compared to the 224 diverse growth chamber, greenhouse and field environments for maize and sorghum. We 225 suspect that the reduced predictive accuracy in Arabidopsis might be due to the inclusion of 226 datasets with plants that are not physiologically stressed, thereby diminishing the effectiveness 227 of our models. 228 To assess the efficiency of the RF models in predicting drought and control samples for 229 each Arabidopsis experiment, we employed the Leave-One-Group-Out cross-validation method. 230 This method involved iterative training of the model using all datasets except one, and then 231 gauging both its overall and individual sample performances. The aggregate performance was 232 0.71, with a precision of 0.77 and a recall of 0.75. Performance metrics for individual datasets 233 varied widely, from a purely random prediction (approximately 0.5) to absolute accuracy (1.0) 234 (Figure 4d). Random Forest had perfect or near perfect prediction in six experiments, but 235 notably, 14 Arabidopsis experiments had performances nearing random results, with scores 236 less than 0.65. Three of these experiments reported ‘mild’ drought stress events where the 237 authors collected limited physiology data to support the degree of plant stress (Dubois et al., 238 2017)(Clauw et al., 2016, 2015). This includes two large-scale experiments of 6 and 98 239 Arabidopsis accessions subjected to ‘mild’ drought stress in an automated plant phenotyping 240 platform (Clauw et al., 2016, 2015). These experiments collected limited physiology data aside 241 from growth rate and soil moisture content, making it difficult to evaluate if the plants were 242 experiencing a physiologically relevant drought stress event, or if they were simply growing with 243 less but sufficient soil water content. 244 To identify the most important underlying features in the model, we developed a second 245 Random Forest classifier using training data from each Arabidopsis BioProject. The overall 246 accuracy improved significantly to 86% with a 0.77 precision and 0.87 recall for control and 0.92 247 and 0.85 for drought, respectively (Figure 4a, b). Random Forest classifiers rank and quantify 248 the importance of each feature in the underlying testing dataset, and we surveyed which 249 features (genes) were most important for our drought predictive model. The top 100 genes with 250 the most predictive power are enriched in Gene Ontology (GO) terms exclusively related to 251 drought processes (Figure 4c). This includes abscisic acid-activated signaling, and responses to 252 osmotic stress, water stress, salt stress, oxygen-containing compounds, and ABA among 253 others. This is perhaps not surprising, but supports that our model is using genes with well 254 supported roles in drought responses to make its classification. Among the top predictors are 255 regulators of the ABA signaling pathway (HAI1, HAI2, and ABI2), ABA responsive genes 256 (RD29B, DIG2, RAB1), LEA proteins (LEA4-5, ABR, and RAB18), and a lipid transfer protein 257 involved in cuticle formation (LTP3) (Supplemental Table 3). The top predictors also include 258 genes with unknown function (Supplemental Table 3), and predictive modeling may be used to 259 identify new genes with uncharacterized roles in drought stress responses. 260 261 Variability in drought expression profiles across diverse crop and model plants 262 263 Our analyses suggested that drought RNAseq data in Arabidopsis is wildly variable, but 264 is this a unique feature of Arabidopsis or a wider issue with drought studies in plants? To 265 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 7 answer this question, we re-analyzed published RNAseq data for five additional plant species 266 with between 12-57 individual drought experiments. This includes 179 soybean, 318 tomato, 267 137 wheat, 1,701 maize, and 981 rice RNAseq samples (Supplemental table 1). Notably, these 268 experiments exhibit greater variability compared to Arabidopsis in terms of genotypic diversity, 269 tissue type, drought assay methods, environmental conditions (e.g., greenhouse, growth 270 chamber, or field settings), and stress severity. We processed this data using the same 271 analytical pipeline that we applied to the Arabidopsis data, utilizing the most recent or highest-272 quality reference genome available for each species. Similar to the Arabidopsis results, 273 Principal Component Analysis (PCA) did not reveal clear distinctions between drought-exposed 274 and control samples for any of the species (Figure 5). This lack of separation could potentially 275 be attributed to experimental artifacts, and to address this, we applied ComBat to mitigate batch 276 effects across all of the BioProjects. The application of ComBat led to a reduction in variability 277 among the studies for maize and rice, resulting in distinct clustering of drought and control 278 groups in the adjusted expression data (Figure 5). However, when applying ComBat to the 279 wheat, soy, and tomato datasets, we did not observe a clear clustering pattern based on stress 280 level, mirroring our observations in Arabidopsis. We tested Leave-One-Group-Out cross-281 validation in rice to test if predictive models could more efficiently discern drought and control 282 given the improved sample comparability compared to Arabidopsis. The aggregate performance 283 across all models was 0.82, with a precision of 0.84 and a recall of 0.92. Performance metrics 284 for individual datasets generally exceeded those observed in Arabidopsis, with 18 datasets 285 achieving perfect predictive accuracy (Figure 4E). Together, this suggests datasets outside of 286 Arabidopsis profile drought more consistently and more comparable, but there is still significant 287 variability. 288 289

Discussion

290 Genome-scale datasets have increased exponentially, and millions of individual studies 291 from plants are publicly available. These datasets span the genomic, transcriptomic, epigenetic, 292 and chromatin landscapes for thousands of diverse species aimed at numerous research 293 questions related to plant form and function. Historically, most of these datasets were examined 294 in isolation, often due to the specialized nature of each study and the lack of tools or interest in 295 integrative analyses (Sielemann et al., 2020). However, recent computational and 296 methodological advancements encourage data integration into more comprehensive 297 comparative frameworks to identify conserved or emergent features of plant systems 298 (Sreedasyam et al., 2023). A critical aspect of comparing multiple datasets is ensuring 299 uniformity and consistency across experiments. Without this, the nuances of the data can 300 become blurred, as sparsity and heterogeneity have the potential to overshadow crucial 301 biological insights. Here, we re-analyzed the wealth of Arabidopsis drought gene expression 302 data and found that inconsistencies across experiments, potential quality issues, and a lack of 303 paired physiology data complicate integrative analyses. Below, we discuss the factors 304 underlying these issues and propose guidelines that can enable enhanced comparability and 305 reproducibility of future studies. 306 Applying consistent abiotic stress is relatively straightforward for heat, cold, salinity, UV, 307 light, or nutrient deficiencies. Temperature can be raised or lowered for a set time, plants can be 308 subjected to too much or too little light, and solutes or micronutrients can be maintained to 309 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 8 predetermined molarities or concentrations to induce the desired deficiency. These discrete 310 conditions enhance reproducibility across labs and enable cross-study comparisons. Our re-311 analysis of heat stress data in Arabidopsis supports this claim, as we identified a consistent 312 expression profile under heat regardless of differences in growth conditions, developmental 313 timing, tissue, or other variables across labs and separate studies. Drought stress in contrast, is 314 more difficult to apply, and the lack of standardization makes it difficult to compare experiments 315 or even distinguish between control and stress samples in some cases. Drought itself is a 316 complex and ill-defined stress with unclear delineations of mild and severe or tolerance and 317 avoidance. More than other abiotic stresses, environmental contexts such as the volume and 318 composition of soil or media, temperature, vapor pressure deficit, light intensity, and air flow 319 affect the progression of drought stress. Plants growing in small pots, well-draining soil, higher 320 temperatures, more light, or low humidity will dry faster. Days of drying is reported for most 321 Arabidopsis studies, but this is an arbitrary metric that does not reflect discrete plant 322 physiological states and is impossible to standardize across conditions without additional data to 323 measure the water status of the plant and its soil. Even drought studies performed in the same 324 environment can be challenging to standardize, as different genotypes or mutants can use 325 water at different rates, causing individual plants to experience differing drought severities within 326 the same timeframe and experiment (Ginzburg et al., 2022). 327 We observed that a surprising 39% of drought RNAseq experiments in Arabidopsis 328 report no drought physiology data, potentially undermining any findings, and preventing these 329 datasets from being used in comparative studies or meta analyses. Many of these studies 330 simply withheld watering for days to weeks and collected control and ‘drought’ stressed tissues 331 for RNAseq and other downstream analyses. No measurements of soil water content or 332 physiological responses were performed. Other studies measured only soil water content 333 without paired measurements in plant tissues, providing no estimate of how stress was affecting 334 the plants. These factors likely explain why we were unable to clearly separate stressed and 335 control samples across studies, even when controlling for experimental artifacts and batch 336 effects. Many ‘drought’ samples have expression signatures that mirror well-watered tissues, 337 suggesting that many plants were not experiencing the physiologically relevant drought stress 338 authors thought had been applied. This noisiness and heterogeneity made it difficult to develop 339 an accurate predictive model of drought stress, and the predictive accuracy was more or less 340 random for almost a third of studies. 341 Drought datasets outside of Arabidopsis are generally higher quality, with more 342 consistent application of drought conditions and distinct signatures of water deficit responses. 343 The reason for this is unclear. It is not difficult to stress Arabidopsis, and physiological 344 signatures of water stress can be seen with a moderate decrease in water potential (-0.5–1.5 345 Mp). Compared to Arabidopsis, maize, rice, tomato, wheat, and soybean are generally grown at 346 higher temperatures under higher light intensity, and they have much higher rates of 347 evapotranspiration. When combined with greater plant biomass and the use of small pots, these 348 species may experience drought stress after only a few days without watering in growth 349 chambers and greenhouse settings. In contrast, small or uncrowded Arabidopsis plants may 350 grow well for a week or more between waterings without experiencing water deficit (Ginzburg et 351 al., 2022). Consequently, the community standard of withholding water for 7-10 days for 352 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 9 Arabidopsis may be insufficient for stressing plants, and this could explain the prevalence of 353 lower quality datasets of Arabidopsis research. 354 Applying consistent drought is challenging and efforts of standardizing stress severity 355 have seen mixed success. Automated phenotyping systems can accurately manage soil water 356 content, but they are expensive, have limited scale and flexibility, and plants are still subjected 357 to other environmental fluctuations. While solutes like PEG or sorbitol can help control soil water 358 potential, they may induce unnatural responses in plants (Gonzalez et al., 2022). Rainout 359 shelters or controlled irrigation can help replicate natural drought conditions, but plants are often 360 subjected to other environmental stresses throughout a growing season. Although there is no 361 universally accepted method for conducting drought studies in plants, accurately quantifying 362 water status and associated physiological responses can facilitate the comparison of 363 experiments across different laboratories (Juenger and Verslues, 2022). Developing a deeper 364 understanding of drought responses requires integrating datasets that range from mild to 365 sublethal and from a wide sampling of genotypes, tissues, and conditions. Rather than 366 standardizing drought, we suggest that researchers should collect paired physiology, 367 biochemistry, and morphological datasets at sufficient temporal and spatial resolution to quantify 368 plant health. These traits can first verify that plants are experiencing the desired level of stress 369 prior to expensive sequencing, and serve as features or covariates for integration and re-370 analysis of multiple datasets. 371 372

Methods

373 374 Assembling a representative catalog of gene expression RNAseq data 375 We assembled a database of drought RNAseq data in Arabidopsis, soybean, tomato, rice, 376 maize, and rice from the NCBI sequence read archive (SRA). Bulk data was retrieved using a 377 series of drought or heat stress related keywords with the SRA Advanced Search Builder. The 378 following metadata was collected for each experiment: tissue type(s), developmental stage, 379 environment (e.g., greenhouse, field, growth chamber etc.), media type, duration of stress, 380 mechanism of drying, associated physiology datasets, genotype, number of timepoints, and 381 number of replicates. 112 studies had a linked publication in the NCBI metadata and 130 had no 382 associated publication across all 6 species. Similar metadata was retrieved for individual SRA 383 samples along with a binary classification of treatment (drought or control) where possible. 384 Metadata was retrieved from the SRA and associated publications, but the lack of publications 385 and ambiguity in some labels led to a high degree of missing or sparse metadata for many 386 samples, and our manual annotations were conservative to reduce mislabeling samples for 387 analysis and downstream predictive modeling. 388 389 RNAseq data processing 390 Raw RNAseq reads were downloaded from the NCBI SRA and quantified using a pipeline to 391 trim, align, and quantify gene expression data (https://github.com/pardojer23/RNAseqV2). 392 Briefly, sequence adapters were trimmed and a quality check was performed on the raw FASTQ 393 files using the fastp program (v0.23.2) (Chen et al., 2018). The cleaned sequencing reads were 394 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 10 then pseudo-aligned to the Arabidopsis TAIR10 (Cheng et al., 2017), maize (Zea mays B73 V5) 395 (Hufford et al., 2021), rice (Oryza sativa Kitaake v3.1) (Jain et al., 2019), tomato (Solanum 396 lycopersicum ITAG4.0) (Hosmani et al., 2019), soybean (Glycine max var. Williams 82 V4) 397 (Valliyodan et al., 2019), wheat (Triticum aestivum cv. Chinese Spring RefSeq v2.1) (Zhu et al., 398 2021) genomes using salmon (v1.6) (Patro et al., 2017). The transcript level counts were 399 converted to gene level using the R package TXimport (v 1.22.0) (Soneson et al., 2015). Raw 400 TPMs or log2+1 transformed values were used for downstream analyses. The median 401 alignment rate is 69.1% across all species and 70.8% in Arabidopsis, 79.5% in maize , 62.0% in 402 rice, 65.1% in tomato, 64.7% in soybean, and 64.9% in wheat. These alignment rates are 403 consistent with other meta analyses of gene expression in Arabidopsis (Zhang et al., 2020). 404 Principal Component Analysis (PCA) was performed using built in functions in Scikit-learn 405 (Pedregosa et al., 2011) on the log2 transformed gene expression data (TPMs) to reduce 406 dimensionality and capture the main sources of variation within the datasets. The first two 407 principal components were plotted for each species and labeled by various factors. 408 409 Predictive modeling of drought stress responses 410 Our previous work on predictive modeling using drought gene expression in sorghum and maize 411 found that the Random Forest ensemble learning method performed best for classification 412 (Pardo et al., 2022), so we first tested Random Forest on our data. Random Forest models were 413 constructed using the RandomForestClassifier function from scikit-learn (v1.1.0) (Pedregosa et 414 al., 2011). To select the hyper-parameters, the RandomizedGridSearchCV function was utilized, 415 with 100 iterations employing 3-fold cross-validation to traverse the parameter space. Samples 416 were split into 75% training and 25% testing, and model performance was compared using the 417 full unbalanced datasets for each species as well as balanced subsets. Analyses using 418 balanced datasets in Arabidopsis had slightly lower but similar precision and recall, so the full 419 set of samples were used. Feature importance was calculated using the mean decrease in 420 impurity (Gini score) as implemented in scikit-learn (v1.1.0). All genes were subsequently 421 ranked by their respective importance score. Enriched Gene Ontology terms were calculated for 422 the top predictive features in Arabidopsis using the Panther classification system (Mi et al., 423 2013). 424 We also tested predictive classification for three additional machine learning algorithms: 425 Linear Support Vector Classifier (LinearSVC), Simple Neural Network (via the Multi-layer 426 Perceptron classifier, MLPClassifier), and Histogram-based Gradient Boosting Classifier 427 (HistGradientBoostingClassifier) implemented using scikit-learn (v1.1.0). The Linear Support 428 Vector Classifier was implemented using the LinearSVC classand the fit method was used to 429 train the model, and the predict method was applied to generate predictions. A simple neural 430 network was developed using the MLPClassifier class. We initialized the MLP with one hidden 431 layer of 100 neurons, and trained the model using the fit method before making predictions with 432 the predict method. The Histogram-based Gradient Boosting Classification Tree, a variant of 433 gradient boosting that is much faster than the traditional Gradient Boosting Classifier, was 434 implemented using the HistGradientBoostingClassifier class from the sklearn.ensemble module. 435 This algorithm is capable of handling missing values, and it also applies the 'Early-Stopping' 436 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 11

Method

to avoid overfitting. For all the models, we evaluated their performance by calculating 437 the accuracy score and generating a classification report, which included precision, recall, f1-438 score, and support for each class. Random Forest outperformed other classifiers in all 439 instances, and was thus used for downstream analyses. 440 441 442 Data availability: The data analyzed in this meta-analysis are detailed in Supplemental Table 443 1, including Sequence Read Archive (SRA) identifiers, PubMed IDs, and all other sample 444 metadata. Raw expression values, expressed in transcripts per million, are accessible on Dryad 445 (https://doi.org/10.5061/dryad.7sqv9s50g). Jupyter notebooks containing all Python code used 446 in this project, along with additional metadata, are available on GitHub: 447 https://github.com/bobvanburen/Drought_meta_analysis_VanBuren_etal_2024. 448 449

Acknowledgements

This work was funded, in part, by the Water and Life Interface Institute 450 (NSF-DBI-2213983) to SYR, RAM, and RV, the United States Department of Agriculture 451 National Institute of Food and Agriculture (USDA-NIFA 2022-67013-36118) to RV, and the US 452 Department of Energy, Office of Science, Office of Biological and Environmental Research, 453 Genomics Sciences Program grants DE-SC0021286 and DE-SC0023160 to SYR. CM, JP, 454 and JS were supported by the predoctoral training award T32-GM110523 from the National 455 Institute of General Medical Sciences of the NIH. AP, JS, and MLW were supported by the 456 National Science Foundation Research Traineeship Program (NSF-NRT 1828149). 457 458 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 12

References

459 Artur, M.A.S. and Kajala, K. (2021). Convergent evolution of gene regulatory networks 460 underlying plant adaptations to dry environments. Plant Cell Environ. 44: 3211–3222. 461 Basu, S., Ramegowda, V., Kumar, A., and Pereira, A. (2016). Plant adaptation to drought 462 stress. F1000Res. 5. 463 Behdenna, A., Colange, M., Haziza, J., Gema, A., Appé, G., Azencott, C.-A., and Nordor, A. 464 (2023). pyComBat, a Python tool for batch effects correction in high-throughput molecular 465 data using empirical Bayes methods. BMC Bioinformatics 24: 459. 466 Bowles, A.M.C., Paps, J., and Bechtold, U. (2021). Evolutionary Origins of Drought Tolerance 467 in Spermatophytes. Front. Plant Sci. 12: 655924. 468 Bray, E.A. (2004). Genes commonly regulated by water-deficit stress in Arabidopsis thaliana. J. 469 Exp. Bot. 55: 2331–2341. 470 Chaves, M.M. and Oliveira, M.M. (2004). Mechanisms underlying plant resilience to water 471 deficits: prospects for water-saving agriculture. J. Exp. Bot. 55: 2365–2384. 472 Cheng, C.-Y., Krishnakumar, V., Chan, A.P., Thibaud-Nissen, F., Schobel, S., and Town, 473 C.D. (2017). Araport11: a complete reannotation of the Arabidopsis thaliana reference 474 genome. Plant J. 89: 789–804. 475 Chen, S., Zhou, Y., Chen, Y., and Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ 476 preprocessor. Bioinformatics 34: i884–i890. 477 Clauw, P. et al. (2016). Leaf Growth Response to Mild Drought: Natural Variation in 478 Arabidopsis Sheds Light on Trait Architecture. Plant Cell 28: 2417–2434. 479 Clauw, P., Coppens, F., De Beuf, K., Dhondt, S., Van Daele, T., Maleux, K., Storme, V., 480 Clement, L., Gonzalez, N., and Inzé, D. (2015). Leaf responses to mild drought stress in 481 natural variants of Arabidopsis. Plant Physiol. 167: 800–816. 482 Danilevskaya, O.N., Yu, G., Meng, X., Xu, J., Stephenson, E., Estrada, S., Chilakamarri, S., 483 Zastrow-Hayes, G., and Thatcher, S. (2019). Developmental and transcriptional 484 responses of maize to drought stress under field conditions. Plant Direct 3: e00129. 485 Des Marais, D.L., McKay, J.K., Richards, J.H., Sen, S., Wayne, T., and Juenger, T.E. 486 (2012). Physiological genomics of response to soil drying in diverse Arabidopsis 487 accessions. Plant Cell 24: 893–914. 488 Dubois, M., Claeys, H., Van den Broeck, L., and Inzé, D. (2017). Time of day determines 489 Arabidopsis transcriptome and growth dynamics under mild drought. Plant Cell Environ. 40: 490 180–189. 491 Eckard t, N.A. et al. (2023). Climate change challenges, plant science solutions. Plant Cell 35: 492 24–66. 493 Ekundayo, O.Y., Abiodun, B.J., and Kalumba, A.M. (2022). Global quantitative and 494 qualitative assessment of drought research from 1861 to 2019. International Journal of 495 Disaster Risk Reduction 70: 102770. 496 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 13 Farooq, M., Wahid, A., Kobayashi, N., Fujita, D., and Basra, S.M.A. (2009). Plant Drought 497 Stress: Effects, Mechanisms and Management. In Sustainable Agriculture, E. Lichtfouse, 498 M. Navarrete, P. Debaeke, S. Véronique, and C. Alberola, eds (Springer Netherlands: 499 Dordrecht), pp. 153–188. 500 Ginzburg, D.N., Bossi, F., and Rhee, S.Y. (2022). Uncoupling differential water usage from 501 drought resistance in a dwarf Arabidopsis mutant. Plant Physiol. 190: 2115–2121. 502 Gonzalez, S., Swift, J., Xu, J., Illouz-Eliaz, N., Nery, J.R., and Ecker, J.R. (2022). Mimicking 503 genuine drought responses using a high throughput plate assay. bioRxiv: 504 2022.11.25.517922. 505 Granier, C. et al. (2006). PHENOPSIS, an automated platform for reproducible phenotyping of 506 plant responses to soil water deficit in Arabidopsis thaliana permitted the identification of an 507 accession with low sensitivity to soil water deficit. New Phytol. 169: 623–635. 508 Groen, S.C., Joly-Lopez, Z., Platts, A.E., Natividad, M., Fresquez, Z., Mauck, W.M., 509 Quintana, M.R., Cabral, C.L.U., Torres, R.O., Satija, R., Purugganan, M.D., and Henry, 510 A. (2022). Evolutionary systems biology reveals patterns of rice adaptation to drought-511 prone agro-ecosystems. Plant Cell 34: 759–783. 512 Großkinsky, D.K., Svensgaard, J., Christensen, S., and Roitsch, T. (2015). Plant phenomics 513 and the need for physiological phenotyping across scales to narrow the genotype-to-514 phenotype knowledge gap. J. Exp. Bot. 66: 5429–5440. 515 Gupta, A., Rico-Medina, A., and Caño-Delgado, A.I. (2020). The physiology of plant 516 responses to drought. Science 368: 266–269. 517 Hincha, D.K., Zuther, E., and Popova, A.V. (2021). Stabilization of Dry Sucrose Glasses by 518 Four LEA_4 Proteins from Arabidopsis thaliana. Biomolecules 11. 519 Hosmani, P.S. et al. (2019). An improved de novo assembly and annotation of the tomato 520

Reference

genome using single-molecule sequencing, Hi-C proximity ligation and optical 521 maps. bioRxiv: 767764. 522 Hufford, M.B. et al. (2021). De novo assembly, annotation, and comparative analysis of 26 523 diverse maize genomes. Science 373: 655–662. 524 Jain, R. et al. (2019). Genome sequence of the model rice variety KitaakeX. BMC Genomics 525 20: 905. 526 Juenger, T.E. and Verslues, P.E. (2022). Time for a drought experiment: Do you know your 527 plants’ water status? The Plant Cell. 528 Klepikova, A.V., Kasianov, A.S., Gerasimov, E.S., Logacheva, M.D., and Penin, A.A. 529 (2016). A high resolution map of the Arabidopsis thaliana developmental transcriptome 530 based on RNA-seq profiling. Plant J. 88: 1058–1070. 531 L amesch, P. et al. (2012). The Arabidopsis Information Resource (TAIR): improved gene 532 annotation and new tools. Nucleic Acids Res. 40: D1202–10. 533 Liu, Q., Kasuga, M., Sakuma, Y., Abe, H., Miura, S., Yamaguchi-Shinozaki, K., and 534 Shinozaki, K. (1998). Two transcription factors, DREB1 and DREB2, with an EREBP/AP2 535 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 14 DNA binding domain separate two cellular signal transduction pathways in drought- and 536 low-temperature-responsive gene expression, respectively, in Arabidopsis. Plant Cell 10: 537 1391–1406. 538 Mi, H., Muruganujan, A., Casagrande, J.T., and Thomas, P.D. (2013). Large-scale gene 539 function analysis with the PANTHER classification system. Nat. Protoc. 8: 1551–1566. 540 Nordin, K., Heino, P., and Palva, E.T. (1991). Separate signal pathways regulate the 541 expression of a low-temperature-induced gene in Arabidopsis thaliana (L.) Heynh. Plant 542 Mol. Biol. 16: 1061–1071. 543 Osmolovskaya, N., Shumilina, J., Kim, A., Didio, A., Grishina, T., Bilova, T., Keltsieva, 544 O.A., Zhukov, V., Tikhonovich, I., Tarakhovskaya, E., Frolov, A., and Wessjohann, 545 L.A. (2018). Methodology of Drought Stress Research: Experimental Setup and 546 Physiological Characterization. Int. J. Mol. Sci. 19. 547 Pardo, J., Wai, C.M., Harman, M., Nguyen, A., Kremling, K.A., Romay, C., Lepak, N., 548 Bauerle, T.L., Buckler, E.S., Thompson, A.M., and VanBuren, R. (2022). Cross-species 549 predictive modeling reveals conserved drought responses between maize and sorghum. 550 bioRxiv: 2022.09.26.509573. 551 Patro, R., Duggal, G., Love, M.I., Irizarry, R.A., and Kingsford, C. (2017). Salmon provides 552 fast and bias-aware quantification of transcript expression. Nat. Methods 14: 417–419. 553 Pedregosa, Varoquaux, and Gramfort (2011). Scikit-learn: Machine learning in Python. of 554 machine Learning …. 555 Sielemann, K., Hafner, A., and Pucker, B. (2020). The reuse of public datasets in the life 556 sciences: potential risks and rewards. PeerJ 8: e9954. 557 Soneson, C., Love, M.I., and Robinson, M.D. (2015). Differential analyses for RNA-seq: 558 transcript-level estimates improve gene-level inferences. F1000Res. 4: 1521. 559 Sreedasyam, A. et al. (2023). JGI Plant Gene Atlas: an updateable transcriptome resource to 560 improve functional gene descriptions across the plant kingdom. Nucleic Acids Res. 561 Tardieu, F. (2012). Any trait or trait-related allele can confer drought tolerance: just design the 562 right drought scenario. J. Exp. Bot. 63: 25–31. 563 Turner, N.C. (1986). Adaptation to Water Deficits: a Changing Perspective. Funct. Plant Biol. 564 13: 175–190. 565 Valliyodan, B. et al. (2019). Construction and comparison of three reference-quality genome 566 assemblies for soybean. Plant J. 100: 1066–1082. 567 Verslues, P.E. et al. (2023). Burning questions for a warming and changing world: 15 568 unknowns in plant abiotic stress. Plant Cell 35: 67–108. 569 Verslues, P.E. and Juenger, T.E. (2011). Drought, metabolites, and Arabidopsis natural 570 variation: a promising combination for understanding adaptation to water-limited 571 environments. Curr. Opin. Plant Biol. 14: 240–245. 572 Yamaguchi-Shinozaki, K. and Shinozaki, K. (1993). The plant hormone abscisic acid 573 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 15 mediates the drought-induced expression but not the seed-specific expression of rd22, a 574 gene responsive to dehydration stress in Arabidopsis thaliana. Mol. Gen. Genet. 238: 17–575 25. 576 Yoshiba, Y., Kiyosue, T., Katagiri, T., Ueda, H., Mizoguchi, T., Yamaguchi-Shinozaki, K., 577 Wada, K., Harada, Y., and Shinozaki, K. (1995). Correlation between the induction of a 578 gene for delta 1-pyrroline-5-carboxylate synthetase and the accumulation of proline in 579 Arabidopsis thaliana under osmotic stress. Plant J. 7: 751–760. 580 Zhang, H., Zhang, F., Yu, Y., Feng, L., Jia, J., Liu, B., Li, B., Guo, H., and Zhai, J. (2020). A 581 comprehensive online database for exploring /i120,000 public Arabidopsis RNA-seq 582 libraries. Mol. Plant 13: 1231–1233. 583 Zhu, T. et al. (2021). Optical maps refine the bread wheat Triticum aestivum cv. Chinese Spring 584 genome assembly. Plant J. 107: 303–314. 585 586 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 16 Figures 587 588 589 Figure 1. Summary of Arabidopsis drought gene expression metadata. Metadata was 590 collected for the 36 BioProjects with associated publications. The top row shows a histogram of 591 developmental stage or age of plants (in days) at sampling, the environment where studies were592 conducted, and the media plants were propagated in. The bottom row shows a histogram of the 593 duration of water deficit stress, mechanism of drying, and paired physiology data. Studies using 594 PEG, air drying, and ABA are not plotted in the stress duration graph, as experiment times 595 ranged from 1-8 hours. Abbreviations are as follows: relative water content (RWC), soil moisture 596 content (SMC), fresh weight (FW), and electrolyte leakage (EL). 597 598 16 re e g re (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 17 599 Figure 2. Dramatic variability of public drought gene expression datasets in Arabidopsis. 600 (a) Principal component analysis of 1,301 drought related RNAseq samples collected from the 601 sequence read archive (SRA). The first two principal components are plotted for all samples an d602 colored by different factors including a binary classification of drought and control (upper left), 603 BioProject (upper right), genotype/accession of the sample (Col-0 or others; bottom left), and 604 the tissue type (bottom right). (b) Violin plot of log2 transformed TPM of RNAseq data for nine 605 drought marker genes in samples classified as control (left) and drought (right). 606 607 17 d (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 18 608 Figure 3. Comparison of public heat stress gene expression data in Arabidopsis. (a) 609 Principal component analysis of 156 heat stress related RNAseq samples collected from the 610 sequence read archive (SRA). The first two principal components are plotted for all samples and611 colored by different factors including a binary classification of heat stressed and control (top), 612 BioProject (middle), and the tissue type (bottom). (b) Violin plot of log2 transformed TPM of 613 RNAseq data for four drought marker genes in samples classified as control (left) and heat 614 stress (right). (c ) Histogram of the same log2 transformed gene expression data as (b) for two 615 heat stress marker genes of HSP70 (At3G12580; left), and DREB2A (AT5G05410; right). 616 18 nd (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 19 617 Figure 4. Predictive modeling of water stress in drought expression data. (a) Receiver 618 operating characteristic curve showing the performance of the Random Forest based drought 619 classification model across all classification thresholds. (b) Confusion matrix of the drought 620 predictive model. (c ) Multi-dimensional scaling plot showing clusters of enriched gene ontology 621 terms for the top 100 most important features (genes) in the Arabidopsis Random Forest 622 machine learning models. The size of each circle is proportional to the number of genes 623 annotated with each term and the circles are colored by the log10 of the adjusted p-value. 624 Histogram of the predictive accuracy of the Random Forest models for classifying drought using 625 a leave one experiment out approach for Arabidopsis (d) and Rice (e). A predictive accuracy of 626 1 corresponds to perfect prediction, and 0.5 is more or less random. 627 628 19 y g (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint 20 629 630 Figure 5. Variability of drought gene expression datasets across model and crop species.631 PCA plots are shown for drought gene expression datasets in the eudicots Arabidopsis, 632 soybean, and tomato (top) and monocots rice, maize, and wheat (bottom). The PCA of 633 Arabidopsis data from Figure 2a is re-drawn here to enable easier comparison. The first two 634 principal components are plotted for all samples and colored by different factors including a 635 binary classification of drought (blue) and control (gold). Samples corresponding to recovery or 636 unclear timepoints are not included. PCAs from Log2 transformation of the raw TPMs are shown637 on the top panel for each species, and ComBat corrected samples are shown below. 638 20 s. n (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 6, 2024. ; https://doi.org/10.1101/2024.02.04.578814doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-20T11:00:21.680559+00:00