GLARE: Discovering hidden patterns in spaceflight transcriptome using representation learning

doi:10.1101/2024.06.04.597470

GLARE: Discovering hidden patterns in spaceflight transcriptome using representation learning

2024 · doi:10.1101/2024.06.04.597470

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 70,476 characters · extracted from oa-pdf · 4 sections · click to expand

Abstract

Spaceﬂight studies present novel insights into biological processes through exposure to stressors1 outside the evolutionary path of terrestrial organisms. Despite limited access to space environ-2 ments, numerous transcriptomic datasets from spaceﬂight experiments are now available through3 NASA’s GeneLab data repository, which allows public access to these datasets, encouraging fur-4 ther analysis. While various computational pipelines and methods have been used to process these5 transcriptomic datasets, learning-model-driven analyses have yet to be applied to a broad array of6 such spaceﬂight-related datasets. In this study, we propose an open-source framework, GLARE:7 GeneLAb Representation learning pipelinE, which consists of training different representation learn-8 ing approaches from manifold learning to self-supervised learning that enhances the performance9 of downstream analytical tasks such as pattern recognition. We illustrate the utility of GLARE by10 applying it to gene-level transcriptional values from the results of the CARA spaceﬂight experiment,11 an Arabidopsis root tip transcriptome dataset that spanned light, dark, and microgravity treatments.12 We show that GLARE not only substantiated the ﬁndings of the original study concerning cell13 wall remodeling but also revealed additional patterns of gene expression affected by the treatments,14 including evidence of hypoxia. This work suggests there is great potential to supplement the insights15 drawn from initial studies on spaceﬂight omics-level data through further machine-learning-enabled16 analyses.17

Keywords

Machine Learning, Representation Learning, Spaceﬂight, RNA-seq, Transcriptomics18 1 Introduction19 Spaceﬂight studies present unprecedented insights into biological processes through exposure to unique environmental20 stressors that have not been experienced by any form of life on Earth. In response to the spaceﬂight environment,21 organisms initiate speciﬁc transcriptional responses to novel conditions. Thus, one key to understanding how biology22 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint responds to spaceﬂight stressors like microgravity, radiation, and hypoxia is through transcriptomic analysis to study the23 gene expression proﬁles that drive physiological adaptation triggered by the spaceﬂight environment (Mustroph et al.,24 2010). Space-related transcriptional studies have now also broadened into multi-omic spaceﬂight investigations that are25 well-suited to multiple rounds of analysis facilitated by the publicly available datasets in the NASA GeneLab database.26 The importance of studying plant biology speciﬁcally in space has been identiﬁed both for exploring the fundamental27 responses of biology to the spaceﬂight environment and at a very practical level for developing bio-regenerative life28 support systems for long-term space exploration (Rutter et al., 2020; Fu et al., 2016). Understanding of transcriptomic29 and physiological changes elicited in plants by spaceﬂight conditions through analyzing transcriptional and other –omic30 patterns is therefore a focus of much current plant space biology experimentation (e.g., Paul et al. (2013); Villacampa31 et al. (2021)). For example, the CARA(Characterizing Arabidopsis Root Attraction) experiment was designed to32 compare the spaceﬂight transcriptome responses between different genotypes of Arabidopsis thaliana’s root tips under33 various conditions (Paul et al., 2017). This experiment explored the patterns of gene expression from root tip cells in34 the spaceﬂight environment on the International Space Station (ISS), with comparable ground controls and the lighting35 sub-environments among three different genotypes. While these kinds of experiments in plant space biology have36 provided many key insights, they have so far largely relied upon the primary transcriptomic analysis of the original37 research team. To provide a framework that can be applied to increase the depth of transcriptomic analyses for previous38 and future spaceﬂight experiments, we introduce GLARE: GeneLab Representation learning pipelinE. We show the39 utility of the GLARE pipeline by applying it to the CARA dataset to illustrate how applying novel machine-learning40

Methods

to transcriptomic datasets extends insights beyond the original transcriptomic analysis of this data, adding new41 perspectives.42 Our analysis pipeline applies state-of-the-art representation learning models to ﬁnd underlying patterns in the FPKM43 values(fragments per kilobase of transcript per million mapped fragments) that are proportional to the abundance of44 each loci’s transcript. These representation learning models allow for better data point representation and clustering45 using unsupervised learning methods. These methods allow for further investigation of the effects of spaceﬂight on, e.g.,46 phytohormone signaling and associated physiological phenotypes (Abts et al., 2017; Ferl and Paul, 2016; Iqbal et al.,47 2017). Moreover, considering that the CARA experiment also utilized lighting sub-environments, we can shine further48 light on the potential spaceﬂight effects that were neglected in past studies. Overall, the GLARE method will provide49 insights to better understand plant behavior in the spaceﬂight environment based on its endogenous and exogenous cues.50 2 Materials and Methods51 2.1 GeneLab Data System and Data Entries52 The Genelab Data System (GLDS) is a public, space-related -omics data repository, which curates data from a wide53 variety of species and experimental spaceﬂight conditions (Ray et al., 2018). GLDS obtains spaceﬂight-related –omics54 datasets from multiple locations such as the Gene Expression Omnibus (GEO), European Bioinformatics Institute (EBI),55 publications directly, and others (Ray et al., 2018). This data is then cataloged with the relevant metadata, such as56 protocols, payload numbers, and experimental variables, and made available as an Open Science Dataset (OSD) in57 NASA’s Open Science Data Repository (OSDR).58 The CARA dataset (OSD-120; https://osdr.nasa.gov/bio/repo/data/studies/OSD-120) was chosen59 from Genelab for use with GLARE due to its many experimental conditions. The CARA experiments were conducted60 with three ecotypes/genotypes of Arabidopsis thaliana: wild-type Wassilewskija (WS), wild-type Columbia-0 (Col-0),61 and a mutant in the PHYTOCHROME D gene in the Col-0 background (PHYD) (Paul et al., 2017). Brieﬂy, these62 genotypes were planted on gel media in Petri dishes and grown in either ambient light conditions or in the dark on63 the ISS for 11 days; Parallel controls were performed on the ground. After the 11 days, germinated seedlings were64 photographed and collected into Kennedy Space Center Fixation Tubes (KFTs;Ferl et al. (2011)) containing RNAlater.65 2 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Seedlings preserved in RNAlater were returned to Earth frozen, and then the roots were dissected into the last 2 mm66 of the tip for the light-grown plants and the last 1 mm for the dark-grown plants. RNA was extracted and sent to67 the Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, for RNA sequencing using a68 NextSeq 500 system, producing ∼40 million paired-end reads per sample. Finally, these pair-end reads were mapped to69 the TAIR10 A. thaliana reference genome using Spliced Transcripts Alignment to a Reference (STAR) software, and70 differential expression was performed using the Cufﬂinks tool (Dobin et al., 2013; Trapnell et al., 2012).71 2.2 High-dimensional Data Analysis72 Overview: Statistical methods have been widely integrated into the bioinformatics pipeline in multi-omics studies for73 analyzing the data as well as preprocessing the data. Speciﬁcally, due to multi-omics datasets having complex data74 topology, dimension reduction and clustering are two commonly used techniques for further investigation (Rappoport75 and Shamir, 2018). GLARE capitalizes upon such approaches. For example, Principal Component Analysis (PCA)76 and Factor Analysis are fundamental methods with widespread application for dimensionality reduction (Zeng and77 Lumley, 2018). After achieving a statistical representation of the dataset with these dimensionality reduction techniques,78 clustering methods are utilized to group similar representations to uncover underlying patterns within the dataset.79 Among these, K-means and hierarchical clustering are featured as two of the most favored methodologies (Hulot et al.,80 2020).81 2.2.1 Learning Data Representations82 While PCA is popularly used for its simplicity, it has its limits for losing essential features through linear embedding,83 which often degrades the clustering quality (Gan et al., 2020). Several alternative methods that do not only rely on data84 point distribution but also leverage latent data structures via learned representations have shown advantages in handling85 biological data, thereby enhancing clustering precision (Karim et al., 2021). GLARE also uses these approaches in86 its analyses. These alternatives to PCA include t-distributed Stochastic Neighbour Embedding (t-SNE), a non-linear87 dimensionality reduction technique particularly adept at preserving local structures within high-dimensional data, and88 Uniform Manifold APproximation (UMAP) (Van der Maaten and Hinton, 2008; McInnes et al., 2018), a manifold89 learning approach that efﬁciently captures complex relationships within the data. However, alternative deep-learning-90 based approaches for obtaining data representations have been largely neglected in the ﬁeld of plant biology, despite the91 advantage of their ability to capture contextual information from the non-linear mappings. Speciﬁcally, this approach92 of capturing contextual information through complex, higher-level features is known as representation learning or93 feature extraction (Aljalbout et al., 2018). Therefore, along with PCA, t-SNE, and UMAP, we have investigated the94 application of Sparse Autoencoder (SAE) as one of the representation learning methods in the GLARE pipeline. SAE95 is an unsupervised learning algorithm based on a neural network that aims to learn an approximation of the identity96 function that represents the data. The model is trained by encoding the data from its feedforward phase but with sparsity97 constraints that only activate neurons with the largest activation, allowing the discovery of the unique structure in98 the data (Makhzani and Frey, 2013). While autoencoders are more commonly used for reconstructing the original99 input data, prior studies show autoencoder as a representation learning approach that works favorably in the context of100 multi-omics datasets (Chaudhary et al., 2018).101 Upon employing multiple approaches to obtain data representation, evaluating these data representations is critical102 to understanding the strengths and limitations of various data representation techniques. Prior research has used103 several evaluation techniques to assess the ﬁdelity between data representations and the original dataset and the quality104 of the data representation structure, so we have used these methods in the development of the GLARE approach.105 Reconstruction error analysis, often conducted through linear regression, and trustworthiness scores that measure106 faithfulness, are widely applied to test the ﬁdelity by comparing the original data and learned representation (Hinton107 and Salakhutdinov, 2006; Van Der Maaten, 2009). To test the quality of the data structure, the K-Nearest Neighbors108 (KNN) classiﬁer can be utilized to assess the neighborhood preservation, showing the ability of the representation to109 3 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint maintain local structure and inherent relationships (Liao and Vemuri, 2002). Furthermore, the Silhouette score measured110 via k-means clustering is widely used to check the insights into clustering performance and compactness of the data111 representation (Rousseeuw, 1987).112 2.2.2 Clustering Data Representations113 Within the clustering paradigm, several alternative methods to K-means exist for the effective organization of these114 representations, and we have explored their application as part of the GLARE pipeline. Among these, Gaussian Mixture115 Models (GMM) with the Expectation-Maximization (EM) algorithm offer a probabilistic framework, wherein each116 cluster is represented by a Gaussian distribution, facilitating more nuanced cluster assignments (Reynolds et al., 2009).117 Density-based clustering methods have gained considerable attention with respect to their ability to detect clusters of118 arbitrary shapes and sizes, thus overcoming some of the limitations associated with distance-based methods (Ester et al.,119 1996). Notably, an extension of this approach, Hierarchical Density-Based Spatial Clustering of Applications with120 Noise (HDBSCAN), utilizes a hierarchical approach to density-based clustering to robustly identify clusters at multiple121 levels with varying densities (Campello et al., 2013). Additionally, spectral clustering presents an alternative approach,122 leveraging the eigenstructure of the similarity matrix to partition the data into clusters, thereby offering an effective123 means of characterizing complex structures within the dataset (Ng et al., 2001).124 Ensemble clustering is an additional powerful technique that combines these multiple clustering solutions to obtain125 consensus clusters that are more robust and accurate. Several ensemble clustering methods have been proposed,126 including Evidence Accumulation Clustering (EAC) (Fred and Jain, 2005), which accumulates evidence from different127 base clustering algorithms to build a co-association matrix. Applying hierarchical clustering to this matrix derives128 a ﬁnal consensus clustering result. Other notable examples include HyperGraph-Partitioning Algorithm (HGPA)129 (Strehl and Ghosh, 2002), which derives consensus clustering through a partitioning hypergraph where each base130 clustering set is a hyperedge in a hypergraph, with vertices representing data points. These ensemble techniques have131 demonstrated their utility in various domains, such as bioinformatics, text mining, and computer vision, where data132 is often high-dimensional, noisy, and complex (Vega-Pons and Ruiz-Shulcloper, 2011). Therefore, they are strong133 candidates for equivalent analyses of the often highly complex structures that make up plant transcriptomics datasets.134 3 Results135 3.1 GLARE: GeneLAb Representation learning pipelinE136 We introduce GLARE, a representation learning pipeline designed to empower researchers to move beyond conventional137 dimensionality reduction techniques in their omics-focused research, such as reliance on PCA or tSNE. The GLARE138 framework enables the extraction of data representations using a trained learning-based model, thereby allowing the139 exploration of latent structures to unveil the hidden patterns inherent within the dataset. We ﬁrst report a veriﬁcation140 study by training a classiﬁcation model on the CARA study’s spaceﬂight and ground control data, followed by the full141 analytical pipeline to highlight GLARE’s ability to both conﬁrm patterns revealed in the published primary analyses142 and reveal novel patterns within the data.143 3.1.1 Veriﬁcation study144 Prior to applying the full end-to-end pipeline of GLARE, we perform a veriﬁcation study through a prediction task to145 ensure that learnable patterns indeed exist within spaceﬂight transcriptome datasets. We focused this analysis on the146 CARA dataset (OSD-120). To enable this analysis, we ﬁrst had to reorder the dataset in OSD-120 to be indexed by147 each experimental feature (analogous to resorting the data table to have column headings/labels) for each factor). We,148 therefore, restructured the unlabeled original data by extracting the feature vectors that represent each experiment (such149 as genotype, spaceﬂight versus ground control, and lighting regime) and performing data discretization (i.e., reordering150 4 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint the numerical FPKM data within the data table to be indexed, not simply by gene locus versus sample but by gene locus151 versus each individual experiment factor, Figure 1(a)). This approach produced discrete labels for each instance of a152 feature, creating a pseudo multi-view datasetX ′. Essentially,X ′ contains a reindexed version of the information in the153 original data, which has 36 continuous features as it contains results from two locations (spaceﬂight versus ground),154 under two different light conditions (light versus dark), and for three genotypes (Ws, Col-0, and PHYD mutants), with155 three replicate samples of each.156 This restructuring allows the experiment environment to be explicitly indicated through the labels (Xu et al., 2013).157 We then trained the classiﬁcation model using XGBoost (Chen and Guestrin, 2016), on the restructured and discretized158 dataX ′ using the concatenated feature vectors as the input matrix and the discretized labels that represent the experiment159 environment as target labels. High predictive performance would indicate that learnable patterns are indeed present in160 the original data, thus motivating the use of unsupervised representation learning techniques from GLARE. Figure 1161 shows an illustration of how we reconstructed our dataset through data discretization and the prediction performance of162 the best-performing data model. We compared prediction performances across multiple data discretization models on a163 held-out test set ofX ′ is presented in Table 1. This test set was set aside during training to be used for evaluating the164 model performance on unseen data. Data models that were tested include our ‘base’ discretization model, where we165 have location labels indicating if the experiments were performed in space (ISS) or on the ground (KSC), having 18166 continuous features and one ﬂight versus ground label. We also discretize other experiment settings such as ‘Genotype’167 and ‘Light condition’, adding these additional discretized labels to act as further categorical predictors. We found that168 our base data model that only discretizes the location variable yields the highest performance with ∼91% test accuracy169 on predicting if the experiments were done in space or the ground based on the normalized counts of FPKM values.170 Figure 1: Illustration of data discretization for data reconstruction and prediction performances. (a) Illustration showing the restructuring process of our base data model where we discretize the experiment location. Raw FPKM numerical data (denoted as ### in tables) from the OSDR record is organized as reads per locus (e.g., AT1G01010, AT1G01020 across all ∼25,000 genes in the Arabidopsis genome) for each experimental sample (e.g., Flight sample of Columbia ecotype grown in the light, FLT_col_Light, or Ground control sample of Columbia ecotype grown in the light, GC_col_Light) as shown left. After discretization (right table), each gene has two instances, one from space and one from the ground sample separately. (b) ROC curves using the training and test dataset on the best-performing data model, which is the base model. Blue line represents XGBoost classiﬁer, showing the ratio of true positives to false positives in the model predictions from the training data (left) and the test set (right). Red line is the random chance baseline. FLT, spaceﬂight; GC, ground control. 5 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Data Model Test Accuracy ↑ F1-score↑ ROC-AUC↑ Light & Genotype Discretization 68.69 ± 0.14 0.686 ± 0.001 0.772 ± 0.001 Genotype Discretization 77.09 ± 0.14 0.770 ± 0.002 0.865 ± 0.001 Light Discretization 83.18 ± 0.07 0.831 ± 0.001 0.919 ± 0.001 No additional Discretization (‘base model’) 91.29± 0.26 0.913 ± 0.002 0.975 ± 0.001 Table 1: Classiﬁcation performances on held-out test set using XGBoost on data from different data models (with ± standard deviation). F1-score: The harmonic mean of precision (avoidance of false positives) and recall (avoidance of false negatives), ROC-AUC: Area under the Receiver Operating Characteristic curve, summarizing true positive vs. false positive trade-off. Test Accuracy is the % correctness of predictions for classifying a sample as spaceﬂight or ground control in the test set using each data model. The veriﬁcation study serves dual purposes: 1) As a validity check for the approach prior to deploying the full171 GLARE pipeline. If the data did not exhibit any learnable and distinctive pattern between the experiment setting172 that we wanted to compare against, then applying unsupervised methods on that data of interest would be ineffective173 as the extracted representations would not capture meaningful latent information and make poor predictions from174 the test set. 2) The prediction task from the veriﬁcation study can serve as the foundation for post-pipeline analysis,175 enabling the incorporation of feature importance explanation schemes, such as SHapley Additive exPlanations (SHAP)176 (Lundberg and Lee, 2017). The feature importance values can reveal, e.g., within CARA, which genotypes and light177 conditions contributed the most to the predictions overall, as well as provide more insights into speciﬁc genes of178 interest. Combining these insights with the clustering results from the GLARE pipeline should substantially empower179 researchers in general to see new patterns in their omics-level data.180 Encouraged by the outcome of this veriﬁcation study that a machine learning approach should be able to extract181 potentially novel features from spaceﬂight datasets, we implemented a full GLARE pipeline, as described in the182 following sections.183 3.1.2 Preprocessing184 GLARE starts with initiating an investigation of the dataset by employing the conventional dimensionality reduction185 approach of PCA to achieve initial data representation. Then, we utilize the PCA representations to conduct clustering186 using the k-means algorithm. Figure 2, shows the distribution of the principal components and clustering results on187 them for the CARA data. Notably, results of both spaceﬂight (FLT) and ground control (GC) experiments exhibit188 similar distributions and clustering patterns, characterized by a concentration of data points within a single cluster.189 Figure 2: Outlier detection via PCA and k-means. (a) Clustering result on spaceﬂight (FLT) data (without any discretization). (b) Clustering results on ground control (GC) data (without any discretization). ∼98% of the data is clustered on clusterA (blue) for both FLT and GC. In this study, from the>25,000 genes in the datasets, we only discard three genes (cluster D and E in (a) and cluster D in (b)) for both FLT and GC that are separated from concentrated clusters A, B, and C. These genes are: AT1G0759, AT3G41768, ATMG00020. 6 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Leveraging this outcome, GLARE employs initial investigation using PCA and k-means clustering as a means of outlier190 detection (Lei et al., 2012), discarding out-of-distribution clusters and keeping the concentrated clusters. For both FLT191 and GC data, we take the three most concentrated clusters to the next step of the pipeline.192 3.1.3 Representation Learning193 Taking the preprocessed data from the prior step, GLARE offers a range of widely applied representation learning194 techniques, including classic dimension reduction methods like PCA, t-SNE, and UMAP. However, GLARE also195 incorporates Sparse Autoencoder (SAE), a deep learning-based model that enables efﬁcient data compression while196 preserving salient features. In this way, it can capture intricate hierarchical structures within the data by simultaneously197 learning both compressed data representation and the features necessary for reconstruction (Ng et al., 2011; Ranzato198 et al., 2007). The illustration of the overall pipeline and details of GLARE are shown in Figure 3.199 Figure 3: Overall pipeline of GLARE: Gene LAb Representation learning pipelinE. (a) Illustration of GLARE, starting with a veriﬁcation study followed by preprocessing through detecting outliers using k-means clustering. Using the clean dataset, GLARE provides options for representation learning from PCA to state-of-the-art SAE pre-trained with high-throughput single-cell data. Retrieved data representation is then processed through ensemble clustering to ﬁnd the hidden patterns within the data. Results from the veriﬁcation study and ensemble clustering are then used for post-pipeline analysis. (b) Model architecture illustration of employed SAE for both training with and without pre-training. (c) Ensemble clustering using three base clustering algorithms based on different statistical methodologies. Evidence accumulation clustering is used to derive consensus clusters from these algorithms. Our implementation of SAE is constructed with a sequence of building blocks, each comprising a Linear layer200 followed by LayerNorm and Exponential Linear Unit (ELU) activation (Ba et al., 2016; Clevert et al., 2015). We chose201 to add the LayerNorm block to improve convergence and stable optimization, considering that our data consists of202 multiple experimental results from different environment settings. Towards this matter, we employ ELU activation as203 well. We use three of these building blocks for the encoder and three blocks for the decoder to make the SAE. The204 sparsity is induced via L1-regularization to deal with the sparse and heterogeneous nature of normalized counts of FPKM205 values. The model training is optimized using mean squared error loss, Adam optimizer (Kingma and Ba, 2014) with206 weight decay, early stopping, and gradient clipping to address exploding gradients and ensure stable training. Numbers207 7 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint of hyperparameters were tested to ﬁnd optimal parameter sets, which are described in our shared code repository208 (https://github.com/OpenScienceDataRepo/Plants_AWG/tree/main/Manuscript_Code/glare). Finally,209 after the model training, we extract data representation from the bottleneck layer between the encoder and decoder210 using this optimized model.211 To further enhance the utility of these representations for downstream tasks such as clustering, we introduce an212 additional self-supervised learning step. We leveraged the pre-training step for the SAE with the addition of high-213 throughput single-cell data, speciﬁcally a single-cell root transcriptome dataset from Shulse et al. (2019), as the CARA214 dataset is drawn from root tip samples. This pre-training step complements the representations from the model by215 incorporating detailed single-cell transcriptome proﬁling of plant root cell types. We then take the pre-trained weights216 to ﬁne-tune SAE using our normalized counts data to build Fine-Tuned SAE (FT-SAE). We maintain the original model217 structure and introduce adapter layers atop the main model to adjust varying dimensions between the single-cell matrix218 and our data appropriately, ensuring seamless integration into our SAE framework. We take the same procedure for the219 model optimization. The suggested self-supervised learning step offers several advantages, including the augmentation220 of feature granularity and the incorporation of cellular-level insights, thereby enhancing the ﬁdelity and relevance of221 the learned representations for downstream analyses (Kiselev et al., 2018). Similar to our approach, such building of222 foundation models pre-trained with high-throughput single-cell data has demonstrated great utility in a diverse array of223 tasks in the life science ﬁeld, including pattern recognition by incorporating foundational knowledge of the data (Hao224 et al., 2023).225 3.1.4 Ensemble Clustering226 GLARE provides an ensemble clustering scheme to improve upon the commonly used application of single clustering227 approaches. GLARE adopts Evidence Accumulation Clustering (EAC) (Fred and Jain, 2005) as its ensemble clustering228 method, integrating three base clustering algorithms: GMM, HDBSCAN, and Spectral clustering. Ensemble clustering229 offers several advantages over-relying on a single clustering algorithm. By merging the clustering outcomes from230 distinct statistical foundations through consensus voting, followed by hierarchical clustering with average linkage on231 the generated co-association matrix, we can mitigate the biases and noises inherent in each base clustering method to232 create more robust and reliable clustering results. Notably, when working with complex data such as representations233 retrieved from a ﬁne-tuned sparse autoencoder, ensemble clustering can effectively address inherent complexities to234 capture hidden patterns and discover biologically meaningful clusters (Monti et al., 2003).235 In addition to obtaining consensus cluster labels using EAC, researchers can leverage GLARE results from three236 base clustering algorithms to get unique clusters for each gene by retrieving the intersected cluster from its respective237 cluster assignments.238 3.2 Data Representation Evaluation239 In this section, we compare data representations from different algorithms that could be retrieved from GLARE. Figure240 4, shows visualizations of each of the representations from FLT and GC using PCA, t-SNE, UMAP, SAE, and FT-SAE.241 Data representation from SAE and FT-SAE has n-dimensions depending on the number of neurons on the bottleneck242 layer. This value is determined through hyperparameter tuning and was set asn = 16. All other data representations243 from PCA, t-SNE, and UMAP have a 2-dimensional matrix. As we discovered from the preprocessing step of GLARE,244 the PCA representation data points are highly condensed in a single region of the map, while t-SNE and UMAP245 representation exhibit a more widespread distribution. On the other hand, SAE and FT-SAE representations show more246 cluster-forming shapes for their t-SNE coordinates where the locally condensed points are separated from others.247 Table 2 shows the next element of the analysis, examining these data representations using multiple quantitative248 evaluation measures: reconstruction error through linear regression, trustworthiness score, neighborhood preservation249 through KNN classiﬁer accuracy, and Silhouette Score through k-means. Among the data representations that could be250 8 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint retrieved from GLARE, we compare all the methods that perform a non-linear transformation to the original dataset,251 leaving out PCA.252 Figure 4: Comparison of data representations retrieved from GLARE. PCA, t-SNE, UMAP, SAE, and FT-SAE from left to right for both FLT and GC data. t-SNE was used for the visualization of n-dimensional data representation from SAE and FT-SAE. Evaluation Metrics Environment Data Representations Reconstruction Error↓ Trustworthiness Score↑ KNN Accuracy↑ Silhouette Score↑ FLT t-SNE 2020.07 0.964 98.11 0.3638 UMAP 1926.06 0.949 97.85 0.3772 SAE 2033.51 0.951 97.39 0.3782 FT-SAE 1845.12 0.884 98.75 0.5323 GC t-SNE 2029.77 0.967 97.89 0.3584 UMAP 1968.41 0.956 97.49 0.3756 SAE 2066.24 0.946 97.99 0.3871 FT-SAE 1954.56 0.864 98.07 0.5397 Table 2: Comparison of various evaluation metrics on data representations. FT-SAE shows the lowest linear reconstruction error, highest KNN accuracy, and highest Silhouette score while having a lower trustworthiness score compared to others for both FLT and GC. Linear reconstruction provides an effective approach for these non-linear methods to see how well they preserve253 the global structure of the data. FT-SAE outperforms other methods on linear reconstruction, having the lowest error254 for both FLT and GC. Measuring the Silhouette score and performing the KNN classiﬁcation on the labels from255 simple k-means clustering offers another perspective on the quality of data representation, speciﬁcally, their utility in256 downstream tasks and local neighborhood structure preservation. FT-SAE outperforms others on these measures as well257 for all cases. On the contrary, t-SNE shows the highest trustworthiness score for both FLT and GC. Although FT-SAE258 retains a fair score with> 0.8 (Lee et al., 2007), it has the lowest among others. This is likely due to the incorporated259 9 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint transfer learning scheme and its emphasis on sparse representations, which may sacriﬁce ﬁdelity. Overall, this analysis260 indicates FT-SAE has great promise in various metrics and its ability to learn sparse, nonlinear representations that261 effectively capture local and global structures in the data.262 3.3 Clustering Results263 Here, we present clustering results using GLARE on the CARA dataset. Figure 5 shows ensemble clustering results264 on the best-performing data representation, FT-SAE. We show individual clustering results from the base clustering265 algorithm we considered, GMM, HDBSCAN, and Spectral clustering, along with a ﬁnal consensus cluster through266 evidence accumulation clustering. We note that GMM and spectral clustering require a user-deﬁned cluster level. These267 were set to 20 and 25, respectively, for FLT and 25 and 20 for GC driven by results from previous studies (Shulse et al.,268 2019; Shahan et al., 2022). HDBSCAN deﬁnes its own cluster number.269 Figure 5: Ensemble clustering via EAC. Results from base clustering algorithms, GMM, HDBSCAN, and Spectral clustering, are shown starting from left to right for both FLT and GC. EAC results are shown at the right, with FLT having 16 consensus cluster labels and 15 consensus cluster labels for GC (depicted as different colors). Spaceﬂight. Clustering of the FLT dataset resulted in the identiﬁcation of 20, 13, and 25 clusters for GMM,270 HDBSCAN, and spectral respectively. GMM clusters had two large clusters, each containing 7,623 and 5,778 genes,271 with most of the other clusters having lesser sizes of 300 to 1,000 genes. HDBSCAN showed a smaller number of272 clusters, where most of the clusters had 1000 to 2500 genes. Spectral clusters had the most consistent cluster sizes273 compared to GMM and HDBSCAN, with most of the clusters having 1,000 to 1,300 genes. These results highlight274 how the precise nature of clusters is different depending on the clustering approach taken. Each clustering strategy has275 distinct strengths. GMM works well when the data does not have well-deﬁned boundaries, where HDBSCAN is useful276 for datasets with noise and outliers, and spectral clustering is highly suited for data with non-linear manifold structures.277 In order to leverage all of these advantages to a robust and reliable analysis of CARA data representation, we combined278 all three approaches via ensemble clustering through consensus voting (Vega-Pons and Ruiz-Shulcloper, 2011). These279 ensemble clusters exhibited diverse characteristics, having clusters with a size of< 1000 genes to two large clusters280 ﬁnding patterns in local structure, each containing 7,627 and 4,715 genes, similar to clusters identiﬁed by GMM. The281 10 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint number of clusters and size of remaining clusters, ranging from 1,000 to 2,500 genes, are suggestive of the outputs of282 HDBSCAN and spectral clusters ﬁnding patterns throughout the global structure (Jain, 2010).283 Ground Control. Clustering of the GC dataset resulted in the identiﬁcation of 25, 15, and 20 clusters for GMM,284 HDBSCAN, and spectral, respectively. Despite the slight change in the number of clusters, the qualitative characteristics285 of the results remained largely consistent with those obtained from FLT. Speciﬁcally, GMM results revealed two large286 clusters, each containing 7,445 and 6,157 genes for GC, along with most of the other clusters having lesser sizes of287 200 to 900 genes highlighting local patterns. Similarly, HDBSCAN and spectral clusters had a comparable consistent288 number of genes as FLT clusters, ﬁnding patterns throughout the global structure. Ensemble clustering demonstrated289 similar outcomes to FLT as well, exhibiting a diverse range of gene counts within each cluster.290 3.4 Post Pipeline Analysis291 Lastly, we demonstrate the full utility of GLARE using the results derived from the ensemble clustering on learned data292 representation of the CARA data and applying feature explanation analysis from the prediction task that we undertook293 for the veriﬁcation study.294 3.4.1 Gene Ontology Analysis295 Gene Ontology (GO) analysis, in conjunction with clustering results, is a widely used approach to ﬁnd the functional296 signiﬁcance of co-expressed genes in the clusters and provide a comprehensive understanding of the biological297 functions and processes underlying the observed gene expression patterns. We use the Metascape platform ( http:298 //metascape.org), which integrates various functional annotation databases (Zhou et al., 2019) to perform GO299 enrichment analysis. We take the clusters from EAC on FT-SAE and process them through Metascape after excluding300 clusters with extreme sizes, as this tool can only take gene lists of less than 3000 counts for the enrichment analysis.301 Speciﬁcally, two large clusters for both FLT and GC datasets, along with one small cluster comprising only 2 genes in the302 FLT dataset, which leaves us 13 signiﬁcant clusters for both FLT and GC. GO analysis on these clusters revealed various303 groups of ontologies, including cellular metabolic processes, oxidative phosphorylation, light response and signaling,304 and vesicle-mediated transport. The prior study on the CARA dataset (Paul et al., 2017) found that genes associated305 with cell wall metabolism seemed most prevalent among the differentially expressed genes. We found that clusters306 associated with vesicle-mediated transport were the most prevalent group for both FLT and GC clusters. Speciﬁcally,307 these vesicle-mediated transport clusters were related to plant-speciﬁc metabolic and developmental pathways for GC,308 such as root morphogenesis and cell wall organization. In contrast, FLT clusters were more related to metabolic and309 catabolic processes, including protein processing and RNA processing (Supplementary Figure S1). Moreover, we310 found a unique hypoxia-related cluster that was only found in FLT results. Root zone hypoxia is predicted to occur in311 spaceﬂight as a loss of buoyancy-driven convection in microgravity should limit oxygen resupply to intensely respiring312 tissues (e.g., Porterﬁeld (2002)). However, transcriptional ﬁngerprints of hypoxia response in plants in spaceﬂight have313 often proven elusive. We therefore concentrated the focus of the rest of our analysis on this hypoxic cluster. In Figure 6,314 we show a heatmap for the FPKM values for the genes within the hypoxia cluster, GO analysis results for the hypoxia315 cluster using Metascape (Zhou et al., 2019) (Figure 6(b)), and Stress Knowledge Map (SKM) (Bleker et al., 2023)316 centered around the Transcription Factors (TFs) in the hypoxia cluster (Figure 6(c)).317 The Stress Knowledge Map (SKM; https://skm.nib.si/) is a curated resource offering two types of knowledge318 graphs on plant molecular interactions and stress signaling (Bleker et al., 2023). We used the Comprehensive Knowledge319 Network (CKN) to gain insights into stress signaling and associated plant biological processes around our genes of320 interest. The map in Figure 6(c) was drawn with ﬁve transcription factors (TFs) that we found in the 43 gene hypoxia321 cluster: ‘DREB2A’, ‘RHL41 / ZAT12’, ‘MYC2’, ‘RRTF1 / ERF109’, and ‘STZ / ZAT10’. The CKN map shows an322 intricate network of TFs and their interactions in the context of stress response mechanisms and related signaling323 pathways with other genes such as ‘HY5 / TED5’, ‘ABI1’, and ‘JAZ1’. Inspection of this network reveals ethylene as a324 11 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint likely important player in this response. GO analysis of the network for biological function (Supplementary Table S1)325 also indicates elements of defense, water stress and cold response may also be important elements for further study.326 Figure 6: Analysis of hypoxia cluster found in FLT clustering result. (a) Heatmap of normalized FPKM values on hypoxia cluster. (b) Enriched ontology on hypoxia cluster from Metascape (c) Stress Knowledge Map (SKM) on ﬁve Transcription Factors (TFs) in hypoxia cluster: ‘DREB2A’, ‘RHL41 / ZAT12’, ‘MYC2’, ‘RRTF1 / ERF109’, and ‘STZ / ZAT10’. 3.4.2 SHAP Analysis327 Up to this point, our analysis has been directed toward uncovering distinct patterns between FLT and GC by generating328 separate data representations for clustering and GO analysis. However, we chose CARA as a dataset to interrogate due329 to the multiple experimental factors within the experiment’s design. Therefore, after identifying the patterns within the330 FLT data using GLARE, particularly a hypoxia cluster, we used this newly identiﬁed cluster to evaluate the effect of331 varying light conditions on different genotypes in each location. We took the found TFs within the hypoxia cluster and332 applied SHAP analysis to quantify feature contribution, thereby explaining which experimental conditions had the most333 effect in classifying this pattern within the data between FLT and GC. SHAP analysis provides a way to understand the334 impact of each feature on the model’s predictions, enabling better model transparency and insights into the underlying335 relationships within the data (Lundberg and Lee, 2017). Higher positive SHAP scores reﬂect features contributing more336 to this discrimination within the dataset to designate a sample to FLT, while negative values reveal factors that have a337 negative impact on the FLT assignment, i.e., reveal the data as GC. In Figure 7, we show local bar plots explaining the338 feature importance among the ﬁve identiﬁed TFs in the FLT hypoxia cluster. Among these ﬁve TFs,‘ZAT12’ has the339 largest aggregate difference in SHAP values between FLT and GC andMYC2 the smallest.340 We see thatPHYD mutants in the dark setting had the most contribution in model prediction in FLT for both‘ZAT12’341 and ‘MYC2’, while WS genotype in the light setting for ‘ZAT12’ and PHYD mutants in the light setting for ‘MYC2’342 had a notable negative effect towards FLT prediction. On the other hand, col genotype in the dark setting had the343 most contribution in model prediction in GC for both ‘ZAT12’ and ‘MYC2’, indicating a strong differentiation between344 12 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Figure 7: SHAP analysis on Transcription Factors (TFs) in the hypoxia cluster.A positive SHAP value (Red color) means that the feature value made a greater contribution than others in classifying the gene as FLT, while a negative SHAP value (Blue color) suggests they had more contribution in GC classiﬁcation. (a)ZAT12 - FLT (b) ZAT12 - GC (c) MYC2 - FLT (d) MYC2 - GC (e) Summary of difference in SHAP value between FLT and GC for the 5 TFs in hypoxia. conditions in a different location. The large difference in aggregated SHAP value between FLT and GC for‘ZAT12’345 suggests that the relative importance and contributions of these features vary signiﬁcantly between the FLT and GC.346 In contrast, the contributions for ‘MYC2’ appear more consistent and stable across both FLT and GC classiﬁcations.347 Lastly, in Figure 8, we present summary SHAP plots on these features, varying light conditions on different genotypes,348 to offer a more comprehensive understanding of feature contribution across the entire dataset.349 We can observe features with different degrees of impact on the model’s prediction from the SHAP value scatterplot350 in Figure 8(a), for example, for the WS genotype: in a light setting, the majority of the data aligns with positive SHAP351 values, supporting FLT classiﬁcation, whereas under dark conditions, the trend is reversed. The beeswarm plot (Figure352 8(b)) illustrates the distribution of SHAP values for each feature. The color gradient from blue to red represents the353 feature value (FPKM values), with blue indicating low expression and red indicating high expression. Figure 8(b)354 illustrates that PHYD mutants in a dark setting have the highest effect on the classiﬁcation with longer tails towards355 positive value, while most of the high FPKM values have negative SHAP value. Suggesting that high expression levels356 from PHYD mutants in dark settings decrease the likelihood of FLT classiﬁcation. Similarly, the Col genotype in a357 dark setting has tails toward negative values, while most of the high FPKM values have positive SHAP values. These358 Figure 8: SHAP value distribution for each treatment. Comparing SHAP values from a classiﬁcation using the XGBoost on the discretized CARA dataset. (a) The summary SHAP value scatterplot for each feature displays the distribution of SHAP values alongside raw feature values. (b) The summary SHAP beeswarm plots, where features are ordered by their importance (measured by mean absolute SHAP values), with the most impactful features appearing at the top. The color bar represents raw feature values. Both plots present the same information. 13 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint observations underscore the presence of intricate interactions between gene expressions, reﬂecting the complexity of359 the transcriptome data and the underlying biological mechanisms.360 SHAP analysis provides a unique perspective on the patterns within the dataset, especially when the data comprises361 various environmental settings as features. Through analyzing the differences and similarities in SHAP values,362 researchers can identify genes that are sensitive to complex environment by genotype-dependent patterns in the data.363 4 Discussion364 In this study, we present an analysis pipeline, GLARE, that employs a state-of-the-art representation learning model365 with self-supervised learning. We chose a previously analyzed dataset, the CARA experiment (OSD-120), which allows366 for an investigation of the overall utility of the pipeline itself and a comparison with the prior ﬁndings. For analysis367 of the root samples in the CARA spaceﬂight data, we trained the system using high-throughput plant root single-cell368 data, along with ensemble clustering, to identify hidden patterns in the spaceﬂight transcriptome. For other spaceﬂight369 datasets, such as whole seedlings, shoot tissues, microbe, animal tissues, or cell types, matching training datasets to370 the particular experimental design would similarly add signiﬁcant depth to these analyses. After the full pipeline,371 we present a recommended framework for post-pipeline analysis employing select bioinformatics tools and adding372 post hoc explainability to the deep learning approach by applying approaches such as SHAP analysis. Such analyses373 conﬁrmed previous patterns found in the data, such as cell wall remodeling and vesicle-mediated transport, but critically374 revealed new features, notably a molecular signature of hypoxic stress in the spaceﬂight samples that is predicted375 from the lack of buoyancy-driven convection in spaceﬂight but that has proven complex to extract from many plant376 transcriptomic datasets. However, our analyses also revealed that this cryptic signature was dependent on experimental377 conditions such as plant genotype and lighting regime. For example, Figure 7 shows that SHAP analysis of the 5378 signature spaceﬂight-related, hypoxia-response transcription factors identiﬁed in this study potentially help explain why379 these signals can be complex to identify in current spaceﬂight datasets without machine learning interrogation.380 Although we present one post hoc analysis pipeline for the output of GLARE, researchers can readily leverage their381 preferred analytics tools when applying GLARE to their datasets to uncover patterns. To this end, we actively encourage382 contributions and novel suggestions through our open science repository. Its open-source nature means researchers383 can readily adapt GLARE on other datasets from GeneLab and elsewhere to reinforce their initial studies and expand384 on these computational ﬁndings. The recent rapid advancement in the machine learning ﬁeld warrants future work385 on GLARE. Similar to our approach, integrating single-cell datasets has been widely adopted for their advantage in386 providing nuanced insights to the cellular level. Indeed, transformer-based foundation models for single-cell multi-omics387 have been suggested (Cui et al., 2024), which offer the potential to generate synthetic data or for gene network inference.388 Our future vision for GLARE is to extend beyond autoencoder-based models to add more advanced self-supervised389 representation learning models, such as contrastive learning methods that are well-used in the ﬁeld of computer vision390 and natural language processing (Chen et al., 2020), to enhance robustness for smaller datasets with fewer features.391 Additionally, causal representation learning methods can be employed to discover the causal relationship between392 related genes (Uelwer et al., 2023; Schölkopf et al., 2021).393 Conﬂict of Interest Statement394 The authors declare they have no conﬂicts of interest.395 14 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Author Contributions396 DH.S. and R.B. conceived of the study and fundamental design. DH.S. and H.F.S. contributed to model testing,397 data analysis, and ﬁgure preparation. DH.S., H.F.S., M.Z., R.B., A.-L.P, R.J.F., and S.G. contributed to manuscript398 preparation. All authors contributed to the manuscript review and editing.399 Funding400 The CARA experiment was supported by grant number GA-2013-104, Center for Advancement of Science in401 Space to A.-L. Paul (PI) and R.J. Ferl (CoI). We gratefully acknowledge support from NASA 80NSSC19K0126402 and 80NSSC21K0577 to S.G.403 Acknowledgments404 The authors would like to acknowledge the sequencing and bioinformatics services provided by the Interdisci-405 plinary Center for Biotechnology Research’s (ICBR) Gene Expression (RRID:SCR_019145), NextGen Sequencing406 (RRID:SCR_019152), and Bioinformatics (RRID:SCR_019120) cores.407 Data Availability Statement408 The dataset (OSD-120) utilized in this method can be found on the NASA GeneLab Data System (https://genelab.409 nasa.gov/). The code utilized for data analysis can be found on the publicly available GitHub repository ( https:410 //github.com/OpenScienceDataRepo/Plants_AWG/tree/main/Manuscript_Code/glare).411 References412 Abts, W., Vandenbussche, B., De Proft, M. P., and Van de Poel, B. (2017). The role of auxin-ethylene crosstalk in413 orchestrating primary root elongation in sugar beet. Frontiers in Plant Science, 8:444.414 Aljalbout, E., Golkov, V ., Siddiqui, Y ., Strobel, M., and Cremers, D. (2018). Clustering with deep learning: Taxonomy415 and new methods. arXiv preprint arXiv:1801.07648.416 Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.417 Bleker, C., Ramšak, Ž., Bittner, A., Podpeˇcan, V ., Zagoršˇcak, M., Wurzinger, B., Baebler, Š., Petek, M., Križnik, M.,418 van Dieren, A., et al. (2023). Stress knowledge map: A knowledge graph resource for systems biology analysis of419 plant stress responses. bioRxiv, pages 2023–11.420 Campello, R. J., Moulavi, D., and Sander, J. (2013). Density-based clustering based on hierarchical density estimates.421 In Paciﬁc-Asia conference on knowledge discovery and data mining, pages 160–172. Springer.422 Chaudhary, K., Poirion, O. B., Lu, L., and Garmire, L. X. (2018). Deep learning–based multi-omics integration robustly423 predicts survival in liver cancer. Clinical Cancer Research, 24(6):1248–1259.424 Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd425 international conference on knowledge discovery and data mining, pages 785–794.426 Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual427 representations. In International conference on machine learning, pages 1597–1607. PMLR.428 15 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear429 units (elus). arXiv preprint arXiv:1511.07289.430 Cui, H., Wang, C., Maan, H., Pang, K., Luo, F., Duan, N., and Wang, B. (2024). scgpt: toward building a foundation431 model for single-cell multi-omics using generative ai. Nature Methods, pages 1–11.432 Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T. R.433 (2013). Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21.434 Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large435 spatial databases with noise. In kdd, volume 96, pages 226–231.436 Ferl, R. J. and Paul, A.-L. (2016). The effect of spaceﬂight on the gravity-sensing auxin gradient of roots: Gfp reporter437 gene microscopy on orbit. npj Microgravity, 2(1):1–9.438 Ferl, R. J., Zupanska, A., Spinale, A., Reed, D., Manning-Roach, S., Guerra, G., Cox, D. R., and Paul, A.-L. (2011). The439 performance of ksc ﬁxation tubes with rnalater for orbital experiments: A case study in iss operations for molecular440 biology. Advances in Space Research, 48(1):199–206.441 Fred, A. L. and Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE transactions on442 pattern analysis and machine intelligence, 27(6):835–850.443 Fu, Y ., Li, L., Xie, B., Dong, C., Wang, M., Jia, B., Shao, L., Dong, Y ., Deng, S., Liu, H., et al. (2016). How to444 establish a bioregenerative life support system for long-term crewed missions to the moon or mars. Astrobiology,445 16(12):925–936.446 Gan, G., Ma, C., and Wu, J. (2020). Data clustering: theory, algorithms, and applications. SIAM.447 Hao, M., Gong, J., Zeng, X., Liu, C., Guo, Y ., Cheng, X., Wang, T., Ma, J., Song, L., and Zhang, X. (2023). Large scale448 foundation model on single-cell transcriptomics. bioRxiv, pages 2023–05.449 Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science,450 313(5786):504–507.451 Hulot, A., Chiquet, J., Jaffrézic, F., and Rigaill, G. (2020). Fast tree aggregation for consensus hierarchical clustering.452 BMC bioinformatics, 21(1):1–12.453 Iqbal, N., Khan, N. A., Ferrante, A., Trivellini, A., Francini, A., and Khan, M. (2017). Ethylene role in plant growth,454 development and senescence: interaction with other phytohormones. Frontiers in plant science, 8:475.455 Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666.456 Karim, M. R., Beyan, O., Zappa, A., Costa, I. G., Rebholz-Schuhmann, D., Cochez, M., and Decker, S. (2021). Deep457 learning-based clustering approaches for bioinformatics. Brieﬁngs in bioinformatics, 22(1):393–415.458 Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.459 Kiselev, V . Y ., Yiu, A., and Hemberg, M. (2018). scmap: projection of single-cell rna-seq data across data sets.Nature460 methods, 15(5):359–362.461 Lee, J. A., Verleysen, M., et al. (2007). Nonlinear dimensionality reduction, volume 1. Springer.462 Lei, D., Zhu, Q., Chen, J., Lin, H., and Yang, P. (2012). Automatic k-means clustering algorithm for outlier detection. In463 Information Engineering and Applications: International Conference on Information Engineering and Applications464 (IEA 2011), pages 363–372. Springer.465 16 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Liao, Y . and Vemuri, V . R. (2002). Use of k-nearest neighbor classiﬁer for intrusion detection.Computers & security,466 21(5):439–448.467 Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach to interpreting model predictions. In Guyon, I., Luxburg,468 U. V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors,Advances in Neural Information469 Processing Systems 30, pages 4765–4774. Curran Associates, Inc.470 Makhzani, A. and Frey, B. (2013). K-sparse autoencoders. arXiv preprint arXiv:1312.5663.471 McInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension472 reduction. arXiv preprint arXiv:1802.03426.473 Monti, S., Tamayo, P., Mesirov, J., and Golub, T. (2003). Consensus clustering: a resampling-based method for class474 discovery and visualization of gene expression microarray data. Machine learning, 52:91–118.475 Mustroph, A., Lee, S. C., Oosumi, T., Zanetti, M. E., Yang, H., Ma, K., Yaghoubi-Masihi, A., Fukao, T., and Bailey-476 Serres, J. (2010). Cross-kingdom comparison of transcriptomic adjustments to low-oxygen stress highlights conserved477 and plant-speciﬁc responses. Plant Physiology, 152(3):1484–1500.478 Ng, A. et al. (2011). Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19.479 Ng, A., Jordan, M., and Weiss, Y . (2001). On spectral clustering: Analysis and an algorithm. Advances in neural480 information processing systems, 14.481 Paul, A.-L., Sng, N. J., Zupanska, A. K., Krishnamurthy, A., Schultz, E. R., and Ferl, R. J. (2017). Genetic dissection of482 the arabidopsis spaceﬂight transcriptome: Are some responses dispensable for the physiological adaptation of plants483 to spaceﬂight? PLoS One, 12(6):e0180186.484 Paul, A.-L., Zupanska, A. K., Schultz, E. R., and Ferl, R. J. (2013). Organ-speciﬁc remodeling of the arabidopsis485 transcriptome in response to spaceﬂight. BMC Plant Biology, 13(112).486 Porterﬁeld, D. M. (2002). The biophysical limitations in physiological transport and exchange in plants grown in487 microgravity. Journal of Plant Growth Regulation, 21(2).488 Ranzato, M., Boureau, Y .-L., Cun, Y ., et al. (2007). Sparse feature learning for deep belief networks. Advances in489 neural information processing systems, 20.490 Rappoport, N. and Shamir, R. (2018). Multi-omic and multi-view clustering algorithms: review and cancer benchmark.491 Nucleic acids research, 46(20):10546–10562.492 Ray, S., Gebre, S., Fogle, H., Berrios, D. C., Tran, P. B., Galazka, J. M., and Costes, S. V . (2018).493 GeneLab: Omics database for spaceﬂight experiments. Bioinformatics, 35(10):1753–1759. _eprint:494 https://academic.oup.com/bioinformatics/article-pdf/35/10/1753/48969335/bioinformatics_35_10_1753.pdf.495 Reynolds, D. A. et al. (2009). Gaussian mixture models. Encyclopedia of biometrics, 741(659-663).496 Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of497 computational and applied mathematics, 20:53–65.498 Rutter, L., Barker, R., Bezdan, D., Cope, H., Costes, S., Degoricija, L., Fisch, K., Gabitto, M., Gebre, S., Giacomello,499 S., et al. (2020). A new era for space life science: international standards for space omics processing (issop). patterns.500 Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y . (2021). Toward causal501 representation learning. Proceedings of the IEEE, 109(5):612–634.502 17 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Shahan, R., Hsu, C.-W., Nolan, T. M., Cole, B. J., Taylor, I. W., Greenstreet, L., Zhang, S., Afanassiev, A., Vlot,503 A. H. C., Schiebinger, G., et al. (2022). A single-cell arabidopsis root atlas reveals developmental trajectories in504 wild-type and cell identity mutants. Developmental cell, 57(4):543–560.505 Shulse, C. N., Cole, B. J., Ciobanu, D., Lin, J., Yoshinaga, Y ., Gouran, M., Turco, G. M., Zhu, Y ., O’Malley, R. C.,506 Brady, S. M., et al. (2019). High-throughput single-cell transcriptome proﬁling of plant cell types. Cell reports,507 27(7):2241–2247.508 Strehl, A. and Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions.509 Journal of machine learning research, 3(Dec):583–617.510 Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L., and511 Pachter, L. (2012). Differential gene and transcript expression analysis of rna-seq experiments with tophat and512 cufﬂinks. Nature protocols, 7(3):562–578.513 Uelwer, T., Robine, J., Wagner, S. S., Höftmann, M., Upschulte, E., Konietzny, S., Behrendt, M., and Harmeling, S.514 (2023). A survey on self-supervised representation learning. arXiv preprint arXiv:2308.11455.515 Van Der Maaten, L. (2009). Learning a parametric embedding by preserving local structure. In Artiﬁcial intelligence516 and statistics, pages 384–391. PMLR.517 Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11).518 Vega-Pons, S. and Ruiz-Shulcloper, J. (2011). A survey of clustering ensemble algorithms. International Journal of519 Pattern Recognition and Artiﬁcial Intelligence, 25(03):337–372.520 Villacampa, A., Ciska, M., Manzano, A., Vandenbrink, J. P., Kiss, J. Z., Herranz, R., and Medina, F. J. (2021). From521 spaceﬂight to mars g-levels: Adaptive response of a. thaliana seedlings in a reduced gravity environment is enhanced522 by red-light photostimulation. International Journal of Molecular Sciences, 22(2):899.523 Xu, C., Tao, D., and Xu, C. (2013). A survey on multi-view learning. arXiv preprint arXiv:1304.5634.524 Zeng, I. S. L. and Lumley, T. (2018). Review of statistical learning methods in integrated omics studies (an integrated525 information science). Bioinformatics and biology insights, 12:1177932218759292.526 Zhou, Y ., Zhou, B., Pache, L., Chang, M., Khodabakhshi, A. H., Tanaseichuk, O., Benner, C., and Chanda, S. K. (2019).527 Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature communications,528 10(1):1523.529 18 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint Supplementary Data and Table530 We provide supplementary data at the OSDR GitHub repository (https://github.com/OpenScienceDataRepo/531 Plants_AWG/tree/main/Manuscript_Code/glare), including the codes for the method and reproducible results532 such as single-cell pre-trained model weights, data representations, ensemble clustering results, Gene Ontology analysis533

Results

for all clusters, and predicted SHAP values for both FLT and GC. Supplementary Table S1 is also included in534 the repository.535 Supplementary Figure536 The supplementary Figure S1 provides an enriched ontology analysis from Metascape (Zhou et al., 2019) on vesicle-537 mediated transport related clusters using FT-SAE data representations from FLT and GC datasets. FLT clusters538 emphasize metabolic and catabolic processes, such as “small molecule catabolic process" and “reactive oxygen species539 metabolic process". Regulatory processes, such as “regulation of programmed cell death" are also prominent in FLT. In540 contrast, GC clusters focus on cellular structure and developmental pathways, with terms like “cellular macromolecule541 localization" and “plant-type cell wall organization", as well as developmental terms like “embryo development".542 This analysis illustrates the distinct biological pathways captured by FLT and GC data representations, offering543 complementary insights to prior study (Paul et al., 2017) with plant-speciﬁc metabolic and developmental pathways.544 Figure S1: Vesicle-mediated transport related clusters in FLT and GC. (a) Enriched ontology analysis from Metascape on vesicle-mediated transport related clusters using FLT FT-SAE data representations. (b) Equivalent analysis using GC FT-SAE data representations. 19 .CC-BY 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0