{"paper_id":"23bde1aa-e764-4aff-b4c2-2260fe099ee4","body_text":"GLARE: D ISCOVERING HIDDEN PATTERNS IN SPACEFLIGHT\nTRANSCRIPTOME USING REPRESENTATION LEARNING\nA bioRχiv PREPRINT\nDongHyeon Seo1, Hunter F. Strickland2,3, Mingqi Zhou3, Richard Barker4,\nRobert J Ferl3,5, Anna-Lisa Paul3,6, Simon Gilroy7\n1Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292\n2Plant Molecular and Cellular Biology Program, University of Florida, Gainesville, FL 32611\n3Department of Horticultural Sciences, University of Florida, Gainesville, FL 32611\n4Blue Marble Space Institute of Science, Seattle, W A 98104\n5Ofﬁce of Research, University of Florida, Gainesville, FL 32611\n6Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL 32610\n7Department of Botany, University of Wisconsin-Madison, Madison, WI 53706\nABSTRACT\nSpaceﬂight studies present novel insights into biological processes through exposure to stressors1\noutside the evolutionary path of terrestrial organisms. Despite limited access to space environ-2\nments, numerous transcriptomic datasets from spaceﬂight experiments are now available through3\nNASA’s GeneLab data repository, which allows public access to these datasets, encouraging fur-4\nther analysis. While various computational pipelines and methods have been used to process these5\ntranscriptomic datasets, learning-model-driven analyses have yet to be applied to a broad array of6\nsuch spaceﬂight-related datasets. In this study, we propose an open-source framework, GLARE:7\nGeneLAb Representation learning pipelinE, which consists of training different representation learn-8\ning approaches from manifold learning to self-supervised learning that enhances the performance9\nof downstream analytical tasks such as pattern recognition. We illustrate the utility of GLARE by10\napplying it to gene-level transcriptional values from the results of the CARA spaceﬂight experiment,11\nan Arabidopsis root tip transcriptome dataset that spanned light, dark, and microgravity treatments.12\nWe show that GLARE not only substantiated the ﬁndings of the original study concerning cell13\nwall remodeling but also revealed additional patterns of gene expression affected by the treatments,14\nincluding evidence of hypoxia. This work suggests there is great potential to supplement the insights15\ndrawn from initial studies on spaceﬂight omics-level data through further machine-learning-enabled16\nanalyses.17\nKeywords Machine Learning, Representation Learning, Spaceﬂight, RNA-seq, Transcriptomics18\n1 Introduction19\nSpaceﬂight studies present unprecedented insights into biological processes through exposure to unique environmental20\nstressors that have not been experienced by any form of life on Earth. In response to the spaceﬂight environment,21\norganisms initiate speciﬁc transcriptional responses to novel conditions. Thus, one key to understanding how biology22\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nresponds to spaceﬂight stressors like microgravity, radiation, and hypoxia is through transcriptomic analysis to study the23\ngene expression proﬁles that drive physiological adaptation triggered by the spaceﬂight environment (Mustroph et al.,24\n2010). Space-related transcriptional studies have now also broadened into multi-omic spaceﬂight investigations that are25\nwell-suited to multiple rounds of analysis facilitated by the publicly available datasets in the NASA GeneLab database.26\nThe importance of studying plant biology speciﬁcally in space has been identiﬁed both for exploring the fundamental27\nresponses of biology to the spaceﬂight environment and at a very practical level for developing bio-regenerative life28\nsupport systems for long-term space exploration (Rutter et al., 2020; Fu et al., 2016). Understanding of transcriptomic29\nand physiological changes elicited in plants by spaceﬂight conditions through analyzing transcriptional and other –omic30\npatterns is therefore a focus of much current plant space biology experimentation (e.g., Paul et al. (2013); Villacampa31\net al. (2021)). For example, the CARA(Characterizing Arabidopsis Root Attraction) experiment was designed to32\ncompare the spaceﬂight transcriptome responses between different genotypes of Arabidopsis thaliana’s root tips under33\nvarious conditions (Paul et al., 2017). This experiment explored the patterns of gene expression from root tip cells in34\nthe spaceﬂight environment on the International Space Station (ISS), with comparable ground controls and the lighting35\nsub-environments among three different genotypes. While these kinds of experiments in plant space biology have36\nprovided many key insights, they have so far largely relied upon the primary transcriptomic analysis of the original37\nresearch team. To provide a framework that can be applied to increase the depth of transcriptomic analyses for previous38\nand future spaceﬂight experiments, we introduce GLARE: GeneLab Representation learning pipelinE. We show the39\nutility of the GLARE pipeline by applying it to the CARA dataset to illustrate how applying novel machine-learning40\nmethods to transcriptomic datasets extends insights beyond the original transcriptomic analysis of this data, adding new41\nperspectives.42\nOur analysis pipeline applies state-of-the-art representation learning models to ﬁnd underlying patterns in the FPKM43\nvalues(fragments per kilobase of transcript per million mapped fragments) that are proportional to the abundance of44\neach loci’s transcript. These representation learning models allow for better data point representation and clustering45\nusing unsupervised learning methods. These methods allow for further investigation of the effects of spaceﬂight on, e.g.,46\nphytohormone signaling and associated physiological phenotypes (Abts et al., 2017; Ferl and Paul, 2016; Iqbal et al.,47\n2017). Moreover, considering that the CARA experiment also utilized lighting sub-environments, we can shine further48\nlight on the potential spaceﬂight effects that were neglected in past studies. Overall, the GLARE method will provide49\ninsights to better understand plant behavior in the spaceﬂight environment based on its endogenous and exogenous cues.50\n2 Materials and Methods51\n2.1 GeneLab Data System and Data Entries52\nThe Genelab Data System (GLDS) is a public, space-related -omics data repository, which curates data from a wide53\nvariety of species and experimental spaceﬂight conditions (Ray et al., 2018). GLDS obtains spaceﬂight-related –omics54\ndatasets from multiple locations such as the Gene Expression Omnibus (GEO), European Bioinformatics Institute (EBI),55\npublications directly, and others (Ray et al., 2018). This data is then cataloged with the relevant metadata, such as56\nprotocols, payload numbers, and experimental variables, and made available as an Open Science Dataset (OSD) in57\nNASA’s Open Science Data Repository (OSDR).58\nThe CARA dataset (OSD-120; https://osdr.nasa.gov/bio/repo/data/studies/OSD-120) was chosen59\nfrom Genelab for use with GLARE due to its many experimental conditions. The CARA experiments were conducted60\nwith three ecotypes/genotypes of Arabidopsis thaliana: wild-type Wassilewskija (WS), wild-type Columbia-0 (Col-0),61\nand a mutant in the PHYTOCHROME D gene in the Col-0 background (PHYD) (Paul et al., 2017). Brieﬂy, these62\ngenotypes were planted on gel media in Petri dishes and grown in either ambient light conditions or in the dark on63\nthe ISS for 11 days; Parallel controls were performed on the ground. After the 11 days, germinated seedlings were64\nphotographed and collected into Kennedy Space Center Fixation Tubes (KFTs;Ferl et al. (2011)) containing RNAlater.65\n2\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nSeedlings preserved in RNAlater were returned to Earth frozen, and then the roots were dissected into the last 2 mm66\nof the tip for the light-grown plants and the last 1 mm for the dark-grown plants. RNA was extracted and sent to67\nthe Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, for RNA sequencing using a68\nNextSeq 500 system, producing ∼40 million paired-end reads per sample. Finally, these pair-end reads were mapped to69\nthe TAIR10 A. thaliana reference genome using Spliced Transcripts Alignment to a Reference (STAR) software, and70\ndifferential expression was performed using the Cufﬂinks tool (Dobin et al., 2013; Trapnell et al., 2012).71\n2.2 High-dimensional Data Analysis72\nOverview: Statistical methods have been widely integrated into the bioinformatics pipeline in multi-omics studies for73\nanalyzing the data as well as preprocessing the data. Speciﬁcally, due to multi-omics datasets having complex data74\ntopology, dimension reduction and clustering are two commonly used techniques for further investigation (Rappoport75\nand Shamir, 2018). GLARE capitalizes upon such approaches. For example, Principal Component Analysis (PCA)76\nand Factor Analysis are fundamental methods with widespread application for dimensionality reduction (Zeng and77\nLumley, 2018). After achieving a statistical representation of the dataset with these dimensionality reduction techniques,78\nclustering methods are utilized to group similar representations to uncover underlying patterns within the dataset.79\nAmong these, K-means and hierarchical clustering are featured as two of the most favored methodologies (Hulot et al.,80\n2020).81\n2.2.1 Learning Data Representations82\nWhile PCA is popularly used for its simplicity, it has its limits for losing essential features through linear embedding,83\nwhich often degrades the clustering quality (Gan et al., 2020). Several alternative methods that do not only rely on data84\npoint distribution but also leverage latent data structures via learned representations have shown advantages in handling85\nbiological data, thereby enhancing clustering precision (Karim et al., 2021). GLARE also uses these approaches in86\nits analyses. These alternatives to PCA include t-distributed Stochastic Neighbour Embedding (t-SNE), a non-linear87\ndimensionality reduction technique particularly adept at preserving local structures within high-dimensional data, and88\nUniform Manifold APproximation (UMAP) (Van der Maaten and Hinton, 2008; McInnes et al., 2018), a manifold89\nlearning approach that efﬁciently captures complex relationships within the data. However, alternative deep-learning-90\nbased approaches for obtaining data representations have been largely neglected in the ﬁeld of plant biology, despite the91\nadvantage of their ability to capture contextual information from the non-linear mappings. Speciﬁcally, this approach92\nof capturing contextual information through complex, higher-level features is known as representation learning or93\nfeature extraction (Aljalbout et al., 2018). Therefore, along with PCA, t-SNE, and UMAP, we have investigated the94\napplication of Sparse Autoencoder (SAE) as one of the representation learning methods in the GLARE pipeline. SAE95\nis an unsupervised learning algorithm based on a neural network that aims to learn an approximation of the identity96\nfunction that represents the data. The model is trained by encoding the data from its feedforward phase but with sparsity97\nconstraints that only activate neurons with the largest activation, allowing the discovery of the unique structure in98\nthe data (Makhzani and Frey, 2013). While autoencoders are more commonly used for reconstructing the original99\ninput data, prior studies show autoencoder as a representation learning approach that works favorably in the context of100\nmulti-omics datasets (Chaudhary et al., 2018).101\nUpon employing multiple approaches to obtain data representation, evaluating these data representations is critical102\nto understanding the strengths and limitations of various data representation techniques. Prior research has used103\nseveral evaluation techniques to assess the ﬁdelity between data representations and the original dataset and the quality104\nof the data representation structure, so we have used these methods in the development of the GLARE approach.105\nReconstruction error analysis, often conducted through linear regression, and trustworthiness scores that measure106\nfaithfulness, are widely applied to test the ﬁdelity by comparing the original data and learned representation (Hinton107\nand Salakhutdinov, 2006; Van Der Maaten, 2009). To test the quality of the data structure, the K-Nearest Neighbors108\n(KNN) classiﬁer can be utilized to assess the neighborhood preservation, showing the ability of the representation to109\n3\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nmaintain local structure and inherent relationships (Liao and Vemuri, 2002). Furthermore, the Silhouette score measured110\nvia k-means clustering is widely used to check the insights into clustering performance and compactness of the data111\nrepresentation (Rousseeuw, 1987).112\n2.2.2 Clustering Data Representations113\nWithin the clustering paradigm, several alternative methods to K-means exist for the effective organization of these114\nrepresentations, and we have explored their application as part of the GLARE pipeline. Among these, Gaussian Mixture115\nModels (GMM) with the Expectation-Maximization (EM) algorithm offer a probabilistic framework, wherein each116\ncluster is represented by a Gaussian distribution, facilitating more nuanced cluster assignments (Reynolds et al., 2009).117\nDensity-based clustering methods have gained considerable attention with respect to their ability to detect clusters of118\narbitrary shapes and sizes, thus overcoming some of the limitations associated with distance-based methods (Ester et al.,119\n1996). Notably, an extension of this approach, Hierarchical Density-Based Spatial Clustering of Applications with120\nNoise (HDBSCAN), utilizes a hierarchical approach to density-based clustering to robustly identify clusters at multiple121\nlevels with varying densities (Campello et al., 2013). Additionally, spectral clustering presents an alternative approach,122\nleveraging the eigenstructure of the similarity matrix to partition the data into clusters, thereby offering an effective123\nmeans of characterizing complex structures within the dataset (Ng et al., 2001).124\nEnsemble clustering is an additional powerful technique that combines these multiple clustering solutions to obtain125\nconsensus clusters that are more robust and accurate. Several ensemble clustering methods have been proposed,126\nincluding Evidence Accumulation Clustering (EAC) (Fred and Jain, 2005), which accumulates evidence from different127\nbase clustering algorithms to build a co-association matrix. Applying hierarchical clustering to this matrix derives128\na ﬁnal consensus clustering result. Other notable examples include HyperGraph-Partitioning Algorithm (HGPA)129\n(Strehl and Ghosh, 2002), which derives consensus clustering through a partitioning hypergraph where each base130\nclustering set is a hyperedge in a hypergraph, with vertices representing data points. These ensemble techniques have131\ndemonstrated their utility in various domains, such as bioinformatics, text mining, and computer vision, where data132\nis often high-dimensional, noisy, and complex (Vega-Pons and Ruiz-Shulcloper, 2011). Therefore, they are strong133\ncandidates for equivalent analyses of the often highly complex structures that make up plant transcriptomics datasets.134\n3 Results135\n3.1 GLARE: GeneLAb Representation learning pipelinE136\nWe introduce GLARE, a representation learning pipeline designed to empower researchers to move beyond conventional137\ndimensionality reduction techniques in their omics-focused research, such as reliance on PCA or tSNE. The GLARE138\nframework enables the extraction of data representations using a trained learning-based model, thereby allowing the139\nexploration of latent structures to unveil the hidden patterns inherent within the dataset. We ﬁrst report a veriﬁcation140\nstudy by training a classiﬁcation model on the CARA study’s spaceﬂight and ground control data, followed by the full141\nanalytical pipeline to highlight GLARE’s ability to both conﬁrm patterns revealed in the published primary analyses142\nand reveal novel patterns within the data.143\n3.1.1 Veriﬁcation study144\nPrior to applying the full end-to-end pipeline of GLARE, we perform a veriﬁcation study through a prediction task to145\nensure that learnable patterns indeed exist within spaceﬂight transcriptome datasets. We focused this analysis on the146\nCARA dataset (OSD-120). To enable this analysis, we ﬁrst had to reorder the dataset in OSD-120 to be indexed by147\neach experimental feature (analogous to resorting the data table to have column headings/labels) for each factor). We,148\ntherefore, restructured the unlabeled original data by extracting the feature vectors that represent each experiment (such149\nas genotype, spaceﬂight versus ground control, and lighting regime) and performing data discretization (i.e., reordering150\n4\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nthe numerical FPKM data within the data table to be indexed, not simply by gene locus versus sample but by gene locus151\nversus each individual experiment factor, Figure 1(a)). This approach produced discrete labels for each instance of a152\nfeature, creating a pseudo multi-view datasetX ′. Essentially,X ′ contains a reindexed version of the information in the153\noriginal data, which has 36 continuous features as it contains results from two locations (spaceﬂight versus ground),154\nunder two different light conditions (light versus dark), and for three genotypes (Ws, Col-0, and PHYD mutants), with155\nthree replicate samples of each.156\nThis restructuring allows the experiment environment to be explicitly indicated through the labels (Xu et al., 2013).157\nWe then trained the classiﬁcation model using XGBoost (Chen and Guestrin, 2016), on the restructured and discretized158\ndataX ′ using the concatenated feature vectors as the input matrix and the discretized labels that represent the experiment159\nenvironment as target labels. High predictive performance would indicate that learnable patterns are indeed present in160\nthe original data, thus motivating the use of unsupervised representation learning techniques from GLARE. Figure 1161\nshows an illustration of how we reconstructed our dataset through data discretization and the prediction performance of162\nthe best-performing data model. We compared prediction performances across multiple data discretization models on a163\nheld-out test set ofX ′ is presented in Table 1. This test set was set aside during training to be used for evaluating the164\nmodel performance on unseen data. Data models that were tested include our ‘base’ discretization model, where we165\nhave location labels indicating if the experiments were performed in space (ISS) or on the ground (KSC), having 18166\ncontinuous features and one ﬂight versus ground label. We also discretize other experiment settings such as ‘Genotype’167\nand ‘Light condition’, adding these additional discretized labels to act as further categorical predictors. We found that168\nour base data model that only discretizes the location variable yields the highest performance with ∼91% test accuracy169\non predicting if the experiments were done in space or the ground based on the normalized counts of FPKM values.170\nFigure 1: Illustration of data discretization for data reconstruction and prediction performances. (a) Illustration\nshowing the restructuring process of our base data model where we discretize the experiment location. Raw FPKM\nnumerical data (denoted as ### in tables) from the OSDR record is organized as reads per locus (e.g., AT1G01010,\nAT1G01020 across all ∼25,000 genes in the Arabidopsis genome) for each experimental sample (e.g., Flight sample of\nColumbia ecotype grown in the light, FLT_col_Light, or Ground control sample of Columbia ecotype grown in the\nlight, GC_col_Light) as shown left. After discretization (right table), each gene has two instances, one from space and\none from the ground sample separately. (b) ROC curves using the training and test dataset on the best-performing data\nmodel, which is the base model. Blue line represents XGBoost classiﬁer, showing the ratio of true positives to false\npositives in the model predictions from the training data (left) and the test set (right). Red line is the random chance\nbaseline. FLT, spaceﬂight; GC, ground control.\n5\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nData Model Test Accuracy ↑ F1-score↑ ROC-AUC↑\nLight & Genotype Discretization 68.69 ± 0.14 0.686 ± 0.001 0.772 ± 0.001\nGenotype Discretization 77.09 ± 0.14 0.770 ± 0.002 0.865 ± 0.001\nLight Discretization 83.18 ± 0.07 0.831 ± 0.001 0.919 ± 0.001\nNo additional Discretization (‘base model’) 91.29± 0.26 0.913 ± 0.002 0.975 ± 0.001\nTable 1: Classiﬁcation performances on held-out test set using XGBoost on data from different data models\n(with ± standard deviation). F1-score: The harmonic mean of precision (avoidance of false positives) and recall\n(avoidance of false negatives), ROC-AUC: Area under the Receiver Operating Characteristic curve, summarizing true\npositive vs. false positive trade-off. Test Accuracy is the % correctness of predictions for classifying a sample as\nspaceﬂight or ground control in the test set using each data model.\nThe veriﬁcation study serves dual purposes: 1) As a validity check for the approach prior to deploying the full171\nGLARE pipeline. If the data did not exhibit any learnable and distinctive pattern between the experiment setting172\nthat we wanted to compare against, then applying unsupervised methods on that data of interest would be ineffective173\nas the extracted representations would not capture meaningful latent information and make poor predictions from174\nthe test set. 2) The prediction task from the veriﬁcation study can serve as the foundation for post-pipeline analysis,175\nenabling the incorporation of feature importance explanation schemes, such as SHapley Additive exPlanations (SHAP)176\n(Lundberg and Lee, 2017). The feature importance values can reveal, e.g., within CARA, which genotypes and light177\nconditions contributed the most to the predictions overall, as well as provide more insights into speciﬁc genes of178\ninterest. Combining these insights with the clustering results from the GLARE pipeline should substantially empower179\nresearchers in general to see new patterns in their omics-level data.180\nEncouraged by the outcome of this veriﬁcation study that a machine learning approach should be able to extract181\npotentially novel features from spaceﬂight datasets, we implemented a full GLARE pipeline, as described in the182\nfollowing sections.183\n3.1.2 Preprocessing184\nGLARE starts with initiating an investigation of the dataset by employing the conventional dimensionality reduction185\napproach of PCA to achieve initial data representation. Then, we utilize the PCA representations to conduct clustering186\nusing the k-means algorithm. Figure 2, shows the distribution of the principal components and clustering results on187\nthem for the CARA data. Notably, results of both spaceﬂight (FLT) and ground control (GC) experiments exhibit188\nsimilar distributions and clustering patterns, characterized by a concentration of data points within a single cluster.189\nFigure 2: Outlier detection via PCA and k-means. (a) Clustering result on spaceﬂight (FLT) data (without any\ndiscretization). (b) Clustering results on ground control (GC) data (without any discretization). ∼98% of the data is\nclustered on clusterA (blue) for both FLT and GC. In this study, from the>25,000 genes in the datasets, we only discard\nthree genes (cluster D and E in (a) and cluster D in (b)) for both FLT and GC that are separated from concentrated\nclusters A, B, and C. These genes are: AT1G0759, AT3G41768, ATMG00020.\n6\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nLeveraging this outcome, GLARE employs initial investigation using PCA and k-means clustering as a means of outlier190\ndetection (Lei et al., 2012), discarding out-of-distribution clusters and keeping the concentrated clusters. For both FLT191\nand GC data, we take the three most concentrated clusters to the next step of the pipeline.192\n3.1.3 Representation Learning193\nTaking the preprocessed data from the prior step, GLARE offers a range of widely applied representation learning194\ntechniques, including classic dimension reduction methods like PCA, t-SNE, and UMAP. However, GLARE also195\nincorporates Sparse Autoencoder (SAE), a deep learning-based model that enables efﬁcient data compression while196\npreserving salient features. In this way, it can capture intricate hierarchical structures within the data by simultaneously197\nlearning both compressed data representation and the features necessary for reconstruction (Ng et al., 2011; Ranzato198\net al., 2007). The illustration of the overall pipeline and details of GLARE are shown in Figure 3.199\nFigure 3: Overall pipeline of GLARE: Gene LAb Representation learning pipelinE. (a) Illustration of GLARE,\nstarting with a veriﬁcation study followed by preprocessing through detecting outliers using k-means clustering. Using\nthe clean dataset, GLARE provides options for representation learning from PCA to state-of-the-art SAE pre-trained\nwith high-throughput single-cell data. Retrieved data representation is then processed through ensemble clustering\nto ﬁnd the hidden patterns within the data. Results from the veriﬁcation study and ensemble clustering are then used\nfor post-pipeline analysis. (b) Model architecture illustration of employed SAE for both training with and without\npre-training. (c) Ensemble clustering using three base clustering algorithms based on different statistical methodologies.\nEvidence accumulation clustering is used to derive consensus clusters from these algorithms.\nOur implementation of SAE is constructed with a sequence of building blocks, each comprising a Linear layer200\nfollowed by LayerNorm and Exponential Linear Unit (ELU) activation (Ba et al., 2016; Clevert et al., 2015). We chose201\nto add the LayerNorm block to improve convergence and stable optimization, considering that our data consists of202\nmultiple experimental results from different environment settings. Towards this matter, we employ ELU activation as203\nwell. We use three of these building blocks for the encoder and three blocks for the decoder to make the SAE. The204\nsparsity is induced via L1-regularization to deal with the sparse and heterogeneous nature of normalized counts of FPKM205\nvalues. The model training is optimized using mean squared error loss, Adam optimizer (Kingma and Ba, 2014) with206\nweight decay, early stopping, and gradient clipping to address exploding gradients and ensure stable training. Numbers207\n7\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nof hyperparameters were tested to ﬁnd optimal parameter sets, which are described in our shared code repository208\n(https://github.com/OpenScienceDataRepo/Plants_AWG/tree/main/Manuscript_Code/glare). Finally,209\nafter the model training, we extract data representation from the bottleneck layer between the encoder and decoder210\nusing this optimized model.211\nTo further enhance the utility of these representations for downstream tasks such as clustering, we introduce an212\nadditional self-supervised learning step. We leveraged the pre-training step for the SAE with the addition of high-213\nthroughput single-cell data, speciﬁcally a single-cell root transcriptome dataset from Shulse et al. (2019), as the CARA214\ndataset is drawn from root tip samples. This pre-training step complements the representations from the model by215\nincorporating detailed single-cell transcriptome proﬁling of plant root cell types. We then take the pre-trained weights216\nto ﬁne-tune SAE using our normalized counts data to build Fine-Tuned SAE (FT-SAE). We maintain the original model217\nstructure and introduce adapter layers atop the main model to adjust varying dimensions between the single-cell matrix218\nand our data appropriately, ensuring seamless integration into our SAE framework. We take the same procedure for the219\nmodel optimization. The suggested self-supervised learning step offers several advantages, including the augmentation220\nof feature granularity and the incorporation of cellular-level insights, thereby enhancing the ﬁdelity and relevance of221\nthe learned representations for downstream analyses (Kiselev et al., 2018). Similar to our approach, such building of222\nfoundation models pre-trained with high-throughput single-cell data has demonstrated great utility in a diverse array of223\ntasks in the life science ﬁeld, including pattern recognition by incorporating foundational knowledge of the data (Hao224\net al., 2023).225\n3.1.4 Ensemble Clustering226\nGLARE provides an ensemble clustering scheme to improve upon the commonly used application of single clustering227\napproaches. GLARE adopts Evidence Accumulation Clustering (EAC) (Fred and Jain, 2005) as its ensemble clustering228\nmethod, integrating three base clustering algorithms: GMM, HDBSCAN, and Spectral clustering. Ensemble clustering229\noffers several advantages over-relying on a single clustering algorithm. By merging the clustering outcomes from230\ndistinct statistical foundations through consensus voting, followed by hierarchical clustering with average linkage on231\nthe generated co-association matrix, we can mitigate the biases and noises inherent in each base clustering method to232\ncreate more robust and reliable clustering results. Notably, when working with complex data such as representations233\nretrieved from a ﬁne-tuned sparse autoencoder, ensemble clustering can effectively address inherent complexities to234\ncapture hidden patterns and discover biologically meaningful clusters (Monti et al., 2003).235\nIn addition to obtaining consensus cluster labels using EAC, researchers can leverage GLARE results from three236\nbase clustering algorithms to get unique clusters for each gene by retrieving the intersected cluster from its respective237\ncluster assignments.238\n3.2 Data Representation Evaluation239\nIn this section, we compare data representations from different algorithms that could be retrieved from GLARE. Figure240\n4, shows visualizations of each of the representations from FLT and GC using PCA, t-SNE, UMAP, SAE, and FT-SAE.241\nData representation from SAE and FT-SAE has n-dimensions depending on the number of neurons on the bottleneck242\nlayer. This value is determined through hyperparameter tuning and was set asn = 16. All other data representations243\nfrom PCA, t-SNE, and UMAP have a 2-dimensional matrix. As we discovered from the preprocessing step of GLARE,244\nthe PCA representation data points are highly condensed in a single region of the map, while t-SNE and UMAP245\nrepresentation exhibit a more widespread distribution. On the other hand, SAE and FT-SAE representations show more246\ncluster-forming shapes for their t-SNE coordinates where the locally condensed points are separated from others.247\nTable 2 shows the next element of the analysis, examining these data representations using multiple quantitative248\nevaluation measures: reconstruction error through linear regression, trustworthiness score, neighborhood preservation249\nthrough KNN classiﬁer accuracy, and Silhouette Score through k-means. Among the data representations that could be250\n8\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nretrieved from GLARE, we compare all the methods that perform a non-linear transformation to the original dataset,251\nleaving out PCA.252\nFigure 4: Comparison of data representations retrieved from GLARE. PCA, t-SNE, UMAP, SAE, and FT-SAE\nfrom left to right for both FLT and GC data. t-SNE was used for the visualization of n-dimensional data representation\nfrom SAE and FT-SAE.\nEvaluation Metrics\nEnvironment Data\nRepresentations\nReconstruction\nError↓\nTrustworthiness\nScore↑\nKNN\nAccuracy↑\nSilhouette\nScore↑\nFLT t-SNE 2020.07 0.964 98.11 0.3638\nUMAP 1926.06 0.949 97.85 0.3772\nSAE 2033.51 0.951 97.39 0.3782\nFT-SAE 1845.12 0.884 98.75 0.5323\nGC t-SNE 2029.77 0.967 97.89 0.3584\nUMAP 1968.41 0.956 97.49 0.3756\nSAE 2066.24 0.946 97.99 0.3871\nFT-SAE 1954.56 0.864 98.07 0.5397\nTable 2: Comparison of various evaluation metrics on data representations. FT-SAE shows the lowest linear\nreconstruction error, highest KNN accuracy, and highest Silhouette score while having a lower trustworthiness score\ncompared to others for both FLT and GC.\nLinear reconstruction provides an effective approach for these non-linear methods to see how well they preserve253\nthe global structure of the data. FT-SAE outperforms other methods on linear reconstruction, having the lowest error254\nfor both FLT and GC. Measuring the Silhouette score and performing the KNN classiﬁcation on the labels from255\nsimple k-means clustering offers another perspective on the quality of data representation, speciﬁcally, their utility in256\ndownstream tasks and local neighborhood structure preservation. FT-SAE outperforms others on these measures as well257\nfor all cases. On the contrary, t-SNE shows the highest trustworthiness score for both FLT and GC. Although FT-SAE258\nretains a fair score with> 0.8 (Lee et al., 2007), it has the lowest among others. This is likely due to the incorporated259\n9\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\ntransfer learning scheme and its emphasis on sparse representations, which may sacriﬁce ﬁdelity. Overall, this analysis260\nindicates FT-SAE has great promise in various metrics and its ability to learn sparse, nonlinear representations that261\neffectively capture local and global structures in the data.262\n3.3 Clustering Results263\nHere, we present clustering results using GLARE on the CARA dataset. Figure 5 shows ensemble clustering results264\non the best-performing data representation, FT-SAE. We show individual clustering results from the base clustering265\nalgorithm we considered, GMM, HDBSCAN, and Spectral clustering, along with a ﬁnal consensus cluster through266\nevidence accumulation clustering. We note that GMM and spectral clustering require a user-deﬁned cluster level. These267\nwere set to 20 and 25, respectively, for FLT and 25 and 20 for GC driven by results from previous studies (Shulse et al.,268\n2019; Shahan et al., 2022). HDBSCAN deﬁnes its own cluster number.269\nFigure 5: Ensemble clustering via EAC. Results from base clustering algorithms, GMM, HDBSCAN, and Spectral\nclustering, are shown starting from left to right for both FLT and GC. EAC results are shown at the right, with FLT\nhaving 16 consensus cluster labels and 15 consensus cluster labels for GC (depicted as different colors).\nSpaceﬂight. Clustering of the FLT dataset resulted in the identiﬁcation of 20, 13, and 25 clusters for GMM,270\nHDBSCAN, and spectral respectively. GMM clusters had two large clusters, each containing 7,623 and 5,778 genes,271\nwith most of the other clusters having lesser sizes of 300 to 1,000 genes. HDBSCAN showed a smaller number of272\nclusters, where most of the clusters had 1000 to 2500 genes. Spectral clusters had the most consistent cluster sizes273\ncompared to GMM and HDBSCAN, with most of the clusters having 1,000 to 1,300 genes. These results highlight274\nhow the precise nature of clusters is different depending on the clustering approach taken. Each clustering strategy has275\ndistinct strengths. GMM works well when the data does not have well-deﬁned boundaries, where HDBSCAN is useful276\nfor datasets with noise and outliers, and spectral clustering is highly suited for data with non-linear manifold structures.277\nIn order to leverage all of these advantages to a robust and reliable analysis of CARA data representation, we combined278\nall three approaches via ensemble clustering through consensus voting (Vega-Pons and Ruiz-Shulcloper, 2011). These279\nensemble clusters exhibited diverse characteristics, having clusters with a size of< 1000 genes to two large clusters280\nﬁnding patterns in local structure, each containing 7,627 and 4,715 genes, similar to clusters identiﬁed by GMM. The281\n10\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nnumber of clusters and size of remaining clusters, ranging from 1,000 to 2,500 genes, are suggestive of the outputs of282\nHDBSCAN and spectral clusters ﬁnding patterns throughout the global structure (Jain, 2010).283\nGround Control. Clustering of the GC dataset resulted in the identiﬁcation of 25, 15, and 20 clusters for GMM,284\nHDBSCAN, and spectral, respectively. Despite the slight change in the number of clusters, the qualitative characteristics285\nof the results remained largely consistent with those obtained from FLT. Speciﬁcally, GMM results revealed two large286\nclusters, each containing 7,445 and 6,157 genes for GC, along with most of the other clusters having lesser sizes of287\n200 to 900 genes highlighting local patterns. Similarly, HDBSCAN and spectral clusters had a comparable consistent288\nnumber of genes as FLT clusters, ﬁnding patterns throughout the global structure. Ensemble clustering demonstrated289\nsimilar outcomes to FLT as well, exhibiting a diverse range of gene counts within each cluster.290\n3.4 Post Pipeline Analysis291\nLastly, we demonstrate the full utility of GLARE using the results derived from the ensemble clustering on learned data292\nrepresentation of the CARA data and applying feature explanation analysis from the prediction task that we undertook293\nfor the veriﬁcation study.294\n3.4.1 Gene Ontology Analysis295\nGene Ontology (GO) analysis, in conjunction with clustering results, is a widely used approach to ﬁnd the functional296\nsigniﬁcance of co-expressed genes in the clusters and provide a comprehensive understanding of the biological297\nfunctions and processes underlying the observed gene expression patterns. We use the Metascape platform ( http:298\n//metascape.org), which integrates various functional annotation databases (Zhou et al., 2019) to perform GO299\nenrichment analysis. We take the clusters from EAC on FT-SAE and process them through Metascape after excluding300\nclusters with extreme sizes, as this tool can only take gene lists of less than 3000 counts for the enrichment analysis.301\nSpeciﬁcally, two large clusters for both FLT and GC datasets, along with one small cluster comprising only 2 genes in the302\nFLT dataset, which leaves us 13 signiﬁcant clusters for both FLT and GC. GO analysis on these clusters revealed various303\ngroups of ontologies, including cellular metabolic processes, oxidative phosphorylation, light response and signaling,304\nand vesicle-mediated transport. The prior study on the CARA dataset (Paul et al., 2017) found that genes associated305\nwith cell wall metabolism seemed most prevalent among the differentially expressed genes. We found that clusters306\nassociated with vesicle-mediated transport were the most prevalent group for both FLT and GC clusters. Speciﬁcally,307\nthese vesicle-mediated transport clusters were related to plant-speciﬁc metabolic and developmental pathways for GC,308\nsuch as root morphogenesis and cell wall organization. In contrast, FLT clusters were more related to metabolic and309\ncatabolic processes, including protein processing and RNA processing (Supplementary Figure S1). Moreover, we310\nfound a unique hypoxia-related cluster that was only found in FLT results. Root zone hypoxia is predicted to occur in311\nspaceﬂight as a loss of buoyancy-driven convection in microgravity should limit oxygen resupply to intensely respiring312\ntissues (e.g., Porterﬁeld (2002)). However, transcriptional ﬁngerprints of hypoxia response in plants in spaceﬂight have313\noften proven elusive. We therefore concentrated the focus of the rest of our analysis on this hypoxic cluster. In Figure 6,314\nwe show a heatmap for the FPKM values for the genes within the hypoxia cluster, GO analysis results for the hypoxia315\ncluster using Metascape (Zhou et al., 2019) (Figure 6(b)), and Stress Knowledge Map (SKM) (Bleker et al., 2023)316\ncentered around the Transcription Factors (TFs) in the hypoxia cluster (Figure 6(c)).317\nThe Stress Knowledge Map (SKM; https://skm.nib.si/) is a curated resource offering two types of knowledge318\ngraphs on plant molecular interactions and stress signaling (Bleker et al., 2023). We used the Comprehensive Knowledge319\nNetwork (CKN) to gain insights into stress signaling and associated plant biological processes around our genes of320\ninterest. The map in Figure 6(c) was drawn with ﬁve transcription factors (TFs) that we found in the 43 gene hypoxia321\ncluster: ‘DREB2A’, ‘RHL41 / ZAT12’, ‘MYC2’, ‘RRTF1 / ERF109’, and ‘STZ / ZAT10’. The CKN map shows an322\nintricate network of TFs and their interactions in the context of stress response mechanisms and related signaling323\npathways with other genes such as ‘HY5 / TED5’, ‘ABI1’, and ‘JAZ1’. Inspection of this network reveals ethylene as a324\n11\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nlikely important player in this response. GO analysis of the network for biological function (Supplementary Table S1)325\nalso indicates elements of defense, water stress and cold response may also be important elements for further study.326\nFigure 6: Analysis of hypoxia cluster found in FLT clustering result. (a) Heatmap of normalized FPKM values on\nhypoxia cluster. (b) Enriched ontology on hypoxia cluster from Metascape (c) Stress Knowledge Map (SKM) on ﬁve\nTranscription Factors (TFs) in hypoxia cluster: ‘DREB2A’, ‘RHL41 / ZAT12’, ‘MYC2’, ‘RRTF1 / ERF109’, and ‘STZ /\nZAT10’.\n3.4.2 SHAP Analysis327\nUp to this point, our analysis has been directed toward uncovering distinct patterns between FLT and GC by generating328\nseparate data representations for clustering and GO analysis. However, we chose CARA as a dataset to interrogate due329\nto the multiple experimental factors within the experiment’s design. Therefore, after identifying the patterns within the330\nFLT data using GLARE, particularly a hypoxia cluster, we used this newly identiﬁed cluster to evaluate the effect of331\nvarying light conditions on different genotypes in each location. We took the found TFs within the hypoxia cluster and332\napplied SHAP analysis to quantify feature contribution, thereby explaining which experimental conditions had the most333\neffect in classifying this pattern within the data between FLT and GC. SHAP analysis provides a way to understand the334\nimpact of each feature on the model’s predictions, enabling better model transparency and insights into the underlying335\nrelationships within the data (Lundberg and Lee, 2017). Higher positive SHAP scores reﬂect features contributing more336\nto this discrimination within the dataset to designate a sample to FLT, while negative values reveal factors that have a337\nnegative impact on the FLT assignment, i.e., reveal the data as GC. In Figure 7, we show local bar plots explaining the338\nfeature importance among the ﬁve identiﬁed TFs in the FLT hypoxia cluster. Among these ﬁve TFs,‘ZAT12’ has the339\nlargest aggregate difference in SHAP values between FLT and GC andMYC2 the smallest.340\nWe see thatPHYD mutants in the dark setting had the most contribution in model prediction in FLT for both‘ZAT12’341\nand ‘MYC2’, while WS genotype in the light setting for ‘ZAT12’ and PHYD mutants in the light setting for ‘MYC2’342\nhad a notable negative effect towards FLT prediction. On the other hand, col genotype in the dark setting had the343\nmost contribution in model prediction in GC for both ‘ZAT12’ and ‘MYC2’, indicating a strong differentiation between344\n12\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nFigure 7: SHAP analysis on Transcription Factors (TFs) in the hypoxia cluster.A positive SHAP value (Red color)\nmeans that the feature value made a greater contribution than others in classifying the gene as FLT, while a negative\nSHAP value (Blue color) suggests they had more contribution in GC classiﬁcation. (a)ZAT12 - FLT (b) ZAT12 - GC (c)\nMYC2 - FLT (d) MYC2 - GC (e) Summary of difference in SHAP value between FLT and GC for the 5 TFs in hypoxia.\nconditions in a different location. The large difference in aggregated SHAP value between FLT and GC for‘ZAT12’345\nsuggests that the relative importance and contributions of these features vary signiﬁcantly between the FLT and GC.346\nIn contrast, the contributions for ‘MYC2’ appear more consistent and stable across both FLT and GC classiﬁcations.347\nLastly, in Figure 8, we present summary SHAP plots on these features, varying light conditions on different genotypes,348\nto offer a more comprehensive understanding of feature contribution across the entire dataset.349\nWe can observe features with different degrees of impact on the model’s prediction from the SHAP value scatterplot350\nin Figure 8(a), for example, for the WS genotype: in a light setting, the majority of the data aligns with positive SHAP351\nvalues, supporting FLT classiﬁcation, whereas under dark conditions, the trend is reversed. The beeswarm plot (Figure352\n8(b)) illustrates the distribution of SHAP values for each feature. The color gradient from blue to red represents the353\nfeature value (FPKM values), with blue indicating low expression and red indicating high expression. Figure 8(b)354\nillustrates that PHYD mutants in a dark setting have the highest effect on the classiﬁcation with longer tails towards355\npositive value, while most of the high FPKM values have negative SHAP value. Suggesting that high expression levels356\nfrom PHYD mutants in dark settings decrease the likelihood of FLT classiﬁcation. Similarly, the Col genotype in a357\ndark setting has tails toward negative values, while most of the high FPKM values have positive SHAP values. These358\nFigure 8: SHAP value distribution for each treatment. Comparing SHAP values from a classiﬁcation using the\nXGBoost on the discretized CARA dataset. (a) The summary SHAP value scatterplot for each feature displays the\ndistribution of SHAP values alongside raw feature values. (b) The summary SHAP beeswarm plots, where features are\nordered by their importance (measured by mean absolute SHAP values), with the most impactful features appearing at\nthe top. The color bar represents raw feature values. Both plots present the same information.\n13\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nobservations underscore the presence of intricate interactions between gene expressions, reﬂecting the complexity of359\nthe transcriptome data and the underlying biological mechanisms.360\nSHAP analysis provides a unique perspective on the patterns within the dataset, especially when the data comprises361\nvarious environmental settings as features. Through analyzing the differences and similarities in SHAP values,362\nresearchers can identify genes that are sensitive to complex environment by genotype-dependent patterns in the data.363\n4 Discussion364\nIn this study, we present an analysis pipeline, GLARE, that employs a state-of-the-art representation learning model365\nwith self-supervised learning. We chose a previously analyzed dataset, the CARA experiment (OSD-120), which allows366\nfor an investigation of the overall utility of the pipeline itself and a comparison with the prior ﬁndings. For analysis367\nof the root samples in the CARA spaceﬂight data, we trained the system using high-throughput plant root single-cell368\ndata, along with ensemble clustering, to identify hidden patterns in the spaceﬂight transcriptome. For other spaceﬂight369\ndatasets, such as whole seedlings, shoot tissues, microbe, animal tissues, or cell types, matching training datasets to370\nthe particular experimental design would similarly add signiﬁcant depth to these analyses. After the full pipeline,371\nwe present a recommended framework for post-pipeline analysis employing select bioinformatics tools and adding372\npost hoc explainability to the deep learning approach by applying approaches such as SHAP analysis. Such analyses373\nconﬁrmed previous patterns found in the data, such as cell wall remodeling and vesicle-mediated transport, but critically374\nrevealed new features, notably a molecular signature of hypoxic stress in the spaceﬂight samples that is predicted375\nfrom the lack of buoyancy-driven convection in spaceﬂight but that has proven complex to extract from many plant376\ntranscriptomic datasets. However, our analyses also revealed that this cryptic signature was dependent on experimental377\nconditions such as plant genotype and lighting regime. For example, Figure 7 shows that SHAP analysis of the 5378\nsignature spaceﬂight-related, hypoxia-response transcription factors identiﬁed in this study potentially help explain why379\nthese signals can be complex to identify in current spaceﬂight datasets without machine learning interrogation.380\nAlthough we present one post hoc analysis pipeline for the output of GLARE, researchers can readily leverage their381\npreferred analytics tools when applying GLARE to their datasets to uncover patterns. To this end, we actively encourage382\ncontributions and novel suggestions through our open science repository. Its open-source nature means researchers383\ncan readily adapt GLARE on other datasets from GeneLab and elsewhere to reinforce their initial studies and expand384\non these computational ﬁndings. The recent rapid advancement in the machine learning ﬁeld warrants future work385\non GLARE. Similar to our approach, integrating single-cell datasets has been widely adopted for their advantage in386\nproviding nuanced insights to the cellular level. Indeed, transformer-based foundation models for single-cell multi-omics387\nhave been suggested (Cui et al., 2024), which offer the potential to generate synthetic data or for gene network inference.388\nOur future vision for GLARE is to extend beyond autoencoder-based models to add more advanced self-supervised389\nrepresentation learning models, such as contrastive learning methods that are well-used in the ﬁeld of computer vision390\nand natural language processing (Chen et al., 2020), to enhance robustness for smaller datasets with fewer features.391\nAdditionally, causal representation learning methods can be employed to discover the causal relationship between392\nrelated genes (Uelwer et al., 2023; Schölkopf et al., 2021).393\nConﬂict of Interest Statement394\nThe authors declare they have no conﬂicts of interest.395\n14\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nAuthor Contributions396\nDH.S. and R.B. conceived of the study and fundamental design. DH.S. and H.F.S. contributed to model testing,397\ndata analysis, and ﬁgure preparation. DH.S., H.F.S., M.Z., R.B., A.-L.P, R.J.F., and S.G. contributed to manuscript398\npreparation. All authors contributed to the manuscript review and editing.399\nFunding400\nThe CARA experiment was supported by grant number GA-2013-104, Center for Advancement of Science in401\nSpace to A.-L. Paul (PI) and R.J. Ferl (CoI). We gratefully acknowledge support from NASA 80NSSC19K0126402\nand 80NSSC21K0577 to S.G.403\nAcknowledgments404\nThe authors would like to acknowledge the sequencing and bioinformatics services provided by the Interdisci-405\nplinary Center for Biotechnology Research’s (ICBR) Gene Expression (RRID:SCR_019145), NextGen Sequencing406\n(RRID:SCR_019152), and Bioinformatics (RRID:SCR_019120) cores.407\nData Availability Statement408\nThe dataset (OSD-120) utilized in this method can be found on the NASA GeneLab Data System (https://genelab.409\nnasa.gov/). The code utilized for data analysis can be found on the publicly available GitHub repository ( https:410\n//github.com/OpenScienceDataRepo/Plants_AWG/tree/main/Manuscript_Code/glare).411\nReferences412\nAbts, W., Vandenbussche, B., De Proft, M. P., and Van de Poel, B. (2017). The role of auxin-ethylene crosstalk in413\norchestrating primary root elongation in sugar beet. Frontiers in Plant Science, 8:444.414\nAljalbout, E., Golkov, V ., Siddiqui, Y ., Strobel, M., and Cremers, D. (2018). Clustering with deep learning: Taxonomy415\nand new methods. arXiv preprint arXiv:1801.07648.416\nBa, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.417\nBleker, C., Ramšak, Ž., Bittner, A., Podpeˇcan, V ., Zagoršˇcak, M., Wurzinger, B., Baebler, Š., Petek, M., Križnik, M.,418\nvan Dieren, A., et al. (2023). Stress knowledge map: A knowledge graph resource for systems biology analysis of419\nplant stress responses. bioRxiv, pages 2023–11.420\nCampello, R. J., Moulavi, D., and Sander, J. (2013). Density-based clustering based on hierarchical density estimates.421\nIn Paciﬁc-Asia conference on knowledge discovery and data mining, pages 160–172. Springer.422\nChaudhary, K., Poirion, O. B., Lu, L., and Garmire, L. X. (2018). Deep learning–based multi-omics integration robustly423\npredicts survival in liver cancer. Clinical Cancer Research, 24(6):1248–1259.424\nChen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd425\ninternational conference on knowledge discovery and data mining, pages 785–794.426\nChen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual427\nrepresentations. In International conference on machine learning, pages 1597–1607. PMLR.428\n15\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nClevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear429\nunits (elus). arXiv preprint arXiv:1511.07289.430\nCui, H., Wang, C., Maan, H., Pang, K., Luo, F., Duan, N., and Wang, B. (2024). scgpt: toward building a foundation431\nmodel for single-cell multi-omics using generative ai. Nature Methods, pages 1–11.432\nDobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T. R.433\n(2013). Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21.434\nEster, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large435\nspatial databases with noise. In kdd, volume 96, pages 226–231.436\nFerl, R. J. and Paul, A.-L. (2016). The effect of spaceﬂight on the gravity-sensing auxin gradient of roots: Gfp reporter437\ngene microscopy on orbit. npj Microgravity, 2(1):1–9.438\nFerl, R. J., Zupanska, A., Spinale, A., Reed, D., Manning-Roach, S., Guerra, G., Cox, D. R., and Paul, A.-L. (2011). The439\nperformance of ksc ﬁxation tubes with rnalater for orbital experiments: A case study in iss operations for molecular440\nbiology. Advances in Space Research, 48(1):199–206.441\nFred, A. L. and Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE transactions on442\npattern analysis and machine intelligence, 27(6):835–850.443\nFu, Y ., Li, L., Xie, B., Dong, C., Wang, M., Jia, B., Shao, L., Dong, Y ., Deng, S., Liu, H., et al. (2016). How to444\nestablish a bioregenerative life support system for long-term crewed missions to the moon or mars. Astrobiology,445\n16(12):925–936.446\nGan, G., Ma, C., and Wu, J. (2020). Data clustering: theory, algorithms, and applications. SIAM.447\nHao, M., Gong, J., Zeng, X., Liu, C., Guo, Y ., Cheng, X., Wang, T., Ma, J., Song, L., and Zhang, X. (2023). Large scale448\nfoundation model on single-cell transcriptomics. bioRxiv, pages 2023–05.449\nHinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science,450\n313(5786):504–507.451\nHulot, A., Chiquet, J., Jaffrézic, F., and Rigaill, G. (2020). Fast tree aggregation for consensus hierarchical clustering.452\nBMC bioinformatics, 21(1):1–12.453\nIqbal, N., Khan, N. A., Ferrante, A., Trivellini, A., Francini, A., and Khan, M. (2017). Ethylene role in plant growth,454\ndevelopment and senescence: interaction with other phytohormones. Frontiers in plant science, 8:475.455\nJain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666.456\nKarim, M. R., Beyan, O., Zappa, A., Costa, I. G., Rebholz-Schuhmann, D., Cochez, M., and Decker, S. (2021). Deep457\nlearning-based clustering approaches for bioinformatics. Brieﬁngs in bioinformatics, 22(1):393–415.458\nKingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.459\nKiselev, V . Y ., Yiu, A., and Hemberg, M. (2018). scmap: projection of single-cell rna-seq data across data sets.Nature460\nmethods, 15(5):359–362.461\nLee, J. A., Verleysen, M., et al. (2007). Nonlinear dimensionality reduction, volume 1. Springer.462\nLei, D., Zhu, Q., Chen, J., Lin, H., and Yang, P. (2012). Automatic k-means clustering algorithm for outlier detection. In463\nInformation Engineering and Applications: International Conference on Information Engineering and Applications464\n(IEA 2011), pages 363–372. Springer.465\n16\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nLiao, Y . and Vemuri, V . R. (2002). Use of k-nearest neighbor classiﬁer for intrusion detection.Computers & security,466\n21(5):439–448.467\nLundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach to interpreting model predictions. In Guyon, I., Luxburg,468\nU. V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors,Advances in Neural Information469\nProcessing Systems 30, pages 4765–4774. Curran Associates, Inc.470\nMakhzani, A. and Frey, B. (2013). K-sparse autoencoders. arXiv preprint arXiv:1312.5663.471\nMcInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension472\nreduction. arXiv preprint arXiv:1802.03426.473\nMonti, S., Tamayo, P., Mesirov, J., and Golub, T. (2003). Consensus clustering: a resampling-based method for class474\ndiscovery and visualization of gene expression microarray data. Machine learning, 52:91–118.475\nMustroph, A., Lee, S. C., Oosumi, T., Zanetti, M. E., Yang, H., Ma, K., Yaghoubi-Masihi, A., Fukao, T., and Bailey-476\nSerres, J. (2010). Cross-kingdom comparison of transcriptomic adjustments to low-oxygen stress highlights conserved477\nand plant-speciﬁc responses. Plant Physiology, 152(3):1484–1500.478\nNg, A. et al. (2011). Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19.479\nNg, A., Jordan, M., and Weiss, Y . (2001). On spectral clustering: Analysis and an algorithm. Advances in neural480\ninformation processing systems, 14.481\nPaul, A.-L., Sng, N. J., Zupanska, A. K., Krishnamurthy, A., Schultz, E. R., and Ferl, R. J. (2017). Genetic dissection of482\nthe arabidopsis spaceﬂight transcriptome: Are some responses dispensable for the physiological adaptation of plants483\nto spaceﬂight? PLoS One, 12(6):e0180186.484\nPaul, A.-L., Zupanska, A. K., Schultz, E. R., and Ferl, R. J. (2013). Organ-speciﬁc remodeling of the arabidopsis485\ntranscriptome in response to spaceﬂight. BMC Plant Biology, 13(112).486\nPorterﬁeld, D. M. (2002). The biophysical limitations in physiological transport and exchange in plants grown in487\nmicrogravity. Journal of Plant Growth Regulation, 21(2).488\nRanzato, M., Boureau, Y .-L., Cun, Y ., et al. (2007). Sparse feature learning for deep belief networks. Advances in489\nneural information processing systems, 20.490\nRappoport, N. and Shamir, R. (2018). Multi-omic and multi-view clustering algorithms: review and cancer benchmark.491\nNucleic acids research, 46(20):10546–10562.492\nRay, S., Gebre, S., Fogle, H., Berrios, D. C., Tran, P. B., Galazka, J. M., and Costes, S. V . (2018).493\nGeneLab: Omics database for spaceﬂight experiments. Bioinformatics, 35(10):1753–1759. _eprint:494\nhttps://academic.oup.com/bioinformatics/article-pdf/35/10/1753/48969335/bioinformatics_35_10_1753.pdf.495\nReynolds, D. A. et al. (2009). Gaussian mixture models. Encyclopedia of biometrics, 741(659-663).496\nRousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of497\ncomputational and applied mathematics, 20:53–65.498\nRutter, L., Barker, R., Bezdan, D., Cope, H., Costes, S., Degoricija, L., Fisch, K., Gabitto, M., Gebre, S., Giacomello,499\nS., et al. (2020). A new era for space life science: international standards for space omics processing (issop). patterns.500\nSchölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y . (2021). Toward causal501\nrepresentation learning. Proceedings of the IEEE, 109(5):612–634.502\n17\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nShahan, R., Hsu, C.-W., Nolan, T. M., Cole, B. J., Taylor, I. W., Greenstreet, L., Zhang, S., Afanassiev, A., Vlot,503\nA. H. C., Schiebinger, G., et al. (2022). A single-cell arabidopsis root atlas reveals developmental trajectories in504\nwild-type and cell identity mutants. Developmental cell, 57(4):543–560.505\nShulse, C. N., Cole, B. J., Ciobanu, D., Lin, J., Yoshinaga, Y ., Gouran, M., Turco, G. M., Zhu, Y ., O’Malley, R. C.,506\nBrady, S. M., et al. (2019). High-throughput single-cell transcriptome proﬁling of plant cell types. Cell reports,507\n27(7):2241–2247.508\nStrehl, A. and Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions.509\nJournal of machine learning research, 3(Dec):583–617.510\nTrapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L., and511\nPachter, L. (2012). Differential gene and transcript expression analysis of rna-seq experiments with tophat and512\ncufﬂinks. Nature protocols, 7(3):562–578.513\nUelwer, T., Robine, J., Wagner, S. S., Höftmann, M., Upschulte, E., Konietzny, S., Behrendt, M., and Harmeling, S.514\n(2023). A survey on self-supervised representation learning. arXiv preprint arXiv:2308.11455.515\nVan Der Maaten, L. (2009). Learning a parametric embedding by preserving local structure. In Artiﬁcial intelligence516\nand statistics, pages 384–391. PMLR.517\nVan der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11).518\nVega-Pons, S. and Ruiz-Shulcloper, J. (2011). A survey of clustering ensemble algorithms. International Journal of519\nPattern Recognition and Artiﬁcial Intelligence, 25(03):337–372.520\nVillacampa, A., Ciska, M., Manzano, A., Vandenbrink, J. P., Kiss, J. Z., Herranz, R., and Medina, F. J. (2021). From521\nspaceﬂight to mars g-levels: Adaptive response of a. thaliana seedlings in a reduced gravity environment is enhanced522\nby red-light photostimulation. International Journal of Molecular Sciences, 22(2):899.523\nXu, C., Tao, D., and Xu, C. (2013). A survey on multi-view learning. arXiv preprint arXiv:1304.5634.524\nZeng, I. S. L. and Lumley, T. (2018). Review of statistical learning methods in integrated omics studies (an integrated525\ninformation science). Bioinformatics and biology insights, 12:1177932218759292.526\nZhou, Y ., Zhou, B., Pache, L., Chang, M., Khodabakhshi, A. H., Tanaseichuk, O., Benner, C., and Chanda, S. K. (2019).527\nMetascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature communications,528\n10(1):1523.529\n18\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint \n\nSupplementary Data and Table530\nWe provide supplementary data at the OSDR GitHub repository (https://github.com/OpenScienceDataRepo/531\nPlants_AWG/tree/main/Manuscript_Code/glare), including the codes for the method and reproducible results532\nsuch as single-cell pre-trained model weights, data representations, ensemble clustering results, Gene Ontology analysis533\nresults for all clusters, and predicted SHAP values for both FLT and GC. Supplementary Table S1 is also included in534\nthe repository.535\nSupplementary Figure536\nThe supplementary Figure S1 provides an enriched ontology analysis from Metascape (Zhou et al., 2019) on vesicle-537\nmediated transport related clusters using FT-SAE data representations from FLT and GC datasets. FLT clusters538\nemphasize metabolic and catabolic processes, such as “small molecule catabolic process\" and “reactive oxygen species539\nmetabolic process\". Regulatory processes, such as “regulation of programmed cell death\" are also prominent in FLT. In540\ncontrast, GC clusters focus on cellular structure and developmental pathways, with terms like “cellular macromolecule541\nlocalization\" and “plant-type cell wall organization\", as well as developmental terms like “embryo development\".542\nThis analysis illustrates the distinct biological pathways captured by FLT and GC data representations, offering543\ncomplementary insights to prior study (Paul et al., 2017) with plant-speciﬁc metabolic and developmental pathways.544\nFigure S1: Vesicle-mediated transport related clusters in FLT and GC. (a) Enriched ontology analysis from\nMetascape on vesicle-mediated transport related clusters using FLT FT-SAE data representations. (b) Equivalent\nanalysis using GC FT-SAE data representations.\n19\n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}