Methods
to transcriptomic datasets extends insights beyond the original transcriptomic analysis of this data, adding new41
perspectives.42
Our analysis pipeline applies state-of-the-art representation learning models to find underlying patterns in the FPKM43
values(fragments per kilobase of transcript per million mapped fragments) that are proportional to the abundance of44
each loci’s transcript. These representation learning models allow for better data point representation and clustering45
using unsupervised learning methods. These methods allow for further investigation of the effects of spaceflight on, e.g.,46
phytohormone signaling and associated physiological phenotypes (Abts et al., 2017; Ferl and Paul, 2016; Iqbal et al.,47
2017). Moreover, considering that the CARA experiment also utilized lighting sub-environments, we can shine further48
light on the potential spaceflight effects that were neglected in past studies. Overall, the GLARE method will provide49
insights to better understand plant behavior in the spaceflight environment based on its endogenous and exogenous cues.50
2 Materials and Methods51
2.1 GeneLab Data System and Data Entries52
The Genelab Data System (GLDS) is a public, space-related -omics data repository, which curates data from a wide53
variety of species and experimental spaceflight conditions (Ray et al., 2018). GLDS obtains spaceflight-related –omics54
datasets from multiple locations such as the Gene Expression Omnibus (GEO), European Bioinformatics Institute (EBI),55
publications directly, and others (Ray et al., 2018). This data is then cataloged with the relevant metadata, such as56
protocols, payload numbers, and experimental variables, and made available as an Open Science Dataset (OSD) in57
NASA’s Open Science Data Repository (OSDR).58
The CARA dataset (OSD-120; https://osdr.nasa.gov/bio/repo/data/studies/OSD-120) was chosen59
from Genelab for use with GLARE due to its many experimental conditions. The CARA experiments were conducted60
with three ecotypes/genotypes of Arabidopsis thaliana: wild-type Wassilewskija (WS), wild-type Columbia-0 (Col-0),61
and a mutant in the PHYTOCHROME D gene in the Col-0 background (PHYD) (Paul et al., 2017). Briefly, these62
genotypes were planted on gel media in Petri dishes and grown in either ambient light conditions or in the dark on63
the ISS for 11 days; Parallel controls were performed on the ground. After the 11 days, germinated seedlings were64
photographed and collected into Kennedy Space Center Fixation Tubes (KFTs;Ferl et al. (2011)) containing RNAlater.65
2
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Seedlings preserved in RNAlater were returned to Earth frozen, and then the roots were dissected into the last 2 mm66
of the tip for the light-grown plants and the last 1 mm for the dark-grown plants. RNA was extracted and sent to67
the Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, for RNA sequencing using a68
NextSeq 500 system, producing ∼40 million paired-end reads per sample. Finally, these pair-end reads were mapped to69
the TAIR10 A. thaliana reference genome using Spliced Transcripts Alignment to a Reference (STAR) software, and70
differential expression was performed using the Cufflinks tool (Dobin et al., 2013; Trapnell et al., 2012).71
2.2 High-dimensional Data Analysis72
Overview: Statistical methods have been widely integrated into the bioinformatics pipeline in multi-omics studies for73
analyzing the data as well as preprocessing the data. Specifically, due to multi-omics datasets having complex data74
topology, dimension reduction and clustering are two commonly used techniques for further investigation (Rappoport75
and Shamir, 2018). GLARE capitalizes upon such approaches. For example, Principal Component Analysis (PCA)76
and Factor Analysis are fundamental methods with widespread application for dimensionality reduction (Zeng and77
Lumley, 2018). After achieving a statistical representation of the dataset with these dimensionality reduction techniques,78
clustering methods are utilized to group similar representations to uncover underlying patterns within the dataset.79
Among these, K-means and hierarchical clustering are featured as two of the most favored methodologies (Hulot et al.,80
2020).81
2.2.1 Learning Data Representations82
While PCA is popularly used for its simplicity, it has its limits for losing essential features through linear embedding,83
which often degrades the clustering quality (Gan et al., 2020). Several alternative methods that do not only rely on data84
point distribution but also leverage latent data structures via learned representations have shown advantages in handling85
biological data, thereby enhancing clustering precision (Karim et al., 2021). GLARE also uses these approaches in86
its analyses. These alternatives to PCA include t-distributed Stochastic Neighbour Embedding (t-SNE), a non-linear87
dimensionality reduction technique particularly adept at preserving local structures within high-dimensional data, and88
Uniform Manifold APproximation (UMAP) (Van der Maaten and Hinton, 2008; McInnes et al., 2018), a manifold89
learning approach that efficiently captures complex relationships within the data. However, alternative deep-learning-90
based approaches for obtaining data representations have been largely neglected in the field of plant biology, despite the91
advantage of their ability to capture contextual information from the non-linear mappings. Specifically, this approach92
of capturing contextual information through complex, higher-level features is known as representation learning or93
feature extraction (Aljalbout et al., 2018). Therefore, along with PCA, t-SNE, and UMAP, we have investigated the94
application of Sparse Autoencoder (SAE) as one of the representation learning methods in the GLARE pipeline. SAE95
is an unsupervised learning algorithm based on a neural network that aims to learn an approximation of the identity96
function that represents the data. The model is trained by encoding the data from its feedforward phase but with sparsity97
constraints that only activate neurons with the largest activation, allowing the discovery of the unique structure in98
the data (Makhzani and Frey, 2013). While autoencoders are more commonly used for reconstructing the original99
input data, prior studies show autoencoder as a representation learning approach that works favorably in the context of100
multi-omics datasets (Chaudhary et al., 2018).101
Upon employing multiple approaches to obtain data representation, evaluating these data representations is critical102
to understanding the strengths and limitations of various data representation techniques. Prior research has used103
several evaluation techniques to assess the fidelity between data representations and the original dataset and the quality104
of the data representation structure, so we have used these methods in the development of the GLARE approach.105
Reconstruction error analysis, often conducted through linear regression, and trustworthiness scores that measure106
faithfulness, are widely applied to test the fidelity by comparing the original data and learned representation (Hinton107
and Salakhutdinov, 2006; Van Der Maaten, 2009). To test the quality of the data structure, the K-Nearest Neighbors108
(KNN) classifier can be utilized to assess the neighborhood preservation, showing the ability of the representation to109
3
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
maintain local structure and inherent relationships (Liao and Vemuri, 2002). Furthermore, the Silhouette score measured110
via k-means clustering is widely used to check the insights into clustering performance and compactness of the data111
representation (Rousseeuw, 1987).112
2.2.2 Clustering Data Representations113
Within the clustering paradigm, several alternative methods to K-means exist for the effective organization of these114
representations, and we have explored their application as part of the GLARE pipeline. Among these, Gaussian Mixture115
Models (GMM) with the Expectation-Maximization (EM) algorithm offer a probabilistic framework, wherein each116
cluster is represented by a Gaussian distribution, facilitating more nuanced cluster assignments (Reynolds et al., 2009).117
Density-based clustering methods have gained considerable attention with respect to their ability to detect clusters of118
arbitrary shapes and sizes, thus overcoming some of the limitations associated with distance-based methods (Ester et al.,119
1996). Notably, an extension of this approach, Hierarchical Density-Based Spatial Clustering of Applications with120
Noise (HDBSCAN), utilizes a hierarchical approach to density-based clustering to robustly identify clusters at multiple121
levels with varying densities (Campello et al., 2013). Additionally, spectral clustering presents an alternative approach,122
leveraging the eigenstructure of the similarity matrix to partition the data into clusters, thereby offering an effective123
means of characterizing complex structures within the dataset (Ng et al., 2001).124
Ensemble clustering is an additional powerful technique that combines these multiple clustering solutions to obtain125
consensus clusters that are more robust and accurate. Several ensemble clustering methods have been proposed,126
including Evidence Accumulation Clustering (EAC) (Fred and Jain, 2005), which accumulates evidence from different127
base clustering algorithms to build a co-association matrix. Applying hierarchical clustering to this matrix derives128
a final consensus clustering result. Other notable examples include HyperGraph-Partitioning Algorithm (HGPA)129
(Strehl and Ghosh, 2002), which derives consensus clustering through a partitioning hypergraph where each base130
clustering set is a hyperedge in a hypergraph, with vertices representing data points. These ensemble techniques have131
demonstrated their utility in various domains, such as bioinformatics, text mining, and computer vision, where data132
is often high-dimensional, noisy, and complex (Vega-Pons and Ruiz-Shulcloper, 2011). Therefore, they are strong133
candidates for equivalent analyses of the often highly complex structures that make up plant transcriptomics datasets.134
3 Results135
3.1 GLARE: GeneLAb Representation learning pipelinE136
We introduce GLARE, a representation learning pipeline designed to empower researchers to move beyond conventional137
dimensionality reduction techniques in their omics-focused research, such as reliance on PCA or tSNE. The GLARE138
framework enables the extraction of data representations using a trained learning-based model, thereby allowing the139
exploration of latent structures to unveil the hidden patterns inherent within the dataset. We first report a verification140
study by training a classification model on the CARA study’s spaceflight and ground control data, followed by the full141
analytical pipeline to highlight GLARE’s ability to both confirm patterns revealed in the published primary analyses142
and reveal novel patterns within the data.143
3.1.1 Verification study144
Prior to applying the full end-to-end pipeline of GLARE, we perform a verification study through a prediction task to145
ensure that learnable patterns indeed exist within spaceflight transcriptome datasets. We focused this analysis on the146
CARA dataset (OSD-120). To enable this analysis, we first had to reorder the dataset in OSD-120 to be indexed by147
each experimental feature (analogous to resorting the data table to have column headings/labels) for each factor). We,148
therefore, restructured the unlabeled original data by extracting the feature vectors that represent each experiment (such149
as genotype, spaceflight versus ground control, and lighting regime) and performing data discretization (i.e., reordering150
4
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
the numerical FPKM data within the data table to be indexed, not simply by gene locus versus sample but by gene locus151
versus each individual experiment factor, Figure 1(a)). This approach produced discrete labels for each instance of a152
feature, creating a pseudo multi-view datasetX ′. Essentially,X ′ contains a reindexed version of the information in the153
original data, which has 36 continuous features as it contains results from two locations (spaceflight versus ground),154
under two different light conditions (light versus dark), and for three genotypes (Ws, Col-0, and PHYD mutants), with155
three replicate samples of each.156
This restructuring allows the experiment environment to be explicitly indicated through the labels (Xu et al., 2013).157
We then trained the classification model using XGBoost (Chen and Guestrin, 2016), on the restructured and discretized158
dataX ′ using the concatenated feature vectors as the input matrix and the discretized labels that represent the experiment159
environment as target labels. High predictive performance would indicate that learnable patterns are indeed present in160
the original data, thus motivating the use of unsupervised representation learning techniques from GLARE. Figure 1161
shows an illustration of how we reconstructed our dataset through data discretization and the prediction performance of162
the best-performing data model. We compared prediction performances across multiple data discretization models on a163
held-out test set ofX ′ is presented in Table 1. This test set was set aside during training to be used for evaluating the164
model performance on unseen data. Data models that were tested include our ‘base’ discretization model, where we165
have location labels indicating if the experiments were performed in space (ISS) or on the ground (KSC), having 18166
continuous features and one flight versus ground label. We also discretize other experiment settings such as ‘Genotype’167
and ‘Light condition’, adding these additional discretized labels to act as further categorical predictors. We found that168
our base data model that only discretizes the location variable yields the highest performance with ∼91% test accuracy169
on predicting if the experiments were done in space or the ground based on the normalized counts of FPKM values.170
Figure 1: Illustration of data discretization for data reconstruction and prediction performances. (a) Illustration
showing the restructuring process of our base data model where we discretize the experiment location. Raw FPKM
numerical data (denoted as ### in tables) from the OSDR record is organized as reads per locus (e.g., AT1G01010,
AT1G01020 across all ∼25,000 genes in the Arabidopsis genome) for each experimental sample (e.g., Flight sample of
Columbia ecotype grown in the light, FLT_col_Light, or Ground control sample of Columbia ecotype grown in the
light, GC_col_Light) as shown left. After discretization (right table), each gene has two instances, one from space and
one from the ground sample separately. (b) ROC curves using the training and test dataset on the best-performing data
model, which is the base model. Blue line represents XGBoost classifier, showing the ratio of true positives to false
positives in the model predictions from the training data (left) and the test set (right). Red line is the random chance
baseline. FLT, spaceflight; GC, ground control.
5
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Data Model Test Accuracy ↑ F1-score↑ ROC-AUC↑
Light & Genotype Discretization 68.69 ± 0.14 0.686 ± 0.001 0.772 ± 0.001
Genotype Discretization 77.09 ± 0.14 0.770 ± 0.002 0.865 ± 0.001
Light Discretization 83.18 ± 0.07 0.831 ± 0.001 0.919 ± 0.001
No additional Discretization (‘base model’) 91.29± 0.26 0.913 ± 0.002 0.975 ± 0.001
Table 1: Classification performances on held-out test set using XGBoost on data from different data models
(with ± standard deviation). F1-score: The harmonic mean of precision (avoidance of false positives) and recall
(avoidance of false negatives), ROC-AUC: Area under the Receiver Operating Characteristic curve, summarizing true
positive vs. false positive trade-off. Test Accuracy is the % correctness of predictions for classifying a sample as
spaceflight or ground control in the test set using each data model.
The verification study serves dual purposes: 1) As a validity check for the approach prior to deploying the full171
GLARE pipeline. If the data did not exhibit any learnable and distinctive pattern between the experiment setting172
that we wanted to compare against, then applying unsupervised methods on that data of interest would be ineffective173
as the extracted representations would not capture meaningful latent information and make poor predictions from174
the test set. 2) The prediction task from the verification study can serve as the foundation for post-pipeline analysis,175
enabling the incorporation of feature importance explanation schemes, such as SHapley Additive exPlanations (SHAP)176
(Lundberg and Lee, 2017). The feature importance values can reveal, e.g., within CARA, which genotypes and light177
conditions contributed the most to the predictions overall, as well as provide more insights into specific genes of178
interest. Combining these insights with the clustering results from the GLARE pipeline should substantially empower179
researchers in general to see new patterns in their omics-level data.180
Encouraged by the outcome of this verification study that a machine learning approach should be able to extract181
potentially novel features from spaceflight datasets, we implemented a full GLARE pipeline, as described in the182
following sections.183
3.1.2 Preprocessing184
GLARE starts with initiating an investigation of the dataset by employing the conventional dimensionality reduction185
approach of PCA to achieve initial data representation. Then, we utilize the PCA representations to conduct clustering186
using the k-means algorithm. Figure 2, shows the distribution of the principal components and clustering results on187
them for the CARA data. Notably, results of both spaceflight (FLT) and ground control (GC) experiments exhibit188
similar distributions and clustering patterns, characterized by a concentration of data points within a single cluster.189
Figure 2: Outlier detection via PCA and k-means. (a) Clustering result on spaceflight (FLT) data (without any
discretization). (b) Clustering results on ground control (GC) data (without any discretization). ∼98% of the data is
clustered on clusterA (blue) for both FLT and GC. In this study, from the>25,000 genes in the datasets, we only discard
three genes (cluster D and E in (a) and cluster D in (b)) for both FLT and GC that are separated from concentrated
clusters A, B, and C. These genes are: AT1G0759, AT3G41768, ATMG00020.
6
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Leveraging this outcome, GLARE employs initial investigation using PCA and k-means clustering as a means of outlier190
detection (Lei et al., 2012), discarding out-of-distribution clusters and keeping the concentrated clusters. For both FLT191
and GC data, we take the three most concentrated clusters to the next step of the pipeline.192
3.1.3 Representation Learning193
Taking the preprocessed data from the prior step, GLARE offers a range of widely applied representation learning194
techniques, including classic dimension reduction methods like PCA, t-SNE, and UMAP. However, GLARE also195
incorporates Sparse Autoencoder (SAE), a deep learning-based model that enables efficient data compression while196
preserving salient features. In this way, it can capture intricate hierarchical structures within the data by simultaneously197
learning both compressed data representation and the features necessary for reconstruction (Ng et al., 2011; Ranzato198
et al., 2007). The illustration of the overall pipeline and details of GLARE are shown in Figure 3.199
Figure 3: Overall pipeline of GLARE: Gene LAb Representation learning pipelinE. (a) Illustration of GLARE,
starting with a verification study followed by preprocessing through detecting outliers using k-means clustering. Using
the clean dataset, GLARE provides options for representation learning from PCA to state-of-the-art SAE pre-trained
with high-throughput single-cell data. Retrieved data representation is then processed through ensemble clustering
to find the hidden patterns within the data. Results from the verification study and ensemble clustering are then used
for post-pipeline analysis. (b) Model architecture illustration of employed SAE for both training with and without
pre-training. (c) Ensemble clustering using three base clustering algorithms based on different statistical methodologies.
Evidence accumulation clustering is used to derive consensus clusters from these algorithms.
Our implementation of SAE is constructed with a sequence of building blocks, each comprising a Linear layer200
followed by LayerNorm and Exponential Linear Unit (ELU) activation (Ba et al., 2016; Clevert et al., 2015). We chose201
to add the LayerNorm block to improve convergence and stable optimization, considering that our data consists of202
multiple experimental results from different environment settings. Towards this matter, we employ ELU activation as203
well. We use three of these building blocks for the encoder and three blocks for the decoder to make the SAE. The204
sparsity is induced via L1-regularization to deal with the sparse and heterogeneous nature of normalized counts of FPKM205
values. The model training is optimized using mean squared error loss, Adam optimizer (Kingma and Ba, 2014) with206
weight decay, early stopping, and gradient clipping to address exploding gradients and ensure stable training. Numbers207
7
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
of hyperparameters were tested to find optimal parameter sets, which are described in our shared code repository208
(https://github.com/OpenScienceDataRepo/Plants_AWG/tree/main/Manuscript_Code/glare). Finally,209
after the model training, we extract data representation from the bottleneck layer between the encoder and decoder210
using this optimized model.211
To further enhance the utility of these representations for downstream tasks such as clustering, we introduce an212
additional self-supervised learning step. We leveraged the pre-training step for the SAE with the addition of high-213
throughput single-cell data, specifically a single-cell root transcriptome dataset from Shulse et al. (2019), as the CARA214
dataset is drawn from root tip samples. This pre-training step complements the representations from the model by215
incorporating detailed single-cell transcriptome profiling of plant root cell types. We then take the pre-trained weights216
to fine-tune SAE using our normalized counts data to build Fine-Tuned SAE (FT-SAE). We maintain the original model217
structure and introduce adapter layers atop the main model to adjust varying dimensions between the single-cell matrix218
and our data appropriately, ensuring seamless integration into our SAE framework. We take the same procedure for the219
model optimization. The suggested self-supervised learning step offers several advantages, including the augmentation220
of feature granularity and the incorporation of cellular-level insights, thereby enhancing the fidelity and relevance of221
the learned representations for downstream analyses (Kiselev et al., 2018). Similar to our approach, such building of222
foundation models pre-trained with high-throughput single-cell data has demonstrated great utility in a diverse array of223
tasks in the life science field, including pattern recognition by incorporating foundational knowledge of the data (Hao224
et al., 2023).225
3.1.4 Ensemble Clustering226
GLARE provides an ensemble clustering scheme to improve upon the commonly used application of single clustering227
approaches. GLARE adopts Evidence Accumulation Clustering (EAC) (Fred and Jain, 2005) as its ensemble clustering228
method, integrating three base clustering algorithms: GMM, HDBSCAN, and Spectral clustering. Ensemble clustering229
offers several advantages over-relying on a single clustering algorithm. By merging the clustering outcomes from230
distinct statistical foundations through consensus voting, followed by hierarchical clustering with average linkage on231
the generated co-association matrix, we can mitigate the biases and noises inherent in each base clustering method to232
create more robust and reliable clustering results. Notably, when working with complex data such as representations233
retrieved from a fine-tuned sparse autoencoder, ensemble clustering can effectively address inherent complexities to234
capture hidden patterns and discover biologically meaningful clusters (Monti et al., 2003).235
In addition to obtaining consensus cluster labels using EAC, researchers can leverage GLARE results from three236
base clustering algorithms to get unique clusters for each gene by retrieving the intersected cluster from its respective237
cluster assignments.238
3.2 Data Representation Evaluation239
In this section, we compare data representations from different algorithms that could be retrieved from GLARE. Figure240
4, shows visualizations of each of the representations from FLT and GC using PCA, t-SNE, UMAP, SAE, and FT-SAE.241
Data representation from SAE and FT-SAE has n-dimensions depending on the number of neurons on the bottleneck242
layer. This value is determined through hyperparameter tuning and was set asn = 16. All other data representations243
from PCA, t-SNE, and UMAP have a 2-dimensional matrix. As we discovered from the preprocessing step of GLARE,244
the PCA representation data points are highly condensed in a single region of the map, while t-SNE and UMAP245
representation exhibit a more widespread distribution. On the other hand, SAE and FT-SAE representations show more246
cluster-forming shapes for their t-SNE coordinates where the locally condensed points are separated from others.247
Table 2 shows the next element of the analysis, examining these data representations using multiple quantitative248
evaluation measures: reconstruction error through linear regression, trustworthiness score, neighborhood preservation249
through KNN classifier accuracy, and Silhouette Score through k-means. Among the data representations that could be250
8
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
retrieved from GLARE, we compare all the methods that perform a non-linear transformation to the original dataset,251
leaving out PCA.252
Figure 4: Comparison of data representations retrieved from GLARE. PCA, t-SNE, UMAP, SAE, and FT-SAE
from left to right for both FLT and GC data. t-SNE was used for the visualization of n-dimensional data representation
from SAE and FT-SAE.
Evaluation Metrics
Environment Data
Representations
Reconstruction
Error↓
Trustworthiness
Score↑
KNN
Accuracy↑
Silhouette
Score↑
FLT t-SNE 2020.07 0.964 98.11 0.3638
UMAP 1926.06 0.949 97.85 0.3772
SAE 2033.51 0.951 97.39 0.3782
FT-SAE 1845.12 0.884 98.75 0.5323
GC t-SNE 2029.77 0.967 97.89 0.3584
UMAP 1968.41 0.956 97.49 0.3756
SAE 2066.24 0.946 97.99 0.3871
FT-SAE 1954.56 0.864 98.07 0.5397
Table 2: Comparison of various evaluation metrics on data representations. FT-SAE shows the lowest linear
reconstruction error, highest KNN accuracy, and highest Silhouette score while having a lower trustworthiness score
compared to others for both FLT and GC.
Linear reconstruction provides an effective approach for these non-linear methods to see how well they preserve253
the global structure of the data. FT-SAE outperforms other methods on linear reconstruction, having the lowest error254
for both FLT and GC. Measuring the Silhouette score and performing the KNN classification on the labels from255
simple k-means clustering offers another perspective on the quality of data representation, specifically, their utility in256
downstream tasks and local neighborhood structure preservation. FT-SAE outperforms others on these measures as well257
for all cases. On the contrary, t-SNE shows the highest trustworthiness score for both FLT and GC. Although FT-SAE258
retains a fair score with> 0.8 (Lee et al., 2007), it has the lowest among others. This is likely due to the incorporated259
9
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
transfer learning scheme and its emphasis on sparse representations, which may sacrifice fidelity. Overall, this analysis260
indicates FT-SAE has great promise in various metrics and its ability to learn sparse, nonlinear representations that261
effectively capture local and global structures in the data.262
3.3 Clustering Results263
Here, we present clustering results using GLARE on the CARA dataset. Figure 5 shows ensemble clustering results264
on the best-performing data representation, FT-SAE. We show individual clustering results from the base clustering265
algorithm we considered, GMM, HDBSCAN, and Spectral clustering, along with a final consensus cluster through266
evidence accumulation clustering. We note that GMM and spectral clustering require a user-defined cluster level. These267
were set to 20 and 25, respectively, for FLT and 25 and 20 for GC driven by results from previous studies (Shulse et al.,268
2019; Shahan et al., 2022). HDBSCAN defines its own cluster number.269
Figure 5: Ensemble clustering via EAC. Results from base clustering algorithms, GMM, HDBSCAN, and Spectral
clustering, are shown starting from left to right for both FLT and GC. EAC results are shown at the right, with FLT
having 16 consensus cluster labels and 15 consensus cluster labels for GC (depicted as different colors).
Spaceflight. Clustering of the FLT dataset resulted in the identification of 20, 13, and 25 clusters for GMM,270
HDBSCAN, and spectral respectively. GMM clusters had two large clusters, each containing 7,623 and 5,778 genes,271
with most of the other clusters having lesser sizes of 300 to 1,000 genes. HDBSCAN showed a smaller number of272
clusters, where most of the clusters had 1000 to 2500 genes. Spectral clusters had the most consistent cluster sizes273
compared to GMM and HDBSCAN, with most of the clusters having 1,000 to 1,300 genes. These results highlight274
how the precise nature of clusters is different depending on the clustering approach taken. Each clustering strategy has275
distinct strengths. GMM works well when the data does not have well-defined boundaries, where HDBSCAN is useful276
for datasets with noise and outliers, and spectral clustering is highly suited for data with non-linear manifold structures.277
In order to leverage all of these advantages to a robust and reliable analysis of CARA data representation, we combined278
all three approaches via ensemble clustering through consensus voting (Vega-Pons and Ruiz-Shulcloper, 2011). These279
ensemble clusters exhibited diverse characteristics, having clusters with a size of< 1000 genes to two large clusters280
finding patterns in local structure, each containing 7,627 and 4,715 genes, similar to clusters identified by GMM. The281
10
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
number of clusters and size of remaining clusters, ranging from 1,000 to 2,500 genes, are suggestive of the outputs of282
HDBSCAN and spectral clusters finding patterns throughout the global structure (Jain, 2010).283
Ground Control. Clustering of the GC dataset resulted in the identification of 25, 15, and 20 clusters for GMM,284
HDBSCAN, and spectral, respectively. Despite the slight change in the number of clusters, the qualitative characteristics285
of the results remained largely consistent with those obtained from FLT. Specifically, GMM results revealed two large286
clusters, each containing 7,445 and 6,157 genes for GC, along with most of the other clusters having lesser sizes of287
200 to 900 genes highlighting local patterns. Similarly, HDBSCAN and spectral clusters had a comparable consistent288
number of genes as FLT clusters, finding patterns throughout the global structure. Ensemble clustering demonstrated289
similar outcomes to FLT as well, exhibiting a diverse range of gene counts within each cluster.290
3.4 Post Pipeline Analysis291
Lastly, we demonstrate the full utility of GLARE using the results derived from the ensemble clustering on learned data292
representation of the CARA data and applying feature explanation analysis from the prediction task that we undertook293
for the verification study.294
3.4.1 Gene Ontology Analysis295
Gene Ontology (GO) analysis, in conjunction with clustering results, is a widely used approach to find the functional296
significance of co-expressed genes in the clusters and provide a comprehensive understanding of the biological297
functions and processes underlying the observed gene expression patterns. We use the Metascape platform ( http:298
//metascape.org), which integrates various functional annotation databases (Zhou et al., 2019) to perform GO299
enrichment analysis. We take the clusters from EAC on FT-SAE and process them through Metascape after excluding300
clusters with extreme sizes, as this tool can only take gene lists of less than 3000 counts for the enrichment analysis.301
Specifically, two large clusters for both FLT and GC datasets, along with one small cluster comprising only 2 genes in the302
FLT dataset, which leaves us 13 significant clusters for both FLT and GC. GO analysis on these clusters revealed various303
groups of ontologies, including cellular metabolic processes, oxidative phosphorylation, light response and signaling,304
and vesicle-mediated transport. The prior study on the CARA dataset (Paul et al., 2017) found that genes associated305
with cell wall metabolism seemed most prevalent among the differentially expressed genes. We found that clusters306
associated with vesicle-mediated transport were the most prevalent group for both FLT and GC clusters. Specifically,307
these vesicle-mediated transport clusters were related to plant-specific metabolic and developmental pathways for GC,308
such as root morphogenesis and cell wall organization. In contrast, FLT clusters were more related to metabolic and309
catabolic processes, including protein processing and RNA processing (Supplementary Figure S1). Moreover, we310
found a unique hypoxia-related cluster that was only found in FLT results. Root zone hypoxia is predicted to occur in311
spaceflight as a loss of buoyancy-driven convection in microgravity should limit oxygen resupply to intensely respiring312
tissues (e.g., Porterfield (2002)). However, transcriptional fingerprints of hypoxia response in plants in spaceflight have313
often proven elusive. We therefore concentrated the focus of the rest of our analysis on this hypoxic cluster. In Figure 6,314
we show a heatmap for the FPKM values for the genes within the hypoxia cluster, GO analysis results for the hypoxia315
cluster using Metascape (Zhou et al., 2019) (Figure 6(b)), and Stress Knowledge Map (SKM) (Bleker et al., 2023)316
centered around the Transcription Factors (TFs) in the hypoxia cluster (Figure 6(c)).317
The Stress Knowledge Map (SKM; https://skm.nib.si/) is a curated resource offering two types of knowledge318
graphs on plant molecular interactions and stress signaling (Bleker et al., 2023). We used the Comprehensive Knowledge319
Network (CKN) to gain insights into stress signaling and associated plant biological processes around our genes of320
interest. The map in Figure 6(c) was drawn with five transcription factors (TFs) that we found in the 43 gene hypoxia321
cluster: ‘DREB2A’, ‘RHL41 / ZAT12’, ‘MYC2’, ‘RRTF1 / ERF109’, and ‘STZ / ZAT10’. The CKN map shows an322
intricate network of TFs and their interactions in the context of stress response mechanisms and related signaling323
pathways with other genes such as ‘HY5 / TED5’, ‘ABI1’, and ‘JAZ1’. Inspection of this network reveals ethylene as a324
11
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
likely important player in this response. GO analysis of the network for biological function (Supplementary Table S1)325
also indicates elements of defense, water stress and cold response may also be important elements for further study.326
Figure 6: Analysis of hypoxia cluster found in FLT clustering result. (a) Heatmap of normalized FPKM values on
hypoxia cluster. (b) Enriched ontology on hypoxia cluster from Metascape (c) Stress Knowledge Map (SKM) on five
Transcription Factors (TFs) in hypoxia cluster: ‘DREB2A’, ‘RHL41 / ZAT12’, ‘MYC2’, ‘RRTF1 / ERF109’, and ‘STZ /
ZAT10’.
3.4.2 SHAP Analysis327
Up to this point, our analysis has been directed toward uncovering distinct patterns between FLT and GC by generating328
separate data representations for clustering and GO analysis. However, we chose CARA as a dataset to interrogate due329
to the multiple experimental factors within the experiment’s design. Therefore, after identifying the patterns within the330
FLT data using GLARE, particularly a hypoxia cluster, we used this newly identified cluster to evaluate the effect of331
varying light conditions on different genotypes in each location. We took the found TFs within the hypoxia cluster and332
applied SHAP analysis to quantify feature contribution, thereby explaining which experimental conditions had the most333
effect in classifying this pattern within the data between FLT and GC. SHAP analysis provides a way to understand the334
impact of each feature on the model’s predictions, enabling better model transparency and insights into the underlying335
relationships within the data (Lundberg and Lee, 2017). Higher positive SHAP scores reflect features contributing more336
to this discrimination within the dataset to designate a sample to FLT, while negative values reveal factors that have a337
negative impact on the FLT assignment, i.e., reveal the data as GC. In Figure 7, we show local bar plots explaining the338
feature importance among the five identified TFs in the FLT hypoxia cluster. Among these five TFs,‘ZAT12’ has the339
largest aggregate difference in SHAP values between FLT and GC andMYC2 the smallest.340
We see thatPHYD mutants in the dark setting had the most contribution in model prediction in FLT for both‘ZAT12’341
and ‘MYC2’, while WS genotype in the light setting for ‘ZAT12’ and PHYD mutants in the light setting for ‘MYC2’342
had a notable negative effect towards FLT prediction. On the other hand, col genotype in the dark setting had the343
most contribution in model prediction in GC for both ‘ZAT12’ and ‘MYC2’, indicating a strong differentiation between344
12
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Figure 7: SHAP analysis on Transcription Factors (TFs) in the hypoxia cluster.A positive SHAP value (Red color)
means that the feature value made a greater contribution than others in classifying the gene as FLT, while a negative
SHAP value (Blue color) suggests they had more contribution in GC classification. (a)ZAT12 - FLT (b) ZAT12 - GC (c)
MYC2 - FLT (d) MYC2 - GC (e) Summary of difference in SHAP value between FLT and GC for the 5 TFs in hypoxia.
conditions in a different location. The large difference in aggregated SHAP value between FLT and GC for‘ZAT12’345
suggests that the relative importance and contributions of these features vary significantly between the FLT and GC.346
In contrast, the contributions for ‘MYC2’ appear more consistent and stable across both FLT and GC classifications.347
Lastly, in Figure 8, we present summary SHAP plots on these features, varying light conditions on different genotypes,348
to offer a more comprehensive understanding of feature contribution across the entire dataset.349
We can observe features with different degrees of impact on the model’s prediction from the SHAP value scatterplot350
in Figure 8(a), for example, for the WS genotype: in a light setting, the majority of the data aligns with positive SHAP351
values, supporting FLT classification, whereas under dark conditions, the trend is reversed. The beeswarm plot (Figure352
8(b)) illustrates the distribution of SHAP values for each feature. The color gradient from blue to red represents the353
feature value (FPKM values), with blue indicating low expression and red indicating high expression. Figure 8(b)354
illustrates that PHYD mutants in a dark setting have the highest effect on the classification with longer tails towards355
positive value, while most of the high FPKM values have negative SHAP value. Suggesting that high expression levels356
from PHYD mutants in dark settings decrease the likelihood of FLT classification. Similarly, the Col genotype in a357
dark setting has tails toward negative values, while most of the high FPKM values have positive SHAP values. These358
Figure 8: SHAP value distribution for each treatment. Comparing SHAP values from a classification using the
XGBoost on the discretized CARA dataset. (a) The summary SHAP value scatterplot for each feature displays the
distribution of SHAP values alongside raw feature values. (b) The summary SHAP beeswarm plots, where features are
ordered by their importance (measured by mean absolute SHAP values), with the most impactful features appearing at
the top. The color bar represents raw feature values. Both plots present the same information.
13
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
observations underscore the presence of intricate interactions between gene expressions, reflecting the complexity of359
the transcriptome data and the underlying biological mechanisms.360
SHAP analysis provides a unique perspective on the patterns within the dataset, especially when the data comprises361
various environmental settings as features. Through analyzing the differences and similarities in SHAP values,362
researchers can identify genes that are sensitive to complex environment by genotype-dependent patterns in the data.363
4 Discussion364
In this study, we present an analysis pipeline, GLARE, that employs a state-of-the-art representation learning model365
with self-supervised learning. We chose a previously analyzed dataset, the CARA experiment (OSD-120), which allows366
for an investigation of the overall utility of the pipeline itself and a comparison with the prior findings. For analysis367
of the root samples in the CARA spaceflight data, we trained the system using high-throughput plant root single-cell368
data, along with ensemble clustering, to identify hidden patterns in the spaceflight transcriptome. For other spaceflight369
datasets, such as whole seedlings, shoot tissues, microbe, animal tissues, or cell types, matching training datasets to370
the particular experimental design would similarly add significant depth to these analyses. After the full pipeline,371
we present a recommended framework for post-pipeline analysis employing select bioinformatics tools and adding372
post hoc explainability to the deep learning approach by applying approaches such as SHAP analysis. Such analyses373
confirmed previous patterns found in the data, such as cell wall remodeling and vesicle-mediated transport, but critically374
revealed new features, notably a molecular signature of hypoxic stress in the spaceflight samples that is predicted375
from the lack of buoyancy-driven convection in spaceflight but that has proven complex to extract from many plant376
transcriptomic datasets. However, our analyses also revealed that this cryptic signature was dependent on experimental377
conditions such as plant genotype and lighting regime. For example, Figure 7 shows that SHAP analysis of the 5378
signature spaceflight-related, hypoxia-response transcription factors identified in this study potentially help explain why379
these signals can be complex to identify in current spaceflight datasets without machine learning interrogation.380
Although we present one post hoc analysis pipeline for the output of GLARE, researchers can readily leverage their381
preferred analytics tools when applying GLARE to their datasets to uncover patterns. To this end, we actively encourage382
contributions and novel suggestions through our open science repository. Its open-source nature means researchers383
can readily adapt GLARE on other datasets from GeneLab and elsewhere to reinforce their initial studies and expand384
on these computational findings. The recent rapid advancement in the machine learning field warrants future work385
on GLARE. Similar to our approach, integrating single-cell datasets has been widely adopted for their advantage in386
providing nuanced insights to the cellular level. Indeed, transformer-based foundation models for single-cell multi-omics387
have been suggested (Cui et al., 2024), which offer the potential to generate synthetic data or for gene network inference.388
Our future vision for GLARE is to extend beyond autoencoder-based models to add more advanced self-supervised389
representation learning models, such as contrastive learning methods that are well-used in the field of computer vision390
and natural language processing (Chen et al., 2020), to enhance robustness for smaller datasets with fewer features.391
Additionally, causal representation learning methods can be employed to discover the causal relationship between392
related genes (Uelwer et al., 2023; Schölkopf et al., 2021).393
Conflict of Interest Statement394
The authors declare they have no conflicts of interest.395
14
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Author Contributions396
DH.S. and R.B. conceived of the study and fundamental design. DH.S. and H.F.S. contributed to model testing,397
data analysis, and figure preparation. DH.S., H.F.S., M.Z., R.B., A.-L.P, R.J.F., and S.G. contributed to manuscript398
preparation. All authors contributed to the manuscript review and editing.399
Funding400
The CARA experiment was supported by grant number GA-2013-104, Center for Advancement of Science in401
Space to A.-L. Paul (PI) and R.J. Ferl (CoI). We gratefully acknowledge support from NASA 80NSSC19K0126402
and 80NSSC21K0577 to S.G.403
Acknowledgments404
The authors would like to acknowledge the sequencing and bioinformatics services provided by the Interdisci-405
plinary Center for Biotechnology Research’s (ICBR) Gene Expression (RRID:SCR_019145), NextGen Sequencing406
(RRID:SCR_019152), and Bioinformatics (RRID:SCR_019120) cores.407
Data Availability Statement408
The dataset (OSD-120) utilized in this method can be found on the NASA GeneLab Data System (https://genelab.409
nasa.gov/). The code utilized for data analysis can be found on the publicly available GitHub repository ( https:410
//github.com/OpenScienceDataRepo/Plants_AWG/tree/main/Manuscript_Code/glare).411
References412
Abts, W., Vandenbussche, B., De Proft, M. P., and Van de Poel, B. (2017). The role of auxin-ethylene crosstalk in413
orchestrating primary root elongation in sugar beet. Frontiers in Plant Science, 8:444.414
Aljalbout, E., Golkov, V ., Siddiqui, Y ., Strobel, M., and Cremers, D. (2018). Clustering with deep learning: Taxonomy415
and new methods. arXiv preprint arXiv:1801.07648.416
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.417
Bleker, C., Ramšak, Ž., Bittner, A., Podpeˇcan, V ., Zagoršˇcak, M., Wurzinger, B., Baebler, Š., Petek, M., Križnik, M.,418
van Dieren, A., et al. (2023). Stress knowledge map: A knowledge graph resource for systems biology analysis of419
plant stress responses. bioRxiv, pages 2023–11.420
Campello, R. J., Moulavi, D., and Sander, J. (2013). Density-based clustering based on hierarchical density estimates.421
In Pacific-Asia conference on knowledge discovery and data mining, pages 160–172. Springer.422
Chaudhary, K., Poirion, O. B., Lu, L., and Garmire, L. X. (2018). Deep learning–based multi-omics integration robustly423
predicts survival in liver cancer. Clinical Cancer Research, 24(6):1248–1259.424
Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd425
international conference on knowledge discovery and data mining, pages 785–794.426
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual427
representations. In International conference on machine learning, pages 1597–1607. PMLR.428
15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear429
units (elus). arXiv preprint arXiv:1511.07289.430
Cui, H., Wang, C., Maan, H., Pang, K., Luo, F., Duan, N., and Wang, B. (2024). scgpt: toward building a foundation431
model for single-cell multi-omics using generative ai. Nature Methods, pages 1–11.432
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T. R.433
(2013). Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21.434
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large435
spatial databases with noise. In kdd, volume 96, pages 226–231.436
Ferl, R. J. and Paul, A.-L. (2016). The effect of spaceflight on the gravity-sensing auxin gradient of roots: Gfp reporter437
gene microscopy on orbit. npj Microgravity, 2(1):1–9.438
Ferl, R. J., Zupanska, A., Spinale, A., Reed, D., Manning-Roach, S., Guerra, G., Cox, D. R., and Paul, A.-L. (2011). The439
performance of ksc fixation tubes with rnalater for orbital experiments: A case study in iss operations for molecular440
biology. Advances in Space Research, 48(1):199–206.441
Fred, A. L. and Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE transactions on442
pattern analysis and machine intelligence, 27(6):835–850.443
Fu, Y ., Li, L., Xie, B., Dong, C., Wang, M., Jia, B., Shao, L., Dong, Y ., Deng, S., Liu, H., et al. (2016). How to444
establish a bioregenerative life support system for long-term crewed missions to the moon or mars. Astrobiology,445
16(12):925–936.446
Gan, G., Ma, C., and Wu, J. (2020). Data clustering: theory, algorithms, and applications. SIAM.447
Hao, M., Gong, J., Zeng, X., Liu, C., Guo, Y ., Cheng, X., Wang, T., Ma, J., Song, L., and Zhang, X. (2023). Large scale448
foundation model on single-cell transcriptomics. bioRxiv, pages 2023–05.449
Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science,450
313(5786):504–507.451
Hulot, A., Chiquet, J., Jaffrézic, F., and Rigaill, G. (2020). Fast tree aggregation for consensus hierarchical clustering.452
BMC bioinformatics, 21(1):1–12.453
Iqbal, N., Khan, N. A., Ferrante, A., Trivellini, A., Francini, A., and Khan, M. (2017). Ethylene role in plant growth,454
development and senescence: interaction with other phytohormones. Frontiers in plant science, 8:475.455
Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666.456
Karim, M. R., Beyan, O., Zappa, A., Costa, I. G., Rebholz-Schuhmann, D., Cochez, M., and Decker, S. (2021). Deep457
learning-based clustering approaches for bioinformatics. Briefings in bioinformatics, 22(1):393–415.458
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.459
Kiselev, V . Y ., Yiu, A., and Hemberg, M. (2018). scmap: projection of single-cell rna-seq data across data sets.Nature460
methods, 15(5):359–362.461
Lee, J. A., Verleysen, M., et al. (2007). Nonlinear dimensionality reduction, volume 1. Springer.462
Lei, D., Zhu, Q., Chen, J., Lin, H., and Yang, P. (2012). Automatic k-means clustering algorithm for outlier detection. In463
Information Engineering and Applications: International Conference on Information Engineering and Applications464
(IEA 2011), pages 363–372. Springer.465
16
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Liao, Y . and Vemuri, V . R. (2002). Use of k-nearest neighbor classifier for intrusion detection.Computers & security,466
21(5):439–448.467
Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg,468
U. V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors,Advances in Neural Information469
Processing Systems 30, pages 4765–4774. Curran Associates, Inc.470
Makhzani, A. and Frey, B. (2013). K-sparse autoencoders. arXiv preprint arXiv:1312.5663.471
McInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension472
reduction. arXiv preprint arXiv:1802.03426.473
Monti, S., Tamayo, P., Mesirov, J., and Golub, T. (2003). Consensus clustering: a resampling-based method for class474
discovery and visualization of gene expression microarray data. Machine learning, 52:91–118.475
Mustroph, A., Lee, S. C., Oosumi, T., Zanetti, M. E., Yang, H., Ma, K., Yaghoubi-Masihi, A., Fukao, T., and Bailey-476
Serres, J. (2010). Cross-kingdom comparison of transcriptomic adjustments to low-oxygen stress highlights conserved477
and plant-specific responses. Plant Physiology, 152(3):1484–1500.478
Ng, A. et al. (2011). Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19.479
Ng, A., Jordan, M., and Weiss, Y . (2001). On spectral clustering: Analysis and an algorithm. Advances in neural480
information processing systems, 14.481
Paul, A.-L., Sng, N. J., Zupanska, A. K., Krishnamurthy, A., Schultz, E. R., and Ferl, R. J. (2017). Genetic dissection of482
the arabidopsis spaceflight transcriptome: Are some responses dispensable for the physiological adaptation of plants483
to spaceflight? PLoS One, 12(6):e0180186.484
Paul, A.-L., Zupanska, A. K., Schultz, E. R., and Ferl, R. J. (2013). Organ-specific remodeling of the arabidopsis485
transcriptome in response to spaceflight. BMC Plant Biology, 13(112).486
Porterfield, D. M. (2002). The biophysical limitations in physiological transport and exchange in plants grown in487
microgravity. Journal of Plant Growth Regulation, 21(2).488
Ranzato, M., Boureau, Y .-L., Cun, Y ., et al. (2007). Sparse feature learning for deep belief networks. Advances in489
neural information processing systems, 20.490
Rappoport, N. and Shamir, R. (2018). Multi-omic and multi-view clustering algorithms: review and cancer benchmark.491
Nucleic acids research, 46(20):10546–10562.492
Ray, S., Gebre, S., Fogle, H., Berrios, D. C., Tran, P. B., Galazka, J. M., and Costes, S. V . (2018).493
GeneLab: Omics database for spaceflight experiments. Bioinformatics, 35(10):1753–1759. _eprint:494
https://academic.oup.com/bioinformatics/article-pdf/35/10/1753/48969335/bioinformatics_35_10_1753.pdf.495
Reynolds, D. A. et al. (2009). Gaussian mixture models. Encyclopedia of biometrics, 741(659-663).496
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of497
computational and applied mathematics, 20:53–65.498
Rutter, L., Barker, R., Bezdan, D., Cope, H., Costes, S., Degoricija, L., Fisch, K., Gabitto, M., Gebre, S., Giacomello,499
S., et al. (2020). A new era for space life science: international standards for space omics processing (issop). patterns.500
Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y . (2021). Toward causal501
representation learning. Proceedings of the IEEE, 109(5):612–634.502
17
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Shahan, R., Hsu, C.-W., Nolan, T. M., Cole, B. J., Taylor, I. W., Greenstreet, L., Zhang, S., Afanassiev, A., Vlot,503
A. H. C., Schiebinger, G., et al. (2022). A single-cell arabidopsis root atlas reveals developmental trajectories in504
wild-type and cell identity mutants. Developmental cell, 57(4):543–560.505
Shulse, C. N., Cole, B. J., Ciobanu, D., Lin, J., Yoshinaga, Y ., Gouran, M., Turco, G. M., Zhu, Y ., O’Malley, R. C.,506
Brady, S. M., et al. (2019). High-throughput single-cell transcriptome profiling of plant cell types. Cell reports,507
27(7):2241–2247.508
Strehl, A. and Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions.509
Journal of machine learning research, 3(Dec):583–617.510
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L., and511
Pachter, L. (2012). Differential gene and transcript expression analysis of rna-seq experiments with tophat and512
cufflinks. Nature protocols, 7(3):562–578.513
Uelwer, T., Robine, J., Wagner, S. S., Höftmann, M., Upschulte, E., Konietzny, S., Behrendt, M., and Harmeling, S.514
(2023). A survey on self-supervised representation learning. arXiv preprint arXiv:2308.11455.515
Van Der Maaten, L. (2009). Learning a parametric embedding by preserving local structure. In Artificial intelligence516
and statistics, pages 384–391. PMLR.517
Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11).518
Vega-Pons, S. and Ruiz-Shulcloper, J. (2011). A survey of clustering ensemble algorithms. International Journal of519
Pattern Recognition and Artificial Intelligence, 25(03):337–372.520
Villacampa, A., Ciska, M., Manzano, A., Vandenbrink, J. P., Kiss, J. Z., Herranz, R., and Medina, F. J. (2021). From521
spaceflight to mars g-levels: Adaptive response of a. thaliana seedlings in a reduced gravity environment is enhanced522
by red-light photostimulation. International Journal of Molecular Sciences, 22(2):899.523
Xu, C., Tao, D., and Xu, C. (2013). A survey on multi-view learning. arXiv preprint arXiv:1304.5634.524
Zeng, I. S. L. and Lumley, T. (2018). Review of statistical learning methods in integrated omics studies (an integrated525
information science). Bioinformatics and biology insights, 12:1177932218759292.526
Zhou, Y ., Zhou, B., Pache, L., Chang, M., Khodabakhshi, A. H., Tanaseichuk, O., Benner, C., and Chanda, S. K. (2019).527
Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature communications,528
10(1):1523.529
18
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597470doi: bioRxiv preprint
Supplementary Data and Table530
We provide supplementary data at the OSDR GitHub repository (https://github.com/OpenScienceDataRepo/531
Plants_AWG/tree/main/Manuscript_Code/glare), including the codes for the method and reproducible results532
such as single-cell pre-trained model weights, data representations, ensemble clustering results, Gene Ontology analysis533