ATMeQ: A Machine Learning-Based Framework for Amyotrophic Lateral Sclerosis Disease using RNA-seq Meta-Analysis

preprint OA: closed
Full text JSON View at publisher
Full text 213,862 characters · extracted from preprint-html · click to expand
ATMeQ: A Machine Learning-Based Framework for Amyotrophic Lateral Sclerosis Disease using RNA-seq Meta-Analysis | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article ATMeQ: A Machine Learning-Based Framework for Amyotrophic Lateral Sclerosis Disease using RNA-seq Meta-Analysis Ahmed Saif, Md Tarikul Islam, Md Aktaruzzaman This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8614090/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 5 You are reading this latest preprint version Abstract methods Random Forest importance, Gradient Boosting, Recursive Feature Elimination (RFE), and the Boruta algorithm, narrowed this set down to a biologically meaningful six-gene signature (ACTA1, ABCA4, COL6A4P2, HERC2P2, KCNE4, LOC107987008). Employing this signature, fifteen machine learning models were trained and optimized through hyperparameter tuning. The top-performing model, a Gradient Boosting Classifier (GBC), was validated through k-fold cross-validation, achieving 96% accuracy, a 0.92 Matthews Correlation Coefficient (MCC), 0.937 precision, 0.991 recall, 0.962 F1-score, and a 0.993 AUC-ROC. Therefore, this model was deployed as ATMeQ, a publicly available web tool ( https://atmeq-ai.streamlit.app/ ) with potential utility for clinicians and researchers to predict ALS risk and validate biomarkers. Collectively, the study demonstrates that integrative transcriptomics and machine learning can significantly reduce potential diagnostic delays and enable biomarker-driven detection in ALS. Health sciences/Biomarkers Biological sciences/Computational biology and bioinformatics Biological sciences/Genetics Health sciences/Neurology Biological sciences/Neuroscience Amyotrophic Lateral Sclerosis (ALS) RNA-seq Meta-analysis Differential Gene Expression Machine Learning-based Diagnosis Gene signature Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 1. Introduction Amyotrophic Lateral Sclerosis (ALS) is a rare and progressive neurodegenerative disorder that primarily affects motor neurons in the brain and spinal cord, leading to muscle weakness, atrophy, and eventually paralysis [ 1 , 2 ]. This degeneration specifically affects both upper motor neurons, which originate in the cerebral cortex and extend to the brainstem and spinal cord, and lower motor neurons, which transmit signals directly from the brainstem or spinal cord to the muscles [ 3 ]. As a result, essential voluntary movements such as walking, talking, and breathing become increasingly impaired. The disease can be classified into two types based on its clinical presentation: familial ALS (fALS), which accounts for about 5–10% of cases and has a genetic etiology, and sporadic ALS (sALS), which makes up the remaining 90–95% of cases and may result from a combination of genetic predispositions and environmental factors[ 4 ]. Sadly, most research indicates that the progression of ALS often leads to death within 2 to 5 years after the onset of symptoms, primarily due to respiratory failure[ 5 ]. Moreover, while current statistics show that approximately 9.9 individuals per 100,000 are affected globally, projections suggest that cases could rise by as much as 69% by 2040, presenting an escalating and critical challenge for neurology and translational neuroscience [ 6 ]. Due to these reasons, for over half a century, translational research in ALS has driven numerous clinical trials and advanced scientific methods to explore neuroprotective compounds, but despite these efforts, a cure remains undiscovered [ 7 – 9 ]. Indeed, since the 1990s, over 50 investigational drugs for ALS have failed in Phase II/III clinical trials, underscoring the immense challenges of developing effective therapies for this fatal neurodegenerative disease [ 10 ]. To date, only two treatments, riluzole and edaravone, have gained regulatory approval. Riluzole, a glutamate modulator that reduces excitotoxicity, extends median survival by approximately 2–3 months but does not meaningfully halt disease progression[ 11 ]. In contrast, Edaravone, an antioxidant designed to mitigate oxidative stress, demonstrated a 33% slower rate of functional decline over 24 weeks in clinical trials involving a narrowly defined subset of early-stage ALS patients[ 11 , 12 ]. Given these limited therapeutic options, timely and precise diagnosis is critical not only to rule out mimicking conditions but also to initiate approved therapies at the earliest possible stage, maximizing their modest benefits. ALS diagnosis begins with a detailed clinical examination, assessing muscle strength, reflexes, and other neurological signs. Features like hyperreflexia, muscle wasting, and weakness help differentiate ALS from other conditions[ 13 , 14 ]. They also use electrodiagnostic tests like electromyography (EMG) to measure muscle electrical activity and confirm nerve issues and nerve conduction studies to check nerve function [ 15 , 16 ]. Imaging, such as MRI, helps rule out other problems like spinal cord compression [ 17 ]. Beyond these, researchers are exploring advanced MRI techniques like diffusion tensor imaging (DTI) and diffusion-weighted imaging (DWI) to look at brain and spinal cord details, as well as Positron Emission Tomography (PET) to measure brain activity[ 18 – 20 ]. The Gold Coast criteria, introduced in 2019, have further simplified ALS diagnosis by focusing on progressive motor impairment and upper and lower motor neuron dysfunction in at least one body region [ 21 – 23 ]. Although progress has been made, diagnosing ALS remains challenging, largely because its symptoms overlap significantly with those of other neurological disorders, and there is no definitive diagnostic test available to identify the condition conclusively [ 24 ]. ALS diagnosis continues to face a median delay of 12 months from symptom onset, with patients typically consulting three or more specialists before receiving a confirmed diagnosis [ 25 , 26 ]. Moreover, the variability in ALS symptoms manifests as either bulbar-onset (affecting speech and swallowing) or limb-onset (impacting peripheral muscles like the hands and feet). This heterogeneity complicates early recognition and diagnosis [ 27 ]. While the El Escorial, revised El Escorial, and Awaji criteria offer diagnostic frameworks, they lack sensitivity and are primarily designed for research rather than clinical practice[ 28 , 29 ]. Most importantly, there is currently no established biomarker that has been validated for clinical application in ALS, which significantly hinders early detection and disease monitoring [ 30 ]. Given that biomarkers are crucial for the early and accurate diagnosis of neurodegenerative diseases (NDs) like Alzheimer’s and Parkinson’s[ 31 , 32 ] identifying reliable diagnostic and prognostic gene biomarkers could substantially improve our understanding and management of ALS. In light of these challenges, high-throughput RNA-seq has emerged as a promising approach to bridge this diagnostic gap. This technology has revolutionized the field of transcriptomics by providing a comprehensive view of the transcriptome, enabling the identification of novel biomarkers for various diseases [ 33 , 34 ]. Unlike traditional methods such as microarrays, RNA-seq enables the detection of both known and novel transcripts with high sensitivity and accuracy while also providing precise quantification of gene expression [ 35 , 36 ]. A key strength of RNA-seq lies in its ability to uncover subtle disease-associated expression patterns that may serve as diagnostic, prognostic, or therapeutic indicators [ 37 , 38 ]. By capturing the full dynamic range of gene expression, including low-abundance transcripts and splice variants, RNA-seq reveals molecular signatures often missed by other technologies [ 39 ]. Furthermore, standardized computational pipelines now allow researchers to reliably identify differentially expressed genes (DEGs), reducing variability and enhancing reproducibility in biomarker discovery [ 40 , 41 ]. These capabilities have positioned RNA-seq as a transformative tool for identifying disease-specific biomarkers, particularly for complex conditions like neurodegenerative diseases, where molecular stratification is critical. In neurodegenerative disease research, brain and blood samples form a synergistic duo, where brain tissue provides direct insights into molecular pathology, while blood offers a scalable, non-invasive platform for early diagnosis and monitoring [ 42 , 43 ]. RNA-seq bridges these domains, revealing biomarkers such as blood-derived microRNAs and neurofilament light chain (NfL) that mirror pathological changes in the brain, enabling breakthroughs in detecting neurological diseases years before symptoms emerge [ 44 ]. By leveraging both types of samples, researchers can accelerate the discovery of actionable biomarkers, ultimately transforming how we predict, track, and combat neurodegeneration [ 45 ] However, the vast and intricate nature of RNA-seq datasets necessitates the use of advanced computational techniques to fully unlock their potential. Machine learning (ML) algorithms have shown great promise in analyzing vast and intricate datasets, such as those generated by high-throughput RNA sequencing [ 46 ]. By leveraging these advanced computational techniques, researchers can pinpoint gene expression patterns unique to specific diseases, accurately classify biological samples, and forecast disease progression [ 47 ]. ML models excel at learning directly from the data, uncovering subtle relationships and patterns, even amidst the noise and variability typical of biological datasets, that often elude traditional statistical methods [ 48 ]. A key application is the use of supervised ML techniques to identify critical genes from RNA-seq data, where models are trained on labeled datasets to recognize important genetic markers [ 49 ]. Since numerous RNA-seq studies compare cases and controls, one developed a logistic regression model that identified 22 biomarker genes (AUC: 0.990) from PBMC RNA-seq data, linking immune response, cell signaling, and metabolism to ALS mechanisms[ 50 ]. In another study, RefMap integrated GWAS with RNA-seq/ATAC-seq data from iPSC-derived motor neurons, uncovering 690 ALS-associated genes and validating KANK1’s role in TDP-43 pathology[ 51 ]. Meanwhile, a multi-omic approach combined unsupervised clustering and the MOALS model to analyze 9,847 ALS-related genes and 7,699 rare variants, boosting prediction accuracy by 1.7–6.2% [ 52 ]. Deep learning via a Keras/TensorFlow23 classifier processed WGS, RNA-seq, and chromatin data to classify ALS cases and reveal novel transcriptional/mutational signatures [ 53 ]. WGCNA and classification models extracted a 20-gene signature from peripheral blood RNA-seq (96 sALS vs. 48 controls), achieving 78% accuracy. GLM, Decision Trees, and Random Forests analyzed spinal cord RNA-seq data, yielding 83% cross-validation and 77% test accuracy [ 54 ]. Finally, CNNs and logistic regression leveraged voice recordings (AUC: 0.86 for bulbar function) and accelerometer data (median AUC: 0.73 for limb function) to predict ALS severity via ALSFRS-R scores, showcasing digital biomarker potential [ 55 ]. However, integrating diverse biological specimens, such as brain tissue and blood, with an array of multiple ML algorithms could provide a more robust approach to ALS detection than relying on a single method. To advance ALS diagnostics, we aim to integrate high-throughput RNA-seq data and machine learning (ML) to develop a predictive framework for ALS classification. Using publicly available ALS-associated gene expression datasets, we will implement a next-generation sequencing (NGS) pipeline to identify DEGs between ALS and control samples. These candidates will be refined through four advanced feature selection methods to define a pathophysiology-driven gene signature. We will then systematically evaluate 15 ML algorithms to optimize accuracy in distinguishing ALS from control samples. To translate these findings into clinical utility, the finalized model will be deployed via ATMeQ, a publicly accessible web application designed to enable clinicians and researchers to predict ALS risk and validate candidate biomarkers, thereby enhancing diagnostic precision, accelerating therapeutic development, and improving outcomes for ALS patients. The workflow for this study is presented in Fig. 1 . 2. Methods 2.1. Retrieval of NGS data Next-generation sequencing (NGS) data for this study were retrieved from the Gene Expression Omnibus (GEO) database ( https://www.ncbi.nlm.nih.gov/geo/ ), a publicly accessible repository managed by the National Center for Biotechnology Information (NCBI)[ 56 ]. Three independent projects: BioProject PRJNA512012 (GEO Series GSE124439), PRJNA831563 (GEO Series GSE201407), and PRJNA1163403 (GEO Series GSE277709), were selected to compile RNA-Seq datasets from a total of 224 postmortem samples, which include 183 amyotrophic lateral sclerosis (ALS) patients and 41 non-ALS controls, and all data were downloaded using NCBI’s SRA Toolkit. The curated datasets encompassed key brain regions implicated in ALS pathology, such as the motor cortex and prefrontal cortex. The selection of these datasets was guided by stringent criteria, including the availability of high-quality RNA-Seq data, comprehensive metadata, and adequate sample size to facilitate biomarker discovery and therapeutic target identification in ALS research while ensuring the data's high quality and biological relevance for downstream analysis. The selection criteria for these datasets are outlined in Fig. 2 , while Supplementary Information (SI) 1 provides detailed information, including project ID, sample size, ALS and control distributions, gender, age range, brain region of origin, disease stage, and relevant references. Subsequent computational processing pipelines are described in the following sections. 2.2. Preprocessing of Raw Data Quality Control of FASTQ Files The quality of raw sequencing reads was evaluated using FastQC (version 0.11.9) [ 57 ], a widely-used tool for high-throughput sequencing data quality control, including assessments of read quality, GC content, adapter contamination, and sequence duplication levels. Trimming FASTQ Files To enhance sequencing read quality by removing low-quality bases and adapter sequences, read trimming was performed using Trimmomatic (version 0.39) [ 58 ], a tool designed to process Illumina data. The trimming parameters included TRAILING:10, SLIDINGWINDOW:4:15, MINLEN:36, and -phred33 for quality score encoding. After trimming, the processed FASTQ files were reanalyzed with FastQC to verify improved read quality. Alignment to the Reference Genome High-quality trimmed reads were aligned to the human reference genome (GRCh38) using HISAT2 (version 2.2.1) [ 59 ], a fast and splice-aware aligner optimized for RNA-Seq data. The alignment leveraged the pre-built HISAT2 index for GRCh38, which includes splice site annotations to ensure accurate mapping of reads spanning exon-exon junctions. The resulting SAM file was subsequently converted to a sorted BAM file using SAMtools (version 1.16) [ 4 ]. Quantification Using FeatureCounts After alignment, gene expression levels were quantified using featureCounts [ 60 ], a tool optimized for assigning RNA-seq reads to genomic features. We used the Ensembl GRCh38 release 106 annotation files (Homo_sapiens.GRCh38.106.gtf) to ensure accurate read counting across annotated genes, exons, and transcripts. This annotation file was accessed directly from the Ensembl FTP repository ( https://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/ ) to maintain consistency with the reference genome used during alignment. Filtering of Count Results After generating the raw counts, we filtered out genes with low expression levels to focus on genes with sufficient coverage for further analysis. Specifically, genes were discarded if their total read count across all samples fell below a threshold of 10 reads. This can be formalized as: $$\:Retained\:genes:\:{\sum\:}_{j=1}^{N}Cij\:\ge\:\:10$$ Where C ij ​ represents the read count for gene i in sample j , and N is the total number of samples. This step ensures that only biologically relevant genes with adequate expression are retained for downstream statistical modeling. 2.3. Identification of differentially expressed genes (DEGs) After preprocessing the data, we employed the DESeq2 statistical tool to identify differentially expressed genes (DEGs) [ 61 ]. To ensure the reliability of these identified DEGs, we adjusted the P-values using the false discovery rate (FDR) method [ 62 ]. For each gene, we calculated the fold change (FC) between the control and non-ALS groups. Genes with an adjusted P-value (P-adjusted) |0.5| were considered significant DEGs[ 61 ]. For downstream machine learning (ML) applications, normalized counts were variance-stabilized using the DESeq2 vst() transformation to mitigate mean-variance dependence. To address potential confounding technical variation, the limma::removeBatchEffect() function [ 63 ] was applied to the variance-stabilized data to eliminate any batch effects. The resulting datasets, which were normalized, variance-stabilized, and batch effect-corrected, were then used for feature selection. 2.4. Train-Test Split The dataset was split into training and testing sets using a 70/30 ratio, where 70% of the data was used for model training and 30% for testing. This division ensures that the model is trained on a majority of the data while reserving a smaller portion for unbiased evaluation of its performance[ 64 ]. 2.5. Oversampling Technique for the Minority Class The dataset exhibited class imbalance, with the minority class being under-represented compared to the majority class. To address this issue, we applied the Synthetic Minority Over-sampling Technique (SMOTE) [ 65 ] exclusively to the training set after splitting. SMOTE generates synthetic samples for the minority class by interpolating between existing instances. This helps to mitigate the risk of model bias toward the majority class. 2.6. Feature Selection for ML Models This study employed a diverse set of feature selection strategies to identify the most critical features required for training various machine learning models. Feature importance was determined through the application of four distinct methodologies: Random Forest Classifier [ 66 ], Gradient Boosting Classifier [ 67 ], Recursive Feature Elimination [ 68 ], and Boruta [ 69 ].To prevent information leakage, feature selection was performed exclusively on the training set. In our study, we utilized the scikit-learn “SelectFromModel” function with the Random Forest Classifier and Gradient Boosting Classifier algorithms to evaluate the relative importance of each feature in the model. The recursive feature elimination technique iteratively removes features with the least significance by using a linear regression model. Additionally, the Boruta technique was employed to assess feature importance by iterating over randomized decision trees and highlighting the most relevant features. These combined strategies facilitated the identification of key features from our dataset. A Venn diagram was constructed to determine the set of features that were common across all methods, and these shared features were subsequently selected. The selected features were then used to develop and refine machine-learning models for ALS classification. 2.7. Machine Learning Model Training Feature scaling plays a vital role in preparing data for machine learning models, and in this study, the input features were standardized using the StandardScaler function from scikit-learn's preprocessing module[ 70 ]. The mean and standard deviation derived from the training dataset were used to scale both the training and test datasets, ensuring no data leakage occurred during the preprocessing step. Once scaled, the test dataset was utilized to evaluate the performance of 15 distinct machine learning algorithms trained on the training data. The models assessed included Gradient Boosting Classifier, Light Gradient Boosting Machine (LightGBM), Extra Trees Classifier, Random Forest Classifier, Ada Boost Classifier, Extreme Gradient Boosting (XGBoost), K Neighbors Classifier (KNN), Linear Discriminant Analysis (LDA), Naive Bayes, Logistic Regression, Decision Tree Classifier, Ridge Classifier, Quadratic Discriminant Analysis (QDA), Dummy Classifier, and Support Vector Machine - Linear Kernel (SVM). Each algorithm was independently trained and tested to compare their performance. AdaBoost Classifier AdaBoost (Adaptive Boosting) is an iterative ensemble method that focuses on misclassified examples by adjusting their weights in subsequent iterations[ 71 ]. Initially, all training samples are assigned equal weights. After each iteration, the weights of misclassified samples are increased, forcing the model to prioritize them in the next round. AdaBoost typically uses weak learners, such as decision stumps, and combines them into a strong classifier. Its adaptability makes it suitable for both binary and multi-class classification tasks. Decision Tree Classifier Decision Tree Classifier is a hierarchical model that recursively splits the dataset into subsets based on feature values [ 72 ]. Each internal node represents a decision rule, and each leaf node corresponds to a class label. Decision trees are easy to interpret and visualize but prone to overfitting. Pruning techniques and ensemble methods (e.g., Random Forest) are often employed to improve generalization. Dummy Classifier Dummy Classifier is a baseline model that generates predictions without using any feature information [ 73 ]. It serves as a benchmark for evaluating the performance of more sophisticated models. Common strategies include predicting the most frequent class, generating random predictions, or using prior probabilities. Dummy Classifier helps identify whether a proposed model provides meaningful improvements over trivial baselines. Extra Trees Classifier The Extra Trees Classifier, or Extremely Randomized Trees, is an ensemble learning method that builds multiple decision trees during training. Unlike Random Forest, it introduces additional randomness by selecting random splits for each feature rather than searching for the best split [ 74 ]. This approach reduces variance and overfitting, making it robust for noisy datasets. The final prediction is obtained by aggregating the outputs of all trees, either through voting (for classification) or averaging (for regression). Gradient Boosting Classifier Gradient Boosting Classifier is an ensemble learning technique that combines multiple weak learners (typically decision trees) to form a strong predictive model. It operates by iteratively minimizing the loss function through gradient descent optimization. In each iteration, the algorithm fits a new model to the residuals of the previous model, thereby improving accuracy progressively [ 75 ]. This method is particularly effective for handling complex datasets with non-linear relationships between features and the target variable. K Neighbors Classifier (KNN) K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for classification and regression [ 76 ]. For classification, KNN predicts the class label of a query point based on the majority vote of its k-nearest neighbors in the feature space. The distance metric (e.g., Euclidean, Manhattan) determines the similarity between points. Despite its simplicity, KNN is effective for small datasets but can become computationally expensive for large datasets due to its reliance on storing all training samples. Light Gradient Boosting Machine (LightGBM) LightGBM is an optimized gradient-boosting framework designed for efficiency and scalability. It employs a novel technique called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to reduce computational overhead while maintaining high accuracy [ 77 ]. By focusing on instances with larger gradients and bundling mutually exclusive features, LightGBM achieves faster training times compared to traditional gradient boosting methods. It is widely used in large-scale machine-learning tasks, such as ranking and classification. Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction and classification technique that seeks to maximize the separation between classes [ 78 ]. It projects the data onto a lower-dimensional space while preserving class-discriminative information. LDA assumes that the data follows a Gaussian distribution and that all classes share the same covariance matrix. It is particularly useful when the number of features exceeds the number of samples. Logistic Regression Logistic Regression is a statistical model used for binary and multi-class classification tasks [ 79 ]. It estimates the probability of a class label using a logistic function applied to a linear combination of input features. Logistic Regression is interpretable, computationally efficient, and works well for linearly separable data. Regularization techniques like L1 (Lasso) and L2 (Ridge) can be incorporated to handle multicollinearity and prevent overfitting. Naive Bayes Naive Bayes is a probabilistic classifier based on Bayes' theorem, which assumes conditional independence between features given the class label [ 80 ]. Despite this "naive" assumption, the algorithm performs surprisingly well in text classification and spam filtering tasks. Variants of Naive Bayes, such as Gaussian Naive Bayes and Multinomial Naive Bayes, cater to different types of data distributions. Quadratic Discriminant Analysis (QDA) Quadratic Discriminant Analysis (QDA) is an extension of LDA that relaxes the assumption of shared covariance matrices across classes [ 81 ]. QDA models each class with its own covariance matrix, resulting in quadratic decision boundaries. While more flexible than LDA, QDA requires more data to estimate the additional parameters accurately. Random Forest Classifier Random Forest Classifier is a popular ensemble learning method that constructs a multitude of decision trees during training and outputs the mode of their predictions for classification tasks [ 82 ]. It mitigates overfitting by introducing randomness in two ways: bootstrapping samples for each tree and selecting a random subset of features at each split. Random Forest is highly versatile and performs well across a wide range of problems, including feature selection and missing data imputation. Ridge Classifier Ridge Classifier is a variant of Ridge Regression adapted for classification tasks [ 83 ]. It applies L2 regularization to penalize large coefficients, reducing overfitting and improving stability. Unlike Logistic Regression, the Ridge Classifier directly minimizes the squared loss instead of maximizing likelihood. It is particularly effective when dealing with multicollinear features. Support Vector Machine - Linear Kernel (SVM) Support Vector Machine (SVM) with a linear kernel is a powerful classification algorithm that identifies the optimal hyperplane separating classes in the feature space [ 84 ]. The margin between the hyperplane and the nearest data points (support vectors) is maximized to ensure robustness. SVM is effective for high-dimensional data and can incorporate kernel functions to handle non-linear relationships. Extreme Gradient Boosting (XGBoost) XGBoost is an advanced implementation of gradient boosting that incorporates regularization techniques (L1 and L2) to prevent overfitting [ 85 ]. It also optimizes the second-order gradient of the loss function, enabling faster convergence and higher accuracy. XGBoost supports parallel processing, handling missing values, and custom objective functions, making it a preferred choice for structured/tabular data competitions and real-world applications. 2.8. Hyperparameter Tuning Hyperparameter tuning is an essential step in optimizing the performance of ML models and was a critical component of this study. The primary goal of this process is to identify the most effective configuration of hyperparameters that maximizes model performance while ensuring robustness and generalization. To achieve this, we utilized the scikit-learn library in Python, which provides a comprehensive suite of tools for hyperparameter optimization [ 73 ]. Our approach involved employing GridSearchCV, a systematic method for traversing the hyperparameter space by evaluating all possible combinations within a predefined grid [ 86 ]. This exhaustive search strategy ensures that no potential combination is overlooked, enabling the identification of the best-performing hyperparameters. By using this rigorous methodology, we ensured that the chosen hyperparameters were optimized for both accuracy and generalizability, which lays a strong foundation for the robustness of the ML models. 2.9. K-Fold Cross-Validation with the Best-Performing Model Cross-validation is an important technique in machine learning that offers a more reliable estimate of a model's performance on unseen data compared to a single train-test split. It helps mitigate the variability that can arise from relying on just one partition of the data for testing. After training and hyperparameter tuning of 15 models, the best-performing model was selected. To further assess its robustness, we employed the “StratifiedKFold” function from scikit-learn to conduct a 10-fold cross-validation[ 73 ]. This involved merging the training and test datasets and dividing them into 10 stratified folds. In each iteration, one-fold served as the validation set, while the remaining nine were used for training. The model's performance was evaluated using four key metrics: accuracy, Matthew’s correlation coefficient (MCC), the area under the receiver operating characteristic curve (AUC–ROC), and the F1 score. This process was repeated across all 10 folds, and the results from each iteration were averaged to provide an overall measure of the model's expected performance on unseen data[ 87 ]. This approach ensures a comprehensive and reliable evaluation of the model's generalization capabilities. 2.10. Deployment of the Model as Web Application Finally, we deployed the developed Gradient Boosting Classifier (GBC) model as a user-friendly web application, making it easily accessible to the research community. The web application, named ATMeQ, was built using the Streamlit Python framework ( https://www.streamlit.io/ ) and hosted on the Streamlit Share cloud platform. The source code for the application is maintained in a dedicated GitHub repository, ensuring transparency and facilitating collaboration. The ATMeQ web app is designed to accept input data in the form of a VST file (provided as a CSV file), process it through the GBC model, and return predictions for ALS disease status. Additionally, users can download the prediction results directly from the app, enhancing its utility for research and analysis purposes. 3. Results 3.1. Quantification of the High-quality Raw Reads The quality of the raw sequencing data retrieved from NCBI was evaluated using FastQC v0.11.5. All raw reads from a total of 224 samples were evaluated and confirmed to meet high-quality standards. Following this quality control step, the reads were aligned to the human reference genome. Alignment to the reference genome identified a total of 26,396 genes in the project PRJNA512012 (GSE124439), 19,908 genes in PRJNA831563 (GSE201407), and 25,034 genes in PRJNA1163403 (GSE277709). These gene sets were subsequently subjected to DEG analysis during the quantification phase. 3.2. Identification of Differentially Expressed Genes (DEGs) Differential gene expression analysis was performed using the DESeq2 package in R. Genes were classified as DEGs based on an adjusted p-value threshold of ≤ 0.05 and a |Log2FC| > 0.5. In the PRJNA512012 (GSE124439) dataset, 1,609 significant DEGs were identified by comparing case samples to normal controls. Similarly, analyses of two additional transcriptomic datasets, PRJNA831563 (GSE201407) and PRJNA1163403 (GSE277709), revealed 1,302 and 2,223 DEGs, respectively. Volcano plots and MA plots illustrating the distribution of DEGs for each dataset are presented in Fig. 3 . Additionally, a cross-dataset comparison identified 32 DEGs that were consistently differentially expressed across all three datasets (Supplementary Information (SI) 2 and Fig. S1 ). Figure SEQ Figure \* ARABIC 3. Differential gene expression analysis comparing ALS and healthy controls across three independent RNA-seq datasets. Panels A–B show volcano and MA plots for GSE124439, panels C–D show volcano and MA plots for GSE201407, and panels E–F show volcano and MA plots for GSE277709, illustrating log₂ fold-change distributions and expression-dependent differential regulation between ALS and control samples. 3.3. DEG-Based Data Preprocessing and Feature Selection The dataset, derived from DEGs and consisting of 224 samples with 32 features, was split into training and testing subsets. The training set comprised 70% of the data (156 samples), while the testing set contained 30% (68 samples). In the training dataset, a class imbalance was observed, with healthy samples being the minority class (29 samples) compared to ALS samples (127 samples). To address this imbalance, we applied SMOTE, a technique that synthetically generates additional samples for the minority class. This preprocessing step balanced the distribution of healthy and ALS samples, as illustrated in Fig. 4. Following data balancing, we performed feature selection using four distinct methodologies to identify the most relevant features associated with the target variable (see Table 1 ). We then identified a set of common features that consistently ranked as highly relevant across all four approaches, as illustrated in Fig. S2 . These features included ACTA1, ABCA4, COL6A4P2, HERC2P2, KCNE4, and LOC107987008 (see Supplementary Information (SI) 3 ). Figure 4 Class balance before and after SMOTE in the training dataset. (A) Original class imbalance with fewer healthy samples compared to ALS samples. (B) Balanced class distribution achieved after applying SMOTE. Table 1 Gene features selected by four independent feature selection methods: Random Forest, Gradient Boosting Classifier, Recursive Feature Elimination (RFE), and Boruta, for ALS versus control classification. Random forest Gradient boosting classifier Recursive feature elimination Boruta HERC2P2 HERC2P2 HERC2P2 HERC2P2 LOC105371874 LOC105371874 LOC105371874 LOC107987008 NPY PTGER2 HSPA2 PTGER2 LOC107987008 LOC112268045 LOC107987008 SLC1A7 PTGER2 ABCA8 LOC105379442 KCNE4 LOC112268045 GREM1 KCNE4 COL6A4P2 LOC105379442 LOC107987071 ABCA8 ACTA1 KCNE4 BOK.AS1 GREM1 MYBPC2 LOC107987071 SLC1A7 LOC107987071 LOC107987075 BOK.AS1 BVES SLC1A7 ABCA4 RASSF9 LRRC63 PAPLN.AS1 LRRC63 COL6A4P2 RASSF9 COL6A4P2 ACTA1 LACC1 ACTA1 MYBPC2 COL6A4P2 MYBPC2 EFHD1 ACTA1 EFHD1 LOC107987003 EFHD1 LOC107987003 LOC105370803 LOC107987003 LOC105370803 LOC107987075 LOC105370803 LOC107987075 ABCA4 LOC107987075 ABCA4 ABCA4 3.4. Model Training and Hyperparameter Optimization In this step, we developed and systematically optimized multiple machine learning (ML) models for a supervised classification task using a compact feature set of six genes identified through a prior feature selection procedure. A total of 15 distinct ML algorithms were trained and evaluated on this dataset. Model performance was assessed using multiple complementary metrics, including accuracy, precision, recall, area under the receiver operating characteristic curve (AUC–ROC), F1 score, and Matthews correlation coefficient (MCC). These results are summarized visually in Fig. 5 , which present accuracy and MCC, precision and recall, and AUC–ROC and F1 score, respectively. To further enhance predictive performance and model robustness, we performed systematic hyperparameter tuning for each of the 15 ML models. This optimization involved an extensive exploration of hyperparameter combinations using a grid search strategy, ensuring reproducible and reliable selection of optimal configurations. The final tuned hyperparameter settings for each model are detailed in Table 2 , while a comprehensive comparison of baseline and tuned model performance metrics is provided in Supplementary Information (SI) 4 . Overall, hyperparameter tuning resulted in consistent performance improvements across all evaluated metrics, with these gains visually summarized in Fig. 6 . Among the 15 tuned models, the Gradient Boosting Classifier with hyperparameter tuning stood out as the top performer. It achieved the highest scores in maximum key metrics: an accuracy of 0.9171, an MCC of 0.7197, a precision of 0.9243, a recall of 0.9171, an AUC–ROC of 0.9385, and an F1-score of 0.9107. The high accuracy and MCC indicate strong overall classification ability, while the impressive precision and recall show the model’s effectiveness in identifying true positives and minimizing errors. The AUC–ROC and F1 scores further confirm GBC’s excellent discriminatory power and balanced performance, making it the standout model for this classification task. Table 2 Optimized hyperparameter configurations for the 15 machine-learning models. ML model Hyperparameters Selected Best Value Gradient Boosting Classifier learning_rate 0.01 n_estimators 300 max_depth 3 subsample 0.8 LightGBM learning_rate 0.1 n_estimators 150 max_depth 4 num_leaves 15 Extra Trees Classifier n_estimators 100 max_depth 8 min_samples_split 5 Random Forest Classifier n_estimators 200 max_depth 10 min_samples_leaf 2 AdaBoost Classifier learning_rate 0.1 n_estimators 50 base_estimator DecisionTree(max_depth = 2) XGBoost learning_rate 0.05 n_estimators 50 max_depth 3 KNN n_neighbors 5 weights 'distance' Linear Discriminant Analysis solver 'lsqr' shrinkage 0.1 Naive Bayes var_smoothing 1e-08 Logistic Regression C 0.1 penalty 'l2' Decision Tree Classifier max_depth 5 min_samples_split 5 Ridge Classifier alpha 10 solver 'cholesky' Quadratic Discriminant Analysis reg_param 0.1 Dummy Classifier strategy 'stratified' Support Vector Machine (SVM) C 0.1 gamma 0.001 3.5. K-fold cross-validation with hyperparameter-tuned Gradient Boosting Classifier Cross-validation is a fundamental technique in machine learning that helps assess a model’s performance more reliably on new data. Instead of depending on a single train-test split, it uses multiple data divisions to provide a more balanced evaluation. This approach reduces the risk of misleading results that can arise from testing on just one specific dataset. In this analysis, we used the StratifiedKFold function from scikit-learn to perform 10-fold stratified cross-validation. This method ensures that the class proportions in each fold mirror those of the entire dataset, a critical feature for achieving dependable results, particularly when dealing with imbalanced classes. We evaluated a Gradient Boosting Classifier, fine-tuned with optimized hyperparameters, using this robust method. The cross-validation process unfolded as follows: the dataset was partitioned into 10 equal folds. In each of the 10 iterations, nine folds were dedicated to training the model, while the tenth fold served as the validation set. This cycle repeated until every fold had been used for validation exactly once. During each iteration, we evaluated the model’s performance using key metrics such as accuracy, precision, recall, F1 score, and AUC-ROC based on its predictions on the validation fold. To gain a well-rounded understanding of the model’s expected performance, we calculated the average of these metrics across all 10 folds. The results showed a strong overall performance, with an average accuracy of 0.921, a precision of 0.944, a recall of 0.906, an F1 score of 0.920, and an AUC-ROC of 0.978, as illustrated in Fig. 7 . These averaged values offer a reliable estimate of how well the model is expected to perform on new, unseen data, which highlights the effectiveness of this cross-validation approach. 3.6. Model deployment as the ATMeQ web app and assessment To make the prediction model easily accessible for biologists and chemists in their research, we have developed it as a publicly available web application called ATMeQ, hosted at [ https://share.streamlit.io/user/saiflab ] Below is a brief guide on how to use the ATMeQ web app (see more details in Fig. 8 ): 1. Data Preparation : Generate a CSV file that incorporates variance-stabilized transformation (VST) data derived from DESeq2. This VST method, part of DESeq2, an R package designed for RNA-Seq analysis, adjusts variance across diverse expression levels to enhance the data’s applicability for clustering and visualization purposes. 2. Accessing the Application : Input the specified URL into a web browser to reach the ATMeQ web app’s prediction page. File Upload: Use the “Browse files” button to submit the CSV file you’ve prepared to the web app. 3. Make Prediction : Launch the prediction process by pressing the “Initiate Analysis” button. 4. Results Review : Examine the outcomes displayed in the section beneath the “Prediction results” heading. The processing typically concludes within a few seconds, and you have the option to retrieve the predicted data in CSV format by selecting the “Download Predictions” button. 4. Discussion Amyotrophic lateral sclerosis (ALS) presents a persistent and formidable diagnostic challenge, primarily due to the nonspecific, heterogeneous, and often subtle nature of its initial symptoms, which significantly overlap with those of other neuromuscular disorders [ 88 , 89 ]. This reality forces a diagnostic process heavily reliant on the exclusion of alternative conditions and the nuanced judgment of specialized clinicians, contributing to a critical and well-documented delay [ 90 ]. Contemporary population-level analyses consistently reveal a substantial diagnostic latency, with a median delay of approximately 11 to 12 months from the onset of first symptoms to a confirmed diagnosis [ 91 , 92 ]. This protracted timeline is especially consequential in a rapidly and relentlessly progressive disease, where lost time equates to lost neurons and diminished therapeutic opportunity [ 92 ]. It powerfully motivates the urgent quest for objective, biological biomarkers that can complement clinical criteria and accelerate diagnostic certainty, thereby enabling earlier intervention. In this pursuit, high-throughput RNA-seq has emerged as a preeminent and powerfully positioned technology. It facilitates an unbiased, genome-wide, and quantitative survey of transcript abundance, supporting detection across an exceptionally broad dynamic range [ 93 ]. Established best-practices frameworks emphasize that RNA-seq workflows encompassing rigorous quality control, accurate alignment, and precise quantification can be standardized to yield highly reproducible transcriptomic profiles [ 94 ]. Notably, direct comparative analyses have consistently reported that RNA-seq holds significant advantages over previous microarray technologies, including superior resolution, a wider dynamic range, lower background noise, and a reduced susceptibility to technical variation [ 95 ]. Perhaps most importantly, RNA-seq uniquely enables the discovery of novel transcripts and the discrimination of biologically critical isoforms, capabilities that are essential for unraveling complex diseases like ALS. These technical strengths collectively solidify its suitability for the discovery of next-generation biomarkers. However, the translation of transcriptomic data into robust, clinically actionable ALS biomarkers is fraught with significant challenges arising from both technical and biological complexity. On the technical front, batch effects and other non-biological sources of variation are a pervasive threat in high-throughput studies; if unaddressed, they can create confounding signals that are erroneously attributed to the disease state, compromising the validity of any downstream conclusions. Biologically, ALS is increasingly understood not as a monolithic entity but as a syndrome encompassing considerable molecular heterogeneity [ 96 ]. Emerging research indicates the existence of distinct transcriptomic subtypes within the sporadic ALS population, observable even in peripheral blood, suggesting divergent underlying pathological programs [ 97 ]. Furthermore, recent conceptual frameworks describe ALS "molecular subtypes" as integrative combinations of cellular dysfunctions including neuroinflammation, mitochondrial stress, and cytoskeletal defects that correlate with clinical variation [ 98 , 99 ]. This inherent biological diversity necessitates analytical approaches that can distinguish consistent, core disease signatures from noise and subtype-specific signals [ 97 ]. Confronted by these challenges, the design of this study was explicitly guided by two foundational principles to maximize the clinical relevance and robustness of our findings: (i) a paramount emphasis on reproducibility, achieved through the integration of multiple independent RNA-seq cohorts to isolate transcriptional signals consistent across diverse datasets and technical platforms; and (ii) a rigorous prioritization of model robustness, implemented through leakage-aware data partitioning, conservative consensus feature selection, and systematic benchmarking of machine learning algorithms. Machine learning is particularly well-suited to this task, as it excels at identifying complex, non-linear patterns and feature interactions within high-dimensional biological data, moving beyond simple differential expression to build predictive models of disease state [ 100 ]. Our analytical journey began by quantifying the scope of transcriptomic dysregulation across three independent ALS brain tissue cohorts (PRJNA512012, PRJNA831563, PRJNA1163403). Differential expression analysis revealed extensive remodeling in each dataset, identifying 1,609, 1,302, and 2,223 differentially expressed genes (DEGs), respectively. Strikingly, however, the intersection of these three sizable lists yielded only 32 shared DEGs. This profound lack of overlap starkly illustrates the substantial heterogeneity introduced by factors such as cohort-specific demographics, disease stage at sample collection, tissue dissection protocols, and technical batch effects. It reinforces a critical lesson from prior transcriptomic meta-analyses: reproducibility across independent cohorts is a far stronger indicator of a robust disease association than the statistical magnitude of change within any single study. To build a generalizable classifier, we first established a supervised learning framework using a 70/30 train-test split, strictly preserving the independence of the test set. Recognizing that an imbalance between ALS and control samples in the training data could bias the classifier toward the majority class, we applied the Synthetic Minority Over-sampling Technique (SMOTE) exclusively to the training fold. SMOTE generates synthetic minority-class samples through informed interpolation in feature space, effectively mitigating bias and improving sensitivity without violating the integrity of the hold-out test set a crucial consideration for ensuring credible performance estimates. The 32 shared DEGs constituted our initial feature universe, which we then refined through a rigorous, consensus-driven feature selection pipeline. We employed four complementary methodologies: Random Forest permutation importance, Gradient Boosting built-in importance, Recursive Feature Elimination (RFE), and the all-relevant selection algorithm Boruta. This multi-method approach was designed to circumvent the limitations inherent to any single technique. Boruta, in particular, serves as a stringent benchmark, as it uses a wrapper approach around Random Forest to identify all features that perform significantly better than random shadow variables, thereby capturing features that are genuinely relevant even if their individual effect size is moderate. The convergence of these distinct methods onto a compact set of six genes ACTA1, ABCA4, COL6A4P2, HERC2P2, KCNE4, and LOC107987008 provides strong evidence for the stability and reliability of this signature, reducing the likelihood that it is an artifact of a specific algorithmic bias. The biological composition of this six-gene panel reflects a convergence of molecular functions plausibly linked to ALS pathophysiology. ACTA1 encodes skeletal muscle α-actin, the predominant actin isoform in sarcomeric thin filaments and an essential structural component for muscle contraction and cytoskeletal integrity [ 101 ]. Altered ACTA1 expression in ALS may therefore reflect secondary muscle remodeling in response to denervation. KCNE4, a β-subunit of voltage-gated potassium channels, suppresses Kv1.3 currents by modulating gating and surface trafficking [ 102 ]. Because neuronal hyperexcitability is an early hallmark of ALS, dysregulation of KCNE4-mediated channel modulation could contribute to excitatory imbalance in motor circuits. ABCA4, a photoreceptor ATP-binding cassette transporter, catalyzes the transport of N-retinylidene-phosphatidylethanolamine to remove reactive retinal derivatives and maintain lipid homeostasis in photoreceptor membranes [ 103 ]. Although primarily retinal, its lipid transport function underscores broader metabolic processes that may influence neuronal vulnerability. COL6A4P2 is a pseudogene derived from the COL6A4 gene of the collagen VI family, which organizes the extracellular matrix (ECM) and supports neuronal and glial survival [ 104 ]. Thus, COL6A4P2 expression may mark ECM remodeling or glial activation observed in ALS tissues, though its own function remains uncharacterized. HERC2P2, a pseudogene of the ubiquitin ligase HERC2, has been found transcriptionally active and associated with DNA repair–related pathways in other biological contexts. Considering the role of its parent HERC2 in ubiquitin-dependent proteostasis [ 105 ], HERC2P2 may similarly reflect stress-response dysregulation relevant to neurodegeneration. Finally, LOC107987008 represents an uncharacterized non-coding RNA locus, consistent with reports that many reproducible ALS transcriptomic signatures involve unannotated long non-coding RNAs [ 106 ]. Collectively, these genes capture distinct biological axes, structural integrity, excitability, lipid metabolism, extracellular matrix maintenance, ubiquitin-linked regulation, and non-coding RNA signaling, that together mirror the molecular heterogeneity of ALS. With this refined feature set, we embarked on a comprehensive benchmarking phase, training and evaluating fifteen distinct machine learning classifiers spanning linear models, support vector machines, k-nearest neighbors, Bayesian classifiers, and ensemble methods. After systematic hyperparameter optimization, a Gradient Boosting Classifier emerged as the top-performing model. On the completely held-out test set, it achieved an accuracy of 0.9171, a Matthews Correlation Coefficient (MCC) of 0.7197, a precision of 0.9243, a recall (sensitivity) of 0.9171, an AUC-ROC of 0.9385, and an F1-score of 0.9107. The model’s robustness was further validated via stratified 10-fold cross-validation on the training data, yielding consistently high mean metrics (e.g., AUC-ROC of 0.978). Gradient boosting’s success in this context is theoretically grounded; it builds a strong predictive model by sequentially combining weak learners (typically decision trees) to correct prior errors, making it exceptionally capable of modeling complex, non-linear interactions within a parsimonious feature set. To translate this computational model into a practical resource, we operationalized it as a lightweight, publicly accessible web application dubbed ATMeQ. This application is designed to accept user-submitted, normalized RNA-seq expression data for the six signature genes and return a predicted classification, along with relevant confidence metrics. By packaging the model in this accessible format, we actively lower the barrier for independent validation, external testing, and exploratory use by the broader research community, addressing a common translational gap in bioinformatics research. In conclusion, this study demonstrates a principled pathway from the recognition of pervasive transcriptomic heterogeneity in ALS to the development of a parsimonious, reproducible, and high-performing diagnostic classifier. The workflow integrating multi-cohort analysis, consensus feature selection, rigorous class imbalance handling, and exhaustive model benchmarking provides a robust template for biomarker discovery in complex diseases. The resulting six-gene signature, while requiring further validation, captures intersecting aspects of ALS pathophysiology involving neuromuscular integrity, ionic excitability, and cellular homeostasis. We openly acknowledge several limitations. The use of postmortem brain tissue inherently captures late-stage pathology, which may not fully reflect the early molecular events most relevant for timely diagnosis. Although multi-cohort integration mitigates batch effects, unmeasured technical or biological confounders may persist. The biological functions of some signature genes, particularly the non-coding elements, require deeper mechanistic investigation. Most critically, prospective validation in independent, ideally multi-center cohorts, including samples from pre-symptomatic or early-stage individuals and from accessible tissues like blood, is the essential next step to evaluate true clinical potential. Future directions should focus on this external validation, while also exploring the signature’s utility in stratifying patients into molecular subtypes, predicting disease progression, and evaluating treatment response. Integrating this transcriptomic signal with other multi-omic data layers will further refine our understanding and move the field closer to a future where molecular diagnostics significantly shorten the protracted and difficult diagnostic journey faced by ALS patients today. 5. Conclusion Based on the preceding discussion, this study successfully navigates the substantial heterogeneity and technical challenges inherent in ALS transcriptomics to identify a concise, reproducible six-gene signature and a high-performance diagnostic classifier. By rigorously integrating multiple independent cohorts and employing consensus feature selection alongside advanced machine learning, we developed a model that achieves robust accuracy and has been operationalized as the publicly accessible ATMeQ web application. While derived from postmortem brain tissue and thus reflective of late-stage pathology, the signature implicates biologically plausible pathways in ALS, including cytoskeletal integrity, ion channel function, and metabolic regulation. The critical next steps involve prospective validation in accessible biospecimens from early-stage patients and exploration of the signature’s utility for disease stratification. Ultimately, this work provides a principled framework and an applicable tool to advance the urgent quest for molecular biomarkers, aiming to shorten the extended diagnostic interval in ALS and enable earlier therapeutic intervention. Abbreviations ALS Amyotrophic Lateral Sclerosis RNA seq–RNA sequencing DEG Differentially Expressed Gene ML Machine Learning GBC Gradient Boosting Classifier RF Random Forest RFE Recursive Feature Elimination SMOTE Synthetic Minority Over–sampling Technique AUC ROC–Area Under the Receiver Operating Characteristic Curve MCC Matthews Correlation Coefficient NGS Next–Generation Sequencing VST Variance Stabilizing Transformation QC Quality Control GEO Gene Expression Omnibus ATMeQ ALS Prediction Tool using Machine Learning and RNA–Seq Declarations Competing Interests: The authors declare no competing interests. Funding: This study has no funding. Author Contribution Ahmed Saif: Conceptualization, Data curation, Methodology, Software, Formal analysis, Result interpretation, Investigation, Validation, Visualization, Writing – original draft, Writing – review, and editing, Supervision. Md. Tarikul Islam and Md Aktaruzzaman: Writing – review and editing. Acknowledgment We are thankful to Biological Research on the Brain (BRB), Jashore 7408, Bangladesh. Data Availability The RNA-sequencing datasets analyzed in this study were obtained from publicly available datasets deposited in the Gene Expression Omnibus (GEO) repository. The datasets include GSE124439 (PRJNA512012), GSE201407 (PRJNA831563), and GSE277709 (PRJNA1163403), and are accessible through their corresponding web links:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124439,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE201407,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE277709. References Brown, R. H. & Al-Chalabi, A. Amyotrophic lateral sclerosis. N. Engl. J. Med. 377 , 162–172 (2017). Feldman, E. L. et al. Amyotrophic lateral sclerosis. Lancet 400 , 1363–1380 (2022). Hardiman, O. et al. Amyotrophic lateral sclerosis. Nat. Rev. Dis. Primers . 3 , 1–19 (2017). Masrori, P. & Van Damme, P. Amyotrophic lateral sclerosis: a clinical review. Eur. J. Neurol. 27 , 1918–1929 (2020). Wijesekera, L. C., Nigel, P. & Leigh Amyotrophic lateral sclerosis. Orphanet J. Rare Dis. 4 , 1–22 (2009). Bradford, D. & Rodgers, K. E. Advancements and challenges in amyotrophic lateral sclerosis. Front. Neurosci. 18 , 1401706 (2024). Montes, J. et al. Translational research in ALS, in: Animal and Translational Models for CNS Drug Discovery, Elsevier, : pp. 267–310. (2008). Turner, M. R., Parton, M. J. & Leigh, P. N. Clinical trials in ALS: an overview, in: Semin Neurol, Copyright© 2001 by Thieme Medical Publishers, Inc., 333 Seventh Avenue, New … pp. 167–176. Petrov, D., Mansfield, C., Moussy, A. & Hermine, O. ALS clinical trials review: 20 years of failure. Are we any closer to registering a new treatment? Front. Aging Neurosci. 9 , 68 (2017). Turnbull, J. Why is ALS so Difficult to Treat? Can. J. Neurol. Sci. 41 , 144–155 (2014). Jaiswal, M. K. Riluzole and edaravone: A tale of two amyotrophic lateral sclerosis drugs. Med. Res. Rev. 39 , 733–748 (2019). Sawada, H. Clinical efficacy of edaravone for the treatment of amyotrophic lateral sclerosis. Expert Opin. Pharmacother . 18 , 735–738 (2017). Gordon, P. H., Cheng, B., Katz, I. B., Mitsumoto, H. & Rowland, L. P. Clinical features that distinguish PLS, upper motor neuron–dominant ALS, and typical ALS. Neurology 72 , 1948–1952 (2009). Nechay, A., Stetsenko, T. & Savchenko, O. P101–2285: Amyotrophic lateral sclerosis with juvenile onset. Case report. Eur. J. Pediatr. Neurol. 19 , S122–S123 (2015). Štětkářová, I. & Ehler, E. Diagnostics of amyotrophic lateral sclerosis: up to date. Diagnostics 11 , 231 (2021). De Carvalho, M. et al. Electrodiagnostic criteria for diagnosis of ALS. Clin. Neurophysiol. 119 , 497–503 (2008). Iwasaki, Y., Ikeda, K., Ichikawa, Y., Igarashi, O. & Kinoshita, M. MRI in ALS patients. Acta Neurol. Scand. 107 (2003). Kassubek, J. & Pagani, M. Imaging in amyotrophic lateral sclerosis: MRI and PET. Curr. Opin. Neurol. 32 , 740–746 (2019). Gatto, R. G., Li, W., Gao, J. & Magin, R. L. In vivo diffusion MRI detects early spinal cord axonal pathology in a mouse model of amyotrophic lateral sclerosis. NMR Biomed. 31 , e3954 (2018). Jamali, A. M., Kethamreddy, M., Burkett, B. J., Port, J. D. & Pandey, M. K. PET and SPECT imaging of ALS: an educational review. Mol. Imaging . 2023 , 5864391 (2023). Shen, D. et al. The Gold Coast criteria increases the diagnostic sensitivity for amyotrophic lateral sclerosis in a Chinese population. Transl Neurodegener . 10 , 1–8 (2021). Hannaford, A. et al. Diagnostic utility of gold coast criteria in amyotrophic lateral sclerosis. Ann. Neurol. 89 , 979–986 (2021). de Jongh, A. D. et al. Characterising ALS disease progression according to El Escorial and Gold Coast criteria. J. Neurol. Neurosurg. Psychiatry . 93 , 865–870 (2022). Campanari, M. L., Bourefis, A. R. & Kabashi, E. Diagnostic challenge and neuromuscular junction contribution to ALS pathogenesis. Front. Neurol. 10 , 68 (2019). Segura, T. et al. Alcahut-Rodríguez, Symptoms timeline and outcomes in amyotrophic lateral sclerosis using artificial intelligence. Sci. Rep. 13 , 702 (2023). Salameh, J. S., Brown, R. H. Jr & Berry, J. D. Amyotrophic lateral sclerosis pp. 469–476 (in: Semin Neurol, Thieme Medical, 2015). Bradford, D. & Rodgers, K. E. Advancements and challenges in amyotrophic lateral sclerosis. Front. Neurosci. 18 , 1401706 (2024). Masrori, P. & Van Damme, P. Amyotrophic lateral sclerosis: a clinical review. Eur. J. Neurol. 27 , 1918–1929 (2020). Vidovic, M., Müschen, L. H., Brakemeier, S., Machetanz, G. & Naumann, M. Castro-Gomez, Current state and future directions in the diagnosis of amyotrophic lateral sclerosis. Cells 12 , 736 (2023). Vu, L. T. & Bowser, R. Fluid-based biomarkers for amyotrophic lateral sclerosis. Neurotherapeutics 14 , 119–134 (2017). Dubois, B., von Arnim, C. A. F., Burnie, N., Bozeat, S. & Cummings, J. Biomarkers in Alzheimer’s disease: role in early and differential diagnosis and recognition of atypical variants. Alzheimers Res. Ther. 15 , 175 (2023). Yamashita, K. Y., Bhoopatiraju, S., Silverglate, B. D. & Grossberg, G. T. Biomarkers in Parkinson’s disease: A state of the art review. Biomark. Neuropsychiatry . 9 , 100074 (2023). Xu, H., Nottingham, R. M. & Lambowitz, A. M. TGIRT-seq protocol for the comprehensive profiling of coding and non-coding RNA biotypes in cellular, extracellular vesicle, and plasma RNAs. Bio Protoc. 11 , e4239–e4239 (2021). Smail, C. & Montgomery, S. B. RNA sequencing in disease diagnosis. Annu. Rev. Genomics Hum. Genet. 25 (2024). Ozsolak, F. & Milos, P. M. RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12 , 87–98 (2011). Sierro, N., Martin, F., Poussin, C., Hoeng, J. & Ivanov, N. V. Comparison of oligonucleotide microarray and RNA-seq technologies in the context of gene expression analysis. EMBnet J. 19 , 88 (2013). Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. 20 , 631–656 (2019). Han, H. & Jiang, X. Disease biomarker query from RNA-seq data. Cancer Inf. 13 , CIN–S13876 (2014). Zhang, Y. et al. A reliable and quick method for screening alternative splicing variants for low-abundance genes. PLoS One . 19 , e0305201 (2024). Lataretu, M. & Hölzer, M. RNAflow: An effective and simple RNA-seq differential gene expression pipeline using nextflow. Genes (Basel) . 11 , 1487 (2020). Costa-Silva, J., Domingues, D. S., Menotti, D., Hungria, M. & Lopes, F. M. Computational methods for differentially expressed gene analysis from RNA-Seq: an overview. ArXiv Preprint ArXiv :210903625 (2021). Chatterjee, P. & Roy, D. Comparative analysis of RNA-Seq data from brain and blood samples of Parkinson’s disease. Biochem. Biophys. Res. Commun. 484 , 557–564 (2017). Dube, U. et al. An atlas of cortical circular RNA expression in Alzheimer disease brains demonstrates clinical and pathological associations. Nat. Neurosci. 22 , 1903–1912 (2019). Sproviero, D. et al. Different miRNA profiles in plasma derived small and large extracellular vesicles from patients with neurodegenerative diseases. Int. J. Mol. Sci. 22 , 2737 (2021). Shi, M., Caudle, W. M. & Zhang, J. Biomarker discovery in neurodegenerative diseases: a proteomic approach. Neurobiol. Dis. 35 , 157–164 (2009). Mittal, S., Jena, M. K. & Pathak, B. Machine Learning-Assisted Direct RNA Sequencing with Epigenetic RNA Modification Detection via Quantum Tunneling. Anal. Chem. 96 , 11516–11524 (2024). Vadapalli, S., Abdelhalim, H., Zeeshan, S. & Ahmed, Z. Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine. Brief. Bioinform . 23 , bbac191 (2022). Dudek, G. et al. Machine learning-based prediction of rheumatoid arthritis with development of ACPA autoantibodies in the presence of non-HLA genes polymorphisms. PLoS One . 19 , e0300717 (2024). Wenric, S. & Shemirani, R. Using supervised learning methods for gene selection in RNA-Seq case-control studies. Front. Genet. 9 , 297 (2018). Vu, D. L., Le, H. C. & Learning-Based, M. ALS Diagnosis Using Gene Expression Data, in: 2023 RIVF International Conference on Computing and Communication Technologies (RIVF), IEEE, : pp. 354–359. (2023). Zhang, S. et al. dos Santos Souza, Genome-wide identification of the genetic basis of amyotrophic lateral sclerosis. Neuron 110 , 992–1008 (2022). Rad, H. N. et al. Amyotrophic lateral sclerosis diagnosis using machine learning and multi-omic data integration. Heliyon 10 (2024). Catanese, A. et al. Multiomics and machine-learning identify novel transcriptional and mutational signatures in amyotrophic lateral sclerosis. Brain 146 , 3770–3782 (2023). Grima, N. et al. RNA sequencing of peripheral blood in amyotrophic lateral sclerosis reveals distinct molecular subtypes: considerations for biomarker discovery. Neuropathol. Appl. Neurobiol. 49 , e12943 (2023). Vieira, F. G. et al. A machine-learning based objective measure for ALS disease severity. NPJ Digit. Med. 5 , 45 (2022). Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41 , D991–D995 (2012). Andrews, S. FastQC: a quality control tool for high throughput sequence data. (2017). (2010). Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30 , 2114–2120 (2014). Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods . 12 , 357–360 (2015). Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30 , 923–930 (2014). Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 , 1–21 (2014). Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) . 57 , 289–300 (1995). Ritchie, M. E. et al. Smyth, limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43 , e47–e47 (2015). Salazar, J. J., Garland, L., Ochoa, J. & Pyrcz, M. J. Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy. J. Pet. Sci. Eng. 209 , 109885 (2022). Blagus, R. & Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14 , 1–16 (2013). Breiman, L. Random forests. Mach. Learn. 45 , 5–32 (2001). Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232. (2001). Zeng, X., Chen, Y. W. & Tao, C. Feature selection using recursive feature elimination for handwritten digit recognition, in: 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IEEE, : pp. 1205–1208. (2009). Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36 , 1–13 (2010). Li, D., Zhang, B. & Li, C. A feature-scaling-based $ k $ -nearest neighbor algorithm for indoor positioning systems. IEEE Internet Things J. 3 , 590–597 (2015). Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55 , 119–139 (1997). Quinlan, J. R. Induction of decision trees. Mach. Learn. 1 , 81–106 (1986). Pedregosa, F. et al. Scikit-learn: Machine learning in Python, The Journal of Machine Learning Research 12 2825–2830. (2011). Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63 , 3–42 (2006). Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232. (2001). Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory . 13 , 21–27 (1967). Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30 (2017). Fisher, R. A. The use of multiple measurements in taxonomic problems. Ann. Eugen . 7 , 179–188 (1936). Hosmer, D. W. Jr, Lemeshow, S. & Sturdivant, R. X. Applied logistic regression (Wiley, 2013). Zhang, H. The optimality of naive Bayes. Aa 1 , 3 (2004). Ziegel, E. R. The elements of statistical learning, (2003). Breiman, L. Random forests. Mach. Learn. 45 , 5–32 (2001). Hoerl, A. E. & Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 , 55–67 (1970). Cortes, C. Support-Vector Networks, Mach Learn (1995). Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, : pp. 785–794. (2016). Bischl, B. et al. Boulesteix, Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdiscip Rev. Data Min. Knowl. Discov . 13 , e1484 (2023). Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection (Morgan Kaufman Publishing, 1995). Galvin, M. et al. The path to specialist multidisciplinary care in amyotrophic lateral sclerosis: a population-based study of consultations, interventions and costs. PLoS One . 12 , e0179796 (2017). Gupta, D., Shiralkar, M. & Chaudhari, V. Conventional remedy to Lou Gehrig’s disease-Amyotrophic Lateral Sclerosis (ALS): a rare clinical entity., (2023). Chieia, M. A., Oliveira, A. S. B., Silva, H. C. A. & Gabbai, A. A. Amyotrophic lateral sclerosis: considerations on diagnostic criteria. Arq. Neuropsiquiatr. 68 , 837–842 (2010). Falcão de Campos, C. et al. Trends in the diagnostic delay and pathway for amyotrophic lateral sclerosis patients across different countries. Front. Neurol. 13 , 1064619 (2023). Cellura, E., Spataro, R., Taiello, A. C. & Bella, V. L. Factors affecting the diagnostic delay in amyotrophic lateral sclerosis. Clin. Neurol. Neurosurg. 114 , 550–554 (2012). Olsen, R. H. & Christensen, H. Transcriptomics: RNA-seq, in: Introduction to Bioinformatics in Microbiology pp. 177–188 (Springer, 2018). Watson, M. Quality assessment and control of high-throughput sequencing data. Front. Genet. 5 , 235 (2014). Gunter, H. M. et al. mRNA vaccine quality analysis using RNA sequencing. Nat. Commun. 14 , 5663 (2023). Floriddia, E. Transcriptomics and ALS outcome. Nat. Neurosci. 26 , 175 (2023). Schweingruber, C. et al. Single-cell RNA-sequencing reveals early mitochondrial dysfunction unique to motor neurons shared across FUS-and TARDBP-ALS. Nat. Commun. 16 , 4633 (2025). Harley, J., Clarke, B. E. & Patani, R. The interplay of RNA binding proteins, oxidative stress and mitochondrial dysfunction in ALS. Antioxidants 10 , 552 (2021). Rossi, S. & Cozzolino, M. Dysfunction of RNA/RNA-binding proteins in ALS astrocytes and microglia. Cells 10 , 3005 (2021). Raza, K. Machine learning in single-cell RNA-seq data analysis (Springer, 2024). Laing, N. G. et al. Mutations and polymorphisms of the skeletal muscle α‐actin gene (ACTA1). Hum. Mutat. 30 , 1267–1277 (2009). Solé, L. et al. KCNE4 suppresses Kv1. 3 currents by modulating trafficking, surface expression and channel gating. J. Cell. Sci. 122 , 3738–3748 (2009). Molday, R. S., Zhong, M. & Quazi, F. The role of the photoreceptor ABC transporter ABCA4 in lipid transport and Stargardt macular degeneration, Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids 1791 573–583. (2009). Sabatelli, P. et al. Expression of the collagen VI α5 and α6 chains in normal human skin and in skin of patients with collagen VI-related myopathies. J. Invest. Dermatology . 131 , 99–107 (2011). Kolenda, T. et al. AURKAPS1, HERC2P2 and SDHAP1 pseudogenes: molecular role in development and progression of head and neck squamous cell carcinomas and their diagnostic utility. Rep. Practical Oncol. Radiotherapy . 29 , 718–731 (2024). Bilbao-Arribas, M. & Jugo, B. M. Transcriptomic meta-analysis reveals unannotated long non-coding RNAs related to the immune response in sheep. Front. Genet. 13 , 1067350 (2022). Additional Declarations No competing interests reported. Supplementary Files SupplementaryInformationSI2.xlsx SupplementaryInformationSI3.xlsx SupplementaryInformationSI4.xlsx SupplementaryInformationSI1.docx Cite Share Download PDF Status: Under Review Version 1 posted Reviewers invited by journal 16 Apr, 2026 Editor assigned by journal 14 Apr, 2026 Editor invited by journal 28 Jan, 2026 Submission checks completed at journal 22 Jan, 2026 First submitted to journal 22 Jan, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8614090","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":624231839,"identity":"f608584b-bb1e-4ac6-8bb9-c5eea5e7a4b0","order_by":0,"name":"Ahmed Saif","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABAUlEQVRIiWNgGAWjYJCCA0CcAMQGQCQnBxZ5gEc5D5oWY2OwlgQCWhgQWhiMExtgXFzAnv104qEbDHZ5uu3Nmz/8KDBInx92+CHQFjs53QYctvDkbjicw5BcbHbmWJlkj4FB7sbbaQZALcnGZgdwOQyshTlx240cMwYegz+5G2cngLQcSNyGSwv/W5CW+sRt998Yf/xjYJBuODv9A34tEmBbDgNt4TGQ5jEwSJCXziFgyw2QLQbHE7edSSuTljEwMNwgnVNwIMEAt1/Y+3M3f86pqE7cdvzw5o9v/hjIy89O3/zhQ4WdHC4tEGCAzD6ALkIQyDeQonoUjIJRMApGAgAA9cVmK/7glCwAAAAASUVORK5CYII=","orcid":"","institution":"University of Rajshahi","correspondingAuthor":true,"prefix":"","firstName":"Ahmed","middleName":"","lastName":"Saif","suffix":""},{"id":624231841,"identity":"2a7f49ca-c6e4-49b4-a06e-56f53241c84f","order_by":1,"name":"Md Tarikul Islam","email":"","orcid":"","institution":"Jashore University of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Md","middleName":"Tarikul","lastName":"Islam","suffix":""},{"id":624231846,"identity":"a0dc7075-e775-4d05-b668-8e394609235c","order_by":2,"name":"Md Aktaruzzaman","email":"","orcid":"","institution":"Jashore University of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Md","middleName":"","lastName":"Aktaruzzaman","suffix":""}],"badges":[],"createdAt":"2026-01-16 00:23:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8614090/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8614090/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":107483294,"identity":"8cd864ad-4109-4b8e-a5cb-d45930ad8b55","added_by":"auto","created_at":"2026-04-22 02:27:15","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":138094,"visible":true,"origin":"","legend":"\u003cp\u003eThe complete workflow of this study.\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/084b47eeb76dda9dfabedc68.jpeg"},{"id":107184385,"identity":"c5a0bdd8-5a03-4c26-8aaa-969571daa1ce","added_by":"auto","created_at":"2026-04-17 18:27:18","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":287181,"visible":true,"origin":"","legend":"\u003cp\u003eWorkflow illustrating the search strategy, inclusion and exclusion criteria, and selection process used to identify human ALS RNA-seq datasets from the NCBI GEO database, resulting in the final selection of BioProjects PRJNA512012, PRJNA831563, and PRJNA1163403.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/3784ebee05399c49fc606b33.png"},{"id":107184389,"identity":"e86d7f8c-6234-4df0-ac68-642b97b4d4fb","added_by":"auto","created_at":"2026-04-17 18:27:18","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":235662,"visible":true,"origin":"","legend":"\u003cp\u003eDifferential gene expression analysis comparing ALS and healthy controls across three independent RNA-seq datasets. Panels A–B show volcano and MA plots for GSE124439, panels C–D show volcano and MA plots for GSE201407, and panels E–F show volcano and MA plots for GSE277709, illustrating log₂ fold-change distributions and expression-dependent differential regulation between ALS and control samples.\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/bbce63d8a569ed0db2e3c0bc.jpeg"},{"id":107481727,"identity":"59eb6c68-1d7d-4b52-a800-79ed74b1807b","added_by":"auto","created_at":"2026-04-22 02:19:49","extension":"jpeg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":43199,"visible":true,"origin":"","legend":"\u003cp\u003eClass balance before and after SMOTE in the training dataset. (A) Original class imbalance with fewer healthy samples compared to ALS samples. (B) Balanced class distribution achieved after applying SMOTE.\u003c/p\u003e","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/2e437f5c7fe70cbc456dc075.jpeg"},{"id":107481850,"identity":"1dc30a41-69ca-4a5e-86d4-61c4848e238b","added_by":"auto","created_at":"2026-04-22 02:20:26","extension":"jpeg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":258911,"visible":true,"origin":"","legend":"\u003cp\u003ePerformance comparison of machine-learning models. Fifteen ML algorithms were evaluated using multiple metrics: (A) accuracy, (B) Matthews correlation coefficient (MCC), (C) precision, (D) recall, (E) AUC–ROC, and (F) F1 score, providing a comprehensive comparison of classification performance across models.\u003c/p\u003e","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/e3260a710ccf5d1cd39d8d81.jpeg"},{"id":107483295,"identity":"e0d58c1f-c9a0-48f4-8530-2457b0188c09","added_by":"auto","created_at":"2026-04-22 02:27:15","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":46872,"visible":true,"origin":"","legend":"\u003cp\u003eImpact of hyperparameter tuning on model performance. Stacked normalized scores across accuracy, MCC, precision, recall, AUC-ROC, and F1 score illustrate consistent performance improvements for the tuned versions of all 15 machine-learning models compared to their baseline counterparts. Distinct colors denote individual performance metrics, as indicated in the legend.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/e3a2aa73685b57ce57464905.png"},{"id":107483320,"identity":"ea95b4c3-78dd-482a-9b77-85b69509c679","added_by":"auto","created_at":"2026-04-22 02:27:21","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":159326,"visible":true,"origin":"","legend":"\u003cp\u003eTen-fold stratified cross-validation performance of the Gradient Boosting classifier. Heatmap showing fold-wise accuracy, precision, recall, F1 score, and AUC–ROC across 10 stratified folds, with the total (sum) for each metric indicated.\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/c0d537a6721073810becf525.png"},{"id":107184390,"identity":"6e3f3ee3-4536-489b-84e0-5e7e7e1c6a9f","added_by":"auto","created_at":"2026-04-17 18:27:18","extension":"jpeg","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":166196,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of the ATMeQ web application interface and diagnostic workflow.\u003c/p\u003e","description":"","filename":"floatimage8.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/f0ae1056936666c64d0e6ee3.jpeg"},{"id":107485934,"identity":"183511e5-e0fb-44ba-b943-688c391a290d","added_by":"auto","created_at":"2026-04-22 02:36:51","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2292940,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/a9a170e4-db70-4469-9bc0-457b83b4e1e7.pdf"},{"id":107184382,"identity":"3a8453b6-2050-46dd-bec0-352df2bfa90e","added_by":"auto","created_at":"2026-04-17 18:27:18","extension":"xlsx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":80601,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryInformationSI2.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/922622e246eb45159d17398a.xlsx"},{"id":107483323,"identity":"8a30b665-e547-45bc-a456-f0e47bef052e","added_by":"auto","created_at":"2026-04-22 02:27:21","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":24925,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryInformationSI3.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/859e0a9a50d7b00cfc81516b.xlsx"},{"id":107482364,"identity":"b22a2692-5ffe-416f-a9b5-5221444ea010","added_by":"auto","created_at":"2026-04-22 02:23:20","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":12076,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryInformationSI4.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/94caa006a4eb06c293c49f44.xlsx"},{"id":107184387,"identity":"bb173ccc-c006-48f1-9fbc-afb1bf4af964","added_by":"auto","created_at":"2026-04-17 18:27:18","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":495669,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryInformationSI1.docx","url":"https://assets-eu.researchsquare.com/files/rs-8614090/v1/57b5b61bf507fd7f8c3aa9bc.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"ATMeQ: A Machine Learning-Based Framework for Amyotrophic Lateral Sclerosis Disease using RNA-seq Meta-Analysis","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eAmyotrophic Lateral Sclerosis (ALS) is a rare and progressive neurodegenerative disorder that primarily affects motor neurons in the brain and spinal cord, leading to muscle weakness, atrophy, and eventually paralysis [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. This degeneration specifically affects both upper motor neurons, which originate in the cerebral cortex and extend to the brainstem and spinal cord, and lower motor neurons, which transmit signals directly from the brainstem or spinal cord to the muscles [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. As a result, essential voluntary movements such as walking, talking, and breathing become increasingly impaired. The disease can be classified into two types based on its clinical presentation: familial ALS (fALS), which accounts for about 5\u0026ndash;10% of cases and has a genetic etiology, and sporadic ALS (sALS), which makes up the remaining 90\u0026ndash;95% of cases and may result from a combination of genetic predispositions and environmental factors[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Sadly, most research indicates that the progression of ALS often leads to death within 2 to 5 years after the onset of symptoms, primarily due to respiratory failure[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Moreover, while current statistics show that approximately 9.9 individuals per 100,000 are affected globally, projections suggest that cases could rise by as much as 69% by 2040, presenting an escalating and critical challenge for neurology and translational neuroscience [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eDue to these reasons, for over half a century, translational research in ALS has driven numerous clinical trials and advanced scientific methods to explore neuroprotective compounds, but despite these efforts, a cure remains undiscovered [\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. Indeed, since the 1990s, over 50 investigational drugs for ALS have failed in Phase II/III clinical trials, underscoring the immense challenges of developing effective therapies for this fatal neurodegenerative disease [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. To date, only two treatments, riluzole and edaravone, have gained regulatory approval. Riluzole, a glutamate modulator that reduces excitotoxicity, extends median survival by approximately 2\u0026ndash;3 months but does not meaningfully halt disease progression[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. In contrast, Edaravone, an antioxidant designed to mitigate oxidative stress, demonstrated a 33% slower rate of functional decline over 24 weeks in clinical trials involving a narrowly defined subset of early-stage ALS patients[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eGiven these limited therapeutic options, timely and precise diagnosis is critical not only to rule out mimicking conditions but also to initiate approved therapies at the earliest possible stage, maximizing their modest benefits. ALS diagnosis begins with a detailed clinical examination, assessing muscle strength, reflexes, and other neurological signs. Features like hyperreflexia, muscle wasting, and weakness help differentiate ALS from other conditions[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. They also use electrodiagnostic tests like electromyography (EMG) to measure muscle electrical activity and confirm nerve issues and nerve conduction studies to check nerve function [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Imaging, such as MRI, helps rule out other problems like spinal cord compression [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Beyond these, researchers are exploring advanced MRI techniques like diffusion tensor imaging (DTI) and diffusion-weighted imaging (DWI) to look at brain and spinal cord details, as well as Positron Emission Tomography (PET) to measure brain activity[\u003cspan additionalcitationids=\"CR19\" citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. The Gold Coast criteria, introduced in 2019, have further simplified ALS diagnosis by focusing on progressive motor impairment and upper and lower motor neuron dysfunction in at least one body region [\u003cspan additionalcitationids=\"CR22\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAlthough progress has been made, diagnosing ALS remains challenging, largely because its symptoms overlap significantly with those of other neurological disorders, and there is no definitive diagnostic test available to identify the condition conclusively [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. ALS diagnosis continues to face a median delay of 12 months from symptom onset, with patients typically consulting three or more specialists before receiving a confirmed diagnosis [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Moreover, the variability in ALS symptoms manifests as either bulbar-onset (affecting speech and swallowing) or limb-onset (impacting peripheral muscles like the hands and feet). This heterogeneity complicates early recognition and diagnosis [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. While the El Escorial, revised El Escorial, and Awaji criteria offer diagnostic frameworks, they lack sensitivity and are primarily designed for research rather than clinical practice[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. Most importantly, there is currently no established biomarker that has been validated for clinical application in ALS, which significantly hinders early detection and disease monitoring [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. Given that biomarkers are crucial for the early and accurate diagnosis of neurodegenerative diseases (NDs) like Alzheimer\u0026rsquo;s and Parkinson\u0026rsquo;s[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] identifying reliable diagnostic and prognostic gene biomarkers could substantially improve our understanding and management of ALS.\u003c/p\u003e \u003cp\u003eIn light of these challenges, high-throughput RNA-seq has emerged as a promising approach to bridge this diagnostic gap. This technology has revolutionized the field of transcriptomics by providing a comprehensive view of the transcriptome, enabling the identification of novel biomarkers for various diseases [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. Unlike traditional methods such as microarrays, RNA-seq enables the detection of both known and novel transcripts with high sensitivity and accuracy while also providing precise quantification of gene expression [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]. A key strength of RNA-seq lies in its ability to uncover subtle disease-associated expression patterns that may serve as diagnostic, prognostic, or therapeutic indicators [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. By capturing the full dynamic range of gene expression, including low-abundance transcripts and splice variants, RNA-seq reveals molecular signatures often missed by other technologies [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]. Furthermore, standardized computational pipelines now allow researchers to reliably identify differentially expressed genes (DEGs), reducing variability and enhancing reproducibility in biomarker discovery [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]. These capabilities have positioned RNA-seq as a transformative tool for identifying disease-specific biomarkers, particularly for complex conditions like neurodegenerative diseases, where molecular stratification is critical. In neurodegenerative disease research, brain and blood samples form a synergistic duo, where brain tissue provides direct insights into molecular pathology, while blood offers a scalable, non-invasive platform for early diagnosis and monitoring [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e, \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e]. RNA-seq bridges these domains, revealing biomarkers such as blood-derived microRNAs and neurofilament light chain (NfL) that mirror pathological changes in the brain, enabling breakthroughs in detecting neurological diseases years before symptoms emerge [\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]. By leveraging both types of samples, researchers can accelerate the discovery of actionable biomarkers, ultimately transforming how we predict, track, and combat neurodegeneration [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e] However, the vast and intricate nature of RNA-seq datasets necessitates the use of advanced computational techniques to fully unlock their potential.\u003c/p\u003e \u003cp\u003eMachine learning (ML) algorithms have shown great promise in analyzing vast and intricate datasets, such as those generated by high-throughput RNA sequencing [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e]. By leveraging these advanced computational techniques, researchers can pinpoint gene expression patterns unique to specific diseases, accurately classify biological samples, and forecast disease progression [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e]. ML models excel at learning directly from the data, uncovering subtle relationships and patterns, even amidst the noise and variability typical of biological datasets, that often elude traditional statistical methods [\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e]. A key application is the use of supervised ML techniques to identify critical genes from RNA-seq data, where models are trained on labeled datasets to recognize important genetic markers [\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]. Since numerous RNA-seq studies compare cases and controls, one developed a logistic regression model that identified 22 biomarker genes (AUC: 0.990) from PBMC RNA-seq data, linking immune response, cell signaling, and metabolism to ALS mechanisms[\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e]. In another study, RefMap integrated GWAS with RNA-seq/ATAC-seq data from iPSC-derived motor neurons, uncovering 690 ALS-associated genes and validating KANK1\u0026rsquo;s role in TDP-43 pathology[\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e]. Meanwhile, a multi-omic approach combined unsupervised clustering and the MOALS model to analyze 9,847 ALS-related genes and 7,699 rare variants, boosting prediction accuracy by 1.7\u0026ndash;6.2% [\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e]. Deep learning via a Keras/TensorFlow23 classifier processed WGS, RNA-seq, and chromatin data to classify ALS cases and reveal novel transcriptional/mutational signatures [\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e]. WGCNA and classification models extracted a 20-gene signature from peripheral blood RNA-seq (96 sALS vs. 48 controls), achieving 78% accuracy. GLM, Decision Trees, and Random Forests analyzed spinal cord RNA-seq data, yielding 83% cross-validation and 77% test accuracy [\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e]. Finally, CNNs and logistic regression leveraged voice recordings (AUC: 0.86 for bulbar function) and accelerometer data (median AUC: 0.73 for limb function) to predict ALS severity via ALSFRS-R scores, showcasing digital biomarker potential [\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e]. However, integrating diverse biological specimens, such as brain tissue and blood, with an array of multiple ML algorithms could provide a more robust approach to ALS detection than relying on a single method.\u003c/p\u003e \u003cp\u003eTo advance ALS diagnostics, we aim to integrate high-throughput RNA-seq data and machine learning (ML) to develop a predictive framework for ALS classification. Using publicly available ALS-associated gene expression datasets, we will implement a next-generation sequencing (NGS) pipeline to identify DEGs between ALS and control samples. These candidates will be refined through four advanced feature selection methods to define a pathophysiology-driven gene signature. We will then systematically evaluate 15 ML algorithms to optimize accuracy in distinguishing ALS from control samples. To translate these findings into clinical utility, the finalized model will be deployed via ATMeQ, a publicly accessible web application designed to enable clinicians and researchers to predict ALS risk and validate candidate biomarkers, thereby enhancing diagnostic precision, accelerating therapeutic development, and improving outcomes for ALS patients. The workflow for this study is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"2. Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1. Retrieval of NGS data\u003c/h2\u003e \u003cp\u003eNext-generation sequencing (NGS) data for this study were retrieved from the Gene Expression Omnibus (GEO) database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.ncbi.nlm.nih.gov/geo/\u003c/span\u003e\u003cspan address=\"https://www.ncbi.nlm.nih.gov/geo/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), a publicly accessible repository managed by the National Center for Biotechnology Information (NCBI)[\u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e]. Three independent projects: BioProject PRJNA512012 (GEO Series GSE124439), PRJNA831563 (GEO Series GSE201407), and PRJNA1163403 (GEO Series GSE277709), were selected to compile RNA-Seq datasets from a total of 224 postmortem samples, which include 183 amyotrophic lateral sclerosis (ALS) patients and 41 non-ALS controls, and all data were downloaded using NCBI\u0026rsquo;s SRA Toolkit. The curated datasets encompassed key brain regions implicated in ALS pathology, such as the motor cortex and prefrontal cortex. The selection of these datasets was guided by stringent criteria, including the availability of high-quality RNA-Seq data, comprehensive metadata, and adequate sample size to facilitate biomarker discovery and therapeutic target identification in ALS research while ensuring the data's high quality and biological relevance for downstream analysis. The selection criteria for these datasets are outlined in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, while \u003cb\u003eSupplementary Information (SI) 1\u003c/b\u003e provides detailed information, including project ID, sample size, ALS and control distributions, gender, age range, brain region of origin, disease stage, and relevant references. Subsequent computational processing pipelines are described in the following sections.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2. Preprocessing of Raw Data\u003c/h2\u003e \u003cp\u003e \u003cb\u003eQuality Control of FASTQ Files\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe quality of raw sequencing reads was evaluated using FastQC (version 0.11.9) [\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e], a widely-used tool for high-throughput sequencing data quality control, including assessments of read quality, GC content, adapter contamination, and sequence duplication levels.\u003c/p\u003e \u003cp\u003e \u003cb\u003eTrimming FASTQ Files\u003c/b\u003e \u003c/p\u003e \u003cp\u003eTo enhance sequencing read quality by removing low-quality bases and adapter sequences, read trimming was performed using Trimmomatic (version 0.39) [\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e], a tool designed to process Illumina data. The trimming parameters included TRAILING:10, SLIDINGWINDOW:4:15, MINLEN:36, and -phred33 for quality score encoding. After trimming, the processed FASTQ files were reanalyzed with FastQC to verify improved read quality.\u003c/p\u003e \u003cp\u003e \u003cb\u003eAlignment to the Reference Genome\u003c/b\u003e \u003c/p\u003e \u003cp\u003eHigh-quality trimmed reads were aligned to the human reference genome (GRCh38) using HISAT2 (version 2.2.1) [\u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e59\u003c/span\u003e], a fast and splice-aware aligner optimized for RNA-Seq data. The alignment leveraged the pre-built HISAT2 index for GRCh38, which includes splice site annotations to ensure accurate mapping of reads spanning exon-exon junctions. The resulting SAM file was subsequently converted to a sorted BAM file using SAMtools (version 1.16) [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cb\u003eQuantification Using FeatureCounts\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAfter alignment, gene expression levels were quantified using featureCounts [\u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e60\u003c/span\u003e], a tool optimized for assigning RNA-seq reads to genomic features. We used the Ensembl GRCh38 release 106 annotation files (Homo_sapiens.GRCh38.106.gtf) to ensure accurate read counting across annotated genes, exons, and transcripts. This annotation file was accessed directly from the Ensembl FTP repository (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/\u003c/span\u003e\u003cspan address=\"https://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e)\u003c/span\u003e to maintain consistency with the reference genome used during alignment.\u003c/p\u003e \u003cp\u003e \u003cb\u003eFiltering of Count Results\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAfter generating the raw counts, we filtered out genes with low expression levels to focus on genes with sufficient coverage for further analysis. Specifically, genes were discarded if their total read count across all samples fell below a threshold of 10 reads. This can be formalized as:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:Retained\\:genes:\\:{\\sum\\:}_{j=1}^{N}Cij\\:\\ge\\:\\:10$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere \u003cem\u003eC\u003c/em\u003e\u003csub\u003e\u003cem\u003eij\u003c/em\u003e\u003c/sub\u003e​ represents the read count for gene \u003cem\u003ei\u003c/em\u003e in sample \u003cem\u003ej\u003c/em\u003e, and \u003cem\u003eN\u003c/em\u003e is the total number of samples. This step ensures that only biologically relevant genes with adequate expression are retained for downstream statistical modeling.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3. Identification of differentially expressed genes (DEGs)\u003c/h2\u003e \u003cp\u003eAfter preprocessing the data, we employed the DESeq2 statistical tool to identify differentially expressed genes (DEGs) [\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e]. To ensure the reliability of these identified DEGs, we adjusted the P-values using the false discovery rate (FDR) method [\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e]. For each gene, we calculated the fold change (FC) between the control and non-ALS groups. Genes with an adjusted P-value (P-adjusted)\u0026thinsp;\u0026lt;\u0026thinsp;0.05 and a log2-transformed fold change (Log2FC) \u0026gt; |0.5| were considered significant DEGs[\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e]. For downstream machine learning (ML) applications, normalized counts were variance-stabilized using the DESeq2 vst() transformation to mitigate mean-variance dependence. To address potential confounding technical variation, the limma::removeBatchEffect() function [\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e] was applied to the variance-stabilized data to eliminate any batch effects. The resulting datasets, which were normalized, variance-stabilized, and batch effect-corrected, were then used for feature selection.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4. Train-Test Split\u003c/h2\u003e \u003cp\u003eThe dataset was split into training and testing sets using a 70/30 ratio, where 70% of the data was used for model training and 30% for testing. This division ensures that the model is trained on a majority of the data while reserving a smaller portion for unbiased evaluation of its performance[\u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e64\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5. Oversampling Technique for the Minority Class\u003c/h2\u003e \u003cp\u003eThe dataset exhibited class imbalance, with the minority class being under-represented compared to the majority class. To address this issue, we applied the Synthetic Minority Over-sampling Technique (SMOTE) [\u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e65\u003c/span\u003e] exclusively to the training set after splitting. SMOTE generates synthetic samples for the minority class by interpolating between existing instances. This helps to mitigate the risk of model bias toward the majority class.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.6. Feature Selection for ML Models\u003c/h2\u003e \u003cp\u003eThis study employed a diverse set of feature selection strategies to identify the most critical features required for training various machine learning models. Feature importance was determined through the application of four distinct methodologies: Random Forest Classifier [\u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e66\u003c/span\u003e], Gradient Boosting Classifier [\u003cspan citationid=\"CR67\" class=\"CitationRef\"\u003e67\u003c/span\u003e], Recursive Feature Elimination [\u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e68\u003c/span\u003e], and Boruta [\u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e69\u003c/span\u003e].To prevent information leakage, feature selection was performed exclusively on the training set. In our study, we utilized the scikit-learn \u0026ldquo;SelectFromModel\u0026rdquo; function with the Random Forest Classifier and Gradient Boosting Classifier algorithms to evaluate the relative importance of each feature in the model. The recursive feature elimination technique iteratively removes features with the least significance by using a linear regression model. Additionally, the Boruta technique was employed to assess feature importance by iterating over randomized decision trees and highlighting the most relevant features. These combined strategies facilitated the identification of key features from our dataset. A Venn diagram was constructed to determine the set of features that were common across all methods, and these shared features were subsequently selected. The selected features were then used to develop and refine machine-learning models for ALS classification.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e2.7. Machine Learning Model Training\u003c/h2\u003e \u003cp\u003eFeature scaling plays a vital role in preparing data for machine learning models, and in this study, the input features were standardized using the StandardScaler function from scikit-learn's preprocessing module[\u003cspan citationid=\"CR70\" class=\"CitationRef\"\u003e70\u003c/span\u003e]. The mean and standard deviation derived from the training dataset were used to scale both the training and test datasets, ensuring no data leakage occurred during the preprocessing step. Once scaled, the test dataset was utilized to evaluate the performance of 15 distinct machine learning algorithms trained on the training data. The models assessed included Gradient Boosting Classifier, Light Gradient Boosting Machine (LightGBM), Extra Trees Classifier, Random Forest Classifier, Ada Boost Classifier, Extreme Gradient Boosting (XGBoost), K Neighbors Classifier (KNN), Linear Discriminant Analysis (LDA), Naive Bayes, Logistic Regression, Decision Tree Classifier, Ridge Classifier, Quadratic Discriminant Analysis (QDA), Dummy Classifier, and Support Vector Machine - Linear Kernel (SVM). Each algorithm was independently trained and tested to compare their performance.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eAdaBoost Classifier\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eAdaBoost (Adaptive Boosting) is an iterative ensemble method that focuses on misclassified examples by adjusting their weights in subsequent iterations[\u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e71\u003c/span\u003e]. Initially, all training samples are assigned equal weights. After each iteration, the weights of misclassified samples are increased, forcing the model to prioritize them in the next round. AdaBoost typically uses weak learners, such as decision stumps, and combines them into a strong classifier. Its adaptability makes it suitable for both binary and multi-class classification tasks.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eDecision Tree Classifier\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eDecision Tree Classifier is a hierarchical model that recursively splits the dataset into subsets based on feature values [\u003cspan citationid=\"CR72\" class=\"CitationRef\"\u003e72\u003c/span\u003e]. Each internal node represents a decision rule, and each leaf node corresponds to a class label. Decision trees are easy to interpret and visualize but prone to overfitting. Pruning techniques and ensemble methods (e.g., Random Forest) are often employed to improve generalization.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eDummy Classifier\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eDummy Classifier is a baseline model that generates predictions without using any feature information [\u003cspan citationid=\"CR73\" class=\"CitationRef\"\u003e73\u003c/span\u003e]. It serves as a benchmark for evaluating the performance of more sophisticated models. Common strategies include predicting the most frequent class, generating random predictions, or using prior probabilities. Dummy Classifier helps identify whether a proposed model provides meaningful improvements over trivial baselines.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eExtra Trees Classifier\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe Extra Trees Classifier, or Extremely Randomized Trees, is an ensemble learning method that builds multiple decision trees during training. Unlike Random Forest, it introduces additional randomness by selecting random splits for each feature rather than searching for the best split [\u003cspan citationid=\"CR74\" class=\"CitationRef\"\u003e74\u003c/span\u003e]. This approach reduces variance and overfitting, making it robust for noisy datasets. The final prediction is obtained by aggregating the outputs of all trees, either through voting (for classification) or averaging (for regression).\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eGradient Boosting Classifier\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eGradient Boosting Classifier is an ensemble learning technique that combines multiple weak learners (typically decision trees) to form a strong predictive model. It operates by iteratively minimizing the loss function through gradient descent optimization. In each iteration, the algorithm fits a new model to the residuals of the previous model, thereby improving accuracy progressively [\u003cspan citationid=\"CR75\" class=\"CitationRef\"\u003e75\u003c/span\u003e]. This method is particularly effective for handling complex datasets with non-linear relationships between features and the target variable.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eK Neighbors Classifier (KNN)\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eK-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for classification and regression [\u003cspan citationid=\"CR76\" class=\"CitationRef\"\u003e76\u003c/span\u003e]. For classification, KNN predicts the class label of a query point based on the majority vote of its k-nearest neighbors in the feature space. The distance metric (e.g., Euclidean, Manhattan) determines the similarity between points. Despite its simplicity, KNN is effective for small datasets but can become computationally expensive for large datasets due to its reliance on storing all training samples.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eLight Gradient Boosting Machine (LightGBM)\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eLightGBM is an optimized gradient-boosting framework designed for efficiency and scalability. It employs a novel technique called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to reduce computational overhead while maintaining high accuracy [\u003cspan citationid=\"CR77\" class=\"CitationRef\"\u003e77\u003c/span\u003e]. By focusing on instances with larger gradients and bundling mutually exclusive features, LightGBM achieves faster training times compared to traditional gradient boosting methods. It is widely used in large-scale machine-learning tasks, such as ranking and classification.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eLinear Discriminant Analysis (LDA)\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eLinear Discriminant Analysis (LDA) is a supervised dimensionality reduction and classification technique that seeks to maximize the separation between classes [\u003cspan citationid=\"CR78\" class=\"CitationRef\"\u003e78\u003c/span\u003e]. It projects the data onto a lower-dimensional space while preserving class-discriminative information. LDA assumes that the data follows a Gaussian distribution and that all classes share the same covariance matrix. It is particularly useful when the number of features exceeds the number of samples.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eLogistic Regression\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eLogistic Regression is a statistical model used for binary and multi-class classification tasks [\u003cspan citationid=\"CR79\" class=\"CitationRef\"\u003e79\u003c/span\u003e]. It estimates the probability of a class label using a logistic function applied to a linear combination of input features. Logistic Regression is interpretable, computationally efficient, and works well for linearly separable data. Regularization techniques like L1 (Lasso) and L2 (Ridge) can be incorporated to handle multicollinearity and prevent overfitting.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eNaive Bayes\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eNaive Bayes is a probabilistic classifier based on Bayes' theorem, which assumes conditional independence between features given the class label [\u003cspan citationid=\"CR80\" class=\"CitationRef\"\u003e80\u003c/span\u003e]. Despite this \"naive\" assumption, the algorithm performs surprisingly well in text classification and spam filtering tasks. Variants of Naive Bayes, such as Gaussian Naive Bayes and Multinomial Naive Bayes, cater to different types of data distributions.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eQuadratic Discriminant Analysis (QDA)\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eQuadratic Discriminant Analysis (QDA) is an extension of LDA that relaxes the assumption of shared covariance matrices across classes [\u003cspan citationid=\"CR81\" class=\"CitationRef\"\u003e81\u003c/span\u003e]. QDA models each class with its own covariance matrix, resulting in quadratic decision boundaries. While more flexible than LDA, QDA requires more data to estimate the additional parameters accurately.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eRandom Forest Classifier\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eRandom Forest Classifier is a popular ensemble learning method that constructs a multitude of decision trees during training and outputs the mode of their predictions for classification tasks [\u003cspan citationid=\"CR82\" class=\"CitationRef\"\u003e82\u003c/span\u003e]. It mitigates overfitting by introducing randomness in two ways: bootstrapping samples for each tree and selecting a random subset of features at each split. Random Forest is highly versatile and performs well across a wide range of problems, including feature selection and missing data imputation.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eRidge Classifier\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eRidge Classifier is a variant of Ridge Regression adapted for classification tasks [\u003cspan citationid=\"CR83\" class=\"CitationRef\"\u003e83\u003c/span\u003e]. It applies L2 regularization to penalize large coefficients, reducing overfitting and improving stability. Unlike Logistic Regression, the Ridge Classifier directly minimizes the squared loss instead of maximizing likelihood. It is particularly effective when dealing with multicollinear features.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eSupport Vector Machine - Linear Kernel (SVM)\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eSupport Vector Machine (SVM) with a linear kernel is a powerful classification algorithm that identifies the optimal hyperplane separating classes in the feature space [\u003cspan citationid=\"CR84\" class=\"CitationRef\"\u003e84\u003c/span\u003e]. The margin between the hyperplane and the nearest data points (support vectors) is maximized to ensure robustness. SVM is effective for high-dimensional data and can incorporate kernel functions to handle non-linear relationships.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eExtreme Gradient Boosting (XGBoost)\u003c/b\u003e \u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eXGBoost is an advanced implementation of gradient boosting that incorporates regularization techniques (L1 and L2) to prevent overfitting [\u003cspan citationid=\"CR85\" class=\"CitationRef\"\u003e85\u003c/span\u003e]. It also optimizes the second-order gradient of the loss function, enabling faster convergence and higher accuracy. XGBoost supports parallel processing, handling missing values, and custom objective functions, making it a preferred choice for structured/tabular data competitions and real-world applications.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e2.8. Hyperparameter Tuning\u003c/h2\u003e \u003cp\u003eHyperparameter tuning is an essential step in optimizing the performance of ML models and was a critical component of this study. The primary goal of this process is to identify the most effective configuration of hyperparameters that maximizes model performance while ensuring robustness and generalization. To achieve this, we utilized the scikit-learn library in Python, which provides a comprehensive suite of tools for hyperparameter optimization [\u003cspan citationid=\"CR73\" class=\"CitationRef\"\u003e73\u003c/span\u003e]. Our approach involved employing GridSearchCV, a systematic method for traversing the hyperparameter space by evaluating all possible combinations within a predefined grid [\u003cspan citationid=\"CR86\" class=\"CitationRef\"\u003e86\u003c/span\u003e]. This exhaustive search strategy ensures that no potential combination is overlooked, enabling the identification of the best-performing hyperparameters. By using this rigorous methodology, we ensured that the chosen hyperparameters were optimized for both accuracy and generalizability, which lays a strong foundation for the robustness of the ML models.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e2.9. K-Fold Cross-Validation with the Best-Performing Model\u003c/h2\u003e \u003cp\u003eCross-validation is an important technique in machine learning that offers a more reliable estimate of a model's performance on unseen data compared to a single train-test split. It helps mitigate the variability that can arise from relying on just one partition of the data for testing. After training and hyperparameter tuning of 15 models, the best-performing model was selected. To further assess its robustness, we employed the \u0026ldquo;StratifiedKFold\u0026rdquo; function from scikit-learn to conduct a 10-fold cross-validation[\u003cspan citationid=\"CR73\" class=\"CitationRef\"\u003e73\u003c/span\u003e]. This involved merging the training and test datasets and dividing them into 10 stratified folds. In each iteration, one-fold served as the validation set, while the remaining nine were used for training. The model's performance was evaluated using four key metrics: accuracy, Matthew\u0026rsquo;s correlation coefficient (MCC), the area under the receiver operating characteristic curve (AUC\u0026ndash;ROC), and the F1 score. This process was repeated across all 10 folds, and the results from each iteration were averaged to provide an overall measure of the model's expected performance on unseen data[\u003cspan citationid=\"CR87\" class=\"CitationRef\"\u003e87\u003c/span\u003e]. This approach ensures a comprehensive and reliable evaluation of the model's generalization capabilities.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e2.10. Deployment of the Model as Web Application\u003c/h2\u003e \u003cp\u003eFinally, we deployed the developed Gradient Boosting Classifier (GBC) model as a user-friendly web application, making it easily accessible to the research community. The web application, named ATMeQ, was built using the Streamlit Python framework (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.streamlit.io/\u003c/span\u003e\u003cspan address=\"https://www.streamlit.io/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e ) and hosted on the Streamlit Share cloud platform. The source code for the application is maintained in a dedicated GitHub repository, ensuring transparency and facilitating collaboration. The ATMeQ web app is designed to accept input data in the form of a VST file (provided as a CSV file), process it through the GBC model, and return predictions for ALS disease status. Additionally, users can download the prediction results directly from the app, enhancing its utility for research and analysis purposes.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Results","content":"\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e3.1. Quantification of the High-quality Raw Reads\u003c/h2\u003e \u003cp\u003eThe quality of the raw sequencing data retrieved from NCBI was evaluated using FastQC v0.11.5. All raw reads from a total of 224 samples were evaluated and confirmed to meet high-quality standards. Following this quality control step, the reads were aligned to the human reference genome. Alignment to the reference genome identified a total of 26,396 genes in the project PRJNA512012 (GSE124439), 19,908 genes in PRJNA831563 (GSE201407), and 25,034 genes in PRJNA1163403 (GSE277709). These gene sets were subsequently subjected to DEG analysis during the quantification phase.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e3.2. Identification of Differentially Expressed Genes (DEGs)\u003c/h2\u003e \u003cp\u003eDifferential gene expression analysis was performed using the DESeq2 package in R. Genes were classified as DEGs based on an adjusted p-value threshold of \u0026le;\u0026thinsp;0.05 and a |Log2FC| \u0026gt; 0.5. In the PRJNA512012 (GSE124439) dataset, 1,609 significant DEGs were identified by comparing case samples to normal controls. Similarly, analyses of two additional transcriptomic datasets, PRJNA831563 (GSE201407) and PRJNA1163403 (GSE277709), revealed 1,302 and 2,223 DEGs, respectively. Volcano plots and MA plots illustrating the distribution of DEGs for each dataset are presented in \u003cb\u003eFig.\u0026nbsp;3\u003c/b\u003e. Additionally, a cross-dataset comparison identified 32 DEGs that were consistently differentially expressed across all three datasets \u003cb\u003e(Supplementary Information (SI) 2 and Fig.\u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e).\u003c/b\u003e\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eFigure SEQ Figure \\* ARABIC 3.\u003c/b\u003e Differential gene expression analysis comparing ALS and healthy controls across three independent RNA-seq datasets. Panels A\u0026ndash;B show volcano and MA plots for GSE124439, panels C\u0026ndash;D show volcano and MA plots for GSE201407, and panels E\u0026ndash;F show volcano and MA plots for GSE277709, illustrating log₂ fold-change distributions and expression-dependent differential regulation between ALS and control samples.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e3.3. DEG-Based Data Preprocessing and Feature Selection\u003c/h2\u003e \u003cp\u003eThe dataset, derived from DEGs and consisting of 224 samples with 32 features, was split into training and testing subsets. The training set comprised 70% of the data (156 samples), while the testing set contained 30% (68 samples). In the training dataset, a class imbalance was observed, with healthy samples being the minority class (29 samples) compared to ALS samples (127 samples). To address this imbalance, we applied SMOTE, a technique that synthetically generates additional samples for the minority class. This preprocessing step balanced the distribution of healthy and ALS samples, as illustrated in \u003cb\u003eFig.\u0026nbsp;4.\u003c/b\u003e\u003c/p\u003e \u003cp\u003eFollowing data balancing, we performed feature selection using four distinct methodologies to identify the most relevant features associated with the target variable (see Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). We then identified a set of common features that consistently ranked as highly relevant across all four approaches, as illustrated in \u003cb\u003eFig.\u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e\u003c/b\u003e. These features included ACTA1, ABCA4, COL6A4P2, HERC2P2, KCNE4, and LOC107987008 (see \u003cb\u003eSupplementary Information (SI) 3\u003c/b\u003e).\u003c/p\u003e \u003cp\u003e \u003cb\u003eFigure\u0026nbsp;4\u003c/b\u003e Class balance before and after SMOTE in the training dataset. (A) Original class imbalance with fewer healthy samples compared to ALS samples. (B) Balanced class distribution achieved after applying SMOTE.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eGene features selected by four independent feature selection methods: Random Forest, Gradient Boosting Classifier, Recursive Feature Elimination (RFE), and Boruta, for ALS versus control classification.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandom forest\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGradient boosting classifier\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRecursive feature elimination\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBoruta\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHERC2P2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHERC2P2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHERC2P2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eHERC2P2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLOC105371874\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLOC105371874\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLOC105371874\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLOC107987008\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNPY\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePTGER2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHSPA2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePTGER2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLOC107987008\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLOC112268045\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLOC107987008\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSLC1A7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePTGER2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eABCA8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLOC105379442\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eKCNE4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLOC112268045\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGREM1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eKCNE4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCOL6A4P2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLOC105379442\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLOC107987071\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eABCA8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eACTA1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKCNE4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBOK.AS1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGREM1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMYBPC2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLOC107987071\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSLC1A7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLOC107987071\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLOC107987075\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBOK.AS1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBVES\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSLC1A7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eABCA4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRASSF9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLRRC63\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePAPLN.AS1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLRRC63\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCOL6A4P2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRASSF9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCOL6A4P2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eACTA1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLACC1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eACTA1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMYBPC2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCOL6A4P2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMYBPC2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEFHD1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eACTA1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEFHD1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLOC107987003\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEFHD1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLOC107987003\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLOC105370803\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLOC107987003\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLOC105370803\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLOC107987075\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLOC105370803\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLOC107987075\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eABCA4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLOC107987075\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eABCA4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eABCA4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e3.4. Model Training and Hyperparameter Optimization\u003c/h2\u003e \u003cp\u003eIn this step, we developed and systematically optimized multiple machine learning (ML) models for a supervised classification task using a compact feature set of six genes identified through a prior feature selection procedure. A total of 15 distinct ML algorithms were trained and evaluated on this dataset. Model performance was assessed using multiple complementary metrics, including accuracy, precision, recall, area under the receiver operating characteristic curve (AUC\u0026ndash;ROC), F1 score, and Matthews correlation coefficient (MCC). These results are summarized visually in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e5\u003c/span\u003e, which present accuracy and MCC, precision and recall, and AUC\u0026ndash;ROC and F1 score, respectively. To further enhance predictive performance and model robustness, we performed systematic hyperparameter tuning for each of the 15 ML models. This optimization involved an extensive exploration of hyperparameter combinations using a grid search strategy, ensuring reproducible and reliable selection of optimal configurations. The final tuned hyperparameter settings for each model are detailed in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, while a comprehensive comparison of baseline and tuned model performance metrics is provided in \u003cb\u003eSupplementary Information (SI) 4\u003c/b\u003e. Overall, hyperparameter tuning resulted in consistent performance improvements across all evaluated metrics, with these gains visually summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e6\u003c/span\u003e. Among the 15 tuned models, the Gradient Boosting Classifier with hyperparameter tuning stood out as the top performer. It achieved the highest scores in maximum key metrics: an accuracy of 0.9171, an MCC of 0.7197, a precision of 0.9243, a recall of 0.9171, an AUC\u0026ndash;ROC of 0.9385, and an F1-score of 0.9107. The high accuracy and MCC indicate strong overall classification ability, while the impressive precision and recall show the model\u0026rsquo;s effectiveness in identifying true positives and minimizing errors. The AUC\u0026ndash;ROC and F1 scores further confirm GBC\u0026rsquo;s excellent discriminatory power and balanced performance, making it the standout model for this classification task.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eOptimized hyperparameter configurations for the 15 machine-learning models.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eML model\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHyperparameters\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSelected Best Value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGradient Boosting Classifier\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elearning_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.01\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_estimators\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e300\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_depth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003esubsample\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLightGBM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elearning_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_estimators\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e150\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_depth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003enum_leaves\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e15\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eExtra Trees Classifier\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_estimators\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e100\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_depth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emin_samples_split\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandom Forest Classifier\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_estimators\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e200\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_depth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emin_samples_leaf\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAdaBoost Classifier\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elearning_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_estimators\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ebase_estimator\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDecisionTree(max_depth\u0026thinsp;=\u0026thinsp;2)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elearning_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.05\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_estimators\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_depth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_neighbors\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eweights\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e'distance'\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLinear Discriminant Analysis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003esolver\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e'lsqr'\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eshrinkage\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNaive Bayes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003evar_smoothing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1e-08\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic Regression\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003epenalty\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e'l2'\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDecision Tree Classifier\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_depth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emin_samples_split\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRidge Classifier\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ealpha\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003esolver\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e'cholesky'\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuadratic Discriminant Analysis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ereg_param\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDummy Classifier\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003estrategy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e'stratified'\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSupport Vector Machine (SVM)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003egamma\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003e3.5. K-fold cross-validation with hyperparameter-tuned Gradient Boosting Classifier\u003c/h2\u003e \u003cp\u003eCross-validation is a fundamental technique in machine learning that helps assess a model\u0026rsquo;s performance more reliably on new data. Instead of depending on a single train-test split, it uses multiple data divisions to provide a more balanced evaluation. This approach reduces the risk of misleading results that can arise from testing on just one specific dataset. In this analysis, we used the StratifiedKFold function from scikit-learn to perform 10-fold stratified cross-validation. This method ensures that the class proportions in each fold mirror those of the entire dataset, a critical feature for achieving dependable results, particularly when dealing with imbalanced classes. We evaluated a Gradient Boosting Classifier, fine-tuned with optimized hyperparameters, using this robust method. The cross-validation process unfolded as follows: the dataset was partitioned into 10 equal folds. In each of the 10 iterations, nine folds were dedicated to training the model, while the tenth fold served as the validation set. This cycle repeated until every fold had been used for validation exactly once. During each iteration, we evaluated the model\u0026rsquo;s performance using key metrics such as accuracy, precision, recall, F1 score, and AUC-ROC based on its predictions on the validation fold. To gain a well-rounded understanding of the model\u0026rsquo;s expected performance, we calculated the average of these metrics across all 10 folds. The results showed a strong overall performance, with an average accuracy of 0.921, a precision of 0.944, a recall of 0.906, an F1 score of 0.920, and an AUC-ROC of 0.978, as illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e7\u003c/span\u003e. These averaged values offer a reliable estimate of how well the model is expected to perform on new, unseen data, which highlights the effectiveness of this cross-validation approach.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003e3.6. Model deployment as the ATMeQ web app and assessment\u003c/h2\u003e \u003cp\u003eTo make the prediction model easily accessible for biologists and chemists in their research, we have developed it as a publicly available web application called ATMeQ, hosted at [\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://share.streamlit.io/user/saiflab\u003c/span\u003e\u003cspan address=\"https://share.streamlit.io/user/saiflab\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e] Below is a brief guide on how to use the ATMeQ web app (see more details in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e8\u003c/span\u003e):\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003e1. Data Preparation\u003c/b\u003e: Generate a CSV file that incorporates variance-stabilized transformation (VST) data derived from DESeq2. This VST method, part of DESeq2, an R package designed for RNA-Seq analysis, adjusts variance across diverse expression levels to enhance the data\u0026rsquo;s applicability for clustering and visualization purposes.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003e2. Accessing the Application\u003c/b\u003e: Input the specified URL into a web browser to reach the ATMeQ web app\u0026rsquo;s prediction page. File Upload: Use the \u0026ldquo;Browse files\u0026rdquo; button to submit the CSV file you\u0026rsquo;ve prepared to the web app.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003e3. Make Prediction\u003c/b\u003e: Launch the prediction process by pressing the \u0026ldquo;Initiate Analysis\u0026rdquo; button.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003e4. Results Review\u003c/b\u003e: Examine the outcomes displayed in the section beneath the \u0026ldquo;Prediction results\u0026rdquo; heading. The processing typically concludes within a few seconds, and you have the option to retrieve the predicted data in CSV format by selecting the \u0026ldquo;Download Predictions\u0026rdquo; button.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. Discussion","content":"\u003cp\u003eAmyotrophic lateral sclerosis (ALS) presents a persistent and formidable diagnostic challenge, primarily due to the nonspecific, heterogeneous, and often subtle nature of its initial symptoms, which significantly overlap with those of other neuromuscular disorders [\u003cspan citationid=\"CR88\" class=\"CitationRef\"\u003e88\u003c/span\u003e, \u003cspan citationid=\"CR89\" class=\"CitationRef\"\u003e89\u003c/span\u003e]. This reality forces a diagnostic process heavily reliant on the exclusion of alternative conditions and the nuanced judgment of specialized clinicians, contributing to a critical and well-documented delay [\u003cspan citationid=\"CR90\" class=\"CitationRef\"\u003e90\u003c/span\u003e]. Contemporary population-level analyses consistently reveal a substantial diagnostic latency, with a median delay of approximately 11 to 12 months from the onset of first symptoms to a confirmed diagnosis [\u003cspan citationid=\"CR91\" class=\"CitationRef\"\u003e91\u003c/span\u003e, \u003cspan citationid=\"CR92\" class=\"CitationRef\"\u003e92\u003c/span\u003e]. This protracted timeline is especially consequential in a rapidly and relentlessly progressive disease, where lost time equates to lost neurons and diminished therapeutic opportunity [\u003cspan citationid=\"CR92\" class=\"CitationRef\"\u003e92\u003c/span\u003e]. It powerfully motivates the urgent quest for objective, biological biomarkers that can complement clinical criteria and accelerate diagnostic certainty, thereby enabling earlier intervention.\u003c/p\u003e \u003cp\u003eIn this pursuit, high-throughput RNA-seq has emerged as a preeminent and powerfully positioned technology. It facilitates an unbiased, genome-wide, and quantitative survey of transcript abundance, supporting detection across an exceptionally broad dynamic range [\u003cspan citationid=\"CR93\" class=\"CitationRef\"\u003e93\u003c/span\u003e]. Established best-practices frameworks emphasize that RNA-seq workflows encompassing rigorous quality control, accurate alignment, and precise quantification can be standardized to yield highly reproducible transcriptomic profiles [\u003cspan citationid=\"CR94\" class=\"CitationRef\"\u003e94\u003c/span\u003e]. Notably, direct comparative analyses have consistently reported that RNA-seq holds significant advantages over previous microarray technologies, including superior resolution, a wider dynamic range, lower background noise, and a reduced susceptibility to technical variation [\u003cspan citationid=\"CR95\" class=\"CitationRef\"\u003e95\u003c/span\u003e]. Perhaps most importantly, RNA-seq uniquely enables the discovery of novel transcripts and the discrimination of biologically critical isoforms, capabilities that are essential for unraveling complex diseases like ALS. These technical strengths collectively solidify its suitability for the discovery of next-generation biomarkers.\u003c/p\u003e \u003cp\u003eHowever, the translation of transcriptomic data into robust, clinically actionable ALS biomarkers is fraught with significant challenges arising from both technical and biological complexity. On the technical front, batch effects and other non-biological sources of variation are a pervasive threat in high-throughput studies; if unaddressed, they can create confounding signals that are erroneously attributed to the disease state, compromising the validity of any downstream conclusions. Biologically, ALS is increasingly understood not as a monolithic entity but as a syndrome encompassing considerable molecular heterogeneity [\u003cspan citationid=\"CR96\" class=\"CitationRef\"\u003e96\u003c/span\u003e]. Emerging research indicates the existence of distinct transcriptomic subtypes within the sporadic ALS population, observable even in peripheral blood, suggesting divergent underlying pathological programs [\u003cspan citationid=\"CR97\" class=\"CitationRef\"\u003e97\u003c/span\u003e]. Furthermore, recent conceptual frameworks describe ALS \"molecular subtypes\" as integrative combinations of cellular dysfunctions including neuroinflammation, mitochondrial stress, and cytoskeletal defects that correlate with clinical variation [\u003cspan citationid=\"CR98\" class=\"CitationRef\"\u003e98\u003c/span\u003e, \u003cspan citationid=\"CR99\" class=\"CitationRef\"\u003e99\u003c/span\u003e]. This inherent biological diversity necessitates analytical approaches that can distinguish consistent, core disease signatures from noise and subtype-specific signals [\u003cspan citationid=\"CR97\" class=\"CitationRef\"\u003e97\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eConfronted by these challenges, the design of this study was explicitly guided by two foundational principles to maximize the clinical relevance and robustness of our findings: (i) a paramount emphasis on reproducibility, achieved through the integration of multiple independent RNA-seq cohorts to isolate transcriptional signals consistent across diverse datasets and technical platforms; and (ii) a rigorous prioritization of model robustness, implemented through leakage-aware data partitioning, conservative consensus feature selection, and systematic benchmarking of machine learning algorithms. Machine learning is particularly well-suited to this task, as it excels at identifying complex, non-linear patterns and feature interactions within high-dimensional biological data, moving beyond simple differential expression to build predictive models of disease state [\u003cspan citationid=\"CR100\" class=\"CitationRef\"\u003e100\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOur analytical journey began by quantifying the scope of transcriptomic dysregulation across three independent ALS brain tissue cohorts (PRJNA512012, PRJNA831563, PRJNA1163403). Differential expression analysis revealed extensive remodeling in each dataset, identifying 1,609, 1,302, and 2,223 differentially expressed genes (DEGs), respectively. Strikingly, however, the intersection of these three sizable lists yielded only 32 shared DEGs. This profound lack of overlap starkly illustrates the substantial heterogeneity introduced by factors such as cohort-specific demographics, disease stage at sample collection, tissue dissection protocols, and technical batch effects. It reinforces a critical lesson from prior transcriptomic meta-analyses: reproducibility across independent cohorts is a far stronger indicator of a robust disease association than the statistical magnitude of change within any single study.\u003c/p\u003e \u003cp\u003eTo build a generalizable classifier, we first established a supervised learning framework using a 70/30 train-test split, strictly preserving the independence of the test set. Recognizing that an imbalance between ALS and control samples in the training data could bias the classifier toward the majority class, we applied the Synthetic Minority Over-sampling Technique (SMOTE) exclusively to the training fold. SMOTE generates synthetic minority-class samples through informed interpolation in feature space, effectively mitigating bias and improving sensitivity without violating the integrity of the hold-out test set a crucial consideration for ensuring credible performance estimates.\u003c/p\u003e \u003cp\u003eThe 32 shared DEGs constituted our initial feature universe, which we then refined through a rigorous, consensus-driven feature selection pipeline. We employed four complementary methodologies: Random Forest permutation importance, Gradient Boosting built-in importance, Recursive Feature Elimination (RFE), and the all-relevant selection algorithm Boruta. This multi-method approach was designed to circumvent the limitations inherent to any single technique. Boruta, in particular, serves as a stringent benchmark, as it uses a wrapper approach around Random Forest to identify all features that perform significantly better than random shadow variables, thereby capturing features that are genuinely relevant even if their individual effect size is moderate. The convergence of these distinct methods onto a compact set of six genes ACTA1, ABCA4, COL6A4P2, HERC2P2, KCNE4, and LOC107987008 provides strong evidence for the stability and reliability of this signature, reducing the likelihood that it is an artifact of a specific algorithmic bias.\u003c/p\u003e \u003cp\u003eThe biological composition of this six-gene panel reflects a convergence of molecular functions plausibly linked to ALS pathophysiology. ACTA1 encodes skeletal muscle α-actin, the predominant actin isoform in sarcomeric thin filaments and an essential structural component for muscle contraction and cytoskeletal integrity [\u003cspan citationid=\"CR101\" class=\"CitationRef\"\u003e101\u003c/span\u003e]. Altered ACTA1 expression in ALS may therefore reflect secondary muscle remodeling in response to denervation. KCNE4, a β-subunit of voltage-gated potassium channels, suppresses Kv1.3 currents by modulating gating and surface trafficking [\u003cspan citationid=\"CR102\" class=\"CitationRef\"\u003e102\u003c/span\u003e]. Because neuronal hyperexcitability is an early hallmark of ALS, dysregulation of KCNE4-mediated channel modulation could contribute to excitatory imbalance in motor circuits. ABCA4, a photoreceptor ATP-binding cassette transporter, catalyzes the transport of N-retinylidene-phosphatidylethanolamine to remove reactive retinal derivatives and maintain lipid homeostasis in photoreceptor membranes [\u003cspan citationid=\"CR103\" class=\"CitationRef\"\u003e103\u003c/span\u003e]. Although primarily retinal, its lipid transport function underscores broader metabolic processes that may influence neuronal vulnerability. COL6A4P2 is a pseudogene derived from the COL6A4 gene of the collagen VI family, which organizes the extracellular matrix (ECM) and supports neuronal and glial survival [\u003cspan citationid=\"CR104\" class=\"CitationRef\"\u003e104\u003c/span\u003e]. Thus, COL6A4P2 expression may mark ECM remodeling or glial activation observed in ALS tissues, though its own function remains uncharacterized. HERC2P2, a pseudogene of the ubiquitin ligase HERC2, has been found transcriptionally active and associated with DNA repair\u0026ndash;related pathways in other biological contexts. Considering the role of its parent HERC2 in ubiquitin-dependent proteostasis [\u003cspan citationid=\"CR105\" class=\"CitationRef\"\u003e105\u003c/span\u003e], HERC2P2 may similarly reflect stress-response dysregulation relevant to neurodegeneration. Finally, LOC107987008 represents an uncharacterized non-coding RNA locus, consistent with reports that many reproducible ALS transcriptomic signatures involve unannotated long non-coding RNAs [\u003cspan citationid=\"CR106\" class=\"CitationRef\"\u003e106\u003c/span\u003e]. Collectively, these genes capture distinct biological axes, structural integrity, excitability, lipid metabolism, extracellular matrix maintenance, ubiquitin-linked regulation, and non-coding RNA signaling, that together mirror the molecular heterogeneity of ALS.\u003c/p\u003e \u003cp\u003eWith this refined feature set, we embarked on a comprehensive benchmarking phase, training and evaluating fifteen distinct machine learning classifiers spanning linear models, support vector machines, k-nearest neighbors, Bayesian classifiers, and ensemble methods. After systematic hyperparameter optimization, a Gradient Boosting Classifier emerged as the top-performing model. On the completely held-out test set, it achieved an accuracy of 0.9171, a Matthews Correlation Coefficient (MCC) of 0.7197, a precision of 0.9243, a recall (sensitivity) of 0.9171, an AUC-ROC of 0.9385, and an F1-score of 0.9107. The model\u0026rsquo;s robustness was further validated via stratified 10-fold cross-validation on the training data, yielding consistently high mean metrics (e.g., AUC-ROC of 0.978). Gradient boosting\u0026rsquo;s success in this context is theoretically grounded; it builds a strong predictive model by sequentially combining weak learners (typically decision trees) to correct prior errors, making it exceptionally capable of modeling complex, non-linear interactions within a parsimonious feature set.\u003c/p\u003e \u003cp\u003eTo translate this computational model into a practical resource, we operationalized it as a lightweight, publicly accessible web application dubbed ATMeQ. This application is designed to accept user-submitted, normalized RNA-seq expression data for the six signature genes and return a predicted classification, along with relevant confidence metrics. By packaging the model in this accessible format, we actively lower the barrier for independent validation, external testing, and exploratory use by the broader research community, addressing a common translational gap in bioinformatics research.\u003c/p\u003e \u003cp\u003eIn conclusion, this study demonstrates a principled pathway from the recognition of pervasive transcriptomic heterogeneity in ALS to the development of a parsimonious, reproducible, and high-performing diagnostic classifier. The workflow integrating multi-cohort analysis, consensus feature selection, rigorous class imbalance handling, and exhaustive model benchmarking provides a robust template for biomarker discovery in complex diseases. The resulting six-gene signature, while requiring further validation, captures intersecting aspects of ALS pathophysiology involving neuromuscular integrity, ionic excitability, and cellular homeostasis.\u003c/p\u003e \u003cp\u003eWe openly acknowledge several limitations. The use of postmortem brain tissue inherently captures late-stage pathology, which may not fully reflect the early molecular events most relevant for timely diagnosis. Although multi-cohort integration mitigates batch effects, unmeasured technical or biological confounders may persist. The biological functions of some signature genes, particularly the non-coding elements, require deeper mechanistic investigation. Most critically, prospective validation in independent, ideally multi-center cohorts, including samples from pre-symptomatic or early-stage individuals and from accessible tissues like blood, is the essential next step to evaluate true clinical potential.\u003c/p\u003e \u003cp\u003eFuture directions should focus on this external validation, while also exploring the signature\u0026rsquo;s utility in stratifying patients into molecular subtypes, predicting disease progression, and evaluating treatment response. Integrating this transcriptomic signal with other multi-omic data layers will further refine our understanding and move the field closer to a future where molecular diagnostics significantly shorten the protracted and difficult diagnostic journey faced by ALS patients today.\u003c/p\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003eBased on the preceding discussion, this study successfully navigates the substantial heterogeneity and technical challenges inherent in ALS transcriptomics to identify a concise, reproducible six-gene signature and a high-performance diagnostic classifier. By rigorously integrating multiple independent cohorts and employing consensus feature selection alongside advanced machine learning, we developed a model that achieves robust accuracy and has been operationalized as the publicly accessible ATMeQ web application. While derived from postmortem brain tissue and thus reflective of late-stage pathology, the signature implicates biologically plausible pathways in ALS, including cytoskeletal integrity, ion channel function, and metabolic regulation. The critical next steps involve prospective validation in accessible biospecimens from early-stage patients and exploration of the signature\u0026rsquo;s utility for disease stratification. Ultimately, this work provides a principled framework and an applicable tool to advance the urgent quest for molecular biomarkers, aiming to shorten the extended diagnostic interval in ALS and enable earlier therapeutic intervention.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eALS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eAmyotrophic Lateral Sclerosis\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRNA\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eseq\u0026ndash;RNA sequencing\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eDEG\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eDifferentially Expressed Gene\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eML\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMachine Learning\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eGBC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eGradient Boosting Classifier\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRF\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRFE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRecursive Feature Elimination\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eSMOTE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eSynthetic Minority Over\u0026ndash;sampling Technique\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eAUC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eROC\u0026ndash;Area Under the Receiver Operating Characteristic Curve\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eMCC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMatthews Correlation Coefficient\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eNGS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eNext\u0026ndash;Generation Sequencing\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eVST\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eVariance Stabilizing Transformation\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eQC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eQuality Control\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eGEO\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eGene Expression Omnibus\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eATMeQ\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eALS Prediction Tool using Machine Learning and RNA\u0026ndash;Seq\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":" \u003cp\u003e \u003cstrong\u003eCompeting Interests:\u003c/strong\u003e \u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding:\u003c/h2\u003e \u003cp\u003eThis study has no funding.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAhmed Saif: Conceptualization, Data curation, Methodology, Software, Formal analysis, Result interpretation, Investigation, Validation, Visualization, Writing \u0026ndash; original draft, Writing \u0026ndash; review, and editing, Supervision. Md. Tarikul Islam and Md Aktaruzzaman: Writing \u0026ndash; review and editing.\u003c/p\u003e\u003ch2\u003eAcknowledgment\u003c/h2\u003e \u003cp\u003eWe are thankful to Biological Research on the Brain (BRB), Jashore 7408, Bangladesh.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe RNA-sequencing datasets analyzed in this study were obtained from publicly available datasets deposited in the Gene Expression Omnibus (GEO) repository. The datasets include GSE124439 (PRJNA512012), GSE201407 (PRJNA831563), and GSE277709 (PRJNA1163403), and are accessible through their corresponding web links:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124439,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE201407,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE277709.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBrown, R. H. \u0026amp; Al-Chalabi, A. Amyotrophic lateral sclerosis. \u003cem\u003eN. Engl. J. Med.\u003c/em\u003e \u003cb\u003e377\u003c/b\u003e, 162\u0026ndash;172 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeldman, E. L. et al. Amyotrophic lateral sclerosis. \u003cem\u003eLancet\u003c/em\u003e \u003cb\u003e400\u003c/b\u003e, 1363\u0026ndash;1380 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHardiman, O. et al. Amyotrophic lateral sclerosis. \u003cem\u003eNat. Rev. Dis. Primers\u003c/em\u003e. \u003cb\u003e3\u003c/b\u003e, 1\u0026ndash;19 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMasrori, P. \u0026amp; Van Damme, P. Amyotrophic lateral sclerosis: a clinical review. \u003cem\u003eEur. J. Neurol.\u003c/em\u003e \u003cb\u003e27\u003c/b\u003e, 1918\u0026ndash;1929 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWijesekera, L. C., Nigel, P. \u0026amp; Leigh Amyotrophic lateral sclerosis. \u003cem\u003eOrphanet J. Rare Dis.\u003c/em\u003e \u003cb\u003e4\u003c/b\u003e, 1\u0026ndash;22 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBradford, D. \u0026amp; Rodgers, K. E. Advancements and challenges in amyotrophic lateral sclerosis. \u003cem\u003eFront. Neurosci.\u003c/em\u003e \u003cb\u003e18\u003c/b\u003e, 1401706 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMontes, J. et al. Translational research in ALS, in: Animal and Translational Models for CNS Drug Discovery, Elsevier, : pp. 267\u0026ndash;310. (2008).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTurner, M. R., Parton, M. J. \u0026amp; Leigh, P. N. Clinical trials in ALS: an overview, in: Semin Neurol, Copyright\u0026copy; 2001 by Thieme Medical Publishers, Inc., 333 Seventh Avenue, New \u0026hellip; pp. 167\u0026ndash;176.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePetrov, D., Mansfield, C., Moussy, A. \u0026amp; Hermine, O. ALS clinical trials review: 20 years of failure. Are we any closer to registering a new treatment? \u003cem\u003eFront. Aging Neurosci.\u003c/em\u003e \u003cb\u003e9\u003c/b\u003e, 68 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTurnbull, J. Why is ALS so Difficult to Treat? \u003cem\u003eCan. J. Neurol. Sci.\u003c/em\u003e \u003cb\u003e41\u003c/b\u003e, 144\u0026ndash;155 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJaiswal, M. K. Riluzole and edaravone: A tale of two amyotrophic lateral sclerosis drugs. \u003cem\u003eMed. Res. Rev.\u003c/em\u003e \u003cb\u003e39\u003c/b\u003e, 733\u0026ndash;748 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSawada, H. Clinical efficacy of edaravone for the treatment of amyotrophic lateral sclerosis. \u003cem\u003eExpert Opin. Pharmacother\u003c/em\u003e. \u003cb\u003e18\u003c/b\u003e, 735\u0026ndash;738 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGordon, P. H., Cheng, B., Katz, I. B., Mitsumoto, H. \u0026amp; Rowland, L. P. Clinical features that distinguish PLS, upper motor neuron\u0026ndash;dominant ALS, and typical ALS. \u003cem\u003eNeurology\u003c/em\u003e \u003cb\u003e72\u003c/b\u003e, 1948\u0026ndash;1952 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNechay, A., Stetsenko, T. \u0026amp; Savchenko, O. P101\u0026ndash;2285: Amyotrophic lateral sclerosis with juvenile onset. Case report. \u003cem\u003eEur. J. Pediatr. Neurol.\u003c/em\u003e \u003cb\u003e19\u003c/b\u003e, S122\u0026ndash;S123 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eŠtětk\u0026aacute;řov\u0026aacute;, I. \u0026amp; Ehler, E. Diagnostics of amyotrophic lateral sclerosis: up to date. \u003cem\u003eDiagnostics\u003c/em\u003e \u003cb\u003e11\u003c/b\u003e, 231 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDe Carvalho, M. et al. Electrodiagnostic criteria for diagnosis of ALS. \u003cem\u003eClin. Neurophysiol.\u003c/em\u003e \u003cb\u003e119\u003c/b\u003e, 497\u0026ndash;503 (2008).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIwasaki, Y., Ikeda, K., Ichikawa, Y., Igarashi, O. \u0026amp; Kinoshita, M. MRI in ALS patients. \u003cem\u003eActa Neurol. Scand.\u003c/em\u003e \u003cb\u003e107\u003c/b\u003e (2003).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKassubek, J. \u0026amp; Pagani, M. Imaging in amyotrophic lateral sclerosis: MRI and PET. \u003cem\u003eCurr. Opin. Neurol.\u003c/em\u003e \u003cb\u003e32\u003c/b\u003e, 740\u0026ndash;746 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGatto, R. G., Li, W., Gao, J. \u0026amp; Magin, R. L. In vivo diffusion MRI detects early spinal cord axonal pathology in a mouse model of amyotrophic lateral sclerosis. \u003cem\u003eNMR Biomed.\u003c/em\u003e \u003cb\u003e31\u003c/b\u003e, e3954 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJamali, A. M., Kethamreddy, M., Burkett, B. J., Port, J. D. \u0026amp; Pandey, M. K. PET and SPECT imaging of ALS: an educational review. \u003cem\u003eMol. Imaging\u003c/em\u003e. \u003cb\u003e2023\u003c/b\u003e, 5864391 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShen, D. et al. The Gold Coast criteria increases the diagnostic sensitivity for amyotrophic lateral sclerosis in a Chinese population. \u003cem\u003eTransl Neurodegener\u003c/em\u003e. \u003cb\u003e10\u003c/b\u003e, 1\u0026ndash;8 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHannaford, A. et al. Diagnostic utility of gold coast criteria in amyotrophic lateral sclerosis. \u003cem\u003eAnn. Neurol.\u003c/em\u003e \u003cb\u003e89\u003c/b\u003e, 979\u0026ndash;986 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ede Jongh, A. D. et al. Characterising ALS disease progression according to El Escorial and Gold Coast criteria. \u003cem\u003eJ. Neurol. Neurosurg. Psychiatry\u003c/em\u003e. \u003cb\u003e93\u003c/b\u003e, 865\u0026ndash;870 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCampanari, M. L., Bourefis, A. R. \u0026amp; Kabashi, E. Diagnostic challenge and neuromuscular junction contribution to ALS pathogenesis. \u003cem\u003eFront. Neurol.\u003c/em\u003e \u003cb\u003e10\u003c/b\u003e, 68 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSegura, T. et al. Alcahut-Rodr\u0026iacute;guez, Symptoms timeline and outcomes in amyotrophic lateral sclerosis using artificial intelligence. \u003cem\u003eSci. Rep.\u003c/em\u003e \u003cb\u003e13\u003c/b\u003e, 702 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSalameh, J. S., Brown, R. H. Jr \u0026amp; Berry, J. D. \u003cem\u003eAmyotrophic lateral sclerosis\u003c/em\u003e pp. 469\u0026ndash;476 (in: Semin Neurol, Thieme Medical, 2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBradford, D. \u0026amp; Rodgers, K. E. Advancements and challenges in amyotrophic lateral sclerosis. \u003cem\u003eFront. Neurosci.\u003c/em\u003e \u003cb\u003e18\u003c/b\u003e, 1401706 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMasrori, P. \u0026amp; Van Damme, P. Amyotrophic lateral sclerosis: a clinical review. \u003cem\u003eEur. J. Neurol.\u003c/em\u003e \u003cb\u003e27\u003c/b\u003e, 1918\u0026ndash;1929 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVidovic, M., M\u0026uuml;schen, L. H., Brakemeier, S., Machetanz, G. \u0026amp; Naumann, M. Castro-Gomez, Current state and future directions in the diagnosis of amyotrophic lateral sclerosis. \u003cem\u003eCells\u003c/em\u003e \u003cb\u003e12\u003c/b\u003e, 736 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVu, L. T. \u0026amp; Bowser, R. Fluid-based biomarkers for amyotrophic lateral sclerosis. \u003cem\u003eNeurotherapeutics\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e, 119\u0026ndash;134 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDubois, B., von Arnim, C. A. F., Burnie, N., Bozeat, S. \u0026amp; Cummings, J. Biomarkers in Alzheimer\u0026rsquo;s disease: role in early and differential diagnosis and recognition of atypical variants. \u003cem\u003eAlzheimers Res. Ther.\u003c/em\u003e \u003cb\u003e15\u003c/b\u003e, 175 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYamashita, K. Y., Bhoopatiraju, S., Silverglate, B. D. \u0026amp; Grossberg, G. T. Biomarkers in Parkinson\u0026rsquo;s disease: A state of the art review. \u003cem\u003eBiomark. Neuropsychiatry\u003c/em\u003e. \u003cb\u003e9\u003c/b\u003e, 100074 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, H., Nottingham, R. M. \u0026amp; Lambowitz, A. M. TGIRT-seq protocol for the comprehensive profiling of coding and non-coding RNA biotypes in cellular, extracellular vesicle, and plasma RNAs. \u003cem\u003eBio Protoc.\u003c/em\u003e \u003cb\u003e11\u003c/b\u003e, e4239\u0026ndash;e4239 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmail, C. \u0026amp; Montgomery, S. B. RNA sequencing in disease diagnosis. \u003cem\u003eAnnu. Rev. Genomics Hum. Genet.\u003c/em\u003e \u003cb\u003e25\u003c/b\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOzsolak, F. \u0026amp; Milos, P. M. RNA sequencing: advances, challenges and opportunities. \u003cem\u003eNat. Rev. Genet.\u003c/em\u003e \u003cb\u003e12\u003c/b\u003e, 87\u0026ndash;98 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSierro, N., Martin, F., Poussin, C., Hoeng, J. \u0026amp; Ivanov, N. V. Comparison of oligonucleotide microarray and RNA-seq technologies in the context of gene expression analysis. \u003cem\u003eEMBnet J.\u003c/em\u003e \u003cb\u003e19\u003c/b\u003e, 88 (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStark, R., Grzelak, M. \u0026amp; Hadfield, J. RNA sequencing: the teenage years. \u003cem\u003eNat. Rev. Genet.\u003c/em\u003e \u003cb\u003e20\u003c/b\u003e, 631\u0026ndash;656 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHan, H. \u0026amp; Jiang, X. Disease biomarker query from RNA-seq data. \u003cem\u003eCancer Inf.\u003c/em\u003e \u003cb\u003e13\u003c/b\u003e, CIN\u0026ndash;S13876 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, Y. et al. A reliable and quick method for screening alternative splicing variants for low-abundance genes. \u003cem\u003ePLoS One\u003c/em\u003e. \u003cb\u003e19\u003c/b\u003e, e0305201 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLataretu, M. \u0026amp; H\u0026ouml;lzer, M. RNAflow: An effective and simple RNA-seq differential gene expression pipeline using nextflow. \u003cem\u003eGenes (Basel)\u003c/em\u003e. \u003cb\u003e11\u003c/b\u003e, 1487 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCosta-Silva, J., Domingues, D. S., Menotti, D., Hungria, M. \u0026amp; Lopes, F. M. Computational methods for differentially expressed gene analysis from RNA-Seq: an overview. \u003cem\u003eArXiv Preprint ArXiv\u003c/em\u003e :210903625 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChatterjee, P. \u0026amp; Roy, D. Comparative analysis of RNA-Seq data from brain and blood samples of Parkinson\u0026rsquo;s disease. \u003cem\u003eBiochem. Biophys. Res. Commun.\u003c/em\u003e \u003cb\u003e484\u003c/b\u003e, 557\u0026ndash;564 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDube, U. et al. An atlas of cortical circular RNA expression in Alzheimer disease brains demonstrates clinical and pathological associations. \u003cem\u003eNat. Neurosci.\u003c/em\u003e \u003cb\u003e22\u003c/b\u003e, 1903\u0026ndash;1912 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSproviero, D. et al. Different miRNA profiles in plasma derived small and large extracellular vesicles from patients with neurodegenerative diseases. \u003cem\u003eInt. J. Mol. Sci.\u003c/em\u003e \u003cb\u003e22\u003c/b\u003e, 2737 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShi, M., Caudle, W. M. \u0026amp; Zhang, J. Biomarker discovery in neurodegenerative diseases: a proteomic approach. \u003cem\u003eNeurobiol. Dis.\u003c/em\u003e \u003cb\u003e35\u003c/b\u003e, 157\u0026ndash;164 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMittal, S., Jena, M. K. \u0026amp; Pathak, B. Machine Learning-Assisted Direct RNA Sequencing with Epigenetic RNA Modification Detection via Quantum Tunneling. \u003cem\u003eAnal. Chem.\u003c/em\u003e \u003cb\u003e96\u003c/b\u003e, 11516\u0026ndash;11524 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVadapalli, S., Abdelhalim, H., Zeeshan, S. \u0026amp; Ahmed, Z. Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine. \u003cem\u003eBrief. Bioinform\u003c/em\u003e. \u003cb\u003e23\u003c/b\u003e, bbac191 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDudek, G. et al. Machine learning-based prediction of rheumatoid arthritis with development of ACPA autoantibodies in the presence of non-HLA genes polymorphisms. \u003cem\u003ePLoS One\u003c/em\u003e. \u003cb\u003e19\u003c/b\u003e, e0300717 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWenric, S. \u0026amp; Shemirani, R. Using supervised learning methods for gene selection in RNA-Seq case-control studies. \u003cem\u003eFront. Genet.\u003c/em\u003e \u003cb\u003e9\u003c/b\u003e, 297 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVu, D. L., Le, H. C. \u0026amp; Learning-Based, M. ALS Diagnosis Using Gene Expression Data, in: 2023 RIVF International Conference on Computing and Communication Technologies (RIVF), IEEE, : pp. 354\u0026ndash;359. (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, S. et al. dos Santos Souza, Genome-wide identification of the genetic basis of amyotrophic lateral sclerosis. \u003cem\u003eNeuron\u003c/em\u003e \u003cb\u003e110\u003c/b\u003e, 992\u0026ndash;1008 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRad, H. N. et al. Amyotrophic lateral sclerosis diagnosis using machine learning and multi-omic data integration. \u003cem\u003eHeliyon\u003c/em\u003e \u003cb\u003e10\u003c/b\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCatanese, A. et al. Multiomics and machine-learning identify novel transcriptional and mutational signatures in amyotrophic lateral sclerosis. \u003cem\u003eBrain\u003c/em\u003e \u003cb\u003e146\u003c/b\u003e, 3770\u0026ndash;3782 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGrima, N. et al. RNA sequencing of peripheral blood in amyotrophic lateral sclerosis reveals distinct molecular subtypes: considerations for biomarker discovery. \u003cem\u003eNeuropathol. Appl. Neurobiol.\u003c/em\u003e \u003cb\u003e49\u003c/b\u003e, e12943 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVieira, F. G. et al. A machine-learning based objective measure for ALS disease severity. \u003cem\u003eNPJ Digit. Med.\u003c/em\u003e \u003cb\u003e5\u003c/b\u003e, 45 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarrett, T. et al. NCBI GEO: archive for functional genomics data sets\u0026mdash;update. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cb\u003e41\u003c/b\u003e, D991\u0026ndash;D995 (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAndrews, S. FastQC: a quality control tool for high throughput sequence data. (2017). (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBolger, A. M., Lohse, M. \u0026amp; Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e30\u003c/b\u003e, 2114\u0026ndash;2120 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, D., Langmead, B. \u0026amp; Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. \u003cem\u003eNat. Methods\u003c/em\u003e. \u003cb\u003e12\u003c/b\u003e, 357\u0026ndash;360 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiao, Y., Smyth, G. K. \u0026amp; Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e30\u003c/b\u003e, 923\u0026ndash;930 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLove, M. I., Huber, W. \u0026amp; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. \u003cem\u003eGenome Biol.\u003c/em\u003e \u003cb\u003e15\u003c/b\u003e, 1\u0026ndash;21 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBenjamini, Y. \u0026amp; Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. \u003cem\u003eJ. Roy. Stat. Soc.: Ser. B (Methodol.)\u003c/em\u003e. \u003cb\u003e57\u003c/b\u003e, 289\u0026ndash;300 (1995).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRitchie, M. E. et al. Smyth, limma powers differential expression analyses for RNA-sequencing and microarray studies. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cb\u003e43\u003c/b\u003e, e47\u0026ndash;e47 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSalazar, J. J., Garland, L., Ochoa, J. \u0026amp; Pyrcz, M. J. Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy. \u003cem\u003eJ. Pet. Sci. Eng.\u003c/em\u003e \u003cb\u003e209\u003c/b\u003e, 109885 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlagus, R. \u0026amp; Lusa, L. SMOTE for high-dimensional class-imbalanced data. \u003cem\u003eBMC Bioinform.\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e, 1\u0026ndash;16 (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBreiman, L. Random forests. \u003cem\u003eMach. Learn.\u003c/em\u003e \u003cb\u003e45\u003c/b\u003e, 5\u0026ndash;32 (2001).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFriedman, J. H. Greedy function approximation: a gradient boosting machine. \u003cem\u003eAnn. Stat.\u003c/em\u003e 1189\u0026ndash;1232. (2001).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZeng, X., Chen, Y. W. \u0026amp; Tao, C. Feature selection using recursive feature elimination for handwritten digit recognition, in: 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IEEE, : pp. 1205\u0026ndash;1208. (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKursa, M. B. \u0026amp; Rudnicki, W. R. Feature selection with the Boruta package. \u003cem\u003eJ. Stat. Softw.\u003c/em\u003e \u003cb\u003e36\u003c/b\u003e, 1\u0026ndash;13 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, D., Zhang, B. \u0026amp; Li, C. A feature-scaling-based \u003cspan\u003e$\u003c/span\u003e k \u003cspan\u003e$\u003c/span\u003e-nearest neighbor algorithm for indoor positioning systems. \u003cem\u003eIEEE Internet Things J.\u003c/em\u003e \u003cb\u003e3\u003c/b\u003e, 590\u0026ndash;597 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFreund, Y. \u0026amp; Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. \u003cem\u003eJ. Comput. Syst. Sci.\u003c/em\u003e \u003cb\u003e55\u003c/b\u003e, 119\u0026ndash;139 (1997).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQuinlan, J. R. Induction of decision trees. \u003cem\u003eMach. Learn.\u003c/em\u003e \u003cb\u003e1\u003c/b\u003e, 81\u0026ndash;106 (1986).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePedregosa, F. et al. Scikit-learn: Machine learning in Python, The Journal of Machine Learning Research 12 2825\u0026ndash;2830. (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGeurts, P., Ernst, D. \u0026amp; Wehenkel, L. Extremely randomized trees. \u003cem\u003eMach. Learn.\u003c/em\u003e \u003cb\u003e63\u003c/b\u003e, 3\u0026ndash;42 (2006).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFriedman, J. H. Greedy function approximation: a gradient boosting machine. \u003cem\u003eAnn. Stat.\u003c/em\u003e 1189\u0026ndash;1232. (2001).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCover, T. \u0026amp; Hart, P. Nearest neighbor pattern classification. \u003cem\u003eIEEE Trans. Inf. Theory\u003c/em\u003e. \u003cb\u003e13\u003c/b\u003e, 21\u0026ndash;27 (1967).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKe, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. \u003cem\u003eAdv. Neural Inf. Process. Syst.\u003c/em\u003e \u003cb\u003e30\u003c/b\u003e (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFisher, R. A. The use of multiple measurements in taxonomic problems. \u003cem\u003eAnn. Eugen\u003c/em\u003e. \u003cb\u003e7\u003c/b\u003e, 179\u0026ndash;188 (1936).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHosmer, D. W. Jr, Lemeshow, S. \u0026amp; Sturdivant, R. X. \u003cem\u003eApplied logistic regression\u003c/em\u003e (Wiley, 2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, H. The optimality of naive Bayes. \u003cem\u003eAa\u003c/em\u003e \u003cb\u003e1\u003c/b\u003e, 3 (2004).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZiegel, E. R. The elements of statistical learning, (2003).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBreiman, L. Random forests. \u003cem\u003eMach. Learn.\u003c/em\u003e \u003cb\u003e45\u003c/b\u003e, 5\u0026ndash;32 (2001).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoerl, A. E. \u0026amp; Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. \u003cem\u003eTechnometrics\u003c/em\u003e \u003cb\u003e12\u003c/b\u003e, 55\u0026ndash;67 (1970).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCortes, C. Support-Vector Networks, Mach Learn (1995).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, T. \u0026amp; Guestrin, C. Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, : pp. 785\u0026ndash;794. (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBischl, B. et al. Boulesteix, Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. \u003cem\u003eWiley Interdiscip Rev. Data Min. Knowl. Discov\u003c/em\u003e. \u003cb\u003e13\u003c/b\u003e, e1484 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKohavi, R. \u003cem\u003eA study of cross-validation and bootstrap for accuracy estimation and model selection\u003c/em\u003e (Morgan Kaufman Publishing, 1995).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGalvin, M. et al. The path to specialist multidisciplinary care in amyotrophic lateral sclerosis: a population-based study of consultations, interventions and costs. \u003cem\u003ePLoS One\u003c/em\u003e. \u003cb\u003e12\u003c/b\u003e, e0179796 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGupta, D., Shiralkar, M. \u0026amp; Chaudhari, V. Conventional remedy to Lou Gehrig\u0026rsquo;s disease-Amyotrophic Lateral Sclerosis (ALS): a rare clinical entity., (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChieia, M. A., Oliveira, A. S. B., Silva, H. C. A. \u0026amp; Gabbai, A. A. Amyotrophic lateral sclerosis: considerations on diagnostic criteria. \u003cem\u003eArq. Neuropsiquiatr.\u003c/em\u003e \u003cb\u003e68\u003c/b\u003e, 837\u0026ndash;842 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFalc\u0026atilde;o de Campos, C. et al. Trends in the diagnostic delay and pathway for amyotrophic lateral sclerosis patients across different countries. \u003cem\u003eFront. Neurol.\u003c/em\u003e \u003cb\u003e13\u003c/b\u003e, 1064619 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCellura, E., Spataro, R., Taiello, A. C. \u0026amp; Bella, V. L. Factors affecting the diagnostic delay in amyotrophic lateral sclerosis. \u003cem\u003eClin. Neurol. Neurosurg.\u003c/em\u003e \u003cb\u003e114\u003c/b\u003e, 550\u0026ndash;554 (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOlsen, R. H. \u0026amp; Christensen, H. \u003cem\u003eTranscriptomics: RNA-seq, in: Introduction to Bioinformatics in Microbiology\u003c/em\u003e pp. 177\u0026ndash;188 (Springer, 2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWatson, M. Quality assessment and control of high-throughput sequencing data. \u003cem\u003eFront. Genet.\u003c/em\u003e \u003cb\u003e5\u003c/b\u003e, 235 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGunter, H. M. et al. mRNA vaccine quality analysis using RNA sequencing. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e, 5663 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFloriddia, E. Transcriptomics and ALS outcome. \u003cem\u003eNat. Neurosci.\u003c/em\u003e \u003cb\u003e26\u003c/b\u003e, 175 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchweingruber, C. et al. Single-cell RNA-sequencing reveals early mitochondrial dysfunction unique to motor neurons shared across FUS-and TARDBP-ALS. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cb\u003e16\u003c/b\u003e, 4633 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHarley, J., Clarke, B. E. \u0026amp; Patani, R. The interplay of RNA binding proteins, oxidative stress and mitochondrial dysfunction in ALS. \u003cem\u003eAntioxidants\u003c/em\u003e \u003cb\u003e10\u003c/b\u003e, 552 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRossi, S. \u0026amp; Cozzolino, M. Dysfunction of RNA/RNA-binding proteins in ALS astrocytes and microglia. \u003cem\u003eCells\u003c/em\u003e \u003cb\u003e10\u003c/b\u003e, 3005 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRaza, K. \u003cem\u003eMachine learning in single-cell RNA-seq data analysis\u003c/em\u003e (Springer, 2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaing, N. G. et al. Mutations and polymorphisms of the skeletal muscle α‐actin gene (ACTA1). \u003cem\u003eHum. Mutat.\u003c/em\u003e \u003cb\u003e30\u003c/b\u003e, 1267\u0026ndash;1277 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSol\u0026eacute;, L. et al. KCNE4 suppresses Kv1. 3 currents by modulating trafficking, surface expression and channel gating. \u003cem\u003eJ. Cell. Sci.\u003c/em\u003e \u003cb\u003e122\u003c/b\u003e, 3738\u0026ndash;3748 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMolday, R. S., Zhong, M. \u0026amp; Quazi, F. The role of the photoreceptor ABC transporter ABCA4 in lipid transport and Stargardt macular degeneration, Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids \u003cb\u003e1791\u003c/b\u003e 573\u0026ndash;583. (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSabatelli, P. et al. Expression of the collagen VI α5 and α6 chains in normal human skin and in skin of patients with collagen VI-related myopathies. \u003cem\u003eJ. Invest. Dermatology\u003c/em\u003e. \u003cb\u003e131\u003c/b\u003e, 99\u0026ndash;107 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKolenda, T. et al. AURKAPS1, HERC2P2 and SDHAP1 pseudogenes: molecular role in development and progression of head and neck squamous cell carcinomas and their diagnostic utility. \u003cem\u003eRep. Practical Oncol. Radiotherapy\u003c/em\u003e. \u003cb\u003e29\u003c/b\u003e, 718\u0026ndash;731 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBilbao-Arribas, M. \u0026amp; Jugo, B. M. Transcriptomic meta-analysis reveals unannotated long non-coding RNAs related to the immune response in sheep. \u003cem\u003eFront. Genet.\u003c/em\u003e \u003cb\u003e13\u003c/b\u003e, 1067350 (2022).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Amyotrophic Lateral Sclerosis (ALS), RNA-seq Meta-analysis, Differential Gene Expression, Machine Learning-based Diagnosis, Gene signature","lastPublishedDoi":"10.21203/rs.3.rs-8614090/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8614090/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003emethods\u003c/h2\u003e \u003cp\u003eRandom Forest importance, Gradient Boosting, Recursive Feature Elimination (RFE), and the Boruta algorithm, narrowed this set down to a biologically meaningful six-gene signature (ACTA1, ABCA4, COL6A4P2, HERC2P2, KCNE4, LOC107987008). Employing this signature, fifteen machine learning models were trained and optimized through hyperparameter tuning. The top-performing model, a Gradient Boosting Classifier (GBC), was validated through k-fold cross-validation, achieving 96% accuracy, a 0.92 Matthews Correlation Coefficient (MCC), 0.937 precision, 0.991 recall, 0.962 F1-score, and a 0.993 AUC-ROC. Therefore, this model was deployed as ATMeQ, a publicly available web tool (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://atmeq-ai.streamlit.app/\u003c/span\u003e\u003cspan address=\"https://atmeq-ai.streamlit.app/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e)\u003c/span\u003e with potential utility for clinicians and researchers to predict ALS risk and validate biomarkers. Collectively, the study demonstrates that integrative transcriptomics and machine learning can significantly reduce potential diagnostic delays and enable biomarker-driven detection in ALS.\u003c/p\u003e","manuscriptTitle":"ATMeQ: A Machine Learning-Based Framework for Amyotrophic Lateral Sclerosis Disease using RNA-seq Meta-Analysis","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-17 18:27:12","doi":"10.21203/rs.3.rs-8614090/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewersInvited","content":"","date":"2026-04-16T05:18:03+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-04-14T18:22:51+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-01-28T09:33:07+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-01-23T02:28:41+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2026-01-23T02:22:10+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"3de4c604-9875-47ce-89b8-d73c550bf020","owner":[],"postedDate":"April 17th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":66440412,"name":"Health sciences/Biomarkers"},{"id":66440413,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":66440414,"name":"Biological sciences/Genetics"},{"id":66440415,"name":"Health sciences/Neurology"},{"id":66440416,"name":"Biological sciences/Neuroscience"}],"tags":[],"updatedAt":"2026-04-17T18:27:12+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-17 18:27:12","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8614090","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8614090","identity":"rs-8614090","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00