Sex-dependent prediction of autism

preprint OA: closed
Full text JSON View at publisher
Full text 115,339 characters · extracted from preprint-html · click to expand
Sex-dependent prediction of autism | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Sex-dependent prediction of autism Justin O'Sullivan, Catriona Miller, Theo Portlock, Denis Nyaga This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6323696/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Autism has a global prevalence of 1%, with a male-to-female diagnosis ratio of roughly 4:1. Several models have been developed to predict autism using genetic information. However, the influence of biological sex on prediction outcomes remains underexplored. We present an ensemble model to predict autism, which integrates polygenic risk scores (PRSs), common genetic variants, and autism associated genes with the MSSNG whole genome sequencing (WGS) dataset. Following training, our model achieved an accuracy of 0.68, an area under the receiver operating curve (AUROC) of 0.72, and a recall of 0.77 on the test dataset. Notably, common variants contributed more significantly to autism prediction in males than females (p < 0.001), with accuracies of 0.69 and 0.66, respectively. The 16p11 locus emerged as particularly predictive for females (p < 0.001). Gene enrichment analysis using the Allen Brain Atlas revealed that expression of autism associated genes that were significant in females were enriched (FWER < 0.05) in the primary somatosensory cortex, inferior parietal cortex, and parietal neocortex during fetal development. By contrast, male autism associated gene expression was enriched (FWER < 0.05) in the dorsolateral prefrontal cortex and anterior cingulate cortex across developmental stages (fetal to adult). These findings underscore a sex-dependent role for common genetic variants in autism development. In doing so, they highlight the utility of ensemble models that incorporate common variation and biological sex for autism prediction. Biological sciences/Genetics Health sciences/Biomarkers/Predictive markers Biological sciences/Neuroscience Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Autism is heterogeneous and characterised by challenges with social skills, repetitive behaviours, and communication deficits. Autism is highly heritable, with a twin-based genetic heritability of 64–91% [ 1 ]. Like other neuropsychiatric and neurodevelopmental traits, common variants have a low individual impact but collectively contribute additively to the heritability of autism [ 2 ]. In contrast, rare variants have a high individual impact but a low overall effect on liability [ 3 ]. Autism has a global prevalence of 1%. However, the prevalence varies across regions and demographics [ 4 ]. Notably, there are roughly four times the number of diagnosed males than females [ 4 , 5 ]. Different hypotheses have been proposed to explain the ‘female protective effect’. These hypotheses include the liability threshold model, which suggests that females have a higher threshold for diagnosis [ 6 ]. The liability threshold is thought to be influenced by both genetic (e.g. sex chromosome and methylation differences) and non-genetic (e.g. diagnostic bias and hormones) factors [ 6 ]. Failure to account for the female protective effect during model development and deployment may introduce significant bias into research outcomes [ 7 ]. However, including biological sex as a variable in attempts to predict autism offers the potential to elucidate the genetic contributions of the putative protective effect. The use of genetic models to predict autism and other neuropsychiatric conditions has had mixed success [ 8 , 9 ]. Early methods used polygenic risk scores (PRS) as a logistic regression for prediction [ 10 , 11 ]. However, the last decade has witnessed a substantial increase in the application of artificial intelligence (AI) and machine learning techniques [ 12 , 13 ]. Incorporating whole genome sequencing into prediction models has also improved results, in part through the inclusion of non-coding regions which are recognised as having a role in determining traits [ 14 , 15 ]. Finally, machine learning approaches have been applied to the genetics of neurodevelopmental conditions for diagnosis, condition subtyping, variant identification, and biomarker prioritisation [ 9 ]. The accuracy of predicting the probability of being autistic from genetics has ranged from 0.52–0.81 [ 8 ]. However, the majority of the models that have been used for predictions are limited due to either: 1) lack of separate validation data [ 16 ], 2) using a small, homogeneous population [ 17 ], or 3) lack of breakdown of predictability between sex and limited reporting of results (e.g. accuracy only) [ 8 ]. In this study, we developed a robust ensemble model to predict the probability of being autistic using common genetic variation, leveraging the Autism Speaks’ MSSNG whole-genome sequencing dataset comprising 11,000 individuals [ 18 ]. Our approach identified key genetic variants and genes contributing to autism probability and their sex-dependent effects. Materials and Methods Data Preparation Controlled access to the MSSNG database was applied for and approved by the MSSNG Database’s Data Access Committee (DACO-2020-04). MSSNG consists of whole-genome sequences from 5,102 autistic individuals (4074 males, 1028 females), 6079 unaffected individuals (3,033 males, 3046 females), and 131 autism-related individuals (61 males, 70 females) [ 18 ]. Of these, 1,732 individuals’ samples were sequenced on Complete Genomics (CG), and 9,580 were sequenced on Illumina HiSeqX platforms. Quality control was performed separately on the CG and Illumina datasets using PLINK (Version 1.9) [ 19 ]. SNPs with a missing rate above 5% or a Hardy-Weinberg equilibrium (HWE) p-value below 10 − 6 were removed. Additionally, we filtered out: (1) individuals with a missing rate above 5%, (2) outlying homozygosity (more than three sd from the mean level of homozygosity), (3) ancestry outliers (more than three sd away from principal components 1 and 2’s cluster mean), and (4) similar samples (an identity by descent [IBD] above 0.5). Feature Selection The CG dataset was used for model training and testing. A 56:24:20 split (training: test 1: test 2) was applied to the CG dataset. The 56% training dataset was used to train 1) a model using significant SNPs, 2) a model using autism-associated genes, and 3) a model using PRS data (Fig. 1 ). The Illumina dataset was kept separate for validation purposes. Model 1 – Fisher SNPs The 56% training dataset underwent LD pruning (50 variant count (vc) window size, 5 vc step size, 2 variance inflation factor [VIF] threshold; using PLINK’s indep function [ 19 ]). A Fisher exact test was then performed on the pruned dataset and identified 790 SNPs associated with autism with a p-value < 5 x 10 − 5 (Table S1 ). A feature table with these SNPs (0,1,2; AA, Aa, aa, respectively) was fed into a random forest model (PyCaret version 3.3.2). Ten-fold cross-validation was performed for hyperparameter optimisation in PyCaret (version 3.3.2) [ 20 ] using the 56% training dataset. The model was then tested on the 24% test 1 dataset. Model 2 – SFARI genes A list of 1037 genes associated with autism was downloaded (2024/03/28) from the SFARI database [ 21 ] (Table S2 ). The number of variants per gene was calculated for each individual and formed the feature table. Ten-fold cross-validation was performed for hyperparameter optimisation in PyCaret (version 3.3.2) on the 56% training dataset. The model was then tested on the 24% test 1 dataset. Model 3 – PRS A list of 40 SNPs associated with autism [ 2 ] was used to create a feature table (0,1,2; AA, Aa, aa, respectively) and fed into a random forest model (Table S3 ). Ten-fold cross-validation was performed for hyperparameter optimisation in PyCaret (version 3.3.2) on the 56% training dataset. The model was then tested on the 24% test 1 dataset. Ensemble Model Development The prediction scores from the three models alongside biological sex were used to form a feature table ( https://github.com/Catriona-Miller/autism_ml ). A random forest model was trained on the 24% test 1 data (ten-fold cross-validation) and then tested on the 20% test 2 data. This model was then validated using the Illumina data. Feature Contribution Analysis Shapley additive explanations (SHAP) values were used to analyse how each model and the features within these models contributed to the overall prediction [ 22 ]. SHAP is a game theoretic approach that calculates the contribution of each feature to the overall model prediction [ 23 , 24 ]. The Python SHAP package (version 0.46) was used to calculate SHAP values for each model and the overall ensemble model [ 22 ]. Developmental Enrichment Analysis of SFARI Genes A Mann-Whitney U test was performed to compare the number of mutations in the set of genes previously associated with autism (all SFARI genes) between autistic and neurotypical individuals for females and males separately. Multiple tests were corrected for using the false discovery rate (Benjamini Hochberg test). All genes that had a statistically significant difference (p-value < 0.05) in variants across their gene length between autistic and neurotypical individuals were referred to as male and/or female SFARI genes. RNA-Seq data from 42 individuals’ brains at five developmental stages (prenatal, infant, child, adolescent, adult) was obtained from the Allen Brain Atlas (ABA) and analysed to identify the brain regions and developmental time points in which the male and female SFARI gene sets’ expression was enriched [ 25 ]. Briefly, ABAEnrichment (version 1.2.2) was employed to perform a developmental enrichment analysis [ 26 ]. Genes were annotated to a brain region if the expression was above the 0.8 cutoff [ 26 ]. A hypergeometric test was used to identify brain regions and developmental time points where the annotated gene list was enriched in the male or female SFARI gene list, compared to background control genes. A family-wise error rate (FWER) was calculated by comparing the enrichment against 1000 random sets of equal size [ 26 ]. Brain regions with an FWER < 0.05 were deemed to be significantly enriched. Brain regions that were enriched across development were determined using ABA’s developmental effect score dataset which contains gene age effect scores based on developmental gene expression changes [ 26 ]. Clustering Analysis of PRS SHAP values were generated to use as features for clustering by rerunning the PRS random forest model on the full CG dataset. A test dataset was not required as the goal was to analyse the clusters produced and not use them for prediction. A Uniform Manifold Approximation and Projection (UMAP) was performed to reduce the SHAP value data to two dimensions. After visualisation, k-means clustering was used to produce six clusters. To analyse the autism PRS for each cluster, GWAS SNP odd ratios (ORs) were downloaded ( https://www.ebi.ac.uk/gwas/efotraits/EFO_0003756 accessed 2024/09/26). PRS was calculated using PLINK’s ‘score’ function [ 19 ]. A pairwise Tukey test was used to compare the mean PRS z-scores between clusters [ 27 ]. Data Availability Table S4 lists the datasets and software that were used in our analyses. All scripts are available on GitHub ( https://github.com/Catriona-Miller/autism_ml ). Results An ensemble model that incorporates biological sex has the greatest accuracy when predicting autism Random forest models were trained on three independent feature sets (SFARI genes, PRS, and Fisher SNPs), with the Fisher SNPs model reaching the highest recall (0.86) and the PRS model reaching the highest accuracy (0.64) when applied to the CG test dataset (Table 1). By creating an ensemble model that combined these three models alongside sex, an accuracy of 0.68 and an AUC of 0.72 were achieved on the test dataset (Table 1, Fig. 2 A). Recall (the percentage of autistic individuals correctly predicted to be autistic) was higher than precision (the percentage of individuals predicted to be autistic who are autistic) at 0.77 and 0.61, respectively (Fig. S1 ). When analysing predictions by sex, the model achieved accuracies of 0.69 (M) and 0.66 (F) (Fig. 2 B-C). On the Illumina validation dataset (an independent, unseen dataset), the model achieved an accuracy of 0.63, AUC of 0.66, and recall of 0.79. Common variants make a large, sex-dependent contribution to the model The prediction scores from the three models are values between 0–1, quantifying the confidence the model has in the individual being neurotypical (0: certain to 0.5 uncertain) or autistic (0.5 uncertain to 1 certain). The three prediction scores (from the a) SFARI genes, b) PRS, and c) Fisher SNPs models) were incorporated into the ensemble model alongside sex and, from here on, will be referred to as features. SHAP scores were used to determine the contribution of the four features to the overall ensemble model. The Fisher SNPs model had the highest impact on the ensemble model predictions, followed by PRS, then sex (Fig. 3 A). This is expected given that the Fisher SNPs model had the highest metric scores of the three individual models (Table 1). We tested for a sex-dependent correlation between Fisher SNP prediction scores and impact on the final prediction (SHAP value) (Fig. 3 B). We identified a clear relationship between sex and Fisher SNP prediction value in the model, with only males predicted to be autistic from the Fisher SNP model. For a male and a female with the same neurotypical Fisher SNP predictive value, the ensemble model places a higher weight on common variants for neurotypical males than females (Fig. 3 B). The Fisher SNPs model was retrained after balancing the number of males and females within the training dataset. Autistic males within the training dataset were randomly subsampled to match the number of autistic females. The balanced model had slightly lower scores across all metrics than the original model (AUC of 0.70, Table S5). This is likely due to the reduced population size. The Fisher SNPs model prediction values still made the greatest contribution to the balanced model. However, sex was the second largest contributor. The contribution of the PRS dropped significantly, likely suggesting that its contribution was largely associated with sex (Fig. 3 C). In addition, plotting the individual Fisher SNP prediction values against their model impacts (i.e. SHAP values) and colouring by sex, highlights sex-dependent impacts of common variants on prediction (Fig. 3 D). The balanced model retains greater confidence in male predictions than female predictions from common variants (Mann-Whitney U p-value < 0.001). Female SHAP scores tended toward zero while male SHAP scores tended toward larger values in both directions (negative: neurotypical, positive: autistic). These results demonstrate the model’s high confidence in predicting autism probability for males compared to females based on common genetic variants. Thus, despite the balanced sex distribution in our dataset, the model exhibits an enhanced discriminative capacity for male-specific autism associated genetic factors. Top contributing common variants included sex-specific and shared variants Variant contributions to the balanced Fisher model were ranked according to their mean absolute SHAP score for females and males separately, and the top 10% of variants for each sex were selected (Fig. 4 A). Most variants (67/89, 75%) in the top 10% were shared between males and females. The top 10 Fisher variants (Fig. 4 B) and genes (Fig. 4 C) which contributed to the prediction had similar mean absolute SHAP values in males and females. However, SHAP values for rs58741612 were significantly different between sexes (t-test p-value < 0.001). rs58741612 (MAF 0.20 in males and females; European: 0.19, African: 0.28, Asian: 0.125) (Karczewski et al., 2020)) has not previously been associated with neurodevelopment. By contrast, the top-ranked variant, chr11:10509093, falls within MTRNR2L8 which has been shown to be fivefold upregulated in Saudi autistic individuals [ 29 ]. The second-ranked variant (i.e. rs1443089352) is located within an intron of P4HA3 , which was shown to be differentially expressed within mildly autistic individuals [ 30 ]. Notably, there were male- (11/89, 12%) and female-specific (11/89, 12%) loci (Fig. 4 A). More sex-specific variants were observed in the original (imbalanced) model, likely reflecting a bias towards sex in the original Fisher model (Fig. S2 ). 16p11.2 was identified as a female specific locus for autism in both the original and balanced models (t-test p-value < 0.001). Male- and female-specific SFARI genes were significantly expressed during different brain developmental windows We performed sex-stratified analyses comparing mutation rates in SFARI genes between autistic and neurotypical individuals across males and females separately. This revealed statistically significant differences in mutation rates in 35 genes specific to females, 455 genes specific to males, and 124 genes shared between both sexes (Fig. S3 A, Table S6, Table S7). These genes are mapped across different regions of the genome. Unsurprisingly, a large percentage of the female (9/31; 29%) and shared (10/48; 21%) gene sets were located on the X chromosome (Fig. S3 B-D). Notably, both male- (male + shared; 579) and female-specific (female + shared; 159) sets demonstrated significantly lower (p < 0.001) loss-of-function observed/expected upper bound fraction (LOEUF) scores compared to control SFARI genes without differential mutation rates (Fig. S3 E-F). These findings indicate that these sex-specific genetic contributors are under stronger evolutionary constraint, underscoring their importance in neurodevelopment. The Allen Brain Atlas [ 25 ] was used to identify the enrichment of sex-specific gene sets within the brain across developmental timepoints. Expression of genes within the female-specific set was enriched in the primary somatosensory cortex, inferior parietal cortex, and the parietal neocortex during the fetal timepoint (FWER = 0.013, 0.014, 0.010 respectively) (Fig. S3 G). By contrast, expression of genes within the male gene set was enriched (FWER < 0.05) in multiple brain regions at most developmental time points (fetal, infant, child, and adult). Significant developmental changes in gene expression (i.e. across all developmental windows) were observed for the expression of the male gene set within the dorsolateral prefrontal cortex and the anterior cingulate cortex (FWER = 0.048 and 0.006, respectively). By contrast, there were no significant developmental changes in expression in the female gene set (Fig. S3 G). Autism PRS scores can be used to cluster individuals We tested to see if it was possible to cluster autistic and neurotypical individuals according to their genotypes. A random forest model was trained on all individuals using only the PRS SNPs to predict autism. As the purpose was clustering and no longer prediction, a separate test dataset was not required. SHAP values were used for clustering and a Uniform Manifold Approximation and Projection (UMAP) was performed to project the SHAP values into two dimensions (Fig. 5 A). K-means clustering (k = 6) identified three autistic clusters (clusters 0, 3, and 5) and three neurotypical clusters (clusters 1, 2, and 4). The majority of the autistic individuals (482/613, 79%) were present in cluster zero. There was no statistically significant difference in the female-to-male ratio between the clusters, indicating that biological sex was not a discriminating factor in the clustering model (Fig. 5 B). Comparing the PRS values for all individuals, there was no statistically significant difference in the distribution between the autistic and neurotypical groups (Fig. 5 C). This is likely because the neurotypical group is comprised of family members of the autistic individuals who share significant genomic information. However, comparisons of the PRS scores by clusters identified significant differences (pairwise Tukey p < 0.001) between clusters (Fig. 5 D). Notably, cluster 3 (autistic) had the highest average PRS. To understand the genetic features responsible for separating the clusters, we compared the presence of each SNP between each autistic cluster and all neurotypical individuals. As was observed for overall PRS scores, no SNPs showed a statistically significant difference in presence between all autistic and neurotypical individuals. By contrast, within cluster 0, SNPs rs112635299 and rs11787216 had a statistically significant difference in distribution compared to all neurotypical individuals (chi-squared, adjusted p-values < 0.001). SNP rs112635299 was statistically significantly associated with autistic individuals in cluster 3. SNP rs11787216 was significantly associated (chi-squared, adjusted p-values < 0.001) with autistic individuals in cluster 5. These findings suggest that while no individual SNPs distinguish autistic individuals as a whole from neurotypical individuals, specific genetic variations may be relevant within certain autistic subgroups, highlighting the genetic heterogeneity within the autistic population. Discussion Our ensemble machine learning approach successfully predicted autism with an accuracy of 0.68 and an AUC of 0.72. The ensemble model was composed of three random forest models trained on: 1) statistically significant SNPs associated with autism in the training dataset (Fisher SNPs), 2) autism PRS SNPs [ 2 ], and 3) autism associated genes (SFARI genes listed in Table S2 ), and biological sex. By analysing the decisions the ensemble model made, we identified genes and variants that are influential for autism prediction. These variants included rs58741612, rs1443089352, MTRNR2L8 , and the 16p11 locus. Collectively, these common variants contributed to prediction with greater confidence in males compared to females (accuracy = 0.69 and 0.66 respectively). Expression of the SFARI gene sets that contributed to this sex-specificity was enriched at different developmental timepoints and in different brain regions in males and females. These insights pave the way for more targeted and sex-informed approaches to understanding autism genetics. The strengths of this study lie in leveraging the comprehensive MSSNG whole-genome sequencing database [ 18 ]. However, several limitations need to be considered when interpreting our findings. Firstly, the MSSNG dataset is largely European (75%) which may limit model generalisability to other ancestry groups. Second, while we retained the individuals who were sequenced using the Illumina platform as an independent validation dataset, a completely separately curated WGS cohort is required to test the model’s applicability to other cohorts fully. Third, our model’s exclusive focus on common genetic variation overlooks the contribution of pathogenic structural variants (SVs) which have been detected in 6% of autistic individuals within the MSSNG database [ 18 ]. Finally, SHAP values can only explain why the model predicted individuals as autistic or neurotypical, not necessarily the ‘true’ biological causality. Our ensemble model is more confident in predictions for males than females, even after adjusting to reduce male bias. Our models identified common and X-chromosome based variation as being important for sex-specific prediction (Fig. 4 A). Notably, a locus on the X chromosome has previously been associated with autism and decreased levels of maternal serum sex-hormone-binding globulin in males – not females [ 31 ]. Our findings that common variants contribute more to male autism expand these earlier observations. It remains possible that the increased confidence in male predictions could be explained by dataset bias; specifically, females are less likely to be diagnosed [ 32 ]. However, the use of a balanced dataset (male:female) in our ensemble model argues against this. Therefore, we contend that our results support the hypothesis that there is a biological role for sex hormones in autism [ 31 ]. These hypotheses should be tested further using larger, sex-balanced autism datasets. Social behaviour difficulties involving the prefrontal cortex manifest differently in males versus females [ 33 , 34 ]. Autistic females tend to have more subtle and less obvious difficulties in social communication than males and are better at adapting behaviours to conform to societal norms [ 35 ]. The expression of the female-specific SFARI gene set showed no significant developmental changes in expression in any brain regions (i.e. over five timepoints from fetal to adulthood). By contrast, expression of the male-specific SFARI genes showed significant changes in expression across development (samples from 8 pcw to 40 years) in the dorsolateral prefrontal cortex (dPFC) and the medial prefrontal cortex (mPFC). The dPFC and mPFC regions have been associated with autism in males. The mPFC is important for social behaviour, and circuitry changes in this region have been identified in both autism clinical studies and rodent models of majority male [ 36 , 37 ] or entirely male cohorts [ 37 , 38 ]. Notably, we identified SHANK3 as associated with autism in males but not females. SHANK3 mutations have been shown to impact mPFC connectivity and social deficits in male but not female mice [ 39 ]. Our findings of sex-dependent SFARI gene expression in developmental windows within the brain are consistent with the existence of different mechanisms for the development of autism in males and females. Critically, this idea is not new, as animal and MRI based studies indicate the existence of brain-based sex differences in autism [ 40 , 41 ]. Many studies analyse the genetic contributions to autism using autistic individuals and their family members [ 42 – 44 ]. Yet changes in polygenic risk scores between autistic individuals and their neurotypical family members are, by definition, minimal. However, clustering weighted (0,1,2; AA, Aa, aa, respectively) PRS SNPs overcomes this. By clustering weighted PRS scores, we identified SNPs rs112635299 (associated with coronary artery disease and bronchodilation [ 45 ]) and rs11787216 (associated with educational attainment [ 2 ]) as being responsible for the greatest discriminatory difference between the autistic and neurotypical clusters. Therefore, we hypothesise that the individuals within these clusters are more likely to have heart and education-related co-occurring traits, respectively. Notably, epidemiological studies suggest that individuals with congenital heart disease (CHD) have double the odds of developing autism (1.99; 95% CI 1.77–2.24) than those without CHD [ 46 ]. The 16p11.2 locus is important for autism prediction within the MSSNG dataset, particularly for females (p-value < 0.001). The 16p11.2 region falls within a known autism associated CNV which has sex-specific phenotypes in mice models [ 47 – 52 ]. Specifically, male mice models with 16p11.2 deletions are more likely to experience hyperactivity, sleep disturbances, and reduced brain sizes compared to females with the same mutation [ 53 , 54 ]. By contrast, female mice models exhibit increased stress-induced anxiety and excitability in their cortical neurons [ 49 , 55 ]. This was hypothesised to be related to increased excitatory synaptic drive in central amygdala neurons, potentially due to lipid breakdown, impacting a central amygdala to globus pallidus externa (GPe) pathway that has a role in fear learning. However, it remains unclear what causes the sex-specific difference. Future studies on the role of the 16p11.2 locus in autism should focus on identifying the sex-specific molecular impacts. This would provide insights that may be able to be applied to better understand the development of autism in males versus females, potentially with significant impacts on diagnosis. In conclusion, this study demonstrated the utility of an ensemble machine learning model trained on common genetic variants for autism prediction. The identification of distinct male and female genetic signatures, with divergent spatiotemporal expression patterns and developmental trajectories, provides compelling evidence for sex-specific etiological pathways on autism development. Future work should explore the mechanisms underlying the sex-specific effects of the 16p11 locus. Declarations Acknowledgements We would like to thank the Genomics and Systems Biology Group (Liggins Institute, University of Auckland) for their insightful suggestions and discussions. Data used in this work comes from the Autism Speaks’ MSSNG dataset. CM was funded by the University of Auckland Doctoral Scholarship. Conflicts of interest The authors have no competing interest to declare. Supplementary information is available at MP’s website. References Tick B, Bolton P, Happé F, Rutter M, Rijsdijk F. Heritability of autism spectrum disorders: A meta-analysis of twin studies. J Child Psychol Psychiatry. 2016;57:585–595. Grove J, Ripke S, Als TD, Mattheisen M, Walters RK, Won H, et al. Identification of common genetic risk variants for autism spectrum disorder. Nat Genet. 2019;51:431–444. Gaugler T, Klei L, Sanders SJ, Bodea CA, Goldberg AP, Lee AB, et al. Most genetic risk for autism resides with common variation. Nat Genet. 2014;46:881–885. Zeidan J, Fombonne E, Scorah J, Ibrahim A, Durkin MS, Saxena S, et al. Global prevalence of autism: A systematic review update. Autism Research. 2022;15:778–790. Bougeard C, Picarel-Blanchot F, Schmid R, Campbell R, Buitelaar J. Prevalence of Autism Spectrum Disorder and Co-morbidities in Children and Adolescents: A Systematic Literature Review. Front Psychiatry. 2021;12. Dougherty JD, Marrus N, Maloney SE, Yip B, Sandin S, Turner TN, et al. Can the “female protective effect” liability threshold model explain sex differences in autism spectrum disorder? Neuron. 2022;110:3243–3262. Cirillo D, Catuara-Solarz S, Morey C, Guney E, Subirats L, Mellino S, et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit Med. 2020;3:81. Bracher-Smith M, Crawford K, Escott-Price V. Machine learning for genetic prediction of psychiatric disorders: a systematic review. Mol Psychiatry. 2021;26:70–79. Gupta C, Chandrashekar P, Jin T, He C, Khullar S, Chang Q, et al. Bringing machine learning to research on intellectual and developmental disabilities: taking inspiration from neurological diseases. J Neurodev Disord. 2022;14. Jansen AG, Dieleman GC, Jansen PR, Verhulst FC, Posthuma D, Polderman TJC. Psychiatric Polygenic Risk Scores as Predictor for Attention Deficit/Hyperactivity Disorder and Autism Spectrum Disorder in a Clinical Child and Adolescent Sample. Behav Genet. 2020;50:203–212. Benca CE, Derringer JL, Corley RP, Young SE, Keller MC, Hewitt JK, et al. Predicting Cognitive Executive Functioning with Polygenic Risk Scores for Psychiatric Disorders. Behav Genet. 2017;47:11–24. Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet. 2023;24:125–137. Manduchi E, Romano JD, Moore JH. The promise of automated machine learning for the genetic analysis of complex traits. Hum Genet. 2022;141:1529–1544. Rosenquist R, Cuppen E, Buettner R, Caldas C, Dreau H, Elemento O, et al. Clinical utility of whole-genome sequencing in precision oncology. Semin Cancer Biol. 2022;84:32–39. Elkon R, Agami R. Characterization of noncoding regulatory DNA in the human genome. Nat Biotechnol. 2017;35:732–746. Ghafouri-Fard S, Taheri M, Omrani MD, Daaee A, Mohammad-Rahimi H, Kazazi H. Application of Single-Nucleotide Polymorphisms in the Diagnosis of Autism Spectrum Disorders: A Preliminary Study with Artificial Neural Networks. Journal of Molecular Neuroscience. 2019;68:515–521. Wang D, Liu S, Warrell J, Won H, Shi X, Navarro FCP, et al. Comprehensive functional genomic resource and integrative model for the human brain. Science (1979). 2018;362. Trost B, Thiruvahindrapuram B, Chan AJS, Engchuan W, Higginbotham EJ, Howe JL, et al. Genomic architecture of autism from comprehensive whole-genome sequence annotation. Cell. 2022;185:4409–4427.e18. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. Ali M. PyCaret: An open source, low-code machine learning library in Python. Https://WwwPycaretOrg.2020. Abrahams BS, Arking DE, Campbell DB, Mefford HC, Morrow EM, Weiss LA, et al. SFARI Gene 2.0: a community-driven knowledgebase for the autism spectrum disorders (ASDs). Mol Autism. 2013;4:36. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56–67. Shapley LS. 17. A Value for n-Person Games. Contributions to the Theory of Games (AM-28), Volume II, Princeton University Press; 1953. p. 307–318. Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. Adv Neural Inf Process Syst. 2017;30:4765–4774. Sunkin SM, Ng L, Lau C, Dolbeare T, Gilbert TL, Thompson CL, et al. Allen Brain Atlas: an integrated spatio-temporal portal for exploring the central nervous system. Nucleic Acids Res. 2012;41:D996–D1008. Grote S, Prüfer K, Kelso J, Dannemann M. ABAEnrichment: An R package to test for gene set expression enrichment in the adult and developing human brain. Bioinformatics. 2016;32:3201–3203. Tukey JW. Comparing Individual Means in the Analysis of Variance. Biometrics. 1949;5:99. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. Almandil NB, AlSulaiman A, Aldakeel SA, Alkuroud DN, Aljofi HE, Alzahrani S, et al. Integration of Transcriptome and Exome Genotyping Identifies Significant Variants with Autism Spectrum Disorder. Pharmaceuticals. 2022;15:158. Lee EC, Hu VW. Phenotypic Subtyping and Re-Analysis of Existing Methylation Data from Autistic Probands in Simplex Families Reveal ASD Subtype-Associated Differentially Methylated Genes and Biological Functions. Int J Mol Sci. 2020;21:6877. Mendes M, Chen DZ, Engchuan W, Leal TP, Thiruvahindrapuram B, Trost B, et al. Chromosome X-wide common variant association study in autism spectrum disorder. The American Journal of Human Genetics. 2025;112:135–153. Hull L, Mandy W. Protective Effect or Missed Diagnosis? Females with Autism Spectrum Disorder. Future Neurol. 2017;12:159–169. Leow KQ, Tonta MA, Lu J, Coleman HA, Parkington HC. Towards understanding sex differences in autism spectrum disorders. Brain Res. 2024;1833:148877. Laine MA, Greiner EM, Shansky RM. Sex differences in the rodent medial prefrontal cortex – What Do and Don’t we know? Neuropharmacology. 2024;248:109867. Ratto AB, Kenworthy L, Yerys BE, Bascom J, Wieckowski AT, White SW, et al. What About the Girls? Sex-Based Differences in Autistic Traits and Adaptive Skills. J Autism Dev Disord. 2018;48:1698–1711. Ibrahim K, Soorya L V., Halpern DB, Gorenstein M, Siper PM, Wang AT. Social cognitive skills groups increase medial prefrontal cortex activity in children with autism spectrum disorder. Autism Research. 2021;14:2495–2511. Mediane DH, Basu S, Cahill EN, Anastasiades PG. Medial prefrontal cortex circuitry and social behaviour in autism. Neuropharmacology. 2024;260:110101. Li L, He C, Jian T, Guo X, Xiao J, Li Y, et al. Attenuated link between the medial prefrontal cortex and the amygdala in children with autism spectrum disorder: Evidence from effective connectivity within the “social brain”. Prog Neuropsychopharmacol Biol Psychiatry. 2021;111:110147. Kim S, Kim Y-E, Song I, Ujihara Y, Kim N, Jiang Y-H, et al. Neural circuit pathology driven by Shank3 mutation disrupts social behaviors. Cell Rep. 2022;39:110906. Lai M, Lerch JP, Floris DL, Ruigrok ANV, Pohl A, Lombardo M V., et al. Imaging sex/gender and autism in the brain: Etiological implications. J Neurosci Res. 2017;95:380–397. Walsh MJM, Wallace GL, Gallegos SM, Braden BB. Brain-based sex differences in autism spectrum disorder across the lifespan: A systematic review of structural MRI, fMRI, and DTI findings. Neuroimage Clin. 2021;31:102719. Klei L, McClain LL, Mahjani B, Panayidou K, De Rubeis S, Grahnat A-CS, et al. How rare and common risk variation jointly affect liability for autism spectrum disorder. Mol Autism. 2021;12:66. Schendel D, Munk Laursen T, Albiñana C, Vilhjalmsson B, Ladd-Acosta C, Fallin MD, et al. Evaluating the interrelations between the autism polygenic score and psychiatric family history in risk for autism. Autism Research. 2022;15:171–182. LaBianca S, LaBianca J, Pagsberg AK, Jakobsen KD, Appadurai V, Buil A, et al. Copy Number Variants and Polygenic Risk Scores Predict Need of Care in Autism and/or ADHD Families. J Autism Dev Disord. 2021;51:276–285. Aherrahrou R, Reinberger T, Hashmi S, Erdmann J. GWAS breakthroughs: mapping the journey from one locus to 393 significant coronary artery disease associations. Cardiovasc Res. 2024;120:1508–1530. Gu S, Katyal A, Zhang Q, Chung W, Franciosi S, Sanatani S. The Association Between Congenital Heart Disease and Autism Spectrum Disorder: A Systematic Review and Meta-Analysis. Pediatr Cardiol. 2023;44:1092–1107. Lynch JF, Ferri SL, Angelakos C, Schoch H, Nickl-Jockschat T, Gonzalez A, et al. Comprehensive Behavioral Phenotyping of a 16p11.2 Del Mouse Model for Neurodevelopmental Disorders. Autism Research. 2020;13:1670–1684. Agarwalla S, Arroyo NS, Long NE, O’Brien WT, Abel T, Bandyopadhyay S. Male-specific alterations in structure of isolation call sequences of mouse pups with 16p11.2 deletion. Genes Brain Behav. 2020;19. Giovanniello J, Ahrens S, Yu K, Li B. Sex-Specific Stress-Related Behavioral Phenotypes and Central Amygdala Dysfunction in a Mouse Model of 16p11.2 Microdeletion. Biological Psychiatry Global Open Science. 2021;1:59–69. Kim J, Koo B-K, Knoblich JA. Human organoids: model systems for human biology and medicine. Nat Rev Mol Cell Biol. 2020;21:571–584. Niarchou M, Chawner SJRA, Doherty JL, Maillard AM, Jacquemont S, Chung WK, et al. Psychiatric disorders in children with 16p11.2 deletion and duplication. Transl Psychiatry. 2019;9:8. Rein B, Yan Z. 16p11.2 Copy Number Variations and Neurodevelopmental Disorders. Trends Neurosci. 2020;43:886–901. Angelakos CC, Watson AJ, O’Brien WT, Krainock KS, Nickl-Jockschat T, Abel T. Hyperactivity and male‐specific sleep deficits in the 16p11.2 deletion mouse model of autism. Autism Research. 2017;10:572–584. Kretz PF, Wagner C, Mikhaleva A, Montillot C, Hugel S, Morella I, et al. Dissecting the autism-associated 16p11.2 locus identifies multiple drivers in neuroanatomical phenotypes and unveils a male-specific role for the major vault protein. Genome Biol. 2023;24:261. Tomasello DL, Kim JL, Khodour Y, McCammon JM, Mitalipova M, Jaenisch R, et al. 16pdel lipid changes in iPSC-derived neurons and function of FAM57B in lipid metabolism and synaptogenesis. IScience. 2022;25:103551. Tables Table 1 is available in the Supplementary Files section. Additional Declarations The authors have declared there is NO conflict of interest to disclose Supplementary Files FigS1.tif Supplementary Figure S1: The ensemble model achieves a recall of 0.77 and a precision of 0.61. Recall is the proportion of autistic individuals (TP+FN; 136) who are correctly predicted to be autistic (TP; 105). Precision is the proportion of individuals predicted to be autistic (TP + FP; 171) who are autistic (TP; 105). FigS2.tif Supplementary Figure S2: Common variants before the model was balanced were largely sex-specific. Phenogram showing the top 10% of common variants for prediction in the original, unbalanced model. Blue dots were only in the top 10% of variants for females, green dots were only in the top 10% of variants for males, and red dots were in the top 10% of variants for males and females. FigS3.tif Supplementary Figure S3: SFARI genes that have a significant difference in mutation count between autistic and neurotypical individuals. A: Venn Diagram showing the number of SFARI genes that are significantly more/less mutated in autistic individuals than neurotypical for males and females. B-D: Pie chart showing the chromosomes containing the genes that had significantly different mutation rates between neurotypical and autistic individuals in females (B), both males and females (C), and males (D). E&F: LOEUF scores of SFARI genes that had significantly different numbers of mutations in males (F) and females (G) compared to those that were insignificant for males and females. p values < 0.001. G: The areas of the brain that were significantly enriched in the gene expression of the female genes (n=159) or male genes (n=579) at different developmental timepoints, from the Allen Brain Atlas. DFC and MFC have significant developmental changes in gene expression for the male gene list. M1C: primary motor cortex, CBC: cerebellar cortex, OFC: orbital frontal cortex, IPC: posteroventral (inferior) parietal cortex, VFC: ventrolateral prefrontal cortex, STC: posterior (causal) superior temporal cortex, S1C: primary somatosensory cortex, PCx: parietal neocortex, DFC: dorsolateral prefrontal cortex, MFC: anterior (rostral) cingulate (medial prefrontal) cortex. Table1.xlsx Table 1: The ensemble model incorporating biological sex outperforms sub-models for predicting autism probability. Metrics are presented for the models’ predictability on the 20% test set of the Complete Genomics data. The final row shows the ensemble model applied to the Illumina validation dataset. a, AUC (area under the curve) represents the probability that a randomly chosen positive (autistic) instance ranks higher than a randomly chosen negative (neurotypical) instance; b, recall is the percentage of autistic individuals correctly predicted to be autistic; c, precision is the percentage of individuals predicted to be autistic who are autistic and d, F1 is the harmonic mean of recall (b) and precision (c). SupplementaryTables.docx Supplementary Figure Legends Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6323696","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":488277711,"identity":"d242e268-4540-43a1-80d6-a14b0d11b695","order_by":0,"name":"Justin O'Sullivan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABNElEQVRIie2SMWrDMBhGJQzpYtrVXuIryBg8lZzFPx68mFDwEgg06iIvoV7VqVdwKJSMKhm86AAZ7aVdMsRLabdKajOkTujaQW+QPoQen5CEkMXyD8FUDYmJzp3Q4cqs3qhRnFUwTX6iUXyqIzmvHIq+FT0R8Yfi8Oyt7dZoHHCgomOTadTIsOdkMr0UTrfH69tBBc9DChJFZKsUYGkRyzzya5IWvhhFHpajEwqmwBDUnlIS6cCzyGPcEgdqgWKEmTtUss4oj9woC3iqdlpZKOXiXSneUElCo1B9sGS2UXWqpSYqCFe3kIGy3IUcmBcR2WmlAb59LXxOGnjYuIUHLPmthGXW9p/sehyU6Uv3QeZQVemqX87mcN+Uq33PBjcWUjMNDmweAB0+xhHBqb0Wi8ViOeILSQd7jv0yVawAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0003-2927-450X","institution":"Liggins Institute, University of Auckland","correspondingAuthor":true,"prefix":"","firstName":"Justin","middleName":"","lastName":"O'Sullivan","suffix":""},{"id":488277712,"identity":"df045acb-3f31-4e81-ba5e-b9e3758356c3","order_by":1,"name":"Catriona Miller","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Catriona","middleName":"","lastName":"Miller","suffix":""},{"id":488277713,"identity":"4ec40cd8-d646-4407-a01e-d7eb7972dc84","order_by":2,"name":"Theo Portlock","email":"","orcid":"https://orcid.org/0000-0001-5971-3847","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Theo","middleName":"","lastName":"Portlock","suffix":""},{"id":488277714,"identity":"e9923641-4b2f-44f9-a47a-0d2206210298","order_by":3,"name":"Denis Nyaga","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Denis","middleName":"","lastName":"Nyaga","suffix":""}],"badges":[],"createdAt":"2025-03-28 00:35:24","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6323696/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6323696/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":87430866,"identity":"ab46dfab-4c8b-4caf-a5cb-938686ce0500","added_by":"auto","created_at":"2025-07-23 17:26:31","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":166972,"visible":true,"origin":"","legend":"\u003cp\u003eSchematic of the methods used to train the ensemble ML model. A: Datasets were split into train/test following quality control (QC). Illumina data was retained for validation (V). Complete genomics (CG) data was split into training (Tr) and two test sets (T1 and T2). B: Random forest models were trained on 1) variants selected by a Fisher test on the training data, 2) normalised mutation rates per SFARI gene, and 3) variants from an autism PRS [2]. C: The predictive scores from the models trained in B and biological sex were used to train the final ensemble random forest model. RF: random forest\u003c/p\u003e","description":"","filename":"Fig1.png","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/db7b4f30e158dcbacfd5ed17.png"},{"id":87430831,"identity":"d1b4382f-7bca-4b64-be98-189d4b9bf6c5","added_by":"auto","created_at":"2025-07-23 17:26:25","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":129231,"visible":true,"origin":"","legend":"\u003cp\u003eAn ensemble model incorporating biological sex predicts autism probability in a previously unseen dataset. A: AUROC graph showing an AUC of 0.72 on the CG test dataset. B and C: Confusion matrix showing the predictions in the Complete Genomics test dataset, split into males and females respectively. Red squares highlight the incorrect predictions (false positives and false negatives) while green highlights the correct predictions (true positives and true negatives).\u003c/p\u003e","description":"","filename":"Fig2.png","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/39505f2f831ed3b17a448910.png"},{"id":87430838,"identity":"c1fe88de-cb59-4f2f-803e-e357f39d3a34","added_by":"auto","created_at":"2025-07-23 17:26:25","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":225080,"visible":true,"origin":"","legend":"\u003cp\u003eCommon variation is an important, sex dependent, predictor in the original and balanced models. A and C: The SHAP values showing the impact of each sub-model (common variation – Fisher’s test, PRS, and SFARI genes) and sex on the overall model. This illustrates the contribution of each sub-model/feature to the overall prediction. A is for the original model while C is the model balanced by sex, i.e. same number of males and females. B and D: Interaction between common variation and biological sex for the original and balanced models, respectively. X-axis shows the prediction scores from the common variation sub-model (fed into the ensemble model as feature values) and the y-axis shows the impact on the overall prediction. Anything left of the dotted line (x= 0.0) would have been predicted as neurotypical by common variation alone while anything above the dotted line (y= 0.0) has a positive impact on the model.\u003c/p\u003e","description":"","filename":"Fig3.png","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/4ee246ec89bef337b7b0f3fa.png"},{"id":87431040,"identity":"3f3d4d08-56b2-4f56-b834-1dc77377a96f","added_by":"auto","created_at":"2025-07-23 17:34:25","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":204916,"visible":true,"origin":"","legend":"\u003cp\u003eShared and sex-dependent common variation plays a role in autism prediction. A: Phenogram showing the SNPs with the top 10% of mean absolute SHAP scores for females and males. Those that were in the top 10% of both sexes are red while those only present in the top 10% of females or males are blue and green respectively. B: The mean absolute SHAP values of the top 10 variants, ordered by male SHAP scores, from the Fisher SNPs. C: The mean absolute SHAP values of the top 5 SFARI genes.\u003c/p\u003e","description":"","filename":"Fig4.png","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/2e334eca297c6de71ad6d003.png"},{"id":87430834,"identity":"6105731f-e242-44bd-a6bc-bb5fcc34ec9f","added_by":"auto","created_at":"2025-07-23 17:26:25","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":233735,"visible":true,"origin":"","legend":"\u003cp\u003eThe autism polygenic risk score (PRS) can be used to cluster neurotypical and autistic individuals. A: UMAP projection of the SHAP values from autism prediction based on PRS alone. Clusters 0, 3, and 5 are comprised of autistic individuals. B: Bar graph showing the sex breakdown of the clusters with the female:male ratio listed above the autistic clusters. There is no statistically significant difference in female:male ratio between clusters. C: The distribution of autism PRS z-scores for the neurotypical and autistic individuals. Pairwise tukey test p-value \u0026gt; 0.05. Z-scores calculated based on all autism PRSes. D: Distribution of autism PRS z-scores per cluster. Autistic clusters are red. Comparison between clusters 1 and 2 has a p-value of 0.01, comparison between clusters 0 and 5 has a p-value of 0.006, all other p-values (***) are \u0026lt; 0.001.\u003c/p\u003e","description":"","filename":"Fig5.png","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/9427b36e879ccc07b9131477.png"},{"id":90084498,"identity":"7ff69d21-da98-41d4-bee3-524dcd14c3cd","added_by":"auto","created_at":"2025-08-28 09:39:28","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1536248,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/22f3f666-392f-4099-805c-1a389aed4a5c.pdf"},{"id":87431039,"identity":"579ce731-8e35-431c-94c2-b4105032bd01","added_by":"auto","created_at":"2025-07-23 17:34:25","extension":"tif","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":504222,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary Figure S1: The ensemble model achieves a recall of 0.77 and a precision of 0.61. Recall is the proportion of autistic individuals (TP+FN; 136) who are correctly predicted to be autistic (TP; 105). Precision is the proportion of individuals predicted to be autistic (TP + FP; 171) who are autistic (TP; 105).\u003c/p\u003e","description":"","filename":"FigS1.tif","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/c3e73d345b2713d282e430e1.tif"},{"id":87430842,"identity":"1f32d0a7-14ba-4f94-96ab-97877ed711e9","added_by":"auto","created_at":"2025-07-23 17:26:25","extension":"tif","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":2588636,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary Figure S2: Common variants before the model was balanced were largely sex-specific. Phenogram showing the top 10% of common variants for prediction in the original, unbalanced model. Blue dots were only in the top 10% of variants for females, green dots were only in the top 10% of variants for males, and red dots were in the top 10% of variants for males and females.\u003c/p\u003e","description":"","filename":"FigS2.tif","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/290d8a7f210a8c13f41d7a83.tif"},{"id":87430840,"identity":"b75c9785-5c67-4197-bdc3-3a6b2a3ad7fc","added_by":"auto","created_at":"2025-07-23 17:26:25","extension":"tif","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":1917490,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary Figure S3: SFARI genes that have a significant difference in mutation count between autistic and neurotypical individuals. A: Venn Diagram showing the number of SFARI genes that are significantly more/less mutated in autistic individuals than neurotypical for males and females. B-D: Pie chart showing the chromosomes containing the genes that had significantly different mutation rates between neurotypical and autistic individuals in females (B), both males and females (C), and males (D). E\u0026amp;F: LOEUF scores of SFARI genes that had significantly different numbers of mutations in males (F) and females (G) compared to those that were insignificant for males and females. p values \u0026lt; 0.001. G: The areas of the brain that were significantly enriched in the gene expression of the female genes (n=159) or male genes (n=579) at different developmental timepoints, from the Allen Brain Atlas. DFC and MFC have significant developmental changes in gene expression for the male gene list. M1C: primary motor cortex, CBC: cerebellar cortex, OFC: orbital frontal cortex, IPC: posteroventral (inferior) parietal cortex, VFC: ventrolateral prefrontal cortex, STC: posterior (causal) superior temporal cortex, S1C: primary somatosensory cortex, PCx: parietal neocortex, DFC: dorsolateral prefrontal cortex, MFC: anterior (rostral) cingulate (medial prefrontal) cortex.\u003c/p\u003e","description":"","filename":"FigS3.tif","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/be1776d6df7b2e6cd8175606.tif"},{"id":87430836,"identity":"695d3d87-59dc-4bcb-a787-457ff62697f7","added_by":"auto","created_at":"2025-07-23 17:26:25","extension":"xlsx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":9633,"visible":true,"origin":"","legend":"\u003cp\u003eTable 1: The ensemble model incorporating biological sex outperforms sub-models for predicting autism probability. Metrics are presented for the models’ predictability on the 20% test set of the Complete Genomics data. The final row shows the ensemble model applied to the Illumina validation dataset. a, AUC (area under the curve) represents the probability that a randomly chosen positive (autistic) instance ranks higher than a randomly chosen negative (neurotypical) instance; b, recall is the percentage of autistic individuals correctly predicted to be autistic; c, precision is the percentage of individuals predicted to be autistic who are autistic and d, F1 is the harmonic mean of recall (b) and precision (c).\u003c/p\u003e","description":"","filename":"Table1.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/a94e6e831df2c4a6b6623788.xlsx"},{"id":87430839,"identity":"39347924-3338-4ea3-b562-bb8788d004b9","added_by":"auto","created_at":"2025-07-23 17:26:25","extension":"docx","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":220466,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSupplementary Figure Legends\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e","description":"","filename":"SupplementaryTables.docx","url":"https://assets-eu.researchsquare.com/files/rs-6323696/v1/27b601dc2bd71a58de1145d4.docx"}],"financialInterests":"The authors have declared there is \u003cb\u003eNO\u003c/b\u003e conflict of interest to disclose","formattedTitle":"Sex-dependent prediction of autism","fulltext":[{"header":"Introduction","content":"\u003cp\u003eAutism is heterogeneous and characterised by challenges with social skills, repetitive behaviours, and communication deficits. Autism is highly heritable, with a twin-based genetic heritability of 64\u0026ndash;91% [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Like other neuropsychiatric and neurodevelopmental traits, common variants have a low individual impact but collectively contribute additively to the heritability of autism [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. In contrast, rare variants have a high individual impact but a low overall effect on liability [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAutism has a global prevalence of 1%. However, the prevalence varies across regions and demographics [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Notably, there are roughly four times the number of diagnosed males than females [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Different hypotheses have been proposed to explain the \u0026lsquo;female protective effect\u0026rsquo;. These hypotheses include the liability threshold model, which suggests that females have a higher threshold for diagnosis [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. The liability threshold is thought to be influenced by both genetic (e.g. sex chromosome and methylation differences) and non-genetic (e.g. diagnostic bias and hormones) factors [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Failure to account for the female protective effect during model development and deployment may introduce significant bias into research outcomes [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. However, including biological sex as a variable in attempts to predict autism offers the potential to elucidate the genetic contributions of the putative protective effect.\u003c/p\u003e \u003cp\u003eThe use of genetic models to predict autism and other neuropsychiatric conditions has had mixed success [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. Early methods used polygenic risk scores (PRS) as a logistic regression for prediction [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. However, the last decade has witnessed a substantial increase in the application of artificial intelligence (AI) and machine learning techniques [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Incorporating whole genome sequencing into prediction models has also improved results, in part through the inclusion of non-coding regions which are recognised as having a role in determining traits [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Finally, machine learning approaches have been applied to the genetics of neurodevelopmental conditions for diagnosis, condition subtyping, variant identification, and biomarker prioritisation [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe accuracy of predicting the probability of being autistic from genetics has ranged from 0.52\u0026ndash;0.81 [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. However, the majority of the models that have been used for predictions are limited due to either: 1) lack of separate validation data [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], 2) using a small, homogeneous population [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], or 3) lack of breakdown of predictability between sex and limited reporting of results (e.g. accuracy only) [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn this study, we developed a robust ensemble model to predict the probability of being autistic using common genetic variation, leveraging the Autism Speaks\u0026rsquo; MSSNG whole-genome sequencing dataset comprising 11,000 individuals [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Our approach identified key genetic variants and genes contributing to autism probability and their sex-dependent effects.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eData Preparation\u003c/h2\u003e \u003cp\u003eControlled access to the MSSNG database was applied for and approved by the MSSNG Database\u0026rsquo;s Data Access Committee (DACO-2020-04). MSSNG consists of whole-genome sequences from 5,102 autistic individuals (4074 males, 1028 females), 6079 unaffected individuals (3,033 males, 3046 females), and 131 autism-related individuals (61 males, 70 females) [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Of these, 1,732 individuals\u0026rsquo; samples were sequenced on Complete Genomics (CG), and 9,580 were sequenced on Illumina HiSeqX platforms.\u003c/p\u003e \u003cp\u003eQuality control was performed separately on the CG and Illumina datasets using PLINK (Version 1.9) [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. SNPs with a missing rate above 5% or a Hardy-Weinberg equilibrium (HWE) p-value below 10\u003csup\u003e\u0026minus;\u0026thinsp;6\u003c/sup\u003e were removed. Additionally, we filtered out: (1) individuals with a missing rate above 5%, (2) outlying homozygosity (more than three sd from the mean level of homozygosity), (3) ancestry outliers (more than three sd away from principal components 1 and 2\u0026rsquo;s cluster mean), and (4) similar samples (an identity by descent [IBD] above 0.5).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eFeature Selection\u003c/h3\u003e\n\u003cp\u003eThe CG dataset was used for model training and testing. A 56:24:20 split (training: test 1: test 2) was applied to the CG dataset. The 56% training dataset was used to train 1) a model using significant SNPs, 2) a model using autism-associated genes, and 3) a model using PRS data (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The Illumina dataset was kept separate for validation purposes.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eModel 1 – Fisher SNPs\u003c/h3\u003e\n\u003cp\u003eThe 56% training dataset underwent LD pruning (50 variant count (vc) window size, 5 vc step size, 2 variance inflation factor [VIF] threshold; using PLINK\u0026rsquo;s \u003cem\u003eindep\u003c/em\u003e function [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]). A Fisher exact test was then performed on the pruned dataset and identified 790 SNPs associated with autism with a p-value\u0026thinsp;\u0026lt;\u0026thinsp;5 x 10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e (Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). A feature table with these SNPs (0,1,2; AA, Aa, aa, respectively) was fed into a random forest model (PyCaret version 3.3.2). Ten-fold cross-validation was performed for hyperparameter optimisation in PyCaret (version 3.3.2) [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] using the 56% training dataset. The model was then tested on the 24% test 1 dataset.\u003c/p\u003e\n\u003ch3\u003eModel 2 – SFARI genes\u003c/h3\u003e\n\u003cp\u003eA list of 1037 genes associated with autism was downloaded (2024/03/28) from the SFARI database [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] (Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e). The number of variants per gene was calculated for each individual and formed the feature table. Ten-fold cross-validation was performed for hyperparameter optimisation in PyCaret (version 3.3.2) on the 56% training dataset. The model was then tested on the 24% test 1 dataset.\u003c/p\u003e\n\u003ch3\u003eModel 3 – PRS\u003c/h3\u003e\n\u003cp\u003eA list of 40 SNPs associated with autism [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e] was used to create a feature table (0,1,2; AA, Aa, aa, respectively) and fed into a random forest model (Table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e). Ten-fold cross-validation was performed for hyperparameter optimisation in PyCaret (version 3.3.2) on the 56% training dataset. The model was then tested on the 24% test 1 dataset.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eEnsemble Model Development\u003c/h2\u003e \u003cp\u003eThe prediction scores from the three models alongside biological sex were used to form a feature table (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/Catriona-Miller/autism_ml\u003c/span\u003e\u003cspan address=\"https://github.com/Catriona-Miller/autism_ml\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). A random forest model was trained on the 24% test 1 data (ten-fold cross-validation) and then tested on the 20% test 2 data. This model was then validated using the Illumina data.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eFeature Contribution Analysis\u003c/h3\u003e\n\u003cp\u003eShapley additive explanations (SHAP) values were used to analyse how each model and the features within these models contributed to the overall prediction [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. SHAP is a game theoretic approach that calculates the contribution of each feature to the overall model prediction [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. The Python SHAP package (version 0.46) was used to calculate SHAP values for each model and the overall ensemble model [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e\n\u003ch3\u003eDevelopmental Enrichment Analysis of SFARI Genes\u003c/h3\u003e\n\u003cp\u003eA Mann-Whitney U test was performed to compare the number of mutations in the set of genes previously associated with autism (all SFARI genes) between autistic and neurotypical individuals for females and males separately. Multiple tests were corrected for using the false discovery rate (Benjamini Hochberg test). All genes that had a statistically significant difference (p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05) in variants across their gene length between autistic and neurotypical individuals were referred to as male and/or female SFARI genes.\u003c/p\u003e \u003cp\u003eRNA-Seq data from 42 individuals\u0026rsquo; brains at five developmental stages (prenatal, infant, child, adolescent, adult) was obtained from the Allen Brain Atlas (ABA) and analysed to identify the brain regions and developmental time points in which the male and female SFARI gene sets\u0026rsquo; expression was enriched [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Briefly, ABAEnrichment (version 1.2.2) was employed to perform a developmental enrichment analysis [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Genes were annotated to a brain region if the expression was above the 0.8 cutoff [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. A hypergeometric test was used to identify brain regions and developmental time points where the annotated gene list was enriched in the male or female SFARI gene list, compared to background control genes. A family-wise error rate (FWER) was calculated by comparing the enrichment against 1000 random sets of equal size [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Brain regions with an FWER\u0026thinsp;\u0026lt;\u0026thinsp;0.05 were deemed to be significantly enriched. Brain regions that were enriched across development were determined using ABA\u0026rsquo;s developmental effect score dataset which contains gene age effect scores based on developmental gene expression changes [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e].\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eClustering Analysis of PRS\u003c/h2\u003e \u003cp\u003eSHAP values were generated to use as features for clustering by rerunning the PRS random forest model on the full CG dataset. A test dataset was not required as the goal was to analyse the clusters produced and not use them for prediction. A Uniform Manifold Approximation and Projection (UMAP) was performed to reduce the SHAP value data to two dimensions. After visualisation, k-means clustering was used to produce six clusters. To analyse the autism PRS for each cluster, GWAS SNP odd ratios (ORs) were downloaded (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.ebi.ac.uk/gwas/efotraits/EFO_0003756\u003c/span\u003e\u003cspan address=\"https://www.ebi.ac.uk/gwas/efotraits/EFO_0003756\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e accessed 2024/09/26). PRS was calculated using PLINK\u0026rsquo;s \u0026lsquo;score\u0026rsquo; function [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. A pairwise Tukey test was used to compare the mean PRS z-scores between clusters [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eData Availability\u003c/h2\u003e \u003cp\u003eTable \u003cspan refid=\"MOESM4\" class=\"InternalRef\"\u003eS4\u003c/span\u003e lists the datasets and software that were used in our analyses.\u003c/p\u003e \u003cp\u003eAll scripts are available on GitHub (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/Catriona-Miller/autism_ml\u003c/span\u003e\u003cspan address=\"https://github.com/Catriona-Miller/autism_ml\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eAn ensemble model that incorporates biological sex has the greatest accuracy when predicting autism\u003c/h2\u003e \u003cp\u003eRandom forest models were trained on three independent feature sets (SFARI genes, PRS, and Fisher SNPs), with the Fisher SNPs model reaching the highest recall (0.86) and the PRS model reaching the highest accuracy (0.64) when applied to the CG test dataset (Table\u0026nbsp;1). By creating an ensemble model that combined these three models alongside sex, an accuracy of 0.68 and an AUC of 0.72 were achieved on the test dataset (Table\u0026nbsp;1, Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA). Recall (the percentage of autistic individuals correctly predicted to be autistic) was higher than precision (the percentage of individuals predicted to be autistic who are autistic) at 0.77 and 0.61, respectively (Fig. \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). When analysing predictions by sex, the model achieved accuracies of 0.69 (M) and 0.66 (F) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB-C). On the Illumina validation dataset (an independent, unseen dataset), the model achieved an accuracy of 0.63, AUC of 0.66, and recall of 0.79.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eCommon variants make a large, sex-dependent contribution to the model\u003c/h2\u003e \u003cp\u003eThe prediction scores from the three models are values between 0\u0026ndash;1, quantifying the confidence the model has in the individual being neurotypical (0: certain to 0.5 uncertain) or autistic (0.5 uncertain to 1 certain). The three prediction scores (from the a) SFARI genes, b) PRS, and c) Fisher SNPs models) were incorporated into the ensemble model alongside sex and, from here on, will be referred to as features.\u003c/p\u003e \u003cp\u003eSHAP scores were used to determine the contribution of the four features to the overall ensemble model. The Fisher SNPs model had the highest impact on the ensemble model predictions, followed by PRS, then sex (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). This is expected given that the Fisher SNPs model had the highest metric scores of the three individual models (Table\u0026nbsp;1). We tested for a sex-dependent correlation between Fisher SNP prediction scores and impact on the final prediction (SHAP value) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). We identified a clear relationship between sex and Fisher SNP prediction value in the model, with only males predicted to be autistic from the Fisher SNP model. For a male and a female with the same neurotypical Fisher SNP predictive value, the ensemble model places a higher weight on common variants for neurotypical males than females (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe Fisher SNPs model was retrained after balancing the number of males and females within the training dataset. Autistic males within the training dataset were randomly subsampled to match the number of autistic females. The balanced model had slightly lower scores across all metrics than the original model (AUC of 0.70, Table S5). This is likely due to the reduced population size. The Fisher SNPs model prediction values still made the greatest contribution to the balanced model. However, sex was the second largest contributor. The contribution of the PRS dropped significantly, likely suggesting that its contribution was largely associated with sex (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). In addition, plotting the individual Fisher SNP prediction values against their model impacts (i.e. SHAP values) and colouring by sex, highlights sex-dependent impacts of common variants on prediction (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eD). The balanced model retains greater confidence in male predictions than female predictions from common variants (Mann-Whitney U p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.001). Female SHAP scores tended toward zero while male SHAP scores tended toward larger values in both directions (negative: neurotypical, positive: autistic). These results demonstrate the model\u0026rsquo;s high confidence in predicting autism probability for males compared to females based on common genetic variants. Thus, despite the balanced sex distribution in our dataset, the model exhibits an enhanced discriminative capacity for male-specific autism associated genetic factors.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eTop contributing common variants included sex-specific and shared variants\u003c/h2\u003e \u003cp\u003eVariant contributions to the balanced Fisher model were ranked according to their mean absolute SHAP score for females and males separately, and the top 10% of variants for each sex were selected (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). Most variants (67/89, 75%) in the top 10% were shared between males and females.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe top 10 Fisher variants (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB) and genes (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC) which contributed to the prediction had similar mean absolute SHAP values in males and females. However, SHAP values for rs58741612 were significantly different between sexes (t-test p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.001). rs58741612 (MAF 0.20 in males and females; European: 0.19, African: 0.28, Asian: 0.125) (Karczewski et al., 2020)) has not previously been associated with neurodevelopment. By contrast, the top-ranked variant, chr11:10509093, falls within \u003cem\u003eMTRNR2L8\u003c/em\u003e which has been shown to be fivefold upregulated in Saudi autistic individuals [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. The second-ranked variant (i.e. rs1443089352) is located within an intron of \u003cem\u003eP4HA3\u003c/em\u003e, which was shown to be differentially expressed within mildly autistic individuals [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eNotably, there were male- (11/89, 12%) and female-specific (11/89, 12%) loci (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). More sex-specific variants were observed in the original (imbalanced) model, likely reflecting a bias towards sex in the original Fisher model (Fig. \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e). 16p11.2 was identified as a female specific locus for autism in both the original and balanced models (t-test p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.001).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eMale- and female-specific SFARI genes were significantly expressed during different brain developmental windows\u003c/h2\u003e \u003cp\u003eWe performed sex-stratified analyses comparing mutation rates in SFARI genes between autistic and neurotypical individuals across males and females separately. This revealed statistically significant differences in mutation rates in 35 genes specific to females, 455 genes specific to males, and 124 genes shared between both sexes (Fig. \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003eA, Table S6, Table S7). These genes are mapped across different regions of the genome. Unsurprisingly, a large percentage of the female (9/31; 29%) and shared (10/48; 21%) gene sets were located on the X chromosome (Fig. \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003eB-D).\u003c/p\u003e \u003cp\u003eNotably, both male- (male\u0026thinsp;+\u0026thinsp;shared; 579) and female-specific (female\u0026thinsp;+\u0026thinsp;shared; 159) sets demonstrated significantly lower (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) loss-of-function observed/expected upper bound fraction (LOEUF) scores compared to control SFARI genes without differential mutation rates (Fig. \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003eE-F). These findings indicate that these sex-specific genetic contributors are under stronger evolutionary constraint, underscoring their importance in neurodevelopment.\u003c/p\u003e \u003cp\u003eThe Allen Brain Atlas [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e] was used to identify the enrichment of sex-specific gene sets within the brain across developmental timepoints. Expression of genes within the female-specific set was enriched in the primary somatosensory cortex, inferior parietal cortex, and the parietal neocortex during the fetal timepoint (FWER\u0026thinsp;=\u0026thinsp;0.013, 0.014, 0.010 respectively) (Fig. \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003eG). By contrast, expression of genes within the male gene set was enriched (FWER\u0026thinsp;\u0026lt;\u0026thinsp;0.05) in multiple brain regions at most developmental time points (fetal, infant, child, and adult).\u003c/p\u003e \u003cp\u003eSignificant developmental changes in gene expression (i.e. across all developmental windows) were observed for the expression of the male gene set within the dorsolateral prefrontal cortex and the anterior cingulate cortex (FWER\u0026thinsp;=\u0026thinsp;0.048 and 0.006, respectively). By contrast, there were no significant developmental changes in expression in the female gene set (Fig. \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003eG).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eAutism PRS scores can be used to cluster individuals\u003c/h2\u003e \u003cp\u003eWe tested to see if it was possible to cluster autistic and neurotypical individuals according to their genotypes. A random forest model was trained on all individuals using only the PRS SNPs to predict autism. As the purpose was clustering and no longer prediction, a separate test dataset was not required. SHAP values were used for clustering and a Uniform Manifold Approximation and Projection (UMAP) was performed to project the SHAP values into two dimensions (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA). K-means clustering (k\u0026thinsp;=\u0026thinsp;6) identified three autistic clusters (clusters 0, 3, and 5) and three neurotypical clusters (clusters 1, 2, and 4). The majority of the autistic individuals (482/613, 79%) were present in cluster zero. There was no statistically significant difference in the female-to-male ratio between the clusters, indicating that biological sex was not a discriminating factor in the clustering model (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eComparing the PRS values for all individuals, there was no statistically significant difference in the distribution between the autistic and neurotypical groups (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eC). This is likely because the neurotypical group is comprised of family members of the autistic individuals who share significant genomic information. However, comparisons of the PRS scores by clusters identified significant differences (pairwise Tukey p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) between clusters (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eD). Notably, cluster 3 (autistic) had the highest average PRS.\u003c/p\u003e \u003cp\u003eTo understand the genetic features responsible for separating the clusters, we compared the presence of each SNP between each autistic cluster and all neurotypical individuals. As was observed for overall PRS scores, no SNPs showed a statistically significant difference in presence between all autistic and neurotypical individuals. By contrast, within cluster 0, SNPs rs112635299 and rs11787216 had a statistically significant difference in distribution compared to all neurotypical individuals (chi-squared, adjusted p-values\u0026thinsp;\u0026lt;\u0026thinsp;0.001). SNP rs112635299 was statistically significantly associated with autistic individuals in cluster 3. SNP rs11787216 was significantly associated (chi-squared, adjusted p-values\u0026thinsp;\u0026lt;\u0026thinsp;0.001) with autistic individuals in cluster 5. These findings suggest that while no individual SNPs distinguish autistic individuals as a whole from neurotypical individuals, specific genetic variations may be relevant within certain autistic subgroups, highlighting the genetic heterogeneity within the autistic population.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eOur ensemble machine learning approach successfully predicted autism with an accuracy of 0.68 and an AUC of 0.72. The ensemble model was composed of three random forest models trained on: 1) statistically significant SNPs associated with autism in the training dataset (Fisher SNPs), 2) autism PRS SNPs [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], and 3) autism associated genes (SFARI genes listed in Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e), and biological sex. By analysing the decisions the ensemble model made, we identified genes and variants that are influential for autism prediction. These variants included rs58741612, rs1443089352, \u003cem\u003eMTRNR2L8\u003c/em\u003e, and the 16p11 locus. Collectively, these common variants contributed to prediction with greater confidence in males compared to females (accuracy\u0026thinsp;=\u0026thinsp;0.69 and 0.66 respectively). Expression of the SFARI gene sets that contributed to this sex-specificity was enriched at different developmental timepoints and in different brain regions in males and females. These insights pave the way for more targeted and sex-informed approaches to understanding autism genetics.\u003c/p\u003e \u003cp\u003eThe strengths of this study lie in leveraging the comprehensive MSSNG whole-genome sequencing database [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. However, several limitations need to be considered when interpreting our findings. Firstly, the MSSNG dataset is largely European (75%) which may limit model generalisability to other ancestry groups. Second, while we retained the individuals who were sequenced using the Illumina platform as an independent validation dataset, a completely separately curated WGS cohort is required to test the model\u0026rsquo;s applicability to other cohorts fully. Third, our model\u0026rsquo;s exclusive focus on common genetic variation overlooks the contribution of pathogenic structural variants (SVs) which have been detected in 6% of autistic individuals within the MSSNG database [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Finally, SHAP values can only explain why the model predicted individuals as autistic or neurotypical, not necessarily the \u0026lsquo;true\u0026rsquo; biological causality.\u003c/p\u003e \u003cp\u003eOur ensemble model is more confident in predictions for males than females, even after adjusting to reduce male bias. Our models identified common and X-chromosome based variation as being important for sex-specific prediction (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). Notably, a locus on the X chromosome has previously been associated with autism and decreased levels of maternal serum sex-hormone-binding globulin in males \u0026ndash; not females [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Our findings that common variants contribute more to male autism expand these earlier observations. It remains possible that the increased confidence in male predictions could be explained by dataset bias; specifically, females are less likely to be diagnosed [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. However, the use of a balanced dataset (male:female) in our ensemble model argues against this. Therefore, we contend that our results support the hypothesis that there is a biological role for sex hormones in autism [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. These hypotheses should be tested further using larger, sex-balanced autism datasets.\u003c/p\u003e \u003cp\u003eSocial behaviour difficulties involving the prefrontal cortex manifest differently in males versus females [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. Autistic females tend to have more subtle and less obvious difficulties in social communication than males and are better at adapting behaviours to conform to societal norms [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. The expression of the female-specific SFARI gene set showed no significant developmental changes in expression in any brain regions (i.e. over five timepoints from fetal to adulthood). By contrast, expression of the male-specific SFARI genes showed significant changes in expression across development (samples from 8 pcw to 40 years) in the dorsolateral prefrontal cortex (dPFC) and the medial prefrontal cortex (mPFC). The dPFC and mPFC regions have been associated with autism in males. The mPFC is important for social behaviour, and circuitry changes in this region have been identified in both autism clinical studies and rodent models of majority male [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e, \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e] or entirely male cohorts [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. Notably, we identified \u003cem\u003eSHANK3\u003c/em\u003e as associated with autism in males but not females. \u003cem\u003eSHANK3\u003c/em\u003e mutations have been shown to impact mPFC connectivity and social deficits in male but not female mice [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]. Our findings of sex-dependent SFARI gene expression in developmental windows within the brain are consistent with the existence of different mechanisms for the development of autism in males and females. Critically, this idea is not new, as animal and MRI based studies indicate the existence of brain-based sex differences in autism [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eMany studies analyse the genetic contributions to autism using autistic individuals and their family members [\u003cspan additionalcitationids=\"CR43\" citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]. Yet changes in polygenic risk scores between autistic individuals and their neurotypical family members are, by definition, minimal. However, clustering weighted (0,1,2; AA, Aa, aa, respectively) PRS SNPs overcomes this. By clustering weighted PRS scores, we identified SNPs rs112635299 (associated with coronary artery disease and bronchodilation [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e]) and rs11787216 (associated with educational attainment [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]) as being responsible for the greatest discriminatory difference between the autistic and neurotypical clusters. Therefore, we hypothesise that the individuals within these clusters are more likely to have heart and education-related co-occurring traits, respectively. Notably, epidemiological studies suggest that individuals with congenital heart disease (CHD) have double the odds of developing autism (1.99; 95% CI 1.77\u0026ndash;2.24) than those without CHD [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe 16p11.2 locus is important for autism prediction within the MSSNG dataset, particularly for females (p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.001). The 16p11.2 region falls within a known autism associated CNV which has sex-specific phenotypes in mice models [\u003cspan additionalcitationids=\"CR48 CR49 CR50 CR51\" citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e]. Specifically, male mice models with 16p11.2 deletions are more likely to experience hyperactivity, sleep disturbances, and reduced brain sizes compared to females with the same mutation [\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e, \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e]. By contrast, female mice models exhibit increased stress-induced anxiety and excitability in their cortical neurons [\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e]. This was hypothesised to be related to increased excitatory synaptic drive in central amygdala neurons, potentially due to lipid breakdown, impacting a central amygdala to globus pallidus externa (GPe) pathway that has a role in fear learning. However, it remains unclear what causes the sex-specific difference. Future studies on the role of the 16p11.2 locus in autism should focus on identifying the sex-specific molecular impacts. This would provide insights that may be able to be applied to better understand the development of autism in males versus females, potentially with significant impacts on diagnosis.\u003c/p\u003e \u003cp\u003eIn conclusion, this study demonstrated the utility of an ensemble machine learning model trained on common genetic variants for autism prediction. The identification of distinct male and female genetic signatures, with divergent spatiotemporal expression patterns and developmental trajectories, provides compelling evidence for sex-specific etiological pathways on autism development. Future work should explore the mechanisms underlying the sex-specific effects of the 16p11 locus.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe would like to thank the Genomics and Systems Biology Group (Liggins Institute, University of Auckland) for their insightful suggestions and discussions. Data used in this work comes from the Autism Speaks\u0026rsquo; MSSNG dataset.\u003c/p\u003e\n\u003cp\u003eCM was funded by the University of Auckland Doctoral Scholarship.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflicts of interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors have no competing interest to declare.\u003c/p\u003e\n\u003cp\u003eSupplementary information is available at MP\u0026rsquo;s website.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eTick B, Bolton P, Happ\u0026eacute; F, Rutter M, Rijsdijk F. Heritability of autism spectrum disorders: A meta-analysis of twin studies. J Child Psychol Psychiatry. 2016;57:585\u0026ndash;595.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGrove J, Ripke S, Als TD, Mattheisen M, Walters RK, Won H, et al. Identification of common genetic risk variants for autism spectrum disorder. Nat Genet. 2019;51:431\u0026ndash;444.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGaugler T, Klei L, Sanders SJ, Bodea CA, Goldberg AP, Lee AB, et al. Most genetic risk for autism resides with common variation. Nat Genet. 2014;46:881\u0026ndash;885.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZeidan J, Fombonne E, Scorah J, Ibrahim A, Durkin MS, Saxena S, et al. Global prevalence of autism: A systematic review update. Autism Research. 2022;15:778\u0026ndash;790.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBougeard C, Picarel-Blanchot F, Schmid R, Campbell R, Buitelaar J. Prevalence of Autism Spectrum Disorder and Co-morbidities in Children and Adolescents: A Systematic Literature Review. Front Psychiatry. 2021;12.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDougherty JD, Marrus N, Maloney SE, Yip B, Sandin S, Turner TN, et al. Can the \u0026ldquo;female protective effect\u0026rdquo; liability threshold model explain sex differences in autism spectrum disorder? Neuron. 2022;110:3243\u0026ndash;3262.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCirillo D, Catuara-Solarz S, Morey C, Guney E, Subirats L, Mellino S, et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit Med. 2020;3:81.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBracher-Smith M, Crawford K, Escott-Price V. Machine learning for genetic prediction of psychiatric disorders: a systematic review. Mol Psychiatry. 2021;26:70\u0026ndash;79.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGupta C, Chandrashekar P, Jin T, He C, Khullar S, Chang Q, et al. Bringing machine learning to research on intellectual and developmental disabilities: taking inspiration from neurological diseases. J Neurodev Disord. 2022;14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJansen AG, Dieleman GC, Jansen PR, Verhulst FC, Posthuma D, Polderman TJC. Psychiatric Polygenic Risk Scores as Predictor for Attention Deficit/Hyperactivity Disorder and Autism Spectrum Disorder in a Clinical Child and Adolescent Sample. Behav Genet. 2020;50:203\u0026ndash;212.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBenca CE, Derringer JL, Corley RP, Young SE, Keller MC, Hewitt JK, et al. Predicting Cognitive Executive Functioning with Polygenic Risk Scores for Psychiatric Disorders. Behav Genet. 2017;47:11\u0026ndash;24.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNovakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet. 2023;24:125\u0026ndash;137.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eManduchi E, Romano JD, Moore JH. The promise of automated machine learning for the genetic analysis of complex traits. Hum Genet. 2022;141:1529\u0026ndash;1544.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRosenquist R, Cuppen E, Buettner R, Caldas C, Dreau H, Elemento O, et al. Clinical utility of whole-genome sequencing in precision oncology. Semin Cancer Biol. 2022;84:32\u0026ndash;39.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eElkon R, Agami R. Characterization of noncoding regulatory DNA in the human genome. Nat Biotechnol. 2017;35:732\u0026ndash;746.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhafouri-Fard S, Taheri M, Omrani MD, Daaee A, Mohammad-Rahimi H, Kazazi H. Application of Single-Nucleotide Polymorphisms in the Diagnosis of Autism Spectrum Disorders: A Preliminary Study with Artificial Neural Networks. Journal of Molecular Neuroscience. 2019;68:515\u0026ndash;521.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang D, Liu S, Warrell J, Won H, Shi X, Navarro FCP, et al. Comprehensive functional genomic resource and integrative model for the human brain. Science (1979). 2018;362.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTrost B, Thiruvahindrapuram B, Chan AJS, Engchuan W, Higginbotham EJ, Howe JL, et al. Genomic architecture of autism from comprehensive whole-genome sequence annotation. Cell. 2022;185:4409\u0026ndash;4427.e18.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePurcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559\u0026ndash;575.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAli M. PyCaret: An open source, low-code machine learning library in Python. Https://WwwPycaretOrg.2020.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbrahams BS, Arking DE, Campbell DB, Mefford HC, Morrow EM, Weiss LA, et al. SFARI Gene 2.0: a community-driven knowledgebase for the autism spectrum disorders (ASDs). Mol Autism. 2013;4:36.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56\u0026ndash;67.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShapley LS. 17. A Value for n-Person Games. Contributions to the Theory of Games (AM-28), Volume II, Princeton University Press; 1953. p. 307\u0026ndash;318.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. Adv Neural Inf Process Syst. 2017;30:4765\u0026ndash;4774.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSunkin SM, Ng L, Lau C, Dolbeare T, Gilbert TL, Thompson CL, et al. Allen Brain Atlas: an integrated spatio-temporal portal for exploring the central nervous system. Nucleic Acids Res. 2012;41:D996\u0026ndash;D1008.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGrote S, Pr\u0026uuml;fer K, Kelso J, Dannemann M. ABAEnrichment: An R package to test for gene set expression enrichment in the adult and developing human brain. Bioinformatics. 2016;32:3201\u0026ndash;3203.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTukey JW. Comparing Individual Means in the Analysis of Variance. Biometrics. 1949;5:99.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKarczewski KJ, Francioli LC, Tiao G, Cummings BB, Alf\u0026ouml;ldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434\u0026ndash;443.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlmandil NB, AlSulaiman A, Aldakeel SA, Alkuroud DN, Aljofi HE, Alzahrani S, et al. Integration of Transcriptome and Exome Genotyping Identifies Significant Variants with Autism Spectrum Disorder. Pharmaceuticals. 2022;15:158.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee EC, Hu VW. Phenotypic Subtyping and Re-Analysis of Existing Methylation Data from Autistic Probands in Simplex Families Reveal ASD Subtype-Associated Differentially Methylated Genes and Biological Functions. Int J Mol Sci. 2020;21:6877.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMendes M, Chen DZ, Engchuan W, Leal TP, Thiruvahindrapuram B, Trost B, et al. Chromosome X-wide common variant association study in autism spectrum disorder. The American Journal of Human Genetics. 2025;112:135\u0026ndash;153.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHull L, Mandy W. Protective Effect or Missed Diagnosis? Females with Autism Spectrum Disorder. Future Neurol. 2017;12:159\u0026ndash;169.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeow KQ, Tonta MA, Lu J, Coleman HA, Parkington HC. Towards understanding sex differences in autism spectrum disorders. Brain Res. 2024;1833:148877.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaine MA, Greiner EM, Shansky RM. Sex differences in the rodent medial prefrontal cortex \u0026ndash; What Do and Don\u0026rsquo;t we know? Neuropharmacology. 2024;248:109867.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRatto AB, Kenworthy L, Yerys BE, Bascom J, Wieckowski AT, White SW, et al. What About the Girls? Sex-Based Differences in Autistic Traits and Adaptive Skills. J Autism Dev Disord. 2018;48:1698\u0026ndash;1711.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIbrahim K, Soorya L V., Halpern DB, Gorenstein M, Siper PM, Wang AT. Social cognitive skills groups increase medial prefrontal cortex activity in children with autism spectrum disorder. Autism Research. 2021;14:2495\u0026ndash;2511.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMediane DH, Basu S, Cahill EN, Anastasiades PG. Medial prefrontal cortex circuitry and social behaviour in autism. Neuropharmacology. 2024;260:110101.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi L, He C, Jian T, Guo X, Xiao J, Li Y, et al. Attenuated link between the medial prefrontal cortex and the amygdala in children with autism spectrum disorder: Evidence from effective connectivity within the \u0026ldquo;social brain\u0026rdquo;. Prog Neuropsychopharmacol Biol Psychiatry. 2021;111:110147.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim S, Kim Y-E, Song I, Ujihara Y, Kim N, Jiang Y-H, et al. Neural circuit pathology driven by Shank3 mutation disrupts social behaviors. Cell Rep. 2022;39:110906.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLai M, Lerch JP, Floris DL, Ruigrok ANV, Pohl A, Lombardo M V., et al. Imaging sex/gender and autism in the brain: Etiological implications. J Neurosci Res. 2017;95:380\u0026ndash;397.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWalsh MJM, Wallace GL, Gallegos SM, Braden BB. Brain-based sex differences in autism spectrum disorder across the lifespan: A systematic review of structural MRI, fMRI, and DTI findings. Neuroimage Clin. 2021;31:102719.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKlei L, McClain LL, Mahjani B, Panayidou K, De Rubeis S, Grahnat A-CS, et al. How rare and common risk variation jointly affect liability for autism spectrum disorder. Mol Autism. 2021;12:66.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchendel D, Munk Laursen T, Albi\u0026ntilde;ana C, Vilhjalmsson B, Ladd-Acosta C, Fallin MD, et al. Evaluating the interrelations between the autism polygenic score and psychiatric family history in risk for autism. Autism Research. 2022;15:171\u0026ndash;182.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaBianca S, LaBianca J, Pagsberg AK, Jakobsen KD, Appadurai V, Buil A, et al. Copy Number Variants and Polygenic Risk Scores Predict Need of Care in Autism and/or ADHD Families. J Autism Dev Disord. 2021;51:276\u0026ndash;285.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAherrahrou R, Reinberger T, Hashmi S, Erdmann J. GWAS breakthroughs: mapping the journey from one locus to 393 significant coronary artery disease associations. Cardiovasc Res. 2024;120:1508\u0026ndash;1530.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGu S, Katyal A, Zhang Q, Chung W, Franciosi S, Sanatani S. The Association Between Congenital Heart Disease and Autism Spectrum Disorder: A Systematic Review and Meta-Analysis. Pediatr Cardiol. 2023;44:1092\u0026ndash;1107.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLynch JF, Ferri SL, Angelakos C, Schoch H, Nickl-Jockschat T, Gonzalez A, et al. Comprehensive Behavioral Phenotyping of a 16p11.2 Del Mouse Model for Neurodevelopmental Disorders. Autism Research. 2020;13:1670\u0026ndash;1684.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAgarwalla S, Arroyo NS, Long NE, O\u0026rsquo;Brien WT, Abel T, Bandyopadhyay S. Male-specific alterations in structure of isolation call sequences of mouse pups with 16p11.2 deletion. Genes Brain Behav. 2020;19.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGiovanniello J, Ahrens S, Yu K, Li B. Sex-Specific Stress-Related Behavioral Phenotypes and Central Amygdala Dysfunction in a Mouse Model of 16p11.2 Microdeletion. Biological Psychiatry Global Open Science. 2021;1:59\u0026ndash;69.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim J, Koo B-K, Knoblich JA. Human organoids: model systems for human biology and medicine. Nat Rev Mol Cell Biol. 2020;21:571\u0026ndash;584.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNiarchou M, Chawner SJRA, Doherty JL, Maillard AM, Jacquemont S, Chung WK, et al. Psychiatric disorders in children with 16p11.2 deletion and duplication. Transl Psychiatry. 2019;9:8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRein B, Yan Z. 16p11.2 Copy Number Variations and Neurodevelopmental Disorders. Trends Neurosci. 2020;43:886\u0026ndash;901.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAngelakos CC, Watson AJ, O\u0026rsquo;Brien WT, Krainock KS, Nickl-Jockschat T, Abel T. Hyperactivity and male‐specific sleep deficits in the 16p11.2 deletion mouse model of autism. Autism Research. 2017;10:572\u0026ndash;584.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKretz PF, Wagner C, Mikhaleva A, Montillot C, Hugel S, Morella I, et al. Dissecting the autism-associated 16p11.2 locus identifies multiple drivers in neuroanatomical phenotypes and unveils a male-specific role for the major vault protein. Genome Biol. 2023;24:261.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTomasello DL, Kim JL, Khodour Y, McCammon JM, Mitalipova M, Jaenisch R, et al. 16pdel lipid changes in iPSC-derived neurons and function of FAM57B in lipid metabolism and synaptogenesis. IScience. 2022;25:103551.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003eTable 1 is available in the Supplementary Files section.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6323696/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6323696/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAutism has a global prevalence of 1%, with a male-to-female diagnosis ratio of roughly 4:1. Several models have been developed to predict autism using genetic information. However, the influence of biological sex on prediction outcomes remains underexplored. We present an ensemble model to predict autism, which integrates polygenic risk scores (PRSs), common genetic variants, and autism associated genes with the MSSNG whole genome sequencing (WGS) dataset. Following training, our model achieved an accuracy of 0.68, an area under the receiver operating curve (AUROC) of 0.72, and a recall of 0.77 on the test dataset. Notably, common variants contributed more significantly to autism prediction in males than females (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), with accuracies of 0.69 and 0.66, respectively. The 16p11 locus emerged as particularly predictive for females (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). Gene enrichment analysis using the Allen Brain Atlas revealed that expression of autism associated genes that were significant in females were enriched (FWER\u0026thinsp;\u0026lt;\u0026thinsp;0.05) in the primary somatosensory cortex, inferior parietal cortex, and parietal neocortex during fetal development. By contrast, male autism associated gene expression was enriched (FWER\u0026thinsp;\u0026lt;\u0026thinsp;0.05) in the dorsolateral prefrontal cortex and anterior cingulate cortex across developmental stages (fetal to adult). These findings underscore a sex-dependent role for common genetic variants in autism development. In doing so, they highlight the utility of ensemble models that incorporate common variation and biological sex for autism prediction.\u003c/p\u003e","manuscriptTitle":"Sex-dependent prediction of autism","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-23 17:26:20","doi":"10.21203/rs.3.rs-6323696/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"35c10ac0-1f70-482c-b115-5e58f9b8b4b7","owner":[],"postedDate":"July 23rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":51828339,"name":"Biological sciences/Genetics"},{"id":51828340,"name":"Health sciences/Biomarkers/Predictive markers"},{"id":51828341,"name":"Biological sciences/Neuroscience"}],"tags":[],"updatedAt":"2025-08-28T09:31:20+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-23 17:26:20","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6323696","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6323696","identity":"rs-6323696","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00