RNAScope-Ancestry: A Cross-Modality Framework for Inferring Genetic Ancestry from RNA-Seq with Application to MECA

preprint OA: closed
Full text JSON View at publisher
Full text 95,829 characters · extracted from preprint-html · click to expand
RNAScope-Ancestry: A Cross-Modality Framework for Inferring Genetic Ancestry from RNA-Seq with Application to MECA | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Method Article RNAScope-Ancestry: A Cross-Modality Framework for Inferring Genetic Ancestry from RNA-Seq with Application to MECA Rashi Verma, Shivam Sharma, Harriet NA Blankson, Emine Guven, and 10 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7801062/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Genetic ancestry inference traditionally relies on genotyping arrays or whole-genome sequencing. We present RNAScope-Ancestry, a computational pipeline that leverages short-read RNA-seq for dual-purpose ancestry and transcriptomic analysis. Using RNA-seq from 490 MECA participants and 1000 Genomes reference populations, we performed variant calling, quality filtering, principal component analysis and retained 230 high-frequency SNPs for ancestry estimation via Rye algorithm for ancestry inferences. MECA participants aligned with African and admixed populations, predominantly West African. Correlation with gene expression identified the top 50 ancestry-associated transcripts. Validation in 109 longitudinal samples confirmed reproducibility. The pipeline is open-source and generalizable: https://github.com/rob-meller/ ; https://github.com/Vermarashi/ . Genetic Ancestry Cardiovascular Disease Short-Read RNA-Seq Race Ethnicity Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Introduction Genetic ancestry analysis has traditionally relied on genotyping arrays and whole-genome sequencing that enable high-resolution inferences of genetic variation by capturing a comprehensive set of variants across the genome [1, 2]. These approaches are well-validated and widely used for ancestry estimation due to their robustness and accuracy. In contrast, RNA sequencing (RNA-seq), primarily designed for transcriptomic profiling, offers an alternative approach by simultaneously quantifying gene expression and identifying genetic variants. Although RNA-seq does not surpass traditional methods in variant resolution, it presents distinct advantages, particularly in leveraging existing transcriptomic data to infer genetic ancestry. RNA-seq is widely used and readily available in many genomic studies, making it a valuable resource for genetic ancestry analysis without requiring additional sequencing efforts. Moreover, the integration of genetic ancestry inferences with transcriptome data providing a complementary dimension that captures gene expression patterns influenced by both genetic background and environmental factors. RNA-seq data, with its ability to capture the full spectrum of transcript variants, presents a unique opportunity to enhance the resolution and accuracy of genetic ancestry analyses [3]. This dual functionality makes RNA-seq particularly valuable in studies of complex diseases, where gene expression data can provide deeper insights into population-specific genetic traits and their functional implications. The application of RNA-seq for genetic ancestry analysis remains relatively unexplored [4-6]. Thus, the study presents RNAScope-ancestry to assesses the genetic ancestry of MECA study participants, providing a high-resolution view using short-read RNA-seq data. By contextualizing genetic and clinical findings within African, European, and American ancestries, it offers insights into cardiovascular health disparities among Black adults. By integrating genetic ancestry into CVD research, we can better understand how gene-environmental interactions shape health outcomes. This understanding can guide targeted interventions to reduce CVD disparities and promote health equity. Results Characteristics of MECA dataset The study includes a diverse cohort of Black adults (n=490, visit 1; n=109, visit 2). It encompasses a broad age range (>36) and both genders, with detailed demographic, clinical, and behavioral data, including blood pressure, cholesterol, BMI, glucose, HbA1c, C-reactive protein, lifestyle factors (diet, exercise, smoking, alcohol, sleep), psychosocial measures (stress, resilience), and environmental data (geographic location, healthcare access). Longitudinal follow-up tracks cardiovascular risk trajectories [8]. Selection of high-quality SNPs High-quality RNA-seq data is essential for accurate SNP inference. From the Visit 1 dataset, 18.86M variants were identified and filtered to 2.59M SNPs, while the reference dataset was reduced from 82.49M to 81.57M variants. Comparative analysis revealed 1.26M common SNPs, further refined through quality control and LD pruning to 1,415 (Visit 1) and 163,279 (reference) independent variants. The intersection yielded 230 common variants for PCA and ancestry analysis, ensuring robust results ( Figure 2 ). Inference of ancestry from RNA-seq data The genetic relationship among MECA participants and reference samples from four regional ancestry groups, computed using PCA, are shown in Figure 3A . The PCA plot illustrates distinct clustering patterns for individuals from various ancestral backgrounds based on principal components (PC1 and PC2). Individuals of European ancestry are clearly separated along PC1, whereas African ancestry forms a tight cluster closer to the origin, with admixed and Native American populations distributed between these clusters. The MECA participants predominantly overlap with the African ancestry cluster, suggesting that most of these individuals are of African descent, with a few showing potential admixture. Ancestry proportions were quantified using Rye algorithm. African reference populations principally display African ancestry, with negligible contributions from European or American ancestries ( Figure 3B ). Conversely, European reference populations are primarily European. Admixed individuals exhibit varying proportions of African, European, and Native American ancestries, reflecting the complexity of their genetic backgrounds. MECA study participants exhibit ancestry proportions like admixed individuals, with significant African and European contributions. These analyses highlight the utility of PCA and admixture estimation in elucidating the ancestry patterns of diverse populations and contextualizing participant genetic profiles. The ancestry estimates are consistent with participants’ self-identified ethnic backgrounds, which is a second level of ethnic identity beneath the ethnic group designation Inference of sub-ancestry from RNA-seq data Next, we were able to assess the genetic structure of sub-African populations and their relationship to the study participants. This is critical not only to test the sensitivity of our approach but also for interpreting genetic associations within the MECA cohort. Consequently, it establishes the participant’s primary African genetic ancestry with highlighting any admixture as well. Understanding this genetic structure enhances the context of our analyses and aligns the study's aim to investigate ancestry driven cardiovascular health in a population using short read RNA-Seq data analysis. PCA plots across the top four PCs showed enhanced population clustering patterns with HWE cutoffs from 1x10⁻⁴ to 0.4. At a cutoff of 1x10⁻⁴, reference populations formed poorly differentiated clusters and showed highly distant separation from the participant population along PC1 ( Figure 4A; B ). Interestingly, the plots for PC2, PC3, and PC4 revealed clear clustering of reference populations, with the participant population closely aligning with Western African populations, particularly ESN and YRI ( Figure 4C; D) . As we applied HWE cutoff 0.4, clustering patterns along PC1 remain the same. However, PC2, PC3, and PC4 clearly depicted the participant population's distribution across Western African populations, suggesting the filtration of probable artifacts from the data ( Figure 4E; F ). These findings emphasize the sensitivity of PCA clustering to HWE thresholds, illustrating how parameter selection can impact population differentiation and the identification of genetic structure. Ancestry proportion showed largely East African ancestry in the East African group and largely West African ancestry in the West African group. The MECA participants exhibited a mixed ancestry pattern, with a predominant contribution from West African populations and minimal contribution from East African populations ( Figure 5). Genes Correlated to Ancestry Correlation analysis between CPM values and ancestry fractions of European, African, and American populations did not reveal any significant associations. However, further analysis focusing specifically on Sub-Saharan African ancestry uncovered distinct gene sets associated with East and West African ancestry fractions. Among the 13,566 significant genes, 54 genes exhibited positive correlations with West African ancestry, while 10,488 genes showed positive correlations with East African ancestry. We reported top 50 genes with the highest positive and negative correlations with East and West African ancestry ( Figure 6) . Moreover, the genes with the strongest correlations (both positive and negative) demonstrated highly significant −log10(adjusted p-values), further emphasizing their robust association with the respective ancestry fractions. Performance accuracy with dataset 2 We analyzed the data from Visit 2 participants that were aligned closely with their corresponding groups in Visit 1, validating the method's accuracy in assigning ancestry ( Figure 7A ). African ancestry dominates both visits, with 106 and 108 participants, respectively, while European and American ancestries show negligible representation. ( Figure 7B ). The decrease in the number of European participants from Visit 1 to Visit 2 could be attributed to technical variations or sample reprocessing. The near-identical distributions reinforce the reproducibility of the methodology when applied to subsets of the same population. PCA plots for Sub-Saharan African ancestry results aligns well with the findings from the Visit 1 dataset ( Figure 7C; D ). Ancestry fraction for visit 2 revealed the African ancestry ( Figure 7E ) and West-African ancestry ( Figure 7F ) as primary component with minimal contributions from other ancestries. This uniform pattern across visits underscores the method's reliability in detecting and quantifying ancestry proportions consistently. DISCUSSION The application of bioinformatics tools and computational approaches in understanding genetic ancestry represents a significant advancement in the field of genomics. While genotyping arrays and whole-genome sequencing remain the gold standard for ancestry inference, RNA-Seq technologies have emerged as a complementary approach. These advancements are driven by their ability to generate high-throughput data, genome-wide coverage, enhanced accuracy, ability to analyze admixed population, flexibility in analysis and adaptability for studying ancient DNA [6, 18-20]. This leads to enhance detection of complex genetic variations, including structural variants, haplotype phasing, and isoform diversity, which are crucial for accurate ancestry analysis. Although RNA-Seq does not provide the same variant resolution as traditional methods, its widespread availability and routine use in transcriptomic studies make it a valuable resource for ancestry analysis. By leveraging genetic variants embedded within transcriptomic data, RNAScope-Ancestry allows for ancestry inference without the need for additional genomic sequencing, making it a cost-effective alternative in studies where DNA-based data may not be available. Furthermore, its application extends to admixed populations, where it can aid in identifying population-specific expression patterns influenced by genetic ancestry. This approach also facilitates the investigation of how ancestry-related genetic factors contribute to gene regulation, enhancing our understanding of complex traits and disease susceptibilities across diverse populations. We aimed to explore ancestry-driven cardiovascular insights in participants of MECA study using short-read RNA-Seq data. To achieve this, we developed a systematic workflow to assess and stratify the genetic makeup of MECA participants against reference populations from Admixed, Europe, America, and Africa. Additionally, we focused on their Sub-Saharan African genetic background by comparing the cohort with various African reference populations. Our approach was further validated using data from participants who attended a second visit, as well as a subset from the initial visit. Principal Component Analysis (PCA) revealed distinct clustering patterns, revealing predominance of African ancestry within our MECA participant cohort. Study revealed diverse genetic makeup of MECA participants with clustering near African, and Admixed groups. At the sub-African level, we observed that MECA participants predominantly clustered with West African (Niger-Congo) ancestry, along with a minor East African component. The East African component primarily reflects Bantu ancestry, which originated from the southwestern regions of Africa during the Bantu expansion, not directly from East Africa. The LWK sample from Kenya, which was used to represent East African ancestry, is more accurately described as Bantu due to its linguistic and genetic connection to the Bantu-speaking populations of Central and Southern Africa. This distinction is important because most enslaved Africans brought to the Americas came from West and Central Africa, not East Africa. Therefore, while it is correct to label the LWK sample as East African geographically, referring to it as Bantu ancestry provides a more accurate historical and genetic context. This clarification aligns with the historical migration patterns and enhances the precision of our findings, reflecting the complex heritage of African Americans. Correlation analysis depicted top 50 genes significantly correlated to sub–Saharan African ancestries. Noticeably, the genes exhibit a positive correlation with East African ancestry and an equally strong negative correlation with West African Ancestry, with an adjusted p-value indicating high statistical significance ( Suppl. Fig. 1 ). The observed correlations follow a predictable pattern due to the inherent non-independence of ancestry components. Since the sum of ancestry proportions is constrained to 100%, a positive correlation for one ancestry component necessarily implies a negative correlation for the other. This non-independence acts as an internal control, ensuring that our results are consistent and correctly interpreted. Such patterns have been noted in previous ancestry studies and are a fundamental property of proportional data. To address this dependency, we recalculated East and West African ancestry proportions as unscaled absolute values by multiplying total African ancestry with their respective relative proportions, thereby removing artificial dependencies and preserving the natural structure of the data ( Figure 6 ). We validated our RNAScope-Ancestry pipeline using data from MECA Participants who returned for a second visit [n (visit 2 = 109]. The validation confirms the reliability of our approach that reproduces consistent ancestry distributions across related cohorts and self-identified ethnic backgrounds. Although all participants self-identified as African American, our genetic ancestry analysis revealed substantial sub-continental African heterogeneity, including contribution from west and east African and European ancestries. These findings support the observation that self-identified race doesn’t align with genetic ancestry. Therefore, biomedical research relying solely on race labels overlook biologically relevant ancestry-related variations. Integrating ancestry analysis provides a more accurate framework for interpreting gene expression pattern in diverse population [21]. CHALLENGES: The genetic ancestry analysis of MECA samples using short-read RNA-seq data presented several challenges, particularly in achieving proper clustering in the PCA plots along PC1. Initial PCA plots exhibited significant clustering for reference populations; however, the MECA samples displayed an unusual spread, indicating potential noise or confounding factors in the dataset ( Suppl. Fig. 2A; B ). To address this, we implemented a series of filtering steps, beginning with LD pruning. Despite this, PC1 and PC2 remained suboptimal for the MECA samples ( Suppl. Fig. 2C ) while PC1 and PC3 demonstrated better clustering for some residual noise ( Suppl. Fig. 2D ). Then, we filter variants with missing genotype >10% across all samples and Hardy-Weinberg threshold (1x10 -4 ). These adjustments effectively removed low-quality variants and potential artifacts, thereby enhancing the resolution of genetic structure ( Figure 3A ). The improved separation in PCA plots following LD pruning and HWE filtering highlights the importance of minimizing noise and ensuring data quality. Next challenge we faced with sub-Saharan African ancestry analysis likely due to the presence of most common variants present across these populations. Initial PCA results revealed dispersed clustering of MECA samples ( Suppl. Fig. 3 ) which suggested the necessity of incorporating rare variants to improve clustering accuracy. To address this, imputation was incorporated into the pipeline which improved clustering on PCA plots, particularly for PC2, PC3, and PC4 ( Figure 4C-D ). Further refinement with HWE thresholds resulted in better alignment of participant samples with reference populations ( Figure 4E; F ). Notably, the MECA query samples appear to shift towards the European pole along PC1 ( Figure 4A-B ). This shift makes sense, as these samples might have a genetic composition that is aligned with European populations. As a separate query group, the MECA samples are not admixed per se but may share genetic features associated with European ancestry, which is reflected in their position along the principal component axis. The shift along PC1 further supports the validity of our results and underscores the importance of considering the genetic background of the query samples in interpreting the PCA plots. While genome-wide studies generally require hundreds of thousands to millions of variants, however, our analysis demonstrated that even a small subset of variants (n<500) can effectively capture significant ancestry-informative signals, particularly in admixture populations. Winkler et al. highlighted that enough ancestry-informative markers can effectively support genome-wide scans for disease associations, particularly in admixed populations [22]. Notably, the alignment of PCA clustering with self-identified ethnicities underscores the utility of this approach, suggesting that a well-curated set of variants can provide meaningful insights into genetic ancestry. This finding highlights the potential of short-read RNA sequencing data, which, despite its lower variant count, can yield valuable genetic information. Our results advocate for the use of smaller variant panels in genetic ancestry studies, especially when cost-effective methods like short-read RNA sequencing are preferred over whole-genome sequencing, offering an accessible and efficient alternative for research on genetic admixture. These results also have profound implications for genetic epidemiology and precision medicine. Given that MECA study participants exhibit genetic affinities with specific West African population, it is crucial to acknowledge these subpopulations in future genetic studies and healthcare applications. Furthermore, the genetic data presented here highlights the importance of utilizing ancestry-informative markers in Mende and Yoruba West groups that could be valuable for refining genetic models to predict disease risk and enabling more targeted and personalized healthcare strategies. Additionally, this study also highlights the need to diversify genomic databases to better represent African genetic diversity, as African populations have been historically underrepresented. Expanding the inclusion of groups will improve the generalizability of research findings and help create more accurate genetic models, benefiting African descent populations worldwide. While this study utilizes RNA-seq data to infer genetic ancestry, it is important to acknowledge that RNA-seq is primarily designed for gene expression analysis rather than ancestry estimation. Unlike genotyping arrays or whole-genome sequencing, RNA-seq-based ancestry inference may be influenced by several factors, including expression variability, batch effects, and sample-specific biases. Additionally, gene expression is context-dependent and may vary across tissues, environmental conditions, and disease states, which could introduce variability in ancestry estimations. Limited reference populations may not fully represent East African diversity, highlighting the need for broader datasets in genetic databases. Usage of small number of variants may reduce the resolution of ancestry inference compared to studies utilizing high-density genotype data. Replicating these findings by addressing limitations in independent datasets will strengthen the evidence for ancestry, ancestry-related expression differences and its health implications. Despite the limitation, our findings exhibit strong consistency with established ancestry patterns, reinforcing the robustness of the approach. The workflow demonstrates the efficacy of filtering and utilizing high-quality SNPs to infer genetic ancestry in a diverse cohort of MECA participants. It highlights genetic variation that influences gene expression offering insights into the biological processes linked to genetic ancestry. It generates consistent results from one over another dataset that make it more reliable when analyzing diverse populations. It allows the integration of ancestry results with gene expression analysis which provides complementary layer of information that makes short RNA-seq particularly useful in studying traits or diseases influenced by regulatory mechanisms. By integration with genetics data, the workflow can provide a more comprehensive understanding of how ancestral origins shape gene expression, offering a deeper perspective on the complex interplay between genetics and phenotype. Conclusion RNAScope-Ancestry demonstrates that short-read RNA-seq can be repurposed beyond transcriptomics to provide reliable genetic ancestry inference, linking population structure with gene expression. By validating the pipeline in MECA participants and ensuring reproducibility across visits, our framework establishes a scalable, generalizable approach for ancestry-aware analyses, particularly in underrepresented populations. This method enables integrative, dual-purpose studies that can advance precision medicine, population genetics, and complex disease research using existing RNA-seq datasets. METHODOLGY The RNAScope-Ancestry pipeline starts with sample collection and RNA-Seq library preparation. Figure 1 highlights the steps of protocol. We applied this framework to the available participants of Morehouse-Emory Cardiovascular Center for Health Equity (MECA) study [7]. The Visit 1 dataset was used for ancestry estimation, while the Visit 2 dataset (comprising subsequent visits of participants already included in the Visit 1 dataset) validated the findings. Data and Sample Collection A multi-faceted approach was employed to investigate cardiovascular health disparities among Black adults in Atlanta, classifying neighborhoods as "at-risk" or "resilient" based on cardiovascular outcomes hospitalization [8]. Over 1,400 individuals were surveyed, with 599 participants undergoing clinical evaluations and blood collection across two visits (Visit 1: n=490; Visit 2: n=109).All eligible participants in this cohort self-identified as Black or African American adults, aged 18+, residing in targeted neighborhoods, with exclusions for non-residency, inability to consent, or conditions interfering with assessments [8].Additionally, for the intervention subset, individuals who could not adhere to the eHealth tools or coaching protocols due to technological or other barriers were excluded. These criteria ensured a representative sample while maintaining the study's scientific rigor and relevance. This protocol ensured a robust cohort for examining genetic and environmental factors influencing cardiovascular risk and resilience. RNA-Seq Library Preparation Blood was collected into PAXgene Blood RNA tubes (Pre Analytix, Qiagen) and the RNA was extracted using the MagMAX for Stabilized Blood Tubes RNA Isolation Kit, compatible with PAXgene Blood RNA Tubes (ThermoFisher Scientific). RNA quality was assessed using a Fragment Analyzer (Agilent) and then one microgram of total RNA was subjected to globin transcript depletion using the GLOBINclear Kit, human (ThermoFisher Scientific). Ten nanograms of the globin-depleted RNA were used as input for cDNA synthesis using the Clontech SMART-Seq v4 Ultra Low Input RNA kit (Takara Bio) according to the manufacturer’s instructions. Amplified cDNA was fragmented and appended with dual-indexed bar codes using the Nextera XT DNA Library Preparation kit (Illumina). Libraries were validated by capillary electrophoresis on a TapeStation 4200 (Agilent), pooled at equimolar concentrations, and sequenced with PE100 reads on an Illumina NovaSeq 6000, yielding ~30 million reads per sample on average. Data Alignment and Variant Calling The sequenced data were trimmed (TrimGalore v0.6.4) and aligned to the GRCh38 human reference genome (STAR v2.7.3a and Bowtie2 v2.3.5.1) [9, 10]. Aligned BAM files were sorted and indexed (SAMtools v1.10) [11]. Variant calling was performed (GATK pipeline) [12]. PCR duplicates were marked (MarkDuplicates), spliced alignments processed (SplitNCigarReads), and base quality scores were recalibrated (BaseRecalibrator) to correct systematic sequencing errors. HaplotypeCaller was used in GVCF mode for variant calling, and the resulting GVCF files were jointly genotyped in GenomicsDB (GenotypeGVCFs). Variants were filtered and recalibrated variant quality score (ApplyVQSR). Preprocessing of MECA and References Samples Variants were annotated (bcftools) with databases such as dbSNP138, HapMap, and Mills and 1000G indels. Quality filters were applied, retaining only non-singleton variants with quality scores > 30 and depth > 10. Normalization was conducted to ensure consistent representation of variants. Genotype data from the MECA samples were designated as “query samples.” Reference data (African, American, European, Admixed = 1249) from the 1000 Genomes Project served as “reference samples” ( Table 1 ). Native American ancestry is represented by Peruvian (PEL) samples, which have been shown to carry a high proportion of Native American ancestry (>80%) as reported in Conley et al., 2023 [13]. African ancestry is represented by Esan (ESN), Gambian (GWD), Luhya (LWK), Mende (MSL), and Yoruba (YRI) samples. European ancestry is represented by Utah residents (CEU), Finnish (FIN), British (GBR), Iberian (IBS), and Toscani (TSI) populations. Admixed American, including African Caribbean in Barbados (ACB) and African Americans in the Southwest U.S. (ASW), are categorized as such due to their known two-way African European admixture. While these populations are sometimes classified under African ancestry in other studies, we use the admixed category to better reflect their genetic background. This classification allows for a more accurate interpretation of genetic structure in our study cohort. Chromosome names were standardized, and variants were normalized to a biallelic format. Quality control steps included filtering for minor allele frequency (MAF < 0.05), genotyping rate (0.1), and Hardy-Weinberg equilibrium (P < 1 × 10⁻ 4 ). To reduce redundant variants, linkage disequilibrium (LD) pruning was performed (100:50:0.5). The resulting intersected variants from query and reference datasets were merged and converted into PLINK binary format ( Script ). Ancestry Estimation PCA was conducted using PLINK (v2.0) to detect genetic structure and estimate ancestry proportions among MECA participants [14]. Ancestry was estimated with Rapid ancestrY Estimation (Rye), an efficient algorithm that leverages principal components for robust ancestry inference [13]. The final dataset included reference groups from the 1000 Genomes Project categorized as African, European, and American. A population-to-group mapping file aggregated populations into these continental groups: African (ESN, GWD, LWK, MSL, YRI), European (CEU, FIN, GBR, IBS, TSI), American (PEL), and Admixed (ACB, ASW) ( Table 1 ). Rye analysis used 30 principal components allowing robust ancestry fraction estimation for each participant in the MECA study. Sub-African Ancestry Analysis Phasing and genotype imputation were performed using Beagle v5.1 for the Sub-Saharan African ancestry analysis of the MECA sample [15]. Phasing involved the reconstruction of chromosomal phases for variants, which improved the accuracy of imputation. Imputation used the phased haplotypes to infer missing genotypes by referencing haplotypes from the 1000 Genomes Project, resulting in a complete and high-quality dataset for downstream ancestry analysis. Sub-Saharan African populations—West African (ESN, YRI, GWD, MSL) and East African (LWK)—were utilized as references ( Table 1 ). The quality control process included filtering, linkage disequilibrium (LD) pruning, merging of common variants, and performing principal component analysis (PCA). Ancestry proportions were calculated using Rye. Gene Expression Correlation Analysis Next, we aimed to identify genes significantly correlated with ancestry fractions in the dataset. Libraries were quantified using StringTie2 [16]. Gene expression was calculated as fragments per kilobase of transcript per million mapped reads (FPKM) that were normalized using the trimmed mean method. The counts per million (CPM) were computed for each gene across all samples. The CPM matrix was then utilized to calculate Spearman correlation coefficients between CPM values and ancestry fractions. Since ancestry proportions are constrained to sum to 100%, the correlations for East African ancestry would be the exact inverse of East African ancestry. To address potential dependencies between ancestry components, we calculated East and West African ancestry as unscaled absolute values based on the total African ancestry and their relative proportions. To account for multiple testing, adjusted p-values were determined using the Benjamini-Hochberg method [17]. For visualization, we selected the top 50 genes with the highest positive and strongest negative correlations. Method Validation We validated our approach for ancestry analysis by applying it to the Visit 2 dataset participants. The same parameters were used to check the reproducibility of the proposed approach. Approach accuracy was measured by comparing the ancestry PCA plot and fractions calculated for participants of visit 1 to visit 2. Declarations ACKNOWLEDGEMENT Next generation sequencing services were provided by the Emory NPRC Genomics Core which is supported in part by NIH P51 OD011132. Sequencing data was acquired on an Illumina NovaSeq6000 funded by NIH S10 OD026799. FUNDING This work is supported by R01NS112422 (PI Meller), RM1HG012334 (PI Meller) and U54HG013595-01 (NIH/NIGMS). The sponsors, which are public or nonprofit organizations dedicated to general science, weren’t involved in the collection, analysis, or interpretation of the data. The content of this publication does not necessarily represent the views or policies of the Department of Health and Human Services, nor does the mention of trade names, commercial products, or organizations imply endorsement by the U.S. government. The authors express their gratitude to the study volunteers and the staff at the GRA clinical center at Emory University Hospital DATA AND CODE AVAILABILITY STATEMENT R and Linux based codes for the project are available on the github page (https://github.com/Vermarashi/; https://github.com/rob-meller/). Genomic data will be available in dbGAP (awaiting id number). The PI (Dr. Meller) can be contacted to request data and codes. CONFLICT OF INTEREST DISCLOSURE The authors declare no competing interests. ETHICAL STATEMENT The protocol was approved by the Institutional Review Boards at Morehouse School of Medicine and Emory University (IRB00083584). All participants included in the study gave their informed consent. CONTRIBUTIONS R.V., R.M., and I.K.J. contributed to the conception and design of the study and supervised the analyses. P.P., D.J., A.A.Q., and H.T. conceived the MECA project and recruited participants. R.V. and S.S. performed data processing and analyses. R.V. drafted the manuscript and prepared the figures. H.N.B. and E.G. contributed statistical and bioinformatics expertise and critically reviewed the analyses. A.P., C.D.S., P.B., and T.L. revised the manuscript for important intellectual content. All authors reviewed the manuscript, contributed to its revision, and approved the final version. References Li, J.Z., et al., Worldwide human relationships inferred from genome-wide patterns of variation. Science, 2008. 319(5866): p. 1100-4. Pritchard, J.K., M. Stephens, and P. Donnelly, Inference of population structure using multilocus genotype data. Genetics, 2000. 155(2): p. 945-59. Montgomery, S.B., et al., Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 2010. 464(7289): p. 773-7. Belleau, P., et al., Correction: Genetic Ancestry Inference from Cancer-Derived Molecular Data across Genomic and Transcriptomic Platforms. Cancer Res, 2023. 83(2): p. 347. Fachrul, M., et al., Direct inference and control of genetic population structure from RNA sequencing data. Commun Biol, 2023. 6(1): p. 804. Barral-Arca, R., et al., Ancestry patterns inferred from massive RNA-seq data. RNA, 2019. 25(7): p. 857-868. Scheepers, B., P. Clough, and C. Pickles, The misdiagnosis of epilepsy: findings of a population study. Seizure, 1998. 7(5): p. 403-6. Islam, S.J., et al., Cardiovascular Risk and Resilience Among Black Adults: Rationale and Design of the MECA Study. J Am Heart Assoc, 2020. 9(9): p. e015247. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods, 2012. 9(4): p. 357-9. Danecek, P., et al., Twelve years of SAMtools and BCFtools. Gigascience, 2021. 10(2). McKenna, A., et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 2010. 20(9): p. 1297-303. Conley, A.B., et al., Rye: genetic ancestry inference at biobank scale. Nucleic Acids Res, 2023. 51(8): p. e44. Purcell, S., et al., PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 2007. 81(3): p. 559-75. Browning, B.L. and S.R. Browning, Genotype Imputation with Millions of Reference Samples. Am J Hum Genet, 2016. 98(1): p. 116-26. Kovaka, S., et al., Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol, 2019. 20(1): p. 278. Glickman, M.E., S.R. Rao, and M.R. Schultz, False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. J Clin Epidemiol, 2014. 67(8): p. 850-7. Deshpande, D., et al., RNA-seq data science: From raw data to effective interpretation. Front Genet, 2023. 14: p. 997383. Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63. Smith, O., et al., Ancient RNA from Late Pleistocene permafrost and historical canids shows tissue-specific transcriptome survival. PLoS Biol, 2019. 17(7): p. e3000166. Gouveia MH, Meeks KAC, Borda V, Leal TP, Kehdy FSG, Mogire R, Doumatey AP, Tarazona-Santos E, Kittles RA, Mata IF, O'Connor TD, Adeyemo AA, Shriner D, Rotimi CN. Subcontinental genetic variation in the All of Us Research Program: Implications for biomedical research. Am J Hum Genet. 2025 Jun 5;112(6):1286-1301. doi: 10.1016/j.ajhg.2025.04.012. Winkler, C.A., G.W. Nelson, and M.W. Smith, Admixture mapping comes of age. Annu Rev Genomics Hum Genet, 2010. 11: p. 65-89. Table 1 Table 1 is available in the Supplementary Files section. Additional Declarations No competing interests reported. Supplementary Files TABLE1.docx SUPPLEMENTARYDATAlegends.docx SupplFig.pdf Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7801062","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Method Article","associatedPublications":[],"authors":[{"id":600129047,"identity":"9aa9458b-b995-4640-be2e-bcd43f2ee8cf","order_by":0,"name":"Rashi Verma","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAoklEQVRIiWNgGAWjYPACGwjFQ4KWNNK1HCZBi/mM5GOfbtScj5afdoDxwds2IrTI3EhLnp1z7HbuhtsJzIZzidEiwXPGmDmHDahFOoFNmpc4Lec/M+f8O5c7f3YC+2/itLD3MDPnth3IbbidwMZMpJY2Y+bcvmSgXxKbJeecI0YLM/Nj5pxvdkCHJR/88KaMCC1IgLGBNPWjYBSMglEwCnADAEYhM00KM41GAAAAAElFTkSuQmCC","orcid":"","institution":"Morehouse School of Medicine","correspondingAuthor":true,"prefix":"","firstName":"Rashi","middleName":"","lastName":"Verma","suffix":""},{"id":600129048,"identity":"09bd9fcb-904a-4d4b-a7c1-882cb5e6fe5c","order_by":1,"name":"Shivam Sharma","email":"","orcid":"","institution":"Georgia Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Shivam","middleName":"","lastName":"Sharma","suffix":""},{"id":600129049,"identity":"8277df94-b276-45a1-880e-f3534df9d0c8","order_by":2,"name":"Harriet NA Blankson","email":"","orcid":"","institution":"Morehouse School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Harriet","middleName":"NA","lastName":"Blankson","suffix":""},{"id":600129050,"identity":"cc09e628-bd54-4ead-989a-2c658fd78a0f","order_by":3,"name":"Emine Guven","email":"","orcid":"","institution":"Morehouse School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Emine","middleName":"","lastName":"Guven","suffix":""},{"id":600129051,"identity":"c597291f-3f2c-4d40-8c00-17ae59865b0c","order_by":4,"name":"Andrea Pearson","email":"","orcid":"","institution":"Morehouse School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Andrea","middleName":"","lastName":"Pearson","suffix":""},{"id":600129052,"identity":"c394fc16-674b-475e-8270-5644a6214a91","order_by":5,"name":"Charles D. Searles","email":"","orcid":"","institution":"Atlanta VA Health Care System","correspondingAuthor":false,"prefix":"","firstName":"Charles","middleName":"D.","lastName":"Searles","suffix":""},{"id":600129053,"identity":"55076df7-7bf5-48de-8cbc-c8576af1a4d1","order_by":6,"name":"Peter Baltrus","email":"","orcid":"","institution":"Morehouse School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Peter","middleName":"","lastName":"Baltrus","suffix":""},{"id":600129054,"identity":"ed16a920-449b-436d-8f41-931ffb8813d4","order_by":7,"name":"Tene T. Lewis","email":"","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Tene","middleName":"T.","lastName":"Lewis","suffix":""},{"id":600129055,"identity":"ba6350de-cb4f-4449-ba3a-320fee620cf7","order_by":8,"name":"Priscilla Pemu","email":"","orcid":"","institution":"Morehouse School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Priscilla","middleName":"","lastName":"Pemu","suffix":""},{"id":600129056,"identity":"28530588-f87c-413f-846a-7bfb03e5514e","order_by":9,"name":"Dean Jones","email":"","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Dean","middleName":"","lastName":"Jones","suffix":""},{"id":600129057,"identity":"701a983d-fdf0-4458-baa8-b2bd7000de6f","order_by":10,"name":"Arshed Ali Quyyumi","email":"","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Arshed","middleName":"Ali","lastName":"Quyyumi","suffix":""},{"id":600129058,"identity":"b05cf9f1-3c24-4fb5-b095-c214c44b8549","order_by":11,"name":"Herman Taylor","email":"","orcid":"","institution":"Morehouse School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Herman","middleName":"","lastName":"Taylor","suffix":""},{"id":600129059,"identity":"e43ecc3c-d9b4-44e9-a35a-082f5ca15fc9","order_by":12,"name":"I. King Jordan","email":"","orcid":"","institution":"Georgia Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"I.","middleName":"King","lastName":"Jordan","suffix":""},{"id":600129060,"identity":"371eeab0-55aa-4bbd-99dd-cf1476b31150","order_by":13,"name":"Robert Meller","email":"","orcid":"","institution":"Morehouse School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Robert","middleName":"","lastName":"Meller","suffix":""}],"badges":[],"createdAt":"2025-10-07 15:38:32","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7801062/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7801062/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":104179985,"identity":"3bea7341-602f-4f86-92a0-aece2cee61a9","added_by":"auto","created_at":"2026-03-08 17:10:42","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":281948,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of the genetic ancestry analysis pipeline using RNA-seq data. \u003c/strong\u003eThe schematic illustrates the step-by-step protocol employed to infer genetic ancestry from RNA sequencing, including variant calling, filtering, and downstream population structure analysis.\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/55d47af5197c79bae627e643.png"},{"id":104404999,"identity":"cc196c54-f6c9-4d25-bd67-1f9ba9853cf3","added_by":"auto","created_at":"2026-03-11 12:21:32","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":383823,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVariant filtering progression across the analysis pipeline. \u003c/strong\u003eThe figure presents the number of variants retained after each major filtering step, highlighting the impact of quality control, Hardy-Weinberg equilibrium testing, and linkage disequilibrium pruning on the dataset.\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/8350a4dac7fb622c3b2bcb8f.png"},{"id":104404542,"identity":"32d255f0-66f7-4a96-ad7f-4dc8c462a5ec","added_by":"auto","created_at":"2026-03-11 12:20:29","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1132729,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGenetic ancestry inferences on MECA participants (visit 1). (A) \u003c/strong\u003ePCA of MECA participants (hot pink) and ancestry group reference samples (colored as shown). (\u003cstrong\u003eB) \u003c/strong\u003eAncestry and admixture patterns for MECA participants. Ancestry fractions (colored as shown) are indicated for each participant. All groups are scaled based on the number of participants.\u003c/p\u003e","description":"","filename":"Figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/8a91044781932b1089d3f8f8.png"},{"id":104403770,"identity":"ba3a032a-9989-4122-ac94-903636f7b805","added_by":"auto","created_at":"2026-03-11 12:19:00","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1758144,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePCA analysis illustrating the effect of Hardy-Weinberg Equilibrium filtering thresholds on Sub-Saharan African ancestry clustering. (A)\u003c/strong\u003e PCA plots generated after applying a HWE filter (P \u0026lt; 1×10⁻⁴), showing clustering patterns based on (i) PC1 vs PC2, (ii) PC1 vs PC3, (iii) PC2 vs PC3, and (iv) PC3 vs PC4. Study participants (visit 1) and reference populations clustered separately. \u003cstrong\u003e(B)\u003c/strong\u003ePCA plots generated after applying a HWE threshold (P \u0026lt; 0.4), showing (i) PC2 vs PC3 and (ii) PC3 vs PC4. PC1 vs PC2 and PC1 vs PC3 are not shown due to negligible differences from plots in section \u003cstrong\u003eA\u003c/strong\u003e. The ancestry clusters, particularly for MECA participants (visit 1), appear closer and show partial overlap with reference populations, providing enhanced resolution of subtle ancestry patterns.\u003c/p\u003e","description":"","filename":"Figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/0313d63f5b129876cfb422b7.png"},{"id":104179989,"identity":"a79e27ae-5755-4541-8bf5-6b2418e1e133","added_by":"auto","created_at":"2026-03-08 17:10:42","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":129107,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSub-Sharan African Ancestry and admixture patterns for MECA participants. \u003c/strong\u003e\u0026nbsp;Ancestry pattern of MECA participants (visit 1) is predominantly aligns with west African reference samples. Ancestry fractions (colored as shown) are indicated for each participant. All groups are scaled based on the number of participants.\u003c/p\u003e","description":"","filename":"Figure5.png","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/0eb5c6c5441f6cce13609d9e.png"},{"id":104179992,"identity":"76595cc3-8e38-4ebb-8ee2-14507f16ec42","added_by":"auto","created_at":"2026-03-08 17:10:42","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":2432592,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGene correlation with Sub-Sharan African Ancestry. (A) \u003c/strong\u003eTop 50 positively correlated genes to East and West African Ancestry (visit 1).\u003cstrong\u003e (B) \u003c/strong\u003eTop 50 negatively correlated genes to East and West African Ancestry (visit 1).\u003c/p\u003e","description":"","filename":"Figure6.png","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/ed847d80e269c66398824734.png"},{"id":104179993,"identity":"ffa05e61-4770-4d2d-b226-ca7854f711a1","added_by":"auto","created_at":"2026-03-08 17:10:42","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":1459919,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eValidation of genetic ancestry analysis pipeline with MECA participants (visit 2). (A) \u003c/strong\u003ePCA of MECA participants visit 2 (hot pink) and ancestry group reference samples (colored as shown). \u003cstrong\u003e(B) \u003c/strong\u003eNumber of participants present in visit 1 and visit 2.\u003cstrong\u003e (C) \u003c/strong\u003ePCA plots highlighting Sub-Sharan African Ancestry using Hardy-Weinberg Equilibrium (HWE) filtering thresholds (P \u0026lt; 1×10⁻⁴). \u003cstrong\u003e(D)\u003c/strong\u003e PCA plot using a HWE threshold (P \u0026lt; 0.4), showing comparatively tighter clustering and partial overlap. \u003cstrong\u003e(E)\u003c/strong\u003eCorrelation analysis of overall ancestry proportions between MECA participants in Visit 1 and Visit 2. (F) Correlation of sub-Saharan African ancestry estimates between Visit 1 and Visit 2, supporting consistency and robustness of the ancestry inference pipeline.\u003c/p\u003e","description":"","filename":"Figure7.png","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/e8e42d46e7d2b027364c105d.png"},{"id":104409247,"identity":"ff579fa5-47d3-46c1-afa0-a8645ddd4a7d","added_by":"auto","created_at":"2026-03-11 12:44:27","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":8234203,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/fdfc14d5-e6ce-4f47-b61d-dd66d7b3e4c9.pdf"},{"id":104179991,"identity":"843a571c-28a7-42f3-bb55-049ba796b338","added_by":"auto","created_at":"2026-03-08 17:10:42","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":16759,"visible":true,"origin":"","legend":"","description":"","filename":"TABLE1.docx","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/bd8bdad2165c66a0838d9515.docx"},{"id":104179987,"identity":"a7b94643-4ba1-4b0a-894b-5838d9bd94b5","added_by":"auto","created_at":"2026-03-08 17:10:42","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":14062,"visible":true,"origin":"","legend":"","description":"","filename":"SUPPLEMENTARYDATAlegends.docx","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/0126d82ff617d398fda7fd1d.docx"},{"id":104179995,"identity":"4101bc4b-d711-4ec0-8b6e-b114ade6c4c1","added_by":"auto","created_at":"2026-03-08 17:10:42","extension":"pdf","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":3042372,"visible":true,"origin":"","legend":"","description":"","filename":"SupplFig.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7801062/v1/c61d5c2a84c10748d2568c81.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"RNAScope-Ancestry: A Cross-Modality Framework for Inferring Genetic Ancestry from RNA-Seq with Application to MECA","fulltext":[{"header":"Introduction","content":"\u003cp\u003eGenetic ancestry analysis has traditionally relied on genotyping arrays and whole-genome sequencing that enable high-resolution inferences of genetic variation by capturing a comprehensive set of variants across the genome [1, 2]. These approaches are well-validated and widely used for ancestry estimation due to their robustness and accuracy. In contrast, RNA sequencing (RNA-seq), primarily designed for transcriptomic profiling, offers an alternative approach by simultaneously quantifying gene expression and identifying genetic variants. Although RNA-seq does not surpass traditional methods in variant resolution, it presents distinct advantages, particularly in leveraging existing transcriptomic data to infer genetic ancestry.\u003c/p\u003e\n\u003cp\u003eRNA-seq is widely used and readily available in many genomic studies, making it a valuable resource for genetic ancestry analysis without requiring additional sequencing efforts. Moreover, the integration of genetic ancestry inferences with transcriptome data providing a complementary dimension that captures gene expression patterns influenced by both genetic background and environmental factors. RNA-seq data, with its ability to capture the full spectrum of transcript variants, presents a unique opportunity to enhance the resolution and accuracy of genetic ancestry analyses [3]. This dual functionality makes RNA-seq particularly valuable in studies of complex diseases, where gene expression data can provide deeper insights into population-specific genetic traits and their functional implications.\u003c/p\u003e\n\u003cp\u003eThe application of RNA-seq for genetic ancestry analysis remains relatively unexplored [4-6]. Thus, the study presents RNAScope-ancestry to assesses the genetic ancestry of MECA study participants, providing a high-resolution view using short-read RNA-seq data. By contextualizing genetic and clinical findings within African, European, and American ancestries, it offers insights into cardiovascular health disparities among Black adults. By integrating genetic ancestry into CVD research, we can better understand how gene-environmental interactions shape health outcomes. This understanding can guide targeted interventions to reduce CVD disparities and promote health equity.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eCharacteristics of MECA dataset\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study includes a diverse cohort of Black adults (n=490, visit 1; n=109, visit 2). It encompasses a broad age range (\u0026gt;36) and both genders, with detailed demographic, clinical, and behavioral data, including blood pressure, cholesterol, BMI, glucose, HbA1c, C-reactive protein, lifestyle factors (diet, exercise, smoking, alcohol, sleep), psychosocial measures (stress, resilience), and environmental data (geographic location, healthcare access). Longitudinal follow-up tracks cardiovascular risk trajectories [8].\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSelection of high-quality SNPs\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHigh-quality RNA-seq data is essential for accurate SNP inference. From the Visit 1 dataset, 18.86M variants were identified and filtered to 2.59M SNPs, while the reference dataset was reduced from 82.49M to 81.57M variants. Comparative analysis revealed 1.26M common SNPs, further refined through quality control and LD pruning to 1,415 (Visit 1) and 163,279 (reference) independent variants. The intersection yielded 230 common variants for PCA and ancestry analysis, ensuring robust results (\u003cstrong\u003eFigure 2\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInference of ancestry from RNA-seq data\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe genetic relationship among MECA participants and reference samples from four regional ancestry groups, computed using PCA, are shown in \u003cstrong\u003eFigure 3A\u003c/strong\u003e. The PCA plot illustrates distinct clustering patterns for individuals from various ancestral backgrounds based on principal components (PC1 and PC2). Individuals of European ancestry are clearly separated along PC1, whereas African ancestry forms a tight cluster closer to the origin, with admixed and Native American populations distributed between these clusters. The MECA participants predominantly overlap with the African ancestry cluster, suggesting that most of these individuals are of African descent, with a few showing potential admixture.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAncestry proportions were quantified using Rye algorithm. African reference populations principally display African ancestry, with negligible contributions from European or American ancestries (\u003cstrong\u003eFigure 3B\u003c/strong\u003e). Conversely, European reference populations are primarily European. Admixed individuals exhibit varying proportions of African, European, and Native American ancestries, reflecting the complexity of their genetic backgrounds. MECA study participants exhibit ancestry proportions like admixed individuals, with significant African and European contributions. These analyses highlight the utility of PCA and admixture estimation in elucidating the ancestry patterns of diverse populations and contextualizing participant genetic profiles. The ancestry estimates are consistent with participants’ self-identified ethnic backgrounds, which is a second level of ethnic identity beneath the ethnic group designation\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInference of sub-ancestry from RNA-seq data\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNext, we were able to assess the genetic structure of sub-African populations and their relationship to the study participants. This is critical not only to test the sensitivity of our approach but also for interpreting genetic associations within the MECA cohort. Consequently, it establishes the participant’s primary African genetic ancestry with highlighting any admixture as well. Understanding this genetic structure enhances the context of our analyses and aligns the study's aim to investigate ancestry driven cardiovascular health in a population using short read RNA-Seq data analysis.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003ePCA plots across the top four PCs showed enhanced population clustering patterns with HWE cutoffs from 1x10⁻⁴ to 0.4. \u0026nbsp;At a cutoff of 1x10⁻⁴, reference populations formed poorly differentiated clusters and showed highly distant separation from the participant population along PC1 (\u003cstrong\u003eFigure 4A; B\u003c/strong\u003e). Interestingly, the plots for PC2, PC3, and PC4 revealed clear clustering of reference populations, with the participant population closely aligning with Western African populations, particularly ESN and YRI (\u003cstrong\u003eFigure 4C; D)\u003c/strong\u003e. As we applied HWE cutoff 0.4, clustering patterns along PC1 remain the same. However, PC2, PC3, and PC4 clearly depicted the participant population's distribution across Western African populations, suggesting the filtration of probable artifacts from the data (\u003cstrong\u003eFigure 4E; F\u003c/strong\u003e). These findings emphasize the sensitivity of PCA clustering to HWE thresholds, illustrating how parameter selection can impact population differentiation and the identification of genetic structure.\u003c/p\u003e\n\u003cp\u003eAncestry proportion showed largely East African ancestry in the East African group and largely West African ancestry in the West African group. The MECA participants exhibited a mixed ancestry pattern, with a predominant contribution from West African populations and minimal contribution from East African populations (\u003cstrong\u003eFigure 5).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eGenes Correlated to Ancestry\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;\u003c/strong\u003eCorrelation analysis between CPM values and ancestry fractions of European, African, and American populations did not reveal any significant associations. However, further analysis focusing specifically on Sub-Saharan African ancestry uncovered distinct gene sets associated with East and West African ancestry fractions. Among the 13,566 significant genes, 54 genes exhibited positive correlations with West African ancestry, while 10,488 genes showed positive correlations with East African ancestry. We reported top 50 genes with the highest positive and negative correlations with East and West African ancestry (\u003cstrong\u003eFigure 6)\u003c/strong\u003e. Moreover, the genes with the strongest correlations (both positive and negative) demonstrated highly significant −log10(adjusted p-values), further emphasizing their robust association with the respective ancestry fractions.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePerformance accuracy with dataset 2\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe analyzed the data from Visit 2 participants that were aligned closely with their corresponding groups in Visit 1, validating the method's accuracy in assigning ancestry (\u003cstrong\u003eFigure 7A\u003c/strong\u003e). African ancestry dominates both visits, with 106 and 108 participants, respectively, while European and American ancestries show negligible representation. (\u003cstrong\u003eFigure 7B\u003c/strong\u003e). The decrease in the number of European participants from Visit 1 to Visit 2 could be attributed to technical variations or sample reprocessing. The near-identical distributions reinforce the reproducibility of the methodology when applied to subsets of the same population. PCA plots for Sub-Saharan African ancestry results aligns well with the findings from the Visit 1 dataset (\u003cstrong\u003eFigure 7C; D\u003c/strong\u003e). Ancestry fraction for visit 2 revealed the African ancestry (\u003cstrong\u003eFigure 7E\u003c/strong\u003e) and West-African ancestry (\u003cstrong\u003eFigure 7F\u003c/strong\u003e) as primary component with minimal contributions from other ancestries. This uniform pattern across visits underscores the method's reliability in detecting and quantifying ancestry proportions consistently.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDISCUSSION\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe application of bioinformatics tools and computational approaches in understanding genetic ancestry represents a significant advancement in the field of genomics. While genotyping arrays and whole-genome sequencing remain the gold standard for ancestry inference, RNA-Seq technologies have emerged as a complementary approach. These advancements are driven by their ability to generate high-throughput data, genome-wide coverage, enhanced accuracy, ability to analyze admixed population, flexibility in analysis and adaptability for studying ancient DNA [6, 18-20]. This leads to enhance detection of complex genetic variations, including structural variants, haplotype phasing, and isoform diversity, which are crucial for accurate ancestry analysis. Although RNA-Seq does not provide the same variant resolution as traditional methods, its widespread availability and routine use in transcriptomic studies make it a valuable resource for ancestry analysis.\u003c/p\u003e\n\u003cp\u003eBy leveraging genetic variants embedded within transcriptomic data, RNAScope-Ancestry allows for ancestry inference without the need for additional genomic sequencing, making it a cost-effective alternative in studies where DNA-based data may not be available. Furthermore, its application extends to admixed populations, where it can aid in identifying population-specific expression patterns influenced by genetic ancestry. This approach also facilitates the investigation of how ancestry-related genetic factors contribute to gene regulation, enhancing our understanding of complex traits and disease susceptibilities across diverse populations.\u003c/p\u003e\n\u003cp\u003eWe aimed to explore ancestry-driven cardiovascular insights in participants of MECA study using short-read RNA-Seq data. To achieve this, we developed a systematic workflow to assess and stratify the genetic makeup of MECA participants against reference populations from Admixed, Europe, America, and Africa. Additionally, we focused on their Sub-Saharan African genetic background by comparing the cohort with various African reference populations. Our approach was further validated using data from participants who attended a second visit, as well as a subset from the initial visit. Principal Component Analysis (PCA) revealed distinct clustering patterns, revealing predominance of African ancestry within our MECA participant cohort.\u003c/p\u003e\n\u003cp\u003eStudy revealed diverse genetic makeup of MECA participants with clustering near African, and Admixed groups. At the sub-African level, we observed that MECA participants predominantly clustered with West African (Niger-Congo) ancestry, along with a minor East African component. The East African component primarily reflects Bantu ancestry, which originated from the southwestern regions of Africa during the Bantu expansion, not directly from East Africa. The LWK sample from Kenya, which was used to represent East African ancestry, is more accurately described as Bantu due to its linguistic and genetic connection to the Bantu-speaking populations of Central and Southern Africa. This distinction is important because most enslaved Africans brought to the Americas came from West and Central Africa, not East Africa. Therefore, while it is correct to label the LWK sample as East African geographically, referring to it as Bantu ancestry provides a more accurate historical and genetic context. This clarification aligns with the historical migration patterns and enhances the precision of our findings, reflecting the complex heritage of African Americans.\u003c/p\u003e\n\u003cp\u003eCorrelation analysis depicted top 50 genes\u0026nbsp;significantly correlated to sub–Saharan African ancestries. Noticeably, the genes exhibit a positive correlation with East African ancestry and an equally strong negative correlation with West African Ancestry, with an adjusted p-value indicating high statistical significance (\u003cstrong\u003eSuppl. Fig. 1\u003c/strong\u003e). The observed correlations follow a predictable pattern due to the inherent non-independence of ancestry components. Since the sum of ancestry proportions is constrained to 100%, a positive correlation for one ancestry component necessarily implies a negative correlation for the other. This non-independence acts as an internal control, ensuring that our results are consistent and correctly interpreted. Such patterns have been noted in previous ancestry studies and are a fundamental property of proportional data. To address this dependency, we recalculated East and West African ancestry proportions as unscaled absolute values by multiplying total African ancestry with their respective relative proportions, thereby removing artificial dependencies and preserving the natural structure of the data (\u003cstrong\u003eFigure 6\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eWe validated our RNAScope-Ancestry pipeline using data from MECA Participants who returned for a second visit [n (visit 2 = 109].\u0026nbsp;The validation confirms the reliability of our approach that reproduces consistent ancestry distributions across related cohorts and self-identified ethnic backgrounds. Although all participants self-identified as African American, our genetic ancestry analysis revealed substantial sub-continental African heterogeneity, including contribution from west and east African and European ancestries. These findings support the observation that self-identified race doesn’t align with genetic ancestry. Therefore, biomedical research relying solely on race labels overlook biologically relevant ancestry-related variations. Integrating ancestry analysis provides a more accurate framework for interpreting gene expression pattern in diverse population [21].\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCHALLENGES:\u003c/strong\u003e The genetic ancestry analysis of MECA samples using short-read RNA-seq data presented several challenges, particularly in achieving proper clustering in the PCA plots along PC1. Initial PCA plots exhibited significant clustering for reference populations; however, the MECA samples displayed an unusual spread, indicating potential noise or confounding factors in the dataset (\u003cstrong\u003eSuppl. Fig. 2A; B\u003c/strong\u003e). To address this, we implemented a series of filtering steps, beginning with LD pruning. Despite this, PC1 and PC2 remained suboptimal for the MECA samples (\u003cstrong\u003eSuppl. Fig. 2C\u003c/strong\u003e) while PC1 and PC3 demonstrated better clustering for some residual noise (\u003cstrong\u003eSuppl. Fig. 2D\u003c/strong\u003e). Then, we filter variants with missing genotype \u0026gt;10% across all samples and Hardy-Weinberg threshold (1x10\u003csup\u003e-4\u003c/sup\u003e). These adjustments effectively removed low-quality variants and potential artifacts, thereby enhancing the resolution of genetic structure (\u003cstrong\u003eFigure 3A\u003c/strong\u003e). The improved separation in PCA plots following LD pruning and HWE filtering highlights the importance of minimizing noise and ensuring data quality.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eNext challenge we faced with sub-Saharan African ancestry analysis likely due to the presence of most common variants present across these populations. Initial PCA results revealed dispersed clustering of MECA samples (\u003cstrong\u003eSuppl. Fig. 3\u003c/strong\u003e) which suggested the necessity of incorporating rare variants to improve clustering accuracy. To address this, imputation was incorporated into the pipeline which improved clustering on PCA plots, particularly for PC2, PC3, and PC4 (\u003cstrong\u003eFigure 4C-D\u003c/strong\u003e). Further refinement with HWE thresholds resulted in better alignment of participant samples with reference populations (\u003cstrong\u003eFigure 4E; F\u003c/strong\u003e). Notably, the MECA query samples appear to shift towards the European pole along PC1 (\u003cstrong\u003eFigure 4A-B\u003c/strong\u003e). This shift makes sense, as these samples might have a genetic composition that is aligned with European populations. As a separate query group, the MECA samples are not admixed per se but may share genetic features associated with European ancestry, which is reflected in their position along the principal component axis. The shift along PC1 further supports the validity of our results and underscores the importance of considering the genetic background of the query samples in interpreting the PCA plots.\u003c/p\u003e\n\u003cp\u003eWhile genome-wide studies generally require hundreds of thousands to millions of variants, however, our analysis demonstrated that even a small subset of variants (n\u0026lt;500) can effectively capture significant ancestry-informative signals, particularly in admixture populations. Winkler et al. highlighted that enough ancestry-informative markers can effectively support genome-wide scans for disease associations, particularly in admixed populations [22]. Notably, the alignment of PCA clustering with self-identified ethnicities underscores the utility of this approach, suggesting that a well-curated set of variants can provide meaningful insights into genetic ancestry. This finding highlights the potential of short-read RNA sequencing data, which, despite its lower variant count, can yield valuable genetic information. Our results advocate for the use of smaller variant panels in genetic ancestry studies, especially when cost-effective methods like short-read RNA sequencing are preferred over whole-genome sequencing, offering an accessible and efficient alternative for research on genetic admixture.\u003c/p\u003e\n\u003cp\u003eThese results also have profound implications for genetic epidemiology and precision medicine. Given that MECA study participants exhibit genetic affinities with specific West African population, it is crucial to acknowledge these subpopulations in future genetic studies and healthcare applications. Furthermore, the genetic data presented here highlights the importance of utilizing ancestry-informative markers in Mende and Yoruba West groups that could be valuable for refining genetic models to predict disease risk and enabling more targeted and personalized healthcare strategies. Additionally, this study also highlights the need to diversify genomic databases to better represent African genetic diversity, as African populations have been historically underrepresented. Expanding the inclusion of groups will improve the generalizability of research findings and help create more accurate genetic models, benefiting African descent populations worldwide.\u003c/p\u003e\n\u003cp\u003eWhile this study utilizes RNA-seq data to infer genetic ancestry, it is important to acknowledge that RNA-seq is primarily designed for gene expression analysis rather than ancestry estimation. Unlike genotyping arrays or whole-genome sequencing, RNA-seq-based ancestry inference may be influenced by several factors, including expression variability, batch effects, and sample-specific biases. Additionally, gene expression is context-dependent and may vary across tissues, environmental conditions, and disease states, which could introduce variability in ancestry estimations. Limited reference populations may not fully represent East African diversity, highlighting the need for broader datasets in genetic databases. Usage of small number of variants may reduce the resolution of ancestry inference compared to studies utilizing high-density genotype data. Replicating these findings by addressing limitations in independent datasets will strengthen the evidence for ancestry, ancestry-related expression differences and its health implications.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eDespite the limitation, our findings exhibit strong consistency with established ancestry patterns, reinforcing the robustness of the approach. The workflow demonstrates the efficacy of filtering and utilizing high-quality SNPs to infer genetic ancestry in a diverse cohort of MECA participants. It highlights genetic variation that influences gene expression offering insights into the biological processes linked to genetic ancestry. It generates consistent results from one over another dataset that make it more reliable when analyzing diverse populations. It allows the integration of ancestry results with gene expression analysis which provides complementary layer of information that makes short RNA-seq particularly useful in studying traits or diseases influenced by regulatory mechanisms. By integration with genetics data, the workflow can provide a more comprehensive understanding of how ancestral origins shape gene expression, offering a deeper perspective on the complex interplay between genetics and phenotype.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eRNAScope-Ancestry demonstrates that short-read RNA-seq can be repurposed beyond transcriptomics to provide reliable genetic ancestry inference, linking population structure with gene expression. By validating the pipeline in MECA participants and ensuring reproducibility across visits, our framework establishes a scalable, generalizable approach for ancestry-aware analyses, particularly in underrepresented populations. This method enables integrative, dual-purpose studies that can advance precision medicine, population genetics, and complex disease research using existing RNA-seq datasets.\u003c/p\u003e\n"},{"header":"METHODOLGY","content":"\u003cp\u003eThe RNAScope-Ancestry pipeline starts with sample collection and RNA-Seq library preparation. \u003cstrong\u003eFigure 1\u003c/strong\u003e highlights the steps of protocol. We applied this framework to the available participants of Morehouse-Emory Cardiovascular Center for Health Equity (MECA) study [7]. The Visit 1 dataset was used for ancestry estimation, while the Visit 2 dataset (comprising subsequent visits of participants already included in the Visit 1 dataset) validated the findings. \u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eData and Sample Collection\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA multi-faceted approach was employed to investigate cardiovascular health disparities among Black adults in Atlanta, classifying neighborhoods as \u0026quot;at-risk\u0026quot; or \u0026quot;resilient\u0026quot; based on cardiovascular outcomes hospitalization [8]. Over 1,400 individuals were surveyed, with 599 participants undergoing clinical evaluations and blood collection across two visits (Visit 1: n=490; Visit 2: n=109).All eligible participants in this cohort self-identified as Black or African American adults, aged 18+, residing in targeted neighborhoods, with exclusions for non-residency, inability to consent, or conditions interfering with assessments [8].Additionally, for the intervention subset, individuals who could not adhere to the eHealth tools or coaching protocols due to technological or other barriers were excluded. These criteria ensured a representative sample while maintaining the study\u0026apos;s scientific rigor and relevance. This protocol ensured a robust cohort for examining genetic and environmental factors influencing cardiovascular risk and resilience.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRNA-Seq Library Preparation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBlood was collected into PAXgene Blood RNA tubes (Pre Analytix, Qiagen) and the RNA was extracted using the MagMAX for Stabilized Blood Tubes RNA Isolation Kit, compatible with PAXgene Blood RNA Tubes (ThermoFisher Scientific). RNA quality was assessed using a Fragment Analyzer (Agilent) and then one microgram of total RNA was subjected to globin transcript depletion using the GLOBINclear Kit, human (ThermoFisher Scientific). Ten nanograms of the globin-depleted RNA were used as input for cDNA synthesis using the Clontech SMART-Seq v4 Ultra Low Input RNA kit (Takara Bio) according to the manufacturer\u0026rsquo;s instructions. Amplified cDNA was fragmented and appended with dual-indexed bar codes using the Nextera XT DNA Library Preparation kit (Illumina). Libraries were validated by capillary electrophoresis on a TapeStation 4200 (Agilent), pooled at equimolar concentrations, and sequenced with PE100 reads on an Illumina NovaSeq 6000, yielding ~30 million reads per sample on average.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eData Alignment and Variant Calling\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe sequenced data were trimmed (TrimGalore v0.6.4) and aligned to the GRCh38 human reference genome (STAR v2.7.3a and Bowtie2 v2.3.5.1) [9, 10]. Aligned BAM files were sorted and indexed (SAMtools v1.10) [11]. Variant calling was performed (GATK pipeline) [12]. PCR duplicates were marked (MarkDuplicates), spliced alignments processed (SplitNCigarReads), and base quality scores were recalibrated (BaseRecalibrator) to correct systematic sequencing errors. HaplotypeCaller was used in GVCF mode for variant calling, and the resulting GVCF files were jointly genotyped in GenomicsDB (GenotypeGVCFs). Variants were filtered and recalibrated variant quality score (ApplyVQSR).\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003ePreprocessing of MECA and References Samples\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eVariants were annotated (bcftools) with databases such as dbSNP138, HapMap, and Mills and 1000G indels. Quality filters were applied, retaining only non-singleton variants with quality scores \u0026gt; 30 and depth \u0026gt; 10. Normalization was conducted to ensure consistent representation of variants. Genotype data from the MECA samples were designated as \u0026ldquo;query samples.\u0026rdquo; \u003c/p\u003e\n\u003cp\u003eReference data (African, American, European, Admixed = 1249) from the 1000 Genomes Project served as \u0026ldquo;reference samples\u0026rdquo; (\u003cstrong\u003eTable 1\u003c/strong\u003e). Native American ancestry is represented by Peruvian (PEL) samples, which have been shown to carry a high proportion of Native American ancestry (\u0026gt;80%) as reported in Conley et al., 2023 [13]. African ancestry is represented by Esan (ESN), Gambian (GWD), Luhya (LWK), Mende (MSL), and Yoruba (YRI) samples. European ancestry is represented by Utah residents (CEU), Finnish (FIN), British (GBR), Iberian (IBS), and Toscani (TSI) populations. Admixed American, including African Caribbean in Barbados (ACB) and African Americans in the Southwest U.S. (ASW), are categorized as such due to their known two-way African European admixture. While these populations are sometimes classified under African ancestry in other studies, we use the admixed category to better reflect their genetic background. This classification allows for a more accurate interpretation of genetic structure in our study cohort.\u003c/p\u003e\n\u003cp\u003eChromosome names were standardized, and variants were normalized to a biallelic format. Quality control steps included filtering for minor allele frequency (MAF \u0026lt; 0.05), genotyping rate (0.1), and Hardy-Weinberg equilibrium (P \u0026lt; 1 \u0026times; 10⁻\u003csup\u003e4\u003c/sup\u003e). To reduce redundant variants, linkage disequilibrium (LD) pruning was performed (100:50:0.5). The resulting intersected variants from query and reference datasets were merged and converted into PLINK binary format (\u003cstrong\u003eScript\u003c/strong\u003e).\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eAncestry Estimation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePCA was conducted using PLINK (v2.0) to detect genetic structure and estimate ancestry proportions among MECA participants [14]. Ancestry was estimated with Rapid ancestrY Estimation (Rye), an efficient algorithm that leverages principal components for robust ancestry inference [13]. The final dataset included reference groups from the 1000 Genomes Project categorized as African, European, and American. A population-to-group mapping file aggregated populations into these continental groups: African (ESN, GWD, LWK, MSL, YRI), European (CEU, FIN, GBR, IBS, TSI), American (PEL), and Admixed (ACB, ASW) (\u003cstrong\u003eTable 1\u003c/strong\u003e). Rye analysis used 30 principal components allowing robust ancestry fraction estimation for each participant in the MECA study. \u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSub-African Ancestry Analysis \u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePhasing and genotype imputation were performed using Beagle v5.1 for the Sub-Saharan African ancestry analysis of the MECA sample [15]. Phasing involved the reconstruction of chromosomal phases for variants, which improved the accuracy of imputation. Imputation used the phased haplotypes to infer missing genotypes by referencing haplotypes from the 1000 Genomes Project, resulting in a complete and high-quality dataset for downstream ancestry analysis. Sub-Saharan African populations\u0026mdash;West African (ESN, YRI, GWD, MSL) and East African (LWK)\u0026mdash;were utilized as references (\u003cstrong\u003eTable 1\u003c/strong\u003e). The quality control process included filtering, linkage disequilibrium (LD) pruning, merging of common variants, and performing principal component analysis (PCA). Ancestry proportions were calculated using Rye. \u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eGene Expression Correlation Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNext, we aimed to identify genes significantly correlated with ancestry fractions in the dataset. Libraries were quantified using StringTie2 [16]. Gene expression was calculated as fragments per kilobase of transcript per million mapped reads (FPKM) that were normalized using the trimmed mean method. The counts per million (CPM) were computed for each gene across all samples. The CPM matrix was then utilized to calculate Spearman correlation coefficients between CPM values and ancestry fractions. Since ancestry proportions are constrained to sum to 100%, the correlations for East African ancestry would be the exact inverse of East African ancestry. To address potential dependencies between ancestry components, we calculated East and West African ancestry as unscaled absolute values based on the total African ancestry and their relative proportions. To account for multiple testing, adjusted p-values were determined using the Benjamini-Hochberg method [17]. For visualization, we selected the top 50 genes with the highest positive and strongest negative correlations.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eMethod Validation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe validated our approach for ancestry analysis by applying it to the Visit 2 dataset participants. The same parameters were used to check the reproducibility of the proposed approach. Approach accuracy was measured by comparing the ancestry PCA plot and fractions calculated for participants of visit 1 to visit 2. \u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eACKNOWLEDGEMENT\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNext generation sequencing services were provided by the Emory NPRC Genomics Core which is supported in part by NIH P51 OD011132. Sequencing data was acquired on an Illumina NovaSeq6000 funded by NIH S10 OD026799. \u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eFUNDING\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work is supported by R01NS112422 (PI Meller), RM1HG012334 (PI Meller) and U54HG013595-01 (NIH/NIGMS). The sponsors, which are public or nonprofit organizations dedicated to general science, weren\u0026rsquo;t involved in the collection, analysis, or interpretation of the data. The content of this publication does not necessarily represent the views or policies of the Department of Health and Human Services, nor does the mention of trade names, commercial products, or organizations imply endorsement by the U.S. government. The authors express their gratitude to the study volunteers and the staff at the GRA clinical center at Emory University Hospital\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eDATA AND CODE AVAILABILITY STATEMENT\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eR and Linux based codes for the project are available on the github page (https://github.com/Vermarashi/; https://github.com/rob-meller/). Genomic data will be available in dbGAP (awaiting id number). The PI (Dr. Meller) can be contacted to request data and codes.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eCONFLICT OF INTEREST DISCLOSURE\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eETHICAL STATEMENT\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe protocol was approved by the Institutional Review Boards at Morehouse School of Medicine and Emory University (IRB00083584). All participants included in the study gave their informed consent.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eCONTRIBUTIONS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eR.V., R.M., and I.K.J. contributed to the conception and design of the study and supervised the analyses. P.P., D.J., A.A.Q., and H.T. conceived the MECA project and recruited participants. R.V. and S.S. performed data processing and analyses. R.V. drafted the manuscript and prepared the figures. H.N.B. and E.G. contributed statistical and bioinformatics expertise and critically reviewed the analyses. A.P., C.D.S., P.B., and T.L. revised the manuscript for important intellectual content. All authors reviewed the manuscript, contributed to its revision, and approved the final version.\u003c/p\u003e\n"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eLi, J.Z., et al., Worldwide human relationships inferred from genome-wide patterns of variation. Science, 2008. 319(5866): p. 1100-4.\u003c/li\u003e\n\u003cli\u003ePritchard, J.K., M. Stephens, and P. Donnelly, Inference of population structure using multilocus genotype data. Genetics, 2000. 155(2): p. 945-59.\u003c/li\u003e\n\u003cli\u003eMontgomery, S.B., et al., Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 2010. 464(7289): p. 773-7.\u003c/li\u003e\n\u003cli\u003eBelleau, P., et al., Correction: Genetic Ancestry Inference from Cancer-Derived Molecular Data across Genomic and Transcriptomic Platforms. Cancer Res, 2023. 83(2): p. 347.\u003c/li\u003e\n\u003cli\u003eFachrul, M., et al., Direct inference and control of genetic population structure from RNA sequencing data. Commun Biol, 2023. 6(1): p. 804.\u003c/li\u003e\n\u003cli\u003eBarral-Arca, R., et al., Ancestry patterns inferred from massive RNA-seq data. RNA, 2019. 25(7): p. 857-868.\u003c/li\u003e\n\u003cli\u003eScheepers, B., P. Clough, and C. Pickles, The misdiagnosis of epilepsy: findings of a population study. Seizure, 1998. 7(5): p. 403-6.\u003c/li\u003e\n\u003cli\u003eIslam, S.J., et al., Cardiovascular Risk and Resilience Among Black Adults: Rationale and Design of the MECA Study. J Am Heart Assoc, 2020. 9(9): p. e015247.\u003c/li\u003e\n\u003cli\u003eDobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.\u003c/li\u003e\n\u003cli\u003eLangmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods, 2012. 9(4): p. 357-9.\u003c/li\u003e\n\u003cli\u003eDanecek, P., et al., Twelve years of SAMtools and BCFtools. Gigascience, 2021. 10(2).\u003c/li\u003e\n\u003cli\u003eMcKenna, A., et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 2010. 20(9): p. 1297-303.\u003c/li\u003e\n\u003cli\u003eConley, A.B., et al., Rye: genetic ancestry inference at biobank scale. Nucleic Acids Res, 2023. 51(8): p. e44.\u003c/li\u003e\n\u003cli\u003ePurcell, S., et al., PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 2007. 81(3): p. 559-75.\u003c/li\u003e\n\u003cli\u003eBrowning, B.L. and S.R. Browning, Genotype Imputation with Millions of Reference Samples. Am J Hum Genet, 2016. 98(1): p. 116-26.\u003c/li\u003e\n\u003cli\u003eKovaka, S., et al., Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol, 2019. 20(1): p. 278.\u003c/li\u003e\n\u003cli\u003eGlickman, M.E., S.R. Rao, and M.R. Schultz, False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. J Clin Epidemiol, 2014. 67(8): p. 850-7.\u003c/li\u003e\n\u003cli\u003eDeshpande, D., et al., RNA-seq data science: From raw data to effective interpretation. Front Genet, 2023. 14: p. 997383.\u003c/li\u003e\n\u003cli\u003eWang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63.\u003c/li\u003e\n\u003cli\u003eSmith, O., et al., Ancient RNA from Late Pleistocene permafrost and historical canids shows tissue-specific transcriptome survival. PLoS Biol, 2019. 17(7): p. e3000166.\u003c/li\u003e\n\u003cli\u003eGouveia MH, Meeks KAC, Borda V, Leal TP, Kehdy FSG, Mogire R, Doumatey AP, Tarazona-Santos E, Kittles RA, Mata IF, O\u0026apos;Connor TD, Adeyemo AA, Shriner D, Rotimi CN. Subcontinental genetic variation in the All of Us Research Program: Implications for biomedical research. Am J Hum Genet. 2025 Jun 5;112(6):1286-1301. doi: 10.1016/j.ajhg.2025.04.012.\u003c/li\u003e\n\u003cli\u003eWinkler, C.A., G.W. Nelson, and M.W. Smith, Admixture mapping comes of age. Annu Rev Genomics Hum Genet, 2010. 11: p. 65-89.\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Table 1","content":"\u003cp\u003eTable 1 is available in the Supplementary Files section.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Genetic Ancestry, Cardiovascular Disease, Short-Read RNA-Seq, Race, Ethnicity","lastPublishedDoi":"10.21203/rs.3.rs-7801062/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7801062/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Genetic ancestry inference traditionally relies on genotyping arrays or whole-genome sequencing. We present RNAScope-Ancestry, a computational pipeline that leverages short-read RNA-seq for dual-purpose ancestry and transcriptomic analysis. Using RNA-seq from 490 MECA participants and 1000 Genomes reference populations, we performed variant calling, quality filtering, principal component analysis and retained 230 high-frequency SNPs for ancestry estimation via Rye algorithm for ancestry inferences. MECA participants aligned with African and admixed populations, predominantly West African. Correlation with gene expression identified the top 50 ancestry-associated transcripts. Validation in 109 longitudinal samples confirmed reproducibility. The pipeline is open-source and generalizable: https://github.com/rob-meller/; https://github.com/Vermarashi/.","manuscriptTitle":"RNAScope-Ancestry: A Cross-Modality Framework for Inferring Genetic Ancestry from RNA-Seq with Application to MECA","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-08 17:10:37","doi":"10.21203/rs.3.rs-7801062/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"00d815d1-3753-4abd-82e3-87479cc97428","owner":[],"postedDate":"March 8th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-03-17T18:35:17+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-08 17:10:37","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7801062","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7801062","identity":"rs-7801062","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00