A Stacking Framework for Polygenic Risk Prediction in Admixed Individuals

doi:10.1101/2024.01.31.24302103

A Stacking Framework for Polygenic Risk Prediction in Admixed Individuals

2024 · doi:10.1101/2024.01.31.24302103

preprint OA: closed CC-BY-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 90,129 characters · extracted from oa-pdf · 3 sections · click to expand

Methods

have been proposed that 1) consider local ancestry by matching chosen risk variants with an individual’s local ancestry at that position23,27 and 2) ignore local ancestry and construct a joint PRS as a linear combination of global European and African PRS28. In simulations, Cavazos and Witte conducted a comprehensive review of both approaches24. While the first approach, deconvoluting ancestry and matching risk variants on population-specific GWAS effect sizes, was initially suggested to perform well 27, this result failed to consistently replicate as shown in Cavazos’ simulations and Bitarello’s real data application24,27,28. Surprisingly, the second approach ignoring local ancestry information (linear combination of global European and African PRS) was found to efficiently optimize prediction across a range of European ancestry quantiles in admixed African American individuals. However, use of global population specific PRS ignores the unique local admixture present in any given region within a sample of admixed individuals, missing potential population specific risk variants in a region or local GxG interactions on a specific ancestral background. Thus, it is possible that performance of local population specific PRS (i.e., a PRS using only risk variants in a genomic region and a specific population GWAS effect sizes) will vary across admixed individuals. In this work we propose slaPRS (stacking local ancestry PRS), a novel stacking framework to construct admixed PRS for quantitative traits that combines local population specific PRS constructed using population specific effect sizes in local genomic regions. Stacking is an ensemble machine learning method that aims to optimize prediction accuracy by combining separate prediction models 29,30. In target samples of a single ancestry, Prive et al successfully used stacking to optimize the commonly used clumping and thresholding (C+T) PRS method through deriving a linear combination of PRS across all possible parameters, rather than . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint learning a single set of optimal parameters31. Outside of PRS construction, stacking has been used in other genetic methods such as the recent REGENIE method for GWAS that improved computational efficiency through orders of magnitude by conditioning on the predicted individual trait values from combining local polygenic risk predictors32. In our approach, we first divide the genome into windows of a predetermined size and in each local window compute population specific local PRS using the respective population specific GWAS effect sizes via C+T. In training data, we then fit a penalized regression model to combine local population specific PRS across the genome to determine unique weights that are used to predict the phenotype in testing data. We show in extensive simulations and real data application of admixed African Americans and African British that slaPRS removes the ancestry dependence of PRS performance present in traditional single-population GWAS PRS and outperforms or compares similarly to existing methods in an efficient data-driven process. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 1.3 Methods Consider a sample of N admixed individuals with ancestral contributions from population A and B (slaPRS is not restricted to two-way genetic admixture but is assumed here for notational simplicity). Let X be the /g1840/g1876/g1839 admixed genotype matrix (M is the total number of variants genome wide) and Y the /g1840/g18761 phenotype vector. Let /g1838 /g3036/g3037 be an /g1840/g1876/g1839 matrix denoting the haplotype-level local ancestry (/g1864 /g3036/g3037/g2869 , /g1864 /g3036/g3037/g2870 .) of individual /g1861 at marker /g1862. We assume the phenotype can be expressed as: /g1851 /g3036 /g3404/g3533 /g1850 /g3036/g3037 /g1858/g4666/g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 ,/g1838 /g3036/g3037 /g4667 /g2:95 /g2:92/g2880/g2778 /g3397/g2261 /g3036 Where /g1850 /g3036/g3037 is the genotype dosage for individual /g1861 at marker /g1862 , and /g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 are effects for marker /g1862 on the phenotype in populations A and B respectively. Here, /g1858/g4666/g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 ,/g1838 /g3036/g3037 /g4667 is a weighted average of population specific GWAS effect sizes and local ancestry (see supplementary for derivation): /g1858/g4672 /g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 ,/g1838 /g3036/g3037 /g4673/g3404/g2010 /g3002 /g3285 /g4672/g1875 /g3038,/g308: /g3250 /g3285 /g3397/g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3002/g4667 /g1838 /g3036/g3037 /g4673/g3397/g2010 /g3003 /g3285 /g4672/g1875 /g3038,/g308: /g3251 /g3285 /g3397/g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3003/g4667 /g1838 /g3036/g3037 /g4673 Where /g1875 /g3038,/g308: /g3250 /g3285 and /g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3002/g4667 (and similarly for population B) are weights for population A effect sizes /g2010 /g3002 /g3285 and local ancestry interaction in each genomic region /g1863 that are learned via ensemble learning (stacking) in the slaPRS framework (see details below). 1.3.1 slaPRS Framework We developed slaPRS for constructing admixed PRS using three main features: 1) a local window approach 2) local population specific PRS and 3) an ensemble stacking framework to combine local population specific PRS. For slaPRS, we assume existence of GWAS effect size estimates for each ancestral population in an admixed population. We first partition the admixed . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint genotype matrix into K non-overlapping genotype blocks /g1833/g3404/g4668 /g1833 /g2869 ,/g1833 /g2870 ,…,/g1833 /g30:2 /g4669 with blocks predefined by physical distance. In our analysis we considered blocks spanning 1Mb and 5Mb of physical distance, each with /g1865 /g3038 SNPs such that ∑ /g1865/g30:2 /g3038/g2880/g2869 /g3038 /g3404/g1839 . Level 0 Local Population-Specific PRS and Ancestry In the training set of admixed individuals, in each block /g1833 /g3038 across the genome (using the /g1865 /g3038 SNPs in the block) we first separately computed vectors of local population A PRS (/g1827 /g3038 /g4667 and local population B PRS (/g1828 /g3038 ) using clumping and thresholding (C+T). While C+T was used in slaPRS, any PRS construction method could be used in our framework. In this step, each block’s C+T optimized ancestry PRS can be viewed as a level 0 model prediction to be stacked in our stacking framework (Figure 1). Clumping first removes variants in strong LD with others using in-sample LD for that region, while greedily retaining the most significant variants 33. Varying p-value thresholds /g1868/g3404 /g4668 5/g1857 /g3398 2,5/g1857 /g3398 4,5/g1857 /g3398 6, 5/g1857 /g3398 8 /g4669 were considered (cross validation in Level 1 stacking model used to select optimal /g1868 to use in testing set) to construct ancestry- specific local PRS in each block using the respective population’s estimated effect sizes. In this step, we make no assumption on whether risk variants are shared across ancestral populations, and thus local PRS /g1827 /g3038 and /g1828 /g3038 can have varying risk variants. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint Figure 1. Diagram of local window and level 0 population specific PRS model predictions. Admixed genomes split into 5Mb windows and in each window a local population A and B PRS are computed using population-specific effect sizes. Local ancestry further computed to form covariate vector for level 1 stacking model. For each sample, we computed the vector of local ancestries in block as the % of population A ancestry. We constructed interaction terms and to allow for the effect of the local population PRS and to vary by a given ancestry. Following completion of level 0 in our framework, block has the covariates (Figure 1): After aggregating the B total local block covariates across the genome, let C be the N matrix: Level 1 Elastic Net Stacking Model We then trained an elastic net34 penalized regression model to stack the local level 0 predictions (local population-specific PRS and ancestry) across the genome. The population’s GWAS that optimizes the local PRS can vary across the genome (see introduction) in an admixed sample, and stacking provides a data driven approach to inform which population’s local PRS should be he ns e . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint upweighted or shrunk. We used elastic net, which combines ridge regression35 and LASSO36, because the genetic architecture of a trait is unknown a priori (unknown which local blocks harbor causal risk variants and the distribution of local block heritability). When most local windows are weakly informative, ridge tends to have higher prediction accuracy while LASSO would likely outperform when only a small number of local windows are highly informative. Elastic net allows a data-adaptive approach to inform the amount of shrinkage and whether shrinkage patterns should favor ridge or LASSO to best accommodate a trait’s genetic architecture. To determine which aspects of our stacking framework drives increases in PRS performance, we considered three level 1 elastic net stacking models that vary in the covariates included from block /g1828 /g3038 : 1) Local population A PRS only /g1829 /g3003 /g3286 /g3404/g4668 /g1827 /g3038 /g4669 2) Local population A and B PRS only /g1829 /g3003 /g3286 /g3404/g4668 /g1827 /g3038 ,/g1828 /g3038 /g4669 3) Local population A and B PRS, Ancestry and Interactions /g1829 /g3003 /g3286 /g3404/g4668 /g1827 /g3038 ,/g1828 /g3038 ,/g1827/g1866 /g1855 /g3038 ,/g1827 /g3038 /g1876/g1827/g1866/g1855 /g3038 ,/g1828 /g3038 /g1876/g1827/g1866/g1855 /g3038 /g4669 Model 1 considered only local population A PRS /g1827 /g3038 to investigate how stacking local PRS alone improves compared to a global population A PRS. Model 2 added local population B PRS /g1828 /g3038 to assess the benefit of adding population B GWAS information, while Model 3 further included ancestry and interaction terms to allow for the effect of a local population specific PRS to vary based on ancestral background. Total covariates in each proposed level 1 model aggregate covariates /g1829 /g3003 /g3286 across all blocks genome wide. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint For each considered model, we fit a level 1 elastic net model34 to combine the level 0 ancestry- specific PRS and additional covariates across the genome. /g1851/g3404/g1875 /g2868 /g3397/g2205 /g2778 /g1829 /g3003 /g3117 /g3397/g2205 /g2779 /g1829 /g3003 /g3118 /g3397/g1710/g3397/g2205 /g2:93 /g1829 /g3003 /g3286 Where /g2205 /g2778 ,/g2205 /g2779 ,…,/g2205 /g2:93 are vectors of regression coefficients from the covariates in /g1829 /g3003 /g3038 . Estimates of /g2205 /g2:93 in the above model given the genome wide covariate matrix are obtained by minimizing the penalized objective function with respect to /g2010 : /g1875/g4666/g2019/g4667/g3555 /g3404/g1853 /g1870 /g1859 /g1865 /g1861 /g1866 /g3050 /g4666/g4670 ∑/g4666 /g1877 /g3036 /g3398/g1829 /g3036 /g1875 /g4667 /g2870 /g4671/g3397/g4670 /g2019 /g4666 /g2009 ∑ /g3627/g1875 /g3037 /g3627/g3397/g4666 1/g3398/g2009 /g4667 ∑ /g1875 /g3037 /g2870 /g3038 /g3037/g2880/g2869 /g4671/g4667/g3038 /g3037/g2880/g2869 /g304: /g3036/g2880/g2869 Parameter /g2019 determines the amount of shrinkage in model coefficients while /g2009/g1488/g4670 0 , 1 /g4671 balances the L1 and L2 penalty from ridge regression (/g2009/g34040 /g4667 and LASSO (/g2009/g34041 /g4667 . To optimize all parameters including the p-value threshold /g1868/g3404 /g4668 5/g1857 /g3398 4,5/g1857 /g3398 6, 5/g1857 /g3398 8 /g4669 used in constructing level 0 local ancestry PRS via C+T, /g2009 /g3404 /g46680, 0.1, 0.2, … , 1/g4669 , and /g2019 = {10 /g2879/g287: ,…,1 0 /g287: /g4669 , we employed K-fold cross validation with 10 folds and selected the set of /g1868 , /g2009 , and /g2019 that produced the lowest adjusted /g1844 /g2870 . Estimates of /g2205 /g2:93 for each block across the genome can be used (see supplementary for derivation) to express the weight for each variant in PRS construction to be a linear combination of population A (/g2010 /g3002 /g3285 /g4667 and B (/g2010 /g3003 /g3285 /g4667 GWAS effect sizes and learned block weights: /g1851 /g3036 /g3404/g3533 /g1850 /g3036/g3037 /g1858/g4666/g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 ,/g1838 /g3036/g3037 /g4667 /g2:95 /g2:92/g2880/g2778 /g3397/g2261 /g3036 /g1858/g4672 /g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 ,/g1838 /g3036/g3037 /g4673/g3404/g2010 /g3002 /g3285 /g4672/g1875 /g3038,/g308: /g3250 /g3285 /g3397/g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3002/g4667 /g1838 /g3036/g3037 /g4673/g3397/g2010 /g3003 /g3285 /g4672/g1875 /g3038,/g308: /g3251 /g3285 /g3397/g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3003/g4667 /g1838 /g3036/g3037 /g4673 Where /g1875 /g3038,/g308: /g3250 /g3285 and /g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3002/g4667 (and similarly for population B) are weights for population A specific local PRS /g1827 /g3038 and its local ancestry interaction term. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint Once weights from the level 1 elastic net stacking models had been estimated from the training data, in testing data we then computed the same level 0 model predictions and covariates in each block and aggregated genome wide: /g1829/g3404/g4670 /g1829 /g3003 /g3117 ,/g1829 /g3003 /g3118 ,…,/g1829 /g3003 /g3286 /g4671 Where /g1829 /g3003 /g3284 is defined as one of the three considered level 1 models. We then predicted trait values using estimated weights from the elastic net model: /g1842/g1844/g1845/g3554 /g3404/g1829 /g2010 /g4632 The estimated PRS is then tested against simulated phenotypes or trait values in real data. Genotype, Phenotype, and Population-Specific GWAS Simulation For our simulations and real data applications we focused on admixed African Americans/British with European and African ancestral backgrounds. To simulate genotype and phenotype data for an African and European population with realistic allele frequencies and linkage disequilibrium patterns, we used the coalescent-based pipeline as described by Martin et al16 and Cavazos et al16,24. Using msprime37 with an out-of-Africa demographic mode modeling HapMap38 chromosome 20 haplotypes, we simulated n=10,000 European samples and varying African sample sizes n={2000, 5000, 10,000}. Simulated population specific genotypes were then used to estimate marginal variant effect sizes. We then simulated quantitative trait phenotypes using the simulated genotypes. We first assumed complete transethnic sharing of genetic architecture across African and European populations, in which true causal variants, causal effect sizes, and overall heritability are consistent across populations. Under this scenario, performance of estimated PRS should vary only because of differences in allele frequency and LD across population. We subset variants with minor allele frequency > 5% in both populations and randomly sampled m={100, 500} . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint shared causal variants. True causal effect sizes were drawn from a normal distribution /g2010~/g1840/g46660, /g3035 /g3118 /g3040 /g4667 where /g1860 /g2870 /g3404 /g46680.10,0.30/g4669 is the SNP-based heritability. In results, we focused on the most realistic simulation scenario consisting of /g1860 /g2870 /g34040 . 1 0 and /g1865 /g3404 100 . We then considered the simulation scenario in which genetic architecture differs across ancestral populations by assuming true causal variant locations and overall heritability are shared, but now simulating causal effects /g2236~/g1839/g1848/g1840/g4666/g2777 , /g4684 /g3035 /g3118 /g3040 /g3096/g3035 /g3118 /g3040 /g3096/g3035 /g3118 /g3040 /g3096/g3035 /g3118 /g3040 /g4685 varying transethnic genetic correlation /g2025 /g3404 /g46680.20, 0.50, 0.80/g4669 . In both simulation scenarios, the true genetic score /g1833 was then defined as the product of sampled causal genotypes and their respective simulated effect sizes (/g1859/g3404 ∑ /g1850 /g3037 /g2010 /g3037 /g3040 /g3037/g2880/g2869 /g4667 , standardized to ensure total heritability of /g1860 /g2870 : /g1833/g3404 /g3034/g2879/g309: /g3282 /g3097 /g3282 /g1499/g1860 /g2870 . We then simulated the environmental effect from a normal distribution with variance comprising the remaining phenotype variance /g2035~/g1840/g46660,1 /g3398 /g1860 /g2870 /g4667 and similarly standardized: /g1831/g3404 /g3:06/g2879/g309: /g3354 /g3097 /g3354 /g1499/g4666 1/g3398/g1860 /g2870 /g4667 . We defined phenotype data Y for both populations as the sum of the standardized true genetic score and environmental effect /g1851/g3404/g1833/g3397 /g1831 . We then estimated effect sizes /g2010/g4632 for each variant genome wide using a linear model /g1851/g3404/g1850 /g1828/g3397/g2035 , using each population’s respective simulated phenotype and genotype data. We additionally simulated n=1,000 European and n=1,000 African founder samples to simulate n=10,000 admixed African Americans genotypes via RFMix39 with s=12 generations of admixture for training and testing slaPRS. Simulated admixed genotypes had known phase and known local ancestry. We followed the same pipeline described above to generate the phenotype given the simulated genotypes. In the scenario where causal effects differed across . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint populations, we considered haploid chromosomes /g1834 /g3036/g3037/g2869 and /g1834 /g3036/g3037/g2870 (corresponding haplotype 1 and 2 for individual /g1861 at variant /g1862/g4667 and matched the population specific effect sizes on the local ancestry of a variant’s haplotype background to derive the true genetic component: /g1850 /g3036 /g3404 ∑ /g2010 /g3037,/g3002/g3007/g30:9 /g3427/g1834 /g3036/g3037/g2869 /g1835/g4666/g1864 /g3036/g3037/g2869 /g3404/g1827/g1832 /g1844 /g4667/g3397/g1834 /g3036/g3037/g2870 /g1835/g4666/g1864 /g3036/g3037/g2870 /g3404/g1827/g1832 /g1844 /g4667 /g3431/g3397/g2010 /g3037,/g3006/g3022/g30:9 /g3427/g1834 /g3036/g3037/g2870 /g1835/g3435/g1864 /g3036/g3037/g2870 /g3404/g1831 /g1847 /g1844 /g3439/g3397/g1834 /g3036/g3037/g2870 /g4666/g1864 /g3036/g3037/g2870 /g3404/g3040 /g3037/g2880/g2869 /g1831/g1847/g1844/g4667 /g3431 . To prevent issues of overfitting, we split our sample into testing and training data using a 70:30 split, resulting in n=7000 and n=3000 admixed samples in the training and testing data splits. The outlined simulation procedure was repeated 150 times to evaluate slaPRS and perform method comparisons. 1.3.2 Comparison of Methods: Clumping and Thresholding (C+T) We first compared the proposed slaPRS method against global single population PRS, /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 , constructed using clumping and thresholding (C+T) with GWAS effect sizes from the respective population separately. In the C+T algorithm, we first clumped SNPs using each population’s GWAS effect sizes with a window size of 250Kb and linkage threshold /g1870 /g2870 =0.10 and then optimized the threshold parameter in the 70% training set with /g3398l o g /g2869/g2868 /g4666/g1868/g4667 p value thresholds including {1, 2, … , 8}. The threshold that optimized PRS performance was then used in the 30% testing set to retain clumped risk variants to include in the PRS construction. Linear Combination of Global Population Specific PRS The second approach compared against was the method proposed by Marquez-Luna et al28 which constructed a PRS as a linear combination of two global population-specific PRS: /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 /g3404/g2009 /g3006/g3022/g30:9 /g1842/g1844/g1845 /g3006/g3022/g30:9 /g3397/g2009 /g3002/g3007/g30:9 /g1842/g1844/g1845 /g3002/g3007/g30:9 . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint Here, /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 are the same global PRS constructed using C+T and the respective population GWAS as described above. To estimate the mixing weights (/g2009 /g3006/g3022/g30:9 , /g2009 /g3002/g3007/g30:9 ) and global polygenic risk scores (/g1842/g1844/g1845 /g3006/g3022/g30:9 ,/g1842 /g1844 /g1845 /g3002/g3007/g30:9 /g4667 , we followed proposed guidelines and used cross validation. The 70% training set of admixed samples was first split in half, where the first half was used to estimate the thresholding parameter in the C+T algorithm. In the second half we constructed /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 using the optimal p-value threshold from the European GWAS (as is typically larger), as done by Marquez-Luna et al. In this same second half of the training set, we then estimated /g2009 /g3006/g3022/g30:9 and /g2009 /g3002/g3007/g30:9 by finding the least squares estimates to: /g1851/g3404/g2009 /g3006/g3022/g30:9 /g1842/g1844/g1845 /g3006/g3022/g30:9 /g3397/g2009 /g3002/g3007/g30:9 /g1842/g1844/g1845 /g3002/g3007/g30:9 With the optimal p-value threshold and mixing weights /g2009 /g3006/g3022/g30:9 and /g2009 /g3002/g3007/g30:9 derived from training data, we then constructed /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 as the weighted sum of /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 . 1.3.3 Quantifying Performance of Estimated PRS To quantify and compare performance of each PRS across methods, we computed the proportion of variance explained (adjusted /g1844 /g2870 ) of the simulated quantitative phenotype with the estimated PRS adjusting for % European ancestry. Because one of our main objectives is to create a PRS with performance independent of the global ancestry of an admixed individual, we further stratified our adjusted /g1844 /g2870 performance metric by European ancestry quantiles [0-20%, 20-40%, 40-60% and 60-80%, 80-100%]. We also compared the mean simulated phenotype value in the top 10% PRS quantile with the bottom 10% PRS quantile to assess the PRS’ ability to identify high-risk and low-risk individuals. 1.3.4 Real Data Application . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint We evaluated slaPRS in real data applications using n=20,262 admixed African British individuals in the UK Biobank40. To choose samples, we selected admixed samples falling on the diagonal between the European and African corners of the PC plot (Supplementary Figure 1). We used autosomal imputed genotypes in constructing polygenic risk scores. Phenotype data included the lipid biomarkers LDL, HDL, and total cholesterol. Lipid biomarker phenotypes were chosen because the Global Lipids Genetic Consortium 41 had collected large sample (excluding UK Biobank samples) ancestry specific GWAS in Europeans (n=1.32 million) and Admixed African or Africans (N=99.4k). For all 20,262 samples we inferred local ancestry with genotypes first phased using BEAGLE 5.0 42. We used RFMix39 to infer local ancestry using phased haplotypes from European and African subpopulations from 1000 Genomes43 individuals as references. From inferred local ancestry, we further computed global ancestry using tract lengths for sample stratification. We split the admixed dataset into 70% training and 30% testing for model training and method comparison. Because the true PRS is unknown in real data, to quantify PRS performance across methods we computed the proportion of variance explained (adjusted /g1844 /g2870 ) between the estimated PRS and phenotypic value (instead of true genetic score) from the model including the first 4 principal components: /g1851/g3404/g2010 /g2868 /g3397/g2010 /g30:7/g30:9/g3020 /g1842/g1844/g1845 /g3397 /g2010 /g30:7/g3004 /g3117 /g1842/g1829 /g2869 /g3397/g1710/g3397/g2010 /g30:7/g3004 /g3120 /g1842/g1829 /g2872 Similar to simulations, we computed adjusted /g1844 /g2870 across the entire testing sample and then also stratified by European ancestry quantiles. We also compared the mean simulated phenotype value in the top 10% PRS quantile with the bottom 10% PRS quantile. Performance metrics were computed with the median reported over 50 folds. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 1.4 Results: 1.4.1 Comparison of PRS Performance Assuming Shared Genetic Architecture across Ancestral Populations To evaluate the performance of slaPRS, we first conducted simulations with complete sharing of genetic architecture across ancestral populations (i.e., true effect sizes and risk variants are shared across European and African populations) for various disease architectures (see methods). Under this setup, differences in GWAS estimated effect sizes across ancestral populations are a function of solely LD. We constructed our stacked PRS using simulated European and African GWAS effect sizes for simulated admixed African Americans of varying ancestry proportions. The distribution of overall European ancestry in our simulated admixed African Americans was approximately normally distributed with a mean of around 50% (Supplementary Figure 2). We focus first on the full level 1 model with 5Mb windows using the local African and European PRS and local ancestry information in each block (/g1829 /g3003 /g3284 /g3404/g4668 /g1827 /g3036 ,/g1831 /g3036 ,/g1827/g1866 /g1855 /g3036 ,/g1827 /g3036 /g1876/g1827/g1866/g1855 /g3036 ,/g1831 /g3036 /g1876/g1827/g1866/g1855 /g3036 /g4669 ) with heritability /g1860 /g2870 /g34040 . 1 0 , number of causal variants /g1865 /g3404 100 , and equal size European and African GWAS sample size /g1866 /g3404 10,000 . Across simulations, our stacked PRS generally had an increased adjusted /g1844 /g2870 with the simulated phenotype compared to the existing approaches. slaPRS had a 5.93% median adjusted /g1844 /g2870 for the true PRS across all admixed individuals in the testing set compared to C+T /g1842/g1844/g1845 /g3006/g3022/g30:9 (3.17%) and /g1842/g1844/g1845 /g3002/g3007/g30:9 (3.18) and /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 (3.39%) that globally combines /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 . Comparing individuals in the top vs bottom 10% of the PRS distribution, slaPRS had higher trait stratification ability with larger mean differences (0.84 vs 0.62, 0.64, 0.64 for /g1842/g1844/g1845 /g3006/g3022/g30:9 , /g1842/g1844/g1845 /g3002/g3007/g30:9 , and /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 respectively). We further stratified testing samples by quantiles of European ancestry and found our stacking approach using the full model explained more variance of the phenotype compared to both /g1842/g1844/g1845 /g3006/g3022/g30:9 , /g1842/g1844/g1845 /g3002/g3007/g30:9 and . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint . Across all ancestry quantiles the percent increase in median adjusted for slaPRS compared to the other methods ranged from 38.46% to 120.61% (Figure 2). Most notably, slaPRS strongly reduced the ancestry dependence of PRS performance as compared to and . When quantified through a simple linear model, the adjusted for slaPRS increased by 0.0009 for every European ancestry quantile increase ranging from 5.69% (0-20% European ancestry) to 5.91% (80-100% European ancestry). On the other hand, single population and had larger changes in of 0.004 (2.60% to 4.22 %) and - 0.001 (4.11%-3.60%) respectively for every quantile increase. compared similarly to slaPRS with an increase of 0.0008 for every quantile increase, ranging from 3.46% to 3.91%. Figure Error! No text of specified style in document.2. Boxplots comparing performance of slaPRS (differing in choice of level 0 predictors from each block), , and single population PRS: & (see methods) quantified through adjusted . Testing samples stratified by overall % of European ancestry. S S 1 . . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint While thus far we only considered the full slaPRS model (/g1829 /g3003 /g3286 /g3404 /g4668/g1831 /g3038 ,/g1827 /g3038 ,/g1827/g1866 /g1855 /g3038 ,/g1827 /g3038 /g1876/g1827/g1866/g1855 /g3038 ,/g1831 /g3038 /g1876/g1827/g1866/g1855 /g3038 /g4669 , we then considered slaPRS under our alternative level 1 models that vary predictors from each local window. For the simplest case /g1829 /g3003 /g3286 /g3404/g4668 /g1831 /g3038 /g4669 (i.e. only European GWAS considered and stacking local European PRS across blocks), slaPRS had adjusted /g1844 /g2870 ranging from 3.28% for 0-20% European ancestry to 5.45% for 80-100% European Ancestry and noticeably outperformed /g1842/g1844/g1845 /g3006/g3022/g30:9 . However, slaPRS under /g1829 /g3003 /g3286 /g3404/g4668 /g1831 /g3038 /g4669 exhibited the strongest ancestry dependence (0.005 increase in adjusted /g1844 /g2870 across ancestry quantiles) across all methods. For /g1829 /g3003 /g3286 /g3404/g4668 /g1831 /g3036 ,/g1827 /g3036 /g4669 (i.e. integrating European and African GWAS and stacking local European and African PRS across blocks), slaPRS further increased performance (compared to the single population case /g1829 /g3003 /g3286 /g3404/g4668 /g1831 /g3038 /g4669 ) with adjusted /g1844 /g2870 ranging from 5.77% to 6.27% and had noticeably reduced ancestry dependence (0.001 increase in adjusted /g1844 /g2870 across ancestry quantiles). The full level 1 model (/g1829 /g3003 /g3286 /g3404/g4668 /g1831 /g3038 ,/g1827 /g3038 ,/g1827/g1866 /g1855 /g3038 ,/g1827 /g3038 /g1876/g1827/g1866/g1855 /g3038 ,/g1831 /g3038 /g1876/g1827/g1866/g1855 /g3038 /g4669 further added local ancestry with interaction terms and performed comparably to the previous model ignoring ancestry /g1829 /g3003 /g3284 /g3404/g4668 /g1831 /g3038 ,/g1827 /g3038 /g4669 . Negligible differences in the full model and the model excluding local ancestry were present only in simulations of complete sharing of transethnic genetic effects. Effect of Overall Heritability, Number of Causal Variants, Window Size, and African GWAS Sample Size We quantified how slaPRS fared against other approaches across different simulation settings including: overall heritability /g1860 /g2870 /g1488 /g46680.10, 0.30/g4669 , number of causal variants /g1865 /g3404 /g46685, 100, 500, 1000/g4669 , African GWAS sample size /g1866 /g1488 /g46682000, 5000, 10000/g4669 , window sizes /g1488 /g4668 1/g1839/g1854, 5/g1839/g1854 /g4669 (see Supplementary), and training data size /g1488 /g46683000, 7000/g4669 (see Supplementary) . Across all settings, slaPRS generally improved performance as compared to single ancestry PRS: /g1842/g1844/g1845 /g3002/g3007/g30:9 and /g1842/g1844/g1845 /g3006/g3022/g30:9 (Supplementary Figure 3). Two factors had a sizable impact on the performance of slaPRS generally and its comparison to /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 . The first major factor impacting PRS . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint performance was the African GWAS sample size. As the African GWAS sample size decreased (while fixing , ) the C+T performed increasingly worse compared to other methods (Figure 3). The performance of the full slaPRS model similarly decreased as the African GWAS sample size decreased, reflecting less informative contributions about the true risk variants from the African cohort. Furthermore, slaPRS exhibited a stronger ancestry dependence (converging towards the European only slaPRS model) as the African GWAS sample size decreased: For every increase in European ancestry quantile, slaPRS under the full model had an average change in average adjusted of 0.0009, 0.001 and 0.003 for African GWAS sample sizes of n=10000, n=5000, and n=2000 respectively. However, even for the smallest African GWAS sample size scenario, slaPRS had the highest adjusted across ancestry quantiles. d ull . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint Figure 3. Line graph comparing PRS performance across methods (quantified by median adjusted /g1844 /g2870 between estimated PRS and phenotype value) as the African GWAS sample size changes (n=2000, 5000, 10,000). Testing admixed samples stratified by European ancestry quantile. The second factor impacting slaPRS, especially compared to /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 , was polygenicity and distribution of per variant effect sizes (Supplementary Figures 3, 4). slaPRS generally had the greatest improvement in polygenic (/g1865 /g3404 100, 500/g4667 simulations with moderate to large per variant effect sizes (/g1860 /g2870 /g3404 0.30, /g1865 /g3404 100, 500 and /g1860 /g2870 /g3404 0.10, /g1865 /g3404 100 ) driving clear genetic signals. Under these simulation parameters, the median adjusted /g1844 /g2870 of the full slaPRS model was 58.1% to 96.7% larger than the median adjusted /g1844 /g2870 of /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 ,. In such settings, a majority of window’s local ancestry PRS contributing genetic signal to the stacking model. On the opposite end, when polygenicity was lower (/g1865/g34045 causal variants, /g1860 /g2870 /g34040 . 1 0 ) the median adjusted /g1844 /g2870 for slaPRS was more similar to /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 (23.4% increase), as a few large per variant effect sizes drive a small number of windows to dominate the genetic signal with remaining windows adding noise to the model. slaPRS similarly performed more similar to /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 (21.1% and 27.3% increase in adjusted /g1844 /g2870 ) in simulations of high polygenicity with low per variant effect sizes (/g1865 /g3404 500, 1000 and /g1860 /g2870 /g3404 0.10/g4667 , as most windows are uninformative and those with very small genetic signal are likely overly penalized and shrunk. 1.4.2 Comparison of PRS Performance Assuming Differences in Genetic Architecture across Ancestral Populations We also considered simulations in which the genetic architecture differed across ancestral populations (i.e., unique population-specific effect sizes), causing population-specific GWAS to vary from both differences in LD and true underlying effects across populations. We computed slaPRS using GWAS effect sizes varying the transethnic genetic correlation across risk variants . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint /g2025/g3404 /g4668 0.2, 0.5, 0.8 /g4669 . We again focused on our base simulation parameters (heritability /g1860 /g2870 /g34040 . 1 0 , number of causal variants /g1865 /g3404 100 , and equal size European and African GWAS sample size /g1866 /g3404 10,000 ). For the single population /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 , which do not consider a risk variant’s local background, the adjusted /g1844 /g2870 from the PRS model was stable in their corresponding admixed groups (80-100% European and 0-20% European) across changing transethnic genetic correlation. However, when transethnic genetic correlation was low (/g2025/g3404 0.2/g4667 , /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 notably had an increased decay in PRS performance as the admixed ancestry group diverged from the population GWAS (Figure 4): Comparing the shared transethnic genetic architecture case vs when /g2025 = 0.20, the change in adjusted /g1844 /g2870 was 0.005 vs 0.004 and -0.006 vs -0.001 across ancestry quantiles for /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 respectively. For slaPRS, notably the full level 1 stacking model (/g1829 /g3003 /g3284 /g3404/g4668 /g1831 /g3036 ,/g1827 /g3036 ,/g1827/g1866 /g1855 /g3036 ,/g1827 /g3036 /g1876/g1827/g1866/g1855 /g3036 ,/g1831 /g3036 /g1876/g1827/g1866/g1855 /g3036 /g4669 modeling local ancestry and interactions outperformed the model using only the local ancestry PRS (/g1829 /g3003 /g3284 /g3404/g4668 /g1831 /g3036 ,/g1827 /g3036 /g4669 as the transethnic genetic correlation decreased. When genetic effects across ancestral populations were similar (/g2025 /g3404 0.8/g4667, the percent increase in adjusted /g1844 /g2870 between the full model and model ignoring local ancestry ranged from 10.9% to 14.3% across ancestry quantiles, as compared to 23.4% to 50.5% when transethnic genetic effects are vastly different ( /g2025/g34040 . 2 ) (Figure 4). Notably, the overall adjusted /g1844 /g2870 of the full level 1 model modeling ancestry specific effects dependent on a variant’s ancestral background was stable across values of /g2025 /g3404 /g46680.2, 0.5, 0.8/g4669 : (/g1844 /g2870 /g3404 5.27%, 5.18%, 5.67% ) as compared to the model ignoring local ancestry (/g1844 /g2870 /g3404 3.65%, 4.09%, 5.18%/g4667 . . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint Figure 4. Line graph comparing PRS performance as quantified through median adjusted between the estimated PRS and phenotype value. Transethnic genetic correlation varies from and testing admixed samples stratified by European ancestry quantile. 1.4.3 Real Data Application We conducted a real data application of our stacking method slaPRS using genotype and phenotype data from the UK Biobank. We considered three quantitative lipid traits: HDL, LDL, and total cholesterol using estimated European and African American GWAS effect sizes from the Global Lipids Genetic Consortium (see methods for details). We first compared our approach to , (C+T using European and African GWAS effect sizes separately), and (combining and globally) across all samples. For all three traits, slaPRS improved the median adjusted r squared values compared to and s, . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint (Table 1). Similarly, slaPRS improved stratification ability as shown in larger mean phenotype values comparing individuals in the top and bottom 10% of the PRS distribution: HDL (0.373 vs 0.365, 0.324), LDL (1.019 vs 0.858, 0.905), TC (1.317 vs 1.028, 1.203). However, slaPRS performed similarly to across all three traits with respect to both metrics, a pattern observed in simulation scenarios of lower polygenicity causing fewer windows to contribute to trait heritability (Table 1). Across the three traits, only 1.6% (HDL), 6.6% (LDL), and 2.1% (TC) of all level 0 local population PRS across the genome had an > 0.10 with the overall trait PRS. For LDL which had the highest signal to noise ratio, there was a minor improvement in both and top vs bottom 10% stratification ability for slaPRS. Furthermore, we found limited improvement in slaPRS using the full level 1 stacking model ( compared to the reduced model ( a) b) . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint Table 1. Performance metrics for lipid phenotypes in UKB. a) Median adjusted from model PHENO ~ PRS + PC1 + PC2 + PC3 + PC4. b) Difference in mean phenotype for individuals in top 10% of PRS distribution vs bottom 10%. We then stratified our testing samples by European ancestry quantile to 1) reassess overall PRS performance on admixed individuals in quantiles of 20%-80% European ancestry (removing primarily European or African admixed African British) and 2) quantify ancestry dependence of PRS performance across all five ancestry quantiles. In the bottom and top quantiles of predominantly homogenous African or European admixed African British, using single ancestry and tended to outperform. However, in the m ore heterogeneous admixed samples (20-80% European ancestry), slaPRS and had the best median adjusted across all methods with comparable results for the three traits: HDL (0.066 and 0.070), LDL (0.103 and 0.098), TC (0.079 and 0.081) (Figure 5). Regarding ancestry dependence of PRS method, across traits and exhibited the strongest ancestry dependence, performing better as the proportion of European or African ancestry increased. On the other hand, methods using multiple ancestry GWAS had reduced ancestry dependence, with slaPRS having the smallest dependence followed by . For HDL, the average change in adjusted for each European quantile increase for slaPRS, , , and s n . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint was 0.004, 0.019, -0.006, and 0.011 respectively. LDL (-0.003, 0.014, -0.016, and 0.003) and TC (-0.002, 0.012, -0.014, -0.005) had similar patterns across methods. Figure 5. Line graph comparing PRS Performance for UKB lipid phenotypes. Performance quantified through median adjusted from model PHENO ~ PRS + PC1 + PC2 + PC3 + PC4. Testing admixed samples are stratified by European ancestry quantile. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 1.5 Discussion: In this work we proposed a novel stacking framework to locally incorporate GWAS from multiple populations into construction of PRS for admixed individuals. Our method, slaPRS, segments admixed genomes into local regions of varying ancestry and optimizes a linear combination of local population specific PRS, local ancestry, and potential interactions. In simulations, we first recapitulated previous findings that traditional PRS constructed using a single population GWAS in admixed samples are ancestry dependent. We then showed across a range of genetic architectures (varying heritability, number of causal variants, underrepresented GWAS sample size, and transethnic genetic correlation across ancestral populations) that slaPRS can outperform existing approaches ( /g1842/g1844/g1845 /g3006/g3022/g30:9 , /g1842/g1844/g1845 /g3002/g3007/g30:9 and /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 /g4667 and reduce the ancestry dependence compared to /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 . In real data, we leveraged ancestry specific GWAS for lipid traits from the Global Lipids Genetic Consortium to compare slaPRS to existing PRS methods in admixed African British from the UK Biobank. We found in these lipid traits that incorporating multiple ancestry GWAS similarly improved performance and strongly reduced the ancestry dependence of PRS performance. From our simulations and real data applications, we conclude that slaPRS for PRS in admixed individuals is likely optimal (compared to existing approaches) for traits with high heritability and polygenicity. slaPRS extends /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 to combine information locally as opposed to globally and comparisons had interesting findings. In simulations, we found the smallest improvements were in trait architectures with low polygenicity (few windows meaningfully contribute to trait heritability with others add noise to the model) or in highly polygenic settings where per-variant effect sizes are small (hard to distinguish signal from noise and genetic signals may be over shrunk). In real data applications, we found slaPRS and /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 performed similarly across the three lipid traits, likely driven by their trait genetic architecture. For the lipid traits studied, the . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint former simulation scenario may be most prevalent as only 2-6% of all local PRS across windows contributed to the estimated PRS causing most regions to solely add noise to the model. As a result, noticeable improvements in slaPRS over /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 may be observed in more heritable and polygenic traits, such as height, in which more local windows across the genome will contribute genetic signal. However, evaluating slaPRS had a surprising finding that explicitly modeling local ancestry in the slaPRS model (vs the model excluding local ancestry) had the most improvement when there existed at least moderate heterogeneity in true causal variant effect sizes across ancestral backgrounds. In simulations, this was shown through the largest increase in PRS performance between slaPRS models when transethnic genetic correlation was low ( /g2025 /g3404 0.20/g4667 , with no improvements under scenarios of shared transethnic genetic architecture. In lipid traits from the UK Biobank, we observed similar findings regarding modeling local ancestry. In such traits, modeling local ancestry in the slaPRS model only provided marginal improvements, consistent with high estimated transethnic genetic correlations from Million Veteran Program participants for HDL ( /g2025/g3404 0.84) and moderate correlation for the other traits /g4666/g2025 /g1488 /g4670 0.47, 0.69 /g4671 /g4667 41. High transethnic genetic correlations for the considered lipid traits are consistent with recent findings from Hou et al, that suggest a majority of common traits likely have similar causal effects across populations 18. Such findings have immediate implications, as slaPRS and other approaches considering local ancestry background may find the most improvement in traits with significant differences in transethnic genetic architecture. Historically in genetic studies, individuals are often discretized into ancestral populations and treated as homogenous within the group. Ding et al has recently challenged the historical paradigm by showing PRS accuracy varies between individuals even within a “homogenous” genetic ancestry cluster to ultimately push for treating genetic ancestry on a continuum 22. Our . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint

Method

slaPRS is tailored to treat genetic ancestry on a continuum by taking a local approach to PRS prediction in admixed samples. As mentioned, /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 previously combined global population specific PRS successfully in admixed individuals, though in doing so uses a single weight for population specific effects. Potential heterogeneity in true population specific risk variants, estimated population specific GWAS effect sizes, and admixture proportions across loci and individuals would cause use of a single weight to be suboptimal. slaPRS extends /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 by combining population specific PRS at the local level instead to 1) allow for varying effects of local population specific PRS across the genome and 2) increase overall external GWAS sample sizes to improve effect size estimation and identify the true causal variants. The first benefit is accomplished through our level 1 elastic net stacking model that learns a linear combination of local population specific PRS (and local ancestry with interaction effects) to inform which population’s local PRS should be upweighted or shrunk. In the case that the true causal effect differs due to ancestral background, slaPRS handles this scenario by modeling the local ancestry and interactions with the local population specific PRS, allowing for the effect of a local population specific PRS to differ based on its ancestral background. The second benefit is accomplished by increasing the overall effective GWAS sample size through incorporating information from each population’s GWAS. In the case that the genetic architecture is shared across ancestral backgrounds, using information from both GWAS will boost power and improve effect size estimation of the shared risk variants and their locations. However, when the genetic architecture differs across populations it is unclear whether using multiple population GWAS can be viewed in a similar manner. slaPRS has desirable statistical and computational properties as well. First, similar to other machine learning-based PRS methods such as TL-PRS 44 in the context of cross population prediction incorporating multiple ancestry GWAS, slaPRS avoids the needs for any distributional . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint assumptions on transethnic effect sizes as compared to the cross population PRS methods PRS-CSx21 and PolyPred45 (Utilizes BOLT-LMM46 and PRS-CS47 which treat SNP effects as random). As a result, our approach makes no assumption on whether a risk variant is shared across population, where each local population PRS in a genomic region can include its own set of risk variants. Second, slaPRS does not require an external LD reference panel or genotypes outside of the admixed genotypes. Third, slaPRS can accommodate any PRS algorithm to construct local population PRS (here we use the C+T algorithm for simplicity). For example, REGENIE 32 uses a ridge regression based approach to construct level 0 local PRS before stacking. Lastly, our approach is computationally very efficient, as discretizing the genome into local windows facilitates efficient parallel processing of level 0 predictions, with a final level 1 elastic net model that can be fit very fast with standard statistical packages. While slaPRS provides a novel stacking approach to combine population specific GWAS information locally, it has a few limitations to consider. We assume existence of GWAS from each ancestry contributing to a genetic admixture, though high powered GWAS in understudied homogenous populations such as Africans are currently limited or non-existent. As a result, our real data application was limited to using African American GWAS as proxies for African GWAS, with only a handful of lipid traits from the Global Lipids Genetic Consortium having sufficiently large GWAS sample sizes. Recent efforts for genomic research in diverse populations such as the African biobank 48 should help to resolve this issue. Furthermore, we describe our framework for continuous value phenotypes, owing to currently limited access to large sample GWAS for binary case/control traits in each ancestral population. Extending this framework to case/control traits using a logistic regression elastic net and liability threshold model should be straightforward. Lastly, while we push to treat admixed individuals on a genetic ancestry continuum, our approach assumes the super population groups such as “European” and “African” have homogenous genetic architecture with respect to a complex trait across their . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint subpopulations. However, studies have shown a high degree of genetic diversity across the African continent49,50 with unique demographic histories driving substantial cultural and ethnic differences that may cause treating all African subpopulations as homogenous to be problematic 22,51. Despite the limitations, slaPRS provides an efficient data driven framework to constructing polygenic risk scores in admixed samples that leverage multiple population GWAS. In providing a method that not only performs well in admixed samples, but equally well across varying ancestry proportions we strive to improve on the current inequity in genetics research that is fast resolving in our community. Furthermore, as sample sizes increase in underrepresented populations for more traits, we expect slaPRS to have additional applications. Lastly, while our work thus far only considered two-way admixture, our approach can easily accommodate three or more ancestral populations and respective external GWAS. In coming years admixture will likely extend beyond the historically predominant African American and Latino admixed groups as people and cultures from various ancestral backgrounds are brought together geographically. As a result, we believe our method’s flexibility to accommodate increasingly complex admixture types using information from multiple GWAS will become even more relevant. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint Data Availability This research used genetic and phenotypic data from the UK Biobank Resource under Application Number 24460. Data is available for download for approved researchers of the UK Biobank. High powered ancestry specific GWAS from the Global Lipids Genetics Consortium are publicly available: http://csg.sph.umich.edu/willer/public/glgc-lipids2021/. Code Availability slaPRS and necessary functions has been implemented as an R package and can be installed via running devtools::install_github('kliao12/slaPRS') using the devtools library in R. An example workflow is available at https://github.com/kliao12/slaPRS

Acknowledgements

Funding for this project was provided by National Institutes of Health grant R01 HG011031 and R01 HG005855 (S.Z.) and the NIH/National Human Genome Research Institute Genome Science Training Program (T32HG00040). We thank UK biobank participants and study teams for providing high quality genetic and phenotypic data. Lastly, we appreciate the helpful insights and feedback given by Dr. Jean Morrison on general method development and analysis. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 1.6 Supplementary 1.6.1 Derivation of weighted function learned from slaPRS We restate our model setup consisting of a sample of N admixed individuals with ancestral contributions from population A and B. Let X be the /g1840/g1876/g1839 admixed genotype matrix (M is the total number of variants genome wide) and Y the /g1840/g18761 phenotype vector. Let /g1838 /g3036/g3037 be an /g1840/g1876/g1839 matrix denoting the haplotype-level local ancestry (/g1864 /g3036/g3037/g2869 , /g1864 /g3036/g3037/g2870 .) of individual /g1861 at marker /g1862. We assume the phenotype can be expressed as: /g1851 /g3036 /g3404/g3533 /g1850 /g3036/g3037 /g1858/g4666/g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 ,/g1838 /g3036/g3037 /g4667 /g2:95 /g2:92/g2880/g2778 /g3397/g2261 /g3036 Where /g1850 /g3036/g3037 is the genotype dosage for individual /g1861 at marker /g1862 , and /g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 are effects for marker /g1862 on the phenotype in populations A and B respectively. Here, /g1858/g4666/g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 ,/g1838 /g3036/g3037 /g4667 is a weighted average of population specific GWAS effect sizes and local ancestry learned via our stacking approach. Following construction of level 0 model predictions in each window /g1829 /g3003 /g3286 across the genome (includes local population A PRS /g1827 /g3038 and local population B PRS /g1828 /g3038 , local ancestry, and interaction terms) we fit the following stacking model: /g1851/g3404/g1875 /g2868 /g3397/g2205 /g2778 /g1829 /g3003 /g3117 /g3397/g2205 /g2779 /g1829 /g3003 /g3118 /g3397/g1710/g3397/g2205 /g2:93 /g1829 /g3003 /g3286 Expanding out terms for the k-th window: /g3404/g1875 /g2868 /g3397/g3427 /g1875 /g3038,/g3002 /g3286 /g1827 /g3038 /g3397/g1875 /g3038,/g3003 /g3286 /g1828 /g3038 /g3397/g1875 /g3038,/g3028/g304:/g3030 /g1827/g1866/g1855 /g3397 /g1875 /g3038,/g3028/g304:/g3030:/g3002 /g3286 /g1827/g1866/g1855 /g1876 /g1827 /g3038 /g3397/g1875 /g3038,/g3028/g304:/g3030:/g3003 /g3286 /g1827/g1866/g1855 /g1876 /g1828 /g3038 /g3431/g3397/g1710 /g3404/g1875 /g2868 /g3397/g4670 /g1875 /g3038,/g3028/g304:/g3030 /g1827/g1866/g1855 /g3397 /g1827 /g3038 /g3435/g1875 /g3038,/g3002 /g3286 /g3397/g1875 /g3038,/g3028/g304:/g3030:/g3002 /g3286 /g3439/g3397/g1828 /g3038 /g3435/g1875 /g3038,/g3003 /g3286 /g3397/g1875 /g3038,/g3028/g304:/g3030:/g3003 /g3286 /g1827/g1866/g1855/g3439/g4671 /g3397 /g1710 The stacking procedure learns a linear combination of level 0 model prediction in each window /g1829 /g3003 /g3286 across the genome through estimating the weights /g1875 /g3038 . /g1827 /g3038 and /g1828 /g3038 are themselves weighted sum of risk variants using population specific GWAS reducing the form to: . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint /g3404/g1875 /g2868 /g3397/g4686 /g1875 /g3038,/g3028/g304:/g3030 /g1827/g1866/g1855 /g3397 /g4684/g3533 /g1850 /g3036/g3037 /g2010 /g3002 /g3285 /g3040 /g3286 /g3037/g2880/g2869 /g4685/g3435 /g1875 /g3038,/g3002 /g3286 /g3397/g1875 /g3038,/g3028/g304:/g3030:/g3002 /g3286 /g3439/g3397 /g4684 /g3533/g1850 /g3036/g3037 /g2010 /g3003 /g3285 /g3040 /g3286 /g3037/g2880/g2869 /g4685/g3435 /g1875 /g3038,/g3003 /g3286 /g3397/g1875 /g3038,/g3028/g304:/g3030:/g3003 /g3286 /g3439/g4687 /g3397 /g1710 /g3404/g1875 /g2868 /g3397/g3430 /g1875 /g3038,/g3028/g304:/g3030 /g1827/g1866/g1855 /g3397 /g3533 /g1850 /g3036/g3037 /g3040 /g3286 /g3037/g2880/g2869 /g4674/g2010 /g3002 /g3037 /g3435/g1875 /g3038,/g3002 /g3286 /g3397/g1875 /g3038,/g3028/g304:/g3030:/g3002 /g3286 /g3439/g3397 /g2010 /g3003 /g3037 /g3435/g1875 /g3038,/g3003 /g3286 /g3397/g1875 /g3038,/g3028/g304:/g3030:/g3003 /g3286 /g3439/g4675/g3434 /g3397 /g1710 Where /g1875 /g3038,/g3002 /g3286 and /g1875 /g3038,/g3028/g304:/g3030:/g3002 /g3286 are weights for population A specific local PRS /g1827 /g3038 and its local ancestry interaction term. Because /g1827 /g3038 (and likewise for /g1828 /g3038 /g4667 is a function of population A GWAS effect sizes that is shared across all variants in the window /g1863 , we replace the notation /g1875 /g3038,/g3002 /g3286 with /g1875 /g3038,/g308: /g3250 /g3285 and similarly /g1853/g1866/g1855 /g3038 is a function of /g1838 /g3036/g3037 so we replace /g1875 /g3038,/g3028/g304:/g3030:/g3002 /g3286 with /g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3002/g4667 . /g1858/g4672 /g2010 /g3002 /g3285 ,/g2010 /g3003 /g3285 ,/g1838 /g3036/g3037 /g4673/g3404/g2010 /g3002 /g3285 /g4672/g1875 /g3038,/g308: /g3250 /g3285 /g3397/g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3002/g4667 /g1838 /g3036/g3037 /g4673/g3397/g2010 /g3003 /g3285 /g4672/g1875 /g3038,/g308: /g3251 /g3285 /g3397/g1875 /g3038,/g30:3 /g3284/g3285 /g4666/g3003/g4667 /g1838 /g3036/g3037 /g4673 1.6.2 Effect of window size and training dataset size slaPRS takes a sliding local window approach to construct local population-specific polygenic risk scores and thus may be sensitive to the size of the window. In simulations under our base scenario ( /g1860 /g2870 /g3404 0.10, /g1865 /g3404 100/g4667 we considered both 1Mb and 5Mb windows. PRS performance quantified by adjusted /g1844 /g2870 with the phenotype were highly consistent across window sizes suggesting slaPRS is robust to window size (Supplementary Figure 5). We further quantified the effect of varying the training dataset size of admixed individuals (n = 3000, n = 7000). As compared to /g1842/g1844/g1845 /g3006/g3022/g30:9 and /g1842/g1844/g1845 /g3002/g3007/g30:9 , slaPRS uses the training data to weight local population specific PRS (and the variants effects themselves) and increased performance should be dependent on the training dataset size. In general, slaPRS for training sizes n=3000 and n=7000 generally had increased adjusted /g1844 /g2870 when the training size was larger compared to /g1842/g1844/g1845 /g3006/g3022/g30:9 (77.3%, 84.9%), /g1842/g1844/g1845 /g3002/g3007/g30:9 /g466635.3%, 66.1%/g4667 and /g1842/g1844/g1845 /g30:4/g3028/g3045/g3044/g3048/g3032/g3053 /g466664.4%, 66.4%/g4667 . . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 1.6.3 Supplementary Tables and Figures S Figure 1. Scatterplot of n=20,262 UKB samples containing African ancestry along diagonal of PC1. . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint S Figure 2. Histogram of the distribution of overall European ancestry across n=10,000 simulated admixed African Americans (for a single simulation). . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint S Figure 3. Line graph comparing PRS performance across PRS methods for different simulation settings using adjusted between estimated PRS and simulated phenotype. Simulation parameters: heritability ( =0.1,0.3) and number of causal variants (m=100,500) . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint S Figure 4. Line graph comparing PRS performance across PRS methods for different simulation settings using adjusted between estimated PRS and simulated phenotype. Simulation parameters: heritability ( =0.1) and number of causal variants (m=5,100,500,1000). . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint S Figure 5. Comparison of PRS performance across methods (quantified by adjusted between estimated PRS and phenotype value) as the window size in slaPRS varies (1Mb, 5Mb). . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 1.7 References 1. Loos, R. J. F. 15 years of genome-wide association studies and no signs of slowing down. Nat. Commun. 11, 5900 (2020). 2. Dudbridge, F. Power and Predictive Accuracy of Polygenic Risk Scores. PLoS Genetics vol. 9 e1003348 Preprint at https://doi.org/10.1371/journal.pgen.1003348 (2013). 3. Polygenic Risk Score Task Force of the International Common Disease Alliance. Responsible use of polygenic risk scores in the clinic: potential benefits, risks and gaps. Nat. Med. 27, 1876–1884 (2021). 4. Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018). 5. Inouye, M. et al. Genomic risk prediction of coronary artery disease in 480,000 adults: Implications for primary prevention. J. Am. Coll. Cardiol. 72, 1883–1893 (2018). 6. Ganna, A. et al. Multilocus genetic risk scores for coronary heart disease prediction. Arterioscler. Thromb. Vasc. Biol. 33, 2267–2272 (2013). 7. Oram, R. A. et al. A type 1 diabetes genetic risk score can aid discrimination between type 1 and type 2 diabetes in young adults. Diabetes Care 39, 337–344 (2016). 8. Udler, M. S., McCarthy, M. I., Florez, J. C. & Mahajan, A. Genetic risk scores for diabetes diagnosis and precision medicine. Endocr. Rev. 40, 1500–1520 (2019). 9. Mars, N. et al. The role of polygenic risk and susceptibility genes in breast cancer over the course of life. Nat. Commun. 11, 6383 (2020). 10. Mavaddat, N. et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet. 104, 21–34 (2019). 11. The International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009). . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 12. Lewis, A. C. F., Green, R. C. & Vassy, J. L. Polygenic risk scores in the clinic: Translating risk into action. HGG Adv. 2, 100047 (2021). 13. Kong, A. et al. The nature of nurture: Effects of parental genotypes. Science 359, 424–428 (2018). 14. Plomin, R. & von Stumm, S. Polygenic scores: prediction versus explanation. Mol. Psychiatry 27, 49–52 (2022). 15. Duncan, L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nature Communications vol. 10 Preprint at https://doi.org/10.1038/s41467- 019-11112-0 (2019). 16. Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Preprint at https://doi.org/10.1101/070797. 17. Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019). 18. Hou, K. et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet. 55, 549–558 (2023). 19. The impact of Linkage Disequilibrium on differences in predictive ability of polygenic risk score across populations. 20. Miao, J. et al. Quantifying portable genetic effects and improving cross-ancestry genetic prediction with GWAS summary statistics. Nat. Commun. 14, 832 (2023). 21. Ruan, Y. et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 54, 573–580 (2022). 22. Ding, Y., Hou, K., Xu, Z., Pimplaskar, A., Petter, E., Boulier, K., ... & Pasaniuc, B. Polygenic scoring accuracy varies across the genetic ancestry continuum in all human populations. bioRxiv (2022). . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 23. Bitarello, B. D. & Mathieson, I. Polygenic Scores for Height in Admixed Populations. G3 10, 4027–4036 (2020). 24. Cavazos, T. B. & Witte, J. S. Inclusion of Variants Discovered from Diverse Populations Improves Polygenic Risk Score Transferability. Preprint at https://doi.org/10.1101/2020.05.21.108845. 25. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 1080 (2019). 26. All of Us Research Program Investigators et al. The “All of Us” Research Program. N. Engl. J. Med. 381, 668–676 (2019). 27. Marnetto, D. et al. Ancestry deconvolution and partial polygenic score can improve susceptibility predictions in recently admixed individuals. Nat. Commun. 11, 1628 (2020). 28. Márquez-Luna, C., Loh, P.-R., South Asian Type 2 Diabetes (SAT2D) Consortium, SIGMA Type 2 Diabetes Consortium & Price, A. L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823 (2017). 29. Breiman, L. Stacked regressions. Machine Learning vol. 24 49–64 Preprint at https://doi.org/10.1007/bf00117832 (1996). 30. Džeroski, S. & Ženko, B. Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning vol. 54 255–273 Preprint at https://doi.org/10.1023/b:mach.0000015881.36452.6e (2004). 31. Privé, F., Vilhjálmsson, B. J., Aschard, H. & Blum, M. G. B. Making the Most of Clumping and Thresholding for Polygenic Scores. Am. J. Hum. Genet. 105, 1213–1221 (2019). 32. Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021). 33. Choi, S. W., Mak, T. S.-H. & O’Reilly, P. F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 15, 2759–2772 (2020). . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 34. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 67, 301–320 (2005). 35. Hoerl, A. E. & Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55 (1970). 36. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58, 267– 288 (1996). 37. Kelleher, J., Etheridge, A. M. & McVean, G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Computational Biology vol. 12 e1004842 Preprint at https://doi.org/10.1371/journal.pcbi.1004842 (2016). 38. International HapMap Consortium. The International HapMap Project. Nature 426, 789– 796 (2003). 39. Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013). 40. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). 41. Graham, S. E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (2021). 42. Browning, B. L., Zhou, Y. & Browning, S. R. A one-penny imputed genome from next- generation reference panels. Am. J. Hum. Genet. 103, 338–348 (2018). 43. The 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). 44. Zhao, Z., Fritsche, L. G., Smith, J. A., Mukherjee, B. & Lee, S. The construction of cross- population polygenic risk scores using transfer learning. Am. J. Hum. Genet. 109, 1998– 2008 (2022). . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint 45. Weissbrod, O. et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat. Genet. 54, 450–458 (2022). 46. Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015). 47. Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019). 48. Maxmen, A. The next chapter for African genomics. Nature 578, 350–354 (2020). 49. Tishkoff, S. A. et al. The genetic structure and history of Africans and African Americans. Science 324, 1035–1044 (2009). 50. Majara, L. et al. Low and differential polygenic score generalizability among African populations due largely to genetic diversity. HGG Adv. 4, 100184 (2023). 51. Fatumo, S. et al. Promoting the genomic revolution in Africa through the Nigerian 100K Genome Project. Nat. Genet. 54, 531–536 (2022). . CC-BY-ND 4.0 International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 3, 2024. ; https://doi.org/10.1101/2024.01.31.24302103doi: medRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-20T11:00:21.680559+00:00

License: CC-BY-ND-4.0