Full text
74,390 characters
· extracted from
preprint-html
· click to expand
Mendelianization: Concentrating Polygenic Signal into a Single Causal Locus | medRxiv /* */ /* */ <!-- <!-- /*! * yepnope1.5.4 * (c) WTFPL, GPLv2 */ (function(a,b,c){function d(a){return"[object Function]"==o.call(a)}function e(a){return"string"==typeof a}function f(){}function g(a){return!a||"loaded"==a||"complete"==a||"uninitialized"==a}function h(){var a=p.shift();q=1,a?a.t?m(function(){("c"==a.t?B.injectCss:B.injectJs)(a.s,0,a.a,a.x,a.e,1)},0):(a(),h()):q=0}function i(a,c,d,e,f,i,j){function k(b){if(!o&&g(l.readyState)&&(u.r=o=1,!q&&h(),l.onload=l.onreadystatechange=null,b)){"img"!=a&&m(function(){t.removeChild(l)},50);for(var d in y[c])y[c].hasOwnProperty(d)&&y[c][d].onload()}}var j=j||B.errorTimeout,l=b.createElement(a),o=0,r=0,u={t:d,s:c,e:f,a:i,x:j};1===y[c]&&(r=1,y[c]=[]),"object"==a?l.data=c:(l.src=c,l.type=a),l.width=l.height="0",l.onerror=l.onload=l.onreadystatechange=function(){k.call(this,r)},p.splice(e,0,u),"img"!=a&&(r||2===y[c]?(t.insertBefore(l,s?null:n),m(k,j)):y[c].push(l))}function j(a,b,c,d,f){return q=0,b=b||"j",e(a)?i("c"==b?v:u,a,b,this.i++,c,d,f):(p.splice(this.i++,0,a),1==p.length&&h()),this}function k(){var a=B;return a.loader={load:j,i:0},a}var l=b.documentElement,m=a.setTimeout,n=b.getElementsByTagName("script")[0],o={}.toString,p=[],q=0,r="MozAppearance"in l.style,s=r&&!!b.createRange().compareNode,t=s?l:n.parentNode,l=a.opera&&"[object Opera]"==o.call(a.opera),l=!!b.attachEvent&&!l,u=r?"object":l?"script":"img",v=l?"script":u,w=Array.isArray||function(a){return"[object Array]"==o.call(a)},x=[],y={},z={timeout:function(a,b){return b.length&&(a.timeout=b[0]),a}},A,B;B=function(a){function b(a){var a=a.split("!"),b=x.length,c=a.pop(),d=a.length,c={url:c,origUrl:c,prefixes:a},e,f,g;for(f=0;f<d;f++)g=a[f].split("="),(e=z[g.shift()])&&(c=e(c,g));for(f=0;f<b;f++)c=x[f](c);return c}function g(a,e,f,g,h){var i=b(a),j=i.autoCallback;i.url.split(".").pop().split("?").shift(),i.bypass||(e&&(e=d(e)?e:e[a]||e[g]||e[a.split("/").pop().split("?")[0]]),i.instead?i.instead(a,e,f,g,h):(y[i.url]?i.noexec=!0:y[i.url]=1,f.load(i.url,i.forceCSS||!i.forceJS&&"css"==i.url.split(".").pop().split("?").shift()?"c":c,i.noexec,i.attrs,i.timeout),(d(e)||d(j))&&f.load(function(){k(),e&&e(i.origUrl,h,g),j&&j(i.origUrl,h,g),y[i.url]=2})))}function h(a,b){function c(a,c){if(a){if(e(a))c||(j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}),g(a,j,b,0,h);else if(Object(a)===a)for(n in m=function(){var b=0,c;for(c in a)a.hasOwnProperty(c)&&b++;return b}(),a)a.hasOwnProperty(n)&&(!c&&!--m&&(d(j)?j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}:j[n]=function(a){return function(){var b=[].slice.call(arguments);a&&a.apply(this,b),l()}}(k[n])),g(a[n],j,b,n,h))}else!c&&l()}var h=!!a.test,i=a.load||a.both,j=a.callback||f,k=j,l=a.complete||f,m,n;c(h?a.yep:a.nope,!!i),i&&c(i)}var i,j,l=this.yepnope.loader;if(e(a))g(a,0,l,0);else if(w(a))for(i=0;i (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0];var j=d.createElement(s);var dl=l!='dataLayer'?'&l='+l:'';j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;j.type='text/javascript';j.async=true;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-P4HH5NV'); Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search Mendelianization: Concentrating Polygenic Signal into a Single Causal Locus Eric V. Strobl doi: https://doi.org/10.1101/2025.10.31.25339237 Eric V. Strobl 1 University of Pittsburgh Find this author on Google Scholar Find this author on PubMed Search for this author on this site For correspondence: eric.strobl{at}pitt.edu Abstract Full Text Info/History Metrics Data/Code Preview PDF Abstract Complex disorders such as depression and alcohol use involve numerous genetic variants, and implicated loci continue to grow with sample size. This proliferation hampers interpretability, as the mechanisms by which so many variants jointly contribute to pathophysiology remain unclear. In contrast, classical Mendelian diseases arise from a single causal locus and are easier to interpret. We thus introduce Mendelianization – an algorithm distinct from Mendelian randomization – that learns weighted combinations of outcomes so that each aggregated phenotype concentrates association at one locus. We prove that this locus is causal under four structural assumptions natural to genetic data. The method handles partial sample overlap, provides calibrated hypothesis tests, maps coefficients to interpretable scales, and quantifies the degree of Mendelianism using summary z -statistics alone. In experiments, Mendelianization enhances statistical power to detect Mendelian symptom profiles even in heterogeneous disorders like major depression, generalized anxiety, and alcohol use disorder. An R implementation is available at github.com/ericstrobl/Mendelianization . Introduction Complex diseases arise from diverse genetic and environmental risk factors. As a result, genome-wide association studies (GWAS) have identified hundreds of associated variants, with larger sample sizes only expanding the catalog. 35 However, the accumulation of associations often complicates rather than clarifies disease mechanism. Even after fine-mapping prioritizes putative causal variants within loci, 27 the genetic architecture remains dispersed across many regions and resists reductionist interpretation. 29 , 17 A major culprit is phenotypic heterogeneity, where complex disorders like depression manifest as different symptom constellations (e.g., low mood, anhedonia, fatigue, insomnia). Conventional GWAS collapse this diversity into a single diagnosis or composite, broadening the phenotype and inflating the number of hits without sharper insight. 8 , 35 Existing multi-trait methods do not reduce this proliferation of associated loci. Structure-learning methods (e.g., Genomic-SEM, 10 PDR, 2 SHAHER, 31 FactorGo 37 ) learn latent outcomes that preserve trait-variant covariance, typically yielding even more polygenic, diffuse signals rather than locus-specific effects. By contrast, power-seeking methods (e.g., HIPO, 21 metaCCA, 4 fastASSET 22 ) combine traits to maximize a detection criterion such as a non-centrality statistic, identifying still even more variants rather than clarifying mechanism; their nominal p -values also tend to be anti-conservative because the outcome is learned on the same data used for testing. Meanwhile, causal inference methods either refine signals within loci (multi-trait fine-mapping 1 , 38 ) or estimate exposure-outcome relations (multi-response MR 40 and DAG methods 39 ), rather than trying to identify variant-to-outcome causal effects genome-wide. We thus advance an alternative approach that leverages multiple outcomes to simplify – not expand – the set of associated variants by learning weighted outcome combinations that approximate “Mendelianized” traits whose genetic associations concentrate in a single causal locus ( Figure 1 ). We specifically make the following contributions : We introduce Mendelianization , a new framework – distinct from Mendelian randomization – that constructs composite outcomes which concentrate associations at a single locus via a substantially modified canonical correlation analysis (CCA). We prove that this locus contains at least one causal variant under just four structural assumptions. We develop a corresponding fast hypothesis test that remains valid after outcome learning. We operationalize Mendelianization at scale by addressing missing data, preserving coefficient interpretability, ensuring accurate inference, and introducing a continuous Score of Mendelianism to quantify achieved Mendelian properties. We demonstrate that Mendelianization uncovers interpretable Mendelian-like traits in complex disorders using symptom-level summary z -statistics. 12 Download figure Open in new tab Fig. 1. Main idea. Standard outcomes – such as diagnosis or total symptom severity – associate with many loci and obscure underlying mechanisms (left). Mendelianization instead learns an outcome whose association converges on a single locus, yielding a solitary Manhattan plot tower (right). In practice, multiple such outcomes are inferred, one for each lead variant. The red line marks genome-wide significance (5 × 10 − 8 ). Together, these advances provide a reductionist framework for identifying Mendelian-like symptom profiles dominated by a single causal locus. Methods One-Dimensional Canonical Correlation Analysis One-dimensional CCA will serve as the underlying engine of Mendelianization. We thus consider summary statistics comprised of the partial correlation matrix r ∈ ℝ q×m between q variants V and m outcomes Y . We assume that all covariates like biological sex, age, and ancestry have already been partialed out from V and Y . We focus on population-level properties for ease of presentation and address sample-level issues later. We thus have r jk = cor( V j , Y k ). Let r j correspond to the row of r associated with the variant V j . For each V j , we construct a composite outcome Y α j by one-dimensional CCA, so that the correlation between V j and Y α j is as large as possible: We will also show that this procedure implicitly reduces the correlations between loci not containing V j and Y α j , so that Y α j resembles a Mendelian trait. Without loss of generality, we assume that each variant and outcome has already been standardized to mean zero and unit variance. We can then simplify the above objective to: where Σ corresponds to the positive definite correlation matrix of Y . Notice that the numerator maximizes alignment with V j , while the denominator minimizes the scale of α according to the geometry of Σ. The solution to Expression (2) is the familiar CCA weight vector that whitens the outcomes and then aligns with r j . In particular, up to an arbitrary scale factor, the optimal weights satisfy so we set with A = Σ − 1 from here on. We provide a detailed derivation of this solution in Supplementary Materials A. Causal Analysis We next ask when the CCA-derived composite Y α j admits a causal interpretation. In particular, we show that, under a set of structural assumptions motivated by genetic data, the optimization problem in Expression (1) produces a composite outcome whose association pattern is concentrated within a single linkage disequilibrium (LD) block, and that this block must contain at least one causal variant for Y α j . We refer to this property as Mendelianization . Structure of Latent Confounding Even after adjusting for observed covariates such as age, sex, and ancestry, genetic association studies typically retain residual confounding from processes such as batch effects, residual population structure, and dynastic effects. 35 These processes tend to act on many phenotypes at once and therefore induce shared structure across outcomes. We formalize this by assuming: Assumption 1 (Low-dimensional latent confounding) For sufficiently large m, the remaining confounding effects lie in a subspace U m ⊂ ℝ m , and there exists an integer constant r 0 ≥ 0 such that dim( U m ) ≤ r 0 . Remark . Assumption 1 states that V and Y are influenced by only a small number of latent confounders, or by many confounders whose effects span a low-dimensional subspace. This matches empirical experience in large biobanks, where a handful of unmeasured processes generate broad, shared patterns across many traits. Assumption 1 implies that each vector of variant-outcome correlations can be written as where lies in the confounding subspace U m and lies in its orthogonal complement (with respect to A m = (Σ m ) − 1 ). Informally, captures residual confounding, while captures the causal component of the association between V j and the outcomes. We use this decomposition to study how Mendelianization behaves with and without latent confounding. Without Latent Confounding We first consider the idealized case where latent confounding is absent, so for all j . Let denote the vector of associations between each variant and the composite outcome defined by . For each variant V k , where denotes the inner product induced by A . When k = j , this quantity measures how strongly V j associates with its own learned outcome. When k ≠ j , the same expression measures the average alignment between the causal pattern of V j and that of V k across the m outcomes. To obtain a single-locus “tower” in the Manhattan plot of , we require that this average alignment vanish for variants outside the LD block around V j . Let 𝒮 j denote the set of variants in LD with V j (the locus or LD block containing V j ). We assume: Assumption 2 (Outcome diversity) We have Remark . In the absence of confounding, represents the causal pattern of effects of V j on the outcomes. Assumption 2 says that, as we add more non-redundant outcomes and whiten them, the causal patterns from different loci become increasingly distinct on average. Mixed positive and negative effects from variants outside 𝒮 j tend to cancel out when averaged across many outcomes, driving their contribution to toward zero. By contrast, variants in S j share the same underlying causal mechanism for and therefore retain strong alignment. This is plausible in high-dimensional phenotyping because adding more granular traits (e.g., subphenotypes or symptom items) typically increases the heterogeneity of causal effects across loci. In sum, the vector develops a sharp tower within S j as m grows, while associations at other loci approach zero. With Latent Confounding In real data, latent confounding is present and can also create tall peaks in Manhattan plots. For example, the composite might load on a pattern driven purely by confounding, such that even if no variant in 𝒮 j is truly causal for . To rule out such purely confounded towers, we impose additional assumptions on the strength of both causal and confounded components. We first assume that the relevant signals do not vanish as we add more outcomes: Assumption 3 (Signal floors) For sufficiently large m, if l ∈ 𝒮 j causally influences , then there exists a constant β c > 0 such that Moreover, there exists a constant β on > 0 such that for every l ∈ 𝒮 j that does not cause . Remark . The first part of Assumption 3 requires that if 𝒮 j contains a causal variant for Y α j , its causal signal remains non-negligible even as we increase the number of outcomes. This is a minimal requirement for any causal effect to be detectable. The second part states that non-causal variants in the same LD block share a non-vanishing confounded component, because LD makes them inherit similar exposure to the same local confounders (such as local ancestry or technical artifacts). This is consistent with how confounding typically behaves within LD regions in GWAS. We also formalize how low-dimensional confounding propagates across the genome: Assumption 4 (Isolation-spillover) If some l ∈ 𝒮 j causally influences , then Conversely, if no l ∈ 𝒮 j causes , then there exists a variant k ∉ 𝒮 j such that for sufficiently large m . Remark . The first clause states that when 𝒮 j contains a causal variant for Y α j , the associated confounding pattern does not align strongly with distant loci, so any remaining confounding is effectively local to the LD block. The second clause describes the opposite scenario: if the tower at 𝒮 j is driven purely by low-dimensional confounding, then at least one other locus must share the same confounding pattern with non-negligible strength, leading to detectable “spillover” of confounded signal elsewhere in the genome. This matches empirical experience, where broad confounding (for example, uncorrected population structure) usually produces widespread inflation rather than a single isolated peak. Asymptotic and Finite Mendelianism Under Assumptions 1–4, the CCA estimator enjoys a strong causal localization property: Theorem 1 (Asymptotic Mendelianism) Consider Assumptions 1–4 and suppose Σ m is positive definite for sufficiently large m. We have for sufficiently large m and (i . e ., Mendelianism is asymptotically achieved for at 𝒮 j ) if and only if at least one variant l ∈ 𝒮 j causes . Remark . We provide all proofs in the Supplementary Materials. Theorem 1 establishes both directions: the composite outcome becomes asymptotically Mendelian at the LD region S j if and only if this region contains at least one causal variant for . Intuitively, under the four structural assumptions (low-dimensional latent confounding, outcome diversity, signal floors, and isolation-spillover), maximizing the correlation between V j and Y α forces the resulting composite outcome to localize its causal signal to a single LD block. This contrasts with common practice, which fixes a single outcome or factor and tests all variants, often yielding many significant loci scattered across the genome ( Figure 1 , left). By comparison, the estimator learns a variant-specific composite outcome for each V j , and Theorem 1 shows that all causal variants for this learned outcome concentrate in one LD region ( Figure 1 , right). Beyond this localization guarantee, Theorem 1 also changes how we interpret CCA loadings. For each variant V j , we estimate the canonical coefficients and form the composite . We then hold this composite fixed and evaluate association genome-wide – not just at V j itself. Each thus defines a Mendelianized trait that we scan across all variants. We interpret as a Mendelian trait when the corresponding Manhattan plot shows a single tower confined to an LD region containing V j . Finally, we recover a stronger version when a finite outcome set is already sufficiently rich: Proposition 1 (Perfect Mendelianism at finite m) Consider Assumptions 5–8 (Supplementary Materials A) with a prespecified m and suppose Σ m is positive definite. We have and (i . e ., Mendelianism is perfectly achieved for 𝒮 j at m) if and only if at least one variant l ∈ 𝒮 j causes . Remark . Proposition 1 shows that, under stronger assumptions, Mendelianization can in principle recover a classical single-locus architecture for the learned outcome at a fixed number of outcomes m . We emphasize Theorem 1 because it uses weaker, more realistic assumptions and allows Mendelianism to emerge asymptotically as we enrich the outcome set. This distinction is important for complex, polygenic diseases, where perfect single-locus resolution may not be achievable at any fixed m . Fast Hypothesis Testing We now test the null of no dependence between V j and Y , even though we learn α j from the same data. We henceforth drop the superscript m to simplify notation. The null and alternative hypotheses are: A naive Pearson test would be anti-conservative, because it treats α j as fixed even though we estimated it using the observed associations. To obtain a valid test, we construct a statistic whose null distribution accounts for the fact that α j is data-driven. Under ℋ 0 , with i.i.d. samples and finite fourth moments, where n denotes sample size 1 . The canonical coefficients are proportional to , which motivates the quadratic form . To understand the null distribution of Q j , write the Cholesky decomposition Σ = LL ⊤ and let u ∼ 𝒩 (0, I m ) so that z j = u ⊤ L ⊤ . Substituting into Q j gives so that under ℋ 0 . We then compute the right-tail probability of Q j under the distribution to obtain the p -value. The above χ 2 result is asymptotic, and its finite-sample accuracy depends on how well we approximate Σ. In the next section, we show that a modified version of Σ can be consistently estimated using millions of variants under realistic LD and sparsity conditions, so that large n and large q asymptotics work decisively in our favor. Operationalizing Mendelianization at Scale Implementing the above hypothesis test requires access to Σ. However, biobanks rarely report Σ in their summary statistics, and the outcomes needed to estimate Σ are typically measured on partially overlapping sets of individuals. Scaling Mendelianization also raises two additional challenges beyond test calibration: redundant traits can obscure the interpretation of raw canonical coefficients, and the available traits may not be rich enough to approach near-perfect Mendelianism. We therefore (a) replace Σ with a quantity that is estimable from summary z -statistics under partial overlap, (b) show how to estimate it at scale despite LD, (c) rescale coefficients for interpretability, and (d) introduce a metric to quantify how close a locus comes to Mendelian behavior. From Σ to Γ Under Partial Overlap If all traits are fully observed and V ╨ Y , then with i.i.d. sampling, and we can consistently estimate Σ using (approximately) independent variants. In practice, samples overlap only partially across traits, so z -statistics and correlations are computed from unequal sample sizes. Let M a,i ∈ {0, 1} indicate whether Y a,i is observed, and define S a = { i : M a,i = 1}, n a = | S a |, and n ab = | S a ∩ S b |. We assume that V contains no missing values, and that each Y a has been centered so that E ( Y a | M ) = 0 for the realized missingness pattern M . Let denote the second-moment matrix conditional on both Y a and Y b being observed (i.e., over S a ∩ S b ). Then: Proposition 2 Assume i . i . d. samples across individuals, V ╨ Y, and V ╨ M | Y . Then Remark . The assumption V ╨ M | Y is reasonable in biobank settings since missingness in Y is largely driven by the presence or severity of symptoms themselves (as well as factors such as sex, age, and ancestry, which we already adjusted for). Proposition 2 also motivates replacing Σ with the matrix Γ in Expression (2), where Replacing Σ with Γ has a convenient interpretation. Define modified outcomes which set missing entries of Y a to zero and rescale by the square root of the observed sample proportion. Then E ( Y ′ | M ) = 0 for the realized M , and hence . Moreover: Proposition 3 Assume i . i . d. samples across individuals, p a = ℙ( M a = 1) > 0, and p b = ℙ( M b = 1) > 0. Then and as n → ∞. Remark . Γ ab is therefore the conditional covariance of Y ′ for the realized missingness pattern M , but it converges to the unconditional covariance of Y ′ as sample size grows, so dependence on the specific pattern of missingness fades. As a result, we can maximize the (asymptotically) unconditional correlation in Expression (2) by working with Y ′ instead of Y : Importantly, this reparameterization targets Γ directly and therefore does not require missingness-at-random (MAR) 14 for Σ. Theorem 1 then holds for Y ′ at the population level with Σ replaced by Γ. The scaling factors used to define Y ′ do not affect the optimization, since: where D = diag( n/n 1 , …, n/n m ) and γ = D 1 / 2 α . Hence: Even better, we never need explicit access to Y ′ because the sample covariance simplifies to: which implies: except that we replace in the denominator with its conditional counterpart Γ, which has the same large-sample limit by Proposition 3. Thus, we only need the z -statistics z j and an estimate of Γ. This reasoning motivates us to maximize: which mirrors Expression (2) with Σ replaced by Γ. We henceforth take as our solution, where Ω = Γ − 1 . Estimating Γ with LD Although Γ is natural to target, empirical estimation is nontrivial. In principle, Proposition 2 shows that genome-wide z -statistics identify Γ under the global null V ╨ Y . In practice, two main obstacles remain: LD induces dependence between nearby variants, and a subset of variants truly associate with Y . Fortunately, we can handle both these issues with the following result: Theorem 2 Consider the estimator: . Let G q index the variants that are independent of Y, and B q index the variants that are dependent on Y . Let , L and D denote finite constants. Assume: (Bounded moments for G q ) For j ∈ G q , we have 𝔼( z ja z jb | M ) = Γ ab and . (Bounded moments for B q ) For j ∈ B q , we have 𝔼( z ja z jb | M ) = µ j with L and . (Local LD) There is an undirected dependency graph with edge set E q and maximum degree D such that |cov( z ja z jb , z ka z kb | M )| ≤ C 2 when ( j, k ) ∈ E q and cov( z ja z jb , z ka z kb | M ) = 0 when ( j, k ) ∉ E q . (Vanishing contamination) We have as q → ∞. Then , conditional on M as q → ∞. Remark . The key condition is the local LD assumption. In practice, investigators routinely treat LD as local by bounding each variant’s LD neighborhood (for example, within 1 Mb), which makes the dependency degree D finite. Under this condition, we can show that the variance of Ŝ ab goes to zero, so given the realized missingness pattern. Moreover, the vanishing contamination assumption reflects the sparse genetic architecture of most traits: after standard quality control, only a tiny fraction of variants are typically associated with any given outcome, so the non-null set B q grows more slowly than q . This matches empirical experience in large biobanks, where the vast majority of variants behave as null. Taken together, these conditions imply a weak law of large numbers despite LD. In practice, with millions of variants, we can estimate each entry of Γ to negligible error, so we treat Γ as known in subsequent derivations. Consequently, under , and we replace Σ with Γ and A with Ω = Γ − 1 so that in the test of Expression (3). We will later show that this test works very well in real data. Interpretable Coefficients We can now solve Expression (5) using an estimate of Γ. However, the raw coefficients correspond to: where is the residual of after regressing it onto , so they depend on the residual variance var , which varies across outcomes and complicates comparisons. To address this, we report rescaled coefficients for interpretability: Proposition 4 We have for every Y a ∈ Y . Thus, equals up to the common factor . The rescaled canonical coefficients are unitless, directly comparable across outcomes for a given variant V j , and their signs match the allelic direction of effect on the unique component of (the residual after adjusting for the other traits). In practice, can be viewed as a standardized effect of V j on the part of trait a that is not explained by the remaining traits in Y . Quantifying the Degree of Mendelianism Near-perfect Mendelianism rarely occurs when the outcome set is small (for example, m ≤ 15). Although Theorem 1 provides a yes/no criterion in the limit, applied analyses benefit from a continuous measure that captures how close Y α j comes to Mendelian behavior. We therefore fix the coefficients α j learned at the lead variant V j and, for any variant V k (including k = j ), compute Under ℋ 0, k : r k α j = 0, we have with i.i.d. samples and finite fourth moments, yielding the two-sided p - value p k | j = 2Φ −|𝒵 k | j |) 2 . We compute 𝒵 k | j and p k | j for all V k ∈ V (i.e., genome-wide). If we reject ℋ 0 in Expression (3) for V j , we summarize how concentrated the signal is at the j -locus with the Score of Mendelianism (SoM) where h (𝒵) = (|𝒵| − z α ) + , and z α is the z -score corresponding to a two-sided p -value (default α = 5 × 10 − 8 ). We construct 𝒮 j using proximity-based clustering around V j (single-linkage with 500 kb gap, padded by ±500 kb), though alternatives such as LD-based clustering relative to V j are also possible. Remark The SoM lies in [0, 1] and can be interpreted as the fraction of “excess” Mendelianized signal (above the genome-wide threshold z α ) that lies within the LD region of interest. A value of ℳ j = 1 indicates complete isolation of the signal to a single locus, while ℳ j approaches 0 when the signal lies entirely outside 𝒮 j or is highly polygenic. We average within 𝒮 j to avoid favoring larger loci and do not average outside 𝒮 j so that even small off-target regions remain visible rather than being diluted across millions of variants. In practice, we use ℳ j only as a ranking metric among loci that are already genome-wide significant; we do not use it for hypothesis testing or confidence intervals. Summary Mendelianization corresponds to one-dimensional CCA, modified in five ways: (1) it uses z -statistics in place of raw correlation coefficients, (2) it overcomes missingness by normalizing with Γ estimated over variants rather than with Σ estimated over outcomes, (3) it replaces Wilks’ Λ with a test for significance, (4) it rescales canonical coefficients by the square roots of the diagonals of Ω for interpretability, and (5) it applies each significant composite outcome genome-wide by Theorem 1 . Pseudocode and a formal time complexity analysis (Supplementary Materials B) show that Mendelianization scales cubically with the number of outcome variables but only linearly with the number of variants. This remains computationally practical in modern biobanks, where the number of outcomes rarely exceeds 100, whereas the number of variants often surpasses one million. Results Comparators We compared Mendelianization against: Meta-analytic Canonical Correlation Analysis (metaCC A) : 4 performs CCA on summary statistics using Wilks’ Λ, which assumes perfectly overlapping and equal sample sizes per trait. metaCCA also estimates Σ from linear regression β coefficients, so it cannot fully combine linear and logistic regression models whose coefficients differ in scale and null variances. Finally, metaCCA lacks causal guarantees, and its scale-dependent canonical coefficients are not directly comparable across traits. Heritability Informed Power Optimization (HIPO) : 21 learns multiple sets of outcome weights that increase statistical power by maximizing a standardized non-centrality parameter across – rather than within – loci. We compare against the most powerful, first set of HIPO weights. Fast Association analysis based on SubSETs (fastASSET) : 22 a multi-trait analysis method that first pre-screens traits with suggestive associations and adjusts for this selection. It then searches over all (two-sided) subsets of traits – allowing opposite effect directions – and combines evidence into a single, multiple testing adjusted omnibus p -value per variant. Mendelianization is the only method that optimizes composite outcomes so that each learned outcome is associated with a single tower in the Manhattan plot that contains at least one causal variant. We finally also compared Mendelianization against three ablated variants : (i) using variant-specific outcomes (i.e., 𝒵 i | j = 𝒵 i | i ) as opposed to fixed outcomes applied genome-wide, (ii) estimating Γ using β coefficients instead of z -statistics, (iii) a combination of both (i) and (ii). Metrics Our primary evaluation metric was the Score of Mendelianism (SoM) , where higher values indicate a greater ability to isolate a single locus. The metric provides a fair basis for comparison because no algorithm, including Mendelianization, optimizes SoM. Given that Mendelianization and metaCCA have different coefficient interpretations, we computed SoM with variant-specific outcomes for metaCCA, and with fixed outcomes applied genome-wide for Mendelianization (the outcome learned at the lead V j , applied to all variants). This ensures each method is evaluated in line with its intended use. Beyond detecting a Mendelian-like pattern, rigorous applications also require valid, efficient inference. As a result, we evaluated Type I error control using the signed Kolmogorov–Smirnov (KS) distance between observed p - values above 0.05 and the uniform distribution on [0.05, 1]. Well-calibrated methods should yield nearly uniform p -values in this range, since most variants are uncorrelated with the target. For methods with positive (anti-conservative) KS-distances below 0.01, we then assessed the statistical power of the hypothesis tests for p -values below 10 − 4 , since most variants with p -values below this threshold are likely to be associated with their learned outcome. Finally, we measured run-time . Simulations We first conducted simulations using individual-level autosomal genotype data from European-ancestry participants in the 1000 Genomes Phase 3 dataset. 5 After regressing out the first five principal components, we partitioned the genome into 1,006 LD blocks using the optimal LD-splitting algorithm 20 (parameters: r 2 threshold for ignoring pairwise correlations = 0.02, window size = 500 Kb, maximum number of variants per block = 2,000, minimum number of variants per block = 200). We then randomly sampled 100,000 variants and, for each of 40 bootstrap replicates, randomly selected five causal variants V c from this set. For each replicate, we first generated 50 latent traits according to and then constructed 32 observed outcomes, where contained 50 variables, each element of ζ was drawn from a uniform distribution on [−0.05, −0.01] ∪ [0.01, 0.05], each entry of τ was drawn from a standard normal distribution, and ε Y also followed a standard normal distribution. This construction mimics realistic settings in which observed outcomes are noisy, imperfect measurements that do not map cleanly onto the underlying biology. For each outcome, we randomly drew the sample size from a uniform distribution between 50,000 and 100,000 individuals by bootstrap resampling individuals from 1000 Genomes. Within each LD block, we then performed a second-level bootstrap to disrupt cross-block correlations and to approximate GWAS-scale sampling variability given the relatively small 1000 Genomes sample size. We applied all methods to the resulting 40 simulated datasets for each outcome configuration (2, 4, 8, 16, and 32 outcomes). We summarize the simulation results in Figure 2 , using primarily violin plots. Mendelianization achieved the highest SoM across all numbers of outcomes ( Figure 2 (a) ). Consistent with Theorem 1 , the SoM steadily increased toward one as the number of outcome variables increased, indicating that the algorithm identified increasingly Mendelian-like traits with more outcomes. In contrast, the other methods produced SoMs near zero, reflecting the fact that we generated the data from five randomly chosen causal variants rather than a single causal variant. Download figure Open in new tab Fig. 2. Main results from the synthetic data experiments. All plots share the same legend. (a) Mendelianization achieved the highest SoM across all numbers of outcomes, with SoM increasing as the number of outcomes grew, consistent with Theorem 1 . (b) Mendelianization was the only algorithm that produced well-calibrated p -values, in agreement with Theorem 2 . (c,d) Mendelianization maintained statistical power of approximately 0.85 and a near-perfect true positive rate across all outcome settings. Panel (d) is displayed as a line plot because the mean true positive rate assumes only a discrete set of values. Error bars denote 95% confidence intervals for the mean. (e) Mendelianization completed in under one second for all runs. Mendelianization also yielded the most well-calibrated p - values for null variants, with calibration close to ideal, consistent with Theorem 2 ( Figure 2 (b) ). By comparison, most alternative algorithms showed progressively poorer calibration as the number of outcomes increased, indicating a failure to control the type I error rate under outcome learning. Mendelianization was further distinguished as the only method that consistently retained statistical power of approximately 0.85 across all outcome dimensions ( Figure 2 (c) ), whereas the power of competing approaches fluctuated substantially, in line with their unstable p -value calibration. In terms of discovery performance, Mendelianization almost always recovered all five causal variants, and its true positive rate increased with the number of outcomes ( Figure 2 (d) ), demonstrating robust power even in high-dimensional outcome settings. Finally, Mendelianization completed in under one second for all runs, whereas other algorithms required approximately three-to four-fold longer computation times ( Figure 2 (e) ). Overall, these results indicate that Mendelianization identifies more Mendelian-like traits as the number of outcomes grows, maintains well-calibrated p -values with high power, and remains computationally efficient. Finally, ablation results confirmed the necessity of applying canonical coefficients genome-wide and using z -statistics to achieve maximal performance ( Supplementary Figure 1 ). Depression and Generalized Anxiety We next evaluated the algorithms using real European ancestry summary statistics from the Pan-UK Biobank. 3 , 12 The outcome variables comprised all 52 non-collinear items in the UK Biobank Mental Health Questionnaire assessing lifetime major depression and generalized anxiety. For all experiments, we restricted analyses to high-quality HapMap3 autosomal, non-MHC, biallelic SNPs with INFO > 0.9 and MAF > 1% in UK Biobank and, when available, gnomAD genome/exome. Figure 3 summarizes the results. Mendelianization identified three loci that surpassed the significance threshold of 5 × 10 − 8 . These loci also attained perfect SoMs of one, whereas most loci identified by the other algorithms had scores close to zero ( Figure 3 (a) ). Applying each of the three learned composite Y α j genome-wide and plotting the p -values associated with the z -statistics Z · | j yielded a single tower per locus, as implied by the SoMs and predicted by Theorem 1 ( Figure 3 (b) ). The interpretable coefficients further delineated three distinct patient profiles: (i) uncontrollable worry with recurrent depression, (ii) melancholic-like depression characterized by insomnia and worthlessness, and (iii) controllable worry accompanied by somatic complaints ( Supplementary Figure 2 ). Together, these results demonstrate that Mendelianization achieved the greatest spikedness while yielding interpretable coefficient patterns. Download figure Open in new tab Fig. 3. Main results for depression and generalized anxiety. (a) Each dot denotes the SoM for the lead variant within a Manhattan-plot tower, and the red lines correspond to the mean SoMs. Only Mendelianization achieved perfect SoMs for all lead variants. No variants reached the genome-wide significance threshold under fastASSET. (b) For each of the three dots in the Mendelianization column of panel (a), the learned outcome corresponded to an essentially perfect Mendelian trait because it produced a Manhattan plot with a single tower, consistent with Theorem 1 . (c) Only Mendelianization produced well-calibrated p -values; fastASSET was conservative. (d) Among methods with well-calibrated or conservative p -values, Mendelianization achieved the highest statistical power by a wide margin. (e) Mendelianization was approximately four orders of magnitude faster than alternative methods. The three loci discovered by Mendelianization correspond to biologically plausible causal regions. The chromosome 15 signal with lead variant rs8038726 lies in the 15q11-q13 GABA A receptor cluster (GABRB3, GABRA5, GABRG3), implicated in abnormalities of inhibitory signaling, anxiety and mood disorders, and sleep disturbance. 11 , 15 The cluster also mediates the pharmacological effects of benzodiazepines and related hypnotics 28 – consistent with a phenotype dominated by uncontrollable worry and recurrent depression. The dense chromosome 17 block with lead variant rs2696455 falls within the 17q21.31 inversion containing CRHR1, whose variants have been associated with major depression, frequently via gene-environment interactions involving childhood maltreatment and through effects on hypothalamic-pituitary-adrenal (HPA) axis reactivity. 30 , 18 , 34 Melancholic depression, in particular, is the subtype most tightly linked to HPA-axis hyperactivity, elevated corticotropin-releasing hormone (CRH), and increased noradrenergic tone, with classic work demonstrating hyper-cortisolism and elevated CRH levels in melancholic compared with non-melancholic depression. 9 The chromosome 7 locus with lead variant rs1990622 overlaps TMEM106B, which multiple GWAS and pleiotropic analyses implicate in depression risk and symptomatology. 13 , 6 Experimental work has also linked this gene to an anxiety-like phenotype in mice, 19 broadly consistent with a profile characterized by controllable worry and prominent somatic complaints. Overall, these three loci localize to biologically plausible regions harboring putative causal variants in line with Theorem 1 . We next evaluated secondary performance metrics. Only Mendelianization produced well-calibrated p -values ( Figure 3 (c) ). FastASSET generated conservative p -values, whereas metaCCA and HIPO yielded anti-conservative estimates, indicating unreliable significance levels. We plot the individual histograms in Supplementary Figure 3 . Among methods with calibrated or conservative behavior, Mendelianization achieved the greatest statistical power by far ( Figure 3 (d) ). The algorithm also required only 2 seconds to complete, while the other methods were slower by at least four orders of magnitude ( Figure 3 (e) ). Overall, Mendelianization uniquely combined well-calibrated inference, high power, and rapid computation. Finally, ablation results confirmed the necessity of all components of the algorithm ( Supplementary Figure 4 ). Alcohol Use Disorder We increased the difficulty in alcohol use disorder by running the algorithms with only 10 outcome variables corresponding to all individual items of the Alcohol Use Disorders Identification Test (AUDIT). 26 We again used European ancestry summary statistics from the Pan-UK Biobank with the same quality control. According to Theorem 1 and the simulation results, Mendelianization is unlikely to recover near-perfect Mendelian traits with only 10 outcomes, but we still expect it to find approximate Mendelian traits. The results, summarized in Figure 4 , support this expectation. Mendelianization did not achieve perfect SoMs, but every locus achieved higher scores than those from any competing method. Visualization of the per-outcome genome-wide Manhattan plots revealed a substantial, though not perfect, degree of Mendelianism ( Figure 4 (b) ). Each trait was associated with an interpretable set of canonical coefficients ( Supplementary Figure 5 ). Download figure Open in new tab Fig. 4. Main results for alcohol use disorder. (a) Mendelianization achieved the highest SoMs for each lead variant. (b) The learned outcome for each point in the Mendelianization column of panel (a) approximated a Mendelian trait. (c) Only Mendelianization produced well-calibrated p -values. (d) Both Mendelianization and fastASSET attained high statistical power. (e) Mendelianization completed in 0.58 seconds. The three loci have strong prior support for causal involvement in alcohol-related phenotypes. The chromosome 17 tower with lead variant rs41382552 again lies in the 17q21.31 inversion encompassing CRHR1–MAPT; CRHR1 variants associate with binge drinking and alcohol dependence 33 and interact with stress exposure, 24 consistent with a mechanism in which stress-axis signaling contributes causally to episodic binge drinking. The chromosome 2 signal with lead variant rs780093 maps to GCKR, and correlated variants robustly associate with alcohol consumption and AUDIT scores in large GWAS. 25 Experimental perturbations of the GCKR–glucokinase axis in C. elegans 32 and a GCKR P446L knock-in mouse model 16 further suggest that this pathway can causally influence alcohol-related behaviors. The chromosome 4 locus with lead variant rs11940694 overlaps KLB, encoding the FGF21 co-receptor; human GWAS and experimental manipulations of FGF21 pathways in rodents 36 and non-human primates 7 converge on a model in which this axis directly regulates alcohol preference and intake. External data therefore support a locus-level causal role for CRHR1/17q21.31, GCKR, and KLB in distinct patterns of alcohol use. As before, Mendelianization was the only method to yield well-calibrated p -values ( Figure 4 (c) ; Supplementary Figure 6 ), while both Mendelianization and fastASSET achieved high statistical power ( Figure 4 (d) ). Finally, Mendelianization completed at least three orders of magnitude faster than competing methods ( Figure 4 (e) ), and ablation results confirmed the necessity of the individual components of the algorithm ( Supplementary Figure 7 ). We conclude that Mendelianization was again the only method that simultaneously achieved high degrees of Mendelianism, well-calibrated inference, high power, and rapid computation. Discussion Mendelianization learns composite outcomes whose associations asymptotically concentrate at a single locus. The locus contains at least one causal variant under just four structural assumptions – low-dimensional confounding, outcome diversity, signal floors, and isolation-spillover. The algorithm also handles partial sample-overlap, yields cross-outcome comparable canonical coefficients, and quantifies single-locus concentration via the SoM. In short, Mendelianization performs genome-wide, locus-level causal inference using a substantially modified one-dimensional CCA. Despite these strengths, Mendelianization has limitations. First, it cannot resolve LD within a locus because it is a one-dimensional method; isolating variant-level causality generally requires fine-mapping – a separate class of techniques fundamentally distinct from Mendelianization. Second, Mendelianization cannot accommodate completely non-overlapping samples between outcome pairs, which poses challenges when distinct populations are assessed with different measures. Third, Mendelianization remains a linear method that may miss nonlinear confounding and causal effects. Future research should therefore explore combining Mendelianization with fine-mapping approaches, extending it to handle fully non-overlapping samples, and generalizing it to nonlinear settings. In summary, Mendelianization provides a powerful framework for isolating causal loci to the extent permitted by available outcomes, advancing a reductionist approach that helps clarify – rather than complicate – the pathophysiological understanding of complex disease. Data Availability All data is publicly available in the Pan-UK Biobank (https://pan.ukbb.broadinstitute.org/). Competing interests No competing interest is declared. Author contributions statement E.V.S. conceived, designed, and executed the project in its entirety. Supplementary Materials A. Derivaton of CCA Solution Correlation and, likewise, the solution to Expression (1) is scale invariant, so we can impose the constraint α T Σ α = 1 and maximize the numerator r j α . We invoke the Cauchy-Schwarz inequality in the Σ-inner product: where the last equality follows because we imposed the constraint . The inequality becomes an equality when Recall that the solution to Expression (2) is scale-invariant, so we set with A = Σ −1 . A. Finite m Assumptions Assumption 5 (Low-dimensional latent confounding) For a prespecified m ∈ ℕ + , the confounding effects are contained in a subspace U m ⊂ ℝ m , and there exists an integer constant r 0 ≥ 0 such that dim( U m ) ≤ r 0 . Assumption 6 (Outcome diversity) We have for a prespecified m . Assumption 7 (Signal floor) For a prespecified m, if l ∈ 𝒮 j causes Y α j , then there exists a constant β c > 0 such that . Moreover, there exists a constant β on > 0 such that for every l ∈ 𝒮 j that does not cause Y α j . Assumption 8 (Isolation-spillover) If some l ∈ 𝒮 j causes Y α j , we have for a prespecified m. On the other hand, if there does not exist l ∈ S j that causes Y α j , then ∃ k ∉ 𝒮 j with and for a prespecified m . B. Pseudocode & Time Complexity Analysis We present the pseudocode for Mendelianization in Algorithm 1 . The algorithm first computes S in Line 1, which serves as a consistent estimator of Γ under Theorem 2 . Using this estimate, it then derives the raw canonical coefficients in Line 2 by solving Expression (5). These coefficients also form the basis of the Q statistics in Expression (3) that enable hypothesis testing; recall that each Q j follows a distribution under the null, so the algorithm can readily compute p -values in Line 4. Next, Mendelianization computes SoMs for lead variants that surpass a genome-wide significance threshold (e.g., 5 × 10 − 8 ) in Line 5. Finally, the algorithm converts the raw coefficients into their interpretable form α ′ by applying Proposition 4 in Line 6. Thus, although the theory of Mendelianization is intricate, the resulting algorithm is remarkably straightforward. Algorithm 1 Mendelianization Download figure Open in new tab The algorithm also has a favorable time complexity. In particular, estimating the correlation matrix Γ described in Line 1 costs O ( qm 2 ) time. In Line 2, inverting Γ requires O ( m 3 ) time, and the matrix product Ω z T required to estimate α costs O ( m 2 q ). Constructing all Q j statistics and evaluating their p -values in Lines 3 and 4 adds O ( mq ), and computing α ′ in Line 6 requires another O ( mq ). For the SoM step, suppose there are s significant variants, and each constitutes its own tower to take the worst case. Computing each statistic Ƶ j | k requires O ( m ) time for the numerator and O ( m 2 ) for the denominator in Equation (6) . Moreover, the numerator must be evaluated genome-wide over q variants, so an upper bound on the total cost of SoM amounts to O ( s ( mq + m 2 )). Summing these components, the overall complexity of Mendelianization is O ( m 3 + m 2 q + sqm ), where the simplification uses q ≫ m . Thus, the algorithm scales linearly with the number of variants and cubically with the number of outcome variables. In practice, Mendelianization completes within seconds and is at least three orders of magnitude faster than the competing methods. C. Proofs Theorem 1 (Asymptotic Mendelianism) Consider Assumptions 1-4 and suppose Σ m is positive definite for sufficiently large m. We have for sufficiently large m and (i . e ., Mendelianism is asymptotically achieved for at 𝒮 j ) if and only if at least one variant l ∈ 𝒮 j causes We first prove the forward direction by contrapositive. Suppose there does not exist a variant l ∈ 𝒮 j that causes . As a result, j ∈ 𝒮 j does not cause . We need to show or (or both) for m sufficiently large. We will show that . Assumption 1 allows us to write: Choose M 1 > 0. By Assumptions 3 and 4, ∃ M 2 ≥ M 1 such that ∀ m ≥ M 2 , ∃ k ∉ 𝒮 j where: The reverse triangle inequality allows us to write: Let . Under Assumption 2 , ∃ M 3 ≥ M 2 such that ∀ m > M 3 we have . As a result: where . This completes the proof of sufficiency – i.e., asymptotic Mendelianism is sufficient to establish causality. To prove the backward direction, assume that there exists at least one variant l ∈ 𝒮 j that causes . Then, we can invoke Assumption 1 and write Equation (7) for any k . We have two situations: If k ∉ 𝒮 j , then by Assumption 2 . Moreover, by the first clause of Assumption 4 . Hence, . If k ∈ 𝒮 j and k = j , then we have two situations. If j causes , then the first clause of Assumption 3 allows us to write: for sufficiently large m . On the other hand, if j does not cause , then the second clause of Assumption 3 allows us to write: for sufficiently large m . We conclude that for sufficiently large m and , which completes the proof of necessity – i.e., asymptotic Mendelianism is necessary to establish causality. □ Proposition 1 (Perfect Mendelianism at finite m) Consider Assumptions 5-8 with a prespecified m and suppose Σ m is positive definite. We have and (i . e ., Mendelianism is perfectly achieved for 𝒮 j at m) if and only if at least one variant l ∈ 𝒮 j causes . Proof The proof almost exactly parallels that of Theorem 1 , but we include it here for completeness. We first prove the forward direction by contrapositive. Suppose there does not exist a variant l ∈ 𝒮 j that causes . As a result, j ∈ 𝒮 j does not cause . We need to show or (or both) for prespecified m . We will show that . Assumption 5 allows us to write: By Assumptions 7 and 8 , there exists k ∉ 𝒮 j where: The reverse triangle inequality allows us to write: Thus, under Assumption 6 , we have: This completes the proof of sufficiency – i.e., perfect Mendelianism is sufficient to establish causality. To prove the backward direction, assume that there exists at least one variant l ∈ 𝒮that causes . Then, we can invoke Assumption 5 and write Equation (8) for any k . We have two situations: If k ∉ 𝒮 j , then by Assumption 6 . Moreover, by the first clause of Assumption 8 . Hence, . If k ∈ 𝒮 j and k = j , then we have two situations. If j causes , then the first clause of Assumption 7 allows us to write: On the other hand, if j does not cause , then the second clause of Assumption 7 allows us to write: We conclude that and , which completes the proof of necessity – i.e., perfect Mendelianism is necessary to establish causality. □ Proposition 2 . Assume i . i . d. samples across individuals, V ╨ Y and V ╨ M | Y . Then . Proof Note that z -statistics are computed on observed samples, so we can write: Cross-terms cancel by i.i.d. samples. Moreover, V ╨ Y and V ╨ M | Y imply V ╨ Y | M by the contraction and weak union semi-graphoid axioms. As a result, we have: where . Recall that genotypes were standardized so that 𝔼( V j ) = 0 and Var( V j ) = 1. Thus: We applied V ╨ M | Y in the second equality, and V ╨ Y and standardization in the third. Hence: We have established the equality Proposition 3 Assume i . i . d. samples across individuals, p a = ℙ( M a = 1) > 0, and p b = ℙ( M b = 1) > 0. We have and . Proof For the first statement, we may write: Note that: where p ab = ℙ( S a ∩ S b ), and almost sure convergence holds by the strong law of large numbers and the continuous mapping theorem. Hence, We apply the law of total expectation for the second statement: Moreover, we have by cosine similarity, so the sequence is uniformly bounded. We combine this with the almost sure convergence above to invoke the dominated convergence theorem: It follows that: Hence, Theorem 2 Consider the estimator: . Let G q index the variants that are independent of Y, and B q index the variants that are dependent on Y . Let C 1 , , C 2 , L and D denote finite constants. Assume: (Bounded moments for G q ) For j ∈ G q , we have 𝔼( z ja z jb | M ) = Γ ab and . (Bounded moments for B q ) For j ∈ B q , we have 𝔼( z ja z jb | M ) = µ j with and . (Local LD) There is an undirected dependency graph with edge set E q and maximum degree D such that |cov( z ja z jb , z ka z kb | M )| ≤ C 2 when ( j, k ) ∈ E q and cov( z ja z jb , z ka z kb | M ) = 0 when ( j, k )∉ E q . (Vanishing contamination) We have as q → ∞. Then , conditional on M as q → ∞. Proof Let X j = z ja z jb . Consider the numerator of Ŝ ab and write: We have 𝔼( N G | M ) = 0 by Assumption 1 . We can also write: The inequality follows by Assumption 2 . Thus |𝔼( N B | M )| /q ≤ L | B q | /q → 0 by Assumption 4 . We proved that 𝔼( Ŝ ab | M ) → Γ ab . With Assumption 3 , we can write: The first and second ratios converge to zero. Hence, Var( Ŝ ab | M ) → 0. Let m = 𝔼( Ŝ ab | M ). We invoke the triangle inequality to write: where we have moved the conditioning notation to the subscripts to avoid confusion with absolute value notation. We previously showed that | m − Γ ab | → 0. Now apply Chebyshev’s inequality when | m − Γ ab | < ε : We also already showed that Var( S ab | M ) → 0, so Ŝ ab → Γ ab as q → ∞. □ Proposition 4 We have for every Y a ∈ Y . Proof Choose any Y a ∈ Y . Consider regressing on : We thus have: where and e a denotes the a th canonical basis vector. We also have . Now consider: Since Var( V j ) = 1 and , we have: We previously showed that . It thus follows that: We conclude that for every Y a ∈ Y because we chose Y a arbitrarily. □ D. Additional Experimental Results Simulations Download figure Open in new tab Supplementary Fig. 1. Ablation results. “-GW” denotes not applying each learned outcome genome-wide, “-Z” denotes estimating Γ from β coefficients instead of z -statistics, and “-GW, Z” denotes using both modifications. (a) Misapplying the CCA-derived outcomes (-GW) or using β coefficients in place of z -statistics (-Z) reduced the SoM to essentially zero; for two of the ablations, the SoM violin plots collapse at the bottom and are therefore not visible. (b) Using β coefficients produced highly conservative p -values. (c) The corresponding p -value histogram was strongly U-shaped, indicating severe miscalibration and yielding highly unstable power across outcome dimensions. Despite these deficiencies, the -GW and -Z variants still achieved near-perfect true positive rates and completed in under one second, comparable to Mendelianization. Depression and Generalized Anxiety Download figure Open in new tab Supplementary Fig. 2. Heatmap of | α ′ | values for the lead variants in depression and anxiety. Each column displays the magnitudes of the interpretable coefficients for a lead variant, normalized to the range [0, 1]. The coefficient patterns resolve three phenotypic profiles: (i) uncontrollable worry with recurrent depression; (ii) melancholic-like depression characterized by insomnia, worthlessness, and cognitive slowing; and (iii) largely controllable worry accompanied by prominent somatic complaints. These profiles illustrate distinct modes of experiencing depression and anxiety, underscoring the disorders’ heterogeneity. Download figure Open in new tab Supplementary Fig. 3. Histograms of p -values across all variants. Mendelianization was the only method to produce a near-uniform distribution for variants associated with p -values above 0.05 (red line). By contrast, metaCCA and HIPO yielded anti-conservative distributions with shrunken p -values, whereas fastASSET was overly conservative, producing p -values near one for many variants. Download figure Open in new tab Supplementary Fig. 4. Ablation results. (a) Misapplying the learned outcomes or using β coefficients in place of z -statistics (or both) decreased the SoM score to near zero. (b) The histogram obtained by β coefficients was very conservative. (c) Inspection of the histogram derived from the β coefficient estimates revealed highly miscalibrated p -values following a U-shape curve. Alcohol Use Disorder Download figure Open in new tab Supplementary Fig. 5. Heatmap of | α ′ | values for the lead variants in alcohol use disorder. The coefficient patterns correspond to three phenotypic profiles: (i) episodic binge drinking, (ii) frequent moderate use of alcohol with blackouts, and (iii) frequent heavy use without blackouts. Download figure Open in new tab Supplementary Fig. 6. Histograms of p -values across all variants. Mendelianization was again the only method to produce a near-uniform distribution on [0.05, 1]. Download figure Open in new tab Supplementary Fig. 7. Ablation results. (a) Mendelianization again achieved the highest SoMs, and (b, c) estimating Γ from β coefficients yielded highly miscalibrated p -values. Acknowledgments TBD Footnotes Added simulation studies and provided biological interpretations of the results obtained from the real data. ↵ 1 All arguments can be modified by substituting sample sizes with Fisher-effective sample sizes in generalized linear models with o p (1) remainders. ↵ 2 The p-values are not always calibrated in practice, because the test assumes that α j is fixed. We do not use this test for formal inference, but only for visualizing and summarizing the degree of Mendelianism. References 1. ↵ Arvanitis , M. , Tayeb , K. , Strober , B.J. , Battle , A. : Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity . The American Journal of Human Genetics 109 ( 2 ), 223 – 239 ( 2022 ) OpenUrl CrossRef PubMed 2. ↵ Ballard , J.L. , O’Connor , L.J. : Shared components of heritability across genetically correlated traits . The American Journal of Human Genetics 109 ( 6 ), 989 – 1006 ( 2022 ) OpenUrl CrossRef PubMed 3. ↵ Bycroft , C. , Freeman , C. , Petkova , D. , Band , G. , Elliott , L.T. , Sharp , K. , Motyer , A. , Vukcevic , D. , Delaneau , O. , O’Connell , J. , et al : The uk biobank resource with deep phenotyping and genomic data . Nature 562 ( 7726 ), 203 – 209 ( 2018 ) OpenUrl CrossRef PubMed 4. ↵ Cichonska , A. , Rousu , J. , Marttinen , P. , Kangas , A.J. , Soininen , P. , Lehtimäki , T. , Raitakari , O.T. , Järvelin , M.R. , Salomaa , V. , Ala-Korpela , M. , et al : metacca: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis . Bioinformatics 32 ( 13 ), 1981 – 1989 ( 2016 ) OpenUrl CrossRef PubMed 5. ↵ Consortium ,. G.P. , et al : A global reference for human genetic variation . Nature 526 ( 7571 ), 68 ( 2015 ) OpenUrl CrossRef PubMed 6. ↵ Fabbri , C. , Pain , O. , Hagenaars , S.P. , Lewis , C.M. , Serretti , A. : Transcriptome-wide association study of treatmentresistant depression and depression subtypes for drug repurposing . Neuropsychopharmacology 46 ( 10 ), 1821 – 1829 ( 2021 ) OpenUrl CrossRef PubMed 7. ↵ Flippo , K.H. , Trammell , S.A. , Gillum , M.P. , Aklan , I. , Perez , M.B. , Yavuz , Y. , Smith , N.K. , Jensen-Cody , S.O. , Zhou , B. , Claflin , K.E. , et al : Fgf21 suppresses alcohol consumption through an amygdalo-striatal circuit . Cell Metabolism 34 ( 2 ), 317 – 328 ( 2022 ) OpenUrl PubMed 8. ↵ Fried , E.I. , Nesse , R.M. : Depression is not a consistent syndrome: An investigation of unique symptom patterns in the star* d study . Journal of Affective Disorders 172 , 96 – 102 ( 2015 ) OpenUrl CrossRef PubMed 9. ↵ Gold , P. , Chrousos , G. : Organization of the stress system and its dysregulation in melancholic and atypical depression: high vs low crh/ne states . Molecular Psychiatry 7 ( 3 ), 254 – 275 ( 2002 ) OpenUrl CrossRef PubMed Web of Science 10. ↵ Grotzinger , A.D. , Rhemtulla , M. , de Vlaming , R. , Ritchie , S.J. , Mallard , T.T. , Hill , W.D. , Ip , H.F. , Marioni , R.E. , McIntosh , A.M. , Deary , I.J. , et al : Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits . Nature Human Behaviour 3 ( 5 ), 513 – 525 ( 2019 ) OpenUrl PubMed 11. ↵ Hodges , L.M. , Fyer , A.J. , Weissman , M.M. , Logue , M.W. , Haghighi , F. , Evgrafov , O. , Rotondo , A. , Knowles , J.A. , Hamilton , S.P. : Evidence for linkage and association of gabrb3 and gabra5 to panic disorder . Neuropsychopharmacology 39 ( 10 ), 2423 – 2431 ( 2014 ) OpenUrl PubMed 12. ↵ Karczewski , K.J. , Gupta , R. , Kanai , M. , Lu , W. , Tsuo , K. , Wang , Y. , Walters , R.K. , Turley , P. , Callier , S. , Shah , N.N. , et al : Pan-uk biobank genome-wide association analyses enhance discovery and resolution of ancestryenriched effects . Nature Genetics pp. 1 – 10 ( 2025 ) 13. ↵ Li , Y. , Dang , X. , Chen , R. , Teng , Z. , Wang , J. , Li , S. , Yue , Y. , Mitchell , B.L. , Zeng , Y. , Yao , Y.G. , et al : Cross-ancestry genome-wide association study and systems-level integrative analyses implicate new risk genes and therapeutic targets for depression . Nature Human Behaviour pp. 1 – 18 ( 2025 ) 14. ↵ Little , R.J. , Rubin , D.B. : Statistical analysis with missing data . John Wiley & Sons ( 2019 ) 15. ↵ Luscher , B. , Shen , Q. , Sahir , N. : The gabaergic deficit hypothesis of major depressive disorder . Molecular Psychiatry 16 ( 4 ), 383 – 406 ( 2011 ) OpenUrl CrossRef PubMed Web of Science 16. ↵ Mehrhoff , E.A. , Bunch , L. , Bower , M. , Fowler , N. , Yang , E. , Hendricks , L. , Verner , J. , Lee , H. , Aki , C. , Branney , M. , Funke , A. , Ehringer , M.A. : Alcoholrelated behaviors in a mouse model containing the human GCKR SNP rs1260326 (p446l) . Alcohol: Clinical and Experimental Research 48 ( S1 ), 117 – 118 ( 2024 ). doi: 10.1111/acer.15316 , speaker abstract, 47th Annual Research Society on Alcoholism Scientific Meeting, Minneapolis, MN, USA OpenUrl CrossRef 17. ↵ Mostafavi , H. , Spence , J.P. , Naqvi , S. , Pritchard , J.K. : Systematic differences in discovery of genetic effects on gene expression and complex traits . Nature Genetics 55 ( 11 ), 1866 – 1875 ( 2023 ) OpenUrl CrossRef PubMed 18. ↵ Nievergelt , C.M. , Maihofer , A.X. , Atkinson , E.G. , Chen , C.Y. , Choi , K.W. , Coleman , J.R. , Daskalakis , N.P. , Duncan , L.E. , Polimanti , R. , Aaronson , C. , et al : Genomewide association analyses identify 95 risk loci and provide insights into the neurobiology of post-traumatic stress disorder . Nature genetics 56 ( 5 ), 792 – 808 ( 2024 ) OpenUrl CrossRef PubMed 19. ↵ Perneel , J. , Lastra Osua , M. , Alidadiani , S. , Peeters , N. , De Witte , L. , Heeman , B. , Manzella , S. , De Rycke , R. , Brooks , M. , Perkerson , R.B. , et al : Increased tmem106b levels lead to lysosomal dysfunction which affects synaptic signaling and neuronal health . Molecular neurodegeneration 20 ( 1 ), 1 – 26 ( 2025 ) OpenUrl PubMed 20. ↵ Privé , F. : Optimal linkage disequilibrium splitting . Bioinformatics 38 ( 1 ), 255 – 256 ( 2022 ) OpenUrl CrossRef 21. ↵ Qi , G. , Chatterjee , N. : Heritability informed power optimization (hipo) leads to enhanced detection of genetic associations across multiple traits . PLoS Genetics 14 ( 10 ), e1007549 ( 2018 ) OpenUrl 22. ↵ Qi , G. , Chhetri , S.B. , Ray , D. , Dutta , D. , Battle , A. , Bhattacharjee , S. , Chatterjee , N. : Genome-wide largescale multi-trait analysis characterizes global patterns of pleiotropy and unique trait-specific variants . Nature Communications 15 ( 1 ), 6985 ( 2024 ) OpenUrl PubMed 23. Quinlan , A.R. , Hall , I.M. : Bedtools: a flexible suite of utilities for comparing genomic features . Bioinformatics 26 ( 6 ), 841 – 842 ( 2010 ) OpenUrl CrossRef PubMed Web of Science 24. ↵ Ray , L.A. , Sehl , M. , Bujarski , S. , Hutchison , K. , Blaine , S. , Enoch , M.A. : The crhr1 gene, trauma exposure, and alcoholism risk: a test of g× e effects . Genes, Brain and Behavior 12 ( 4 ), 361 – 369 ( 2013 ) OpenUrl CrossRef PubMed Web of Science 25. ↵ Sanchez-Roige , S. , Palmer , A.A. , Fontanillas , P. , Elson , S.L. , 23andMe Research Team, t.S.U.D.W.G.o.t.P.G.C ., Adams , M.J. , Howard , D.M. , Edenberg , H.J. , Davies , G. , Crist , R.C. , et al : Genome-wide association study meta-analysis of the alcohol use disorders identification test (audit) in two population-based cohorts . American Journal of Psychiatry 176 ( 2 ), 107 – 118 ( 2019 ) OpenUrl CrossRef PubMed 26. ↵ Saunders , J.B. , Aasland , O.G. , Babor , T.F. , De la Fuente , J.R. , Grant , M. : Development of the alcohol use disorders identification test (audit): Who collaborative project on early detection of persons with harmful alcohol consumption-ii . Addiction 88 ( 6 ), 791 – 804 ( 1993 ) OpenUrl CrossRef PubMed Web of Science 27. ↵ Schaid , D.J. , Chen , W. , Larson , N.B. : From genome-wide associations to candidate causal variants by statistical finemapping . Nature Reviews Genetics 19 ( 8 ), 491 – 504 ( 2018 ) OpenUrl CrossRef PubMed 28. ↵ Sigel , E. , Ernst , M. : The benzodiazepine binding sites of gabaa receptors . Trends in Pharmacological Sciences 39 ( 7 ), 659 – 671 ( 2018 ) OpenUrl CrossRef 29. ↵ Strobl , E.V. , Gamazon , E.R. : Transcriptome-wide root causal inference . PLOS Computational Biology 21 ( 9 ), e1013461 ( 2025 ) OpenUrl 30. ↵ Su , X. , Li , W. , Lv , L. , Li , X. , Yang , J. , Luo , X.J. , Liu , J. : Transcriptome-wide association study provides insights into the genetic component of gene expression in anxiety . Frontiers in Genetics 12 , 740134 ( 2021 ) OpenUrl PubMed 31. ↵ Svishcheva , G.R. , Tiys , E.S. , Elgaeva , E.E. , Feoktistova , S.G. , Timmers , P.R. , Sharapov , S.Z. , Axenovich , T.I. , Tsepilov , Y.A. : A novel framework for analysis of the shared genetic background of correlated traits . Genes 13 ( 10 ), 1694 ( 2022 ) OpenUrl 32. ↵ Thompson , A. , Cook , J. , Choquet , H. , Jorgenson , E. , Yin , J. , Kinnunen , T. , Barclay , J. , Morris , A.P. , Pirmohamed , M. : Functional validity, role, and implications of heavy alcohol consumption genetic loci . Science Advances 6 ( 3 ), eaay5034 ( 2020 ) OpenUrl FREE Full Text 33. ↵ Treutlein , J. , Kissling , C. , Frank , J. , Wiemann , S. , Dong , L. , Depner , M. , Saam , C. , Lascorz , J. , Soyka , M. , Preuss , U. , et al : Genetic association of the human corticotropin releasing hormone receptor 1 (crhr1) with binge drinking and alcohol intake patterns in two independent samples . Molecular Psychiatry 11 ( 6 ), 594 – 602 ( 2006 ) OpenUrl CrossRef PubMed Web of Science 34. ↵ Tyrka , A.R. , Price , L.H. , Gelernter , J. , Schepker , C. , Anderson , G.M. , Carpenter , L.L. : Interaction of childhood maltreatment with the corticotropin-releasing hormone receptor gene: effects on hypothalamic-pituitary-adrenal axis reactivity . Biological psychiatry 66 ( 7 ), 681 – 685 ( 2009 ) OpenUrl CrossRef PubMed Web of Science 35. ↵ Uffelmann , E. , Huang , Q.Q. , Munung , N.S. , De Vries , J. , Okada , Y. , Martin , A.R. , Martin , H.C. , Lappalainen , T. , Posthuma , D. : Genome-wide association studies . Nature Reviews Methods Primers 1 ( 1 ), 59 ( 2021 ) OpenUrl 36. ↵ Wang , T. , Tyler , R.E. , Ilaka , O. , Cooper , D. , Farokhnia , M. , Leggio , L. : The crosstalk between fibroblast growth factor 21 (fgf21) system and substance use . iScience 27 ( 7 ) ( 2024 ) 37. ↵ Zhang , Z. , Jung , J. , Kim , A. , Suboc , N. , Gazal , S. , Mancuso , N. : A scalable approach to characterize pleiotropy across thousands of human diseases and complex traits using gwas summary statistics . The American Journal of Human Genetics 110 ( 11 ), 1863 – 1874 ( 2023 ) OpenUrl CrossRef PubMed 38. ↵ Zou , Y. , Carbonetto , P. , Xie , D. , Wang , G. , Stephens, M.: Fast and flexible joint fine-mapping of multiple traits via the sum of single effects model . bioRxiv pp. 2023 – 04 ( 2024 ) 39. ↵ Zuber , V. , Cronjé , T. , Cai , N. , Gill , D. , Bottolo , L. : Bayesian causal graphical model for joint mendelian randomization analysis of multiple exposures and outcomes . The American Journal of Human Genetics 112 ( 5 ), 1173 – 1198 ( 2025 ) OpenUrl PubMed 40. ↵ Zuber , V. , Lewin , A. , Levin , M.G. , Haglund , A. , Ben-Aicha , S. , Emanueli , C. , Damrauer , S. , Burgess , S. , Gill , D. , Bottolo , L. : Multi-response mendelian randomization: Identification of shared and distinct exposures for multimorbidity and multiple related disease outcomes . The American Journal of Human Genetics 110 ( 7 ), 1177 – 1199 ( 2023 ) OpenUrl CrossRef PubMed View the discussion thread. Back to top Previous Next Posted December 10, 2025. Download PDF Data/Code Email Thank you for your interest in spreading the word about medRxiv. NOTE: Your email address is requested solely to identify you as the sender of this article. Your Email * Your Name * Send To * Enter multiple addresses on separate lines or separate them with commas. You are going to email the following Mendelianization: Concentrating Polygenic Signal into a Single Causal Locus Message Subject (Your Name) has forwarded a page to you from medRxiv Message Body (Your Name) thought you would like to see this page from the medRxiv website. Your Personal Message CAPTCHA This question is for testing whether or not you are a human visitor and to prevent automated spam submissions. Share Mendelianization: Concentrating Polygenic Signal into a Single Causal Locus Eric V. Strobl medRxiv 2025.10.31.25339237; doi: https://doi.org/10.1101/2025.10.31.25339237 Share This Article: Copy Citation Tools Mendelianization: Concentrating Polygenic Signal into a Single Causal Locus Eric V. Strobl medRxiv 2025.10.31.25339237; doi: https://doi.org/10.1101/2025.10.31.25339237 Citation Manager Formats BibTeX Bookends EasyBib EndNote (tagged) EndNote 8 (xml) Medlars Mendeley Papers RefWorks Tagged Ref Manager RIS Zotero Tweet Widget Facebook Like Google Plus One Subject Area Genetic and Genomic Medicine Subject Areas All Articles Addiction Medicine (568) Allergy and Immunology (863) Anesthesia (300) Cardiovascular Medicine (4435) Dentistry and Oral Medicine (444) Dermatology (382) Emergency Medicine (608) Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1509) Epidemiology (15229) Forensic Medicine (30) Gastroenterology (1124) Genetic and Genomic Medicine (6600) Geriatric Medicine (668) Health Economics (997) Health Informatics (4536) Health Policy (1368) Health Systems and Quality Improvement (1613) Hematology (541) HIV/AIDS (1264) Infectious Diseases (except HIV/AIDS) (15916) Intensive Care and Critical Care Medicine (1103) Medical Education (623) Medical Ethics (146) Nephrology (667) Neurology (6599) Nursing (346) Nutrition (998) Obstetrics and Gynecology (1144) Occupational and Environmental Health (957) Oncology (3332) Ophthalmology (974) Orthopedics (369) Otolaryngology (420) Pain Medicine (436) Palliative Medicine (130) Pathology (663) Pediatrics (1693) Pharmacology and Therapeutics (691) Primary Care Research (711) Psychiatry and Clinical Psychology (5447) Public and Global Health (9232) Radiology and Imaging (2198) Rehabilitation Medicine and Physical Therapy (1370) Respiratory Medicine (1196) Rheumatology (593) Sexual and Reproductive Health (712) Sports Medicine (530) Surgery (712) Toxicology (99) Transplantation (289) Urology (265) (function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'a00c8b440f62e08f',t:'MTc3OTYyODI2Mw=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.