A simple phylogenetic approach to analyze hypermutated HIV proviruses reveals insights into their dynamics and persistence during antiretroviral therapy | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A simple phylogenetic approach to analyze hypermutated HIV proviruses reveals insights into their dynamics and persistence during antiretroviral therapy Aniqa Shahid, Bradley R. Jones, Maggie C. Duncan, Signe MacLennan, and 12 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4549934/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Hypermutated proviruses, which arise in a single HIV replication cycle when host antiviral APOBEC3 proteins introduce extensive G-to-A mutations throughout the viral genome, persist in all people living with HIV receiving antiretroviral therapy (ART). But, the within-host evolutionary origins of hypermutated sequences are incompletely understood because phylogenetic inference algorithms, which assume that mutations gradually accumulate over generations, incorrectly reconstruct their ancestor-descendant relationships. Using >1400 longitudinal single-genome-amplified HIV env-gp120 sequences isolated from six women over a median 18 years of follow-up − including plasma HIV RNA sequences collected over a median 9 years between seroconversion and ART initiation, and >500 proviruses isolated over a median 9 years on ART − we evaluated three approaches for removing hypermutation from nucleotide alignments. Our goals were to 1) reconstruct accurate phylogenies that can be used for molecular dating and 2) phylogenetically infer the integration dates of hypermutated proviruses persisting during ART. Two of the tested approaches (stripping all positions containing putative APOBEC3 mutations from the alignment, or replacing individual putative APOBEC3 mutations in hypermutated sequences with the ambiguous base R) consistently normalized tree topologies, eliminated erroneous clustering of hypermutated proviruses, and brought env -intact and hypermutated proviruses into comparable ranges with respect to multiple tree-based metrics. Importantly, these corrected trees produced integration date estimates for env -intact proviruses that were highly concordant with those from benchmark trees that excluded hypermutated sequences, indicating that the corrected trees can be used for molecular dating. Use of these trees to infer the integration dates of hypermutated proviruses persisting during ART revealed that these spanned a wide age range, with the oldest ones dating to shortly after infection. This indicates that hypermutated proviruses, like other provirus types, begin to be seeded into the proviral pool immediately following infection, and can persist for decades. In two of the six participants, hypermutated proviruses differed from env -intact ones in terms of their age distributions, suggesting that different provirus types decay at heterogeneous rates in some hosts. These simple approaches to reconstruct hypermutated provirus' evolutionary histories, allow insights into their in vivo origins and longevity, towards a more comprehensive understanding of HIV persistence during ART. Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Introduction Antiretroviral therapy (ART) is not curative because HIV persists as an integrated provirus within a small fraction of infected cell reservoirs (Finzi et al. 1997 ; Finzi et al. 1999 ). Entry of HIV sequences into these reservoirs begins immediately following infection (Gantner et al. 2023 ; Whitney et al. 2014 ) and continues until viral suppression is achieved on ART, yielding a genetically diverse viral reservoir (Brodin et al. 2016 ; Brooks et al. 2020 ; Jones et al. 2018 ; Kinloch et al. 2023 ; Nicolas et al. 2022 ; Pankau et al. 2020 ). Only a minority (~ 2–5%) of integrated proviruses persisting on ART however are genetically intact and capable of producing replication-competent HIV; the remainder are genetically defective and cannot produce infectious virus (Bruner et al. 2016 ; Ho et al. 2013 ; Imamichi et al. 2020 ; Sanchez et al. 1997 ). Large deletions, which occur during the minus-strand synthesis step of reverse transcription, are the most common defects, followed by hypermutation (Bruner et al. 2016 ; Hiener et al. 2017 ; Ho et al. 2013 ; Kinloch et al. 2023 ; Lee et al. 2017 ). Hypermutated proviruses arise in a single HIV replication cycle when host antiviral APOBEC3 proteins catalyze widespread cytidine-to-uridine deamination within the minus-strand HIV DNA genome that is produced during reverse transcription, yielding extensive guanine to adenine (G-to-A) mutations during plus-strand synthesis (Fitzgibbon et al. 1993 ; Goodenow et al. 1989 ; Vartanian et al. 1991 ; Vartanian et al. 1994 ). Hypermutation is normally deleterious, yielding stop codons in one or more HIV reading frames (Harris and Liddament 2004 ; Vartanian et al. 1991 ; Waldron 2015 ). As a result, hypermutated proviruses do not generally yield evolutionary descendants (Kieffer et al. 2005 ; Sheehy et al. 2002 ). Nevertheless, hypermutated sequences readily persist, typically representing 15% (though as much as > 50%) of all proviruses during long-term ART (Bruner et al. 2016 ; Hiener et al. 2017 ; Ho et al. 2013 ; Kinloch et al. 2023 ; Lee et al. 2017 ). Hypermutated HIV sequences pose challenges for phylogenetic inference algorithms, which assume that mutations gradually accumulate over generations, not all at once in a single round of replication (Gorbalenya 2017 ). Phylogenies inferred from sequence alignments containing hypermutated proviruses will therefore inaccurately reflect the ancestor-descendant relationships of these sequences. Due to their large number of G-to-A mutations, the terminal branch lengths of hypermutated sequences are typically extended in these trees, and they will also often cluster together due to a type of phylogenetic error known as long branch attraction, whereby divergent sequences are classified as being more similar to one another simply because they have undergone a large amount of change, not because they share recent ancestry (Bergsten 2005 ). Though hypermutated sequences are routinely included in phylogenies simply as a way to visualize complete datasets (Halvas et al. 2020 ; Kearney et al. 2016 ; Patro et al. 2019 ), such trees should not be used for formal hypothesis testing. To our knowledge, no standard approaches exist to correctly infer ancestor-descendant relationships in datasets that include hypermutated sequences. Instead, these sequences are typically removed from HIV alignments, excluding them from phylogenetic hypothesis testing entirely (Bozzi et al. 2019 ; Brodin et al. 2016 ; Brooks et al. 2020 ; Jones et al. 2018 ; Jones and Joy 2023 ; Pinzone et al. 2019 ). As a result, relatively little is known about the within-host origins and longevity of hypermutated proviruses. To address these gaps, we used longitudinal within-host HIV env-gp120 sequence datasets from six participants of the Women's Interagency HIV Study (WIHS) (Shahid et al. 2024 ) to evaluate the ability of three nucleotide alignment modification strategies to normalize the topologies of trees containing hypermutated proviruses. Using these corrected trees, we then estimated the integration dates of env -intact and hypermutated proviruses persisting during ART, towards better understanding the within-host evolutionary dynamics of these different proviral types. Methods Study participants and within-host HIV sequence datasets We analyzed longitudinal, single-genome-amplified HIV env - gp120 sequence datasets previously collected from six WIHS participants with documented HIV seroconversion (Shahid et al. 2024 ). WIHS is a multi-center cohort of women living with (or without) HIV in the United States (Adimora et al. 2018 ; Bacon et al. 2005 ; Barkan et al. 1998 ), that has now merged into the MACS/WIHS Combined Cohort Study (MWCCS) (D'Souza et al. 2021 ). Each participant's longitudinal dataset comprised plasma HIV RNA env-gp120 sequences collected between seroconversion and ART initiation, along with env-gp120 proviral sequences sampled during ART (Shahid et al. 2024 ) ( Table 1 ). All sequences were collected by single-genome amplification, where those with nucleotide mixtures, defects ( e.g. , deletions causing frameshifts) or evidence of within-host recombination (identified using RDP4 v4.1 (Martin et al. 2015 )) were excluded (Shahid et al. 2024 ). Sequences that were 100% identical in env-gp120 were collapsed to a single representative sequence prior to phylogenetic inference. Within-host datasets comprised a median 242 (IQR 119–337) distinct sequences per participant. Ethics statement Institutional review boards at each WIHS clinical research site approved the study protocol. All participants provided written informed consent. This nested sub-study was additionally approved by the institutional review boards at Providence Health Care/University of British Columbia, and Simon Fraser University. Identification of hypermutated sequences and sequence alignment modification Hypermutated HIV sequences were identified using Hypermut 2.0, available at https://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermut.html (Rose and Korber 2000 ). This program takes a nucleotide alignment as input, where the first sequence is used as a reference to which all others are compared. As recommended for within-host datasets (Rose and Korber 2000 ), we chose the most frequently-observed env-gp120 sequence from the first plasma HIV RNA sampling timepoint as the reference wherever possible. Hypermut defines APOBEC3 target sites as G RD; that is, a G followed by either A or G (denoted by the IUPAC code R (Cornish-Bowden 1985 )), then followed by A, G or T (denoted by D), where the bold and underlined G is the APOBEC3 target site. Non-APOBEC3 target sites are defined as G Y (where Y denotes C or T), or G RC. Hypermut identifies all target and non-target sites within each sequence, and categorizes each as mutated ( i.e. , harboring an A) or not ( i.e. , harboring a C, G or T). The program then compares the proportion of mutated target and non-target sites in each sequence using Fisher's exact test. Sequences enriched in G-to-A mutations at target sites with p < 0.05 are identified as hypermutated. We then prepared five within-host env-gp120 sequence alignments for each participant, where the first two were controls and the last three used different strategies to remove hypermutation. Sequence alignments were performed in a codon-aware manner using MAFFT v7.471 (Katoh and Standley 2013 ) and manually inspected in AliView v1.26 (Larsson 2014 ). The first alignment contained all pre-ART env-gp120 plasma HIV RNA sequences plus only the env- intact proviruses sampled during ART ( i.e. , hypermutated proviruses were excluded, as is the current practice in the field (Brooks et al. 2020 ; Jones et al. 2018 ; Jones et al. 2020 ; Kinloch et al. 2023 )). We called this the " env- intact only" alignment, where the resulting phylogeny was used as the benchmark for provirus molecular dating. The second alignment contained all pre-ART plasma HIV RNA sequences plus all ( i.e. , both env- intact and hypermutated) proviruses sampled during ART, where the phylogeny inferred from this "HM-Unaltered" alignment served to illustrate the skewed topologies of resulting trees. The next three alignments were modifications of this second one, in which we tested different strategies to remove hypermutation and thereby normalize topology. The first strategy, HM-Stripped, removed all nucleotide positions that harbored an A at an APOBEC3 target site in at least one hypermutated sequence, yielding a shorter overall alignment. The second strategy, HM-Replacedw/R, individually replaced all A bases at APOBEC3 target sites within hypermutated sequences with R. The third strategy, HM-Replacedw/G, individually replaced all A bases at APOBEC3 target sites within hypermutated sequences with G. Both these strategies preserved the alignment length. Here, replacing with G assumes that all A bases at target sites are the result of APOBEC3 effects, whereas replacing with R recognizes the possibility that some may be legitimate A bases that are not attributable to APOBEC3 effects. Visualizations of the HM Unaltered, HM-Stripped and HM-Replacedw/R alignments are provided in Supplementary Fig. 1 . Phylogenies inferred from these alignments were evaluated as described below. Within-host phylogenetic inference, rooting and tree metrics Maximum likelihood phylogenies were inferred from sequence alignments following automated model selection using an Akaike information criterion (AIC) in IQ-TREE 2. Best-fit models are reported in Supplementary Table 1 . Branch support values were derived using the ultrafast bootstrap option (1,000 bootstraps) (Hoang et al. 2018 ; Minh et al. 2020 ). Phylogenies were visualized using the R package ggtree (Yu 2020 ). Most of our downstream analyses required rooting the tree at the inferred most recent common ancestor (MRCA) of the dataset. As previously described, we used a modified root-to-tip regression approach where we explored all positions in the tree to identify the location that maximized the (Pearson's) correlation between the root-to-tip distances of all plasma HIV RNA sequences collected prior to ART initiation , and their sampling dates (Jones et al. 2018 ). This location was set as the tree root, which represents the estimated transmitted/founder virus, or a close descendant thereof, in these datasets. To evaluate the extent to which the three alignment modification strategies normalized the position of hypermutated proviruses in the tree, we compared env- intact and hypermutated proviruses with respect to various tree-based metrics, explained in Fig. 1 . We quantified terminal branch length (TBL), which is the length of the branch connecting each sequence to the tree, in estimated substitutions per nucleotide site (Fig. 1 B ) . We computed root-to-tip distance (RTT) , which is the total distance between each tip and the tree root (Fig. 1 C). We computed two measures of evolutionary distinctiveness: Fair Proportion Evolutionary Distinctiveness (FP-ED) and Equal Splits Evolutionary Distinctiveness (ES-ED), both of which distribute the root-to-tip distances in a tree among the descendant sequences at the tips (Pavoine 2017). FP-ED does this by dividing the shared evolutionary history represented by an internal branch equally among all its descendant tips, regardless of branching order (Isaac et al. 2007 ; Redding et al. 2014 ) (Fig. 1 D), whereas ES-ED assigns a longer portion of shared internal branches to immediate descendants (Redding and Mooers 2006 ) (Fig. 1 E). FP-ED and ES-ED were computed using a custom R script with package picante (v1.8.2) (Kembel et al. 2010 ). We computed each proviral sequence's median topological distance (TD) from all other sequences of the same type ( i.e., env- intact or hypermutated), where distance was defined as the number of nodes separating each pair (Fig. 1 F). Finally, we used the Slatkin-Maddison (SM) test (Slatkin and Maddison 1989 ), implemented using the R package slatkin.maddison (v0.1.0; https://github.com/prmac/slatkin.maddison ) to assess the extent to which env- intact and hypermutated sequences displayed population structure in the tree. This test determines the minimum number of migrations between groups to explain the distribution of groups at the tree tips: the smaller the number, the stronger the support for population structure. Statistical support is based on the number of migrations that would be expected in a randomly-structured population, simulated by permuting group labels between tips. Note that Slatkin-Maddison returns an estimated p-value, where a value of 0 can be interpreted as p < 0.001, as 1,000 permutations were performed. Within-host phylogenetic inference and proviral dating We inferred the integration dates of env- intact and hypermutated proviruses persisting during ART using a published phylogenetic approach (Jones et al. 2018 ). Using the rooted trees, we fit a linear model relating the root-to-tip distances of pre-ART plasma HIV sequences to their collection dates. The slope of this line represents the average within-host env-gp120 evolutionary rate during untreated HIV infection, and the x-intercept represents the inferred root date. Model quality was assessed by comparing the model's AIC to that of a null model with zero slope. To pass quality control (QC), the linear model needed to have an AIC value at least 10 units lower than the null model (ΔAIC ≥ 10), and a root date prior to the first plasma sampling. All phylogenies met these criteria ( Supplementary Table 1 ). We then used the linear model to convert proviral root-to-tip distances to their integration dates. The custom R script for this method is available at https://github.com/cfe-lab/phylodating . Statistical analysis Spearman’s correlation (ρ) and Lin’s concordance correlation coefficient (ρc) were calculated in R. All other statistical analyses were performed in Prism, v10.0.2 (GraphPad Software). A threshold of p < 0.05 was used to denote statistical significance. Results Within-host HIV sequence datasets We analyzed 1,408 single-genome-amplified HIV env-gp120 sequences collected longitudinally from six WIHS participants who experienced HIV seroconversion (a seventh participant from the original study was not included here, as no hypermutated proviruses were isolated from their samples) (Shahid et al. 2024 ) ( Table 1 ). The data included 866 distinct HIV RNA env-gp120 sequences (median 157 per participant) isolated from plasma over a median of 9 time points spanning a median of 7 years between seroconversion and ART initiation. The data also included 542 distinct env-gp120 proviral sequences, including 449 env- intact ones (median 62 per participant) and 93 hypermutated ones (median 19 per participant) isolated from peripheral blood at a minimum of 3 time points over a median of 8.7 years during ART ( Table 1 ). All participants had HIV subtype B, with no evidence of dual or super-infection. Identifying sites of hypermutation Between 7 and 42% of participants' proviral sequences were hypermutated (though hypermutation was not observed in plasma HIV RNA sequences, as expected). In a given within-host alignment, between 9–11% of env-gp120 nucleotide positions had a putative APOBEC3-driven A in at least one sequence ( Table 2 ). Hypermutated proviruses harbored a grand median of 45 putative APOBEC3 mutations (representing 31% of all possible target sites, and 3% of all env-gp120 nucleotides), but the overall range was 9 to 83 putative APOBEC3 mutations per env-gp120 sequence (representing 6–61% of all possible target sites, and 0.6-5% of all env-gp120 nucleotides). For context, the grand median of putative APOBEC3 mutations in env- intact (non-hypermutated) proviruses was 5. Assessing how alignment modification strategies normalized tree topology and metrics We next investigated how well our sequence alignment modification strategies helped normalize tree topologies, beginning with participant WIHS-P2 as an example. Participant WIHS-P2's dataset included 227 plasma HIV RNA env-gp120 sequences sampled over 9 years during untreated infection, and 75 proviruses (53 env- intact, 22 hypermutated) sampled over ~ 7 years during ART (Fig. 2 A). WIHS-P2’s unmodified nucleotide alignment yielded a phylogeny that placed nearly all hypermutated proviruses into a single clade with high (≥ 90%) bootstrap support (Fig. 2 B; a larger tree with branch support values is shown in Supplementary Fig. 2 ). Hypermutated provirus terminal branch lengths in this tree were on average four times longer than env -intact ones (p < 0.0001; Fig. 2 C), though their root-to-tip distances were not significantly inflated (p = 0.2, Supplementary Fig. 3A ). Hypermutated proviruses also exhibited significantly higher evolutionary distinctiveness (ED) than env -intact ones in this tree (p < 0.0001 for both fair proportion and equal splits ED; Fig. 2 D and Supplementary Fig. 4A ). Also reflecting the erroneous clustering of hypermutated sequences in this tree, the median number of nodes separating hypermutated sequences from one another ( i.e. , topological distance) was on average only half of that separating env -intact proviruses (p < 0.0001; Fig. 2 E). A Slatkin-Maddison test also returned significant evidence of genetic population structure ( i.e. , "compartmentalization") between hypermutated and env -intact proviruses in this tree (three inferred migrations; estimated p = 0; Fig. 2 B inset ). By contrast, the tree inferred from WIHS-P2's HM-Stripped alignment, in which 140 (of 1515) env-gp120 positions harboring putative APOBEC3 mutations had been removed, exhibited a substantially normalized topology (Fig. 2 F). The same was true for the tree inferred from the HM-Replacedw/R alignment, where a median of 43 putative APOBEC3-driven A bases in hypermutated sequences had been replaced with R (Fig. 2 G; larger trees in Supplementary Fig. 2 ). In both trees, hypermutated proviruses were now comparable to env -intact ones in terms of terminal branch lengths (both p > 0.1; Figs. 2 H and 2 I), evolutionary distinctiveness (all p > 0.1; Figs. 2 J and 2 K; Supplementary Figs. 4B and 4C ) and topological distance (both p > 0.1, Figs. 2 L and 2 M). Genetic compartmentalization between env- intact and hypermutated proviruses was also markedly reduced (15 inferred migrations compared to the original 3), though the p-values remained marginally significant (both p ≤ 0.01; Figs. 2 F and 2 G, insets ). Of note, root-to-tip distances of hypermutated proviruses in these two trees were now shorter than those of env- intact ones (both p < 0.001; Supplementary Figs. 3B and 3C ). In contrast, while the tree inferred from participant WIHS-P2's HM-Replacedw/G alignment (where putative APOBEC3-driven A bases in hypermutated sequences were replaced with G) appeared broadly normalized, env- intact and hypermutated sequences remained highly significantly compartmentalized in this tree (estimated p = 0; Supplementary Fig. 5 ). As our second example, participant WIHS-P4's dataset included 182 plasma HIV RNA env-gp120 sequences sampled over ~ 11 years pre-ART, and 155 proviruses (132 env- intact; 23 hypermutated) sampled during 12 years of ART (Fig. 3 A). The unaltered alignment produced a phylogeny (Fig. 3 B; larger tree in Supplementary Fig. 6 ) where hypermutated sequences exhibited significantly inflated branch lengths, root-to-tip distances and evolutionary distinctiveness (all p < 0.0001; Figs. 3 C and 3 D, Supplementary Figs. 3D and 4D ), erroneous clustering (p < 0.0001 Fig. 3 E) and significant compartmentalization (estimated p = 0; Fig. 3 B inset ). By contrast, the HM-Stripped and HM-Replacedw/R alignments produced substantially normalized trees (Figs. 3 F and 3 G respectively; larger trees in Supplementary Fig. 6 ) with no genetic compartmentalization between env- intact and hypermutated sequences (22 migrations compared to the original 6; both p > 0.1; Figs. 3 F and 3 G insets ). The ranges of terminal branch lengths, root-to-tip distance measurements, evolutionary distinctiveness measures and topological distances were now also comparable between env- intact and hypermutated proviruses, though the latter remained modestly yet statistically significantly different from env- intact sequences by most measures (p-values from 0.001 to 0.039, Figs. 3 H to 3 M; Supplementary Figs. 3E and 3F; Supplementary Figs. 4E and 4F ). In contrast, hypermutated sequences remained highly compartmentalized in the phylogeny inferred from WIHS-P4's HM-Replacedw/G alignment ( Supplementary Fig. 7 ). The same analyses were applied to participants WIHS-P1, WIHS-P3, WIHS-P5, and WIHS-P6 (small trees and select metrics in Supplementary Figs. 8–11 ; large trees in Supplementary Figs. 12–15; remaining metrics in Supplementary Figs. 3 and 4 ). Broadly, the trees inferred from the HM-Stripped and HM-Replacedw/R alignments were markedly normalized and yielded metric values for env- intact and hypermutated proviruses that spanned comparable ranges. For some participants, these metrics normalized such that env- intact and hypermutated viruses became statistically comparable ( e.g. , WIHS-P5; Supplementary Fig. 10 ). For others, hypermutated sequences remained somewhat distinctive ( e.g. , hypermutated provirus terminal branch lengths and evolutionary distinctiveness remained slightly elevated for WIHS-P6; Supplementary Figs. 4 and 11 ), but in all cases these differences were far smaller in magnitude than those from the trees inferred from unaltered alignments. Indeed, the p-values derived from comparing env- intact and hypermutated proviruses in the HM-Stripped and HM-Replacedw/R trees were an average > 3 logs higher than those from the HM-Unaltered trees, with 56% of comparisons yielding p-values > 0.05 (Fig. 4 ). By contrast, the HM-Replacedw/G approach did not reliably normalize the trees. In particular, WIHS-P5's HM-Replacedw/G phylogeny maintained obvious clustering of hypermutated sequences and very strong compartmentalization, while terminal branch lengths, fair proportion evolutionary distinctiveness, and topological distance also remained highly skewed for one or more participants (Fig. 4 , and data not shown). As such, only the HM-Stripped and HM-Replacedw/R trees were advanced to further evaluation. Inferring proviral integration dates from corrected trees: a validation We next investigated whether accurate evolutionary information can be extracted from these corrected trees, by phylogenetically inferring the integration dates of proviruses sampled during ART. Figure 5 illustrates how this is done. Briefly, we first root the phylogeny at the location that maximizes the correlation between the root-to-tip distances of the pre-ART plasma HIV RNA sequences and their sampling dates (proviruses sampled during ART, though included in the tree, are not considered in this correlation; Fig. 5 B). This root represents the MRCA of the dataset ( i.e ., the estimated the founder virus). We then fit a linear model relating the root-to-tip genetic distances of the pre-ART plasma sequences to their sampling dates (Fig. 5 C). This model is then used to convert the root-to-tip distance of each on-ART provirus to its inferred integration date (plus 95% confidence interval; Fig. 5 D). Application of this approach to WIHS-P2's unaltered and corrected trees yielded estimated root dates that were consistent with the clinically-estimated infection date ( Table 1 ) and comparable to the root date inferred from the benchmark ( env- intact only) tree ( Supplementary Table 1; the likely reason that the unaltered tree produced reasonable root dates and evolutionary rate estimates is because these metrics are computed from pre-ART plasma HIV RNA sequences only). We next verified the extent to which the integration dates of env- intact proviruses inferred from the corrected trees matched those inferred from the benchmark tree (which, as per current field standards, excluded hypermutated sequences entirely). Reassuringly, env- intact proviral integration dates inferred from the HM-Stripped tree were highly concordant with those inferred from the benchmark tree (Spearman’s rho [ρ] = 0.95, p < 0.0001; Lin’s concordance correlation coefficient [ρc = 0.96], as were those inferred from the HM-replacedw/R tree (ρ = 0.98, p < 0.0001; ρc = 0.97) (Fig. 6 A). These results indicate that WIHS-P2's corrected trees can be used for molecular dating, and produce valid proviral integration dates. We next inferred the integration dates of all proviruses from the corrected trees, including the hypermutated ones. Inferred integration dates were highly concordant between the two approaches, yielding ρc between 0.93 and 0.97 depending on whether we compared env- intact, hypermutated or all proviruses (Fig. 6 B). Moreover, there was no bias between the two methods (p = 0.65) (Fig. 6 C). Thus, for participant WIHS-P2, both methods recovered proviral ages equally well. By contrast, the phylogeny inferred from the unaltered alignment produced hypermutated provirus integration dates that were poorly concordant with those from the corrected trees (HM-Stripped ρc = 0.46; HM-Replacedw/R ρc = 0.45; Fig. 6 D). This illustrates the pitfalls of inferring evolutionary information from the former tree type. We obtained similar results for WIHS-P4. Again, the integration dates of env -intact proviruses inferred from both corrected trees were highly concordant with those inferred from the benchmark tree (both ρc = 0.98; Fig. 7 A), indicating that the corrected trees are appropriate for molecular dating. Moreover, proviral integration dates inferred from the corrected trees were highly concordant with one another (ρc 0.97 to 0.98) (Fig. 7 B), and showed no bias between methods (p = 0.25) (Fig. 7 C). By contrast, the phylogeny inferred from the unaltered alignment produced hypermutated provirus integration dates that were highly discordant with those inferred from the corrected trees (both ρc = 0.08; Fig. 7 D), again illustrating the pitfalls of inferring evolutionary information from the former tree type. WIHS-P1, WIHS-P3, WIHS-P5, and WIHS-P6's corrected trees similarly produced env -intact proviral integration dates that were strongly concordant with those inferred from their benchmark trees (ρc: 0.81 to 0.93), and generally highly concordant proviral integration dates to one another, with no bias between methods ( Supplementary Figs. 16–19 ). Again, the phylogenies inferred from their unaltered alignments produced hypermutated provirus integration dates that were generally poorly concordant with those inferred from the corrected trees. Together, these observations demonstrate that removing hypermutation from alignments is possible, and yields phylogenies that can be used to infer the integration dates of both hypermutated and env -intact proviruses. Longevity and dynamics of hypermutated proviruses persisting on ART Having demonstrated that proviral integration dates can be inferred from the corrected trees, we compared the integration dates of env -intact and hypermutated proviruses persisting on ART. Again, we begin with participant WIHS-P2. Both of this participant's corrected trees indicated that the hypermutated proviruses, like the env -intact ones, spanned essentially the entire duration of untreated infection, with the earliest dating to early 2004, approximately one year after seroconversion, (Figs. 8 A and 8 B ) . On average however, hypermutated proviruses were older than env -intact ones in this participant (both trees p = 0.001; Figs. 8 A and 8 B). Longitudinal analysis further revealed that, while integration date distributions of env -intact proviruses remained stable during the first seven years of ART (both trees p ≥ 0.1; Figs. 8 C and 8 D), hypermutated proviruses gradually shifted towards earlier integration dates over time (both trees p < 0.02; Figs. 8 E and 8 F), presumably because those with more recent integration dates were preferentially eliminated during long-term ART. WIHS-P4's proviruses also spanned essentially the entire duration of untreated infection (Figs. 8 G and 8 H ) . In contrast to WIHS-P2 however, the integration dates of their hypermutated proviruses were on average more recent than their env -intact ones (both trees p ≤ 0.02; Figs. 8 G and 8 H). As previously reported (Shahid et al. 2024 ), WIHS P4's env -intact proviruses gradually shifted towards earlier integration dates over time on ART (both trees p ≤ 0.003; Figs. 8 I and 8 J), likely because those with more recent integration dates decayed more rapidly following ART initiation. In contrast, hypermutated provirus integration date distributions remained stable during ART (both trees p > 0.1; Figs. 8 K and 8 L). WIHS-P1, WIHS-P3, WIHS-P5, and WIHS-P6's hypermutated proviruses also spanned broad age ranges, but in contrast to WIHS-P2 and WIHS-P4, they did not differ from env -intact ones in terms of their overall integration date distributions ( Supplementary Figs. 20 and 21 ). As reported previously, their env- intact proviral integration date distributions remained stable except for participant WIHS-P5 in whom the proviral pool shifted slightly towards later integration dates over time ( Supplementary Figs. 21C and 21D ) (Shahid et al. 2024 ). Hypermutated proviral integration date distributions were also stable over time except in WIHS-P1, whose proviral date distributions differed markedly by visit ( Supplementary Figs. 20E and 20F ). Though this could suggest dynamic changes over time, limited sampling must be acknowledged. Notably, the HM-Stripped and HM-Replacedw/R approaches produced comparable results except in the temporal analysis of env -intact proviruses for WIHS-P3, where HM-Stripped suggested a modest shift towards more recent integration dates over time, whereas HM-Replacedw/R indicated no change ( Supplemental Figs. 20I and 20J ). Discussion Though hypermutated proviruses persist in all people living with HIV (PLWH) (Bruner et al. 2016 ; Ho et al. 2013 ; Kinloch et al. 2023 ), we know relatively little about their within-host origins because they cannot be readily incorporated into phylogenies. We explored three simple approaches to remove hypermutation from nucleotide alignments, with the dual goals of 1) reconstructing phylogenies that accurately reconstruct the within-host evolutionary histories of hypermutated sequences and 2) applying molecular dating approaches to these trees to gain insights into the within-host origins and longevity of hypermutated proviruses. Of the approaches we evaluated, stripping nucleotide positions containing putative APOBEC3 mutations from the alignment, or replacing individual APOBEC3 mutations in hypermutated sequences with R, consistently normalized tree topologies and metrics. By contrast, replacing APOBEC3 mutations in hypermutated sequences with G failed to consistently resolve their erroneous clustering in the tree. We speculate that this is because G replacement is an overcorrection, as not all A bases at target sites are necessarily the result of APOBEC3 activities (the HIV genome is naturally high in A bases (Kypr and Mrazek 1987 ; Kypr et al. 1989 )). Across-the-board G replacement therefore likely obscures some legitimate ancestral information ( i.e. , inherited A bases), leaving these sequences at continued risk of long-branch attraction. By contrast, replacing putative APOBEC3 mutations with R mitigates this risk by acknowledging this ambiguity. We therefore advise against replacement of APOBEC3 mutations in hypermutated sequences with G. We further showed that the integration dates of env -intact proviruses inferred from the HM-Stripped and HM-Replacedw/R approaches were highly concordant with those inferred from benchmark trees that excluded hypermutated sequences entirely, as is the current practice. The demonstration that these corrected trees provide valid molecular dating results is important because it provides, for the first time, an approach to study the within-host evolutionary origins and longevity of the large and genetically diverse population of hypermutated proviruses that persist in all PLWH during ART. Proviral integration date estimates produced by the two approaches were highly concordant, and there was no clear difference in their performance. While the p-values derived from comparing the tree-based metrics of env -intact and hypermutated sequences, shown in Fig. 4 , are overall slightly higher for the HM-Replacedw/R compared to the HM-Stripped approach, we caution against interpreting this to mean that the former is superior. Though we applied statistical tests to guide interpretation, the main goal was to produce tree metric values for hypermutated and env -intact sequences that were in the same range as one another. Both HM-Stripped and HM-Replacedw/R approaches achieved this. We did not necessarily expect that env -intact and hypermutated sequence metrics would all normalize completely ( i.e. , produce non-significant p-values) because some evolutionary attributes of env -intact and hypermutated sequences might plausibly differ. As hypermutated sequences don't normally yield descendants for example, their closest neighbors in the tree might be more distant than those for env -intact proviruses, simply because of the lower likelihood of sampling a close relative (which, for a hypermutated sequence, could only be an ancestor). Differential evolutionary dynamics between hypermutated and env -intact proviruses could also produce differential root-to-tip measurements (and by extension integration date estimates) between groups, a phenomenon that was indeed observed in WIHS-P2 and WIHS-P4. We therefore offer the following considerations when choosing an approach. Since the HM-Replacedw/R approach retains the full alignment, it should also preserve more phylogenetic signal than the HM-Stripped approach, where an average of 9% of each env-gp120 alignment was removed. This could be advantageous for HIV regions that are relatively conserved, yet hotspots for APOBEC3 mutation, for example parts of pol (Kieffer et al. 2005 ; Kijak et al. 2008 ). But, before implementing the Replacedw/R approach, it is essential to verify that the chosen phylogenetic inference package supports ambiguous characters. IQ-TREE 2, used in the present study, assigns equal likelihood to each component character (Minh et al. 2020 ), but other packages, such as the approximate maximum likelihood algorithm FastTree, treat all non-ACTG characters as missing data (Price et al. 2010 ). It is also important to recognize when sequence alignment modifications are warranted. For routine phylogenetic visualization of HIV datasets, hypermutated sequences can be incorporated directly. Such trees might even be adequate for some limited tree-based inferences, as suggested by our finding that uncorrected trees produced reasonable root dates and evolutionary rates, likely because these calculations only use information from pre-ART plasma HIV RNA sequences. Nevertheless, our demonstration that uncorrected trees erroneously reconstructed the ancestry of hypermutated proviruses, and produced inaccurate (and often nonsensical) integration dates for them underscores why they can't be used to answer questions about the evolutionary history of hypermutated proviruses. For such questions, the above alignment modification approaches should be used. Our results also reveal insights into hypermutated provirus evolutionary dynamics. Like env -intact ones, hypermutated proviruses spanned a broad age range. From WIHS-P2 for example, we isolated hypermutated proviruses that had integrated as early as a year following seroconversion. This indicates that hypermutated proviruses, like other provirus types, begin to be seeded into the proviral pool essentially immediately following transmission, and can persist for decades thereafter. Our results also revealed evidence of differential evolutionary dynamics of hypermutated and env -intact proviruses in two of the six participants studied, namely WIHS-P2, whose hypermutated proviruses were on average older than env -intact ones, and WIHS-P4, in whom the opposite was observed. This suggests that the decay rates of different types of proviruses can be heterogeneous within a given host, as well as heterogeneous between hosts. Our study has some limitations. We analyzed the present dataset (Shahid et al. 2024 ) because it is among the most comprehensive of its type (in terms of sequence N, follow-up time and sampling near seroconversion) and because env-gp120 is commonly used for within-host HIV evolutionary studies (Brooks et al. 2020 ; Dapp et al. 2017 ). That said, participants WIHS-P3 and WIHS-P6 had only modest numbers of hypermutated proviruses, which limited our power to detect differences between these and env -intact proviruses in their data. Furthermore, while our proposed method should be applicable to any HIV gene region, we did not explicitly investigate this. The identification of hypermutated sequences, on which our method depends, is by definition imperfect, as it relies on a statistical cut-off and can be subtly influenced by the choice of reference sequence, particularly if a heterologous sequence ( e.g. HXB2 HIV reference strain) is used for this purpose (Rose and Korber 2000 ). As recommended, we used the most frequent sequence observed post-seroconversion as the reference (Rose and Korber 2000 ), though we verified that use of a different sequence impacted the identification of hypermutated sequences minimally or not at all ( e.g ., using an arbitrarily-chosen reference sequence from WIHS-P2's earliest sampling time point yielded 137 (out of 1515) nucleotide positions with putative APOBEC3 mutations, versus the original 140). Finally, we cannot assume that intact env-gp120 sequences come from fully intact HIV genomes. As such, the comparison group for hypermutated sequences in the present study is not the replication competent HIV reservoir, but rather the pool of proviruses with intact env-gp120 sequences, many of which will have defects elsewhere. In summary, the current practice of excluding hypermutated proviruses from phylogenies used for hypothesis testing has been a major barrier to understanding the in vivo evolutionary origins and longevity of these sequences. Here, we validated two simple nucleotide alignment modification approaches that, for the first time, allow hypermutated sequences to be correctly incorporated into phylogenies that can be used for molecular dating. Overall, our observations reveal that hypermutated proviruses, like other provirus types, are archived throughout untreated infection and can persist for years on ART. Our observations further suggest that the evolutionary dynamics of hypermutated proviruses may differ from those of other proviral types in some individuals. In addition to enriching our understanding of HIV persistence towards the ultimate goal of HIV cure, the approaches developed here could be extended to between-host phylogenies, and testing of other hypotheses related to within-host evolutionary origins of hypermutated sequences. Declarations Data availability The nucleotide sequences reported in this paper are available in GenBank (proviral DNA accession numbers: OR404056 - OR404777, OR404820 - OR404981; HIV RNA accession numbers: OR403057 - OR403738). Conflict of interest The authors declare that they have no conflicts of interest. Acknowledgments and funding We thank Drs. Art F.Y. Poon and Natalie N. Kinloch for helpful discussions. The authors gratefully acknowledge the contributions of the study participants and dedication of the staff at the MWCCS sites. Data in this manuscript were collected by the Women’s Interagency HIV Study (WIHS), now the MACS/WIHS Combined Cohort Study (MWCCS). The contents of this publication are solely the responsibility of the authors and do not represent the official views of the National Institutes of Health (NIH) or other funders. MWCCS (Principal Investigators): Atlanta CRS (Ighovwerha Ofotokun, Anandi Sheth, and Gina Wingood), U01-HL146241; Baltimore CRS (Todd Brown and Joseph Margolick), U01-HL146201; Bronx CRS (Kathryn Anastos, David Hanna, and Anjali Sharma), U01-HL146204; Brooklyn CRS (Deborah Gustafson and Tracey Wilson), U01-HL146202; Data Analysis and Coordination Center (Gypsyamber D’Souza, Stephen Gange and Elizabeth Topper), U01-HL146193; Chicago-Cook County CRS (Mardge Cohen, Audrey French, and Ryan Ross), U01-HL146245; Chicago-Northwestern CRS (Steven Wolinsky, Frank Palella, and Valentina Stosor), U01-HL146240; Northern California CRS (Bradley Aouizerat, Jennifer Price, and Phyllis Tien), U01-HL146242; Los Angeles CRS (Roger Detels and Matthew Mimiaga), U01-HL146333; Metropolitan Washington CRS (Seble Kassaye and Daniel Merenstein), U01-HL146205; Miami CRS (Maria Alcaide, Margaret Fischl, and Deborah Jones), U01-HL146203; Pittsburgh CRS (Jeremy Martinson and Charles Rinaldo), U01-HL146208; UAB-MS CRS (Mirjam-Colette Kempf, James B. Brock, and Deborah Konkle-Parker), U01-HL146192; UNC CRS (M. Bradley Drummond and Michelle Floris-Moore), U01-HL146194. The MWCCS is funded primarily by the National Heart, Lung, and Blood Institute (NHLBI), with additional co-funding from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), National Institute on Aging (NIA), National Institute of Dental and Craniofacial Research (NIDCR), National Institute of Allergy And Infectious Diseases (NIAID), National Institute of Neurological Disorders and Stroke (NINDS), National Institute of Mental Health (NIMH), National Institute on Drug Abuse (NIDA), National Institute of Nursing Research (NINR), National Cancer Institute (NCI), National Institute on Alcohol Abuse and Alcoholism (NIAAA), National Institute on Deafness and Other Communication Disorders (NIDCD), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute on Minority Health and Health Disparities (NIMHD), and in coordination and alignment with the research priorities of the National Institutes of Health, Office of AIDS Research (OAR). MWCCS data collection is also supported by UL1-TR000004 (UCSF CTSA), UL1-TR003098 (JHU ICTR), UL1-TR001881 (UCLA CTSI), P30-AI-050409 (Atlanta CFAR), P30-AI-073961 (Miami CFAR), P30-AI-050410 (UNC CFAR), P30-AI-027767 (UAB CFAR), P30-MH-116867 (Miami CHARM), UL1-TR001409 (DC CTSA), KL2-TR001432 (DC CTSA), and TL1-TR001431 (DC CTSA). In addition, this work was supported by the Canadian Institutes of Health Research (CIHR) through a project grant (PJT-159625 to Z.L.B. and J.B.J.) and a focused team grant (HB1-164063 to Z.L.B.). This work was also supported by the Martin Delaney "REACH" Collaboratory (NIH grant 1-UM1AI164565-01 to Z.L.B.), which is supported by the following NIH co-funding Institutes: NIMH, NIDA, NINDS, NIDDK, NHLBI, and NIAID. This work was also supported by the Einstein-Rockefeller-CUNY Center for AIDS Research (NIH grant # P30AI124414 to H.G.). A.S. and B.R.J. were supported by CIHR Doctoral Research Awards. S.M. was supported by an FHS Undergraduate Student Research Award. M.C.D. was supported by a CIHR Canada Graduate Scholarship—Master’s award. Z.L.B. was supported by a Scholar Award from Michael Smith Health Research BC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. References Adimora AA et al (2018) Cohort Profile: The Women's Interagency HIV Study (WIHS). Int J Epidemiol 47(2):393–94i Bacon MC et al (2005) The Women's Interagency HIV Study: an observational cohort brings clinical sciences to the bench. Clin Diagn Lab Immunol 12(9):1013–1019 Barkan SE et al (1998) The Women's Interagency HIV Study. WIHS Collaborative Study Group. Epidemiology 9(2):117–125 Bergsten J (2005) A review of long-branch attraction. Cladistics 21(2):163–193 Bozzi G et al (2019) 'No evidence of ongoing HIV replication or compartmentalization in tissues during combination antiretroviral therapy: Implications for HIV eradication', Sci Adv , 5 (9), eaav2045 Brodin J et al (2016) 'Establishment and stability of the latent HIV-1 DNA reservoir'. Elife, 5 Brooks K et al (2020) 'HIV-1 variants are archived throughout infection and persist in the reservoir'. PLoS Pathog, 16 (6), e1008378 Bruner KM et al (2016) Defective proviruses rapidly accumulate during acute HIV-1 infection. Nat Med 22(9):1043–1049 Cornish-Bowden A (1985) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res 13(9):3021–3030 D'Souza G et al (2021) Characteristics of the MACS/WIHS Combined Cohort Study: Opportunities for Research on Aging With HIV in the Longest US Observational Study of HIV. Am J Epidemiol 190(8):1457–1475 Dapp MJ et al (2017) 'Patterns and rates of viral evolution in HIV-1 subtype B infected females and males'. PLoS ONE, 12 (10), e0182443 Finzi D et al (1997) Identification of a reservoir for HIV-1 in patients on highly active antiretroviral therapy. Science 278(5341):1295–1300 Finzi D et al (1999) Latent infection of CD4 + T cells provides a mechanism for lifelong persistence of HIV-1, even in patients on effective combination therapy. Nat Med 5(5):512–517 Fitzgibbon JE, Mazar S, Dubin DT (1993) A new type of G–>A hypermutation affecting human immunodeficiency virus. AIDS Res Hum Retroviruses 9(9):833–838 Gantner P et al (2023) 'HIV rapidly targets a diverse pool of CD4(+) T cells to establish productive and latent infections'. Immunity, 56 (3), 653 – 68 e5. Goodenow M et al (1989) 'HIV-1 isolates are rapidly evolving quasispecies: evidence for viral mixtures and preferred nucleotide substitutions', J Acquir Immune Defic Syndr (1988) , 2 (4), 344 – 52 Gorbalenya AE (2017) 'Phylogeny of Viruses', Reference Module in Biomedical Sciences Halvas EK et al (2020) HIV-1 viremia not suppressible by antiretroviral therapy can originate from large T cell clones producing infectious virus. J Clin Invest 130(11):5847–5857 Harris RS, Liddament MT (2004) Retroviral restriction by APOBEC proteins. Nat Rev Immunol 4(11):868–877 Hiener B et al (2017) Identification of Genetically Intact HIV-1 Proviruses in Specific CD4(+) T Cells from Effectively Treated Participants. Cell Rep 21(3):813–822 Ho YC et al (2013) Replication-competent noninduced proviruses in the latent reservoir increase barrier to HIV-1 cure. Cell 155(3):540–551 Hoang DT et al (2018) UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol 35(2):518–522 Imamichi H et al (2020) Defective HIV-1 proviruses produce viral proteins. Proc Natl Acad Sci U S A 117(7):3704–3710 Isaac NJ et al (2007) 'Mammals on the EDGE: conservation priorities based on threat and phylogeny'. PLoS ONE, 2 (3), e296 Jones BR, Joy JB (2023) 'Inferring Human Immunodeficiency Virus 1 Proviral Integration Dates With Bayesian Inference', Mol Biol Evol , 40 (8) Jones BR et al (2018) Phylogenetic approach to recover integration dates of latent HIV sequences within-host. Proc Natl Acad Sci U S A 115(38):E8958–E67 Jones BR et al (2020) 'Genetic Diversity, Compartmentalization, and Age of HIV Proviruses Persisting in CD4(+) T Cell Subsets during Long-Term Combination Antiretroviral Therapy', J Virol , 94 (5) Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772–780 Kearney MF et al (2016) Origin of Rebound Plasma HIV Includes Cells with Identical Proviruses That Are Transcriptionally Active before Stopping of Antiretroviral Therapy. J Virol 90(3):1369–1376 Kembel SW et al (2010) Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26(11):1463–1464 Kieffer TL et al (2005) G–>A hypermutation in protease and reverse transcriptase regions of human immunodeficiency virus type 1 residing in resting CD4 + T cells in vivo. J Virol 79(3):1975–1980 Kijak GH et al (2008) Variable contexts and levels of hypermutation in HIV-1 proviral genomes recovered from primary peripheral blood mononuclear cells. Virology 376(1):101–111 Kinloch NN et al (2023) 'HIV reservoirs are dominated by genetically younger and clonally enriched proviruses', mBio , e0241723 Kypr J, Mrazek J (1987) Unusual codon usage of HIV. Nature 327(6117):20 Kypr J, Mrazek J, Reich J (1989) Nucleotide composition bias and CpG dinucleotide content in the genomes of HIV and HTLV 1/2. Biochim Biophys Acta 1009(3):280–282 Larsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30(22):3276–3278 Lee GQ et al (2017) Clonal expansion of genome-intact HIV-1 in functionally polarized Th1 CD4 + T cells. J Clin Invest 127(7):2689–2696 Martin DP et al (2015) RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evol 1(1):vev003 Minh BQ et al (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 37(5):1530–1534 Nicolas A et al (2022) 'Genotypic and Phenotypic Diversity of the Replication-Competent HIV Reservoir in Treated Patients'. Microbiol Spectr, 10 (4), e0078422 Pankau MD et al (2020) 'Dynamics of HIV DNA reservoir seeding in a cohort of superinfected Kenyan women'. PLoS Pathog, 16 (2), e1008286 Patro SC et al (2019) Combined HIV-1 sequence and integration site analysis informs viral dynamics and allows reconstruction of replicating viral ancestors. Proc Natl Acad Sci U S A 116(51):25891–25899 Pavoine et al (2017) From phylogenetic to functional originality: Guide through indices and new developments. Ecol Ind 82:196–205 Pinzone MR et al (2019) Longitudinal HIV sequencing reveals reservoir expression leading to decay which is obscured by clonal expansion. Nat Commun 10(1):728 Price MN, Dehal PS, Arkin AP (2010) 'FastTree 2–approximately maximum-likelihood trees for large alignments'. PLoS ONE, 5 (3), e9490 Redding DW, Mooers AØ (2006) Incorporating evolutionary measures into conservation prioritization. Conserv Biol 20(6):1670–1678 Redding DW, Mazel F, Mooers A (2014) 'Measuring Evolutionary Isolation for Conservation'. PLoS ONE, 9 (12), e113490 Rose PP, Korber BT (2000) Detecting hypermutations in viral sequences with an emphasis on G --> A hypermutation. Bioinformatics 16(4):400–401 Sanchez G et al (1997) Accumulation of defective viral genomes in peripheral blood mononuclear cells of human immunodeficiency virus type 1-infected individuals. J Virol 71(3):2233–2240 Shahid A et al (2024) 'The replication-competent HIV reservoir is a genetically restricted, younger subset of the overall pool of HIV proviruses persisting during therapy, which is highly genetically stable over time'. J Virol, 98 (2), e0165523 Sheehy AM et al (2002) Isolation of a human gene that inhibits HIV-1 infection and is suppressed by the viral Vif protein. Nature 418(6898):646–650 Slatkin M, Maddison WP (1989) A cladistic measure of gene flow inferred from the phylogenies of alleles. Genetics 123(3):603–613 Vartanian JP et al (1991) Selection, recombination, and G----A hypermutation of human immunodeficiency virus type 1 genomes. J Virol 65(4):1779–1788 Vartanian JP et al (1994) G–>A hypermutation of the human immunodeficiency virus type 1 genome: evidence for dCTP pool imbalance during reverse transcription. Proc Natl Acad Sci U S A 91(8):3092–3096 Waldron D (2015) Hypermutation of HIV-1 in vivo. Nat Rev Genet 16(11):626–626 Whitney JB et al (2014) Rapid seeding of the viral reservoir prior to SIV viraemia in rhesus monkeys. Nature 512(7512):74–77 Yu G (2020) Using ggtree to Visualize Data on Tree-Like Structures. Curr Protoc Bioinf 69(1):e96 Tables Table 1: Participant information, HIV sampling and sequencing details ID^ Estimated date of infection Duration of uncontrolled infection (years) No. of pre-ART plasma HIV RNA time points Distinct pre-ART plasma HIV env-gp120 sequences ART initiation date Years of ART until last proviral sampling No. of on-ART proviral time points Distinct on-ART HIV env-gp120 proviral sequences (Hypermutated N; %) WIHS-P2 Jan 2003 9 10 227 Jan 2012 6.8 3 75 (22; 28%) WIHS-P4* Jul 1995 10.9 9 182 Jun 2006 12.3 4 155 (23; 15%) WIHS-P1 Dec 1995 12 13 207 Jan 2008 10.3 4 85 (15; 13%) WIHS-P3 Jul 2002 5.5 9 132 Jan 2008 8.8 3 59 (5; 8%) WIHS-P5 Mar 2008 1.9 2 45 Feb 2010 8.7 3 74 (22; 30%) WIHS-P6 Aug 2006 3.9 6 73 Jul 2010 8.3 4 94 (6; 6%) ^ Participants are numbered in the same order as the original manuscript (Shahid et al, 2024). That is, WIHS-P2 in the present study is Participant 2 in (Shahid et al, 2024). * The MWCCS database indicated that participant 4 initiated ART in 2003, but no reductions in plasma viral load (pVL) were observed until June 2006. For this reason, we considered June 2006 as this participant's effective ART start date. Table 2: Hypermutated sequence details ID Hypermutated proviruses Aligned HIV env-gp120 sequence length (bp) Putative hypermutated nucleotide positions in the alignment a Hypermutated sites identified per sequence Median (range) b WIHS-P2 22 1515 140 43 (20 – 68) WIHS-P4 23 1541 176 55 (34 – 83) WIHS-P1 15 1483 141 41 (10 – 75) WIHS-P3 5 1486 127 40 (36 – 64) WIHS-P5 22 1501 152 57 (9 – 78) WIHS-P6 6 1523 122 47 (35 – 75) a The total number of nucleotide positions that harbored an A at an APOBEC3 target site in at least one hypermutated sequence in the participant’s sequence alignment. These positions were stripped out of the alignment in the HM-Stripped approach. b Statistics summarizing the overall number of A bases at APOBEC3 target sites in the participant's hypermutated sequences. These A bases were changed to R or G, respectively, in the HM-Replacedw/R and HM-Replacedw/G approaches. Additional Declarations The authors declare no competing interests. Supplementary Files ShahidetalSupplementaryfigures.pdf Supplementary figures with legends ShahidetalSupplementaryTables20240607.docx Supplementary Table 1 and 2 Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4549934","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":312793079,"identity":"0d9d7652-575a-4ff5-a201-1a26e0123f3f","order_by":0,"name":"Aniqa Shahid","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAtklEQVRIiWNgGAWjYFACHgZmBgNmOQbmAyRqMWZgSyBJCwNzYgPRWvj7zx78XFBgnT6/jceA4UcNEVokbuQlS88wSM/dcIzHgLHnGDHW3OAxkOYxOJy7Qb7HgJmBjQgd8ufPGP8GakmXBzqMmeEfEVoMDuSYgWxJYAA6jJmxjQgthjdyzKx5DNINNxxjKzjY20eEFjmgw27z/LGWl29j3vjgxzcitKCAA6RqGAWjYBSMglGAAwAAvzIv1KcRmOMAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0002-8362-1528","institution":"Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada","correspondingAuthor":true,"prefix":"","firstName":"Aniqa","middleName":"","lastName":"Shahid","suffix":""},{"id":312793229,"identity":"e1d60082-c07d-4ff4-bfff-1bbc139ad15c","order_by":1,"name":"Bradley R. Jones","email":"","orcid":"","institution":"Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada","correspondingAuthor":false,"prefix":"","firstName":"Bradley","middleName":"R.","lastName":"Jones","suffix":""},{"id":312797132,"identity":"b6ed89bb-6f24-4bf8-ba7e-c7e3e7a62963","order_by":2,"name":"Maggie C. Duncan","email":"","orcid":"","institution":"Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada","correspondingAuthor":false,"prefix":"","firstName":"Maggie","middleName":"C.","lastName":"Duncan","suffix":""},{"id":312797164,"identity":"08986815-81ea-4491-9997-630a326b7e3e","order_by":3,"name":"Signe MacLennan","email":"","orcid":"","institution":"Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada","correspondingAuthor":false,"prefix":"","firstName":"Signe","middleName":"","lastName":"MacLennan","suffix":""},{"id":312797238,"identity":"97c589cc-83c5-4c00-af45-134fa50d09f3","order_by":4,"name":"Michael J. Dapp","email":"","orcid":"","institution":"Department of Microbiology, University of Washington, School of Medicine, Seattle, Washington, USA","correspondingAuthor":false,"prefix":"","firstName":"Michael","middleName":"J.","lastName":"Dapp","suffix":""},{"id":312797431,"identity":"5694afea-79dc-498e-87ae-c5017c1205b3","order_by":5,"name":"Mark H. Kuniholm","email":"","orcid":"","institution":"Department of Epidemiology and Biostatistics, University at Albany, State University of New York, Rensselaer, New York, USA","correspondingAuthor":false,"prefix":"","firstName":"Mark","middleName":"H.","lastName":"Kuniholm","suffix":""},{"id":312798384,"identity":"cd77826c-5e58-4154-8106-9ae9d732c7eb","order_by":6,"name":"Bradley Aouizerat","email":"","orcid":"","institution":"College of Dentistry, New York University, New York, New York, USA","correspondingAuthor":false,"prefix":"","firstName":"Bradley","middleName":"","lastName":"Aouizerat","suffix":""},{"id":312798385,"identity":"5a16d83f-0bae-443d-b802-fffc283d86dc","order_by":7,"name":"Nancie M. Archin","email":"","orcid":"","institution":"UNC HIV Cure Center, Institute of Global Health and Infectious Diseases, University of North Carolina at Chapel Hill, North Carolina, USA","correspondingAuthor":false,"prefix":"","firstName":"Nancie","middleName":"M.","lastName":"Archin","suffix":""},{"id":312798386,"identity":"87fd2d8f-555d-4784-a4ca-f2449f24f026","order_by":8,"name":"Stephen Gange","email":"","orcid":"","institution":"Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA","correspondingAuthor":false,"prefix":"","firstName":"Stephen","middleName":"","lastName":"Gange","suffix":""},{"id":312798387,"identity":"729304f4-7737-4597-bb17-2be8164efe43","order_by":9,"name":"Igho Ofotokun","email":"","orcid":"","institution":"Division of Infectious Diseases, Department of Medicine, Emory University School of Medicine, Atlanta, Georgia, USA","correspondingAuthor":false,"prefix":"","firstName":"Igho","middleName":"","lastName":"Ofotokun","suffix":""},{"id":312798388,"identity":"31134893-c2d3-40a8-98fa-7ab053222352","order_by":10,"name":"Margaret A. Fischl","email":"","orcid":"","institution":"Division of Infectious Diseases, Department of Medicine, University of Miami School of Medicine, Miami, Florida, USA","correspondingAuthor":false,"prefix":"","firstName":"Margaret","middleName":"A.","lastName":"Fischl","suffix":""},{"id":312798389,"identity":"ae82b0bd-974c-4d5e-9b51-e10d297d8ad3","order_by":11,"name":"Seble Kassaye","email":"","orcid":"","institution":"Division of Infectious Diseases and Tropical Medicine, Georgetown University, Washington, DC, USA","correspondingAuthor":false,"prefix":"","firstName":"Seble","middleName":"","lastName":"Kassaye","suffix":""},{"id":312798390,"identity":"f3132f9a-fcb8-423d-badb-7837e302d10f","order_by":12,"name":"Harris Goldstein","email":"","orcid":"","institution":"Departments of Microbiology and Immunology and Pediatrics, Albert Einstein College of Medicine, Bronx, New York, USA","correspondingAuthor":false,"prefix":"","firstName":"Harris","middleName":"","lastName":"Goldstein","suffix":""},{"id":312798391,"identity":"4d28487a-a1f1-45ea-ab15-b5c9a6737592","order_by":13,"name":"Kathryn Anastos","email":"","orcid":"","institution":"Department of Medicine, Albert Einstein College of Medicine, Bronx, New York, USA","correspondingAuthor":false,"prefix":"","firstName":"Kathryn","middleName":"","lastName":"Anastos","suffix":""},{"id":312798392,"identity":"7532a5c3-e21e-48c5-b8e7-a7f4849012f6","order_by":14,"name":"Jeffrey B. Joy","email":"","orcid":"","institution":"Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada","correspondingAuthor":false,"prefix":"","firstName":"Jeffrey","middleName":"B.","lastName":"Joy","suffix":""},{"id":312798393,"identity":"beaa9a34-a10e-4925-82de-ab55bb8d1292","order_by":15,"name":"Zabrina L. Brumme","email":"","orcid":"","institution":"Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada","correspondingAuthor":false,"prefix":"","firstName":"Zabrina","middleName":"L.","lastName":"Brumme","suffix":""}],"badges":[],"createdAt":"2024-06-08 09:36:48","currentVersionCode":1,"declarations":{"humanSubjects":true,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":true,"humanSubjectConsent":true,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-4549934/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4549934/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":58293106,"identity":"cf693ee2-6a17-4535-88e4-8ee3c20e89e8","added_by":"auto","created_at":"2024-06-13 14:02:39","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":3023063,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eTree-based metrics \u003c/strong\u003e\u003c/em\u003e(A) Hypothetical tree containing six sequences, labeled A through F, each in a unique color. Vertical dotted lines depict the distance scale in hypothetical units numbered below the tree. All other panels depict this same tree. (B) Colored horizontal lines trace the terminal branch lengths (TBL) of sequences A-F, with the values also shown at the right of the tree. (C) Each sequence's path from root to tip is traced with a unique color, where the sum of these lengths (representing the root-top-tip distance; RTT) is shown at the right of the tree. (D and E)\u003cem\u003e Fair Proportion Evolutionary Distinctiveness \u003c/em\u003e(FP-ED) divides the shared evolutionary history represented by an internal branch equally among all its descendant\u003cem\u003e \u003c/em\u003esequencesat the tips. Here, colored lines and associated fractional branch lengths show how internal branch lengths are apportioned to each sequence. The sum of each sequence's branch measurements, the FP-ED, is shown at the right of the tree.\u003cem\u003e \u003c/em\u003e(E) In contrast, \u003cem\u003eEqual Splits Evolutionary Distinctiveness \u003c/em\u003e(ES-ED) assigns 50% of each internal branch length to each immediate descendant. As such, branches leading to a single descendant assign 50% of that branch to this descendant, whereas branches leading to multiple descendants further split the remaining 50% among them using this same scheme. The sum of these measurements, the ES-ED, is shown at the right of the tree. (F)\u003cem\u003e \u003c/em\u003eThe topological distance (TD) separating sequence A from all others is shown to the right of each tip, where TD is computed as\u003cem\u003e \u003c/em\u003ethe\u003cem\u003e \u003c/em\u003etotal number of nodes separating A from all others in the tree. Here, the median TD separating A from all others in the tree is 4.\u003c/p\u003e","description":"","filename":"Figure1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/c11ce7b8da61c388e1f286b4.jpg"},{"id":58293109,"identity":"a7d87d6f-8d13-4ae4-b272-ee5847ea4b2d","added_by":"auto","created_at":"2024-06-13 14:02:39","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":4194895,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eWIHS-P2: clinical history, within-host phylogenies and tree metrics.\u003c/strong\u003e\u003c/em\u003e (A) Participant WIHS-P2's plasma viral load history and sampling timeline. Closed grey circles denote pre-ART plasma HIV RNA sampling. Open circles denote proviral sampling on ART (blue for \u003cem\u003eenv\u003c/em\u003e-intact proviruses and red for hypermutated proviruses). Grey shading denotes ART. (B) Participant WIHS-P2's rooted maximum-likelihood phylogeny, inferred from all within-host \u003cem\u003eenv-gp120\u003c/em\u003e sequences including hypermutated proviruses. Branches are colored by sequence type (pre-ART HIV RNA = grey; on-ART \u003cem\u003eenv\u003c/em\u003e-intact provirus = blue; on-ART hypermutated provirus = red). Inset shows the number of inferred migrations between \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated sequence groups computed using the Slatkin-Maddison (SM) test, along with the estimated p-value. Here, p=0 can be interpreted as p\u0026lt;0.001, as 1,000 permutations were performed. (C) Terminal Branch Lengths (TBL) of \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated sequences in this tree. Horizontal black lines denote the median values. P-value computed using the Mann-Whitney U-test. (D) Fair Proportion Evolutionary Distinctiveness (FP-ED) values for \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated sequences in this tree. (E) Median Topological distances (TD) separating \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated proviruses from others of the same type (F-L) same as panels B through E, but for the phylogeny inferred from an alignment where positions containing hypermutation were stripped out. (G-M) same as panels B through E, but for the phylogeny inferred from an alignment where hypermutated sites were replaced with R.\u003c/p\u003e","description":"","filename":"Figure2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/732222725aa81ba35845c438.jpg"},{"id":58293108,"identity":"28107780-b1fe-4cf4-9464-e1d13b42ac6c","added_by":"auto","created_at":"2024-06-13 14:02:39","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":4214518,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eWIHS-P4 clinical history, within-host phylogenies and tree metrics. \u003c/strong\u003e\u003c/em\u003eLegend as in Figure 2, except the data are for WIHS-P4.\u003c/p\u003e","description":"","filename":"Figure3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/0931e427d93bad366956235b.jpg"},{"id":58293642,"identity":"3fadb9b1-252f-4799-b929-77732c7d44c0","added_by":"auto","created_at":"2024-06-13 14:10:39","extension":"tiff","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":490314,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eSummary of tree metrics across all participants. \u003c/strong\u003e\u003c/em\u003eFor each participant (each shown with a distinct symbol), the p-value derived from comparing \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated proviruses in the tree for each of the phylogenetic metrics (each shown in a distinct color) is plotted for each tree type. TBL = Terminal Branch Length; FP-ED = Fair Proportion Evolutionary Distinctiveness; TD = Topological Distance; ES-ED = Equal Splits Evolutionary Distinctiveness; RTT = root-to-tip distance; SM = Slatkin-Maddison. For consistency with the other metrics, SM estimated p-values of 0 are shown here as p\u0026lt;0.0001. The horizontal dashed line at p=0.05 denotes the standard threshold for statistical significance.\u003c/p\u003e","description":"","filename":"Figure4.tiff","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/089fa30326b90086ec35de3d.tiff"},{"id":58293112,"identity":"3613f3cb-7dca-484c-a6d8-809bd2f514f4","added_by":"auto","created_at":"2024-06-13 14:02:39","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":1474226,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eWithin-host phylogenetic approach to infer proviral integration dates.\u003c/strong\u003e\u003c/em\u003e (A) Viral load and sampling timeline for a hypothetical participant. Closed grey circles denote plasma HIV RNA sampling prior to ART, while the open blue circle denotes proviral HIV DNA sampling during ART. Light grey shading represents ART. (B) Rooted, maximum likelihood within-host phylogeny, with branches colored by sequence type (grey = pre-ART plasma HIV RNA; blue = on-ART proviruses). (C) HIV sequence divergence from the root over time. The blue dashed diagonal represents the linear model relating the root-to-tip distances of distinct pre-ART plasma HIV RNA sequences (closed grey circles) to their sampling dates. This model is used to convert the root-to-tip distances of proviral sequences sampled during ART (open blue circles) to their integration dates. Faint grey lines trace the ancestral relationships between HIV sequences. (D) Integration date point estimates (and 95% confidence intervals) for each distinct provirus sequence sampled during ART, sorted from oldest to youngest. The provirus shown at the bottom right of panel C for example, was estimated to have integrated in October 1998, and is shown at the bottom left of panel D.\u003c/p\u003e","description":"","filename":"Figure5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/5d6e0879f6bc03c899a2232e.jpg"},{"id":58293115,"identity":"01a60f48-4ffe-4823-b173-6f4ff4d787df","added_by":"auto","created_at":"2024-06-13 14:02:39","extension":"jpg","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":3695303,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eInferring proviral integration dates from corrected trees: validation using WIHS-P2's data. \u003c/strong\u003e\u003c/em\u003e(\u003cstrong\u003eA\u003c/strong\u003e)\u003cem\u003e\u003cstrong\u003e \u003c/strong\u003e\u003c/em\u003eCorrelation between inferred integration dates of \u003cem\u003eenv\u003c/em\u003e-intact proviruses from the benchmark versus corrected trees, where the dates inferred from the HM-Stripped tree are in orange and those inferred from the HM-Replacedw/R tree are in teal. Spearman's ρ, associated p-value, and Lin's concordance correlation coefficient (ρc) are shown for each comparison. Regression lines in matching colors are also provided to help visualize these relationships. The dotted diagonal denotes a hypothetical perfect concordance. (B) Correlation between inferred integration dates of\u003cu\u003e\u003cem\u003e all\u003c/em\u003e\u003c/u\u003e proviruses from HM-Stripped versus HM-Replacedw/R trees, with hypermutated proviruses in red and \u003cem\u003eenv\u003c/em\u003e-intact proviruses in blue. Statistics are computed for all proviruses (black), hypermutated proviruses only (red) and \u003cem\u003eenv\u003c/em\u003e-intact proviruses only (blue). (C) Inferred integration dates of \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated proviruses from HM-Stripped and HM-Replacedw/R trees, presented as paired measurements connected with matching-colored lines. P-value computed using the Wilcoxon matched-pairs signed rank test. (D) Correlation between inferred integration dates of hypermutated proviruses from the HM-Unaltered and corrected trees (HM-Stripped tree = maroon; HM-Replacedw/R tree = gold).\u003c/p\u003e","description":"","filename":"Figure6.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/6c76f1c0f4887917927aa5ea.jpg"},{"id":58293114,"identity":"3c3fbcb7-3d0c-4539-9ea1-53bb99802fcd","added_by":"auto","created_at":"2024-06-13 14:02:39","extension":"jpg","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":3952354,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eInferring proviral integration dates from corrected trees: validation using WIHS-P2's data. \u003c/strong\u003e\u003c/em\u003eLegend as in Figure 6, except for WIHS-P4.\u003c/p\u003e","description":"","filename":"Figure7.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/46eed22fe4315299c19f2a7c.jpg"},{"id":58293113,"identity":"3d50792c-589d-4db9-a46f-e6910eb13ea2","added_by":"auto","created_at":"2024-06-13 14:02:39","extension":"jpg","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":4161458,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eWIHS-P2\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e and \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eP4\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e: \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eIntegration dates of env\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e-\u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eintact and hypermutated proviruses persisting during ART\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e. \u003c/strong\u003eTop: HIV plasma viral load and sampling history for participant WIHS-P2. (A, B) Integration dates of \u003cem\u003eenv\u003c/em\u003e-intact (blue) and hypermutated proviruses (HM; red) inferred from the HM-Stripped (panel A) and HM-Replacedw/R (panel B) trees. All proviruses of the same type are grouped together regardless of sampling date on ART. P-value derived from the Mann-Whitney U-test. Horizontal black lines represent the median values. (C-D) These are the same \u003cem\u003eenv\u003c/em\u003e-intact provirus integration dates as shown in panels A and B, but now stratified by sampling date on ART. P-value is from a Kruskal-Wallis test comparing all groups. (E-F) These are the same hypermutated provirus integration dates as shown in panels A and B, but now stratified by sampling date on ART. The large P-value at the top is from a Kruskal-Wallis test comparing all groups. The smaller p-values below represent the significant pairwise post-tests after correction for multiple comparisons. (G-L) Same as for panels A-F, except for participant WIHS-P4.\u003c/p\u003e","description":"","filename":"Figure8.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/0699218e017813d170fa3853.jpg"},{"id":58821646,"identity":"844aa9b9-b993-4f45-8960-c33dd4e0362e","added_by":"auto","created_at":"2024-06-21 15:20:58","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":26349492,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/22139868-c5f6-4680-b54f-99e129d87651.pdf"},{"id":58294052,"identity":"26e20b32-e4c6-44e2-8394-317cc492f98a","added_by":"auto","created_at":"2024-06-13 14:18:39","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":4935094,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary figures with legends\u003c/p\u003e","description":"","filename":"ShahidetalSupplementaryfigures.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/411d12f77a8a49149e097545.pdf"},{"id":58293107,"identity":"f3e9cc6a-f632-4734-90e8-c6dd0b1afce2","added_by":"auto","created_at":"2024-06-13 14:02:39","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":26217,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary Table 1 and 2\u003c/p\u003e","description":"","filename":"ShahidetalSupplementaryTables20240607.docx","url":"https://assets-eu.researchsquare.com/files/rs-4549934/v1/4ca61fba0dc1332ad94b3ded.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"A simple phylogenetic approach to analyze hypermutated HIV proviruses reveals insights into their dynamics and persistence during antiretroviral therapy","fulltext":[{"header":"Introduction","content":"\u003cp\u003eAntiretroviral therapy (ART) is not curative because HIV persists as an integrated provirus within a small fraction of infected cell reservoirs (Finzi et al. \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e1997\u003c/span\u003e; Finzi et al. \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e1999\u003c/span\u003e). Entry of HIV sequences into these reservoirs begins immediately following infection (Gantner et al. \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Whitney et al. \u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e2014\u003c/span\u003e) and continues until viral suppression is achieved on ART, yielding a genetically diverse viral reservoir (Brodin et al. \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Brooks et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Jones et al. \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Kinloch et al. \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Nicolas et al. \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Pankau et al. \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Only a minority (~\u0026thinsp;2\u0026ndash;5%) of integrated proviruses persisting on ART however are genetically intact and capable of producing replication-competent HIV; the remainder are genetically defective and cannot produce infectious virus (Bruner et al. \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Ho et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Imamichi et al. \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Sanchez et al. \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e1997\u003c/span\u003e). Large deletions, which occur during the minus-strand synthesis step of reverse transcription, are the most common defects, followed by hypermutation (Bruner et al. \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Hiener et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Ho et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Kinloch et al. \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Lee et al. \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2017\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eHypermutated proviruses arise in a single HIV replication cycle when host antiviral APOBEC3 proteins catalyze widespread cytidine-to-uridine deamination within the minus-strand HIV DNA genome that is produced during reverse transcription, yielding extensive guanine to adenine (G-to-A) mutations during plus-strand synthesis (Fitzgibbon et al. \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e1993\u003c/span\u003e; Goodenow et al. \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e1989\u003c/span\u003e; Vartanian et al. \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e1991\u003c/span\u003e; Vartanian et al. \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e1994\u003c/span\u003e). Hypermutation is normally deleterious, yielding stop codons in one or more HIV reading frames (Harris and Liddament \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2004\u003c/span\u003e; Vartanian et al. \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e1991\u003c/span\u003e; Waldron \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). As a result, hypermutated proviruses do not generally yield evolutionary descendants (Kieffer et al. \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2005\u003c/span\u003e; Sheehy et al. \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2002\u003c/span\u003e). Nevertheless, hypermutated sequences readily persist, typically representing 15% (though as much as \u0026gt;\u0026thinsp;50%) of all proviruses during long-term ART (Bruner et al. \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Hiener et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Ho et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Kinloch et al. \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Lee et al. \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2017\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eHypermutated HIV sequences pose challenges for phylogenetic inference algorithms, which assume that mutations gradually accumulate over generations, not all at once in a single round of replication (Gorbalenya \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). Phylogenies inferred from sequence alignments containing hypermutated proviruses will therefore inaccurately reflect the ancestor-descendant relationships of these sequences. Due to their large number of G-to-A mutations, the terminal branch lengths of hypermutated sequences are typically extended in these trees, and they will also often cluster together due to a type of phylogenetic error known as long branch attraction, whereby divergent sequences are classified as being more similar to one another simply because they have undergone a large amount of change, not because they share recent ancestry (Bergsten \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2005\u003c/span\u003e). Though hypermutated sequences are routinely included in phylogenies simply as a way to visualize complete datasets (Halvas et al. \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Kearney et al. \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Patro et al. \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), such trees should not be used for formal hypothesis testing. To our knowledge, no standard approaches exist to correctly infer ancestor-descendant relationships in datasets that include hypermutated sequences. Instead, these sequences are typically removed from HIV alignments, excluding them from phylogenetic hypothesis testing entirely (Bozzi et al. \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Brodin et al. \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Brooks et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Jones et al. \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Jones and Joy \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Pinzone et al. \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). As a result, relatively little is known about the within-host origins and longevity of hypermutated proviruses.\u003c/p\u003e \u003cp\u003eTo address these gaps, we used longitudinal within-host HIV \u003cem\u003eenv-gp120\u003c/em\u003e sequence datasets from six participants of the Women's Interagency HIV Study (WIHS) (Shahid et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) to evaluate the ability of three nucleotide alignment modification strategies to normalize the topologies of trees containing hypermutated proviruses. Using these corrected trees, we then estimated the integration dates of \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated proviruses persisting during ART, towards better understanding the within-host evolutionary dynamics of these different proviral types.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy participants and within-host HIV sequence datasets\u003c/h2\u003e \u003cp\u003eWe analyzed longitudinal, single-genome-amplified HIV \u003cem\u003eenv\u003c/em\u003e-\u003cem\u003egp120\u003c/em\u003e sequence datasets previously collected from six WIHS participants with documented HIV seroconversion (Shahid et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). WIHS is a multi-center cohort of women living with (or without) HIV in the United States (Adimora et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Bacon et al. \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2005\u003c/span\u003e; Barkan et al. \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e1998\u003c/span\u003e), that has now merged into the MACS/WIHS Combined Cohort Study (MWCCS) (D'Souza et al. \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Each participant's longitudinal dataset comprised plasma HIV RNA \u003cem\u003eenv-gp120\u003c/em\u003e sequences collected between seroconversion and ART initiation, along with \u003cem\u003eenv-gp120\u003c/em\u003e proviral sequences sampled during ART (Shahid et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) (\u003cb\u003eTable\u0026nbsp;1\u003c/b\u003e). All sequences were collected by single-genome amplification, where those with nucleotide mixtures, defects (\u003cem\u003ee.g.\u003c/em\u003e, deletions causing frameshifts) or evidence of within-host recombination (identified using RDP4 v4.1 (Martin et al. \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2015\u003c/span\u003e)) were excluded (Shahid et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Sequences that were 100% identical in \u003cem\u003eenv-gp120\u003c/em\u003e were collapsed to a single representative sequence prior to phylogenetic inference. Within-host datasets comprised a median 242 (IQR 119\u0026ndash;337) distinct sequences per participant.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eEthics statement\u003c/h2\u003e \u003cp\u003e Institutional review boards at each WIHS clinical research site approved the study protocol. All participants provided written informed consent. This nested sub-study was additionally approved by the institutional review boards at Providence Health Care/University of British Columbia, and Simon Fraser University.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eIdentification of hypermutated sequences and sequence alignment modification\u003c/h2\u003e \u003cp\u003eHypermutated HIV sequences were identified using Hypermut 2.0, available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermut.html\u003c/span\u003e\u003cspan address=\"https://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermut.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (Rose and Korber \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2000\u003c/span\u003e). This program takes a nucleotide alignment as input, where the first sequence is used as a reference to which all others are compared. As recommended for within-host datasets (Rose and Korber \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2000\u003c/span\u003e), we chose the most frequently-observed \u003cem\u003eenv-gp120\u003c/em\u003e sequence from the first plasma HIV RNA sampling timepoint as the reference wherever possible. Hypermut defines APOBEC3 target sites as \u003cspan type=\"BoldUnderline\" class=\"BoldUnderline\" name=\"Emphasis\"\u003eG\u003c/span\u003eRD; that is, a \u003cspan type=\"BoldUnderline\" class=\"BoldUnderline\" name=\"Emphasis\"\u003eG\u003c/span\u003e followed by either A or G (denoted by the IUPAC code R (Cornish-Bowden \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e1985\u003c/span\u003e)), then followed by A, G or T (denoted by D), where the bold and underlined \u003cspan type=\"BoldUnderline\" class=\"BoldUnderline\" name=\"Emphasis\"\u003eG\u003c/span\u003e is the APOBEC3 target site. Non-APOBEC3 target sites are defined as \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eG\u003c/span\u003eY (where Y denotes C or T), or \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eG\u003c/span\u003eRC. Hypermut identifies all target and non-target sites within each sequence, and categorizes each as mutated (\u003cem\u003ei.e.\u003c/em\u003e, harboring an A) or not (\u003cem\u003ei.e.\u003c/em\u003e, harboring a C, G or T). The program then compares the proportion of mutated target and non-target sites in each sequence using Fisher's exact test. Sequences enriched in G-to-A mutations at target sites with p\u0026thinsp;\u0026lt;\u0026thinsp;0.05 are identified as hypermutated.\u003c/p\u003e \u003cp\u003eWe then prepared five within-host \u003cem\u003eenv-gp120\u003c/em\u003e sequence alignments for each participant, where the first two were controls and the last three used different strategies to remove hypermutation. Sequence alignments were performed in a codon-aware manner using MAFFT v7.471 (Katoh and Standley \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2013\u003c/span\u003e) and manually inspected in AliView v1.26 (Larsson \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2014\u003c/span\u003e). The first alignment contained all pre-ART \u003cem\u003eenv-gp120\u003c/em\u003e plasma HIV RNA sequences plus only the \u003cem\u003eenv-\u003c/em\u003eintact proviruses sampled during ART (\u003cem\u003ei.e.\u003c/em\u003e, hypermutated proviruses were excluded, as is the current practice in the field (Brooks et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Jones et al. \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Jones et al. \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Kinloch et al. \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)). We called this the \"\u003cem\u003eenv-\u003c/em\u003eintact only\" alignment, where the resulting phylogeny was used as the benchmark for provirus molecular dating. The second alignment contained all pre-ART plasma HIV RNA sequences plus all (\u003cem\u003ei.e.\u003c/em\u003e, both \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated) proviruses sampled during ART, where the phylogeny inferred from this \"HM-Unaltered\" alignment served to illustrate the skewed topologies of resulting trees. The next three alignments were modifications of this second one, in which we tested different strategies to remove hypermutation and thereby normalize topology. The first strategy, HM-Stripped, removed all nucleotide positions that harbored an A at an APOBEC3 target site in at least one hypermutated sequence, yielding a shorter overall alignment. The second strategy, HM-Replacedw/R, individually replaced all A bases at APOBEC3 target sites within hypermutated sequences with R. The third strategy, HM-Replacedw/G, individually replaced all A bases at APOBEC3 target sites within hypermutated sequences with G. Both these strategies preserved the alignment length. Here, replacing with G assumes that all A bases at target sites are the result of APOBEC3 effects, whereas replacing with R recognizes the possibility that some may be legitimate A bases that are not attributable to APOBEC3 effects. Visualizations of the HM Unaltered, HM-Stripped and HM-Replacedw/R alignments are provided in \u003cb\u003eSupplementary Fig.\u0026nbsp;1\u003c/b\u003e. Phylogenies inferred from these alignments were evaluated as described below.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eWithin-host phylogenetic inference, rooting and tree metrics\u003c/h2\u003e \u003cp\u003eMaximum likelihood phylogenies were inferred from sequence alignments following automated model selection using an Akaike information criterion (AIC) in IQ-TREE 2. Best-fit models are reported in \u003cb\u003eSupplementary Table\u0026nbsp;1\u003c/b\u003e. Branch support values were derived using the ultrafast bootstrap option (1,000 bootstraps) (Hoang et al. \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Minh et al. \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Phylogenies were visualized using the R package \u003cem\u003eggtree\u003c/em\u003e (Yu \u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e2020\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eMost of our downstream analyses required rooting the tree at the inferred most recent common ancestor (MRCA) of the dataset. As previously described, we used a modified root-to-tip regression approach where we explored all positions in the tree to identify the location that maximized the (Pearson's) correlation between the root-to-tip distances of \u003cem\u003eall plasma HIV RNA sequences collected prior to ART initiation\u003c/em\u003e, and their sampling dates (Jones et al. \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). This location was set as the tree root, which represents the estimated transmitted/founder virus, or a close descendant thereof, in these datasets.\u003c/p\u003e \u003cp\u003eTo evaluate the extent to which the three alignment modification strategies normalized the position of hypermutated proviruses in the tree, we compared \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated proviruses with respect to various tree-based metrics, explained in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. We quantified \u003cem\u003eterminal branch length\u003c/em\u003e (TBL), which is the length of the branch connecting each sequence to the tree, in estimated substitutions per nucleotide site (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eB\u003cb\u003e)\u003c/b\u003e. We computed \u003cem\u003eroot-to-tip distance (RTT)\u003c/em\u003e, which is the total distance between each tip and the tree root (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eC). We computed two measures of evolutionary distinctiveness: \u003cem\u003eFair Proportion Evolutionary Distinctiveness\u003c/em\u003e (FP-ED) and \u003cem\u003eEqual Splits Evolutionary Distinctiveness\u003c/em\u003e (ES-ED), both of which distribute the root-to-tip distances in a tree among the descendant sequences at the tips (Pavoine 2017). FP-ED does this by dividing the shared evolutionary history represented by an internal branch equally among all its descendant tips, regardless of branching order (Isaac et al. \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2007\u003c/span\u003e; Redding et al. \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2014\u003c/span\u003e) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eD), whereas ES-ED assigns a longer portion of shared internal branches to immediate descendants (Redding and Mooers \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2006\u003c/span\u003e) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eE). FP-ED and ES-ED were computed using a custom R script with package \u003cem\u003epicante\u003c/em\u003e (v1.8.2) (Kembel et al. \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2010\u003c/span\u003e). We computed each proviral sequence's median \u003cem\u003etopological distance (TD)\u003c/em\u003e from all other sequences of the same type (\u003cem\u003ei.e., env-\u003c/em\u003eintact or hypermutated), where distance was defined as the \u003cem\u003enumber of nodes\u003c/em\u003e separating each pair (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eF). Finally, we used the Slatkin-Maddison (SM) test (Slatkin and Maddison \u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e1989\u003c/span\u003e), implemented using the R package \u003cem\u003eslatkin.maddison\u003c/em\u003e (v0.1.0; \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/prmac/slatkin.maddison\u003c/span\u003e\u003cspan address=\"https://github.com/prmac/slatkin.maddison\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) to assess the extent to which \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated sequences displayed population structure in the tree. This test determines the minimum number of migrations between groups to explain the distribution of groups at the tree tips: the smaller the number, the stronger the support for population structure. Statistical support is based on the number of migrations that would be expected in a randomly-structured population, simulated by permuting group labels between tips. Note that Slatkin-Maddison returns an estimated p-value, where a value of 0 can be interpreted as p\u0026thinsp;\u0026lt;\u0026thinsp;0.001, as 1,000 permutations were performed.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eWithin-host phylogenetic inference and proviral dating\u003c/h2\u003e \u003cp\u003eWe inferred the integration dates of \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated proviruses persisting during ART using a published phylogenetic approach (Jones et al. \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Using the rooted trees, we fit a linear model relating the root-to-tip distances of \u003cem\u003epre-ART plasma HIV sequences\u003c/em\u003e to their collection dates. The slope of this line represents the average within-host \u003cem\u003eenv-gp120\u003c/em\u003e evolutionary rate during untreated HIV infection, and the x-intercept represents the inferred root date. Model quality was assessed by comparing the model's AIC to that of a null model with zero slope. To pass quality control (QC), the linear model needed to have an AIC value at least 10 units lower than the null model (ΔAIC\u0026thinsp;\u0026ge;\u0026thinsp;10), and a root date prior to the first plasma sampling. All phylogenies met these criteria (\u003cb\u003eSupplementary Table\u0026nbsp;1\u003c/b\u003e). We then used the linear model to convert proviral root-to-tip distances to their integration dates. The custom R script for this method is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/cfe-lab/phylodating\u003c/span\u003e\u003cspan address=\"https://github.com/cfe-lab/phylodating\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eSpearman\u0026rsquo;s correlation (ρ) and Lin\u0026rsquo;s concordance correlation coefficient (ρc) were calculated in R. All other statistical analyses were performed in Prism, v10.0.2 (GraphPad Software). A threshold of p\u0026thinsp;\u0026lt;\u0026thinsp;0.05 was used to denote statistical significance.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eWithin-host HIV sequence datasets\u003c/h2\u003e \u003cp\u003eWe analyzed 1,408 single-genome-amplified HIV \u003cem\u003eenv-gp120\u003c/em\u003e sequences collected longitudinally from six WIHS participants who experienced HIV seroconversion (a seventh participant from the original study was not included here, as no hypermutated proviruses were isolated from their samples) (Shahid et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) (\u003cb\u003eTable\u0026nbsp;1\u003c/b\u003e). The data included 866 distinct HIV RNA \u003cem\u003eenv-gp120\u003c/em\u003e sequences (median 157 per participant) isolated from plasma over a median of 9 time points spanning a median of 7 years between seroconversion and ART initiation. The data also included 542 distinct \u003cem\u003eenv-gp120\u003c/em\u003e proviral sequences, including 449 \u003cem\u003eenv-\u003c/em\u003eintact ones (median 62 per participant) and 93 hypermutated ones (median 19 per participant) isolated from peripheral blood at a minimum of 3 time points over a median of 8.7 years during ART (\u003cb\u003eTable\u0026nbsp;1\u003c/b\u003e). All participants had HIV subtype B, with no evidence of dual or super-infection.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eIdentifying sites of hypermutation\u003c/h2\u003e \u003cp\u003eBetween 7 and 42% of participants' proviral sequences were hypermutated (though hypermutation was not observed in plasma HIV RNA sequences, as expected). In a given within-host alignment, between 9\u0026ndash;11% of \u003cem\u003eenv-gp120\u003c/em\u003e nucleotide positions had a putative APOBEC3-driven A in at least one sequence (\u003cb\u003eTable\u0026nbsp;2\u003c/b\u003e). Hypermutated proviruses harbored a grand median of 45 putative APOBEC3 mutations (representing 31% of all possible target sites, and 3% of all \u003cem\u003eenv-gp120\u003c/em\u003e nucleotides), but the overall range was 9 to 83 putative APOBEC3 mutations per \u003cem\u003eenv-gp120\u003c/em\u003e sequence (representing 6\u0026ndash;61% of all possible target sites, and 0.6-5% of all \u003cem\u003eenv-gp120\u003c/em\u003e nucleotides). For context, the grand median of putative APOBEC3 mutations in \u003cem\u003eenv-\u003c/em\u003eintact (non-hypermutated) proviruses was 5.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eAssessing how alignment modification strategies normalized tree topology and metrics\u003c/h2\u003e \u003cp\u003eWe next investigated how well our sequence alignment modification strategies helped normalize tree topologies, beginning with participant WIHS-P2 as an example.\u003c/p\u003e \u003cp\u003eParticipant WIHS-P2's dataset included 227 plasma HIV RNA \u003cem\u003eenv-gp120\u003c/em\u003e sequences sampled over 9 years during untreated infection, and 75 proviruses (53 \u003cem\u003eenv-\u003c/em\u003eintact, 22 hypermutated) sampled over ~\u0026thinsp;7 years during ART (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA). WIHS-P2\u0026rsquo;s unmodified nucleotide alignment yielded a phylogeny that placed nearly all hypermutated proviruses into a single clade with high (\u0026ge;\u0026thinsp;90%) bootstrap support (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB; a larger tree with branch support values is shown in \u003cb\u003eSupplementary Fig.\u0026nbsp;2\u003c/b\u003e). Hypermutated provirus terminal branch lengths in this tree were on average four times longer than \u003cem\u003eenv\u003c/em\u003e-intact ones (p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001; Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC), though their root-to-tip distances were not significantly inflated (p\u0026thinsp;=\u0026thinsp;0.2, \u003cb\u003eSupplementary Fig.\u0026nbsp;3A\u003c/b\u003e). Hypermutated proviruses also exhibited significantly higher evolutionary distinctiveness (ED) than \u003cem\u003eenv\u003c/em\u003e-intact ones in this tree (p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001 for both fair proportion and equal splits ED; Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eD and \u003cb\u003eSupplementary Fig.\u0026nbsp;4A\u003c/b\u003e). Also reflecting the erroneous clustering of hypermutated sequences in this tree, the median number of nodes separating hypermutated sequences from one another (\u003cem\u003ei.e.\u003c/em\u003e, topological distance) was on average only half of that separating \u003cem\u003eenv\u003c/em\u003e-intact proviruses (p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001; Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eE). A Slatkin-Maddison test also returned significant evidence of genetic population structure (\u003cem\u003ei.e.\u003c/em\u003e, \"compartmentalization\") between hypermutated and \u003cem\u003eenv\u003c/em\u003e-intact proviruses in this tree (three inferred migrations; estimated p\u0026thinsp;=\u0026thinsp;0; Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB \u003cb\u003einset\u003c/b\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eBy contrast, the tree inferred from WIHS-P2's HM-Stripped alignment, in which 140 (of 1515) \u003cem\u003eenv-gp120\u003c/em\u003e positions harboring putative APOBEC3 mutations had been removed, exhibited a substantially normalized topology (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eF). The same was true for the tree inferred from the HM-Replacedw/R alignment, where a median of 43 putative APOBEC3-driven A bases in hypermutated sequences had been replaced with R (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eG; larger trees in \u003cb\u003eSupplementary Fig.\u0026nbsp;2\u003c/b\u003e). In both trees, hypermutated proviruses were now comparable to \u003cem\u003eenv\u003c/em\u003e-intact ones in terms of terminal branch lengths (both p\u0026thinsp;\u0026gt;\u0026thinsp;0.1; Figs.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eH and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eI), evolutionary distinctiveness (all p\u0026thinsp;\u0026gt;\u0026thinsp;0.1; Figs.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eJ and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eK; \u003cb\u003eSupplementary Figs.\u0026nbsp;4B and 4C\u003c/b\u003e) and topological distance (both p\u0026thinsp;\u0026gt;\u0026thinsp;0.1, Figs.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eL and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eM). Genetic compartmentalization between \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated proviruses was also markedly reduced (15 inferred migrations compared to the original 3), though the p-values remained marginally significant (both p\u0026thinsp;\u0026le;\u0026thinsp;0.01; Figs.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eF and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eG, \u003cb\u003einsets\u003c/b\u003e). Of note, root-to-tip distances of hypermutated proviruses in these two trees were now shorter than those of \u003cem\u003eenv-\u003c/em\u003eintact ones (both p\u0026thinsp;\u0026lt;\u0026thinsp;0.001; \u003cb\u003eSupplementary Figs.\u0026nbsp;3B and 3C\u003c/b\u003e). In contrast, while the tree inferred from participant WIHS-P2's HM-Replacedw/G alignment (where putative APOBEC3-driven A bases in hypermutated sequences were replaced with G) appeared broadly normalized, \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated sequences remained highly significantly compartmentalized in this tree (estimated p\u0026thinsp;=\u0026thinsp;0; \u003cb\u003eSupplementary Fig.\u0026nbsp;5\u003c/b\u003e).\u003c/p\u003e \u003cp\u003eAs our second example, participant WIHS-P4's dataset included 182 plasma HIV RNA \u003cem\u003eenv-gp120\u003c/em\u003e sequences sampled over ~\u0026thinsp;11 years pre-ART, and 155 proviruses (132 \u003cem\u003eenv-\u003c/em\u003eintact; 23 hypermutated) sampled during 12 years of ART (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). The unaltered alignment produced a phylogeny (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB; larger tree in \u003cb\u003eSupplementary Fig.\u0026nbsp;6\u003c/b\u003e) where hypermutated sequences exhibited significantly inflated branch lengths, root-to-tip distances and evolutionary distinctiveness (all p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001; Figs.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC and \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eD, \u003cb\u003eSupplementary Figs.\u0026nbsp;3D and 4D\u003c/b\u003e), erroneous clustering (p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001 Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eE) and significant compartmentalization (estimated p\u0026thinsp;=\u0026thinsp;0; Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB \u003cb\u003einset\u003c/b\u003e). By contrast, the HM-Stripped and HM-Replacedw/R alignments produced substantially normalized trees (Figs.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eF and \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eG respectively; larger trees in \u003cb\u003eSupplementary Fig.\u0026nbsp;6\u003c/b\u003e) with no genetic compartmentalization between \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated sequences (22 migrations compared to the original 6; both p\u0026thinsp;\u0026gt;\u0026thinsp;0.1; Figs.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eF and \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eG \u003cb\u003einsets\u003c/b\u003e). The ranges of terminal branch lengths, root-to-tip distance measurements, evolutionary distinctiveness measures and topological distances were now also comparable between \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated proviruses, though the latter remained modestly yet statistically significantly different from \u003cem\u003eenv-\u003c/em\u003eintact sequences by most measures (p-values from 0.001 to 0.039, Figs.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eH to \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eM; \u003cb\u003eSupplementary Figs.\u0026nbsp;3E and 3F; Supplementary Figs.\u0026nbsp;4E and 4F\u003c/b\u003e). In contrast, hypermutated sequences remained highly compartmentalized in the phylogeny inferred from WIHS-P4's HM-Replacedw/G alignment (\u003cb\u003eSupplementary Fig.\u0026nbsp;7\u003c/b\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe same analyses were applied to participants WIHS-P1, WIHS-P3, WIHS-P5, and WIHS-P6 (small trees and select metrics in \u003cb\u003eSupplementary Figs.\u0026nbsp;8\u0026ndash;11\u003c/b\u003e; large trees in \u003cb\u003eSupplementary Figs.\u0026nbsp;12\u0026ndash;15;\u003c/b\u003e remaining metrics in \u003cb\u003eSupplementary Figs.\u0026nbsp;3 and 4\u003c/b\u003e). Broadly, the trees inferred from the HM-Stripped and HM-Replacedw/R alignments were markedly normalized and yielded metric values for \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated proviruses that spanned comparable ranges. For some participants, these metrics normalized such that \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated viruses became statistically comparable (\u003cem\u003ee.g.\u003c/em\u003e, WIHS-P5; \u003cb\u003eSupplementary Fig.\u0026nbsp;10\u003c/b\u003e). For others, hypermutated sequences remained somewhat distinctive (\u003cem\u003ee.g.\u003c/em\u003e, hypermutated provirus terminal branch lengths and evolutionary distinctiveness remained slightly elevated for WIHS-P6; \u003cb\u003eSupplementary Figs.\u0026nbsp;4 and 11\u003c/b\u003e), but in all cases these differences were far smaller in magnitude than those from the trees inferred from unaltered alignments. Indeed, the p-values derived from comparing \u003cem\u003eenv-\u003c/em\u003eintact and hypermutated proviruses in the HM-Stripped and HM-Replacedw/R trees were an average\u0026thinsp;\u0026gt;\u0026thinsp;3 logs higher than those from the HM-Unaltered trees, with 56% of comparisons yielding p-values\u0026thinsp;\u0026gt;\u0026thinsp;0.05 (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eBy contrast, the HM-Replacedw/G approach did not reliably normalize the trees. In particular, WIHS-P5's HM-Replacedw/G phylogeny maintained obvious clustering of hypermutated sequences and very strong compartmentalization, while terminal branch lengths, fair proportion evolutionary distinctiveness, and topological distance also remained highly skewed for one or more participants (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, and data not shown). As such, only the HM-Stripped and HM-Replacedw/R trees were advanced to further evaluation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eInferring proviral integration dates from corrected trees: a validation\u003c/h2\u003e \u003cp\u003eWe next investigated whether accurate evolutionary information can be extracted from these corrected trees, by phylogenetically inferring the integration dates of proviruses sampled during ART. Figure\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e illustrates how this is done. Briefly, we first root the phylogeny at the location that maximizes the correlation between the root-to-tip distances of \u003cem\u003ethe pre-ART plasma HIV RNA sequences\u003c/em\u003e and their sampling dates (proviruses sampled during ART, though included in the tree, are \u003cem\u003enot\u003c/em\u003e considered in this correlation; Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB). This root represents the MRCA of the dataset (\u003cem\u003ei.e\u003c/em\u003e., the estimated the founder virus). We then fit a linear model relating the root-to-tip genetic distances of the pre-ART plasma sequences to their sampling dates (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eC). This model is then used to convert the root-to-tip distance of each on-ART provirus to its inferred integration date (plus 95% confidence interval; Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eD).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eApplication of this approach to WIHS-P2's unaltered and corrected trees yielded estimated root dates that were consistent with the clinically-estimated infection date (\u003cb\u003eTable\u0026nbsp;1\u003c/b\u003e) and comparable to the root date inferred from the benchmark (\u003cem\u003eenv-\u003c/em\u003eintact only) tree (\u003cb\u003eSupplementary Table\u0026nbsp;1;\u003c/b\u003e the likely reason that the unaltered tree produced reasonable root dates and evolutionary rate estimates is because these metrics are computed from pre-ART plasma HIV RNA sequences only). We next verified the extent to which the integration dates of \u003cem\u003eenv-\u003c/em\u003eintact proviruses inferred from the corrected trees matched those inferred from the benchmark tree (which, as per current field standards, excluded hypermutated sequences entirely). Reassuringly, \u003cem\u003eenv-\u003c/em\u003eintact proviral integration dates inferred from the HM-Stripped tree were highly concordant with those inferred from the benchmark tree (Spearman\u0026rsquo;s rho [ρ]\u0026thinsp;=\u0026thinsp;0.95, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.0001; Lin\u0026rsquo;s concordance correlation coefficient [ρc\u0026thinsp;=\u0026thinsp;0.96], as were those inferred from the HM-replacedw/R tree (ρ\u0026thinsp;=\u0026thinsp;0.98, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.0001; ρc\u0026thinsp;=\u0026thinsp;0.97) (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eA). These results indicate that WIHS-P2's corrected trees can be used for molecular dating, and produce valid proviral integration dates.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe next inferred the integration dates of all proviruses from the corrected trees, including the hypermutated ones. Inferred integration dates were highly concordant between the two approaches, yielding ρc between 0.93 and 0.97 depending on whether we compared \u003cem\u003eenv-\u003c/em\u003eintact, hypermutated or all proviruses (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eB). Moreover, there was no bias between the two methods (p\u0026thinsp;=\u0026thinsp;0.65) (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eC). Thus, for participant WIHS-P2, both methods recovered proviral ages equally well. By contrast, the phylogeny inferred from the unaltered alignment produced hypermutated provirus integration dates that were poorly concordant with those from the corrected trees (HM-Stripped ρc\u0026thinsp;=\u0026thinsp;0.46; HM-Replacedw/R ρc\u0026thinsp;=\u0026thinsp;0.45; Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eD). This illustrates the pitfalls of inferring evolutionary information from the former tree type.\u003c/p\u003e \u003cp\u003eWe obtained similar results for WIHS-P4. Again, the integration dates of \u003cem\u003eenv\u003c/em\u003e-intact proviruses inferred from both corrected trees were highly concordant with those inferred from the benchmark tree (both ρc\u0026thinsp;=\u0026thinsp;0.98; Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eA), indicating that the corrected trees are appropriate for molecular dating. Moreover, proviral integration dates inferred from the corrected trees were highly concordant with one another (ρc 0.97 to 0.98) (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eB), and showed no bias between methods (p\u0026thinsp;=\u0026thinsp;0.25) (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eC). By contrast, the phylogeny inferred from the unaltered alignment produced hypermutated provirus integration dates that were highly discordant with those inferred from the corrected trees (both ρc\u0026thinsp;=\u0026thinsp;0.08; Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eD), again illustrating the pitfalls of inferring evolutionary information from the former tree type.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWIHS-P1, WIHS-P3, WIHS-P5, and WIHS-P6's corrected trees similarly produced \u003cem\u003eenv\u003c/em\u003e-intact proviral integration dates that were strongly concordant with those inferred from their benchmark trees (ρc: 0.81 to 0.93), and generally highly concordant proviral integration dates to one another, with no bias between methods (\u003cb\u003eSupplementary Figs.\u0026nbsp;16\u0026ndash;19\u003c/b\u003e). Again, the phylogenies inferred from their unaltered alignments produced hypermutated provirus integration dates that were generally poorly concordant with those inferred from the corrected trees.\u003c/p\u003e \u003cp\u003eTogether, these observations demonstrate that removing hypermutation from alignments is possible, and yields phylogenies that can be used to infer the integration dates of both hypermutated and \u003cem\u003eenv\u003c/em\u003e-intact proviruses.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eLongevity and dynamics of hypermutated proviruses persisting on ART\u003c/h2\u003e \u003cp\u003eHaving demonstrated that proviral integration dates can be inferred from the corrected trees, we compared the integration dates of \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated proviruses persisting on ART. Again, we begin with participant WIHS-P2. Both of this participant's corrected trees indicated that the hypermutated proviruses, like the \u003cem\u003eenv\u003c/em\u003e-intact ones, spanned essentially the entire duration of untreated infection, with the earliest dating to early 2004, approximately one year after seroconversion, (Figs.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eA and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eB\u003cb\u003e)\u003c/b\u003e. On average however, hypermutated proviruses were older than \u003cem\u003eenv\u003c/em\u003e-intact ones in this participant (both trees p\u0026thinsp;=\u0026thinsp;0.001; Figs.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eA and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eB). Longitudinal analysis further revealed that, while integration date distributions of \u003cem\u003eenv\u003c/em\u003e-intact proviruses remained stable during the first seven years of ART (both trees p\u0026thinsp;\u0026ge;\u0026thinsp;0.1; Figs.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eC and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eD), hypermutated proviruses gradually shifted towards earlier integration dates over time (both trees p\u0026thinsp;\u0026lt;\u0026thinsp;0.02; Figs.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eE and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eF), presumably because those with more recent integration dates were preferentially eliminated during long-term ART.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWIHS-P4's proviruses also spanned essentially the entire duration of untreated infection (Figs.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eG and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eH\u003cb\u003e)\u003c/b\u003e. In contrast to WIHS-P2 however, the integration dates of their hypermutated proviruses were on average more recent than their \u003cem\u003eenv\u003c/em\u003e-intact ones (both trees p\u0026thinsp;\u0026le;\u0026thinsp;0.02; Figs.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eG and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eH). As previously reported (Shahid et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), WIHS P4's \u003cem\u003eenv\u003c/em\u003e-intact proviruses gradually shifted towards earlier integration dates over time on ART (both trees p\u0026thinsp;\u0026le;\u0026thinsp;0.003; Figs.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eI and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eJ), likely because those with more recent integration dates decayed more rapidly following ART initiation. In contrast, hypermutated provirus integration date distributions remained stable during ART (both trees p\u0026thinsp;\u0026gt;\u0026thinsp;0.1; Figs.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eK and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eL).\u003c/p\u003e \u003cp\u003eWIHS-P1, WIHS-P3, WIHS-P5, and WIHS-P6's hypermutated proviruses also spanned broad age ranges, but in contrast to WIHS-P2 and WIHS-P4, they did not differ from \u003cem\u003eenv\u003c/em\u003e-intact ones in terms of their overall integration date distributions (\u003cb\u003eSupplementary Figs.\u0026nbsp;20 and 21\u003c/b\u003e). As reported previously, their \u003cem\u003eenv-\u003c/em\u003eintact proviral integration date distributions remained stable except for participant WIHS-P5 in whom the proviral pool shifted slightly towards later integration dates over time (\u003cb\u003eSupplementary Figs.\u0026nbsp;21C and 21D\u003c/b\u003e) (Shahid et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Hypermutated proviral integration date distributions were also stable over time except in WIHS-P1, whose proviral date distributions differed markedly by visit (\u003cb\u003eSupplementary Figs.\u0026nbsp;20E and 20F\u003c/b\u003e). Though this could suggest dynamic changes over time, limited sampling must be acknowledged. Notably, the HM-Stripped and HM-Replacedw/R approaches produced comparable results except in the temporal analysis of \u003cem\u003eenv\u003c/em\u003e-intact proviruses for WIHS-P3, where HM-Stripped suggested a modest shift towards more recent integration dates over time, whereas HM-Replacedw/R indicated no change (\u003cb\u003eSupplemental Figs.\u0026nbsp;20I and 20J\u003c/b\u003e).\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThough hypermutated proviruses persist in all people living with HIV (PLWH) (Bruner et al. \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Ho et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Kinloch et al. \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), we know relatively little about their within-host origins because they cannot be readily incorporated into phylogenies. We explored three simple approaches to remove hypermutation from nucleotide alignments, with the dual goals of 1) reconstructing phylogenies that accurately reconstruct the within-host evolutionary histories of hypermutated sequences and 2) applying molecular dating approaches to these trees to gain insights into the within-host origins and longevity of hypermutated proviruses.\u003c/p\u003e \u003cp\u003eOf the approaches we evaluated, stripping nucleotide positions containing putative APOBEC3 mutations from the alignment, or replacing individual APOBEC3 mutations in hypermutated sequences with R, consistently normalized tree topologies and metrics. By contrast, replacing APOBEC3 mutations in hypermutated sequences with G failed to consistently resolve their erroneous clustering in the tree. We speculate that this is because G replacement is an overcorrection, as not all A bases at target sites are necessarily the result of APOBEC3 activities (the HIV genome is naturally high in A bases (Kypr and Mrazek \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e1987\u003c/span\u003e; Kypr et al. \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e1989\u003c/span\u003e)). Across-the-board G replacement therefore likely obscures some legitimate ancestral information (\u003cem\u003ei.e.\u003c/em\u003e, inherited A bases), leaving these sequences at continued risk of long-branch attraction. By contrast, replacing putative APOBEC3 mutations with R mitigates this risk by acknowledging this ambiguity. We therefore advise against replacement of APOBEC3 mutations in hypermutated sequences with G.\u003c/p\u003e \u003cp\u003eWe further showed that the integration dates of \u003cem\u003eenv\u003c/em\u003e-intact proviruses inferred from the HM-Stripped and HM-Replacedw/R approaches were highly concordant with those inferred from benchmark trees that excluded hypermutated sequences entirely, as is the current practice. The demonstration that these corrected trees provide valid molecular dating results is important because it provides, for the first time, an approach to study the within-host evolutionary origins and longevity of the large and genetically diverse population of hypermutated proviruses that persist in all PLWH during ART.\u003c/p\u003e \u003cp\u003eProviral integration date estimates produced by the two approaches were highly concordant, and there was no clear difference in their performance. While the p-values derived from comparing the tree-based metrics of \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated sequences, shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, are overall slightly higher for the HM-Replacedw/R compared to the HM-Stripped approach, we caution against interpreting this to mean that the former is superior. Though we applied statistical tests to guide interpretation, the main goal was to produce tree metric values for hypermutated and \u003cem\u003eenv\u003c/em\u003e-intact sequences that were in the same range as one another. Both HM-Stripped and HM-Replacedw/R approaches achieved this. We did not necessarily expect that \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated sequence metrics would all normalize completely (\u003cem\u003ei.e.\u003c/em\u003e, produce non-significant p-values) because some evolutionary attributes of \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated sequences might plausibly differ. As hypermutated sequences don't normally yield descendants for example, their closest neighbors in the tree might be more distant than those for \u003cem\u003eenv\u003c/em\u003e-intact proviruses, simply because of the lower likelihood of sampling a close relative (which, for a hypermutated sequence, could only be an ancestor). Differential evolutionary dynamics between hypermutated and \u003cem\u003eenv\u003c/em\u003e-intact proviruses could also produce differential root-to-tip measurements (and by extension integration date estimates) between groups, a phenomenon that was indeed observed in WIHS-P2 and WIHS-P4.\u003c/p\u003e \u003cp\u003eWe therefore offer the following considerations when choosing an approach. Since the HM-Replacedw/R approach retains the full alignment, it should also preserve more phylogenetic signal than the HM-Stripped approach, where an average of 9% of each \u003cem\u003eenv-gp120\u003c/em\u003e alignment was removed. This could be advantageous for HIV regions that are relatively conserved, yet hotspots for APOBEC3 mutation, for example parts of \u003cem\u003epol\u003c/em\u003e (Kieffer et al. \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2005\u003c/span\u003e; Kijak et al. \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2008\u003c/span\u003e). But, before implementing the Replacedw/R approach, it is essential to verify that the chosen phylogenetic inference package supports ambiguous characters. IQ-TREE 2, used in the present study, assigns equal likelihood to each component character (Minh et al. \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), but other packages, such as the approximate maximum likelihood algorithm FastTree, treat all non-ACTG characters as missing data (Price et al. \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2010\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIt is also important to recognize when sequence alignment modifications are warranted. For routine phylogenetic visualization of HIV datasets, hypermutated sequences can be incorporated directly. Such trees might even be adequate for some limited tree-based inferences, as suggested by our finding that uncorrected trees produced reasonable root dates and evolutionary rates, likely because these calculations only use information from pre-ART plasma HIV RNA sequences. Nevertheless, our demonstration that uncorrected trees erroneously reconstructed the ancestry of hypermutated proviruses, and produced inaccurate (and often nonsensical) integration dates for them underscores why they can't be used to answer questions about the evolutionary history of hypermutated proviruses. For such questions, the above alignment modification approaches should be used.\u003c/p\u003e \u003cp\u003eOur results also reveal insights into hypermutated provirus evolutionary dynamics. Like \u003cem\u003eenv\u003c/em\u003e-intact ones, hypermutated proviruses spanned a broad age range. From WIHS-P2 for example, we isolated hypermutated proviruses that had integrated as early as a year following seroconversion. This indicates that hypermutated proviruses, like other provirus types, begin to be seeded into the proviral pool essentially immediately following transmission, and can persist for decades thereafter. Our results also revealed evidence of differential evolutionary dynamics of hypermutated and \u003cem\u003eenv\u003c/em\u003e-intact proviruses in two of the six participants studied, namely WIHS-P2, whose hypermutated proviruses were on average older than \u003cem\u003eenv\u003c/em\u003e-intact ones, and WIHS-P4, in whom the opposite was observed. This suggests that the decay rates of different types of proviruses can be heterogeneous within a given host, as well as heterogeneous between hosts.\u003c/p\u003e \u003cp\u003eOur study has some limitations. We analyzed the present dataset (Shahid et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) because it is among the most comprehensive of its type (in terms of sequence N, follow-up time and sampling near seroconversion) and because \u003cem\u003eenv-gp120\u003c/em\u003e is commonly used for within-host HIV evolutionary studies (Brooks et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Dapp et al. \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). That said, participants WIHS-P3 and WIHS-P6 had only modest numbers of hypermutated proviruses, which limited our power to detect differences between these and \u003cem\u003eenv\u003c/em\u003e-intact proviruses in their data. Furthermore, while our proposed method should be applicable to any HIV gene region, we did not explicitly investigate this. The identification of hypermutated sequences, on which our method depends, is by definition imperfect, as it relies on a statistical cut-off and can be subtly influenced by the choice of reference sequence, particularly if a heterologous sequence (\u003cem\u003ee.g.\u003c/em\u003e HXB2 HIV reference strain) is used for this purpose (Rose and Korber \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2000\u003c/span\u003e). As recommended, we used the most frequent sequence observed post-seroconversion as the reference (Rose and Korber \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2000\u003c/span\u003e), though we verified that use of a different sequence impacted the identification of hypermutated sequences minimally or not at all (\u003cem\u003ee.g\u003c/em\u003e., using an arbitrarily-chosen reference sequence from WIHS-P2's earliest sampling time point yielded 137 (out of 1515) nucleotide positions with putative APOBEC3 mutations, versus the original 140). Finally, we cannot assume that intact \u003cem\u003eenv-gp120\u003c/em\u003e sequences come from fully intact HIV genomes. As such, the comparison group for hypermutated sequences in the present study is not the replication competent HIV reservoir, but rather the pool of proviruses with intact \u003cem\u003eenv-gp120\u003c/em\u003e sequences, many of which will have defects elsewhere.\u003c/p\u003e \u003cp\u003eIn summary, the current practice of excluding hypermutated proviruses from phylogenies used for hypothesis testing has been a major barrier to understanding the \u003cem\u003ein vivo\u003c/em\u003e evolutionary origins and longevity of these sequences. Here, we validated two simple nucleotide alignment modification approaches that, for the first time, allow hypermutated sequences to be correctly incorporated into phylogenies that can be used for molecular dating. Overall, our observations reveal that hypermutated proviruses, like other provirus types, are archived throughout untreated infection and can persist for years on ART. Our observations further suggest that the evolutionary dynamics of hypermutated proviruses may differ from those of other proviral types in some individuals. In addition to enriching our understanding of HIV persistence towards the ultimate goal of HIV cure, the approaches developed here could be extended to between-host phylogenies, and testing of other hypotheses related to within-host evolutionary origins of hypermutated sequences.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eData availability\u003c/h2\u003e \u003cp\u003eThe nucleotide sequences reported in this paper are available in GenBank (proviral DNA accession numbers: OR404056 - OR404777, OR404820 - OR404981; HIV RNA accession numbers: OR403057 - OR403738).\u003c/p\u003e \u003c/div\u003e\u003cp\u003e \u003ch2\u003eConflict of interest\u003c/h2\u003e \u003cp\u003eThe authors declare that they have no conflicts of interest.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAcknowledgments and funding\u003c/h2\u003e \u003cp\u003eWe thank Drs. Art F.Y. Poon and Natalie N. Kinloch for helpful discussions.\u003c/p\u003e\u003cp\u003eThe authors gratefully acknowledge the contributions of the study participants and dedication of the staff at the MWCCS sites.\u003c/p\u003e\u003cp\u003eData in this manuscript were collected by the Women\u0026rsquo;s Interagency HIV Study (WIHS), now the MACS/WIHS Combined Cohort Study (MWCCS).\u003c/p\u003e\u003cp\u003eThe contents of this publication are solely the responsibility of the authors and do not represent the official views of the National Institutes of Health (NIH) or other funders.\u003c/p\u003e\u003cp\u003eMWCCS (Principal Investigators): Atlanta CRS (Ighovwerha Ofotokun, Anandi Sheth, and Gina Wingood), U01-HL146241; Baltimore CRS (Todd Brown and Joseph Margolick), U01-HL146201; Bronx CRS (Kathryn Anastos, David Hanna, and Anjali Sharma), U01-HL146204; Brooklyn CRS (Deborah Gustafson and Tracey Wilson), U01-HL146202; Data Analysis and Coordination Center (Gypsyamber D\u0026rsquo;Souza, Stephen Gange and Elizabeth Topper), U01-HL146193; Chicago-Cook County CRS (Mardge Cohen, Audrey French, and Ryan Ross), U01-HL146245; Chicago-Northwestern CRS (Steven Wolinsky, Frank Palella, and Valentina Stosor), U01-HL146240; Northern California CRS (Bradley Aouizerat, Jennifer Price, and Phyllis Tien), U01-HL146242; Los Angeles CRS (Roger Detels and Matthew Mimiaga), U01-HL146333; Metropolitan Washington CRS (Seble Kassaye and Daniel Merenstein), U01-HL146205; Miami CRS (Maria Alcaide, Margaret Fischl, and Deborah Jones), U01-HL146203; Pittsburgh CRS (Jeremy Martinson and Charles Rinaldo), U01-HL146208; UAB-MS CRS (Mirjam-Colette Kempf, James B. Brock, and Deborah Konkle-Parker), U01-HL146192; UNC CRS (M. Bradley Drummond and Michelle Floris-Moore), U01-HL146194. The MWCCS is funded primarily by the National Heart, Lung, and Blood Institute (NHLBI), with additional co-funding from the \u003cem\u003eEunice Kennedy Shriver\u003c/em\u003e National Institute of Child Health and Human Development (NICHD), National Institute on Aging (NIA), National Institute of Dental and Craniofacial Research (NIDCR), National Institute of Allergy And Infectious Diseases (NIAID), National Institute of Neurological Disorders and Stroke (NINDS), National Institute of Mental Health (NIMH), National Institute on Drug Abuse (NIDA), National Institute of Nursing Research (NINR), National Cancer Institute (NCI), National Institute on Alcohol Abuse and Alcoholism (NIAAA), National Institute on Deafness and Other Communication Disorders (NIDCD), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute on Minority Health and Health Disparities (NIMHD), and in coordination and alignment with the research priorities of the National Institutes of Health, Office of AIDS Research (OAR). MWCCS data collection is also supported by UL1-TR000004 (UCSF CTSA), UL1-TR003098 (JHU ICTR), UL1-TR001881 (UCLA CTSI), P30-AI-050409 (Atlanta CFAR), P30-AI-073961 (Miami CFAR), P30-AI-050410 (UNC CFAR), P30-AI-027767 (UAB CFAR), P30-MH-116867 (Miami CHARM), UL1-TR001409 (DC CTSA), KL2-TR001432 (DC CTSA), and TL1-TR001431 (DC CTSA).\u003c/p\u003e\u003cp\u003eIn addition, this work was supported by the Canadian Institutes of Health Research (CIHR) through a project grant (PJT-159625 to Z.L.B. and J.B.J.) and a focused team grant (HB1-164063 to Z.L.B.). This work was also supported by the Martin Delaney \"REACH\" Collaboratory (NIH grant 1-UM1AI164565-01 to Z.L.B.), which is supported by the following NIH co-funding Institutes: NIMH, NIDA, NINDS, NIDDK, NHLBI, and NIAID. This work was also supported by the Einstein-Rockefeller-CUNY Center for AIDS Research (NIH grant # P30AI124414 to H.G.). A.S. and B.R.J. were supported by CIHR Doctoral Research Awards. S.M. was supported by an FHS Undergraduate Student Research Award. M.C.D. was supported by a CIHR Canada Graduate Scholarship\u0026mdash;Master\u0026rsquo;s award. Z.L.B. was supported by a Scholar Award from Michael Smith Health Research BC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAdimora AA et al (2018) Cohort Profile: The Women's Interagency HIV Study (WIHS). Int J Epidemiol 47(2):393\u0026ndash;94i\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBacon MC et al (2005) The Women's Interagency HIV Study: an observational cohort brings clinical sciences to the bench. Clin Diagn Lab Immunol 12(9):1013\u0026ndash;1019\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarkan SE et al (1998) The Women's Interagency HIV Study. WIHS Collaborative Study Group. Epidemiology 9(2):117\u0026ndash;125\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBergsten J (2005) A review of long-branch attraction. Cladistics 21(2):163\u0026ndash;193\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBozzi G et al (2019) 'No evidence of ongoing HIV replication or compartmentalization in tissues during combination antiretroviral therapy: Implications for HIV eradication', \u003cem\u003eSci Adv\u003c/em\u003e, 5 (9), eaav2045\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrodin J et al (2016) 'Establishment and stability of the latent HIV-1 DNA reservoir'. Elife, 5\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrooks K et al (2020) 'HIV-1 variants are archived throughout infection and persist in the reservoir'. PLoS Pathog, 16 (6), e1008378\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBruner KM et al (2016) Defective proviruses rapidly accumulate during acute HIV-1 infection. Nat Med 22(9):1043\u0026ndash;1049\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCornish-Bowden A (1985) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res 13(9):3021\u0026ndash;3030\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eD'Souza G et al (2021) Characteristics of the MACS/WIHS Combined Cohort Study: Opportunities for Research on Aging With HIV in the Longest US Observational Study of HIV. Am J Epidemiol 190(8):1457\u0026ndash;1475\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDapp MJ et al (2017) 'Patterns and rates of viral evolution in HIV-1 subtype B infected females and males'. PLoS ONE, 12 (10), e0182443\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFinzi D et al (1997) Identification of a reservoir for HIV-1 in patients on highly active antiretroviral therapy. Science 278(5341):1295\u0026ndash;1300\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFinzi D et al (1999) Latent infection of CD4\u0026thinsp;+\u0026thinsp;T cells provides a mechanism for lifelong persistence of HIV-1, even in patients on effective combination therapy. Nat Med 5(5):512\u0026ndash;517\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFitzgibbon JE, Mazar S, Dubin DT (1993) A new type of G\u0026ndash;\u0026gt;A hypermutation affecting human immunodeficiency virus. AIDS Res Hum Retroviruses 9(9):833\u0026ndash;838\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGantner P et al (2023) 'HIV rapidly targets a diverse pool of CD4(+) T cells to establish productive and latent infections'. Immunity, 56 (3), 653\u0026thinsp;\u0026ndash;\u0026thinsp;68 e5.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoodenow M et al (1989) 'HIV-1 isolates are rapidly evolving quasispecies: evidence for viral mixtures and preferred nucleotide substitutions', \u003cem\u003eJ Acquir Immune Defic Syndr (1988)\u003c/em\u003e, 2 (4), 344\u0026thinsp;\u0026ndash;\u0026thinsp;52\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGorbalenya AE (2017) 'Phylogeny of Viruses', \u003cem\u003eReference Module in Biomedical Sciences\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHalvas EK et al (2020) HIV-1 viremia not suppressible by antiretroviral therapy can originate from large T cell clones producing infectious virus. J Clin Invest 130(11):5847\u0026ndash;5857\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHarris RS, Liddament MT (2004) Retroviral restriction by APOBEC proteins. Nat Rev Immunol 4(11):868\u0026ndash;877\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHiener B et al (2017) Identification of Genetically Intact HIV-1 Proviruses in Specific CD4(+) T Cells from Effectively Treated Participants. Cell Rep 21(3):813\u0026ndash;822\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHo YC et al (2013) Replication-competent noninduced proviruses in the latent reservoir increase barrier to HIV-1 cure. Cell 155(3):540\u0026ndash;551\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoang DT et al (2018) UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol 35(2):518\u0026ndash;522\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eImamichi H et al (2020) Defective HIV-1 proviruses produce viral proteins. Proc Natl Acad Sci U S A 117(7):3704\u0026ndash;3710\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIsaac NJ et al (2007) 'Mammals on the EDGE: conservation priorities based on threat and phylogeny'. PLoS ONE, 2 (3), e296\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJones BR, Joy JB (2023) 'Inferring Human Immunodeficiency Virus 1 Proviral Integration Dates With Bayesian Inference', \u003cem\u003eMol Biol Evol\u003c/em\u003e, 40 (8)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJones BR et al (2018) Phylogenetic approach to recover integration dates of latent HIV sequences within-host. Proc Natl Acad Sci U S A 115(38):E8958\u0026ndash;E67\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJones BR et al (2020) 'Genetic Diversity, Compartmentalization, and Age of HIV Proviruses Persisting in CD4(+) T Cell Subsets during Long-Term Combination Antiretroviral Therapy', \u003cem\u003eJ Virol\u003c/em\u003e, 94 (5)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKatoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772\u0026ndash;780\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKearney MF et al (2016) Origin of Rebound Plasma HIV Includes Cells with Identical Proviruses That Are Transcriptionally Active before Stopping of Antiretroviral Therapy. J Virol 90(3):1369\u0026ndash;1376\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKembel SW et al (2010) Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26(11):1463\u0026ndash;1464\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKieffer TL et al (2005) G\u0026ndash;\u0026gt;A hypermutation in protease and reverse transcriptase regions of human immunodeficiency virus type 1 residing in resting CD4\u0026thinsp;+\u0026thinsp;T cells in vivo. J Virol 79(3):1975\u0026ndash;1980\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKijak GH et al (2008) Variable contexts and levels of hypermutation in HIV-1 proviral genomes recovered from primary peripheral blood mononuclear cells. Virology 376(1):101\u0026ndash;111\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKinloch NN et al (2023) 'HIV reservoirs are dominated by genetically younger and clonally enriched proviruses', \u003cem\u003emBio\u003c/em\u003e, e0241723\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKypr J, Mrazek J (1987) Unusual codon usage of HIV. Nature 327(6117):20\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKypr J, Mrazek J, Reich J (1989) Nucleotide composition bias and CpG dinucleotide content in the genomes of HIV and HTLV 1/2. Biochim Biophys Acta 1009(3):280\u0026ndash;282\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLarsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30(22):3276\u0026ndash;3278\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee GQ et al (2017) Clonal expansion of genome-intact HIV-1 in functionally polarized Th1 CD4\u0026thinsp;+\u0026thinsp;T cells. J Clin Invest 127(7):2689\u0026ndash;2696\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMartin DP et al (2015) RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evol 1(1):vev003\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMinh BQ et al (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 37(5):1530\u0026ndash;1534\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNicolas A et al (2022) 'Genotypic and Phenotypic Diversity of the Replication-Competent HIV Reservoir in Treated Patients'. Microbiol Spectr, 10 (4), e0078422\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePankau MD et al (2020) 'Dynamics of HIV DNA reservoir seeding in a cohort of superinfected Kenyan women'. PLoS Pathog, 16 (2), e1008286\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePatro SC et al (2019) Combined HIV-1 sequence and integration site analysis informs viral dynamics and allows reconstruction of replicating viral ancestors. Proc Natl Acad Sci U S A 116(51):25891\u0026ndash;25899\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePavoine et al (2017) From phylogenetic to functional originality: Guide through indices and new developments. Ecol Ind 82:196\u0026ndash;205\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePinzone MR et al (2019) Longitudinal HIV sequencing reveals reservoir expression leading to decay which is obscured by clonal expansion. Nat Commun 10(1):728\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePrice MN, Dehal PS, Arkin AP (2010) 'FastTree 2\u0026ndash;approximately maximum-likelihood trees for large alignments'. PLoS ONE, 5 (3), e9490\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRedding DW, Mooers A\u0026Oslash; (2006) Incorporating evolutionary measures into conservation prioritization. Conserv Biol 20(6):1670\u0026ndash;1678\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRedding DW, Mazel F, Mooers A (2014) 'Measuring Evolutionary Isolation for Conservation'. PLoS ONE, 9 (12), e113490\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRose PP, Korber BT (2000) Detecting hypermutations in viral sequences with an emphasis on G --\u0026gt; A hypermutation. Bioinformatics 16(4):400\u0026ndash;401\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSanchez G et al (1997) Accumulation of defective viral genomes in peripheral blood mononuclear cells of human immunodeficiency virus type 1-infected individuals. J Virol 71(3):2233\u0026ndash;2240\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShahid A et al (2024) 'The replication-competent HIV reservoir is a genetically restricted, younger subset of the overall pool of HIV proviruses persisting during therapy, which is highly genetically stable over time'. J Virol, 98 (2), e0165523\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSheehy AM et al (2002) Isolation of a human gene that inhibits HIV-1 infection and is suppressed by the viral Vif protein. Nature 418(6898):646\u0026ndash;650\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSlatkin M, Maddison WP (1989) A cladistic measure of gene flow inferred from the phylogenies of alleles. Genetics 123(3):603\u0026ndash;613\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVartanian JP et al (1991) Selection, recombination, and G----A hypermutation of human immunodeficiency virus type 1 genomes. J Virol 65(4):1779\u0026ndash;1788\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVartanian JP et al (1994) G\u0026ndash;\u0026gt;A hypermutation of the human immunodeficiency virus type 1 genome: evidence for dCTP pool imbalance during reverse transcription. Proc Natl Acad Sci U S A 91(8):3092\u0026ndash;3096\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWaldron D (2015) Hypermutation of HIV-1 in vivo. Nat Rev Genet 16(11):626\u0026ndash;626\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWhitney JB et al (2014) Rapid seeding of the viral reservoir prior to SIV viraemia in rhesus monkeys. Nature 512(7512):74\u0026ndash;77\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu G (2020) Using ggtree to Visualize Data on Tree-Like Structures. Curr Protoc Bioinf 69(1):e96\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003e\u003cstrong\u003eTable 1: Participant information, HIV sampling and sequencing details\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" align=\"left\" width=\"933\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.82422293676313%\"\u003e\n \u003cp\u003e\u003cstrong\u003eID^\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"7.07395498392283%\"\u003e\n \u003cp\u003e\u003cstrong\u003eEstimated date of infection\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003e\u003cstrong\u003eDuration of uncontrolled infection (years)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e\u003cstrong\u003eNo. of pre-ART plasma HIV RNA time points\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.183279742765274%\"\u003e\n \u003cp\u003e\u003cstrong\u003eDistinct pre-ART\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eplasma HIV\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u003cem\u003eenv-gp120\u003c/em\u003e\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;sequences\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e\u003cstrong\u003eART initiation date\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e\u003cstrong\u003eYears of ART until last\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eproviral sampling\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e\u003cstrong\u003eNo. of\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eon-ART proviral\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003etime points\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"20.257234726688104%\"\u003e\n \u003cp\u003e\u003cstrong\u003eDistinct on-ART HIV\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u003cem\u003eenv-gp120\u003c/em\u003e\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;proviral sequences\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e(Hypermutated N; %)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.82422293676313%\"\u003e\n \u003cp\u003eWIHS-P2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"7.07395498392283%\"\u003e\n \u003cp\u003eJan 2003\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.183279742765274%\"\u003e\n \u003cp\u003e227\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eJan 2012\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e6.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"20.257234726688104%\"\u003e\n \u003cp\u003e75 (22; 28%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.82422293676313%\"\u003e\n \u003cp\u003eWIHS-P4*\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"7.07395498392283%\"\u003e\n \u003cp\u003eJul 1995\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003e10.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.183279742765274%\"\u003e\n \u003cp\u003e\u0026nbsp;182\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eJun 2006\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e12.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"20.257234726688104%\"\u003e\n \u003cp\u003e155 (23; 15%)\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.82422293676313%\"\u003e\n \u003cp\u003eWIHS-P1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"7.07395498392283%\"\u003e\n \u003cp\u003eDec 1995\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.183279742765274%\"\u003e\n \u003cp\u003e207\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eJan 2008\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e10.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"20.257234726688104%\"\u003e\n \u003cp\u003e85 (15; 13%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.82422293676313%\"\u003e\n \u003cp\u003eWIHS-P3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"7.07395498392283%\"\u003e\n \u003cp\u003eJul 2002\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003e5.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.183279742765274%\"\u003e\n \u003cp\u003e132\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eJan 2008\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e8.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"20.257234726688104%\"\u003e\n \u003cp\u003e59 (5; 8%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.82422293676313%\"\u003e\n \u003cp\u003eWIHS-P5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"7.07395498392283%\"\u003e\n \u003cp\u003eMar 2008\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003e1.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.183279742765274%\"\u003e\n \u003cp\u003e45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eFeb 2010\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e8.7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"20.257234726688104%\"\u003e\n \u003cp\u003e74 (22; 30%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.82422293676313%\"\u003e\n \u003cp\u003eWIHS-P6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"7.07395498392283%\"\u003e\n \u003cp\u003eAug 2006\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003e3.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.183279742765274%\"\u003e\n \u003cp\u003e73\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eJul 2010\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e8.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.146838156484458%\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"20.257234726688104%\"\u003e\n \u003cp\u003e94 (6; 6%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e^\u003c/strong\u003eParticipants are numbered in the same order as the original manuscript\u0026nbsp;(Shahid et al, 2024). That is, WIHS-P2 in the present study is Participant 2 in (Shahid et al, 2024).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e*\u0026nbsp;\u003c/strong\u003eThe MWCCS database indicated that participant 4 initiated ART in 2003, but no reductions in plasma viral load (pVL) were observed until June 2006. For this reason, we considered June 2006 as this participant\u0026apos;s effective ART start date.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2: Hypermutated sequence details\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"782\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"NaN%\" valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eID\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"NaN%\"\u003e\n \u003cp\u003e\u003cstrong\u003eHypermutated proviruses\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"NaN%\"\u003e\n \u003cp\u003e\u003cstrong\u003eAligned HIV \u003cem\u003eenv-gp120\u003c/em\u003e sequence length (bp)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"NaN%\"\u003e\n \u003cp\u003e\u003cstrong\u003ePutative hypermutated\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003enucleotide positions in the alignment\u003csup\u003ea\u003c/sup\u003e\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"NaN%\"\u003e\n \u003cp\u003e\u003cstrong\u003eHypermutated sites identified per sequence\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eMedian (range)\u003csup\u003eb\u003c/sup\u003e\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.10230179028133%\" valign=\"top\"\u003e\n \u003cp\u003eWIHS-P2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.391304347826086%\"\u003e\n \u003cp\u003e22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e1515\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e140\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.887468030690535%\" valign=\"top\"\u003e\n \u003cp\u003e43 (20 \u0026ndash; 68)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.10230179028133%\" valign=\"top\"\u003e\n \u003cp\u003eWIHS-P4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.391304347826086%\"\u003e\n \u003cp\u003e23\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e1541\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e176\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.887468030690535%\" valign=\"top\"\u003e\n \u003cp\u003e55 (34 \u0026ndash; 83)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.10230179028133%\" valign=\"top\"\u003e\n \u003cp\u003eWIHS-P1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.391304347826086%\" valign=\"top\"\u003e\n \u003cp\u003e15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e1483\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e141\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.887468030690535%\" valign=\"top\"\u003e\n \u003cp\u003e41 (10 \u0026ndash; 75)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.10230179028133%\" valign=\"top\"\u003e\n \u003cp\u003eWIHS-P3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.391304347826086%\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e1486\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e127\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.887468030690535%\" valign=\"top\"\u003e\n \u003cp\u003e40 (36 \u0026ndash; 64)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.10230179028133%\" valign=\"top\"\u003e\n \u003cp\u003eWIHS-P5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.391304347826086%\"\u003e\n \u003cp\u003e22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e1501\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e152\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.887468030690535%\" valign=\"top\"\u003e\n \u003cp\u003e57 (9 \u0026ndash; 78)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.10230179028133%\" valign=\"top\"\u003e\n \u003cp\u003eWIHS-P6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.391304347826086%\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e1523\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.309462915601024%\" valign=\"top\"\u003e\n \u003cp\u003e122\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.887468030690535%\" valign=\"top\"\u003e\n \u003cp\u003e47 (35 \u0026ndash; 75)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003csup\u003ea\u003c/sup\u003eThe total number of nucleotide positions that harbored an A at an APOBEC3 target site in at least one hypermutated sequence in the participant\u0026rsquo;s sequence alignment. These positions were stripped out of the alignment in the HM-Stripped approach.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003csup\u003eb\u0026nbsp;\u003c/sup\u003eStatistics summarizing the overall number of A bases at APOBEC3 target sites in the participant\u0026apos;s hypermutated\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003esequences. These A bases were changed to R or G, respectively, in the HM-Replacedw/R and HM-Replacedw/G approaches.\u0026nbsp;\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Simon Fraser University","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4549934/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4549934/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eHypermutated proviruses, which arise in a single HIV replication cycle when host antiviral APOBEC3 proteins introduce extensive G-to-A mutations throughout the viral genome, persist in all people living with HIV receiving antiretroviral therapy (ART). But, the within-host evolutionary origins of hypermutated sequences are incompletely understood because phylogenetic inference algorithms, which assume that mutations gradually accumulate over generations, incorrectly reconstruct their ancestor-descendant relationships. Using \u0026gt;1400 longitudinal single-genome-amplified HIV \u003cem\u003eenv-gp120\u003c/em\u003e sequences isolated from six women over a median 18 years of follow-up − including plasma HIV RNA\u003cem\u003e \u003c/em\u003esequences collected over a median 9 years between seroconversion and ART initiation, and \u0026gt;500 proviruses isolated over a median 9 years on ART − we evaluated three approaches for removing hypermutation from nucleotide alignments. Our goals were to 1) reconstruct accurate phylogenies that can be used for molecular dating and 2) phylogenetically infer the integration dates of hypermutated proviruses persisting during ART. Two of the tested approaches (stripping all positions containing putative APOBEC3 mutations from the alignment, or replacing individual putative APOBEC3 mutations in hypermutated sequences with the ambiguous base R) consistently normalized tree topologies, eliminated erroneous clustering of hypermutated proviruses, and brought \u003cem\u003eenv\u003c/em\u003e-intact and hypermutated proviruses into comparable ranges with respect to multiple tree-based metrics. Importantly, these corrected trees produced integration date estimates for \u003cem\u003eenv\u003c/em\u003e-intact proviruses that were highly concordant with those from benchmark trees that excluded hypermutated sequences, indicating that the corrected trees can be used for molecular dating. Use of these trees to infer the integration dates of hypermutated proviruses persisting during ART revealed that these spanned a wide age range, with the oldest ones dating to shortly after infection. This indicates that hypermutated proviruses, like other provirus types, begin to be seeded into the proviral pool immediately following infection, and can persist for decades. In two of the six participants, hypermutated proviruses differed from \u003cem\u003eenv\u003c/em\u003e-intact ones in terms of their age distributions, suggesting that different provirus types decay at heterogeneous rates in some hosts. These simple approaches to reconstruct hypermutated provirus' evolutionary histories, allow insights into their \u003cem\u003ein vivo\u003c/em\u003e origins and longevity, towards a more comprehensive understanding of HIV persistence during ART.\u003c/p\u003e","manuscriptTitle":"A simple phylogenetic approach to analyze hypermutated HIV proviruses reveals insights into their dynamics and persistence during antiretroviral therapy","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-06-13 14:02:34","doi":"10.21203/rs.3.rs-4549934/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"3d351ffe-a2a2-444b-981e-3e89de37b041","owner":[],"postedDate":"June 13th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-06-13T14:02:34+00:00","versionOfRecord":[],"versionCreatedAt":"2024-06-13 14:02:34","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4549934","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4549934","identity":"rs-4549934","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.