{"paper_id":"21d3fee7-fff5-4526-ba01-6718c18cef44","body_text":"Hypermut 3: Identifying specific mutational patterns in a defined nucleotide \ncontext that allows multistate characters \n \nZena Lapp1, Hyejin Yoon1, Brian Foley1, Thomas Leitner1* \n \n1Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM \n87545, USA \n \n*To whom correspondence should be addressed \n \n \nAbstract \n \nSummary: The detection of APOBEC3F- and APOBEC3G-induced mutations in virus \nsequences is useful for identifying hypermutated sequences. These sequences are not \nrepresentative of viral evolution and can therefore alter the results of downstream sequence \nanalyses if included. We previously published the software Hypermut, which detects \nhypermutation events in sequences relative to a reference. Two versions of this method are \navailable as a webtool. Neither of these methods consider multistate characters or gaps in the \nsequence alignment. Here, we present an updated, user-friendly web and command-line version \nof Hypermut with functionality to handle multistate characters and gaps in the sequence \nalignment. This tool allows for straightforward integration of hypermutation detection into \nsequence analysis pipelines. As with the previous tool, while the main purpose is to identify G to \nA hypermutation events, any mutational pattern and context can be specified. \nAvailability and implementation: Hypermut 3 is written in Python 3. It is available as a \ncommand-line tool at https://github.com/MolEvolEpid/hypermut3 and as a webtool at \nhttps://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermutv3.html.  \nContact: tkl@lanl.gov or seq-info@lanl.gov  \n \n \nIntroduction \n \nAs part of the human immune response, APOBEC3F and APOBEC3G proteins can deaminate \ncytosine residues to uracil on viral DNA, leading to a guanine to adenine mutation in subsequent \nviral sequences (Refsland et al., 2012). These mutations usually occur in the context of \ndownstream RD (A or G, not C) nucleotides (Refsland et al., 2012; Yu et al., 2004). This pattern \nof hypermutation has been detected in sequences from human immunodeficiency virus (HIV) \n(Vartanian et al., 1991), hepatitis B virus (Noguchi et al., 2005), and mpox (Desingu et al., \n2024), among others. Since these mutations do not arise from standard vertical evolution of viral \nmutations during replication, they need to be removed or modified prior to performing analyses \nthat use mutations to estimate relatedness between sequences, such as computing phylogenies, \nidentifying transmission clusters, or dating latent proviral sequences. \n \nWe previously published a webtool, Hypermut (Rose and Korber, 2000), that detects patterns \nconsistent with hypermutation in genome sequences. Given an alignment including a reference \nsequence that is assumed to have no signature of hypermutation, Hypermut detects potentially \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint \n\nhypermutated positions in each query sequence in the alignment. A decade later, we updated the \nwebtool (unpublished) to enable users to compare the number of mutations occurring in user-\ndefined upstream and downstream nucleotide primary contexts vs. control contexts (Figure 1A). \nThe context can be enforced in the reference sequence, the query sequence, or both. The use of \nmultistate characters in the user-defined contexts indicate that any of the included nucleotides \nmay be considered a match in the alignment. If the correct mutation occurs in a primary context \n(Figure 1B, q1), it is considered a primary match. If an incorrect mutation, or no mutation, \noccurs in a primary context (Figure 1B, q2), it is considered a potential primary match (but not a \nmatch). The same logic is followed to identify potential and actual matches for control contexts \n(Figure 1B, q3). Hypermut 2 returns a summary of the number of mutations in contexts of \ninterest compared to control contexts for each sequence and uses these numbers to compute \nFisher’s exact p-values that quantify whether a sequence has more mutations than expected in \ncontexts of interest compared to control contexts. Users can also download a file including the \npositions of potential mutation sites in the primary and control contexts, and whether there was a \nmutation match at the site. However, Hypermut 2 only matches to ACGT characters in the \nalignment and therefore does not consider sites with multistate characters or gaps, which may \noccur due to virus population diversity.  \n \nSince the original publication of Hypermut 25 years ago, there has been an explosion in available \nsequencing data and corresponding development of automated bioinformatic pipelines. Here, we \npresent a substantial update to Hypermut that allows for the integration of hypermutation \ndetection into automated bioinformatic pipelines, and that can handle multistate characters and \nalignment gaps. \n \n \nNew developments in Hypermut 3 \n \nHypermut 3 extends Hypermut 2 by optionally handling gaps and multistate characters in the \nalignment. Furthermore, it is available as a command-line tool that can be straightforwardly \nintegrated into bioinformatic pipelines. We have simplified the input compared to Hypermut 2 by \nautomatically setting the control context to be the exact complement of the primary context, \nwhich will reduce accidental user error for complex control patterns. The output information \nremains the same.  \n \nGap handling \nGaps in the mutation site are not considered (Figure 1B, q4). By default, gaps in the context are \nignored when considering potential mutation sites of interest (Figure 1B, q5), i.e., gaps are \nskipped over and the following characters in the sequences are used for the context patterns. We \nalso provide an option to keep gaps, using them as characters, in which case gaps in the context \nare not ignored. \n \nMultistate character handling \nMatching at positions with multistate characters in the alignment can be handled in two ways \n(Figure 1B, q6-q9): (1) strict mode ignores positions where the multistate character in the \nsequence is broader than the user-defined pattern, and (2) partial mode identifies partial matches \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint \n\nfor positions with some overlap between the multistate character and the user-defined pattern. \nOnly strict matching can be performed if the reference sequence contains multistate characters. \n \nStrict mode \nIn strict mode, sites with multistate characters are considered only when the nucleotides that \nmake up the multistate character are entirely included in the user-defined search pattern. In cases \nwhere the multistate characters contain nucleotides that are not included in the user-defined \nsearch pattern, the location is ignored and not considered a potential primary or control match.  \n \nPartial mode \nPartial mode returns the same complete matches as strict mode, but also assigns partial matches \nto locations with multistate characters that contain some, but not all, nucleotides in the user-\ndefined search pattern. For this mode, we assume that each nucleotide is present at equal \nfrequency in the population, and that all potential combinations of nucleotides are also present at \nequal frequency. In this case, there may be partial matching in the context or the mutation site. \nPartial matching in the context leads to a fractional potential match “count,” while partial \nmatching in the mutation site leads to a fractional actual match “count.” For a given position, we \nquantify the extent of matching by determining the proportion 𝑃!,# of standard nucleotides \n(ACGT) in the IUPAC code of the sequence that are in the correct context: \n \n𝑃!,# = \t |𝑠 ∩ 𝑐|\n|𝑠| , \n \nwhere 𝑠 is the set of nucleotides present in the IUPAC code for the sequence, 𝑐 is the set of \nnucleotides present in the IUPAC code for the context or mutation of interest, and |.| is the \ncardinality (i.e., length) of the set. For a given context of length 𝑙, the potential match “count” 𝑀 \nis the product of the proportions for each position in the context: \n \n𝑀 = + 𝑃!,#$\n%\n$&'\n,\t \n \nwhere 𝑃!,#\n$  is the proportion of matching nucleotides for the ith position in the context pattern. If \nthere are multiple possible contexts, then the total potential match count is the sum of the \nindividual match counts. The actual match count is computed as: \n \n𝑚 = \t𝑃!,#( - 𝑀$,\n)\n$&'\n \n \nwhere 𝛿 designates the mutation of interest, 𝑛 is the number of possible contexts, and 𝑀$\tis the \npotential match count for context i.  \n \nPotential and actual match counts are computed for the user-defined primary context as well as \nthe inferred control context.  \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint \n\nAvailability and implementation  \nHypermut 3 is written in Python 3 and requires the scipy package (Virtanen et al., 2020). It is \navailable as a command-line tool at https://github.com/MolEvolEpid/hypermut3 and as a \nwebtool at https://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermutv3.html.  \n \n \nExample of use \n \nTo investigate the effects of gaps and multistate characters on the detection of potentially \nhypermutated positions, we downloaded from the LANL HIV database 1541 high-quality HIV-1 \ngag and env population sequences aligned to HXB2 (GenBank accession number K03455; all \ndata available at https://github.com/MolEvolEpid/hypermut3/tree/main/manuscript/data). Due to \nthe diversity of the sequences in the alignment, the mean percent of gaps within a query \nsequence was 24.9% (range: 0%-40%). The mean percent of non-ACGT characters was 1.1% \n(range: 0.06%- 7.0%). We ran Hypermut 3 on these sequences, using HXB2 as the reference \nsequence, in both strict and partial modes, with and without skipping gaps in the context.  \n \nWe first investigated whether sequences had signatures of hypermutation using a Fisher’s exact \np-value threshold of 0.05. When keeping gaps in the context, 2 sequences fell below the \nthreshold in both strict and partial modes, and 1 additional sequence fell below the threshold in \npartial mode only. When skipping gaps in the context, 8 sequences had signatures of \nhypermutation in both strict and partial modes.  \n \nIn general, skipping gaps increased the number of potential contexts in 94.7% (1460/1541) of \nsequences in both strict and partial modes, and the number of actual matches in over 70% of \nsequences (strict: 1132/1541, 73.5%; partial: 1154/1541, 74.9%). \n \nWe next compared potential and actual matches for the strict and partial modes, both with gap \nskipping. When compared to the strict mode, using the partial mode led to 49.2% (758/1541) of \nsequences having more potential contexts and 40.2% (619/1541) of sequences having more \npotential hypermutations. As expected, we observed a positive correlation between the percent of \nnon-ACGT characters and the number of additional potential sites (Figure 1C). An example of \nthe difference in the cumulative number of potential sites and matches under different multistate \nmatch mode and gap handling conditions is shown in Figure 1D. \n \n \nPotential applications \n \nHypermut 3 can be used to detect APOBEC3F- and APOBEC3G-induced mutations as well as \nother mutations of interest. Among other applications, the output from Hypermut may be useful \nas a quality control step in a sequence analysis pipeline to remove sequences with certain \nmutational signatures or to mask certain positions within sequences, thus avoiding biases in \ndownstream results. The new partial matching mode is particularly useful for identifying \npotential hypermutation in viral population sequences derived from, for example, Sanger \nsequencing of a virus population or a resolved consensus sequence of Illumina reads.   \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint \n\nThe web version of Hypermut 3 is straightforward to use for even those with no programming \nbackground and provides a means to easily identify potentially hypermutated sequences from an \nalignment of interest. For those with more programming experience, we also provide a \ncommand-line version of Hypermut 3 that can be incorporated into bioinformatic pipelines. This \ngreatly expands the potential applications of Hypermut by allowing it to become a standard tool \nused to automatically identify and flag hypermutated sequences in real time for purposes ranging \nfrom research to public health. \n \n \nConclusion \n \nWe have updated the original Hypermut software to accommodate the increase in genome \nsequencing and the development of automated bioinformatic pipelines. The added functionality \nof Hypermut 3 compared to the original published version allows for the identification of \npotential hypermutation events for multistate characters (as of Hypermut 3), provides a \ncomparison of mutations in primary and control contexts to more easily identify hypermutated \nsequences (as of Hypermut 2), and enables straightforward integration into bioinformatic \npipelines (as of Hypermut 3). Positions or sequences identified as likely hypermutants can then \nbe removed or transformed prior to further sequence analyses, thus reducing potential biases in \nthe downstream results.  \n \n \nFunding \n \nThis work was supported by the National Institutes of Health [grant R01AI087520 to TL, \ninteragency agreement AAI24007-001-00000 HIV/SIV, Database and Analysis Unit to HY and \nBF] and the Los Alamos National Laboratory [Laboratory Directed Research and Development \nprogram fellowship project no. 20230873PRD4 to ZL]. \n \n \nReferences \n \nDesingu,P.A. et al. (2024) Molecular evolution of 2022 multi-country outbreak-causing \nmonkeypox virus Clade IIb. iScience, 27. \nNoguchi,C. et al. (2005) G to A hypermutation of hepatitis B virus. Hepatology, 41, 626–633. \nRefsland,E.W. et al. (2012) Endogenous Origins of HIV-1 G-to-A Hypermutation and \nRestriction in the Nonpermissive T Cell Line CEM2n. PLOS Pathog., 8, e1002800. \nRose,P.P. and Korber,B.T. (2000) Detecting hypermutations in viral sequences with an emphasis \non G → A hypermutation. Bioinformatics, 16, 400–401. \nVartanian,J.P. et al. (1991) Selection, recombination, and G----A hypermutation of human \nimmunodeficiency virus type 1 genomes. J. Virol., 65, 1779–1788. \nVirtanen,P. et al. (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. \nNat. Methods, 17, 261–272. \nYu,Q. et al. (2004) Single-strand specificity of APOBEC3G accounts for minus-strand \ndeamination of the HIV genome. Nat. Struct. Mol. Biol., 11, 435–442. \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint \n\n \n \nFigure 1: Hypermut 3 overview. (A) Example mutation site and context (for APOBEC3G and \nAPOBEC3F) using IUPAC codes. The control pattern is the complement of the primary pattern \nand inferred by the program. (B) Example reference and query sequences, and whether they are \nconsidered a potential primary match and an actual primary match for three Hypermut versions: \n2.0, 3.0 strict, and 3.0 partial. These numbers assume that gaps are skipped (for Hypermut 3) and \nthat the context is enforced on the query sequence. In the sequence, the underlined nucleotide is \nthe mutation site, correct mutations or contexts are bolded, and matches for Hypermut 3 partial \nmode are colored in red. The IUPAC code R indicates A and/or G, and the IUPAC code N \nindicates any of the bases. (C) Correlation between percent non-ACGT characters in the query \nsequence and the number of additional potential primary matches identified when using partial \nmatching compared to strict matching. The color indicates the number of additional primary \nmatches observed. (D) For an example sequence, the cumulative number of potential sites vs. the \ncumulative number of actual matches identified for each combination of partial vs. strict \nmatching (color) and keeping vs. skipping gaps in the alignment (line type). \n \nG\nA RD\nG\nA YN|RC\nUpstream \ncontext Mutation site Downstream \ncontext\nReference\nQuery\nReference\nQuery\nPrimaryControl\nPotential primary \nmatch\nPrimary match score\nID Seq 2.0 strict partial 2.0 strict partial\nref GATC\nq1 AGTC 1 1 1 1 1 1\nq2 TGTC 1 1 1 0 0 0\nq3 AGCC 0 0 0 0 0 0\nq4 -GTC 1 0 0 0 0 0\nq5 A-GT 0 1 1 0 1 1\nq6 RGTC 1 0 1 0 0 0.5\nq7 AGRC 0 1 1 0 1 1\nq8 ANTC 0 0 0.5 0 0 0.5\nq9 RNTC 0 0 0.5 0 0 0.25\nA\nB\nD\nC\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}