Hypermut 3: Identifying specific mutational patterns in a defined nucleotide context that allows multistate characters

doi:10.1101/2024.10.24.620069

Hypermut 3: Identifying specific mutational patterns in a defined nucleotide context that allows multistate characters

2024 · doi:10.1101/2024.10.24.620069

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

⚙ AI-generated deep summary by claude@2026-06, 2026-06-18 · read from full text ⓘ

This paper introduces Hypermut 3, an updated Python-based web and command-line tool for detecting hypermutation patterns induced by APOBEC3F/APOBEC3G in aligned viral sequences, with an emphasis on correctly handling multistate characters and alignment gaps. Using the definition that reference and query positions can fall into “primary” and “control” nucleotide context patterns (and calculating enrichment via Fisher’s exact tests), Hypermut 3 extends prior versions by adding configurable gap handling and two modes for multistate matching (strict vs partial) with fractional counts under partial mode; a key stated caveat is that gaps and multistate sites are treated with these explicit assumptions rather than fully representing viral evolution. In an HIV-1 gag and env example alignment (n=1541), gap skipping changed which sequences passed a hypermutation significance threshold and generally increased potential context counts, while partial multistate matching frequently yielded more potential sites and hypermutations than strict mode. This paper does not explicitly discuss endometriosis or adenomyosis; it was included in the corpus via a keyword match in the upstream search index.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Full text 17,999 characters · extracted from oa-pdf · 7 sections · click to expand

Abstract

Summary: The detection of APOBEC3F- and APOBEC3G-induced mutations in virus sequences is useful for identifying hypermutated sequences. These sequences are not representative of viral evolution and can therefore alter the results of downstream sequence analyses if included. We previously published the software Hypermut, which detects hypermutation events in sequences relative to a reference. Two versions of this method are available as a webtool. Neither of these methods consider multistate characters or gaps in the sequence alignment. Here, we present an updated, user-friendly web and command-line version of Hypermut with functionality to handle multistate characters and gaps in the sequence alignment. This tool allows for straightforward integration of hypermutation detection into sequence analysis pipelines. As with the previous tool, while the main purpose is to identify G to A hypermutation events, any mutational pattern and context can be specified. Availability and implementation: Hypermut 3 is written in Python 3. It is available as a command-line tool at https://github.com/MolEvolEpid/hypermut3 and as a webtool at https://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermutv3.html. Contact: [email protected] or [email protected]

Introduction

As part of the human immune response, APOBEC3F and APOBEC3G proteins can deaminate cytosine residues to uracil on viral DNA, leading to a guanine to adenine mutation in subsequent viral sequences (Refsland et al., 2012). These mutations usually occur in the context of downstream RD (A or G, not C) nucleotides (Refsland et al., 2012; Yu et al., 2004). This pattern of hypermutation has been detected in sequences from human immunodeficiency virus (HIV) (Vartanian et al., 1991), hepatitis B virus (Noguchi et al., 2005), and mpox (Desingu et al., 2024), among others. Since these mutations do not arise from standard vertical evolution of viral mutations during replication, they need to be removed or modified prior to performing analyses that use mutations to estimate relatedness between sequences, such as computing phylogenies, identifying transmission clusters, or dating latent proviral sequences. We previously published a webtool, Hypermut (Rose and Korber, 2000), that detects patterns consistent with hypermutation in genome sequences. Given an alignment including a reference sequence that is assumed to have no signature of hypermutation, Hypermut detects potentially .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint hypermutated positions in each query sequence in the alignment. A decade later, we updated the webtool (unpublished) to enable users to compare the number of mutations occurring in user- defined upstream and downstream nucleotide primary contexts vs. control contexts (Figure 1A). The context can be enforced in the reference sequence, the query sequence, or both. The use of multistate characters in the user-defined contexts indicate that any of the included nucleotides may be considered a match in the alignment. If the correct mutation occurs in a primary context (Figure 1B, q1), it is considered a primary match. If an incorrect mutation, or no mutation, occurs in a primary context (Figure 1B, q2), it is considered a potential primary match (but not a match). The same logic is followed to identify potential and actual matches for control contexts (Figure 1B, q3). Hypermut 2 returns a summary of the number of mutations in contexts of interest compared to control contexts for each sequence and uses these numbers to compute Fisher’s exact p-values that quantify whether a sequence has more mutations than expected in contexts of interest compared to control contexts. Users can also download a file including the positions of potential mutation sites in the primary and control contexts, and whether there was a mutation match at the site. However, Hypermut 2 only matches to ACGT characters in the alignment and therefore does not consider sites with multistate characters or gaps, which may occur due to virus population diversity. Since the original publication of Hypermut 25 years ago, there has been an explosion in available sequencing data and corresponding development of automated bioinformatic pipelines. Here, we present a substantial update to Hypermut that allows for the integration of hypermutation detection into automated bioinformatic pipelines, and that can handle multistate characters and alignment gaps. New developments in Hypermut 3 Hypermut 3 extends Hypermut 2 by optionally handling gaps and multistate characters in the alignment. Furthermore, it is available as a command-line tool that can be straightforwardly integrated into bioinformatic pipelines. We have simplified the input compared to Hypermut 2 by automatically setting the control context to be the exact complement of the primary context, which will reduce accidental user error for complex control patterns. The output information remains the same. Gap handling Gaps in the mutation site are not considered (Figure 1B, q4). By default, gaps in the context are ignored when considering potential mutation sites of interest (Figure 1B, q5), i.e., gaps are skipped over and the following characters in the sequences are used for the context patterns. We also provide an option to keep gaps, using them as characters, in which case gaps in the context are not ignored. Multistate character handling Matching at positions with multistate characters in the alignment can be handled in two ways (Figure 1B, q6-q9): (1) strict mode ignores positions where the multistate character in the sequence is broader than the user-defined pattern, and (2) partial mode identifies partial matches .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint for positions with some overlap between the multistate character and the user-defined pattern. Only strict matching can be performed if the reference sequence contains multistate characters. Strict mode In strict mode, sites with multistate characters are considered only when the nucleotides that make up the multistate character are entirely included in the user-defined search pattern. In cases where the multistate characters contain nucleotides that are not included in the user-defined search pattern, the location is ignored and not considered a potential primary or control match. Partial mode Partial mode returns the same complete matches as strict mode, but also assigns partial matches to locations with multistate characters that contain some, but not all, nucleotides in the user- defined search pattern. For this mode, we assume that each nucleotide is present at equal frequency in the population, and that all potential combinations of nucleotides are also present at equal frequency. In this case, there may be partial matching in the context or the mutation site. Partial matching in the context leads to a fractional potential match “count,” while partial matching in the mutation site leads to a fractional actual match “count.” For a given position, we quantify the extent of matching by determining the proportion 𝑃!,# of standard nucleotides (ACGT) in the IUPAC code of the sequence that are in the correct context: 𝑃!,# = |𝑠 ∩ 𝑐| |𝑠| , where 𝑠 is the set of nucleotides present in the IUPAC code for the sequence, 𝑐 is the set of nucleotides present in the IUPAC code for the context or mutation of interest, and |.| is the cardinality (i.e., length) of the set. For a given context of length 𝑙, the potential match “count” 𝑀 is the product of the proportions for each position in the context: 𝑀 = + 𝑃!,#$ % $&' , where 𝑃!,# $ is the proportion of matching nucleotides for the ith position in the context pattern. If there are multiple possible contexts, then the total potential match count is the sum of the individual match counts. The actual match count is computed as: 𝑚 = 𝑃!,#( - 𝑀$, ) $&' where 𝛿 designates the mutation of interest, 𝑛 is the number of possible contexts, and 𝑀$ is the potential match count for context i. Potential and actual match counts are computed for the user-defined primary context as well as the inferred control context. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint Availability and implementation Hypermut 3 is written in Python 3 and requires the scipy package (Virtanen et al., 2020). It is available as a command-line tool at https://github.com/MolEvolEpid/hypermut3 and as a webtool at https://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermutv3.html. Example of use To investigate the effects of gaps and multistate characters on the detection of potentially hypermutated positions, we downloaded from the LANL HIV database 1541 high-quality HIV-1 gag and env population sequences aligned to HXB2 (GenBank accession number K03455; all data available at https://github.com/MolEvolEpid/hypermut3/tree/main/manuscript/data). Due to the diversity of the sequences in the alignment, the mean percent of gaps within a query sequence was 24.9% (range: 0%-40%). The mean percent of non-ACGT characters was 1.1% (range: 0.06%- 7.0%). We ran Hypermut 3 on these sequences, using HXB2 as the reference sequence, in both strict and partial modes, with and without skipping gaps in the context. We first investigated whether sequences had signatures of hypermutation using a Fisher’s exact p-value threshold of 0.05. When keeping gaps in the context, 2 sequences fell below the threshold in both strict and partial modes, and 1 additional sequence fell below the threshold in partial mode only. When skipping gaps in the context, 8 sequences had signatures of hypermutation in both strict and partial modes. In general, skipping gaps increased the number of potential contexts in 94.7% (1460/1541) of sequences in both strict and partial modes, and the number of actual matches in over 70% of sequences (strict: 1132/1541, 73.5%; partial: 1154/1541, 74.9%). We next compared potential and actual matches for the strict and partial modes, both with gap skipping. When compared to the strict mode, using the partial mode led to 49.2% (758/1541) of sequences having more potential contexts and 40.2% (619/1541) of sequences having more potential hypermutations. As expected, we observed a positive correlation between the percent of non-ACGT characters and the number of additional potential sites (Figure 1C). An example of the difference in the cumulative number of potential sites and matches under different multistate match mode and gap handling conditions is shown in Figure 1D. Potential applications Hypermut 3 can be used to detect APOBEC3F- and APOBEC3G-induced mutations as well as other mutations of interest. Among other applications, the output from Hypermut may be useful as a quality control step in a sequence analysis pipeline to remove sequences with certain mutational signatures or to mask certain positions within sequences, thus avoiding biases in downstream results. The new partial matching mode is particularly useful for identifying potential hypermutation in viral population sequences derived from, for example, Sanger sequencing of a virus population or a resolved consensus sequence of Illumina reads. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint The web version of Hypermut 3 is straightforward to use for even those with no programming

Background

and provides a means to easily identify potentially hypermutated sequences from an alignment of interest. For those with more programming experience, we also provide a command-line version of Hypermut 3 that can be incorporated into bioinformatic pipelines. This greatly expands the potential applications of Hypermut by allowing it to become a standard tool used to automatically identify and flag hypermutated sequences in real time for purposes ranging from research to public health.

Conclusion

We have updated the original Hypermut software to accommodate the increase in genome sequencing and the development of automated bioinformatic pipelines. The added functionality of Hypermut 3 compared to the original published version allows for the identification of potential hypermutation events for multistate characters (as of Hypermut 3), provides a comparison of mutations in primary and control contexts to more easily identify hypermutated sequences (as of Hypermut 2), and enables straightforward integration into bioinformatic pipelines (as of Hypermut 3). Positions or sequences identified as likely hypermutants can then be removed or transformed prior to further sequence analyses, thus reducing potential biases in the downstream results. Funding This work was supported by the National Institutes of Health [grant R01AI087520 to TL, interagency agreement AAI24007-001-00000 HIV/SIV, Database and Analysis Unit to HY and BF] and the Los Alamos National Laboratory [Laboratory Directed Research and Development program fellowship project no. 20230873PRD4 to ZL].

References

Desingu,P.A. et al. (2024) Molecular evolution of 2022 multi-country outbreak-causing monkeypox virus Clade IIb. iScience, 27. Noguchi,C. et al. (2005) G to A hypermutation of hepatitis B virus. Hepatology, 41, 626–633. Refsland,E.W. et al. (2012) Endogenous Origins of HIV-1 G-to-A Hypermutation and Restriction in the Nonpermissive T Cell Line CEM2n. PLOS Pathog., 8, e1002800. Rose,P.P. and Korber,B.T. (2000) Detecting hypermutations in viral sequences with an emphasis on G → A hypermutation. Bioinformatics, 16, 400–401. Vartanian,J.P. et al. (1991) Selection, recombination, and G----A hypermutation of human immunodeficiency virus type 1 genomes. J. Virol., 65, 1779–1788. Virtanen,P. et al. (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods, 17, 261–272. Yu,Q. et al. (2004) Single-strand specificity of APOBEC3G accounts for minus-strand deamination of the HIV genome. Nat. Struct. Mol. Biol., 11, 435–442. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint Figure 1: Hypermut 3 overview. (A) Example mutation site and context (for APOBEC3G and APOBEC3F) using IUPAC codes. The control pattern is the complement of the primary pattern and inferred by the program. (B) Example reference and query sequences, and whether they are considered a potential primary match and an actual primary match for three Hypermut versions: 2.0, 3.0 strict, and 3.0 partial. These numbers assume that gaps are skipped (for Hypermut 3) and that the context is enforced on the query sequence. In the sequence, the underlined nucleotide is the mutation site, correct mutations or contexts are bolded, and matches for Hypermut 3 partial mode are colored in red. The IUPAC code R indicates A and/or G, and the IUPAC code N indicates any of the bases. (C) Correlation between percent non-ACGT characters in the query sequence and the number of additional potential primary matches identified when using partial matching compared to strict matching. The color indicates the number of additional primary matches observed. (D) For an example sequence, the cumulative number of potential sites vs. the cumulative number of actual matches identified for each combination of partial vs. strict matching (color) and keeping vs. skipping gaps in the alignment (line type). G A RD G A YN|RC Upstream context Mutation site Downstream context

Reference

Query

Reference

Query PrimaryControl Potential primary match Primary match score ID Seq 2.0 strict partial 2.0 strict partial ref GATC q1 AGTC 1 1 1 1 1 1 q2 TGTC 1 1 1 0 0 0 q3 AGCC 0 0 0 0 0 0 q4 -GTC 1 0 0 0 0 0 q5 A-GT 0 1 1 0 1 1 q6 RGTC 1 0 1 0 0 0.5 q7 AGRC 0 1 1 0 1 1 q8 ANTC 0 0 0.5 0 0 0.5 q9 RNTC 0 0 0.5 0 0 0.25 A B D C .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 29, 2024. ; https://doi.org/10.1101/2024.10.24.620069doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0