{"paper_id":"8a2ee01e-64a2-45d0-824d-92964b07fbfc","body_text":"Journal Title Here, 2022, 1–10\ndoi: DOI HERE\nAdvance Access Publication Date: Day Month Year\nPaper\nGenotyping Short Tandem Repeats Across Copy\nNumber Alterations, Aneuploidies, and Polyploid\nOrganisms\nMax A. Verbiest ,1,2,3∗ Elena Grassi ,4,5 Andrea Bertotti 4,5\nand Maria Anisimova 1,3∗\n1Institute of Computational Life Sciences, Zurich University of Applied Sciences, 8820 W¨ adenswil,\nSwitzerland, 2Department of Molecular Life Sciences, University of Zurich, 8057 Z¨ urich, Switzerland, 3Swiss\nInstitute of Bioinformatics, 1015 Lausanne, Switzerland, 4Department of Oncology, University of Torino,\n10060 Candiolo, Torino, Italy and 5Candiolo Cancer Institute - FPO IRCCS, 10060 Candiolo, Torino, Italy\n∗Corresponding authors max.verbiest@zhaw.ch, maria.anisimova@zhaw.ch\nFOR PUBLISHER ONLY Received on Date Month Year; revised on Date Month Year; accepted on Date Month Year\nAbstract\nShort tandem repeats (STRs) are a rich source of genetic variation, but are difficult to genotype. While\nspecialized repeat variant callers exist, they typically assume a euploid human genome. This means recent\nfindings regarding phenotypic effects of STR variants in human health and disease cannot be readily extended\nto polyploid organisms or cancer, which is characterised by copy number alterations (CNAs). Here we present\nConSTRain, a novel STR variant caller that explicitly accounts for the copy number of loci in its genotyping\napproach. We benchmark ConSTRain using a euploid human 100X whole genome sequencing sample where it\ncalls STR allele lengths for over 1.7×106 loci in under 20 minutes with an accuracy of 98.28%. Subsequently,\nwe show that ConSTRain resolves complex STR genotypes in an artificial trisomy 21 sample and a polyploid\nDwarf Cavendish banana harbouring a large duplication. Finally, we analyse a microsatellite instable colorectal\ncancer tumoroid, where ConSTRain tackles CNAs and whole-genome duplications. ConSTRain is the first\nSTR variant caller that allows for the investigation of repeats affected by CNAs, aneuploidies, and polyploid\ngenomes. This unlocks the investigation of STRs across a wide range of contexts and organisms where they\npreviously could not be easily studied.\nKey words: Short tandem repeat, Microsatellite, Variant calling, Cancer, Copy number alteration, Polyploidy\nIntroduction\nShort tandem repeats (STRs), also known as microsatellites, are\ngenomic regions where a DNA motif one to six base pairs (bp)\nin length is repeated consecutively. STRs are highly variable.\nEspecially prevalent are insertion and deletion (indel) mutations\nthat expand or contract the repeat by one or more unit [1].\nSuch STR variants may cause frameshift mutations or affect the\nphenotype by regulating gene expression levels in health and\ndisease [2, 3, 4]. STR loci for which the allele length is associated\nwith gene expression levels are called expression STRs (eSTRs).\nThe distinct mutational characteristics of STRs cause issues\nwhen genotyping them with general-purpose variant calling tools.\nTo this end, specialized STR variant calling algorithms have been\ndeveloped [5, 6, 7, 8]. While these tools enable accurate variant\ncalling of STR loci from human sequencing samples, there are\nseveral key points they do not address. Notably, current STR\ngenotypers were developed with the euploid human genome in\nmind. This means such tools expect two copies of each repeat\nlocus to be present, with some tools supporting a ploidy of one for\nsex chromosomes.\nWhile this may generally hold for mammalian genomes, it is\nnot representative of the full range of genomic variation. Copy\nnumber alterations (CNAs) can change the ploidy of parts of a\nchromosome — and thus of the STRs located in those regions.\nCNAs can be present in the germline of healthy individuals [9].\nFurthermore, somatic CNAs are a key feature of cancer, where\nthey contribute to carcinogenesis by deleting and upregulating\nbiological functions [10]. There are also more extreme cases where\nthe ploidies of whole chromosomes (e.g., trisomy 21) or the full\ngenome (i.e., whole-genome duplications) are affected. We recently\ndescribed a panel of putative eSTRs in colorectal cancer [4].\nHowever, since current STR variant callers do not account for\n© The Author 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 1\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\n2 Verbiest et al.\nCNAs, our eSTR detection approach had to exclude all STRs\nthat were located in regions affected by CNAs. This lead to a\nsubstantial fraction of information — around 15% of all calls —\nbeing removed, meaning we may have missed important eSTR loci.\nBesides not addressing aneuploidies or CNAs, the focus of current\nSTR variant callers on the human genome also means that such\ntools cannot be readily used to study STRs in polyploid organisms.\nWhile polyploidy occurs sporadically in animals, it is widespread\nin plants [11]. Among the polyploid plants are many important\nfood crops like wheat, maize, and banana [11, 12, 13]. Despite the\nsocietal importance of such species, current computational tools\ndo not allow for the extension of findings regarding the phenotypic\neffects of STR variants to polyploid organisms.\nTo address these open issues, we here introduce a new STR\nvariant caller named ConSTRain (co\npy number guided STR allele\ninference). The fundamental idea of ConSTRain is that the copy\nnumber of each STR locus is explicitly considered in the variant\ncalling process. The copy number can be set at the chromosome\nlevel by specifying the karyotype of the organism. Furthermore,\nConSTRain allows the copy numbers of specific genomic regions\nto be changed by specifying CNAs known to be present in a sample.\nWe demonstrate that our new method is highly competitive:\nConSTRain’s accuracy is at least as high as state-of-the-art STR\nvariant callers on a euploid human benchmark, while the runtime\nis substantially lower (especially when running multithreaded).\nFurthermore, we apply ConSTRain in aneuploid settings on\nsimulated trisomy 21 data, and on whole genome sequencing\n(WGS) data from a triploid Musa acuminata Dwarf Cavendish\nbanana. The original publication of this M. acuminata sequencing\ndata reported a large duplication on the long arm of chromosome\n2 [13]. We show that ConSTRain is able to account for this\nduplication when the coordinates of the affected region are\nprovided. Finally, we analyse STRs in four WGS samples from\na microsatellite instable (MSI) colorectal cancer (CRC) tumoroid\n[14, 15]. One of these samples represents the original tumoroid\nline and the other three are clonal organoids, two of which have\nundergone whole-genome duplication. While these samples stem\nfrom the same tumour, we observe differences in STR allele lengths\nin pairwise sample comparisons. This indicates that ConSTRain\ncan be useful for analysing tumour heterogeneity and tracing\nclonal lineages in cancer, even in closely related samples. Overall,\nConSTRain is a flexible, fast, and accurate STR variant caller that\ncan genotype repeats in human and non-human sequencing data\nwhile addressing ploidy-altering events.\nMaterials and Methods\nConSTRain implementation\nConSTRain is an STR variant caller implemented in Rust. It\nrelies on the htslib C library [16] through the rust-htslib crate\n[17]. All analyses reported in this manuscript were performed\nusing ConSTRain version 0.9.1. A visual overview of ConSTRain’s\ngenotyping approach is shown in Figure 1. ConSTRain requires\nthree input files: an alignment of sequencing reads to a\nreference genome (SAM/BAM/CRAM format), a file specifying\nthe locations of STR loci (BED format), and a file specifying the\nkaryotype (i.e, the ploidy of each chromosome) (JSON format). If\nthe alignment file is in CRAM format, the reference genome must\nalso be supplied (FASTA format). Optionally, a file specifying the\nlocation and copy number of regions affected by CNAs can be\nsupplied (BED format). The estimated STR genotypes for each\nlocus in the input STR panel are written to stdout in VCF format.\nSource code, details of input and output file formats, and an\noverview of available command line arguments are available at\nhttps://github.com/acg-team/ConSTRain.\nVocabulary\nDifferent fields and research groups use inconsistent terminology\nto describe the various characteristics of STRs. To avoid confusion,\nwe will explicitly define the vocabulary used by ConSTRain here:\nSTRs are made up of a sequence of repeated units. Currently,\nConSTRain allows only perfect STRs, without any mismatches,\ninsertions, or deletions between the different units of a locus. The\nnumber of nucleotides in the unit is referred to as the period. The\nnumber of times a unit is repeated is called the STR allele length.\nA genotype is the combination of allele lengths that exist for an\nSTR locus in a sample. Finally, each STR has a copy number,\nwhich indicates how many instances of the STR locus exist in the\ngenome.\nInitialisation\nConSTRain starts by reading STR loci from the STR panel file\nand the ploidy of contigs from the karyotype file. The copy number\nof an STR is initially set based on the ploidy of the contig it is\nlocated on. E.g., the copy number will typically be set to two for\nSTRs located on human autosomal chromosomes. However, if a file\nwith CNAs is provided and the coordinates of the STR intersect\nwith the coordinates of a CNA, the copy number of the STR is\nupdated to that of the CNA. Subsequently, ConSTRain fetches\nall reads from the alignment file that fully span the STR locus\nand parses CIGAR strings to extract the STR allele length from\neach read. This yields the observed allele length distribution for\nthat STR. For reasons that are discussed below in ‘Generating all\npossible genotypes’, the distribution is sorted such that the STR\nallele lengths are listed according to their observed frequencies, in\ndescending order (Figure 1, step 3). The main task for ConSTRain\nis to infer the most likely genotype for each STR locus, given its\nobserved allele length distribution and copy number.\nEstimating the most likely STR genotype\nRather than using a heuristic optimisation approach to estimate\nthe most likely genotype, ConSTRain explicitly generates all\npossible genotypes for an STR locus. From each of these possible\ngenotypes, an expected allele length distribution is generated. The\ngenotype for which the expected allele length distribution has\nthe lowest absolute error (Manhattan distance) to the observed\nallele length distribution is chosen as the most likely genotype.\nTo make this process tractable, ConSTRain operates under three\nassumptions:\n1. STRs exist at integer copy numbers.\n2. There are at most as many distinct allele lengths as the STR\ncopy number.\n3. Each STR allele in the genotype contributes an equal number\nof reads to the allele length distribution.\nUnder these assumptions ConSTRain can generate all possible\ngenotypes for an STR locus, given its copy number. ConSTRain\nonly considers genotypes where the alleles are in descending order\nof abundance (‘Generating all possible genotypes’ for details).\nThus, if the STR has copy number two the possible genotypes are\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\nConSTRain 3\n‘AA’ and ‘AB’, if the copy number is three, the possible genotypes\nare ‘AAA’, ‘AAB’, and ‘ABC’, etc. Internally, ConSTRain\nrepresents genotypes as matrices. E.g., for copy number three the\npossible genotypes are:\nG × ⃗ a=\n\n\n3 0 0\n2 1 0\n1 1 1\n\n ×\n\n\nA\nB\nC\n\n (1)\nwhere each row in G represents a possible genotype, each column\nrepresents an STR allele length (encoded in ⃗ a), and each value\nrepresents the number of times an STR allele length is present\nin a genotype. For a given locus, the shape of G will always\nbe such that the number of columns equals the copy number of\nthe locus, and the number of rows equals the number of integer\npartitions that exist for that copy number (‘Generating all possible\ngenotypes’ for details). Next, ConSTRain uses G to generate a\nmatrix of expected allele length distributions, denoted as D. To\ndo this, ConSTRain must first know the number of reads each\nallele in the genotype is expected to contribute to the STR allele\nlength distribution. Under assumptions (2) and (3) we can find this\nnumber by dividing the total number of reads mapped to the STR\nlocus by the copy number of the STR locus. ConSTRain multiplies\nG by this scalar which results in matrixD where each row contains\nthe expected allele length distribution for the corresponding row\nin G. For each row in D, the absolute error to the observed allele\nlength distribution of the STR is calculated. If the number of\nallele lengths observed for a locus is greater than the copy number\n(such as in Figure 1) only as many alleles as the copy number\nare considered. Conversely, if the number of observed alleles is\nfewer than the copy number, zero values are appended to the\nobserved allele length distribution until its length equals the copy\nnumber. ConSTRain reports the genotype in G for which the\nassociated expected allele length distribution in D has the lowest\nerror to the observed allele length distribution. In cases where\nmultiple genotypes are equally likely, ConSTRain does not report\na genotype and sets a VCF filter tag to indicate why a genotype\nwas not inferred. However, the observed allele length distribution\nwill still be included in the VCF record’s FORMAT field.\nGenerating all possible genotypes\nAs noted above, ConSTRain does not consider genotypes where\nallele abundances are not in descending order. Under assumption\n(3) it is not possible for an STR allele that is less abundant\nto contribute more reads to the allele length distribution than\nanother allele that is more abundant. By sorting the observed allele\nlength distribution we thus do not need to consider genotypes with\nnon-descending allele abundances. Genotypes of this form would\nresult in expected allele length distributions that are impossible\nunder ConSTRain’s assumptions. Sorting the observed allele\nlength distribution is a way to reduce the combinatorial space of\npossible genotypes: without doing this the number of genotypes\nto generate for a locus would be equal to the number of weak\ninteger compositions of a size equal to the STR copy number. A\nweak integer composition refers to the representation of an integer\nas the sum of a sequence of non-negative integers. For a given\ninteger, the number of weak compositions of a specific size (i.e.,\nthe number of terms to represent the integer as) is given by:\nnumber of weak compositions =\n\u0010n + k − 1\nn\n\u0011\n(2)\nwhere n is the integer and k is the composition size. For our\npurposes n equals k equals the STR copy number. By sorting the\nobserved allele length distribution we instead only need to generate\na number of genotypes equal to the number of integer partitions of\nthe STR copy number. Integer partitions differ from compositions\nin that the terms of the partition are not ordered, i.e., different\norders of the same terms are considered identical. No closed-form\nsolution is known to determine the number of partitions for an\ninteger, but Sloane’s sequence A000041 enumerates the number of\npartitions for a range of integers [18]. Going back to the example\nin equation Equation 1, we can use Equation 2 to calculate that\nthere exist ten weak integer compositions when n and k are both\nthree. On the other hand, A000041 tells us that there are three\ninteger partitions — a difference of seven. This may not seem very\nimpactful, but the difference becomes much more pronounced for\nhigher copy numbers: for n = 20 there exist more than 68.9 × 109\nweak compositions of size 20, but only 627 partitions. Further,\ngiven that an STR panel can contain hundreds of thousands of\nloci (e.g., over 1 .7 × 106 for the human genome), even small\noptimisations make a difference in overall runtime.\nUpdating an existing VCF file\nBesides the standard mode of running ConSTRain outlined above,\nConSTRain also supports reanalysing previously generated VCF\nfiles. This may be useful if novel CNA information for a sample\nbecomes available after an input alignment has already been\nanalysed, or if it is necessary to adjust filtering parameters. It\nalso prevents having to re-download large alignment files from\nremote repositories. This is possible because ConSTRain includes\nthe observed allele length distribution of each STR in a FORMAT\nfield of the output VCF. Since it is much faster to read the observed\nallele length distribution from a VCF file than to extract it from\nsequencing reads in an alignment, running ConSTRain in this\nmode is typically a matter of seconds.\nFiltering ConSTRain output\nGenomic regions where the depth of coverage is lower or higher\nthan expected may indicate a large number of technical artifacts\nfor that region. This can lead to inaccurate variant calls. To\naddress this, ConSTRain allows for the filtering of STR loci based\non their normalised depth of coverage. The normalised depth is\ncalculated by dividing the number of mapped reads by the locus\ncopy number. This normalisation is important because the copy\nnumber of a locus is expected to affect the depth of coverage.\nWhen analysing an alignment of human male sequencing reads, for\ninstance, loci on the sex chromosomes are expected to have roughly\nhalf the depth of coverage as loci on autosomes. Similar effects exist\nfor genomic regions that are amplified or deleted by structural\nvariants. Dividing the depth of coverage by the locus copy number\nwill force all loci to occupy the same range of normalised depth\nvalues, which makes filtering more straightforward. The desired\nminimum and maximum normalised depth values can be set at\nthe command line via the --min-norm-depth (default: 1.0) and\n--max-norm-depth (not set by default) arguments, respectively.\nThese upper and lower bounds can be set manually to reasonable\nvalues before running ConSTRain. Another option is to first run\nConSTRain without filters and then set bounds based on the\nobserved distribution of normalised sequencing depths across all\nloci in the sample. This can help identify the range of acceptable\nnormalised depth values for a specific sample. Once the minimum\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\n4 Verbiest et al.\nand maximum values are found, ConSTRain can be rerun on the\nVCF file with the updated filtering parameters. Since running\nConSTRain on a VCF file is extremely fast (around 20 seconds\nfor 1733646 loci on a 2020 MacBook Pro) this only marginally\nincreases the overall computational workload. A Python script\nto generate a distribution of normalised depth values from a\nConSTRain VCF file is included in the ConSTRain GitHub\nrepository.\nSTR reference panels\nConSTRain needs a reference panel of STR loci to know where\nSTRs are located in the reference genome. The reference panel\nthat was used in all experiments involving human data reported\nin this manuscript is based on the GRCh38 version 13 reference\npanel provided by GangSTR [7]. While ConSTRain is primarily\naimed at genotyping STRs with periods between one and six,\nthe repeat panel provided by GangSTR contains a small number\n(20481) of repeat loci with longer periods (up to 20), which\nwe did not remove. Furthermore, the GangSTR panel does not\ncontain mononucleotide repeats. We therefore extended this panel\nto include perfect mononucleotide repeats of at least allele length\nten, which were identified using mreps [19]. This resulted in a panel\ncontaining 1733646 repeat loci in the GRCh38 human reference\ngenome (Supplementary Figure 1A). The total region length of\nmost of these repeats was relatively short, with only 3.38% being\nlonger than 30bp in the reference assembly (Supplementary Figure\n1B). This suggests that — barring large expansions — the vast\nmajority of repeat loci in this panel should be resolvable with\nshort sequencing reads.\nWe created a novel STR referene panel for the DH-Pahang v4\nbanana reference genome [20]. This was also done using mreps [19],\nsetting command line arguments such that perfect repeats with\nperiods one through six were reported. We subsequently filtered\nmreps output using custom Python scripts to retain only perfect\nSTRs with at least allele length ten, six, four, three, three, and\nthree for STRs with period one through six, respectively. This\nyielded a reference panel of 183345 STR loci across the 11 main\nchromosomes in the DH-Pahang v4 reference.\nHG002 benchmark\nSTR genotyping tools were benchmarked using haplotypes\nprovided by the telomere-to-telomere (T2T) consortium’s Q100\nproject [21, 22]. The Q100 project provides high-quality, phased\nhaplotypes of the HG002 cell line which have been used previously\nto benchmark STR variant calls [22]. To obtain ground-truth allele\nlengths for loci in our STR reference panel in the HG002 cell\nline, the Q100 haplotypes were mapped to the GRCh38 reference\ngenome using minimap2 [23]. The resulting PAF file was parsed\nto find a ground-truth STR allele lengths in GRCh38 coordinate\nspace. The HG002 allele length could be recovered for 1695865\nSTR loci in our panel (97.82% of the total).\nSubsequently, STR variant callers were used to genotype the\nSTR reference panel in an alignment of 2x250 Illumina whole-\ngenome sequencing reads of HG002, which is available through\nGenome in a Bottle [24]. STR allele lengths generated by the\ndifferent variant callers were compared to the ground-truth allele\nlengths derived from the Q100 haplotypes. Genotyping accuracy\nwas calculated by determining the fraction of loci for which the\nbiallelic genotypes reported by a variant caller exactly matched\nthe allele lengths observed in the Q100 haplotypes.\nSimulating trisomy 21\nWe simulated 2x150 paired-end reads from chromosome 21 of\nthe maternal and paternal haplotypes of HG002, as well as\nGRCh38. Since we were not interested in modelling sequencing\nerrors, we simulated error-free, paired-end reads to a depth of\ncoverage of 15X for each of the three haplotypes using wgsim\n(https://github.com/lh3/wgsim). Simulated reads from the three\nhaplotypes were then combined to form a 45X sequencing sample\nof a triploid chromosome 21. These reads were mapped back to\nthe GRCh38 reference genome using minimap2 [23].\nMusa acuminata whole-genome sequencing data\nThe M. acuminata sequencing data used here consist of two\nsequencing experiments of the same organism, one performed on\nan Illumina HiSeq1500 machine and the other on an Illumina\nNextSeq500 [13]. We downloaded all sequencing reads (European\nNucleotide Archive, study PRJEB33317) and combined outputs of\nsequencing runs into two FASTQ files, one for the HiSeq1500, one\nfor the NextSeq500. The two FASTQ files were mapped to the DH-\nPahang v4 reference genome [20] using minimap2 [23], removing\nimproper pairs, duplicate alignments, and low-quality alignments.\nThese alignments will be referred to as the ‘HiSeq1500 alignment’\nand the ‘NextSeq500 alignment’. Subsequently, the HiSeq1500 and\nNextSeq500 alignments were concatenated to form the ‘merged\nalignment’.\nColorectal cancer whole-genome sequencing data\nWe obtained WGS data of a patient-derived cancer CRC\ntumoroid generated as a part of a previously published mutation\naccumulation experiment [14]. These data are available through\nthe European Genome-phenome Archive under accession number\nEGAD50000000411. Briefly, this experiment was set up so that\nindividual cells were isolated from a CRC tumoroid [25] and\nallowed to grow for six weeks. At the six week mark, WGS was\nperformed on each clone to obtain a high-quality representation of\nthe genome of the individually isolated cells. Subsequently, clones\nwere repeatedly bottlenecked to 100 cells every two weeks for six\nmonths, followed by WGS of the resulting clones [14].\nAs a demonstration of ConSTRain’s applicability to cancer\nsequencing data we analysed four WGS samples from a single\nmicrosatellite instable tumoroid. The first of these samples was\ntaken from the original tumoroid line and the other samples\nrepresented three different clones (01-0, 05-0, and 07-0) sequenced\nafter six weeks of growth. For each sample, CNA calls generated\nby Sequenza were available [14, 26]. These CNA calls indicated\nthat while the original tumoroid line and the 05-0 clone were\ndiploid, the 01-0 and 07-0 clones had undergone whole-genome\nduplications and were tetraploid. We ran ConSTRain on all four\nsamples, providing the appropriate Sequenza CNA calls each\ntime. Then, we computed pairwise STR-based distances between\nsamples based on the genotypes returned by ConSTRain. We\nlimited this analysis to high confidence STR genotypes where\nthe unit size was between three and six and the normalised\ndepth of coverage was at least 5. For comparisons between\ndiploid and tetraploid samples all genotypes in the diploid sample\nwere artificially duplicated before performing comparisons. This\nmeant that the diploid genotype [10, 10] would be represented as\n[10, 10, 10, 10], and [8,9] as [8, 8, 9, 9], etc. Loci that were annotated\nwith a different copy number in the two samples of a pair were\nnot considered when calculating pairwise distances. The impact\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\nConSTRain 5\nof this filter varied depending on which two samples were being\ncompared: when comparing the two 2n samples 4.81% of loci did\nnot have the same copy number, whereas up to 26.85% of loci\nhad to be removed when comparing 4n samples. This is likely\ndue to the fact that accurately calling copy number levels from\nsequencing data is a difficult task, especially for higher copy\nnumbers. Pairwise sample distances were calculated by taking the\nsum of Manhattan distances between STR genotypes for all loci\nwith a high confidence call in both samples, normalising by the\ntotal number of compared loci. This resulted in a distance between\nsamples with a unit of ‘average difference in allele length per locus’.\nResults\nConSTRain accurately genotypes STRs in euploid human\nsequencing data\nWe first evaluated ConSTRain’s performance when analysing\nsequencing data from a euploid human genome. We ran\nConSTRain with default parameters on 100X short-read WGS\ndata of the HG002 human cell line. Using high-quality HG002\nassemblies as ground truth, we determined ConSTRain’s accuracy\n(Figure 2A&B, Supplementary Figure 3). ConSTRain returned\nallele length estimates for 1655655 out of the 1695865 repeat loci\n(97.63%) for which a ground truth was available. For 95.25% of\nthese, the allele length(s) returned by ConSTRain exactly matched\nthose of the ground truth.\nNext, we investigated if accuracy could be increased by\nfiltering STR loci based on their normalised depth (see Methods).\nTo this end, we generated the distribution of normalised read\ndepths shown in Figure 2A. The distribution is left-skewed.\nUpon further investigation, we found that this was caused by\nmononucleotide repeats, whereas the distribution for repeats with\nhigher periods followed a normal distribution (Supplementary\nFigure 2). Therefore, we decided to not consider mononucleotide\nrepeats when defining our filter parameters, and instead used\nthe distribution of loci with periods greater than one. In this\ndistribution, we set bounds such that they excluded loci that fell\nin the lowest 2.5% and highest 2.5% of normalised depth values.\nThese bounds were then used as parameters to rerun ConSTRain\non the VCF (including mononucleotide repeats) previously created\nfrom the HG002 alignment. Additionally, we filtered out loci that\noverlapped known segmental duplications in the human genome.\nTogether, this decreased the number of called loci to 1393426\n(82.17% of the total), but increased the accuracy to 98.28%\n(Figure 2B, Supplementary Figure 3).\nHaving found that ConSTRain was able to accurately\ndetermine repeat allele lengths in a 100X short-read sequencing\nalignment, we wanted to see how it performed on samples with\nlower sequencing depths. To examine this, we downsampled the\nHG002 alignment to 30X and 10X depth of coverage. The accuracy\nof unfiltered allele length calls was 94.51% and 93.06% for the\n30X and 10X alignments, respectively (Figure 2B). Importantly,\nnormalised depth-based filtering of loci proved to be effective for\nthe downsampled alignments as well. After filtering the genotyping\naccuracy rose to 96.65% and 94.75% for the 30X and 10X\nalignments, respectively (Figure 2B).\nConSTRain’s accuracy is competitive with existing STR variant\ncallers\nNext, we sought to compare ConSTRain’s performance to that\nof other STR variant callers. We ran GangSTR and HipSTR\non the 100X HG002 alignment using the same STR reference\npanel used for ConSTRain and again compared reported allele\nlengths to the ground truth haplotypes. For both tools, we\nanalysed their unfiltered outputs, as well as the outputs filtered\naccording to instructions in the respective tool’s documentation.\nThe results are shown in Table 1. The accuracy of the filtered\noutputs of the three methods are very similar: ConSTRain had an\naccuracy of 98.28%, GangSTR 97.69%, and HipSTR 97.74%. A\nnotable difference between the three methods was that HipSTR\ncalled substantially fewer loci (69.28% of the STR reference\npanel) than both ConSTRain (82.17%) and GangSTR (80.95%).\nAnother major difference was that ConSTRain had a much lower\nruntime than the other two tools. When running single-threaded,\nConSTRain was around 2.2 times as fast as GangSTR, and 1.8\ntimes as fast as HipSTR. Moreover, ConSTRain supports running\non multiple threads. With 32 threads, ConSTRain took 19 minutes\nand 31 seconds to genotype our reference panel of over 1.7 × 106\nloci in the 100X HG002 alignment, making it 45.8 times as fast as\nGangSTR and 36.9 times as fast as HipSTR in this benchmark.\nConSTRain resolves STR genotypes in a simulated trisomy 21\nsample\nHaving demonstrated that ConSTRain accurately recovers STR\nallele lengths from sequencing data of a diploid genome, we were\ncurious to see how it performed at other copy numbers. We\nmimicked a trisomy 21 event by simulating short sequencing reads\nfrom three different assemblies of the human chromosome 21 and\nmapping them to GRCh38 (see Methods). We ran ConSTRain on\nthe resulting alignment, specifying that the ploidy of chromosome\n21 was three. After filtering, ConSTRain was able to estimate a\ngenotype for 18241 out of the 21482 loci located on chr21 in our\nreference panel. At 13465 of these, the ground truth consisted\nof one distinct allele length (genotype AAA) across the three\ninput assemblies. For 3923 and 853 loci two (genotype AAB)\nand three (genotype ABC) distinct allele lengths were present in\nthe input haplotypes, respectively. Overall, ConSTRain reported\nthe correct genotype for 98.39% of loci. Accuracy depended on\nthe number of distinct alleles at a locus: ConSTRain reported\nthe correct genotype for all loci with genotype AAA, 94.88%\nof loci with genotype AAB, and 92.76% of loci with genotype\nABC. Similar to the HG002 benchmark, the majority of errors\nwere because the distribution of generated allele lengths was not\nrepresentative of the underlying genotype. This is possible because\nthere was some stochasticity in the simulation of sequencing reads\nfrom reference haplotypes with regards to the depth of coverage\nalong the input sequence. For example, ConSTRain reported an\nincorrect genotype for a mononucleotide repeat for which the\ngenotype across the three haplotypes consisted of two alleles of\nlength 19 and one of length 20. For this locus, 35 spanning reads\nwith allele length 19 had been generated, and only three reads\nwith allele length 20. Therefore, ConSTRain reported a genotype\nconsisting of three alleles of length 19. Similar observations were\nmade for other incorrectly called loci.\nWe were also interested to see what genotypes HipSTR and\nGangSTR would report in this setting. We ran both variant callers\non the alignment of simulated reads without filtering and parsed\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\n6 Verbiest et al.\ntheir outputs. Both tools reported homozygous genotypes for all\nrepeat loci with genotype AAA. For loci with genotype AAB they\nusually reported a genotype that was homozygous for the more\nabundant of the two alleles, missing the other allele length. In a\nsubset of these loci (7.53% for GangSTR, 6.49% for HipSTR) a\nheterozygous AB genotype was reported. Finally, for the loci with\ngenotype ABC, both tools almost always reported heterozygous\ngenotypes containing two of the three alleles. Which of the three\nalleles at a locus was ignored seemed to be dictated by which allele\nwas represented by the fewest reads.\nConSTRain accounts for CNAs in a triploid Musa acuminata\nGiven that we could accurately resolve STR genotypes in\nsimulated reads of a triploid chromosome, we were curious how\nConSTRain performed on a real polyploid sample. We obtained\nWGS reads from a M. acuminata Dwarf Cavendish banana, which\nis a triploid, and mapped them to the DH-Pahang v4 reference\ngenome [20]. This particular sample was reported to have a\nduplication of around 6 megabases on the long arm of chromosome\n02 [13], making it an even more relevant test case for ConSTRain.\nThere was no ground truth available for STR genotypes\nin this analysis. However, there were two separate sequencing\nexperiments performed for the same sample (see Methods). Ideally,\nSTR genotypes reported by ConSTRain should be consistent\nbetween the two samples [27]. We tested for consistency by\nrunning ConSTRain on the NextSeq500 and HiSeq1500 alignments\nseparately (including coordinates of the chr02 duplication), and\ncomparing STR genotypes between the two outputs (Figure 3A).\nWe considered only genotypes that were exactly the same between\nthe two outputs to be consistent. Initially, we again set the\nminimum and maximum normalised depth values such that 2.5%\nof the lowest and 2.5% of the highest depth loci with periods\n> 1 were excluded from each sample. Even with these rather\nlenient filter parameters, genotype calls for STRs with periods\n> 2 were consistent at over 90% of loci (Figure 3A). Genotype\ncalls were much less consistent for mononucleotide (74.29% of calls\nconsistent) and dinucleotide (63.66% of calls consistent) repeats,\nhowever. Consistency between samples could be increased by\nraising the minimum normalised depth value, also reaching around\n90% for mono- and dinucleotide repeats when using a threshold of\n10. or 15. (Figure 3A).\nNext, we genotyped our banana STR reference panel using\nthe merged alignment (see Methods), specifying that three copies\nexisted of each chromosome but without providing coordinates for\nthe chr02 duplication. ConSTRain ran in 70 seconds on 16 threads\nand reported genotypes for 153167 out of 183345 STRs in the panel\nbefore filtering. Afterwards, we ran ConSTRain on the resulting\nVCF file, this time providing coordinates of the duplicated region\n(Supplementary Figure 4). We found that 2699 of the genotyped\nSTR loci were located in the duplicated region. Figure 3B shows\nthe distribution of normalised depths of coverage reported by\nConSTRain across STRs. The normalised depth distribution for\nSTRs in the duplicated region is shown separately — both before\nand after providing duplication coordinates to ConSTRain. The\nmean normalised depth for the non-amplified STRs was 18.84,\nwhile the mean normalised depth for the amplified STRs was\n27.29. This difference in mean normalised depth is roughly in\nline with a tetraploid region being mistakenly analysed as triploid\n(theoretically, normalised depth at tetraploid loci should be 1 1\n3\ntimes that of the normalised depth in triploid loci in this case).\nFurthermore, the distribution of normalised depths reported for\namplified STRs when the duplication coordinates were provided to\nConSTRain largely overlapped the distribution for loci in the rest\nof the genome (Figure 3B). This highlights an additional benefit\nof setting filter parameters based on the normalised depth of loci\nobserved in a sample: using bounds such that 2.5% of the lowest\nand 2.5% of the highest normalised depth loci with periods > 1\nwere excluded genome wide, 27.04% of STRs in the duplicated\nregion were excluded when running ConSTRain without CNA\ninformation. When CNA information was provided to ConSTRain,\nhowever, only 3.44% of duplicated loci were excluded, since their\nnormalised depth values were much more in line with the rest\nof the genome. This indicates that when CNA information is not\navailable or is incorrect, a substantial portion of loci with incorrect\ncopy number values may be excluded by filtering on normalised\ndepth of coverage. This effect is expected to be even stronger when\nthe difference between the annotated and true copy number of loci\nis larger.\nAfter filtering, ConSTRain reported genotypes for 148532\nSTRs, 2612 of which were located in the duplicated region\non chr02. At 55.15% of the triploid loci ConSTRain reported\none distinct allele, 33.02% had two disctinct allele lengths, and\n11.83% had three distinct alleles (Supplementary Figure 5). This\ndistribution was very similar for loci located in the duplicated\nregion on chr02 (Supplementary Figure 5). Interestingly, we also\nobserved 31 loci with four distinct alleles in the duplicated region,\n18 of which had a normalised depth of coverage ≥ 10. While this\nwas only a small fraction of loci in the duplicated region, it suggests\nthat some STR loci may have mutated after the duplication event.\nConSTRain resolves STR genotypes in whole-genome\nduplicated colorectal cancer\nFinally, we were curious to see whether ConSTRain can be used to\nstudy STRs in cancer sequencing data. To this end, we obtained\nfour WGS samples that were derived from a microsatellite instable\nCRC tumoroid (see Methods) [14]. Of the four WGS samples,\none was taken directly from the original tumoroid line. The other\nthree samples represented clones 01-0, 05-0, and 07-0 that had\nbeen grown from single cells taken from the original tumoroid\nline, where clones 01-0 and 07-0 had undergone whole-genome\nduplications.\nWe ran ConSTRain on these four samples, providing CNA\ninformation each time. Next, we calculated STR-based pairwise\ndistances between samples by comparing the STR genotypes (see\nMethods). Even though the four samples were all derived from\nthe same tumoroid with only six weeks between the isolation\nof individual cells and the sequencing of the resulting clones,\nthere were already some differences in STR genotypes between\nsamples (Figure 4). The smallest distance was observed between\nthe original tumoroid line and the diploid clone 05-0. The two\ntetraploid clones were more distinct from the original tumoroid\nline, with the STR-based distances between the original tumoroid\nline and both tetraploid clones being roughly similar. The largest\npairwise distance we observed across this dataset was between the\nSTR genotypes of the two tetraploid clones (Figure 4).\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\nConSTRain 7\nDiscussion\nTo the best of our knowledge, ConSTRain is the first STR variant\ncaller that enables rapid and accurate analyses of microsatellites\nwhile accounting for copy number alterations and polyploidy.\nOn a benchmark of a euploid human genome ConSTRain\nreached a genotyping accuracy of 98.28%, which was competitive\nwith state-of-the-art STR variant callers. ConSTRain was faster\nthan the two other STR variant callers included in our analyses,\neven when running single threaded. ConSTRain’s runtime can be\neasily reduced even further by using multiple compute threads:\nfor example, when running on 32 threads, ConSTRain genotyped\nover 1 .7 × 106 repeats in under 20 minutes from an alignment\nof 100X WGS reads. We showed that ConSTRain’s genotyping\naccuracy remained high for a simulated trisomy 21 event, even\nat loci with three distinct alleles. Then, we demonstrated\nConSTRain’s ability to call STR genotypes from sequencing data\nof a polyploid M. acuminata Dwarf Cavendish banana while\naccounting for a large amplified region on chromosome 02. The\nfinal analysis we presented here focused on four WGS samples from\nan MSI CRC tumoroid. Interestingly, two of these samples had\nundergone whole-genome duplications and were tetraploid. This\nis uncharacteristic for microsatellite instable tumours, which are\ntypically chromosomally stable [14]. When we determined STR-\nbased pairwise distances between all four samples, we found that\nthe comparison between the two tetraploid clones yielded the\nlargest distance in this dataset. This could indicate that these\ntwo clones are derived from different lineages within the tumour,\nmeaning that at least two separate whole-genome duplication\nevents occurred. While a more in-depth analysis is needed to\ndetermine this for certain, we have shown that ConSTRain could\nbe used for such a study.\nAs with any method, it is important to be aware of\nConSTRain’s limitations. In our HG002 benchmark and trisomy\n21 analyses, we observed two main sources of errors. First, for\nsome STR loci the observed allele length distribution strongly\ndeviated from the underlying genotype, with some alleles being\nover- or underrepresented in the distribution. This was the largest\nsource of errors in the HG002 benchmark (79.65% of errors were\nof this type). We do not consider this an inherent failing of\nConSTRain’s genotype estimation approach: it reported the most\nlikely genotype based on the observed allele length distribution in\neach case. However, it does highlight that an incorrect inference\nwill be made if the observed read distribution strongly deviates\nfrom the underlying genotype. This is an issue for variant calling\nin general, and not specific to our method.\nThe second source of errors in the HG002 benchmark was due\nto rare instances where STR loci had an insertion or deletion\nthat did not consist of an addition or removal of one or more\ncomplete repeat units. Similar to other STR variant callers\n[6, 7], ConSTRain only considers in-phase indel mutations, and\nthus does not return the correct allele lengths at such loci.\nThe fact that this was observed at only 4873 out of 1393426\nrepeats in the filtered 100X benchmark suggests that this heuristic\nis reasonable. This is an opportunity for future extensions or\nupdates to ConSTRain, however. To address out-of-phase indels\nConSTRain’s core genotype inference approach would not need\nto be updated, but it would mean that the full sequence in each\nread mapping to repeat loci needs to be resolved. This is likely\nto increase runtimes compared to the current implementation,\nalthough it is impossible to say in advance by how much.\nAnother potential source of issues is the use of external\nCNA information, which ConSTRain does not validate. If\nincorrect information is provided, it is likely to result in incorrect\ngenotype calls. Our M. acuminata analysis suggests this may be\nmitigated by setting filtering parameters based on the distribution\nof normalised depth values observed across loci in a sample\n(Figure 3B). An alternative approach could be to estimate the\ncopy number of repeat loci as part of the method itself. However,\nsince it is a non-trivial task and many existing tools are available,\nwe decided not to implement this functionality in ConSTRain at\nthis point.\nRecent years have seen an increased appreciation of the\nregulatory roles STR variants play in human physiology and\ndisease. We believe ConSTRain unlocks the extension of such\nfindings to other settings and organisms. By studying how STR\nvariability interacts with other structural variants, we hope to\nlearn more about the role these highly variable genomic elements\nplay in the context of cancer. Furthermore, some of the crops\nupon which we rely most for our food production have polyploid\ngenomes. By leveraging ConSTRain to analyse microsatellites in\nthese species we may discover more about how their genomes and\nphenotypes are regulated.\nData Availability\nSource code and precompiled binaries for ConSTRain, as well\nas STR reference panels, are available on the ConSTRain\nGitHub page: https://github.com/acg-team/ConSTRain. Scripts\nand notebooks that were used to perform analyses and generate\nvisualisations included in this manuscript are available in\na separate GitHub repository: https://github.com/acg-team/\nConSTRain-analyses/tree/main. HG002 Q100 haplotypes can be\ndownloaded according to instructions on the T2T consortium’s\nHG002 Q100 GitHub page: https://github.com/marbl/HG002/\ntree/main. The aligned Illumina sequencing reads for the HG002\ncell line are hosted by NCBI and can be downloaded fromhttps://\nftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_\nNA24385_son/NIST_Illumina_2x250bps/novoalign_bams/. The Musa\nacuminata sequencing data are hosted by the European Nucleotide\nArchive under study accession number PRJEB33317. The CRC\ntumoroid sequencing data are hosted by the European Genome-\nphenome Archive under accession number EGAD50000000411.\nFunding\nThis work was supported by the Swiss National Science\nFoundation [Sinergia CRSII5\n193832 to M.A.]; the Horizon 2020\nMarie Sk lodowska-Curie research and innovation program [823886\nto M.A.]; and the Associazione Italiana per la Ricerca sul Cancro\n[5x1000 grant 21091 to A.B.].\nConflict of Interest Disclosure\nThe authors declare no conflict of interest.\nReferences\n1. Max A. Verbiest, M. Maksimov, Y. Jin, M. Anisimova,\nM. Gymrek, and T. Bilgin Sonay. Mutation and selection\nprocesses regulating short tandem repeats give rise to\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\n8 Verbiest et al.\ngenetic and phenotypic diversity across species. Journal of\nEvolutionary Biology, 36(2):321–336, 2023.\n2. Stephanie Feupe Fotsing, Jonathan Margoliash, Catherine\nWang, Shubham Saini, Richard Yanicky, Sharona Shleizer-\nBurko, Alon Goren, and Melissa Gymrek. The impact of short\ntandem repeat variation on gene expression. Nature Genetics,\n51(11):1652–1659, 2019.\n3. Yirong Shi, Yiwei Niu, Peng Zhang, Huaxia Luo, Shuai\nLiu, Sijia Zhang, Jiajia Wang, Yanyan Li, Xinyue Liu,\nTingrui Song, Tao Xu, and Shunmin He. Characterization of\ngenome-wide STR variation in 6487 human genomes. Nature\nCommunications, 14(1):2092, 2023.\n4. Max A. Verbiest, Oxana Lundstr¨ om, Feifei Xia, Michael\nBaudis, Tugce Bilgin Sonay, and Maria Anisimova. Short\ntandem repeat mutations regulate gene expression in colorectal\ncancer. Scientific Reports, 14(1):3331, 2024.\n5. Thomas Willems, Dina Zielinski, Jie Yuan, Assaf Gordon,\nMelissa Gymrek, and Yaniv Erlich. Genome-wide profiling\nof heritable and de novo STR variations. Nature Methods ,\n14(6):590–592, 2017.\n6. Egor Dolzhenko, Viraj Deshpande, Felix Schlesinger, Peter\nKrusche, Roman Petrovski, Sai Chen, Dorothea Emig-Agius,\nAndrew Gross, Giuseppe Narzisi, Brett Bowman, Konrad\nScheffler, Joke J F A van Vugt, Courtney French, Alba\nSanchis-Juan, Kristina Ib´ a˜ nez, Arianna Tucci, Bryan R Lajoie,\nJan H Veldink, F Lucy Raymond, Ryan J Taft, David R\nBentley, and Michael A Eberle. ExpansionHunter: a sequence-\ngraph-based tool to analyze variation in short tandem repeat\nregions. Bioinformatics, 35(22):4754–4756, 2019.\n7. Nima Mousavi, Sharona Shleizer-Burko, Richard Yanicky,\nand Melissa Gymrek. Profiling the genome-wide landscape\nof tandem repeat expansions. Nucleic Acids Research ,\n47(15):e90–e90, 2019.\n8. Hope A. Tanudisastro, Ira W. Deveson, Harriet Dashnow,\nand Daniel G. MacArthur. Sequencing and characterizing\nshort tandem repeats in the human genome. Nature Reviews\nGenetics, pages 1–16, 2024.\n9. Liwen Hu, Xinyue Yao, Hairong Huang, Zhong Guo, Xi Cheng,\nYang Xu, Yi Shen, Biao Xu, and Demin Li. Clinical\nsignificance of germline copy number variation in susceptibility\nof human diseases. Journal of Genetics and Genomics = Yi\nChuan Xue Bao, 45(1):3–12, 2018.\n10. Rameen Beroukhim, Craig H. Mermel, Dale Porter, Guo\nWei, Soumya Raychaudhuri, Jerry Donovan, Jordi Barretina,\nJesse S. Boehm, Jennifer Dobson, Mitsuyoshi Urashima,\nKevin T. Mc Henry, Reid M. Pinchback, Azra H. Ligon,\nYoon-Jae Cho, Leila Haery, Heidi Greulich, Michael Reich,\nWendy Winckler, Michael S. Lawrence, Barbara A. Weir,\nKumiko E. Tanaka, Derek Y. Chiang, Adam J. Bass, Alice\nLoo, Carter Hoffman, John Prensner, Ted Liefeld, Qing Gao,\nDerek Yecies, Sabina Signoretti, Elizabeth Maher, Frederic J.\nKaye, Hidefumi Sasaki, Joel E. Tepper, Jonathan A. Fletcher,\nJosep Tabernero, Jos´ e Baselga, Ming-Sound Tsao, Francesca\nDemichelis, Mark A. Rubin, Pasi A. Janne, Mark J. Daly,\nCarmelo Nucera, Ross L. Levine, Benjamin L. Ebert, Stacey\nGabriel, Anil K. Rustgi, Cristina R. Antonescu, Marc Ladanyi,\nAnthony Letai, Levi A. Garraway, Massimo Loda, David G.\nBeer, Lawrence D. True, Aikou Okamoto, Scott L. Pomeroy,\nSamuel Singer, Todd R. Golub, Eric S. Lander, Gad Getz,\nWilliam R. Sellers, and Matthew Meyerson. The landscape of\nsomatic copy-number alteration across human cancers.Nature,\n463(7283):899–905, 2010.\n11. Sarah P. Otto and Jeannette Whitton. Polyploid incidence\nand evolution. Annual Review of Genetics , 34(Volume 34,\n2000):401–437, 2000.\n12. Yves Van de Peer, Eshchar Mizrachi, and Kathleen Marchal.\nThe evolutionary significance of polyploidy. Nature Reviews\nGenetics, 18(7):411–424, 2017.\n13. Mareike Busche, Boas Pucker, Prisca Vieh¨ over, Bernd\nWeisshaar, and Ralf Stracke. Genome Sequencing of Musa\nacuminata Dwarf Cavendish Reveals a Duplication of a Large\nSegment of Chromosome 2. G3 Genes |Genomes|Genetics,\n10(1):37–42, 2020.\n14. Elena Grassi, Valentina Vurchio, George D. Cresswell, Irene\nCatalano, Barbara Lupo, Francesco Sassi, Francesco Galimi,\nSofia Borgato, Martina Ferri, Marco Viviani, Simone Pompei,\nGianvito Urgese, Bingjie Chen, Eugenia R. Zanella, Francesca\nCottino, Alberto Bardelli, Marco Cosentino Lagomarsino,\nAndrea Sottoriva, Livio Trusolino, and Andrea Bertotti.\nHeterogeneity and evolution of DNA mutation rates in\nmicrosatellite stable colorectal cancer, 2024.\n15. Sophie F. Roerink, Nobuo Sasaki, Henry Lee-Six, Matthew D.\nYoung, Ludmil B. Alexandrov, Sam Behjati, Thomas J.\nMitchell, Sebastian Grossmann, Howard Lightfoot, David A.\nEgan, Apollo Pronk, Niels Smakman, Joost van Gorp,\nElizabeth Anderson, Stephen J. Gamble, Chris Alder, Marc\nvan de Wetering, Peter J. Campbell, Michael R. Stratton, and\nHans Clevers. Intra-tumour diversification in colorectal cancer\nat the single-cell level. Nature, 556(7702):457–462, 2018.\n16. James K Bonfield, John Marshall, Petr Danecek, Heng Li,\nValeriu Ohan, Andrew Whitwham, Thomas Keane, and\nRobert M Davies. HTSlib: C library for reading/writing\nhigh-throughput sequencing data. GigaScience, 10(2):giab007,\n2021.\n17. Johannes K¨ oster. Rust-Bio: a fast and safe bioinformatics\nlibrary. Bioinformatics, 32(3):444–446, 2016.\n18. OEIS Foundation Inc. The On-Line Encyclopedia of Integer\nSequences, 2024.\n19. Roman Kolpakov, Ghizlane Bana, and Gregory Kucherov.\nmreps: efficient and flexible detection of tandem repeats in\nDNA. Nucleic Acids Research, 31(13):3672–3678, 2003.\n20. Caroline Belser, Franc-Christophe Baurens, Benjamin Noel,\nGuillaume Martin, Corinne Cruaud, Benjamin Istace, Nabila\nYahiaoui, Karine Labadie, Eva Hˇ ribov´ a, Jaroslav Doleˇ zel,\nArnaud Lemainque, Patrick Wincker, Ang´ elique D’Hont, and\nJean-Marc Aury. Telomere-to-telomere gapless chromosomes\nof banana using nanopore sequencing. Communications\nBiology, 4(1):1–12, 2021.\n21. Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen,\nAndrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger,\nNicolas Altemose, Lev Uralsky, Ariel Gershman, Sergey\nAganezov, Savannah J. Hoyt, Mark Diekhans, Glennis A.\nLogsdon, Michael Alonge, Stylianos E. Antonarakis, Matthew\nBorchers, Gerard G. Bouffard, Shelise Y. Brooks, Gina V.\nCaldas, Nae-Chyun Chen, Haoyu Cheng, Chen-Shan Chin,\nWilliam Chow, Leonardo G. de Lima, Philip C. Dishuck,\nRichard Durbin, Tatiana Dvorkina, Ian T. Fiddes, Giulio\nFormenti, Robert S. Fulton, Arkarachai Fungtammasan, Erik\nGarrison, Patrick G. S. Grady, Tina A. Graves-Lindsay,\nIra M. Hall, Nancy F. Hansen, Gabrielle A. Hartley,\nMarina Haukness, Kerstin Howe, Michael W. Hunkapiller,\nChirag Jain, Miten Jain, Erich D. Jarvis, Peter Kerpedjiev,\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\nConSTRain 9\nMelanie Kirsche, Mikhail Kolmogorov, Jonas Korlach, Milinn\nKremitzki, Heng Li, Valerie V. Maduro, Tobias Marschall,\nAnn M. McCartney, Jennifer McDaniel, Danny E. Miller,\nJames C. Mullikin, Eugene W. Myers, Nathan D. Olson,\nBenedict Paten, Paul Peluso, Pavel A. Pevzner, David\nPorubsky, Tamara Potapova, Evgeny I. Rogaev, Jeffrey A.\nRosenfeld, Steven L. Salzberg, Valerie A. Schneider, Fritz J.\nSedlazeck, Kishwar Shafin, Colin J. Shew, Alaina Shumate,\nYing Sims, Arian F. A. Smit, Daniela C. Soto, Ivan Sovi´ c,\nJessica M. Storer, Aaron Streets, Beth A. Sullivan, Fran¸ coise\nThibaud-Nissen, James Torrance, Justin Wagner, Brian P.\nWalenz, Aaron Wenger, Jonathan M. D. Wood, Chunlin\nXiao, Stephanie M. Yan, Alice C. Young, Samantha Zarate,\nUrvashi Surti, Rajiv C. McCoy, Megan Y. Dennis, Ivan A.\nAlexandrov, Jennifer L. Gerton, Rachel J. O’Neill, Winston\nTimp, Justin M. Zook, Michael C. Schatz, Evan E. Eichler,\nKaren H. Miga, and Adam M. Phillippy. The complete\nsequence of a human genome. Science, 376(6588):44–53, 2022.\n22. Helyaneh Ziaei Jam, Justin M. Zook, Sara Javadzadeh,\nJonghun Park, Aarushi Sehgal, and Melissa Gymrek. LongTR:\ngenome-wide profiling of genetic variation at tandem repeats\nfrom long reads. Genome Biology, 25(1):176, 2024.\n23. Heng Li. New strategies to improve minimap2 alignment\naccuracy. Bioinformatics, 37(23):4572–4574, 2021.\n24. Justin M. Zook, David Catoe, Jennifer McDaniel, Lindsay\nVang, Noah Spies, Arend Sidow, Ziming Weng, Yuling Liu,\nChristopher E. Mason, Noah Alexander, Elizabeth Henaff,\nAlexa B. R. McIntyre, Dhruva Chandramohan, Feng Chen,\nErich Jaeger, Ali Moshrefi, Khoa Pham, William Stedman,\nTiffany Liang, Michael Saghbini, Zeljko Dzakula, Alex Hastie,\nHan Cao, Gintaras Deikus, Eric Schadt, Robert Sebra, Ali\nBashir, Rebecca M. Truty, Christopher C. Chang, Natali\nGulbahce, Keyan Zhao, Srinka Ghosh, Fiona Hyland, Yutao\nFu, Mark Chaisson, Chunlin Xiao, Jonathan Trow, Stephen T.\nSherry, Alexander W. Zaranek, Madeleine Ball, Jason Bobe,\nPreston Estep, George M. Church, Patrick Marks, Sofia\nKyriazopoulou-Panagiotopoulou, Grace X. Y. Zheng, Michael\nSchnall-Levin, Heather S. Ordonez, Patrice A. Mudivarti,\nKristina Giorda, Ying Sheng, Karoline Bjarnesdatter Rypdal,\nand Marc Salit. Extensive sequencing of seven human genomes\nto characterize benchmark reference materials.Scientific Data,\n3(1):160025, 2016.\n25. Simonetta M. Leto, Elena Grassi, Marco Avolio, Valentina\nVurchio, Francesca Cottino, Martina Ferri, Eugenia R.\nZanella, Sofia Borgato, Giorgio Corti, Laura di Blasio, Desiana\nSomale, Marianela Vara-Messler, Francesco Galimi, Francesco\nSassi, Barbara Lupo, Irene Catalano, Marika Pinnelli, Marco\nViviani, Luca Sperti, Alfredo Mellano, Alessandro Ferrero,\nCaterina C. Zingaretti, Alberto Puliafito, Luca Primo,\nAndrea Bertotti, and Livio Trusolino. XENTURION is a\npopulation-level multidimensional resource of xenografts and\ntumoroids from metastatic colorectal cancer patients. Nature\nCommunications, 15(1):7495, 2024.\n26. F. Favero, T. Joshi, A. M. Marquard, N. J. Birkbak,\nM. Krzystanek, Q. Li, Z. Szallasi, and A. C. Eklund. Sequenza:\nallele-specific copy number and mutation profiles from tumor\nsequencing data. Annals of Oncology, 26(1):64–70, 2015.\n27. Daniel P. Cooke, David C. Wedge, and Gerton Lunter.\nBenchmarking small-variant genotyping in polyploids.\nGenome Research, 32(2):403–408, 2022.\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\n10 Verbiest et al.\nTables and Figures\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\nConSTRain 11\nFig. 1: ConSTRain overview and example. (1) An STR locus are loaded from the input files. The locus reference information is parsed\nfrom the STR panel. The STR copy number is set based on the karyotype, and optionally updated if the STR is affected by a CNA. (2)\nReads overlapping the STR region are extracted from the alignment file, and the length of the STR region in each read is determined.\n(3) The observed distribution is sorted, and at most as many allele lengths as the STR copy number are kept. (4) This yields the final\nobserved allele length distribution. (5) Next, all possible genotypes are generated for the STR copy number and stored in matrix G. (6)\nFrom G, the matrix D is generated by multiplying it with the total number of mapped reads (51 in the example) divided by the STR\ncopy number (3 in the example). Each row in D corresponds to the expected allele length distribution of one of the genotypes in G. (7)\nThe expected distribution with the lowest error to the observed distribution is found by taking the absolute difference between each row\nin D and the observed distribution, then (8) taking the sum of rows and finding the one with the lowest value. (9) The genotype in G\nwith the lowest error is selected (10) and reported in the output. The inferred genotype of the STR locus in this example consists of an\nallele of 4 CAG units (present once), an allele of 5 CAG units (present once), and an allele of 8 CAG units (also present once).\nTable 1. Results for ConSTRain, GangSTR, and HipSTR on the HG002 human benchmark. Data are shown before and after filtering the output of each\ntool. The total number of loci in the benchmark was 1695865. The percentage of this number that was called by each variant caller is shown in brackets\nin the ’Loci called’ columns. The best value in each column is printed in bold.\nUnfiltered Filtered\nMethod Runtime (hrs) Throughput (loci/s) Loci called (%) Accuracy Loci called (%) Accuracy\nConSTRain∗ 0.33 1480.48 1655655 (97.63) 0.9525 1393426 (82.17) 0.9828\nGangSTR 14.91 32.31 1654569 (97.56) 0.9513 1372842 (80.95) 0.9769\nHipSTR 11.99 40.15 1225530 (72.27) 0.9750 1174890 (69.28) 0.9774\n∗data for ConSTRain running on 32 threads are shown. Single-threaded runtime was 6.73 hours (71.50 loci/s)\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\n12 Verbiest et al.\nFig. 2: ConSTRain performance on Q100 benchmark. ( A) Distribution of normalised sequencing depth observed by ConSTRain across\n167114 repeat loci in the 100X HG002 WGS sample. The x-axis shows the sequencing depth normalised by the copy number of repeat\nloci. The left y-axis shows the accuracy of allele length calls (blue line and dots). The right y-axis shows the proportion of loci (grey\nhistogram). Note: only normalised depth values between 0 and 60 are shown for visual clarity. (B ) Accuracy of unfiltered and filtered\nConSTRain STR allele length calls for 100X WGS of HG002, as well as for the same sample downsampled to 30X and 10X depth of\ncoverage. Note: y-axis starts at 0.75.\nFig. 3: Genotyping STRs in a triploidM. acuminata sample with a large duplication on chr02.(A) Consistency of STR genotypes between\nthe HiSeq1500 and NextSeq500 samples for different normalised depth filtering thresholds. X-axis: STR period, y-axis: proportion of\nloci for which the inferred genotype matched exactly between the two alignments. (B) Distributions of the depth of coverage for STR\nloci normalised by copy number for STRs in the alignment of combined HiSeq1500 and NextSeq500 reads. The blue distribution shows\nthe normalised depths for loci not affected by CNAs. The orange distribution shows the normalised depth reported for loci in the chr02\nduplication when CNA information was not provided to ConSTRain. The green distribution shows normalised depth for the loci in the\nchr02 duplication when CNA information was provided to ConSTRain. Vertical dashed lines indicate filtering bounds that exclude the\n2.5% of loci with the highest and the 2.5% of loci with the lowest depth of coverage in the overall sample.\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint \n\nConSTRain 13\nFig. 4: Pairwise STR-based distances between four samples stemming from the same patient-derived tumoroid. Each cell represents\nthe comparison between two samples, with the colour and value of cells indicating the normalised distances between samples (average\ndifference in allele length per locus).\n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}