Result
in expected allele length distributions that are impossible
under ConSTRain’s assumptions. Sorting the observed allele
length distribution is a way to reduce the combinatorial space of
possible genotypes: without doing this the number of genotypes
to generate for a locus would be equal to the number of weak
integer compositions of a size equal to the STR copy number. A
weak integer composition refers to the representation of an integer
as the sum of a sequence of non-negative integers. For a given
integer, the number of weak compositions of a specific size (i.e.,
the number of terms to represent the integer as) is given by:
number of weak compositions =
n + k − 1
n
(2)
where n is the integer and k is the composition size. For our
purposes n equals k equals the STR copy number. By sorting the
observed allele length distribution we instead only need to generate
a number of genotypes equal to the number of integer partitions of
the STR copy number. Integer partitions differ from compositions
in that the terms of the partition are not ordered, i.e., different
orders of the same terms are considered identical. No closed-form
solution is known to determine the number of partitions for an
integer, but Sloane’s sequence A000041 enumerates the number of
partitions for a range of integers [18]. Going back to the example
in equation Equation 1, we can use Equation 2 to calculate that
there exist ten weak integer compositions when n and k are both
three. On the other hand, A000041 tells us that there are three
integer partitions — a difference of seven. This may not seem very
impactful, but the difference becomes much more pronounced for
higher copy numbers: for n = 20 there exist more than 68.9 × 109
weak compositions of size 20, but only 627 partitions. Further,
given that an STR panel can contain hundreds of thousands of
loci (e.g., over 1 .7 × 106 for the human genome), even small
optimisations make a difference in overall runtime.
Updating an existing VCF file
Besides the standard mode of running ConSTRain outlined above,
ConSTRain also supports reanalysing previously generated VCF
files. This may be useful if novel CNA information for a sample
becomes available after an input alignment has already been
analysed, or if it is necessary to adjust filtering parameters. It
also prevents having to re-download large alignment files from
remote repositories. This is possible because ConSTRain includes
the observed allele length distribution of each STR in a FORMAT
field of the output VCF. Since it is much faster to read the observed
allele length distribution from a VCF file than to extract it from
sequencing reads in an alignment, running ConSTRain in this
mode is typically a matter of seconds.
Filtering ConSTRain output
Genomic regions where the depth of coverage is lower or higher
than expected may indicate a large number of technical artifacts
for that region. This can lead to inaccurate variant calls. To
address this, ConSTRain allows for the filtering of STR loci based
on their normalised depth of coverage. The normalised depth is
calculated by dividing the number of mapped reads by the locus
copy number. This normalisation is important because the copy
number of a locus is expected to affect the depth of coverage.
When analysing an alignment of human male sequencing reads, for
instance, loci on the sex chromosomes are expected to have roughly
half the depth of coverage as loci on autosomes. Similar effects exist
for genomic regions that are amplified or deleted by structural
variants. Dividing the depth of coverage by the locus copy number
will force all loci to occupy the same range of normalised depth
values, which makes filtering more straightforward. The desired
minimum and maximum normalised depth values can be set at
the command line via the --min-norm-depth (default: 1.0) and
--max-norm-depth (not set by default) arguments, respectively.
These upper and lower bounds can be set manually to reasonable
values before running ConSTRain. Another option is to first run
ConSTRain without filters and then set bounds based on the
observed distribution of normalised sequencing depths across all
loci in the sample. This can help identify the range of acceptable
normalised depth values for a specific sample. Once the minimum
.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint
4 Verbiest et al.
and maximum values are found, ConSTRain can be rerun on the
VCF file with the updated filtering parameters. Since running
ConSTRain on a VCF file is extremely fast (around 20 seconds
for 1733646 loci on a 2020 MacBook Pro) this only marginally
increases the overall computational workload. A Python script
to generate a distribution of normalised depth values from a
ConSTRain VCF file is included in the ConSTRain GitHub
repository.
STR reference panels
ConSTRain needs a reference panel of STR loci to know where
STRs are located in the reference genome. The reference panel
that was used in all experiments involving human data reported
in this manuscript is based on the GRCh38 version 13 reference
panel provided by GangSTR [7]. While ConSTRain is primarily
aimed at genotyping STRs with periods between one and six,
the repeat panel provided by GangSTR contains a small number
(20481) of repeat loci with longer periods (up to 20), which
we did not remove. Furthermore, the GangSTR panel does not
contain mononucleotide repeats. We therefore extended this panel
to include perfect mononucleotide repeats of at least allele length
ten, which were identified using mreps [19]. This resulted in a panel
containing 1733646 repeat loci in the GRCh38 human reference
genome (Supplementary Figure 1A). The total region length of
most of these repeats was relatively short, with only 3.38% being
longer than 30bp in the reference assembly (Supplementary Figure
1B). This suggests that — barring large expansions — the vast
majority of repeat loci in this panel should be resolvable with
short sequencing reads.
We created a novel STR referene panel for the DH-Pahang v4
banana reference genome [20]. This was also done using mreps [19],
setting command line arguments such that perfect repeats with
periods one through six were reported. We subsequently filtered
mreps output using custom Python scripts to retain only perfect
STRs with at least allele length ten, six, four, three, three, and
three for STRs with period one through six, respectively. This
yielded a reference panel of 183345 STR loci across the 11 main
chromosomes in the DH-Pahang v4 reference.
HG002 benchmark
STR genotyping tools were benchmarked using haplotypes
provided by the telomere-to-telomere (T2T) consortium’s Q100
project [21, 22]. The Q100 project provides high-quality, phased
haplotypes of the HG002 cell line which have been used previously
to benchmark STR variant calls [22]. To obtain ground-truth allele
lengths for loci in our STR reference panel in the HG002 cell
line, the Q100 haplotypes were mapped to the GRCh38 reference
genome using minimap2 [23]. The resulting PAF file was parsed
to find a ground-truth STR allele lengths in GRCh38 coordinate
space. The HG002 allele length could be recovered for 1695865
STR loci in our panel (97.82% of the total).
Subsequently, STR variant callers were used to genotype the
STR reference panel in an alignment of 2x250 Illumina whole-
genome sequencing reads of HG002, which is available through
Genome in a Bottle [24]. STR allele lengths generated by the
different variant callers were compared to the ground-truth allele
lengths derived from the Q100 haplotypes. Genotyping accuracy
was calculated by determining the fraction of loci for which the
biallelic genotypes reported by a variant caller exactly matched
the allele lengths observed in the Q100 haplotypes.
Simulating trisomy 21
We simulated 2x150 paired-end reads from chromosome 21 of
the maternal and paternal haplotypes of HG002, as well as
GRCh38. Since we were not interested in modelling sequencing
errors, we simulated error-free, paired-end reads to a depth of
coverage of 15X for each of the three haplotypes using wgsim
(https://github.com/lh3/wgsim). Simulated reads from the three
haplotypes were then combined to form a 45X sequencing sample
of a triploid chromosome 21. These reads were mapped back to
the GRCh38 reference genome using minimap2 [23].
Musa acuminata whole-genome sequencing data
The M. acuminata sequencing data used here consist of two
sequencing experiments of the same organism, one performed on
an Illumina HiSeq1500 machine and the other on an Illumina
NextSeq500 [13]. We downloaded all sequencing reads (European
Nucleotide Archive, study PRJEB33317) and combined outputs of
sequencing runs into two FASTQ files, one for the HiSeq1500, one
for the NextSeq500. The two FASTQ files were mapped to the DH-
Pahang v4 reference genome [20] using minimap2 [23], removing
improper pairs, duplicate alignments, and low-quality alignments.
These alignments will be referred to as the ‘HiSeq1500 alignment’
and the ‘NextSeq500 alignment’. Subsequently, the HiSeq1500 and
NextSeq500 alignments were concatenated to form the ‘merged
alignment’.
Colorectal cancer whole-genome sequencing data
We obtained WGS data of a patient-derived cancer CRC
tumoroid generated as a part of a previously published mutation
accumulation experiment [14]. These data are available through
the European Genome-phenome Archive under accession number
EGAD50000000411. Briefly, this experiment was set up so that
individual cells were isolated from a CRC tumoroid [25] and
allowed to grow for six weeks. At the six week mark, WGS was
performed on each clone to obtain a high-quality representation of
the genome of the individually isolated cells. Subsequently, clones
were repeatedly bottlenecked to 100 cells every two weeks for six
months, followed by WGS of the resulting clones [14].
As a demonstration of ConSTRain’s applicability to cancer
sequencing data we analysed four WGS samples from a single
microsatellite instable tumoroid. The first of these samples was
taken from the original tumoroid line and the other samples
represented three different clones (01-0, 05-0, and 07-0) sequenced
after six weeks of growth. For each sample, CNA calls generated
by Sequenza were available [14, 26]. These CNA calls indicated
that while the original tumoroid line and the 05-0 clone were
diploid, the 01-0 and 07-0 clones had undergone whole-genome
duplications and were tetraploid. We ran ConSTRain on all four
samples, providing the appropriate Sequenza CNA calls each
time. Then, we computed pairwise STR-based distances between
samples based on the genotypes returned by ConSTRain. We
limited this analysis to high confidence STR genotypes where
the unit size was between three and six and the normalised
depth of coverage was at least 5. For comparisons between
diploid and tetraploid samples all genotypes in the diploid sample
were artificially duplicated before performing comparisons. This
meant that the diploid genotype [10, 10] would be represented as
[10, 10, 10, 10], and [8,9] as [8, 8, 9, 9], etc. Loci that were annotated
with a different copy number in the two samples of a pair were
not considered when calculating pairwise distances. The impact
.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint
ConSTRain 5
of this filter varied depending on which two samples were being
compared: when comparing the two 2n samples 4.81% of loci did
not have the same copy number, whereas up to 26.85% of loci
had to be removed when comparing 4n samples. This is likely
due to the fact that accurately calling copy number levels from
sequencing data is a difficult task, especially for higher copy
numbers. Pairwise sample distances were calculated by taking the
sum of Manhattan distances between STR genotypes for all loci
with a high confidence call in both samples, normalising by the
total number of compared loci. This resulted in a distance between
samples with a unit of ‘average difference in allele length per locus’.
References
1. Max A. Verbiest, M. Maksimov, Y. Jin, M. Anisimova,
M. Gymrek, and T. Bilgin Sonay. Mutation and selection
processes regulating short tandem repeats give rise to
.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint
8 Verbiest et al.
genetic and phenotypic diversity across species. Journal of
Evolutionary Biology, 36(2):321–336, 2023.
2. Stephanie Feupe Fotsing, Jonathan Margoliash, Catherine
Wang, Shubham Saini, Richard Yanicky, Sharona Shleizer-
Burko, Alon Goren, and Melissa Gymrek. The impact of short
tandem repeat variation on gene expression. Nature Genetics,
51(11):1652–1659, 2019.
3. Yirong Shi, Yiwei Niu, Peng Zhang, Huaxia Luo, Shuai
Liu, Sijia Zhang, Jiajia Wang, Yanyan Li, Xinyue Liu,
Tingrui Song, Tao Xu, and Shunmin He. Characterization of
genome-wide STR variation in 6487 human genomes. Nature
Communications, 14(1):2092, 2023.
4. Max A. Verbiest, Oxana Lundstr¨ om, Feifei Xia, Michael
Baudis, Tugce Bilgin Sonay, and Maria Anisimova. Short
tandem repeat mutations regulate gene expression in colorectal
cancer. Scientific Reports, 14(1):3331, 2024.
5. Thomas Willems, Dina Zielinski, Jie Yuan, Assaf Gordon,
Melissa Gymrek, and Yaniv Erlich. Genome-wide profiling
of heritable and de novo STR variations. Nature Methods ,
14(6):590–592, 2017.
6. Egor Dolzhenko, Viraj Deshpande, Felix Schlesinger, Peter
Krusche, Roman Petrovski, Sai Chen, Dorothea Emig-Agius,
Andrew Gross, Giuseppe Narzisi, Brett Bowman, Konrad
Scheffler, Joke J F A van Vugt, Courtney French, Alba
Sanchis-Juan, Kristina Ib´ a˜ nez, Arianna Tucci, Bryan R Lajoie,
Jan H Veldink, F Lucy Raymond, Ryan J Taft, David R
Bentley, and Michael A Eberle. ExpansionHunter: a sequence-
graph-based tool to analyze variation in short tandem repeat
regions. Bioinformatics, 35(22):4754–4756, 2019.
7. Nima Mousavi, Sharona Shleizer-Burko, Richard Yanicky,
and Melissa Gymrek. Profiling the genome-wide landscape
of tandem repeat expansions. Nucleic Acids Research ,
47(15):e90–e90, 2019.
8. Hope A. Tanudisastro, Ira W. Deveson, Harriet Dashnow,
and Daniel G. MacArthur. Sequencing and characterizing
short tandem repeats in the human genome. Nature Reviews
Genetics, pages 1–16, 2024.
9. Liwen Hu, Xinyue Yao, Hairong Huang, Zhong Guo, Xi Cheng,
Yang Xu, Yi Shen, Biao Xu, and Demin Li. Clinical
significance of germline copy number variation in susceptibility
of human diseases. Journal of Genetics and Genomics = Yi
Chuan Xue Bao, 45(1):3–12, 2018.
10. Rameen Beroukhim, Craig H. Mermel, Dale Porter, Guo
Wei, Soumya Raychaudhuri, Jerry Donovan, Jordi Barretina,
Jesse S. Boehm, Jennifer Dobson, Mitsuyoshi Urashima,
Kevin T. Mc Henry, Reid M. Pinchback, Azra H. Ligon,
Yoon-Jae Cho, Leila Haery, Heidi Greulich, Michael Reich,
Wendy Winckler, Michael S. Lawrence, Barbara A. Weir,
Kumiko E. Tanaka, Derek Y. Chiang, Adam J. Bass, Alice
Loo, Carter Hoffman, John Prensner, Ted Liefeld, Qing Gao,
Derek Yecies, Sabina Signoretti, Elizabeth Maher, Frederic J.
Kaye, Hidefumi Sasaki, Joel E. Tepper, Jonathan A. Fletcher,
Josep Tabernero, Jos´ e Baselga, Ming-Sound Tsao, Francesca
Demichelis, Mark A. Rubin, Pasi A. Janne, Mark J. Daly,
Carmelo Nucera, Ross L. Levine, Benjamin L. Ebert, Stacey
Gabriel, Anil K. Rustgi, Cristina R. Antonescu, Marc Ladanyi,
Anthony Letai, Levi A. Garraway, Massimo Loda, David G.
Beer, Lawrence D. True, Aikou Okamoto, Scott L. Pomeroy,
Samuel Singer, Todd R. Golub, Eric S. Lander, Gad Getz,
William R. Sellers, and Matthew Meyerson. The landscape of
somatic copy-number alteration across human cancers.Nature,
463(7283):899–905, 2010.
11. Sarah P. Otto and Jeannette Whitton. Polyploid incidence
and evolution. Annual Review of Genetics , 34(Volume 34,
2000):401–437, 2000.
12. Yves Van de Peer, Eshchar Mizrachi, and Kathleen Marchal.
The evolutionary significance of polyploidy. Nature Reviews
Genetics, 18(7):411–424, 2017.
13. Mareike Busche, Boas Pucker, Prisca Vieh¨ over, Bernd
Weisshaar, and Ralf Stracke. Genome Sequencing of Musa
acuminata Dwarf Cavendish Reveals a Duplication of a Large
Segment of Chromosome 2. G3 Genes |Genomes|Genetics,
10(1):37–42, 2020.
14. Elena Grassi, Valentina Vurchio, George D. Cresswell, Irene
Catalano, Barbara Lupo, Francesco Sassi, Francesco Galimi,
Sofia Borgato, Martina Ferri, Marco Viviani, Simone Pompei,
Gianvito Urgese, Bingjie Chen, Eugenia R. Zanella, Francesca
Cottino, Alberto Bardelli, Marco Cosentino Lagomarsino,
Andrea Sottoriva, Livio Trusolino, and Andrea Bertotti.
Heterogeneity and evolution of DNA mutation rates in
microsatellite stable colorectal cancer, 2024.
15. Sophie F. Roerink, Nobuo Sasaki, Henry Lee-Six, Matthew D.
Young, Ludmil B. Alexandrov, Sam Behjati, Thomas J.
Mitchell, Sebastian Grossmann, Howard Lightfoot, David A.
Egan, Apollo Pronk, Niels Smakman, Joost van Gorp,
Elizabeth Anderson, Stephen J. Gamble, Chris Alder, Marc
van de Wetering, Peter J. Campbell, Michael R. Stratton, and
Hans Clevers. Intra-tumour diversification in colorectal cancer
at the single-cell level. Nature, 556(7702):457–462, 2018.
16. James K Bonfield, John Marshall, Petr Danecek, Heng Li,
Valeriu Ohan, Andrew Whitwham, Thomas Keane, and
Robert M Davies. HTSlib: C library for reading/writing
high-throughput sequencing data. GigaScience, 10(2):giab007,
2021.
17. Johannes K¨ oster. Rust-Bio: a fast and safe bioinformatics
library. Bioinformatics, 32(3):444–446, 2016.
18. OEIS Foundation Inc. The On-Line Encyclopedia of Integer
Sequences, 2024.
19. Roman Kolpakov, Ghizlane Bana, and Gregory Kucherov.
mreps: efficient and flexible detection of tandem repeats in
DNA. Nucleic Acids Research, 31(13):3672–3678, 2003.
20. Caroline Belser, Franc-Christophe Baurens, Benjamin Noel,
Guillaume Martin, Corinne Cruaud, Benjamin Istace, Nabila
Yahiaoui, Karine Labadie, Eva Hˇ ribov´ a, Jaroslav Doleˇ zel,
Arnaud Lemainque, Patrick Wincker, Ang´ elique D’Hont, and
Jean-Marc Aury. Telomere-to-telomere gapless chromosomes
of banana using nanopore sequencing. Communications
Biology, 4(1):1–12, 2021.
21. Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen,
Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger,
Nicolas Altemose, Lev Uralsky, Ariel Gershman, Sergey
Aganezov, Savannah J. Hoyt, Mark Diekhans, Glennis A.
Logsdon, Michael Alonge, Stylianos E. Antonarakis, Matthew
Borchers, Gerard G. Bouffard, Shelise Y. Brooks, Gina V.
Caldas, Nae-Chyun Chen, Haoyu Cheng, Chen-Shan Chin,
William Chow, Leonardo G. de Lima, Philip C. Dishuck,
Richard Durbin, Tatiana Dvorkina, Ian T. Fiddes, Giulio
Formenti, Robert S. Fulton, Arkarachai Fungtammasan, Erik
Garrison, Patrick G. S. Grady, Tina A. Graves-Lindsay,
Ira M. Hall, Nancy F. Hansen, Gabrielle A. Hartley,
Marina Haukness, Kerstin Howe, Michael W. Hunkapiller,
Chirag Jain, Miten Jain, Erich D. Jarvis, Peter Kerpedjiev,
.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint
ConSTRain 9
Melanie Kirsche, Mikhail Kolmogorov, Jonas Korlach, Milinn
Kremitzki, Heng Li, Valerie V. Maduro, Tobias Marschall,
Ann M. McCartney, Jennifer McDaniel, Danny E. Miller,
James C. Mullikin, Eugene W. Myers, Nathan D. Olson,
Benedict Paten, Paul Peluso, Pavel A. Pevzner, David
Porubsky, Tamara Potapova, Evgeny I. Rogaev, Jeffrey A.
Rosenfeld, Steven L. Salzberg, Valerie A. Schneider, Fritz J.
Sedlazeck, Kishwar Shafin, Colin J. Shew, Alaina Shumate,
Ying Sims, Arian F. A. Smit, Daniela C. Soto, Ivan Sovi´ c,
Jessica M. Storer, Aaron Streets, Beth A. Sullivan, Fran¸ coise
Thibaud-Nissen, James Torrance, Justin Wagner, Brian P.
Walenz, Aaron Wenger, Jonathan M. D. Wood, Chunlin
Xiao, Stephanie M. Yan, Alice C. Young, Samantha Zarate,
Urvashi Surti, Rajiv C. McCoy, Megan Y. Dennis, Ivan A.
Alexandrov, Jennifer L. Gerton, Rachel J. O’Neill, Winston
Timp, Justin M. Zook, Michael C. Schatz, Evan E. Eichler,
Karen H. Miga, and Adam M. Phillippy. The complete
sequence of a human genome. Science, 376(6588):44–53, 2022.
22. Helyaneh Ziaei Jam, Justin M. Zook, Sara Javadzadeh,
Jonghun Park, Aarushi Sehgal, and Melissa Gymrek. LongTR:
genome-wide profiling of genetic variation at tandem repeats
from long reads. Genome Biology, 25(1):176, 2024.
23. Heng Li. New strategies to improve minimap2 alignment
accuracy. Bioinformatics, 37(23):4572–4574, 2021.
24. Justin M. Zook, David Catoe, Jennifer McDaniel, Lindsay
Vang, Noah Spies, Arend Sidow, Ziming Weng, Yuling Liu,
Christopher E. Mason, Noah Alexander, Elizabeth Henaff,
Alexa B. R. McIntyre, Dhruva Chandramohan, Feng Chen,
Erich Jaeger, Ali Moshrefi, Khoa Pham, William Stedman,
Tiffany Liang, Michael Saghbini, Zeljko Dzakula, Alex Hastie,
Han Cao, Gintaras Deikus, Eric Schadt, Robert Sebra, Ali
Bashir, Rebecca M. Truty, Christopher C. Chang, Natali
Gulbahce, Keyan Zhao, Srinka Ghosh, Fiona Hyland, Yutao
Fu, Mark Chaisson, Chunlin Xiao, Jonathan Trow, Stephen T.
Sherry, Alexander W. Zaranek, Madeleine Ball, Jason Bobe,
Preston Estep, George M. Church, Patrick Marks, Sofia
Kyriazopoulou-Panagiotopoulou, Grace X. Y. Zheng, Michael
Schnall-Levin, Heather S. Ordonez, Patrice A. Mudivarti,
Kristina Giorda, Ying Sheng, Karoline Bjarnesdatter Rypdal,
and Marc Salit. Extensive sequencing of seven human genomes
to characterize benchmark reference materials.Scientific Data,
3(1):160025, 2016.
25. Simonetta M. Leto, Elena Grassi, Marco Avolio, Valentina
Vurchio, Francesca Cottino, Martina Ferri, Eugenia R.
Zanella, Sofia Borgato, Giorgio Corti, Laura di Blasio, Desiana
Somale, Marianela Vara-Messler, Francesco Galimi, Francesco
Sassi, Barbara Lupo, Irene Catalano, Marika Pinnelli, Marco
Viviani, Luca Sperti, Alfredo Mellano, Alessandro Ferrero,
Caterina C. Zingaretti, Alberto Puliafito, Luca Primo,
Andrea Bertotti, and Livio Trusolino. XENTURION is a
population-level multidimensional resource of xenografts and
tumoroids from metastatic colorectal cancer patients. Nature
Communications, 15(1):7495, 2024.
26. F. Favero, T. Joshi, A. M. Marquard, N. J. Birkbak,
M. Krzystanek, Q. Li, Z. Szallasi, and A. C. Eklund. Sequenza:
allele-specific copy number and mutation profiles from tumor
sequencing data. Annals of Oncology, 26(1):64–70, 2015.
27. Daniel P. Cooke, David C. Wedge, and Gerton Lunter.
Benchmarking small-variant genotyping in polyploids.
Genome Research, 32(2):403–408, 2022.
.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint
10 Verbiest et al.
Tables and Figures
.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 17, 2024. ; https://doi.org/10.1101/2024.12.13.628141doi: bioRxiv preprint
ConSTRain 11
Fig. 1: ConSTRain overview and example. (1) An STR locus are loaded from the input files. The locus reference information is parsed
from the STR panel. The STR copy number is set based on the karyotype, and optionally updated if the STR is affected by a CNA. (2)
Reads overlapping the STR region are extracted from the alignment file, and the length of the STR region in each read is determined.
(3) The observed distribution is sorted, and at most as many allele lengths as the STR copy number are kept. (4) This yields the final
observed allele length distribution. (5) Next, all possible genotypes are generated for the STR copy number and stored in matrix G. (6)
From G, the matrix D is generated by multiplying it with the total number of mapped reads (51 in the example) divided by the STR
copy number (3 in the example). Each row in D corresponds to the expected allele length distribution of one of the genotypes in G. (7)
The expected distribution with the lowest error to the observed distribution is found by taking the absolute difference between each row
in D and the observed distribution, then (8) taking the sum of rows and finding the one with the lowest value. (9) The genotype in G
with the lowest error is selected (10) and reported in the output. The inferred genotype of the STR locus in this example consists of an
allele of 4 CAG units (present once), an allele of 5 CAG units (present once), and an allele of 8 CAG units (also present once).
Table 1. Results for ConSTRain, GangSTR, and HipSTR on the HG002 human benchmark. Data are shown before and after filtering the output of each
tool. The total number of loci in the benchmark was 1695865. The percentage of this number that was called by each variant caller is shown in brackets
in the ’Loci called’ columns. The best value in each column is printed in bold.
Unfiltered Filtered