hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data Bing Guo, Stephen F. Schaffner, Aimee R. Taylor, Timothy D. O’Connor, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7004070/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 09 Feb, 2026 Read the published version in Malaria Journal → Version 1 posted 10 You are reading this latest preprint version Abstract Background Identity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method hmmIBD demonstrates high accuracy in inferring IBD segments between haploid genomes, including Plasmodium falciparum , and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of hmmIBD does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome. Methods We developed an enhanced implementation of hmmIBD in the Rust programming language, named hmmibd-rs , which leverages multi-threaded computing to parallelize IBD inference over genome pairs and which supports optional, user-defined recombination rate maps for more accurate IBD detection and filtration from genomes with non-uniform recombination. We further streamlined large-scale IBD detection by incorporating auxiliary built-in functionalities to preprocess input directly from the standard binary variant call format (BCF) and filter IBD output to reduce disk usage. Results Our new implementation significantly reduces IBD detection computation time nearly linearly with the increased number of CPU threads used; using 128 threads shortens IBD detection time from 5.2 days to 1.3 hours for 220 million pairs of simulated Plasmodium falciparum -like chromosomes, increasing computational speed by approximately 100x over the single-threaded hmmIBD algorithm. Incorporating non-uniform recombination rates in hmmibd-rs enhances the accuracy of IBD inference by mitigating the overestimation of IBD breakpoints in recombination cold spots and their underestimation in hot spots. It also improves IBD segment length filtration, reducing the false positive rate in recombination cold spots and the false negative rate in hot spots. When applied to empirical data sets, hmmibd-rs completes the detection of IBD from MalariaGEN Pf7 (n ≈ 10,000 monoclonal samples) within hours, enabling a single-day IBD analysis pipeline for large genomic data sets. Conclusion hmmibd-rs builds upon, accelerates, and enhances hmmIBD for efficient and accurate IBD detection, serving as a crucial tool for advancing large-scale malaria genomic surveillance. Identity-By-Descent Hidden Markov Model Parallelization Population Genomics Recombination Rate Map Plasmodium Figures Figure 1 Figure 2 Figure 3 BACKGROUND Identity-by-descent (IBD) refers to alleles or genomic regions (segments) that are identical between two individuals/genomes due to shared ancestry. For species with high recombination rates relative to mutation rates, such as malaria parasites, metrics that leverage recombination, e.g., IBD, capture finer-scale and higher-resolution dynamics in population demography [ 1 – 3 ]. IBD, inferred from malaria parasite genomic data, conveys important information about the recent history of populations, including genetic relatedness within and between populations, loci under natural selection, and time-specific demography (effective population size and population structure), thus playing a crucial role in malaria genomic surveillance [ 4 – 14 ]. Accurate detection of IBD segments often requires genotype data with sufficient marker density [ 4 , 15 , 16 ]. Species with a high ratio of recombination rate to mutation rate, such as the malaria parasite Plasmodium falciparum , tend to have a (common variant) marker density two orders of magnitude lower than that of humans [ 9 , 15 , 17 – 20 ]. Thus, an IBD detection algorithm robust to low marker density is crucial for high-recombining species. Our accompanying project suggests that hmmIBD stands out among many other IBD detection methods, uniquely providing high-quality IBD segment calls, including shorter segments, that allow for the generation of accurate results even for quality-sensitive inferences [ 16 ]. Despite its high accuracy and wide adoption in malaria research, hmmIBD could be improved in several areas: it uses only a single thread of multi-core CPUs, it assumes a uniform recombination rate across the genome, and it requires substantial preprocessing of input data into a specific format prior to analysis [ 5 ]. These limitations may prevent its wider application to larger data sets or to analyses that rely on a non-uniform genetic map [ 20 , 21 ], an important consideration given the demand for analysis of large-scale whole-genome sequence (WGS) data [ 20 ] and opportunities to construct high-resolution recombination rate maps based on recent genetic crosses [ 22 , 23 ]. In this work, we addressed these limitations by reimplementing and enhancing the Hidden Markov Model described in the original paper [ 5 ] using the Rust programming language. Our new implementation, hmmibd-rs , offers three key features: parallelized IBD inference, support for non-uniform recombination rates, and streamlined data management. METHODS We enabled parallelization for the HMM inference process at the level of single haploid genome pairs or groups of haploid genome pairs. In our reimplementation, we first modularized the original algorithm into multiple components, including data processing modules and different subcomponents of the HMM inference process. Relying on the modular structure, we then isolated the HMM inference process for a genome pair as the basic unit for parallelization. The original sequential HMM inference process, iterated over genome pairs, was converted into parallelizable tasks, utilizing the Rayon crate, a library designed for data parallelism [ 24 ]. In addition, we provide options to optimize memory usage based on parameters such as the maximum number of alternative alleles per locus and the output file buffer sizes. We enabled non-uniform recombination rates for HMM inferences and IBD segment filtration using user-provided genetic maps. For HMM inference, we updated the term \(\:{e}^{-k\rho\:{d}_{t}}\) in the transition probabilities matrix [ 5 ] to the term \(\:{e}^{-k{c}_{t}}\) , where ρ is the recombination rate per generation per bp, and \(\:{d}_{t}\) is the physical distance between the \(\:t\) th marker and the \(\:t-1\:\) th marker, and \(\:{c}_{t}\) is the genetic distance between the two markers given by the user. As \(\:{c}_{t}=\rho\:{d}_{t}\) , our new implementation maps physical distances between markers to genetic distance, and directly uses genetic distance \(\:{c}_{t}\) for HMM inference, thus removing the assumption of a uniform recombination rate along the genome. Besides the HMM inference steps, we also allow the usage of a recombination rate map for post-HMM inference, built-in IBD segment length filtration in genetic units, which is usually needed for analyses based on IBD segments since short IBD segment estimates are more error-prone and often filtered out before downstream analyses [ 15 , 16 , 25 ]. We improved the efficiency and ergonomics of input and output data management. We developed a simple, cross-platform auxiliary library bcf_reader in Rust and used it to process input data directly from the common genotype file format, binary variant call format (BCF). Based on this library, we implemented two main built-in functions. The first main function is to construct haploid genomes by replacing heteroallelic genotype calls in monoclonal samples with dominant alleles if the total allele depths (via the command line option min_depth ) and the fraction of reads supporting the dominant alleles (via the options min_ratio and min_r1_r2 ) are high; otherwise, these are set to missing (as detailed in Supplementary Table 1). We note that the default criteria for determining whether to use dominant alleles or missing data are somewhat subjective: users may opt for stringent thresholds, such as setting all heteroallelic calls to missing data, which comes with the caveat of removing more sites and samples during the subsequent genotype filtering step, or more permissive ones to include all heteroallelic calls by using the allele with the highest read support, which may introduce substantial genotyping errors. The second built-in function is to iteratively filter samples and sites (based on the missingness of genotype calls) to obtain high-quality genotype data while retaining balanced numbers of markers and samples (Supplementary Table 1). The dominant-allele-based haploid genome construction from BCF files is a heuristic strategy for working with monoclonal samples. Haploid genomes from polyclonal infections may be inferred via external deconvolution programs like DEPloid or DEPloidIBD [ 26 , 27 ] and provided to hmmibd-rs in a traditional table format used in hmmIBD . We have included additional options, made available via the command line interface, to customize HMM parameters and data management parameters (e.g.,BCF processing, IBD filtering, output buffering, and optional suppressing) to allow the user to balance analytical needs and computational/storage efficiency. Additionally, hmmibd-rs is designed to be fully compatible with hmmIBD to facilitate the transition to hmmibd-rs , and, by default, generates both files for IBD segments and files for the fraction of sites IBD (estimates of genetic relatedness) like hmmIBD . Methods used for simulation, measurement of computation time, and downstream analysis of detected IBD segments are similar to our previous analysis [ 14 , 16 ], with further description provided in the Supplementary Methods. Details of these analyses can be found in the related pipeline and source code listed in the Availability of Data and Materials. RESULTS Our new implementation, hmmibd-rs , improves the computational efficiency of HMM IBD inference both by increasing single-thread performance and by enabling multithreading. When both hmmibd-rs and hmmIBD are forced to use a single thread, run times with hmmibd-rs were about 40% shorter than those of hmmIBD (Fig. 1 a), due to the more compact memory representation of the genotype matrix and reduced disk read and write operations in hmmibd-rs . When multithreading is enabled, the performance of hmmibd-rs is almost linear with respect to the number of threads and the number of genome pairs. To test the performance on a large data set, we ran hmmIBD and hmmibd-rs on simulated P. falciparum -like genomic data with a sample size of up to n = 30,000, which is the same order of magnitude as the MalariaGEN Pf7 data set (n > 21,000). Our new implementation completed the IBD detection from the simulated data set in 1.3 hours using 128 threads with the AMD EPYC 9654 CPU model, whereas the single-threaded hmmIBD took an estimated 5.2 days to complete IBD detection (Fig. 1 c). Additionally, when IBD segment length filtering options were used, the resulting file sizes were largely reduced (Supplementary Table 2). Thus, this new implementation, in this example, can accelerate the process by two orders of magnitude. To understand how recombination rate misspecification affects IBD detection, we simulated genomes with a non-uniform recombination rate map that had the same mean recombination rate as the P. falciparum genome (Fig. 2 a and b). When focusing on IBD segments ≥ 2 cM and using the average recombination rate ( hmmIBD ), we found that the number of ends (breakpoints) of the detected IBD segments decreases for recombination hot spots and increases for cold spots when compared to those using the true non-uniform rates ( hmmibd-rs ) (Fig. 2 c). Consistently, the error rates (false negative rates and false positive rates, Fig. 2 e and f) and deviation from the true IBD coverage pattern (Fig. 2 g) were significantly higher in hmmIBD results than in hmmibd-rs . The differences between hmmIBD - and hmmibd-rs -derived IBD segments are largely reduced when using the true rates to calculate length used to filter segments inferred by hmmIBD , suggesting that an accurate recombination map is important for filtering IBD segments by length in genetic units (Supplementary Fig. 1). To test whether the recombination rate affects HMM inference, we analyzed unfiltered IBD segments when called with true ( hmmibd-rs ) and average rates ( hmmIBD ). We showed that rate misspecification indeed affects the detection of IBD breakpoints (Supplementary Fig. 2). The general underestimation of IBD breakpoints in recombination hotspots likely arises from two main factors: the IBD-merging bias in the HMM and the high error rates caused by low marker densities per genetic unit. In addition, this issue is aggravated by reduced state switching rate in the HMM and aggressive IBD removal in length filtering due to recombination rate misspecification (see Supplementary Note for more details). This finding highlights the importance of accurately characterizing the local recombination rate variation and using it to improve the detection and filtration of IBD segments. Another obstacle in the hmmIBD -based analytical pipeline is the need for an hmmIBD -specific format for the input data, which adds a data formatting step to IBD detection and downstream analysis, which is particularly cumbersome if iterative filtering of samples and variants is done. We mitigated these issues by implementing optional, built-in, all-in-memory functions for iterative sample and site filtering, and for haploid genome construction using dominant alleles (Supplementary Table 1). We presented a simple pipeline based on these features, implemented in hmmibd-rs , to demonstrate its applicability to large-scale WGS data sets, from VCF/BCF files to IBD-based estimates, which includes genotype filtering and haploid genome construction ( bcftools [ 28 ] and hmmibd-rs ), IBD calling ( hmmibd-rs ), and IBD coverage calculation. We were able to finish IBD calling from the raw genotype call files within a single day using 64 threads CPU (Fig. 3 a), with the majority of time spent on bcftools for the initial genotype filtering step. As a proof of concept, the resulting IBD data show signals of positive selection (Fig. 3 b) consistent with previous reports [ 11 , 14 ]. DISCUSSION This study presents an improved implementation of hmmIBD with three important features for large-scale population genomics: high computational performance, optional recombination rate map specification, and improved data management. Compared to other probabilistic IBD (segment) detection methods popular in malaria research, including hmmIBD [ 5 ], isoRelate [ 6 ] and DEploidIBD [ 27 ], hmmibd-rs is the first attempt to leverage the memory-safe language Rust and its rich ecosystem to embrace the era of large-scale genomics by enabling computational parallelization, employing a standard input format and lowering difficulty of long-term software maintainability and further development to incorporate more complex models. Although hmmibd-rs has mainly been applied to Plasmodium data, it is expected to work with data from other sexually recombining species with high recombination rates for which haploid genomes can be constructed [ 29 ], which may include non- Plasmodium Apicomplexan species, such as Theileria parva [ 30 ], and insects such as Apis mellifera [ 31 , 32 ] and fungi such as Saccharomyces cerevisiae [ 33 , 34 ]. The new features of hmmibd-rs including its parallelizability and support of a non-uniform genetic map may allow the detection of inter-individual IBD segments from phased data on diploids, e.g. mosquitoes, by treating each phase of a diploid individual as a haploid genome. However, given advances in human genetics, a superior approach for diploids may also exist. One caveat of our analyses of hmmibd-rs is the lack of reliable non-uniform recombination rates for empirical data sets of high-recombining species like P. falciparum , despite initial efforts to estimate either the average rate or high-resolution rate maps based on limited samples [ 22 , 23 ]. Ongoing work is needed to estimate high-resolution recombination rate maps based on existing genetic cross data, as well as WGS data from large-scale population samples [ 21 , 22 , 35 – 37 ]. This will allow further evaluation of biases in IBD-based analysis due to the use of a simple average rate in empirical data. We also note that the optional built-in function that constructs haploid genomes based on dominant alleles is misspecified for polyclonal samples. Using more advanced genotype deconvolution tools, such as DEploid and DEploidIBD [ 26 , 27 ], may better utilize polyclonal infections, although it may significantly increase the computational burden. CONCLUSION hmmibd-rs enhances the original IBD detection algorithm hmmIBD with key features that significantly accelerate IBD detection from large-scale genomic data and enable the incorporation of a genetic map for improved accuracy in genomes with non-uniform recombination. The new implementation allows for more efficient, accurate, and streamlined IBD-based analysis of Plasmodium genomes, which will contribute to the timely malaria genomic surveillance. Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and materials Source code for hmmibd-rs in Rust is available at https://github.com/bguo068/ hmmibd-rs ; this repository also includes a version of hmmIBD in C, modified to allow users to specify the average recombination rate via a command line option. Source code for the cross-platform library for reading BCF file format, bcf-reader is available at https://github.com/bguo068/bcf-reader. The pipeline that simulates data and benchmarks and characterizes hmmibd-rs and hmmIBD is accessible at https://github.com/bguo068/hmmibd-rs-bench . The pipeline for a full demonstration of hmmibd-rs ’s application to MalariaGEN Pf7 is provided at https://github.com/bguo068/hmmibd-rs-bench-empirical. The genotype data of the MalariaGEN Pf7 data set [20] and its sample meta information, including estimates, are publicly available at https://www.malariagen.net/resource/34/. Competing interests The authors declare no competing interests. Funding This work was supported by NIH 1R01AI145852 granted to ST-H and TDO by the U.S. National Institutes of Health. Authors’ contributions BG: Led algorithm enhancement, developed and tested the program, and led paper writing; SFS: Helped improve paper; ART: Helped improve the paper; TDO: Supervised work; STH: Supervised work, and helped improve the paper. Acknowledgements This publication uses MalariaGEN data as described in ‘Pf7: an open dataset of Plasmodium falciparum genome variation in 20,000 worldwide samples’ MalariaGEN et al., Wellcome Open Research 2023, 8:22 https://doi.org/10.12688/wellcomeopenres.18681.1. References Neafsey DE, Taylor AR, MacInnis BL. Advances and opportunities in malaria population genomics. Nat Rev Genet. 2021;22:502–17. Camponovo F, Buckee CO, Taylor AR. Measurably recombining malaria parasites. Trends Parasitol. 2023;39:17–25. Guo B, Rowley E, O’Connor TD, Takala-Harrison S. Potential and pitfalls of using identity-by-descent for malaria genomic surveillance. Trends Parasitol. 2025;41:387–400. Taylor AR, Jacob PE, Neafsey DE, Buckee CO. Estimating relatedness between malaria parasites. Genetics. 2019;212:1337–51. Schaffner SF, Taylor AR, Wong W, Wirth DF, Neafsey DE. HmmIBD: Software to infer pairwise identity by descent between haploid genotypes. Malar J. 2018;17:10–3. Henden L, Lee S, Mueller I, Barry A, Bahlo M. Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. PLoS Genet. 2018;14:e1007279. Browning SR, Browning BL. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am J Hum Genet. 2015;97:404–18. Morgan AP, Brazeau NF, Ngasala B, Mhamilawa LE, Denton M, Msellem M, et al. Falciparum malaria from coastal Tanzania and Zanzibar remains highly connected despite effective control efforts on the archipelago. Malar J. 2020;19:47. Shetty AC, Jacob CG, Huang F, Li Y, Agrawal S, Saunders DL, et al. Genomic structure and diversity of Plasmodium falciparum in Southeast Asia reveal recent parasite migration patterns. Nat Commun. 2019;10:1–11. Al-Asadi H, Petkova D, Stephens M, Novembre J. Estimating recent migration and population-size surfaces. DeGiorgio M, editor. PLOS Genetics. 2019;15:e1007908–e1007908. Amambua-Ngwa A, Amenga-Etego L, Kamau E, Amato R, Ghansah A, Golassa L, et al. Major subpopulations of Plasmodium falciparum in sub-Saharan Africa. Science. 2019;365:813–6. Belbin GM, Cullina S, Wenric S, Soper ER, Glicksberg BS, Torre D, et al. Toward a fine-scale population health monitoring system. Cell. 2021;184:2068–83. e11. Borda V, Loesch DP, Guo B, Laboulaye R, Veliz-Otani D, French JN, et al. Genetics of Latin American Diversity Project: Insights into population genetics and association studies in admixed groups in the Americas. Cell Genom. 2024;4:100692. Guo B, Borda V, Laboulaye R, Spring MD, Wojnarski M, Vesely BA, et al. Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum. Nat Commun. 2024;15:2499. Zhou Y, Browning SR, Browning BL. A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data. Am J Hum Genet. 2020;106:426–37. Guo B, Takala-Harrison S, O’Connor TD. Benchmarking and Optimization of Methods for the Detection of Identity-By-Descent in High-Recombining Plasmodium falciparum Genomes. eLife [Internet]. 2025 [cited 2025 Jan 22];14. Available from: https://elifesciences.org/reviewed-preprints/101924 Conway DJ, Roper C, Oduola AMJ, Arnot DE, Kremsner PG, Grobusch MP, et al. High recombination rate in natural populations of Plasmodium falciparum. Proc Natl Acad Sci U S A. 1999;96:4506–11. Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, et al. A high-resolution recombination map of the human genome. Nat Genet. 2002;31:241–7. Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–9. MalariaGEN, Ahouidi A, Ali M, Almagro-Garcia J, Amambua-Ngwa A, Amaratunga C, et al. An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples. Wellcome Open Res. 2021;6:42. Jiang H, Li N, Gopalan V, Zilversmit MM, Varma S, Nagarajan V et al. High recombination rates and hotspots in a Plasmodium falciparum genetic cross. Genome Biol. 2011. Vendrely KM, Kumar S, Li X, Vaughan AM. Humanized Mice and the Rebirth of Malaria Genetic Crosses. Trends Parasitol. 2020;36:850–63. Kane J, Li X, Kumar S, Button-Simons KA, Vendrely Brenneman KM, Dahlhoff H, et al. A Plasmodium falciparum genetic cross reveals the contributions of pfcrt and plasmepsin II/III to piperaquine drug resistance. mBio. 2024;15:e0080524. Stone J, Matsakis N, rayon. Simple work-stealing parallelism for Rust [Internet]. 2024. Available from: {https://github.com/rayon-rs/rayon} Tang K, Naseri A, Wei Y, Zhang S, Zhi D. Open-source benchmarking of IBD segment detection methods for biobank-scale cohorts. GigaScience. 2022;11:giac111. Zhu SJ, Almagro-Garcia J, McVean G. Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics. 2018;34:9–15. Zhu SJ, Hendry JA, Almagro-Garcia J, Pearson RD, Amato R, Miles A, et al. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. Elife. 2019;8:e40845. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10:giab008. Stapley J, Feulner PGD, Johnston SE, Santure AW, Smadja CM. Variation in recombination frequency and distribution across eukaryotes: patterns and processes. Philosophical Trans Royal Soc B: Biol Sci. 2017;372:20160455. Sivakumar T, Hayashida K, Sugimoto C, Yokoyama N. Evolution and genetic diversity of Theileria. Infect Genet Evol. 2014;27:250–63. Kent CF, Minaei S, Harpur BA, Zayed A. Recombination is associated with the evolution of genome structure and worker behavior in honey bees. Proceedings of the National Academy of Sciences. 2012;109:18012–7. Leroy T, Faux P, Basso B, Eynard S, Wragg D, Vignal A. Inferring Long-Term and Short-Term Determinants of Genetic Diversity in Honey Bees: Beekeeping Impact and Conservation Strategies. Mol Biol Evol. 2024;41:msae249. Peter J, De Chiara M, Friedrich A, Yue J-X, Pflieger D, Bergström A, et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature. 2018;556:339–44. Barton AB, Pekosz MR, Kurvathi RS, Kaback DB. Meiotic Recombination at the Ends of Chromosomes in Saccharomyces cerevisiae. Genetics. 2008;179:1221–35. Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, et al. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res. 2016;26:1288–99. Naseri A, Yue W, Zhang S, Zhi D. Fast inference of genetic recombination rates in biobank scale data. Genome Res. 2023;33:1015–22. Zhou Y, Browning BL, Browning SR. Population-Specific Recombination Maps from Segments of Identity by Descent. Am J Hum Genet. 2020;107:137–48. Additional Declarations No competing interests reported. Supplementary Files hmmibdrsmanuscriptsuppl.docx Cite Share Download PDF Status: Published Journal Publication published 09 Feb, 2026 Read the published version in Malaria Journal → Version 1 posted Editorial decision: Revision requested 16 Sep, 2025 Reviews received at journal 10 Aug, 2025 Reviews received at journal 08 Aug, 2025 Reviewers agreed at journal 21 Jul, 2025 Reviewers agreed at journal 20 Jul, 2025 Reviewers agreed at journal 18 Jul, 2025 Reviewers invited by journal 01 Jul, 2025 Editor assigned by journal 30 Jun, 2025 Submission checks completed at journal 30 Jun, 2025 First submitted to journal 29 Jun, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7004070","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":479235756,"identity":"40f18901-14ee-49d3-a7d5-1a485afaa96c","order_by":0,"name":"Bing Guo","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABEklEQVRIiWNgGAWjYBACA2Yogx8hxtzAwMAGpA8Q0CLZABdjJKAFzkAoIKDFnJ338AuGX4fljM8fPrqBsW2bvPmMxMbPBWUMcnw3ErBqsWzmS7Ng7DtsbHYjLe0GY9ttwzk3EpulZ5xjMJbEocXgMI+ZAWPP4cRtN3jMQFoYZ0gkNkjztjEkbiCkZXP/GbAWe6CW5t9ALfV4tBg/YPhxOHEDQw5YSyJQSxvIlgQDnH7hMWNIbEg3lgD5JeHc7eQZPA/brHnOSRjOPPMAe4jxnzH+8OGPtRx//+FjNz6U3badwZ58+DZPmY0833HstgABG9AlUCZYjQCYlMClHASYPzD8QebzH8CnehSMglEwCkYgAAAxvmWSwRUccgAAAABJRU5ErkJggg==","orcid":"","institution":"University of Maryland School of Medicine","correspondingAuthor":true,"prefix":"","firstName":"Bing","middleName":"","lastName":"Guo","suffix":""},{"id":479235757,"identity":"1b75b3ab-c80f-4511-a21a-fb1be6a7484d","order_by":1,"name":"Stephen F. Schaffner","email":"","orcid":"","institution":"Broad Institute of MIT and Harvard","correspondingAuthor":false,"prefix":"","firstName":"Stephen","middleName":"F.","lastName":"Schaffner","suffix":""},{"id":479235758,"identity":"e7506145-dfb5-4abc-ae8a-f6a5cbc6421d","order_by":2,"name":"Aimee R. Taylor","email":"","orcid":"","institution":"Institut Pasteur, Universit´e Paris Cit´e","correspondingAuthor":false,"prefix":"","firstName":"Aimee","middleName":"R.","lastName":"Taylor","suffix":""},{"id":479235759,"identity":"ed394d67-024e-44e6-965a-39b82e1579ff","order_by":3,"name":"Timothy D. O’Connor","email":"","orcid":"","institution":"University of Maryland School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Timothy","middleName":"D.","lastName":"O’Connor","suffix":""},{"id":479235760,"identity":"9db0daf7-ff3c-4c8c-a1ed-245afb45c426","order_by":4,"name":"Shannon Takala-Harrison","email":"","orcid":"","institution":"University of Maryland School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Shannon","middleName":"","lastName":"Takala-Harrison","suffix":""}],"badges":[],"createdAt":"2025-06-29 17:08:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7004070/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7004070/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12936-026-05814-2","type":"published","date":"2026-02-09T15:59:03+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":85846539,"identity":"5afb4362-70d3-4c13-91c0-05a574764c3e","added_by":"auto","created_at":"2025-07-02 09:43:05","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":415307,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003ehmmibd-rs\u003c/em\u003esignificantly reduces the computation time for IBD detection in simulated \u003cem\u003eP. falciparum\u003c/em\u003e-like chromosomes when compared to \u003cem\u003ehmmIBD\u003c/em\u003e. a, Comparison of single-thread runtime of \u003cem\u003ehmmibd-rs\u003c/em\u003eand \u003cem\u003ehmmIBD\u003c/em\u003e for simulated data with different sample sizes. b, Multithreading performance of \u003cem\u003ehmmibd-rs\u003c/em\u003e. c, Runtime for detecting IBD from simulated data on a MalariaGEN-scale using \u003cem\u003ehmmibd-rs\u003c/em\u003e with 128 threads. Note that in (a and c), sample sizes are indicated as both the number of chromosomes (top horizontal axis ticks) and the number of chromosome pairs (bottom horizontal axis ticks). Dotted line in (c) indicates that the top right data point (star) is an estimate based on the extrapolation of data points involving smaller sample sizes.\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7004070/v1/91befdaca6e2e41d0a328a94.jpeg"},{"id":85846540,"identity":"71315f56-1137-49b0-b638-057cd72b5a2f","added_by":"auto","created_at":"2025-07-02 09:43:05","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":216545,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003ehmmibd-rs\u003c/em\u003e improves the accuracy of reported IBD segments by incorporating a non-uniform recombination rate map in the HMM algorithm and the IBD length filtering step. a-b, the true, non-uniform recombination rate map (blue line) and the constant chromosome-wide average rates (black dotted line). c. Number of detected IBD breakpoints (two ends of IBD segments over 2 centimorgans, cM) in 15kb windows along the simulated \u003cem\u003eP. falciparum\u003c/em\u003e-like genomes. e-f, False negative rate (e) and false positive rate (f) of detected IBD segments (over 2 cM) using IBD segment overlapping analysis (see Methods). g, Coverage of detected IBD segments (over 2 cM), normalized to the coverage of true IBD segments (over 2 cM, determined by \u003cem\u003etskibd \u003c/em\u003e[14]). Note: 1) The true (non-uniform) rate was used to simulate the genotype data; 2) \u003cem\u003etskibd \u003c/em\u003eis used to generate true IBD segments from simulated genealogy trees, and the true recombination rate is used for true IBD length calculation and filtration; 3) For \u003cem\u003ehmmibd-rs\u003c/em\u003e, the true rate was used for both the HMM inference and the IBD length filtration; 4) For \u003cem\u003ehmmIBD\u003c/em\u003e, the average rate is used for both the HMM and length filtration steps. See Supplementary Figure 1 for similar analyses using the true rate for length calculation and filtering for \u003cem\u003ehmmIBD\u003c/em\u003e-inferred IBD segments. Also, see Supplementary Figure 1 for the IBD breakpoint analysis when all IBD segments are included (without length filtering).\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7004070/v1/e3f7f10b23d2d9f258d5adab.png"},{"id":85846541,"identity":"673898c6-12b6-4816-a1d2-44379489a336","added_by":"auto","created_at":"2025-07-02 09:43:05","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":249436,"visible":true,"origin":"","legend":"\u003cp\u003eThe \u003cem\u003ehmmibd-rs\u003c/em\u003e-based, VCF-to-IBD pipeline enables fast and streamlined detection of IBD segments from the MalariaGEN Pf7 data set within a single day. a, Descriptions, tools involved, numbers of samples and records analyzed, computation time, and output file sizes over the five steps of the VCF-to-IBD pipeline. b, Example analysis (IBD coverage) of the detection of IBD segments from the MalariaGEN Pf7. Genes of interest (potentially under positive selection) were labeled above the plot.\u003c/p\u003e","description":"","filename":"floatimage322.png","url":"https://assets-eu.researchsquare.com/files/rs-7004070/v1/c86b3967e768674e2744b234.png"},{"id":102785591,"identity":"5bfc046a-36db-4cbe-896f-0d8752724f6e","added_by":"auto","created_at":"2026-02-16 16:08:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1337628,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7004070/v1/036ec9ac-9b51-4125-9996-ea097ecaf46c.pdf"},{"id":85846545,"identity":"b2ea5bca-5fbb-43db-baf5-14d619aa1b7a","added_by":"auto","created_at":"2025-07-02 09:43:05","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":3549736,"visible":true,"origin":"","legend":"","description":"","filename":"hmmibdrsmanuscriptsuppl.docx","url":"https://assets-eu.researchsquare.com/files/rs-7004070/v1/0fb538df8ab78e8891308293.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data","fulltext":[{"header":"BACKGROUND","content":"\u003cp\u003eIdentity-by-descent (IBD) refers to alleles or genomic regions (segments) that are identical between two individuals/genomes due to shared ancestry. For species with high recombination rates relative to mutation rates, such as malaria parasites, metrics that leverage recombination, e.g., IBD, capture finer-scale and higher-resolution dynamics in population demography [\u003cspan additionalcitationids=\"CR2\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. IBD, inferred from malaria parasite genomic data, conveys important information about the recent history of populations, including genetic relatedness within and between populations, loci under natural selection, and time-specific demography (effective population size and population structure), thus playing a crucial role in malaria genomic surveillance [\u003cspan additionalcitationids=\"CR5 CR6 CR7 CR8 CR9 CR10 CR11 CR12 CR13\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAccurate detection of IBD segments often requires genotype data with sufficient marker density [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Species with a high ratio of recombination rate to mutation rate, such as the malaria parasite \u003cem\u003ePlasmodium falciparum\u003c/em\u003e, tend to have a (common variant) marker density two orders of magnitude lower than that of humans [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan additionalcitationids=\"CR18 CR19\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Thus, an IBD detection algorithm robust to low marker density is crucial for high-recombining species. Our accompanying project suggests that \u003cem\u003ehmmIBD\u003c/em\u003e stands out among many other IBD detection methods, uniquely providing high-quality IBD segment calls, including shorter segments, that allow for the generation of accurate results even for quality-sensitive inferences [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Despite its high accuracy and wide adoption in malaria research, \u003cem\u003ehmmIBD\u003c/em\u003e could be improved in several areas: it uses only a single thread of multi-core CPUs, it assumes a uniform recombination rate across the genome, and it requires substantial preprocessing of input data into a specific format prior to analysis [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. These limitations may prevent its wider application to larger data sets or to analyses that rely on a non-uniform genetic map [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e], an important consideration given the demand for analysis of large-scale whole-genome sequence (WGS) data [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] and opportunities to construct high-resolution recombination rate maps based on recent genetic crosses [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn this work, we addressed these limitations by reimplementing and enhancing the Hidden Markov Model described in the original paper [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] using the Rust programming language. Our new implementation, \u003cem\u003ehmmibd-rs\u003c/em\u003e, offers three key features: parallelized IBD inference, support for non-uniform recombination rates, and streamlined data management.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cp\u003eWe enabled parallelization for the HMM inference process at the level of single haploid genome pairs or groups of haploid genome pairs. In our reimplementation, we first modularized the original algorithm into multiple components, including data processing modules and different subcomponents of the HMM inference process. Relying on the modular structure, we then isolated the HMM inference process for a genome pair as the basic unit for parallelization. The original sequential HMM inference process, iterated over genome pairs, was converted into parallelizable tasks, utilizing the \u003cem\u003eRayon\u003c/em\u003e crate, a library designed for data parallelism [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. In addition, we provide options to optimize memory usage based on parameters such as the maximum number of alternative alleles per locus and the output file buffer sizes.\u003c/p\u003e \u003cp\u003eWe enabled non-uniform recombination rates for HMM inferences and IBD segment filtration using user-provided genetic maps. For HMM inference, we updated the term \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{e}^{-k\\rho\\:{d}_{t}}\\)\u003c/span\u003e\u003c/span\u003e in the transition probabilities matrix [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] to the term \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{e}^{-k{c}_{t}}\\)\u003c/span\u003e\u003c/span\u003e, where ρ is the recombination rate per generation per bp, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{d}_{t}\\)\u003c/span\u003e\u003c/span\u003e is the physical distance between the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:t\\)\u003c/span\u003e\u003c/span\u003e th marker and the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:t-1\\:\\)\u003c/span\u003e\u003c/span\u003eth marker, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{c}_{t}\\)\u003c/span\u003e\u003c/span\u003e is the genetic distance between the two markers given by the user. As \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{c}_{t}=\\rho\\:{d}_{t}\\)\u003c/span\u003e\u003c/span\u003e, our new implementation maps physical distances between markers to genetic distance, and directly uses genetic distance \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{c}_{t}\\)\u003c/span\u003e\u003c/span\u003e for HMM inference, thus removing the assumption of a uniform recombination rate along the genome. Besides the HMM inference steps, we also allow the usage of a recombination rate map for post-HMM inference, built-in IBD segment length filtration in genetic units, which is usually needed for analyses based on IBD segments since short IBD segment estimates are more error-prone and often filtered out before downstream analyses [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eWe improved the efficiency and ergonomics of input and output data management. We developed a simple, cross-platform auxiliary library \u003cem\u003ebcf_reader\u003c/em\u003e in Rust and used it to process input data directly from the common genotype file format, binary variant call format (BCF). Based on this library, we implemented two main built-in functions. The first main function is to construct haploid genomes by replacing heteroallelic genotype calls in monoclonal samples with dominant alleles if the total allele depths (via the command line option \u003cem\u003emin_depth\u003c/em\u003e) and the fraction of reads supporting the dominant alleles (via the options \u003cem\u003emin_ratio\u003c/em\u003e and \u003cem\u003emin_r1_r2\u003c/em\u003e) are high; otherwise, these are set to missing (as detailed in Supplementary Table\u0026nbsp;1). We note that the default criteria for determining whether to use dominant alleles or missing data are somewhat subjective: users may opt for stringent thresholds, such as setting all heteroallelic calls to missing data, which comes with the caveat of removing more sites and samples during the subsequent genotype filtering step, or more permissive ones to include all heteroallelic calls by using the allele with the highest read support, which may introduce substantial genotyping errors. The second built-in function is to iteratively filter samples and sites (based on the missingness of genotype calls) to obtain high-quality genotype data while retaining balanced numbers of markers and samples (Supplementary Table\u0026nbsp;1). The dominant-allele-based haploid genome construction from BCF files is a heuristic strategy for working with monoclonal samples. Haploid genomes from polyclonal infections may be inferred via external deconvolution programs like \u003cem\u003eDEPloid\u003c/em\u003e or \u003cem\u003eDEPloidIBD\u003c/em\u003e [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e] and provided to \u003cem\u003ehmmibd-rs\u003c/em\u003e in a traditional table format used in \u003cem\u003ehmmIBD\u003c/em\u003e.\u003c/p\u003e \u003cp\u003eWe have included additional options, made available via the command line interface, to customize HMM parameters and data management parameters (e.g.,BCF processing, IBD filtering, output buffering, and optional suppressing) to allow the user to balance analytical needs and computational/storage efficiency. Additionally, \u003cem\u003ehmmibd-rs\u003c/em\u003e is designed to be fully compatible with \u003cem\u003ehmmIBD\u003c/em\u003e to facilitate the transition to \u003cem\u003ehmmibd-rs\u003c/em\u003e, and, by default, generates both files for IBD segments and files for the fraction of sites IBD (estimates of genetic relatedness) like \u003cem\u003ehmmIBD\u003c/em\u003e.\u003c/p\u003e \u003cp\u003eMethods used for simulation, measurement of computation time, and downstream analysis of detected IBD segments are similar to our previous analysis [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], with further description provided in the Supplementary Methods. Details of these analyses can be found in the related pipeline and source code listed in the Availability of Data and Materials.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003eOur new implementation, \u003cem\u003ehmmibd-rs\u003c/em\u003e, improves the computational efficiency of HMM IBD inference both by increasing single-thread performance and by enabling multithreading. When both \u003cem\u003ehmmibd-rs\u003c/em\u003e and \u003cem\u003ehmmIBD\u003c/em\u003e are forced to use a single thread, run times with \u003cem\u003ehmmibd-rs\u003c/em\u003e were about 40% shorter than those of \u003cem\u003ehmmIBD\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea), due to the more compact memory representation of the genotype matrix and reduced disk read and write operations in \u003cem\u003ehmmibd-rs\u003c/em\u003e. When multithreading is enabled, the performance of \u003cem\u003ehmmibd-rs\u003c/em\u003e is almost linear with respect to the number of threads and the number of genome pairs. To test the performance on a large data set, we ran \u003cem\u003ehmmIBD\u003c/em\u003e and \u003cem\u003ehmmibd-rs\u003c/em\u003e on simulated \u003cem\u003eP. falciparum\u003c/em\u003e-like genomic data with a sample size of up to n\u0026thinsp;=\u0026thinsp;30,000, which is the same order of magnitude as the MalariaGEN Pf7 data set (n\u0026thinsp;\u0026gt;\u0026thinsp;21,000). Our new implementation completed the IBD detection from the simulated data set in 1.3 hours using 128 threads with the AMD EPYC 9654 CPU model, whereas the single-threaded \u003cem\u003ehmmIBD\u003c/em\u003e took an estimated 5.2 days to complete IBD detection (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec). Additionally, when IBD segment length filtering options were used, the resulting file sizes were largely reduced (Supplementary Table\u0026nbsp;2). Thus, this new implementation, in this example, can accelerate the process by two orders of magnitude.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo understand how recombination rate misspecification affects IBD detection, we simulated genomes with a non-uniform recombination rate map that had the same mean recombination rate as the P. falciparum genome (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea and b). When focusing on IBD segments\u0026thinsp;\u0026ge;\u0026thinsp;2 cM and using the average recombination rate (\u003cem\u003ehmmIBD\u003c/em\u003e), we found that the number of ends (breakpoints) of the detected IBD segments decreases for recombination hot spots and increases for cold spots when compared to those using the true non-uniform rates (\u003cem\u003ehmmibd-rs\u003c/em\u003e) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec). Consistently, the error rates (false negative rates and false positive rates, Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee and f) and deviation from the true IBD coverage pattern (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eg) were significantly higher in \u003cem\u003ehmmIBD\u003c/em\u003e results than in \u003cem\u003ehmmibd-rs\u003c/em\u003e. The differences between \u003cem\u003ehmmIBD\u003c/em\u003e- and \u003cem\u003ehmmibd-rs\u003c/em\u003e-derived IBD segments are largely reduced when using the true rates to calculate length used to filter segments inferred by \u003cem\u003ehmmIBD\u003c/em\u003e, suggesting that an accurate recombination map is important for filtering IBD segments by length in genetic units (Supplementary Fig.\u0026nbsp;1). To test whether the recombination rate affects HMM inference, we analyzed unfiltered IBD segments when called with true (\u003cem\u003ehmmibd-rs\u003c/em\u003e) and average rates (\u003cem\u003ehmmIBD\u003c/em\u003e). We showed that rate misspecification indeed affects the detection of IBD breakpoints (Supplementary Fig.\u0026nbsp;2). The general underestimation of IBD breakpoints in recombination hotspots likely arises from two main factors: the IBD-merging bias in the HMM and the high error rates caused by low marker densities per genetic unit. In addition, this issue is aggravated by reduced state switching rate in the HMM and aggressive IBD removal in length filtering due to recombination rate misspecification (see Supplementary Note for more details). This finding highlights the importance of accurately characterizing the local recombination rate variation and using it to improve the detection and filtration of IBD segments.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAnother obstacle in the \u003cem\u003ehmmIBD\u003c/em\u003e-based analytical pipeline is the need for an \u003cem\u003ehmmIBD\u003c/em\u003e-specific format for the input data, which adds a data formatting step to IBD detection and downstream analysis, which is particularly cumbersome if iterative filtering of samples and variants is done. We mitigated these issues by implementing optional, built-in, all-in-memory functions for iterative sample and site filtering, and for haploid genome construction using dominant alleles (Supplementary Table\u0026nbsp;1). We presented a simple pipeline based on these features, implemented in \u003cem\u003ehmmibd-rs\u003c/em\u003e, to demonstrate its applicability to large-scale WGS data sets, from VCF/BCF files to IBD-based estimates, which includes genotype filtering and haploid genome construction (\u003cem\u003ebcftools\u003c/em\u003e [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] and \u003cem\u003ehmmibd-rs\u003c/em\u003e), IBD calling (\u003cem\u003ehmmibd-rs\u003c/em\u003e), and IBD coverage calculation. We were able to finish IBD calling from the raw genotype call files within a single day using 64 threads CPU (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea), with the majority of time spent on \u003cem\u003ebcftools\u003c/em\u003e for the initial genotype filtering step. As a proof of concept, the resulting IBD data show signals of positive selection (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb) consistent with previous reports [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eThis study presents an improved implementation of \u003cem\u003ehmmIBD\u003c/em\u003e with three important features for large-scale population genomics: high computational performance, optional recombination rate map specification, and improved data management. Compared to other probabilistic IBD (segment) detection methods popular in malaria research, including \u003cem\u003ehmmIBD\u003c/em\u003e [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], isoRelate [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] and DEploidIBD [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], \u003cem\u003ehmmibd-rs\u003c/em\u003e is the first attempt to leverage the memory-safe language Rust and its rich ecosystem to embrace the era of large-scale genomics by enabling computational parallelization, employing a standard input format and lowering difficulty of long-term software maintainability and further development to incorporate more complex models. Although \u003cem\u003ehmmibd-rs\u003c/em\u003e has mainly been applied to \u003cem\u003ePlasmodium\u003c/em\u003e data, it is expected to work with data from other sexually recombining species with high recombination rates for which haploid genomes can be constructed [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e], which may include non-\u003cem\u003ePlasmodium\u003c/em\u003e Apicomplexan species, such as \u003cem\u003eTheileria parva\u003c/em\u003e [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], and insects such as \u003cem\u003eApis mellifera\u003c/em\u003e [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] and fungi such as Saccharomyces cerevisiae [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. The new features of \u003cem\u003ehmmibd-rs\u003c/em\u003e including its parallelizability and support of a non-uniform genetic map may allow the detection of inter-individual IBD segments from phased data on diploids, e.g. mosquitoes, by treating each phase of a diploid individual as a haploid genome. However, given advances in human genetics, a superior approach for diploids may also exist.\u003c/p\u003e \u003cp\u003eOne caveat of our analyses of \u003cem\u003ehmmibd-rs\u003c/em\u003e is the lack of reliable non-uniform recombination rates for empirical data sets of high-recombining species like \u003cem\u003eP. falciparum\u003c/em\u003e, despite initial efforts to estimate either the average rate or high-resolution rate maps based on limited samples [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. Ongoing work is needed to estimate high-resolution recombination rate maps based on existing genetic cross data, as well as WGS data from large-scale population samples [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e, \u003cspan additionalcitationids=\"CR36\" citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]. This will allow further evaluation of biases in IBD-based analysis due to the use of a simple average rate in empirical data. We also note that the optional built-in function that constructs haploid genomes based on dominant alleles is misspecified for polyclonal samples. Using more advanced genotype deconvolution tools, such as \u003cem\u003eDEploid\u003c/em\u003e and \u003cem\u003eDEploidIBD\u003c/em\u003e [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], may better utilize polyclonal infections, although it may significantly increase the computational burden.\u003c/p\u003e"},{"header":"CONCLUSION","content":"\u003cp\u003e \u003cem\u003ehmmibd-rs\u003c/em\u003e enhances the original IBD detection algorithm \u003cem\u003ehmmIBD\u003c/em\u003e with key features that significantly accelerate IBD detection from large-scale genomic data and enable the incorporation of a genetic map for improved accuracy in genomes with non-uniform recombination. The new implementation allows for more efficient, accurate, and streamlined IBD-based analysis of \u003cem\u003ePlasmodium\u003c/em\u003e genomes, which will contribute to the timely malaria genomic surveillance.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSource code for \u003cem\u003ehmmibd-rs\u003c/em\u003e in Rust is available at https://github.com/bguo068/\u003cem\u003ehmmibd-rs\u003c/em\u003e; this repository also includes a version of \u003cem\u003ehmmIBD\u003c/em\u003e in C, modified to allow users to specify the average recombination rate via a command line option. Source code for the cross-platform library for reading BCF file format, bcf-reader is available at https://github.com/bguo068/bcf-reader. The pipeline that simulates data and benchmarks and characterizes \u003cem\u003ehmmibd-rs\u003c/em\u003e and \u003cem\u003ehmmIBD\u003c/em\u003e is accessible at https://github.com/bguo068/hmmibd-rs-bench . The pipeline for a full demonstration of \u003cem\u003ehmmibd-rs\u003c/em\u003e\u0026rsquo;s application to MalariaGEN Pf7 is provided at https://github.com/bguo068/hmmibd-rs-bench-empirical. The genotype data of the MalariaGEN Pf7 data set [20] and its sample meta information, including \u003cimg width=\"21\" height=\"17\" src=\"https://myfiles.space/user_files/127393_c7e80a1c9bb65875/127393_custom_files/img1751448812.gif\" alt=\"image\"\u003e\u0026nbsp;estimates, are publicly available at https://www.malariagen.net/resource/34/.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by NIH 1R01AI145852 granted to ST-H and TDO by the U.S. National Institutes of Health.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBG: Led algorithm enhancement, developed and tested the program, and led paper writing; SFS: Helped improve paper; ART: Helped improve the paper; TDO: Supervised work; STH: Supervised work, and helped improve the paper.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis publication uses MalariaGEN data as described in \u0026lsquo;Pf7: an open dataset of Plasmodium falciparum genome variation in 20,000 worldwide samples\u0026rsquo; MalariaGEN et al., Wellcome Open Research 2023, 8:22 https://doi.org/10.12688/wellcomeopenres.18681.1.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eNeafsey DE, Taylor AR, MacInnis BL. Advances and opportunities in malaria population genomics. Nat Rev Genet. 2021;22:502\u0026ndash;17.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCamponovo F, Buckee CO, Taylor AR. Measurably recombining malaria parasites. Trends Parasitol. 2023;39:17\u0026ndash;25.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo B, Rowley E, O\u0026rsquo;Connor TD, Takala-Harrison S. Potential and pitfalls of using identity-by-descent for malaria genomic surveillance. Trends Parasitol. 2025;41:387\u0026ndash;400.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTaylor AR, Jacob PE, Neafsey DE, Buckee CO. Estimating relatedness between malaria parasites. Genetics. 2019;212:1337\u0026ndash;51.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchaffner SF, Taylor AR, Wong W, Wirth DF, Neafsey DE. HmmIBD: Software to infer pairwise identity by descent between haploid genotypes. Malar J. 2018;17:10\u0026ndash;3.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHenden L, Lee S, Mueller I, Barry A, Bahlo M. Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. PLoS Genet. 2018;14:e1007279.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrowning SR, Browning BL. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am J Hum Genet. 2015;97:404\u0026ndash;18.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMorgan AP, Brazeau NF, Ngasala B, Mhamilawa LE, Denton M, Msellem M, et al. Falciparum malaria from coastal Tanzania and Zanzibar remains highly connected despite effective control efforts on the archipelago. Malar J. 2020;19:47.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShetty AC, Jacob CG, Huang F, Li Y, Agrawal S, Saunders DL, et al. Genomic structure and diversity of Plasmodium falciparum in Southeast Asia reveal recent parasite migration patterns. Nat Commun. 2019;10:1\u0026ndash;11.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAl-Asadi H, Petkova D, Stephens M, Novembre J. Estimating recent migration and population-size surfaces. DeGiorgio M, editor. PLOS Genetics. 2019;15:e1007908\u0026ndash;e1007908.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAmambua-Ngwa A, Amenga-Etego L, Kamau E, Amato R, Ghansah A, Golassa L, et al. Major subpopulations of Plasmodium falciparum in sub-Saharan Africa. Science. 2019;365:813\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBelbin GM, Cullina S, Wenric S, Soper ER, Glicksberg BS, Torre D, et al. Toward a fine-scale population health monitoring system. Cell. 2021;184:2068\u0026ndash;83. e11.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBorda V, Loesch DP, Guo B, Laboulaye R, Veliz-Otani D, French JN, et al. Genetics of Latin American Diversity Project: Insights into population genetics and association studies in admixed groups in the Americas. Cell Genom. 2024;4:100692.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo B, Borda V, Laboulaye R, Spring MD, Wojnarski M, Vesely BA, et al. Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum. Nat Commun. 2024;15:2499.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou Y, Browning SR, Browning BL. A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data. Am J Hum Genet. 2020;106:426\u0026ndash;37.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo B, Takala-Harrison S, O\u0026rsquo;Connor TD. Benchmarking and Optimization of Methods for the Detection of Identity-By-Descent in High-Recombining Plasmodium falciparum Genomes. eLife [Internet]. 2025 [cited 2025 Jan 22];14. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://elifesciences.org/reviewed-preprints/101924\u003c/span\u003e\u003cspan address=\"https://elifesciences.org/reviewed-preprints/101924\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eConway DJ, Roper C, Oduola AMJ, Arnot DE, Kremsner PG, Grobusch MP, et al. High recombination rate in natural populations of Plasmodium falciparum. Proc Natl Acad Sci U S A. 1999;96:4506\u0026ndash;11.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, et al. A high-resolution recombination map of the human genome. Nat Genet. 2002;31:241\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTaliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMalariaGEN, Ahouidi A, Ali M, Almagro-Garcia J, Amambua-Ngwa A, Amaratunga C, et al. An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples. Wellcome Open Res. 2021;6:42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJiang H, Li N, Gopalan V, Zilversmit MM, Varma S, Nagarajan V et al. High recombination rates and hotspots in a Plasmodium falciparum genetic cross. Genome Biol. 2011.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVendrely KM, Kumar S, Li X, Vaughan AM. Humanized Mice and the Rebirth of Malaria Genetic Crosses. Trends Parasitol. 2020;36:850\u0026ndash;63.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKane J, Li X, Kumar S, Button-Simons KA, Vendrely Brenneman KM, Dahlhoff H, et al. A Plasmodium falciparum genetic cross reveals the contributions of pfcrt and plasmepsin II/III to piperaquine drug resistance. mBio. 2024;15:e0080524.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStone J, Matsakis N, rayon. Simple work-stealing parallelism for Rust [Internet]. 2024. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e{https://github.com/rayon-rs/rayon}\u003c/span\u003e\u003cspan address=\"http://{https://github.com/rayon-rs/rayon}\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTang K, Naseri A, Wei Y, Zhang S, Zhi D. Open-source benchmarking of IBD segment detection methods for biobank-scale cohorts. GigaScience. 2022;11:giac111.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu SJ, Almagro-Garcia J, McVean G. Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics. 2018;34:9\u0026ndash;15.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu SJ, Hendry JA, Almagro-Garcia J, Pearson RD, Amato R, Miles A, et al. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. Elife. 2019;8:e40845.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDanecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10:giab008.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStapley J, Feulner PGD, Johnston SE, Santure AW, Smadja CM. Variation in recombination frequency and distribution across eukaryotes: patterns and processes. Philosophical Trans Royal Soc B: Biol Sci. 2017;372:20160455.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSivakumar T, Hayashida K, Sugimoto C, Yokoyama N. Evolution and genetic diversity of Theileria. Infect Genet Evol. 2014;27:250\u0026ndash;63.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKent CF, Minaei S, Harpur BA, Zayed A. Recombination is associated with the evolution of genome structure and worker behavior in honey bees. Proceedings of the National Academy of Sciences. 2012;109:18012\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeroy T, Faux P, Basso B, Eynard S, Wragg D, Vignal A. Inferring Long-Term and Short-Term Determinants of Genetic Diversity in Honey Bees: Beekeeping Impact and Conservation Strategies. Mol Biol Evol. 2024;41:msae249.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePeter J, De Chiara M, Friedrich A, Yue J-X, Pflieger D, Bergstr\u0026ouml;m A, et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature. 2018;556:339\u0026ndash;44.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarton AB, Pekosz MR, Kurvathi RS, Kaback DB. Meiotic Recombination at the Ends of Chromosomes in Saccharomyces cerevisiae. Genetics. 2008;179:1221\u0026ndash;35.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMiles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, et al. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res. 2016;26:1288\u0026ndash;99.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNaseri A, Yue W, Zhang S, Zhi D. Fast inference of genetic recombination rates in biobank scale data. Genome Res. 2023;33:1015\u0026ndash;22.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou Y, Browning BL, Browning SR. Population-Specific Recombination Maps from Segments of Identity by Descent. Am J Hum Genet. 2020;107:137\u0026ndash;48.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"malaria-journal","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"malj","sideBox":"Learn more about [Malaria Journal](http://malariajournal.biomedcentral.com/)","snPcode":"12936","submissionUrl":"https://submission.nature.com/new-submission/12936/3","title":"Malaria Journal","twitterHandle":"@malariajournal","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Identity-By-Descent, Hidden Markov Model, Parallelization, Population Genomics, Recombination Rate Map, Plasmodium","lastPublishedDoi":"10.21203/rs.3.rs-7004070/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7004070/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eIdentity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method \u003cem\u003ehmmIBD\u003c/em\u003e demonstrates high accuracy in inferring IBD segments between haploid genomes, including \u003cem\u003ePlasmodium falciparum\u003c/em\u003e, and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of \u003cem\u003ehmmIBD\u003c/em\u003e does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe developed an enhanced implementation of \u003cem\u003ehmmIBD\u003c/em\u003e in the Rust programming language, named \u003cem\u003ehmmibd-rs\u003c/em\u003e, which leverages multi-threaded computing to parallelize IBD inference over genome pairs and which supports optional, user-defined recombination rate maps for more accurate IBD detection and filtration from genomes with non-uniform recombination. We further streamlined large-scale IBD detection by incorporating auxiliary built-in functionalities to preprocess input directly from the standard binary variant call format (BCF) and filter IBD output to reduce disk usage.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eOur new implementation significantly reduces IBD detection computation time nearly linearly with the increased number of CPU threads used; using 128 threads shortens IBD detection time from 5.2 days to 1.3 hours for 220\u0026nbsp;million pairs of simulated \u003cem\u003ePlasmodium falciparum\u003c/em\u003e-like chromosomes, increasing computational speed by approximately 100x over the single-threaded \u003cem\u003ehmmIBD\u003c/em\u003e algorithm. Incorporating non-uniform recombination rates in \u003cem\u003ehmmibd-rs\u003c/em\u003e enhances the accuracy of IBD inference by mitigating the overestimation of IBD breakpoints in recombination cold spots and their underestimation in hot spots. It also improves IBD segment length filtration, reducing the false positive rate in recombination cold spots and the false negative rate in hot spots. When applied to empirical data sets, \u003cem\u003ehmmibd-rs\u003c/em\u003e completes the detection of IBD from MalariaGEN Pf7 (n\u0026thinsp;\u0026asymp;\u0026thinsp;10,000 monoclonal samples) within hours, enabling a single-day IBD analysis pipeline for large genomic data sets.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003e \u003cem\u003ehmmibd-rs\u003c/em\u003e builds upon, accelerates, and enhances \u003cem\u003ehmmIBD\u003c/em\u003e for efficient and accurate IBD detection, serving as a crucial tool for advancing large-scale malaria genomic surveillance.\u003c/p\u003e","manuscriptTitle":"hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-02 09:42:59","doi":"10.21203/rs.3.rs-7004070/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-09-16T15:00:27+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-10T19:23:56+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-08T20:54:24+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"83551523410952781239516935016420620389","date":"2025-07-21T14:19:27+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"32274553613382400703231760856776194804","date":"2025-07-20T19:43:24+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"339323875946576697810541250179349555711","date":"2025-07-18T05:38:42+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-07-01T14:15:24+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-06-30T11:33:04+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-06-30T11:30:25+00:00","index":"","fulltext":""},{"type":"submitted","content":"Malaria Journal","date":"2025-06-29T17:02:22+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"malaria-journal","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"malj","sideBox":"Learn more about [Malaria Journal](http://malariajournal.biomedcentral.com/)","snPcode":"12936","submissionUrl":"https://submission.nature.com/new-submission/12936/3","title":"Malaria Journal","twitterHandle":"@malariajournal","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"175e9c62-0c7a-42c0-ab28-65f5a58cd2d1","owner":[],"postedDate":"July 2nd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-02-16T16:05:17+00:00","versionOfRecord":{"articleIdentity":"rs-7004070","link":"https://doi.org/10.1186/s12936-026-05814-2","journal":{"identity":"malaria-journal","isVorOnly":false,"title":"Malaria Journal"},"publishedOn":"2026-02-09 15:59:03","publishedOnDateReadable":"February 9th, 2026"},"versionCreatedAt":"2025-07-02 09:42:59","video":"","vorDoi":"10.1186/s12936-026-05814-2","vorDoiUrl":"https://doi.org/10.1186/s12936-026-05814-2","workflowStages":[]},"version":"v1","identity":"rs-7004070","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7004070","identity":"rs-7004070","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.