Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors

doi:10.21203/rs.3.rs-8865258/v1

Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors

2026 · doi:10.21203/rs.3.rs-8865258/v1

preprint OA: closed

Full text JSON View at publisher

Full text 14,645 characters · extracted from preprint-html · click to expand

Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors Nathalie Gocht, Alexander Schliep This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8865258/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 7 You are reading this latest preprint version Abstract Background: Estimating how many approximate matches a query sequence has in a genome is a fundamental task in bioinformatics, with applications including read mapping or genome mappability analysis. While exact string algorithms and specialized index data structures enable low-error matching, counting matches for short k-mers at high dissimilarity (small k/e ratio) becomes computationally intractable as the number of potential matches grow exponentially for Hamming distance, and even more so for other dissimilarity measures including edit distance. Results: We use machine learning to solve this algorithmic problem. Specifically, we propose deep learning models which in contrast to learned index approaches do not accelerate exact algorithms, but rather directly predict Hit Frequency Vectors (HFVs), which count approximate matches at increasing distance e = 1, 2, 3, . . .. The models are trained on HFVs which are computed using brute-force (for Hamming distance), respectively an efficient sampling approach followed by a search to account for undersampling close matches and correcting counts. Model training and inference are computationally efficient, and enable rapid genome-scale estimation of approximate match frequencies for short query sequences and small k/e. The models capture genome-wide match frequency distributions with high accuracy and allow fast batch prediction even in regimes where exhaustive computation is computationally infeasible for exact algorithms relying on seed-and-extend due to limited seed length leading to exponential effort. We provide loss functions and evaluation measures appropriate for histograms with bin count varying by several orders of magnitude. Conclusion: Our approach enables practical approximation of HFVs and can be easily extended to complex dissimilarities which are too computationally expensive to evaluate exhaustively across an entire genome. The combination of approximate HFV generation and predictive modeling extends the scope of sequence uniqueness analyses considerably. Our approach can trivially be extended to continuous dissimilarities like oligonucleotide binding affinity via appropriate binning for applications in DNA storage, target depletion, or oligonucleotide therapeutics. Lastly, we believe that our work can provide a blue-print for further applications in bioinformatics where straight-forward ML prediction—possibly followed by confirmation with exact algorithms—could have great utility. hit frequencies approximate string matching deep learning Full Text Additional Declarations No competing interests reported. Supplementary Files DLMappabilityPredictionHFVSupplement.pdf Cite Share Download PDF Status: Under Review Version 1 posted Reviews received at journal 15 Apr, 2026 Reviewers agreed at journal 12 Apr, 2026 Reviewers agreed at journal 10 Apr, 2026 Reviewers invited by journal 24 Feb, 2026 Editor assigned by journal 16 Feb, 2026 Submission checks completed at journal 16 Feb, 2026 First submitted to journal 12 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8865258","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":596489710,"identity":"bcfe8f7b-d392-47c3-80f0-dc73abecb873","order_by":0,"name":"Nathalie Gocht","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+klEQVRIiWNgGAWjYHACNgaGAgkwi5mBwSYBzOLBq4MZqMUApiUhLQFkBDFaGGBaDhPWYs5+/tiDHwYWcgz8iw9+LvxxPo9BvseA4U0Fbi2WPcnshj0GEsYMEs+SpWck3C5mYOMxYJxzBrcWgwPJbBI8BhKJDRJnzJh5Em4n7j/GY8DM24ZHy/nHbJJ/DCTqoVrOJTawgbT8w6PlRjKbNNCWBAb+HpCWA1AtDfi0PDaTljGQMGyTYEuW5klLBvolreDgnGP4HJb4TPJNRZ08P//hg595bOzyGJgPb3zwpga3FjhgA7oNDg4QoQEI+IlUNwpGwSgYBSMPAAAX00RXmS131wAAAABJRU5ErkJggg==","orcid":"","institution":"Brandenburg University of Technology Cottbus–Senftenberg","correspondingAuthor":true,"prefix":"","firstName":"Nathalie","middleName":"","lastName":"Gocht","suffix":""},{"id":596489724,"identity":"9c4d311f-8397-4075-b17d-fd9bfb3586de","order_by":1,"name":"Alexander Schliep","email":"","orcid":"","institution":"Brandenburg University of Technology Cottbus–Senftenberg","correspondingAuthor":false,"prefix":"","firstName":"Alexander","middleName":"","lastName":"Schliep","suffix":""}],"badges":[],"createdAt":"2026-02-12 19:53:17","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8865258/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8865258/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":104398228,"identity":"6317101e-8bb9-4a46-92db-8978c6b96fa5","added_by":"auto","created_at":"2026-03-11 12:00:50","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1560477,"visible":true,"origin":"","legend":"","description":"","filename":"DLMappabilityPredictionHFV.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8865258/v1_covered_9c0e40c2-930a-4082-9cb0-0753c0666d03.pdf"},{"id":103546280,"identity":"01a978dd-d8bd-48ec-9f9e-f2957a9be81e","added_by":"auto","created_at":"2026-02-26 23:48:45","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":1491189,"visible":true,"origin":"","legend":"","description":"","filename":"DLMappabilityPredictionHFVSupplement.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8865258/v1/e194a73a4b4af9d9b6f30ff4.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003ePredicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"algorithms-for-molecular-biology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"amob","sideBox":"Learn more about [Algorithms for Molecular Biology](http://almob.biomedcentral.com/)","snPcode":"13015","submissionUrl":"https://submission.nature.com/new-submission/13015/3","title":"Algorithms for Molecular Biology","twitterHandle":"@BioMedCentral","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"hit frequencies, approximate string matching, deep learning","lastPublishedDoi":"10.21203/rs.3.rs-8865258/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8865258/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eBackground: Estimating how many approximate matches a query sequence has\u0026nbsp;in a genome is a fundamental task in bioinformatics, with applications including\u0026nbsp;read mapping or genome mappability analysis. While exact string algorithms and\u0026nbsp;specialized index data structures enable low-error matching, counting matches\u0026nbsp;for short k-mers at high dissimilarity (small k/e ratio) becomes computationally\u0026nbsp;intractable as the number of potential matches grow exponentially for Hamming distance, and even more so for other dissimilarity measures including edit\u0026nbsp;distance.\u003c/p\u003e\n\u003cp\u003eResults: We use machine learning to solve this algorithmic problem. Specifically, we propose deep learning models which in contrast to learned index\u0026nbsp;approaches do not accelerate exact algorithms, but rather directly predict Hit\u0026nbsp;Frequency Vectors (HFVs), which count approximate matches at increasing distance e = 1, 2, 3, . . .. The models are trained on HFVs which are computed\u0026nbsp;using brute-force (for Hamming distance), respectively an efficient sampling\u0026nbsp;approach followed by a search to account for undersampling close matches and correcting counts. Model training and inference are computationally efficient,\u0026nbsp;and enable rapid genome-scale estimation of approximate match frequencies for\u0026nbsp;short query sequences and small k/e. The models capture genome-wide match\u0026nbsp;frequency distributions with high accuracy and allow fast batch prediction even\u0026nbsp;in regimes where exhaustive computation is computationally infeasible for exact\u0026nbsp;algorithms relying on seed-and-extend due to limited seed length leading to exponential effort. We provide loss functions and evaluation measures appropriate for\u0026nbsp;histograms with bin count varying by several orders of magnitude.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eConclusion: Our approach enables practical approximation of HFVs and can\u0026nbsp;be easily extended to complex dissimilarities which are too computationally\u0026nbsp;expensive to evaluate exhaustively across an entire genome. The combination\u0026nbsp;of approximate HFV generation and predictive modeling extends the scope\u0026nbsp;of sequence uniqueness analyses considerably. Our approach can trivially be\u0026nbsp;extended to continuous dissimilarities like oligonucleotide binding affinity via\u0026nbsp;appropriate binning for applications in DNA storage, target depletion, or oligonucleotide therapeutics. Lastly, we believe that our work can provide a blue-print\u0026nbsp;for further applications in bioinformatics where straight-forward ML prediction—possibly followed by confirmation with exact algorithms—could have great\u0026nbsp;utility.\u003c/p\u003e","manuscriptTitle":"Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-26 23:48:41","doi":"10.21203/rs.3.rs-8865258/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"editorInvitedReview","content":"","date":"2026-04-15T12:37:10+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"213790725685625080699098248309522984629","date":"2026-04-12T18:49:16+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"146085521913518188987542647088968598515","date":"2026-04-10T12:26:54+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-02-24T18:24:25+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-16T13:29:21+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-16T11:14:42+00:00","index":"","fulltext":""},{"type":"submitted","content":"Algorithms for Molecular Biology","date":"2026-02-12T19:44:22+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"algorithms-for-molecular-biology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"amob","sideBox":"Learn more about [Algorithms for Molecular Biology](http://almob.biomedcentral.com/)","snPcode":"13015","submissionUrl":"https://submission.nature.com/new-submission/13015/3","title":"Algorithms for Molecular Biology","twitterHandle":"@BioMedCentral","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"40af3d69-b4d0-404d-9dc4-00500ad8cc3b","owner":[],"postedDate":"February 26th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-02-26T23:48:41+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-26 23:48:41","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8865258","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8865258","identity":"rs-8865258","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00