Distributed Compressive Genomics: Fundamental Pattern Matching Primitives via Spark | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Distributed Compressive Genomics: Fundamental Pattern Matching Primitives via Spark Lorenzo Di Rocco, Umberto Ferraro Petrillo, Raffaele Giancarlo, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4747701/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background: Compressive genomics consists of a set of techniques that, for some important bioinformatics tasks such as sequence comparison and search, leverages on a compressed representation of the data in order to improve time performance. That is, its good compression is not the final goal, since the new and smaller representation of the data should support fast processing of it. Although Big Data technologies, such as distributed systems incarnated by Spark, have no apparent limit on the overall amount of memory available, making compression not so appealing, they may benefit of compressive genomics for a variety of reasons (e.g., the economic cost of adding more computational nodes may be significant). Unfortunately, there is no study that provides an evaluation of how convenient such an approach would be, to fix ideas, for Spark. Results: Although porting compressive genomics techniques to a distributed environment is not simple neither immediate, we present here the first study regarding the benefits of compressive genomics for Big Data technologies such as Spark. In particular, we provide the first Spark version of the FM-Index, a fundamental compressed pattern matching data structure with pervasive use in bioinformatics. For completeness, we also propose Spark versions of the Compressed Boyer-Moore Pattern Matching algorithm. A carefully designed experimental analysis indicates the clear advantages of using those two compressed genomics primitives within Spark. Moreover, we propose a general method, with associated software, that simplifies the development of compressive genomics techniques on Spark. Conclusions: We provide the first, and much needed study, regarding the potential advantages of using compressive genomics within Big data technologies: the proof of principle is via a fundamental compressed data structure for bioinformatics such as the FM-Index on a leading system such as Spark. The software platform associated to this research is perceived as a fundamental building block for further studies in this area. Availability: https://github.com/ldirocco/SparkGeco MapReduce Hadoop Sequence Analysis Data Compression Full Text Additional Declarations No competing interests reported. Supplementary Files supplementary.pdf Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4747701","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":337888868,"identity":"c27fc335-5f8e-4b94-832c-ca5cae96a081","order_by":0,"name":"Lorenzo Di Rocco","email":"","orcid":"","institution":"Sapienza University of Rome","correspondingAuthor":false,"prefix":"","firstName":"Lorenzo","middleName":"Di","lastName":"Rocco","suffix":""},{"id":337888869,"identity":"1cd1ebcf-0f0f-489b-b786-6f0f9b92907e","order_by":1,"name":"Umberto Ferraro Petrillo","email":"","orcid":"","institution":"Sapienza University of Rome","correspondingAuthor":false,"prefix":"","firstName":"Umberto","middleName":"Ferraro","lastName":"Petrillo","suffix":""},{"id":337888870,"identity":"87b09793-966c-4e11-863e-c66669bebd28","order_by":2,"name":"Raffaele Giancarlo","email":"","orcid":"","institution":"University of Palermo","correspondingAuthor":false,"prefix":"","firstName":"Raffaele","middleName":"","lastName":"Giancarlo","suffix":""},{"id":337888871,"identity":"bd5b5176-7070-4697-b6f6-0eb05fbe4b66","order_by":3,"name":"Giuseppe Cattaneo","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAyElEQVRIiWNgGAWjYDADfiA+AMQ8RKhlZmwAUZINUC1E6IFqMTgA5RPUwt9+/viDnzts5Ixv5B48+KOGQcaekBaJM8mMjb1n0ozNbuQlHOY5RoTDDBiSGRt42w4nbruRY3CYgY0YLfyPGRv/tv1P3Dwjx+Dgj3/EaJFIZmzmbTuQuEEix+AAbxsRWiRuPDacLduWbCxx5o3BYd4+CR6eAwS08PcnPvj4ts1Ojr89x/jjj2829uwNhKxBt5VE9aNgFIyCUTAKsAIAGF88LdPhh3YAAAAASUVORK5CYII=","orcid":"","institution":"University of Salerno","correspondingAuthor":true,"prefix":"","firstName":"Giuseppe","middleName":"","lastName":"Cattaneo","suffix":""}],"badges":[],"createdAt":"2024-07-16 07:08:10","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4747701/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4747701/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":63391035,"identity":"f2b653e3-44f1-4906-b7c4-1ceef35dbfdb","added_by":"auto","created_at":"2024-08-27 15:38:36","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":501589,"visible":true,"origin":"","legend":"","description":"","filename":"CompressiveGenomicsCattaneo.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4747701/v1_covered_6a9bf61d-8c76-4e29-bca2-097d11813857.pdf"},{"id":62300331,"identity":"660f1e50-363f-4916-8ed5-c8eb39b6f9ea","added_by":"auto","created_at":"2024-08-12 16:27:12","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":250495,"visible":true,"origin":"","legend":"","description":"","filename":"supplementary.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4747701/v1/746f463b915edd0b558b737e.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eDistributed Compressive Genomics: Fundamental Pattern Matching Primitives via Spark \u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"MapReduce, Hadoop, Sequence Analysis, Data Compression","lastPublishedDoi":"10.21203/rs.3.rs-4747701/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4747701/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground:\u003c/strong\u003e Compressive genomics consists of a set of techniques that, for some important bioinformatics tasks such as sequence comparison and search, leverages on a compressed representation of the data in order to improve time performance. That is, its good compression is not the final goal, since the new and smaller representation of the data should support fast processing of it. Although Big Data technologies, such as distributed systems incarnated by Spark, have no apparent limit on the overall amount of memory available, making compression not so appealing, they may benefit of compressive genomics for a variety of reasons (e.g., the economic cost of adding more computational nodes may be significant). Unfortunately, there is no study that provides an evaluation of how convenient such an approach would be, to fix ideas, for Spark.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults: \u003c/strong\u003eAlthough porting compressive genomics techniques to a distributed environment is not simple neither immediate, we present here the first study regarding the benefits of compressive genomics for Big Data technologies such as Spark. In particular, we provide the first Spark version of the FM-Index, a fundamental compressed pattern matching data structure with pervasive use in bioinformatics. For completeness, we also propose Spark versions of the Compressed Boyer-Moore Pattern Matching algorithm. A carefully designed experimental analysis indicates the clear advantages of using those two compressed genomics primitives within Spark. Moreover, we propose a general method, with associated software, that simplifies the development of compressive genomics techniques on Spark.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusions: \u003c/strong\u003eWe provide the first, and much needed study, regarding the potential advantages of using compressive genomics within Big data technologies: the proof of principle is via a fundamental compressed data structure for bioinformatics such as the FM-Index on a leading system such as Spark. The software platform associated to this research is perceived as a fundamental building block for further studies in this area.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability:\u003c/strong\u003e https://github.com/ldirocco/SparkGeco\u003c/p\u003e","manuscriptTitle":"Distributed Compressive Genomics: Fundamental Pattern Matching Primitives via Spark","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-12 16:27:08","doi":"10.21203/rs.3.rs-4747701/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f32580fb-05b2-4931-a56d-657e2ae406c0","owner":[],"postedDate":"August 12th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-08-27T15:30:28+00:00","versionOfRecord":[],"versionCreatedAt":"2024-08-12 16:27:08","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4747701","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4747701","identity":"rs-4747701","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.