Full text
23,956 characters
· extracted from
preprint-html
· click to expand
IGD: A simple, efficient genotype data format | bioRxiv /* */ /* */ <!-- <!-- /*! * yepnope1.5.4 * (c) WTFPL, GPLv2 */ (function(a,b,c){function d(a){return"[object Function]"==o.call(a)}function e(a){return"string"==typeof a}function f(){}function g(a){return!a||"loaded"==a||"complete"==a||"uninitialized"==a}function h(){var a=p.shift();q=1,a?a.t?m(function(){("c"==a.t?B.injectCss:B.injectJs)(a.s,0,a.a,a.x,a.e,1)},0):(a(),h()):q=0}function i(a,c,d,e,f,i,j){function k(b){if(!o&&g(l.readyState)&&(u.r=o=1,!q&&h(),l.onload=l.onreadystatechange=null,b)){"img"!=a&&m(function(){t.removeChild(l)},50);for(var d in y[c])y[c].hasOwnProperty(d)&&y[c][d].onload()}}var j=j||B.errorTimeout,l=b.createElement(a),o=0,r=0,u={t:d,s:c,e:f,a:i,x:j};1===y[c]&&(r=1,y[c]=[]),"object"==a?l.data=c:(l.src=c,l.type=a),l.width=l.height="0",l.onerror=l.onload=l.onreadystatechange=function(){k.call(this,r)},p.splice(e,0,u),"img"!=a&&(r||2===y[c]?(t.insertBefore(l,s?null:n),m(k,j)):y[c].push(l))}function j(a,b,c,d,f){return q=0,b=b||"j",e(a)?i("c"==b?v:u,a,b,this.i++,c,d,f):(p.splice(this.i++,0,a),1==p.length&&h()),this}function k(){var a=B;return a.loader={load:j,i:0},a}var l=b.documentElement,m=a.setTimeout,n=b.getElementsByTagName("script")[0],o={}.toString,p=[],q=0,r="MozAppearance"in l.style,s=r&&!!b.createRange().compareNode,t=s?l:n.parentNode,l=a.opera&&"[object Opera]"==o.call(a.opera),l=!!b.attachEvent&&!l,u=r?"object":l?"script":"img",v=l?"script":u,w=Array.isArray||function(a){return"[object Array]"==o.call(a)},x=[],y={},z={timeout:function(a,b){return b.length&&(a.timeout=b[0]),a}},A,B;B=function(a){function b(a){var a=a.split("!"),b=x.length,c=a.pop(),d=a.length,c={url:c,origUrl:c,prefixes:a},e,f,g;for(f=0;f<d;f++)g=a[f].split("="),(e=z[g.shift()])&&(c=e(c,g));for(f=0;f<b;f++)c=x[f](c);return c}function g(a,e,f,g,h){var i=b(a),j=i.autoCallback;i.url.split(".").pop().split("?").shift(),i.bypass||(e&&(e=d(e)?e:e[a]||e[g]||e[a.split("/").pop().split("?")[0]]),i.instead?i.instead(a,e,f,g,h):(y[i.url]?i.noexec=!0:y[i.url]=1,f.load(i.url,i.forceCSS||!i.forceJS&&"css"==i.url.split(".").pop().split("?").shift()?"c":c,i.noexec,i.attrs,i.timeout),(d(e)||d(j))&&f.load(function(){k(),e&&e(i.origUrl,h,g),j&&j(i.origUrl,h,g),y[i.url]=2})))}function h(a,b){function c(a,c){if(a){if(e(a))c||(j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}),g(a,j,b,0,h);else if(Object(a)===a)for(n in m=function(){var b=0,c;for(c in a)a.hasOwnProperty(c)&&b++;return b}(),a)a.hasOwnProperty(n)&&(!c&&!--m&&(d(j)?j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}:j[n]=function(a){return function(){var b=[].slice.call(arguments);a&&a.apply(this,b),l()}}(k[n])),g(a[n],j,b,n,h))}else!c&&l()}var h=!!a.test,i=a.load||a.both,j=a.callback||f,k=j,l=a.complete||f,m,n;c(h?a.yep:a.nope,!!i),i&&c(i)}var i,j,l=this.yepnope.loader;if(e(a))g(a,0,l,0);else if(w(a))for(i=0;i (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0];var j=d.createElement(s);var dl=l!='dataLayer'?'&l='+l:'';j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;j.type='text/javascript';j.async=true;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-M677548'); Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search New Results IGD: A simple, efficient genotype data format View ORCID Profile Drew DeHaas , View ORCID Profile Xinzhu Wei doi: https://doi.org/10.1101/2025.02.05.636549 Drew DeHaas 1 Department of Computational Biology, Cornell University , Ithaca, NY Find this author on Google Scholar Find this author on PubMed Search for this author on this site ORCID record for Drew DeHaas Xinzhu Wei 1 Department of Computational Biology, Cornell University , Ithaca, NY Find this author on Google Scholar Find this author on PubMed Search for this author on this site ORCID record for Xinzhu Wei For correspondence: aprilwei{at}cornell.edu Abstract Full Text Info/History Metrics Preview PDF Abstract Motivation While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement – yet fast and small – is helpful for research on highly scalable bioinformatics. Results We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100 times faster and 3.5 times smaller than vcf.gz on Biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format. Availability A C++ library reading and writing IGD, and tooling to convert . vcf.gz files, can be found at https://github.com/aprilweilab/picovcf . A Python library is at https://github.com/aprilweilab/pyigd Introduction Genetic polymorphism data is typically stored in a tabular format that can be thought of as an S × N matrix. The rows represent S sites and the columns represent N individuals, where each site has at least one alternate allele that differs from the reference sequence, but may have many (a multi-allelic site). Variant Call Format (VCF) ( Danecek et al ., 2011 ) and its compressed form (. vcf.gz ) are mainstays of tooling that process such tabular genotype data. The VCF format is very flexible, and its plaintext nature makes it easy to understand, construct, and parse. However, it is inefficient to store and process for large-scale datasets, as evidenced by the proliferation of faster and more compact formats. Among alternate formats, BCF ( Li, 2011 ), BGEN ( Band and Marchini, 2018 ), and BED ( Purcell et al ., 2007 ) have popular tooling support. Newer, more efficient formats such as PGEN ( Rivas and Chang, 2024 ), XSI (Wertenbroek et al ., 2022), Savvy ( LeFaive et al ., 2021 ), GTshark ( Deorowicz and Danek, 2019 ), GRG (DeHaas et al ., 2024), and others ( Lan et al ., 2021 ; Browning et al ., 2018 ) leverage the similarity between samples at nearby genetic positions (due to linkage-disequilibrium (LD)) to compress genotype data to an impressive extent. Here we present the Indexable Genotype Data (IGD) format, which is designed by the (sometimes at odds) principles of simplicity and efficiency. IGD encodes tabular genotype data as hard calls, similar to pVCF and BED. The only meta-data it stores (optionally) are identifiers for variants and individuals; the expectation is that most meta-data can be stored separately in general purpose file formats like CSV or JSON ( Pezoa et al ., 2016 ). IGD is uncompressed, which makes reading and writing the format easy to implement and avoids the need for external compression libraries which may not be easily usable across platforms or programming languages. IGD is a binary format, and supports multi-allelic variants, any ploidy up to 255, is contained in a single file, and can be constructed in one pass over the input data. IGD can represent both phased and unphased data, but all data in the file must have the same phasedness. Methods Given that we have N individuals in a dataset, we number them 0… ( N-1 ). There are N H = N × ploidy haploid samples of these individuals, which are similarly numbered 0… ( N H -1 ). Throughout when we refer to a “sample” we mean a haploid sample. There are M variants in a dataset, each of which can be uniquely identified by the pair ( base-pair position, alternate allele ), and Q variants that contain at least one sample with missing data. There are a few significant aspects of the IGD format worth highlighting. All polymorphic sites are stored using bi-allelic format Instead of storing a row per site ( S × N H matrix), IGD stores a row per variant ( M × N H matrix). Multi-allelic sites are supported by expanding them into a row per variant. For example, a ( k +1)-allelic site with k alternate alleles (without missing data) is expanded into k variants that all have the same position and reference allele, but different alternate alleles. The original multi-allelic sites can be recovered by aggregating the IGD variants by position, however, keeping the data as an M × N H matrix is often convenient for statistical genetics or population genetics applications. In practice, IGD is an ( M + Q ) × N H matrix, since missing data is encoded as a row of samples representing the ones with missing data, instead of representing those with the alternate allele. IGD contains an internal index The index contains the genomic position (in base-pairs) of each variant, and can be cross-referenced to the genotype data, the allele strings, or variant IDs. By keeping the contents of the index small (16 bytes per variant) we keep the cost of reading it from disk very small. All variant-related data in IGD can be randomly accessed by i , the row number of that variant in the IGD index. IGD uses one of two compact genotype formats per variant Each row of genotype data is represented as the set of samples that have the variant/alternate allele. The two simple, compact ways to store this data are either (a) sparsely as a list of sample numbers or (b) densely as a bit-vector where each bit at position i reflects whether the i -th sample has the alternate allele (1) or not (0). We can choose between representation (a) and (b) by examining the allele frequency p i of the variant in question. Sample numbers are represented by 32-bit unsigned integers, which means that if the sparse representation is more compact (a), otherwise the bit-vector representation (b) is smaller. It is important to note that a valid IGD file can be constructed using only row representation (a) or only bit vector representation (b), or any mixture of the two, but the optimally sized IGD will determine which to use on a per-variant basis. Converting between these two representations is trivial, and thus the representation on disk does not have to be the representation used for computation. 2.1 File Format Details The layout of an IGD file is shown in Fig. 1 . The header contains file offsets for each of the sections after the genotype data, as their positions are unpredictable otherwise, and random access to them is useful. We use the following storage type definitions. Download figure Open in new tab Fig. 1. IGD file layout. The layout of an IGD file on disk. M is the number of variants, N is the number of individuals, and N H is the number of haploid samples. Genotype data rows may be either sparse (a list of haploid sample indexes containing the alternate allele) or dense (a bit-vector with a 1 at an index i iff the i th haploid sample contains the alternate allele). uint32 : A 32-bit unsigned integer. uint64 : A 64-bit unsigned integer. string : A uint32 for the length k , followed by k bytes for the contents. list32 : A uint32 for the length k , followed by k uint32 values for the contents. bv(w) : A bit-vector of w bits stored at the byte granularity. The number of bytes used is ceil ( w /8). Given a sample index b that we want to store as a 1, the byte offset is determined by floor ( b /8). Within that byte we set the ( 7 - ( b mod 8 )) th least significant bit; that is, if ( b mod 8 ) = 0 we will set the most significant bit. 2.1.1 Header The header is a fixed-size (128 byte) table as described in Table 1 . View this table: View inline View popup Download powerpoint Table 1. IGD header details 2.1.2 Flags The least-significant bit of the flags (in the header) signifies phasedness, where a value of 1 indicates phased data. 2.1.2 Description strings Immediately following the header are two string values. The first is a string describing how the file was created (e.g., “converted from foo.vcf.gz”) and the second is a generic description field. 2.1.3 Genotype data Immediately following the description strings is M+Q rows of genotype data, where each row is either a list32 or a bv ( N H ) . The Index (described next) has a flag that indicates the type of each row. 2.1.4 Index The Index is M+Q rows of 16 bytes each, and can be viewed as two uint64 values. The first uint64 value contains the base-pair position associated with the variant in the least-significant 48 bits, followed by an 8-bit unsigned integer numCopies , and finally bitwise flags in the most-significant 8 bits. The currently defined flags are: SPARSE=0×01 : If this flag is set the corresponding genotype row is a list32 , otherwise it is a bv ( N H ) . IS_MISSING=0×02 : If this flag is set the corresponding genotype row’s sample list represents missing data. The list of samples in the row do not have a variant call for that polymorphic site. The numCopies value is 0 for phased data, but is 1≤ numCopies ≤ ploidy for unphased data. The second uint64 value in the Index row contains the file offset of the genotype row for the current variant. The i th variant can be randomly accessed directly at IndexStart + (16 × i ). 2.1.5 Allele strings This section is M+Q rows, each row contains first the string for the reference allele and then the string for the alternate allele. 2.1.6 Individual IDs This section is a uint64 for the number of strings, followed by that many strings , where the k th is the identifier for individual k . This section is only present in the file if the corresponding header file offset entry is non-zero. 2.1.7 Variant IDs This section is a uint64 for the number of strings, followed by that many strings , where the i th one is the variant identifier for the i th variant. This section is only present in the file if the corresponding header file offset entry is non-zero. 2.2 Phasedness Unphased data is stored by clearing the phased flag in the header, and storing separate variants for each number of copies of each alternate allele. The numCopies value in the index (see above) indicates the zygosity of the currently stored sample list. Additionally, instead of storing haploid sample lists, IGD stores individual-based sample lists for unphased data. That is, it represents an ( M + Q ) × N (instead of N H ). matrix 2.3 Access Patterns There are two typical access patterns for an IGD file. If the variant index i is known, we can seek directly to it in the Index and then seek directly to the genotype data for that variant. If any of the string data is needed (alleles, individual IDs, variant IDs) those tables will need to be loaded into memory so they are indexable by i , or just scanned on disk to find the i th entry. Alternatively, traversing an IGD file is done by seeking to the start of the Index. Starting at variant i = 0, each row of the Index is read and if the base-pair position is of interest then the genotype data is accessed using the file offset found in the current ( i th ) row of the Index. String data can be read into RAM one time, or a file pointer can be maintained to the current entry for each string table and incremented whenever i is incremented. Results We compared file size and data traversal time between IGD, . vcf.gz , BCF, and PGEN ( Fig. 2 ). These formats were chosen for their apparent popularity as well as for capturing a spectrum from simple and inefficient (. vcf.gz ) to more complex, yet very efficient (PGEN). PGEN is on the more complex side because it uses LD-based compression, and also supports 8 different storage modes, some of which are for backwards compatibility ( Chang, 2024 ). Download figure Open in new tab Fig. 2. Comparison of file formats. The upper panels show time to traverse the file for UK Biobank WGS data (left) and simulated data (right). The lower panels show the file sizes for UK Biobank WGS data (left) and simulated data (right). Allele frequency calculation was used for traversal time, as it is a trivial calculation over all the data in the file. For formats that encode an allele count, such as BCF, PGEN, and IGD (for sparse variants), we do not use that count but read the full sample data for each variant in order to measure the data traversal overhead. plink2 (Purcell and Chang, 2024 )) was used for conversion to BCF and PGEN formats, as well as for allele frequency calculation for . vcf.gz and BCF. The API PgrGetDifflistOrGenovec() was used for calculating PGEN allele frequency. The BCF files were stripped of additional meta-data, and only contained variants and genotypes. The simulated data was generated via stdpopsim (Adrion et al ., 2020) and msprime (Kelleher et al ., 2016), using an out-of-Africa demographic model (Jouganous et al ., 2017) and sampling 500,000 European individuals, in an attempt to generate data similar to the UK Biobank whole-genome sequence (WGS) data (Hofmeister et al ., 2023). File sizes are all fairly similar on the simulated dataset, ranging from 54GB (PGEN) to 84GB (.vcf.gz), with IGD and BCF being similarly sized at 73-74GB. PGEN and IGD are the two smallest formats on the UKB dataset, which is much richer in low-frequency variants (∼96% variants are MAF<0.1% (Hofmeister et al ., 2023)) than our simulated dataset (∼73% of variants are MAF<0.1%), and thus more compactly stored with a sparse representation. BCF and . vcf.gz rely heavily on standard compression algorithms, which is illustrated by the traversal times for IGD and PGEN being many times faster, especially on the UKB data. PGEN is the smallest and fastest file format, about 30% smaller and 5x faster than IGD. File format conversion times are summarized in Table 2 . PGEN is again fastest, likely partially due to plink’s highly optimized . vcf.gz read functionality. View this table: View inline View popup Download powerpoint Table 2. Conversion times from . vcf.gz The compactness and simplicity of IGD format make it easily usable in bioinformatic tool development. IGD has been successfully used as the input to Genotype Representation Graph (GRG) construction (DeHaas et al ., 2024) construction, and is essential for the efficiency of that process on biobank-scale data. GRG construction requires fast indexing of genomic regions as well as fast genotype data access. Indexing compressed formats (such as .vcf.gz and BCF) can be complex, and creates a separate file for the index. For IGD the index is a fundamental part of the file format. Fig. 3 shows the time to construct a GRG tree, the first part of GRG construction, is 13-15x faster for IGD than for . vcf.gz . We extracted the same region (for varying lengths, on the x-axis) from a simulated dataset with 1 million haploid samples, and then timed the GRG tree construction for that region. Download figure Open in new tab Fig. 3. File format impact on GRG tree construction. Genotype Representation Graph (GRG) tree construction time for .vcf.gz vs. IGD file formats, for small regions of the genome (x-axis) from a simulated dataset with 1 million haploid samples. IGD provides an extremely simple, yet efficient, alternative to existing file formats. It focuses on genotype data storage, and ease-of-use for developers of scalable prototypes and tools. Funding This work has been partly supported by NIH R35GM150579 to X.W. Conflict of Interest none declared. Acknowledgements This research has been conducted using the UK Biobank Resource under Application Number 97908. References Adrion , J.R. et al. ( 2020 ) A community-maintaiwned standard library of population genetic models . elife , 9 , e54967 . OpenUrl CrossRef PubMed ↵ Band , G. and Marchini , J. ( 2018 ) Bgen: a binary file format for imputed genotype and haplotype data . bioRxiv , 308296 . ↵ Browning , B.L. et al. ( 2018 ) A one-penny imputed genome from next-generation reference panels . Am. J. Hum. Genet ., 103 , 338 – 348 . OpenUrl CrossRef PubMed ↵ Chang , C. ( 2024 ) PLINK 2 File Format Specification Draft . ↵ Danecek , P. et al. ( 2011 ) The variant call format and VCFtools . Bioinformatics , 27 , 2156 – 2158 . OpenUrl CrossRef PubMed Web of Science DeHaas , D. et al. ( 2024 ) Enabling efficient analysis of biobank-scale data with genotype representation graphs . Nat. Comput. Sci ., 1 – 13 . ↵ Deorowicz , S. and Danek , A. ( 2019 ) GTShark: genotype compression in large projects . Bioinformatics , 35 , 4791 – 4793 . OpenUrl CrossRef PubMed Hofmeister , R.J. et al. ( 2023 ) Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank . Nat. Genet ., 55 , 1243 – 1249 . OpenUrl CrossRef PubMed Jouganous , J. et al. ( 2017 ) Inferring the joint demographic history of multiple populations: beyond the diffusion approximation . Genetics , 206 , 1549 – 1567 . OpenUrl Abstract / FREE Full Text Kelleher , J. et al. ( 2016 ) Efficient coalescent simulation and genealogical analysis for large sample sizes . PLoS Comput. Biol ., 12 , e1004842 . OpenUrl CrossRef PubMed ↵ Lan , D. et al. ( 2021 ) Genozip: a universal extensible genomic data compressor . Bioinformatics , 37 , 2225 – 2230 . OpenUrl CrossRef PubMed ↵ LeFaive , J. et al. ( 2021 ) Sparse allele vectors and the savvy software suite . Bioinformatics , 37 , 4248 – 4250 . OpenUrl CrossRef PubMed ↵ Li , H. ( 2011 ) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data . Bioinformatics , 27 , 2987 – 2993 . OpenUrl CrossRef PubMed Web of Science ↵ Pezoa , F. et al. ( 2016 ) Foundations of JSON schema . In, Proceedings of the 25th International Conference on World Wide Web . International World Wide Web Conferences Steering Committee , pp. 263 – 273 . ↵ Purcell , S. et al. ( 2007 ) PLINK: a tool set for whole-genome association and population-based linkage analyses . Am. J. Hum. Genet ., 81 , 559 – 575 . OpenUrl CrossRef PubMed ↵ Rivas , M.A. and Chang , C. ( 2024 ) Efficient storage and regression computation for population-scale genome sequencing studies . bioRxiv . Wertenbroek , R. et al. ( 2022 ) XSI—a genotype compression tool for compressive genomics in large biobanks . Bioinformatics , 38 , 3778 – 3784 . OpenUrl CrossRef PubMed View the discussion thread. Back to top Previous Next Posted February 08, 2025. Download PDF Email Thank you for your interest in spreading the word about bioRxiv. NOTE: Your email address is requested solely to identify you as the sender of this article. Your Email * Your Name * Send To * Enter multiple addresses on separate lines or separate them with commas. You are going to email the following IGD: A simple, efficient genotype data format Message Subject (Your Name) has forwarded a page to you from bioRxiv Message Body (Your Name) thought you would like to see this page from the bioRxiv website. Your Personal Message CAPTCHA This question is for testing whether or not you are a human visitor and to prevent automated spam submissions. Share IGD: A simple, efficient genotype data format Drew DeHaas , Xinzhu Wei bioRxiv 2025.02.05.636549; doi: https://doi.org/10.1101/2025.02.05.636549 Share This Article: Copy Citation Tools IGD: A simple, efficient genotype data format Drew DeHaas , Xinzhu Wei bioRxiv 2025.02.05.636549; doi: https://doi.org/10.1101/2025.02.05.636549 Citation Manager Formats BibTeX Bookends EasyBib EndNote (tagged) EndNote 8 (xml) Medlars Mendeley Papers RefWorks Tagged Ref Manager RIS Zotero Tweet Widget Facebook Like Google Plus One Subject Area Bioinformatics Subject Areas All Articles Animal Behavior and Cognition (7635) Biochemistry (17691) Bioengineering (13892) Bioinformatics (41937) Biophysics (21452) Cancer Biology (18588) Cell Biology (25504) Clinical Trials (138) Developmental Biology (13378) Ecology (19899) Epidemiology (2067) Evolutionary Biology (24320) Genetics (15609) Genomics (22506) Immunology (17736) Microbiology (40394) Molecular Biology (17181) Neuroscience (88605) Paleontology (666) Pathology (2832) Pharmacology and Toxicology (4824) Physiology (7641) Plant Biology (15156) Scientific Communication and Education (2045) Synthetic Biology (4294) Systems Biology (9825) Zoology (2271)
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.