vcfsim: flexible simulation of all-sites VCFs with missing data

preprint OA: closed
📄 Open PDF Full text JSON View at publisher
Full text 1,986 characters · extracted from oa-doi-fallback · 3 sections · click to expand

Abstract

Background VCFs are the most widely used data format for encoding genetic variation. By design, standard VCFs do not include data from sites where all individuals are homozygous for the reference allele (“invariant sites”) and thus do not differentiate these from sites where data are completely missing. However, missing data are a key feature of biological datasets across all domains of genomics, and many recent studies have shown that missing data can introduce a variety of statistical biases in the estimation of key population genetic parameters. A solution to this limitation is to include invariant sites in a standard VCF, creating an “all-sites VCF”, exposing missing and invariant sites explicitly. One hurdle to the wider adoption of all-sites VCFs is a reliable parameterized simulation framework for generating biologically realistic all-sites VCFs.

Results

Here, we introduce an open-source command line tool, vcfsim, that interfaces with the popular coalescent simulation platform msprime and provides convenience functions for simulating all-sites VCFs with variable levels of ploidy and missing data. We show that the post-processed VCFs generated using vcfsim align precisely with population genetic expectations (i.e. are statistically identical to raw msprime output), accurately introduce missing data, and permit the simulation of data with varying ploidy levels, including the simulation of intraindividual ploidy variation (e.g. heterogametic sex chromosomes) and population structures.

Conclusions

Our results vcfsim is a useful and easy-to-use tool for the benchmarking of new software tools, performing population genetic inference, training of machine learning models, and the exploration of the effects of missing data in genomics data sets. Competing Interest Statement The authors have declared no competing interest. Footnotes We performed more extensive benchmarking, updated the software to allow for the simulation of multiple populations.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00