Spark4VCF: A Novel Big Data Framework to Accelerate Genomics Analysis | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Spark4VCF: A Novel Big Data Framework to Accelerate Genomics Analysis Vinh Chi Duong, Thien Khac Nguyen, Giang Minh Vu, Sang Van Nguyen, and 6 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8910343/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 11 You are reading this latest preprint version Abstract In recent years, the exponential growth of Next Generation Sequencing (NGS) has led to an unprecedented increase in the amount of genomics data. While NGS technologies enable us to read the entire human genome, the analysis of functions of variants and phenotype prediction found in human sequences are still limited by computational tools that usually require high computing overhead due to the gigabytes or terabytes of data to be analyzed. Here we report a powerful big data framework called Spark4VCF which uses Apache Spark engine to accelerate genomics pipelines. Spark4VCF leverages independent attributes between variants and samples to speed up commonly used computational tools while maintaining quality and optimizing I/O tasks through parallel computing. We illustrated the superior speed, CPU usage and memory usage as well as new capability of Spark4VCF by showing example applications of three popular genomics toolboxes: GATK, VEP, and PyPGx. In summary, Spark4VCF is a high-performance framework that provides not only capacity of analyzing high quantities of genomics datasets but also user-friendly applications in big data settings. Biological sciences/Computational biology and bioinformatics Biological sciences/Genetics Physical sciences/Mathematics and computing Apache Spark Easy-to-config setting Variant Call Format Variant Annotation Variant Calling Phenotype Prediction Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Recently, the genomic data have been exponentially proliferating due to the rapid decrease in the cost of microarray genotyping and next-generation sequencing (NGS). To take advantage of the massive genomic data, numerous bioinformatics tools have been developed for variant calling, annotation, phenotype prediction, genotype imputation [1, 2], and many others. For example, GATK has comprised a set of robust and efficient analysis tools for NGS data, which was considered to be the gold stan- dard for many tasks such as identifying genomic variants, including Single Nucleotide Polymorphisms (SNPs), Short Insertions/Deletions (INDELs), Copy Number Variants (CNVs), and Structural Variants (SVs) [3]. A crucial step in genomic analysis is variant annotation, i.e. describing the effects and underlying mechanisms of genomic variants on transcripts and post-transcriptional processes, eventually deducing genome-to-phenome relationships and the association of genomic variants with human diseases. Many annotation tools have been developed for this task, such as SnpEff[4], Variant Effect Predictor (VEP)[5], ANNOVAR[6] or Invitae[7]. For example, VEP is an open-access tool which has been widely used for variant annotation, or analyzing the impact of the variants (SNPs, INDELs, CNVs, or SVs) on regulatory areas, genes, transcripts, and protein sequences. Moreover, a number of tools have been developed to foster the applications of genomics data in clinical practice, such as PyPGx[8] and Stargazer[9], which have been widely used to predict phenotypes of multiple pharmacogenes. Despite the emergence of many advanced hardware solutions able to reduce the time of genomic analysis, such as Illumina’s Dragen [10], BGI’s MegaBolt, the enormous and increasing amount of available data generated from other previous platforms posed plenty of computational challenges in extracting useful information. The challenges also come from the hardness of genomic data analysis problems, which is affected by the genome complexity and becomes harder if the genomes are embedded with complex repeat structures [11]. Several multi-node annotation tools have been developed to overcome these challenges, such as Hail (https://hail.is/) or Cannoli (bigdatagenomics.github.io). While Hail does not support exporting annotated information into output files, Cannoli, which uses ADAM application programming interfaces (APIs), could provide more analyses. However, despite its impressive speed compared to traditional methods, Canoli is still limited by constraints within this general genomics framework that can hamper its performance. The following describes current popular tools and frameworks that aim to accelerate genomic data analyses (see more details in Table 1). Name Functions Feature Advantages VariantSpark 1. Variant association 2. Population genetics studies Parallels population-scale tasks based on Spark and the associated MLlib 80% faster than ADAM, Hadoop/Mahout version, and ADMIXTURE More than 90% faster than R and Python implementations SparkBWA 1. Alignment 2. Mapping Consists of three main stages: 1. RDD creation 2. Map 3. Reduce phases; and employs two independents software layers For shorter reads, averages 1.9x and 1.4x faster than SEAL and pBWA For longer reads, averages 1.4x faster than BigBWA and Halvade GATK-HaplotypeCaller-Spark 1. Sequence analysis Takes full account of compute, workload, and characteristics Achieves more than 37 times increased speed Table 1 : Popular frameworks which integrated Spark for bioinformatic functions. Hadoop[12] is an open-source framework that has been broadly utilized to accelerate data analyses in various disciplines, including bioinformatics (e.g., Biodoop[13], FASTdoop[14, 15]). With Hadoop, multiple computer nodes are organized to provide a scalable distributed file system (HDFS). Hadoop uses a mechanism called MapReduce to process vast amounts of data, dividing a large computational program into many small sub-programs that run in parallel on large computing clusters with thousands of nodes in a reliable, fault-tolerant manner. However, due to its disk-based input/output (I/O) access pattern, Hadoop MapReduce suffers from high latency. Moreover, MapReduce is not an optimal solution for applications that require iterative in-memory computation. These limitations led to the development of the open-source tool, Spark, which works together with Hadoop. Spark provides a resilient distributed dataset (RDD) and caches datasets in memory across cluster nodes, eliminating or reducing the bottleneck of disk I/O to speed up performance by about 100-fold for some applications. Notable applications using Spark and Hadoop MapReduce include alignment and mapping [16–26], sequence analysis [27–30], phylogeny [31], drug discovery [32], single-cell RNA sequencing [33], and variant association in population genetics studies [34, 35]. Besides, tips for installing Spark in bioinformatics was also researched by [36] to provide insight into developing bioinformatics tools. VariantSpark, a powerful tool for genomic data analysis, overcomes the limitations of Hadoop by minimizing reliance on hard disk input-output operations (disk IO) [34]. In an experiment using the 1KGP dataset, the authors demonstrated that VariantSpark outperformed Apache Spark, ADAM, Hadoop MapReduce, R, Python, and ADMIXTURE. By accurately grouping individuals from super-populations (AMR, AFR, EAS, and SAS), VariantSpark proved its superior speed, resource efficiency, and scalability. VariantSpark, thus, enables the application of advanced machine learning algorithms to genomic data. SparkBWA is designed for sequencing alignment and offers a significant advantage, which does not necessitate any changes to the original BWA source code [19]. Notably, SparkBWA outperformed SEAL and pBWA by a factor of 1.9× and 1.4×, respectively, for aligning shorter reads using the BWA-backtrack algorithm. For longer reads, implementing the BWA-MEM algorithm, SparkBWA achieved an average speed up of 1.4× compared to BigBWA and Halvade tools. GATKSpark, a complementary for the GATK toolset, focuses on read alignment and variant calling tasks. In particular, by leveraging a robust 256-core cluster, GATKSpark’s execution time has been remarkably reduced by a factor of 37x [28]. This experiment involved running various tools, including GATK original, GATK-queue, and GATK-Spark on 1 to 32 nodes (equivalent to 8 to 256 cores). Key bottlenecks which were addressed in GATKSpark include single-process inefficiency, I/O challenges during SAM/BAM manipulation in the Cleaner step, and merging of BAM files with deduplication. Notable improvements of GATKSpark involve parallel processing techniques and a transition from SAM/BAM to the more efficient ADAM format, which minimizes disk access. It is known that these popular tools still have many drawbacks. They aim at specific aspects of the analysis function, or trying to convert the original data into their specific data types, which are time-consuming and need more effort in the main computation stage to deal with their specific data types. Therefore, it is essential to develop a new framework that can perform well with many tools. This framework should be able to deal with a lot of factors that influence not only running time but also resource usage, including data preprocessing modules and framework structure as vital elements. In this work, we report a novel framework, namely Spark4VCF, that leverages Apache Spark structure and in-memory computation to accelerate diverse genomic analysis pipelines. Spark4VCF is a tool designed to parallelize tasks by dividing the entire workload into multiple segments that are processed concurrently. Task distribution among worker nodes can be based on the number of samples, variants, or sequencing reads. Spark4VCF supports some common tasks such as variant calling and in particular, two tasks that were not fully integrated into the existing popular Spark-based tools: variant annotation and phenotype prediction. The main advantage of the Spark4VCF is its dual functionality: it offers a user-friendly and easy-to-config interface which integrating bioinformatics tools without requiring modifications to the source code of these tools. By seamlessly integrating common tasks such as variant annotation and phenotype prediction, Spark4VCF facilitates users in analyzing genomic data in an easy and efficient way. Spark4VCF has been tested on diverse datasets, and it has shown a significant improvement in performance over the original tools. Results 2.1 Experimental Setup Item Host Server Hardware Model ASUSTeK COMPUTER INC. PRIME Z390-P HPE ProLiant DL385 Gen10 Plus Processor Intel® Core™ i9-9900K × 8-Core 2-Thread 2 × AMD EPYC 7742 × 64-Core 2-Thread Memory 64 GB 1008 GB Local Storage Samsung SSD 870 EVO (1TiB) WDC WD4005FZBX-00K5WB0 (4TiB) Logical Volume (8.75TiB) Data plane NICs Virtualbox Internal network Virtualbox Internal network BIOS Version: 2804 Release Date: 04/15/2020 Version: A42 Release Date: 02/10/2022 Hyper-threading Enable Enable Table 2 : Summary of the host and server computer configuration We simulated 4 computers by using virtual machine (VM) technology on a host computer to validate the experiments on the same configuration for each computer. The system of VMs was set up so that the number of CPU cores is 1, 2, and 3, and the memory usage of RAM is 14 GB. Several test cases have been designed on VMs to assess the performance of Spark4VCF. We evaluated the original tools and Spark Join module with the number of intervals, samples, and variants. Besides, we performed the framework on the server provided by GeneStory JSC to validate the results. The settings were simulated similarly to the host computer. The configuration listed in Table 2, was set up with the number of CPU cores growing from 1 to 32. We compare the results for these machines by running both Spark4VCF as well as other pipelines. We logged the execution time, and resource utilization via time , and psrecord commands to evaluate different metrics for file size, running time, CPU usage (%), memory usage (%). Our environments use Hadoop version 2.7, Spark version 2.4.0, VEP version 108, GATK 3.2.2, PyPGX 0.20.0 and Scala version 2.11.12. File sizes include 133, 454, and 2300 MB, which are WGS of chromosome 22 from 1000 Genomes Project (1KGP). Spark4VCF uses different worker nodes that utilize various variant calling, variant annotation, and phenotype prediction tools such as GATK-HaplotypeCaller, VEP, and PyPGx. Particularly, we used BAM file of HG00131 sample for variant calling, VCF file of one random sample to run variant annotation, and 3202 samples for phenotype prediction. 2.1.1 GATK-HaplotypeCaller Based on the interval range of the HaplotypeCaller tool, we divided variant locations into chunks and pushed them into each map of Spark to call variants. Then, the map returns will be merged into one output. The experiment was designed to validate the run time between GATK-HaplotypeCaller and GATK-HaplotypeCaller-Spark4VCF with multiple primary and alternative chromosome regions, here referred to as intervals. 2.1.2 VEP We split the sequencing datasets into multiple files by variant location as standard inputs after that files will be mapped into each process to worker nodes. Worker nodes compute tasks parallelized based on the position range of genomic variants after that the reducer sums all of the map buffers to finally get the standard output. The goal of the experiment was to compare the running times of VEP and VEP-Spark4VCF for different numbers of variants. 2.1.3 PyPGx We converted all of the samples into sample chunk format and mappers read each chunk at the same time to process. When map tasks are completed, the Spark pipeline reads each sample line in each worker to predict phenotype samples or star alleles identification. After that, the final output was aggregated from each worker node output. The purpose of the experiment was to compare the run times of PyPGx and PyPGx-Spark4VCF across a number of samples. 2.2 Performance Evaluation In Figure 1, the execution time of the original module and the Spark Join module versus the difference in the number are shown. Because of the I/O bound on merging VCF files, the running time of validation processes is linear with file size as the execution time trend increases. The Spark Join module had a running time significantly lower than the original tools, with the number of samples thanks to the mechanism of Spark-memory processing. GATK-HaplotypeCaller and GATK-HaplotypeCaller- Spark4VCF were used for calling variants (Figure 1a). The execution time of GATK- HaplotypeCaller alone were higher than that of GATK-HaplotypeCaller-Spark4VCF across the different interval lengths. The running time both started from nearly 100 minutes in 20 intervals and that of GATK-HaplotypeCaller increased to roughly 900 minutes and that of GATK-HaplotypeCaller-Spark4VCF increased to roughly 423 minutes in 95 intervals. Besides, for the phenotype prediction process (Figure 1b), we implemented PyPGx and PyPGx-Spark4VCF by constantly increasing the number of samples. The running time of PyPGx significantly increased from approximately 1 minute to 1000 minutes when processing 3202 samples, while the running time of PyPGx-Spark4VCF was gradually increased from approximately 1 to 110 minutes for the same sample size. In addition to variant annotation (Figure 1c), VEP and VEP- Spark4VCF were implemented with a different number of variants. It can be seen that the running time of VEP also moved upward strongly at 900000 variants from approximately 70 to 160 minutes, which is higher than the running time of VEP- Spark4VCF from approximately 40 to 50 minutes at 900000 variants. Table 3 shows detailed running time of each module. Figure 2 shows that the running time of Spark4VCF is faster than other methods in all test cases. In variant calling, GATK-HaplotypeCaller-Spark4VCF running time was reduced from approximately 983 minutes to 562 minutes, increasing the speed by 1.75 times that of the original pipeline in a single core. Therefore, we expanded CPU cores to 2 cores, and 3 cores for the validation experiment. The GATK-HaplotypeCaller running time sped up to approximately 2.8 times with Spark in 2 cores and 3 cores. Besides, the running time of PyPGx-Spark4VCF was considerably faster than the original PyPGx. With 1 core, the running time of PyPGx-Spark4VCF decreased from approximately 1319.37 minutes to 366.66 minutes, a speedup of 3.6 times. Similar to 1 core, the running time of PyPGx-Spark4VCF in 2 cores, and 3 cores also decreased runtime from 1306.78 minutes to 137.1 minutes (had a speedup of 9.5 times) and from 1102.62 minutes to 109 minutes (a speedup of 10.12 times) in phenotype prediction task, respectively. Finally, the running time of VEP-Spark4VCF is faster than VEP in three settings of CPU cores, which reduced runtime from approximately 185.5 minutes to 118.8 minutes (1.5 times) in single-core, from 182.49 minutes to 97 minutes (1.8 times) in 2 cores, and from 197.75 minutes to 70.17 minutes (2.8 times) for annotating variants. However, CPU resource for Spark join module was higher utilization approximately 50% than the original tools during execution. Thus, even though the percentage of memory usage (Figure 2c), and CPU usage spread more widely than the original tools (Figure 2b), the running time was considerably reduced a lot of time with the additional Spark module (Figure 2a). The detailed summary was presented in Table 4. Moreover, we assessed the framework on our server, the figure 3 shows the running time of three different tasks on a server with a varying number of CPU cores increasing from 1 to 32. While GATK-HaplotypeCaller-Spark4VCF pipeline’s running time dropped at 32 CPU cores, the GATK-HaplotypeCaller’s execution time increased significantly as the number of CPU cores increased as well. For example, the pipeline execution time of GATK-HaplotypeCaller-Spark4VCF took about 150 minutes on a server with one CPU core, and it was decreased to about 50 minutes with 32 CPU cores, while the GATK-HaplotypeCaller execution time hung around 275 minutes (Figure 3a). The running time for PyPGx-Spark4VCF pipeline decreased significantly as the number of CPU cores increases than the PyPGx pipeline. For example, on a server with 1 CPU core, the execution time of PyPGx-Spark4VCF took roughly 2250 minutes to run the pipeline and was reduced to approximately 273.11 minutes with 32 CPU cores while the execution time of PyPGx fluctuated around 4000 minutes from 1 CPU core to 32 CPU cores (Figure 3b). Finally, the running time of VEP-Spark4VCF pipeline shows improvement in the setup of Spark. That of GATK-HaplotypeCaller- Spark4VCF reduced from approximately 250 minutes to 100 minutes at 32 CPU cores while that of GATK-HaplotypeCaller also increased slightly to approximately 600 minutes at 32 CPU cores (Figure 3c). All three original tools fluctuated because the task just ran 1 thread so the running time of the original tools did not differ. Table 3 : The execution time of the original tools and Spark join modules with different setting input on the host computer. Number of Sample PyPGx PyPGx-Spark4VCF 100 8.22 04.09 500 59.93 9.14 1000 189.25 16.05 1500 376.62 29.28 2000 579.76 36.78 2500 892.01 50.49 3202 1102.62 109.00 Number of Variants VEP VEP-Spark4VCF 300000 71.05 40.32 400000 81.21 47.50 500000 95.44 43.52 600000 124.90 47.74 700000 136.69 41.09 800000 139.40 44.09 900000 162.46 52.26 Number of intervals GATK-HaplotypeCaller GATK-HaplotypeCaller-Spark4VCF 20 118.79 72.45 40 364.06 213.89 60 519.50 230.97 80 707.23 359.61 95 894.14 423.10 Table 4 : Summary of the performance results of Spark4VCF on the host computer. N is number of samples. W is number of workers. Dataset Types Tools N File size (MB) Running Time (minutes) CPU (%) Maximum memory used (%) W CPU cores RAM Usage Disk Type 1KGP-chr 22 Variant Annotation VEP 1 133 185.53 100 12.8 1 1 14GiB HDD 1KGP-chr 22 Variant Annotation VEP 1 133 182.49 178 7.6 1 2 14GiB HDD 1KGP-chr 22 Variant Annotation VEP 1 133 197.75 140 12.9 1 3 14GiB HDD 1KGP-chr 22 Variant Annotation VEP-Spark 1 133 118.82 100 40.33 3 1 14GiB HDD 1KGP-chr 22 Variant Annotation VEP-Spark 1 133 97.01 200 41.33 3 2 14GiB HDD 1KGP-chr 22 Variant Annotation VEP-Spark 1 133 70.16 300 41.33 3 3 14GiB HDD 1KGP-chr 22 Phenotype Prediction PyPGX 3202 454 1319.36 100 5.7 1 1 14GiB HDD 1KGP-chr 22 Phenotype Prediction PyPGX 3202 454 1306.77 163 5.7 1 2 14GiB HDD 1KGP-chr 22 Phenotype Prediction PyPGX 3202 454 1102.61 140 5.7 1 3 14GiB HDD 1KGP-chr 22 Phenotype Prediction PyPGX-Spark 3202 454 366.65 100 14 3 1 14GiB HDD 1KGP-chr 22 Phenotype Prediction PyPGX-Spark 3202 454 137.08 200 14 3 2 14GiB HDD 1KGP-chr 22 Phenotype Prediction PyPGX-Spark 3202 454 109.00 250 14 3 3 14GiB HDD BAM file HG00131 Variant Calling GATK 1 2300 983.43 100 14 1 1 14GiB HDD BAM file HG00131 Variant Calling GATK 1 2300 960.91 200 11.5 1 2 14GiB HDD BAM file HG00131 Variant Calling GATK 1 2300 1006.67 225 14.1 1 3 14GiB HDD BAM file HG00131 Variant Calling GATK-Spark 1 2300 562.52 100 20 3 1 14GiB HDD BAM file HG00131 Variant Calling GATK-Spark 1 2300 338.79 200 39.233 3 2 14GiB HDD BAM file HG00131 Variant Calling GATK-Spark 1 2300 344.73 300 40.333 3 3 14GiB HDD Background Apache Spark is an open-source multi-language engine for distributed data processing designed to deal with Hadoop limitations. Spark was originally developed at the University of California, Berkeley [ 37 ], and sponsored by Apache Software Foundation. Spark utilizes a master-slave architecture with a single central driver and many distributed workers in order to carry up tasks or analytics. RDDs of Apache Spark is a read-only collection of data objects partitioned across multiple machines [ 38 ], which can be created by distributing existing collections ( list or set ) in memory, or by loading an external dataset from numerous sources supported by Hadoop, including the local file system such as HDFS [ 12 ], Parquet ( https://parquet.apache.org/ ), and AWS ( https://aws.amazon.com ). The RDD parallel operations consist of two ways: Transformations and Actions . Transformations operate on existing RDDs to return new RDDs, such as map (apply a function to each element), filter (select elements based on a condition), join (combine data from two RDDs), and groupByKey (group elements with the same key). Besides, the output RDDs will be stored in memory for faster processing in after steps, but Spark allows persisting it to disk if needed. On the contrary, Actions trigger computations on the RDDs, such as collect (get all elements as a collection), count (get the number of elements), and take (get the first N elements). It will return an output to the driver program (main application) or external storage. Importantly, Transformations are evaluated lazily, which means that they build a logical execution plan on the data, which reflects the order in which the Transformations should be applied rather than executing immediately. This approach allows Spark to optimize the workflow by avoiding unnecessary computations if the results of a Transformation is not ultimately used by any actions. Apache Spark supports multiple programming languages such as Python, Scala, and Java, enabling users to work with data that is spread across several machines and stored on disk or in memory. Significantly, Spark can be used with a cluster management like YARN [ 39 ] or Apache Mesos [ 40 ], or it can run locally. Methods In this section, we introduce Spark4VCF, which integrates various bioinformatics tools into Apache Spark framework (Fig. 4 ). The goals of building Spark4VCF are performance improvement, easy-to-config ability, and flexibility in integrating diverse tools into the framework. 4.1 Architecture Spark4VCF workflow consists of three main stages: a pre-processing stage, a main computation stage, and a post-processing stage. The pre-processing stage loads the data from files in 3 formats BAM, VCF, or FASTQ into an HDFS storage. The main computation stage comprises three steps: 1) Data is distributed from HDFS to worker nodes by Spark API and injected into multiple pipes so that tools can process it; 2) Each worker node is called an external process, receives input data from the pipe, and starts the tasks. 3) The output data of pipes is returned to the driver. The post- processing stage comprises two steps: 1) The output files are uploaded to HDFS, and 2) they can be exported to a local, file server, or columnar storage such as Parquet, Elasticsearch, Cassandra, or HBase. Notably, we used HDFS as a distributed file system. The details of the workflow are presented in Fig. 5 . 4.2 In-memory computing In the pre-processing stage, there are two primary tasks: loading data from the local disk and uploading the data to HDFS. Loading and uploading are disk I/O tasks that can be handled by Direct Memory Access. Uploading data relies on the network channel while loading data depends on disk I/O. It would be unnecessary to load data from HDFS to the local disk at each step during the main computation stage. Instead, it is stored in memory for later phases thanks to the spark mechanism. As a result, this stage was developed to allow data to be directly fed into the next processes, so it minimizes the number of read/write cycles from disk while original tools read and write from disk in each process (Fig. 6 ). Moreover, only one set of data is processed into an independent hardware component in each time unit. Therefore, two processes are handling 1–2 data units in each time slot. Particularly, the chunk identifier of an input file was used as a key in RDDs. Thus, the RDD variable was generated by Spark having the following format: { chunk id, chunk content } where chunk content contains variant information of a chunk id identifier. This variable will be fed for the main computation stage, such as VEP, PyPGX, and GATK-HaplotypeCaller algorithm. Once the computation phase is complete, users can merge all the outputs into one file. 4.3 Spark4VCF Console Easy-to-config is one of the key features of Spark4VCF to perform various tasks such as variant annotation, variant calling, or phenotype prediction. With this in mind, Spark4VCF is built based on a big data technology namely Apache Spark, and can be installed through Github. Spark4VCF application was packaged into Java Archive (jar) format in Scala language. It can be passed a variety of options to run a specific task, e.g., listing 1 - example in supplement-listing-1. We implemented VEP (variant annotation), GATK-HaplotypeCaller (variant calling), and PyPGx (phenotype prediction) to verify whether the application could be applied to various tasks. We integrated Spark4VCF into spark-submit command, providing a way for running from the terminal to leverage the distributed power. The command supports a variety of arguments and options in which the user can run specific details based on the YARN configuration. For example, it is possible to choose the number of executors, the number of cores used per worker, or the volume of memory (the basic syntax in listing 2, as shown in the example of spark-submit in supplement-listing-2). As a result, Spark4VCF is an easy-to-configure tool that enables a user to run tasks using the standard spark-submit from the terminal. Listing 1: The basic syntax for spark-submit $ java − jar spark4vcf.jar --help Usage : Spark4VCF [ options ] -- bioinformaticstool : Bioinformatics tool name (e.G., vep, pypgx, gatk−haplotypecaller) -- tooldir : Executable path −i, -- inputfile : Absolute path to input file in HDFS storage −o, --outputfile : Absolute path to output file in HDFS storage -- toolargs : Bioinformatic tools arguments in Quote −h, --help : Print usage Listing 2: The basic syntax for spark-submit $ spark−submit [ options ] [ application arguments ] : This can be the path to the JAR file or the Python script containing the Spark Application. [options]: These are optional arguments that allow you to configure the Spark application. Here are some commonly used options: --master: Specifies the cluster manager to use( e.g ., yarn, local) --deploy–mode: Sets the deployment mode ( client or cluster ) --memory: Defines the memory allocation for the driver and executors. --num−executors: Sets the number of worker nodes to use. 4.4 Bioinformatics Tools 4.4.1 GATK-HaplotypeCaller GATK introduced a variant calling tool namely HaplotypeCaller, specifically designed for high-accuracy variant genotyping in diploid genomes to address sequencing errors and mismapping reads in variant calling [ 41 ]. GATK-HaplotypeCaller employs a de Bruijn graph-based approach to assemble contigs from aligned sequencing reads [42]. This strategy allows for the reconstruction of haplotypes, the complete set of alleles on a single chromosome within a sequencing sample. By analyzing these reconstructed haplotypes, HaplotypeCaller can effectively distinguish true variants from sequencing errors based on their presence within the assembled contigs. The toolkit leveraging a haplotype-based assembly approach, overcomes the limitations of traditional variant callers and delivers a more accurate and comprehensive set of identified variants, although HaplotypeCaller was evaluated as the most time-consuming step. 4.4.2 The Variant Effect Predictor (VEP) VEP, which has been developed by the Ensembl Project, plays a crucial role in variant annotation tasks[ 5 ]. With enormous genomic data, VEP has been designed as a thorough in-silico annotation tool that predicts the effects of genetic variants. The tool accepts a list of variants, typically in the Variant Call Format (VCF). VEP finds out the potential effects of a variant, including its location within a gene (e.g., exonic, intronic), predicted protein consequences (e.g., missense mutation, synonymous mutation), and potential impact on protein functions (e.g., predictors of damaging, benign). Users may quickly determine the genes and transcripts impacted by the variants by entering the chromosome coordinates of the variants and the nucleotide alterations. The inclusion of VEP in research workflows offers several significant benefits. VEP automates the variant annotation process instead of manual analysis, which helps reduce time and resources. Moreover, with relevant databases, VEP provides valuable insights into the potential functional effects of variants. 4.4.3 PyPGx PyPGx, which has been developed at Macrogen Inc., [ 8 ], uses The 1000 Genome Project reference panel to get information about individual diplotype calls, plots of copy number, and allele fraction. Firstly, genomic variants were phased by the Beagle tool before matching to star alleles in the Gene Haplotype translation reference. Then, the target gene’s per-base read depth is translated to copy numbers utilizing an intra- sample normalization method with a control gene. The inter-sample normalization is further carried out in the case of focused sequencing data to take into account the variation in total coverage across samples. Conclusion In this paper, we reported Spark4VCF, a framework in which bioinformatic tools are built on top of Apache Spark, achieving superior performance with an easy-to-config manner. The Spark4VCF consists of three stages: pre-processing, main computation, and post-processing. In the main computation stage, we improved processing speed by avoiding the need to use intermediate data stored on the local disk. By using Spark4VCF in the practice of analyzing large-scale biology data, we obtained high-performance tools that will enable routine analyses of high quantities of variants in human genome analyses using common tools such as PyPGx and VEP. With the drastic increase in the number of samples and features in genomics research, efficient tools for genomic data processing are crucial for the research commu- nity. Our work proved that Spark framework could be well utilized in various genomic data analyses and Spark4VCF could serve as a complementary solution for such tasks. Currently, Spark4VCF has several limitations such as integrating only several tools at the moment and running them in the manner of step-by-step instead of the whole pipeline from raw data to final output. We are in a plan to develop Spark4VCF as a more powerful and comprehensive solution for genomic data analyses. Declarations 6 Author contributions statement T.H.H. and V.C.D. wrote the manuscript and performed data analysis. V.C.D. deployed and performed QC for the system. T.H.H. and T.D.D. designed the system and developed the driver algorithm (main program). C.D.L. wrote test cases and tested annotation tools. V.H.P. develops an automated deployment tool for Spark4VCF. Q.N. contributed to the experiment design and revised the manuscript. N.S.V. conceived the project, supervised the experiments, revised the manuscript, and coordinated the overall project. All authors revised the manuscript, discussed the analysis results, and contributed to bringing innovative ideas into the manuscript. 7 Code availability statement Source code of Spark4VCF are available from: https://github.com/vinhdc10998/ Spark4VCF 8 Data access statement 1000 genome project phase 3 (1KGP3) dataset: 2504 unrelated individuals of WGS data was used HG00131 sample was downloaded from: https://www. internationalgenome.org/data-portal. This 1KGP3 dataset has its own consent form. All datasets follow the relevant guidelines and regulations (e.g. Helsinki Declaration). 9 Human Ethics and Consent to Participate declarations Not applicable 10 Ethics Approval declaration The project (grant VINIF.DA.2020.02) was approved by the Institutional Review Board of the Hanoi Medical University, Hanoi, Vietnam - IRB-VN01.001/IRB00003121. 11 Funding Declaration This work is funded by Institute of VinUni Big Data Research, VinUniversity inter- nal funding, and partly supported by the Vingroup Innovation Foundation (grant VINIF.DA.2020.02). 12 Competing interests The authors declare that they have no competing interests. References Maarala, A. I., P¨arn, K., Nun˜ez-Fontarnau, J. & Heljanko, K. Sparkbeagle: Scal- able genotype imputation from distributed whole-genome reference panel cloud. In: Proceedings of the 11th ACM International Conference on Bioinfor- matics, Computational Biology and Health Informatics. BCB ’20. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3388440.3414860 . https://doi.org/10.1145/3388440.3414860. Chi Duong, V. et al. A rapid and reference-free impu- tation method for low-cost genotyping platforms. Sci. Rep. 13 (1), 23083 (2023). Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020). Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: Snps in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6(2), 80–92 (2012). McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17 (1), 1–14 (2016). Wang, K., Li, M. & Hakonarson, H. Annovar: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38 (16), 164–164 (2010). Hart, R. K. et al. A python package for parsing, validating, mapping and formatting sequence variants using hgvs nomenclature. Bioinformatics 31 (2), 268–270 (2015). Lee, S., Shin, J. Y., Kwon, N. J., Kim, C. & Seo, J. S. Clinpharmseq: A targeted sequencing panel for clinical pharmacogenetics implementation. PLoS One . 17 (7), 0272129 (2022). Lee, S. et al. Stargazer: a software tool for calling star alleles from next-generation sequencing data using cyp2d6 as a model. Genet. Sci. 21 (2), 361–372 (2019). Schobers, G. et al. Genome sequencing as a generic diagnostic strategy for rare disease. Genome Med. 16 (1), 32 (2024). Phan, V., Gao, S., Tran, Q. & Vo, N. S. How genome complexity can explain the hardness of aligning reads to genomes. In: 2014 IEEE 4th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp. 1–2 (2014). https://doi.org/10.1109/ICCABS.2014.6863916 Apache Software Foundation. Hadoop. https://hadoop.apache.org. Leo, S., Santoni, F. & Zanetti, G. Biodoop: Bioinformatics on hadoop. In: 2009 International Conference on Parallel Processing Workshops, pp. 415–422 (2009). https://doi.org/10.1109/ICPPW.2009.37 Ferraro Petrillo, U., Roscigno, G., Cattaneo, G. & Giancarlo, R. Fastdoop: a ver- satile and efficient library for the input of fasta and fastq files for mapreduce hadoop bioinformatics applications. Bioinformatics 33 (10), 1575–1577. https://doi.org/10.1093/bioinformatics/btx010 (2017). Alnasir, J. J. & Shanahan, H. P. The application of hadoop in structural bioinfor- matics. Brief. Bioinform. 21 (1), 96–105 (2020). Zhao, G., Ling, C. & Sun, D. Sparksw: scalable distributed computing system for large-scale biological sequence alignment. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 845–852 IEEE (2015). Xu, B. et al. Dsa: scalable distributed sequence alignment system using simd instructions. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 758–761 IEEE (2017). Xu, B. et al. Efficient dis- tributed smith-waterman algorithm based on apache spark. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 608–615 IEEE (2017). Abu´ın, J. M., Pichel, J. C., Pena, T. F. & Amigo, J. Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. PloS one . 11 (5), 0155461 (2016). Mushtaq, H., Ahmed, N. & Al-Ars, Z. Streaming distributed dna sequence align- ment using apache spark. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 188–193 IEEE (2017). Abu´ın, J. M., Pena, T. F. & Pichel, J. C. Pastaspark: multiple sequence alignment meets big data. Bioinformatics 33 (18), 2948–2950 (2017). Llad´os, J., Guirado, F. & Cores, F. Ppcas: Implementation of a probabilistic pairwise model for consistency-based multiple alignment in apache spark. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 601–610 Springer (2017). Castro, M. R., Tostes, C. S., D´avila, A. M., Senger, H. & Silva, F. A. Sparkblast:scalable blast processing using in-memory operations. BMC Bioinform. 18 , 1–13 (2017). Zhou, W. et al. Metaspark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes. Bioinformatics 33 (7), 1090–1092 (2017). Ferraro Petrillo, U., Palini, F., Cattaneo, G. & Giancarlo, R. Alignment-free genomic analysis via a big data spark platform. Bioinformatics 37 (12), 1658–1665 (2021). AlJame, M. & Ahmad, I. Dna short read alignment on apache spark. Appl. Comput. Inf. 19 (1/2), 64–81 (2023). Deng, L., Huang, G., Zhuang, Y., Wei, J. & Yan, Y. Higene: A high-performance platform for genomic data analysis. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 576–583 IEEE (2016). Li, X. et al. Accelerating large- scale genomic analysis with spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 747–751 IEEE (2016). Wiewi´orka, M. S. et al. Sparkseq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 30(18), 2652– 2653 Demirbaga, U ¨ ., Aujla, G.S., Jindal, A., Kalyon, O.: Big data analytics in bioinfor- matics. In: Big Data Analytics: Theory, Techniques, Platforms, and Applications, pp. 265–284. Springer (2024) (2014). Xu, X., Ji, Z. & Zhang, Z. Cloudphylo: a fast and scalable tool for phylogeny reconstruction. Bioinformatics 33 (3), 438–440 (2017). Harnie, D. et al. Scaling machine learning for target prediction in drug discovery using apache spark. Future Generation Comput. Syst. 67 , 409–417 (2017). Yang, A., Troup, M., Lin, P. & Ho, J. W. Falco: a quick and flexible single-cell rna-seq processing framework on the cloud. Bioinformatics 33 (5), 767–769 (2017). O’Brien, A. R. et al. Variantspark: population scale clustering of genotype information. BMC Genom. 16 , 1–9 (2015). Guha Neogi, A., Eltaher, A. & Sargsyan, A. NGS data analysis with apache spark. In Computational Life Sciences: Data Engineering and Data Mining for Life Sciences (441–467). Cham: Springer International Publishing. (2023). Chicco, D., Petrillo, F. & Cattaneo, U. Ten quick tips for bioinformat- ics analyses using an apache spark distributed computing environment. PLoS Comput. Biol. 19 (7), 1011272 (2023). Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. Spark: Cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10) (2010). Zaharia, M. et al. Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28 (2012). Vavilapalli, V. K. et al. : Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1–16 (2013). Hindman, B. et al. Mesos: A platform for {Fine-Grained} resource sharing in the data center. In: 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11) (2011). Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv , 201178 (2017). Ren, S., Bertels, K. & Al-Ars, Z. Efficient acceleration of the pair-hmms forward algorithm for gatk haplotypecaller on graphics processing units. Evolutionary Bioinf. 14 , 1176934318760543 (2018). Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 30 Mar, 2026 Reviews received at journal 27 Mar, 2026 Reviewers agreed at journal 27 Feb, 2026 Reviewers agreed at journal 27 Feb, 2026 Reviews received at journal 25 Feb, 2026 Reviewers agreed at journal 25 Feb, 2026 Reviewers invited by journal 25 Feb, 2026 Editor assigned by journal 25 Feb, 2026 Editor invited by journal 24 Feb, 2026 Submission checks completed at journal 23 Feb, 2026 First submitted to journal 23 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8910343","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":598858544,"identity":"e33734e6-e06f-4b72-8fd8-4439c4926485","order_by":0,"name":"Vinh Chi Duong","email":"","orcid":"","institution":"GeneStory JSC","correspondingAuthor":false,"prefix":"","firstName":"Vinh","middleName":"Chi","lastName":"Duong","suffix":""},{"id":598858551,"identity":"c5581d1a-ac9f-4c4d-a586-0e74f54922ad","order_by":1,"name":"Thien Khac Nguyen","email":"","orcid":"","institution":"Vingroup Big Data Institute","correspondingAuthor":false,"prefix":"","firstName":"Thien","middleName":"Khac","lastName":"Nguyen","suffix":""},{"id":598858552,"identity":"967fc469-f5f1-4d7c-9ab3-c5aef52e699c","order_by":2,"name":"Giang Minh Vu","email":"","orcid":"","institution":"VinUniversity","correspondingAuthor":false,"prefix":"","firstName":"Giang","middleName":"Minh","lastName":"Vu","suffix":""},{"id":598858553,"identity":"11beae35-3de0-4fac-a983-fe1296de3f11","order_by":3,"name":"Sang Van Nguyen","email":"","orcid":"","institution":"GeneStory JSC","correspondingAuthor":false,"prefix":"","firstName":"Sang","middleName":"Van","lastName":"Nguyen","suffix":""},{"id":598858554,"identity":"a51022ce-a5bf-46f9-b764-cd35a00cb291","order_by":4,"name":"Quan Nguyen","email":"","orcid":"","institution":"The University of Queensland","correspondingAuthor":false,"prefix":"","firstName":"Quan","middleName":"","lastName":"Nguyen","suffix":""},{"id":598858555,"identity":"223de6d8-7dec-4eaa-91e7-ccaaf0c70f38","order_by":5,"name":"Vu Hoang Pham","email":"","orcid":"","institution":"Vingroup Big Data Institute","correspondingAuthor":false,"prefix":"","firstName":"Vu","middleName":"Hoang","lastName":"Pham","suffix":""},{"id":598858556,"identity":"c71243f6-2845-4c1b-bd60-bebb2ab3d461","order_by":6,"name":"Cuong Dinh Le","email":"","orcid":"","institution":"Vingroup Big Data Institute","correspondingAuthor":false,"prefix":"","firstName":"Cuong","middleName":"Dinh","lastName":"Le","suffix":""},{"id":598858558,"identity":"bc586031-6be1-4c34-87a2-7a6540856300","order_by":7,"name":"Toan Dang Dao","email":"","orcid":"","institution":"Vingroup Big Data Institute","correspondingAuthor":false,"prefix":"","firstName":"Toan","middleName":"Dang","lastName":"Dao","suffix":""},{"id":598858559,"identity":"f40945d0-3ee8-4a1b-b92c-05afbf946e28","order_by":8,"name":"Nam Sy Vo","email":"","orcid":"","institution":"VinUniversity","correspondingAuthor":false,"prefix":"","firstName":"Nam","middleName":"Sy","lastName":"Vo","suffix":""},{"id":598858560,"identity":"654af887-b5c9-42d9-8aa0-cd6640eba5a2","order_by":9,"name":"Tham Hong Hoang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+UlEQVRIiWNgGAWjYDACCQYGZhgDCGyAmLHxAD4dPGha0kBaGkjSchhM4tViL938dHNBxR27Bunmg58Lfp23W9t+GGhLjU00TltkjpndnnHmWXKDzLFk6Zl9t5O3nUkEajmWltuA02EJZrd52w4nM0jkGEjz9txONjsA1MLYcBiPlvRvMC3Gv3l7ziWbnX9ISEsO2BY7oBYzaZ4fB+zMbhCy5UZO2W2eM4cT2CTS0qx5G5ITzG4AbUnA4xf2GenbbvNUHLbnl0g+fJvnj5292fn0hw8+1Njg1AIDiW0gkrGNIRGsMoGAchCwh1B/YIxRMApGwSgYBQgAAAceYo9IRTX4AAAAAElFTkSuQmCC","orcid":"","institution":"VinUniversity","correspondingAuthor":true,"prefix":"","firstName":"Tham","middleName":"Hong","lastName":"Hoang","suffix":""}],"badges":[],"createdAt":"2026-02-18 14:39:33","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8910343/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8910343/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103736738,"identity":"87112c85-40d7-4e15-8837-20abea97b382","added_by":"auto","created_at":"2026-03-02 10:06:27","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":219165,"visible":true,"origin":"","legend":"\u003cp\u003eThe execution time (minutes) of GATK-HaplotypeCaller (GATK-HC) on the number of intervals(a), PyPGx on the number of samples(b), and VEP on the number of variants(c). The experiments of original tools were implemented on 3 CPU cores, and 14 GB memory usage while that of Spark Join modules were applied on 3 clusters that had the same settings on the host computer.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8910343/v1/766b38b97a8c5fc58e7810fc.png"},{"id":104399812,"identity":"d6e8b610-f877-4993-bf7a-ed6318f8f9bb","added_by":"auto","created_at":"2026-03-11 12:07:43","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":115482,"visible":true,"origin":"","legend":"\u003cp\u003eThe graphs illustrate the running time(a), the percentage of CPU usage(b), and the percentage of memory usage(c) of the original pipeline and pipeline with Spark among 3 CPU cores on the host computer. HC: HaplotypeCaller.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8910343/v1/3769b15159f58cc68ce90df1.png"},{"id":103736733,"identity":"ca141f48-4a25-41e8-9b7f-6988a7cf62b8","added_by":"auto","created_at":"2026-03-02 10:06:27","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":142216,"visible":true,"origin":"","legend":"\u003cp\u003eThe graphs illustrate the running time of the original pipeline and the pipeline with Spark on the server with different numbers of CPU cores. HC: HaplotypeCaller\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8910343/v1/e0d5af40d748f957e51dc374.png"},{"id":104400195,"identity":"e3d79281-2086-47a1-a6b2-82fdb4eb04a5","added_by":"auto","created_at":"2026-03-11 12:09:10","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":447986,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of Spark4VCF, set up in a specific environment such as servers or computers. It can take three types of input (FASTQ, BAM, and VCF) and distribute input to computing resources to run tasks (e.g., GATK-HaplotypeCaller, PyPGx, and VEP). The output from the task will be exported to the dashboard.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8910343/v1/7eaa24e6bf5e929b817292d1.png"},{"id":103736734,"identity":"e549a4d5-77de-447f-92e0-afec442d1209","added_by":"auto","created_at":"2026-03-02 10:06:27","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":235436,"visible":true,"origin":"","legend":"\u003cp\u003eA schematic overview of Spark4VCF workflow. The input on the storage is transferred to the storage of the cluster manager. It is distributed to computing slaves to map a specific task like VEP, GATK-HaplotypeCaller, or PyPGx. The output of each map is combined with a list of actions (shuffling, sorting, and reducing ) to export a final output file.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-8910343/v1/ef44c468a426a2766d5deaae.png"},{"id":103736737,"identity":"9f5c3ca1-bf3e-4cd8-be8c-0cbdc577a1e1","added_by":"auto","created_at":"2026-03-02 10:06:27","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":106753,"visible":true,"origin":"","legend":"\u003cp\u003eSpark In-Memory Computing. During the task stages, there is no need to load data from the disk. Instead, the spark mechanism allows it to be saved in memory for later stages. This stage was therefore created to enable data to be fed directly into the tasks that follow.\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-8910343/v1/6e8532dc8887f44071be98cb.png"},{"id":105751824,"identity":"c4b153ad-6a3b-4e8c-8026-251143928904","added_by":"auto","created_at":"2026-03-30 15:46:15","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2095958,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8910343/v1/922465e9-95c3-4320-8f4a-9855ec976d13.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Spark4VCF: A Novel Big Data Framework to Accelerate Genomics Analysis","fulltext":[{"header":"Introduction","content":"\u003cp\u003eRecently, the genomic data have been exponentially proliferating due to the rapid decrease in the cost of microarray genotyping and next-generation sequencing (NGS). To take advantage of the massive genomic data, numerous bioinformatics tools have been developed for variant calling, annotation, phenotype prediction, genotype imputation [1, 2], and many others. For example, GATK has comprised a set of robust and efficient analysis tools for NGS data, which was considered to be the gold stan- dard for many tasks such as identifying genomic variants, including Single Nucleotide Polymorphisms (SNPs), Short Insertions/Deletions (INDELs), Copy Number Variants (CNVs), and Structural Variants (SVs) [3]. A crucial step in genomic analysis is variant annotation, i.e. describing the effects and underlying mechanisms of genomic variants on transcripts and post-transcriptional processes, eventually deducing genome-to-phenome relationships and the association of genomic variants with human diseases. Many annotation tools have been developed for this task, such as SnpEff[4], Variant Effect Predictor (VEP)[5], ANNOVAR[6] or Invitae[7]. For example, VEP is an open-access tool which has been widely used for variant annotation, or analyzing the impact of the variants (SNPs, INDELs, CNVs, or SVs) on regulatory areas, genes, transcripts, and protein sequences. Moreover, a number of tools have been developed to foster the applications of genomics data in clinical practice, such as PyPGx[8] and Stargazer[9], which have been widely used to predict phenotypes of multiple pharmacogenes.\u003c/p\u003e\n\u003cp\u003eDespite the emergence of many advanced hardware solutions able to reduce the time of genomic analysis, such as Illumina\u0026rsquo;s Dragen [10], BGI\u0026rsquo;s MegaBolt, the enormous and increasing amount of available data generated from other previous platforms posed plenty of computational challenges in extracting useful information. The challenges also come from the hardness of genomic data analysis problems, which is affected by the genome complexity and becomes harder if the genomes are embedded with complex repeat structures [11]. Several multi-node annotation tools have been developed to overcome these challenges, such as Hail (https://hail.is/) or Cannoli (bigdatagenomics.github.io). While Hail does not support exporting annotated information into output files, Cannoli, which uses ADAM application programming interfaces (APIs), could provide more analyses. However, despite its impressive speed compared to traditional methods, Canoli is still limited by constraints within this general genomics framework that can hamper its performance. The following describes current popular tools and frameworks that aim to accelerate genomic data analyses (see more details in Table 1).\u003c/p\u003e\n\n\n\n\n\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"567\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 102px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eName\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 78px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFunctions\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 204px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFeature\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 183px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAdvantages\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 102px;\"\u003e\n \u003cp\u003eVariantSpark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 78px;\"\u003e\n \u003cp\u003e1. Variant association\u003c/p\u003e\n \u003cp\u003e2. Population genetics studies\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 204px;\"\u003e\n \u003cp\u003eParallels population-scale tasks\u003c/p\u003e\n \u003cp\u003ebased on Spark and the associated MLlib\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 183px;\"\u003e\n \u003cp\u003e80% faster than ADAM, Hadoop/Mahout version, and ADMIXTURE\u003c/p\u003e\n \u003cp\u003eMore than 90% faster than R and Python implementations\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 102px;\"\u003e\n \u003cp\u003eSparkBWA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 78px;\"\u003e\n \u003cp\u003e1. Alignment\u003c/p\u003e\n \u003cp\u003e2. Mapping\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 204px;\"\u003e\n \u003cp\u003eConsists of three main stages:\u003c/p\u003e\n \u003cp\u003e1. RDD creation\u003c/p\u003e\n \u003cp\u003e2. Map\u003c/p\u003e\n \u003cp\u003e3. Reduce phases;\u003c/p\u003e\n \u003cp\u003eand employs two independents\u003c/p\u003e\n \u003cp\u003esoftware layers\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 183px;\"\u003e\n \u003cp\u003eFor shorter reads, averages 1.9x and 1.4x faster than SEAL and pBWA\u003c/p\u003e\n \u003cp\u003eFor longer reads, averages 1.4x\u003c/p\u003e\n \u003cp\u003efaster than BigBWA and Halvade\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 102px;\"\u003e\n \u003cp\u003eGATK-HaplotypeCaller-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 78px;\"\u003e\n \u003cp\u003e1. Sequence analysis\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 204px;\"\u003e\n \u003cp\u003eTakes full account of compute, workload, and characteristics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 183px;\"\u003e\n \u003cp\u003eAchieves more than 37 times increased speed\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1\u003c/strong\u003e: Popular frameworks which integrated Spark for bioinformatic functions.\u003c/p\u003e\n\n\u003cp\u003eHadoop[12] is an open-source framework that has been broadly utilized to accelerate data analyses in various disciplines, including bioinformatics (e.g., Biodoop[13], FASTdoop[14, 15]). With Hadoop, multiple computer nodes are organized to provide a scalable distributed file system (HDFS). Hadoop uses a mechanism called MapReduce to process vast amounts of data, dividing a large computational program into many small sub-programs that run in parallel on large computing clusters with thousands of nodes in a reliable, fault-tolerant manner. However, due to its disk-based input/output (I/O) access pattern, Hadoop MapReduce suffers from high latency. Moreover, MapReduce is not an optimal solution for applications that require iterative in-memory computation. These limitations led to the development of the open-source tool, Spark, which works together with Hadoop. Spark provides a resilient distributed dataset (RDD) and caches datasets in memory across cluster nodes, eliminating or reducing the bottleneck of disk I/O to speed up performance by about 100-fold for some applications. Notable applications using Spark and Hadoop MapReduce include alignment and mapping [16\u0026ndash;26], sequence analysis [27\u0026ndash;30], phylogeny [31], drug discovery [32], single-cell RNA sequencing [33], and variant association in population genetics studies [34, 35]. Besides, tips for installing Spark in bioinformatics was also researched by [36] to provide insight into developing bioinformatics tools.\u003c/p\u003e\n\n\u003cp\u003eVariantSpark, a powerful tool for genomic data analysis, overcomes the limitations of Hadoop by minimizing reliance on hard disk input-output operations (disk IO) [34]. In an experiment using the 1KGP dataset, the authors demonstrated that VariantSpark outperformed Apache Spark, ADAM, Hadoop MapReduce, R, Python, and ADMIXTURE. By accurately grouping individuals from super-populations (AMR, AFR, EAS, and SAS), VariantSpark proved its superior speed, resource efficiency, and scalability. VariantSpark, thus, enables the application of advanced machine learning algorithms to genomic data.\u003c/p\u003e\n\u003cp\u003eSparkBWA is designed for sequencing alignment and offers a significant advantage, which does not necessitate any changes to the original BWA source code [19]. Notably, SparkBWA outperformed SEAL and pBWA by a factor of 1.9\u0026times; and 1.4\u0026times;, respectively, for aligning shorter reads using the BWA-backtrack algorithm. For longer reads, implementing the BWA-MEM algorithm, SparkBWA achieved an average speed up of 1.4\u0026times; compared to BigBWA and Halvade tools.\u003c/p\u003e\n\u003cp\u003eGATKSpark, a complementary for the GATK toolset, focuses on read alignment and variant calling tasks. In particular, by leveraging a robust 256-core cluster, GATKSpark\u0026rsquo;s execution time has been remarkably reduced by a factor of 37x [28]. This experiment involved running various tools, including GATK original, GATK-queue, and GATK-Spark on 1 to 32 nodes (equivalent to 8 to 256 cores). Key bottlenecks which were addressed in GATKSpark include single-process inefficiency, I/O challenges during SAM/BAM manipulation in the Cleaner step, and merging of BAM files with deduplication. Notable improvements of GATKSpark involve parallel processing techniques and a transition from SAM/BAM to the more efficient ADAM format, which minimizes disk access.\u003c/p\u003e\n\u003cp\u003eIt is known that these popular tools still have many drawbacks. They aim at specific aspects of the analysis function, or trying to convert the original data into their specific data types, which are time-consuming and need more effort in the main computation stage to deal with their specific data types. Therefore, it is essential to develop a new framework that can perform well with many tools. This framework should be able to deal with a lot of factors that influence not only running time but also resource usage, including data preprocessing modules and framework structure as vital elements.\u003c/p\u003e\n\u003cp\u003eIn this work, we report a novel framework, namely Spark4VCF, that leverages Apache Spark structure and in-memory computation to accelerate diverse genomic analysis pipelines. Spark4VCF is a tool designed to parallelize tasks by dividing the entire workload into multiple segments that are processed concurrently. Task distribution among worker nodes can be based on the number of samples, variants, or sequencing reads. Spark4VCF supports some common tasks such as variant calling and in particular, two tasks that were not fully integrated into the existing popular Spark-based tools: variant annotation and phenotype prediction. The main advantage of the Spark4VCF is its dual functionality: it offers a user-friendly and easy-to-config interface which integrating bioinformatics tools without requiring modifications to the source code of these tools. By seamlessly integrating common tasks such as variant annotation and phenotype prediction, Spark4VCF facilitates users in analyzing genomic data in an easy and efficient way. Spark4VCF has been tested on diverse datasets, and it has shown a significant improvement in performance over the original tools.\u003c/p\u003e\n"},{"header":"Results","content":"\u003ch2\u003e2.1 Experimental Setup \u003c/h2\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"520\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eItem\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eHost\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eServer\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHardware Model\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eASUSTeK COMPUTER INC. PRIME Z390-P\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHPE ProLiant DL385 Gen10 Plus\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eProcessor\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eIntel® Core™ i9-9900K × 8-Core 2-Thread\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2 × AMD EPYC 7742 × 64-Core 2-Thread\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eMemory\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e64 GB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1008 GB\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eLocal Storage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eSamsung SSD 870 EVO (1TiB)\u003c/p\u003e\n \u003cp\u003eWDC WD4005FZBX-00K5WB0 (4TiB)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eLogical Volume (8.75TiB)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eData plane NICs\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVirtualbox Internal network\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVirtualbox Internal network\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eBIOS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVersion: 2804\u003c/p\u003e\n \u003cp\u003eRelease Date: 04/15/2020\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVersion: A42\u003c/p\u003e\n \u003cp\u003eRelease Date: 02/10/2022\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHyper-threading\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eEnable\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eEnable\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2\u003c/strong\u003e: Summary of the host and server computer configuration \u003c/p\u003e\n\u003cp\u003eWe simulated 4 computers by using virtual machine (VM) technology on a host computer to validate the experiments on the same configuration for each computer. The system of VMs was set up so that the number of CPU cores is 1, 2, and 3, and the memory usage of RAM is 14 GB. Several test cases have been designed on VMs to assess the performance of Spark4VCF. We evaluated the original tools and Spark Join module with the number of intervals, samples, and variants. Besides, we performed the framework on the server provided by GeneStory JSC to validate the results. The settings were simulated similarly to the host computer. The configuration listed in Table 2, was set up with the number of CPU cores growing from 1 to 32. We compare the results for these machines by running both Spark4VCF as well as other pipelines. We logged the execution time, and resource utilization via \u003cem\u003etime\u003c/em\u003e, and \u003cem\u003epsrecord \u003c/em\u003ecommands to evaluate different metrics for file size, running time, CPU usage (%), memory usage (%).\u003c/p\u003e\n\u003cp\u003eOur environments use Hadoop version 2.7, Spark version 2.4.0, VEP version 108, GATK 3.2.2, PyPGX 0.20.0 and Scala version 2.11.12. File sizes include 133, 454, and 2300 MB, which are WGS of chromosome 22 from 1000 Genomes Project (1KGP). Spark4VCF uses different worker nodes that utilize various variant calling, variant annotation, and phenotype prediction tools such as GATK-HaplotypeCaller, VEP, and PyPGx. Particularly, we used BAM file of HG00131 sample for variant calling, VCF file of one random sample to run variant annotation, and 3202 samples for phenotype prediction. \u003c/p\u003e\n\u003ch3\u003e2.1.1 GATK-HaplotypeCaller\u003c/h3\u003e\n\u003cp\u003eBased on the interval range of the HaplotypeCaller tool, we divided variant locations into chunks and pushed them into each map of Spark to call variants. Then, the map returns will be merged into one output. The experiment was designed to validate the run time between GATK-HaplotypeCaller and GATK-HaplotypeCaller-Spark4VCF with multiple primary and alternative chromosome regions, here referred to as intervals. \u003c/p\u003e\n\u003ch3\u003e2.1.2 VEP\u003c/h3\u003e\n\u003cp\u003eWe split the sequencing datasets into multiple files by variant location as standard inputs after that files will be mapped into each process to worker nodes. Worker nodes compute tasks parallelized based on the position range of genomic variants after that the reducer sums all of the map buffers to finally get the standard output. The goal of the experiment was to compare the running times of VEP and VEP-Spark4VCF for different numbers of variants. \u003c/p\u003e\n\u003ch3\u003e2.1.3 PyPGx\u003c/h3\u003e\n\u003cp\u003eWe converted all of the samples into sample chunk format and mappers read each chunk at the same time to process. When \u003cem\u003emap \u003c/em\u003etasks are completed, the Spark pipeline reads each sample line in each worker to predict phenotype samples or star alleles identification. After that, the final output was aggregated from each worker node output. The purpose of the experiment was to compare the run times of PyPGx and PyPGx-Spark4VCF across a number of samples. \u003c/p\u003e\n\u003ch2\u003e2.2 Performance Evaluation\u003c/h2\u003e\n\u003cp\u003eIn Figure 1, the execution time of the original module and the Spark Join module versus the difference in the number are shown. Because of the I/O bound on merging VCF files, the running time of validation processes is linear with file size as the execution time trend increases. The Spark Join module had a running time significantly lower than the original tools, with the number of samples thanks to the mechanism of Spark-memory processing. GATK-HaplotypeCaller and GATK-HaplotypeCaller- Spark4VCF were used for calling variants (Figure 1a). The execution time of GATK- HaplotypeCaller alone were higher than that of GATK-HaplotypeCaller-Spark4VCF across the different interval lengths. The running time both started from nearly 100 minutes in 20 intervals and that of GATK-HaplotypeCaller increased to roughly 900 minutes and that of GATK-HaplotypeCaller-Spark4VCF increased to roughly 423 minutes in 95 intervals. Besides, for the phenotype prediction process (Figure 1b), we implemented PyPGx and PyPGx-Spark4VCF by constantly increasing the number of samples. The running time of PyPGx significantly increased from approximately 1 minute to 1000 minutes when processing 3202 samples, while the running time of PyPGx-Spark4VCF was gradually increased from approximately 1 to 110 minutes for the same sample size. In addition to variant annotation (Figure 1c), VEP and VEP- Spark4VCF were implemented with a different number of variants. It can be seen that the running time of VEP also moved upward strongly at 900000 variants from approximately 70 to 160 minutes, which is higher than the running time of VEP- Spark4VCF from approximately 40 to 50 minutes at 900000 variants. Table 3 shows detailed running time of each module. Figure 2 shows that the running time of Spark4VCF is faster than other methods in all test cases. In variant calling, GATK-HaplotypeCaller-Spark4VCF running time was reduced from approximately 983 minutes to 562 minutes, increasing the speed by 1.75 times that of the original pipeline in a single core. Therefore, we expanded CPU cores to 2 cores, and 3 cores for the validation experiment. The GATK-HaplotypeCaller running time sped up to approximately 2.8 times with Spark in 2 cores and 3 cores. Besides, the running time of PyPGx-Spark4VCF was considerably faster than the original PyPGx. With 1 core, the running time of PyPGx-Spark4VCF decreased from approximately 1319.37 minutes to 366.66 minutes, a speedup of 3.6 times. Similar to 1 core, the running time of PyPGx-Spark4VCF in 2 cores, and 3 cores also decreased runtime from 1306.78 minutes to 137.1 minutes (had a speedup of 9.5 times) and from 1102.62 minutes to 109 minutes (a speedup of 10.12 times) in phenotype prediction task, respectively. Finally, the running time of VEP-Spark4VCF is faster than VEP in three settings of CPU cores, which reduced runtime from approximately 185.5 minutes to 118.8 minutes (1.5 times) in single-core, from 182.49 minutes to 97 minutes (1.8 times) in 2 cores, and from 197.75 minutes to 70.17 minutes (2.8 times) for annotating variants. However, CPU resource for Spark join module was higher utilization approximately 50% than the original tools during execution. Thus, even though the percentage of memory usage (Figure 2c), and CPU usage spread more widely than the original tools (Figure 2b), the running time was considerably reduced a lot of time with the additional Spark module (Figure 2a). The detailed summary was presented in Table 4.\u003c/p\u003e\n\u003cp\u003eMoreover, we assessed the framework on our server, the figure 3 shows the running time of three different tasks on a server with a varying number of CPU cores increasing from 1 to 32. While GATK-HaplotypeCaller-Spark4VCF pipeline’s running time dropped at 32 CPU cores, the GATK-HaplotypeCaller’s execution time increased significantly as the number of CPU cores increased as well. For example, the pipeline execution time of GATK-HaplotypeCaller-Spark4VCF took about 150 minutes on a server with one CPU core, and it was decreased to about 50 minutes with 32 CPU cores, while the GATK-HaplotypeCaller execution time hung around 275 minutes (Figure 3a). The running time for PyPGx-Spark4VCF pipeline decreased significantly as the number of CPU cores increases than the PyPGx pipeline. For example, on a server with 1 CPU core, the execution time of PyPGx-Spark4VCF took roughly 2250 minutes to run the pipeline and was reduced to approximately 273.11 minutes with 32 CPU cores while the execution time of PyPGx fluctuated around 4000 minutes from 1 CPU core to 32 CPU cores (Figure 3b). Finally, the running time of VEP-Spark4VCF pipeline shows improvement in the setup of Spark. That of GATK-HaplotypeCaller- Spark4VCF reduced from approximately 250 minutes to 100 minutes at 32 CPU cores while that of GATK-HaplotypeCaller also increased slightly to approximately 600 minutes at 32 CPU cores (Figure 3c). All three original tools fluctuated because the task just ran 1 thread so the running time of the original tools did not differ.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 3\u003c/strong\u003e: The execution time of the original tools and Spark join modules with different setting input on the host computer. \u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"532\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eNumber of Sample\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003ePyPGx\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003ePyPGx-Spark4VCF\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e8.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e04.09\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e500\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e59.93\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e9.14\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e189.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e16.05\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1500\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e376.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e29.28\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e579.76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e36.78\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2500\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e892.01\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e50.49\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3202\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1102.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e109.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eNumber of Variants\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eVEP\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eVEP-Spark4VCF\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e300000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e71.05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e40.32\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e400000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e81.21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e47.50\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e500000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e95.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e43.52\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e600000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e124.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e47.74\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e700000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e136.69\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e41.09\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e800000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e139.40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e44.09\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e900000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e162.46\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e52.26\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eNumber of intervals\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eGATK-HaplotypeCaller\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eGATK-HaplotypeCaller-Spark4VCF\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e118.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e72.45\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e364.06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e213.89\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e519.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e230.97\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e707.23\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e359.61\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e95\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e894.14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e423.10\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eTable 4\u003c/strong\u003e: Summary of the performance results of Spark4VCF on the host computer. N is number of samples. W is number of workers. \u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"575\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eDataset\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eTypes\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eTools\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eN\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eFile size\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e(MB)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eRunning\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eTime\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e(minutes)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eCPU (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eMaximum memory\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eused (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eW\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eCPU\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003ecores\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eRAM\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eUsage\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eDisk\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eType\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Annotation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVEP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e133\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e185.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e12.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Annotation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVEP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e133\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e182.49\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e178\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e7.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Annotation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVEP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e133\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e197.75\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e140\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e12.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Annotation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVEP-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e133\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e118.82\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e40.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Annotation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVEP-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e133\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e97.01\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e41.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Annotation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVEP-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e133\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e70.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e41.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePhenotype Prediction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePyPGX\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3202\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e454\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1319.36\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e5.7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePhenotype Prediction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePyPGX\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3202\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e454\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1306.77\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e163\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e5.7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePhenotype Prediction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePyPGX\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3202\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e454\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1102.61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e140\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e5.7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePhenotype Prediction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePyPGX-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3202\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e454\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e366.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePhenotype Prediction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePyPGX-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3202\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e454\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e137.08\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1KGP-chr 22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePhenotype Prediction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePyPGX-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3202\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e454\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e109.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e250\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eBAM file HG00131\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Calling\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eGATK\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e983.43\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eBAM file HG00131\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Calling\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eGATK\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e960.91\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e11.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eBAM file HG00131\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Calling\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eGATK\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1006.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e225\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14.1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eBAM file HG00131\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Calling\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eGATK-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e562.52\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eBAM file HG00131\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Calling\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eGATK-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e338.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e39.233\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eBAM file HG00131\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVariant Calling\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eGATK-Spark\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e2300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e344.73\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e40.333\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e14GiB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eHDD\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Background","content":"\u003cp\u003eApache Spark is an open-source multi-language engine for distributed data processing designed to deal with Hadoop limitations. Spark was originally developed at the University of California, Berkeley [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e], and sponsored by Apache Software Foundation. Spark utilizes a master-slave architecture with a single central driver and many distributed workers in order to carry up tasks or analytics. RDDs of Apache Spark is a read-only collection of data objects partitioned across multiple machines [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e], which can be created by distributing existing collections (\u003cem\u003elist\u003c/em\u003e or \u003cem\u003eset\u003c/em\u003e ) in memory, or by loading an external dataset from numerous sources supported by Hadoop, including the local file system such as HDFS [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e], Parquet (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://parquet.apache.org/\u003c/span\u003e\u003cspan address=\"https://parquet.apache.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), and AWS (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://aws.amazon.com\u003c/span\u003e\u003cspan address=\"https://aws.amazon.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe RDD parallel operations consist of two ways: \u003cem\u003eTransformations\u003c/em\u003e and \u003cem\u003eActions\u003c/em\u003e. \u003cem\u003eTransformations\u003c/em\u003e operate on existing RDDs to return new RDDs, such as \u003cem\u003emap\u003c/em\u003e (apply a function to each element), \u003cem\u003efilter\u003c/em\u003e (select elements based on a condition), \u003cem\u003ejoin\u003c/em\u003e (combine data from two RDDs), and \u003cem\u003egroupByKey\u003c/em\u003e (group elements with the same key). Besides, the output RDDs will be stored in memory for faster processing in after steps, but Spark allows persisting it to disk if needed. On the contrary, \u003cem\u003eActions\u003c/em\u003e trigger computations on the RDDs, such as \u003cem\u003ecollect\u003c/em\u003e (get all elements as a collection), \u003cem\u003ecount\u003c/em\u003e (get the number of elements), and \u003cem\u003etake\u003c/em\u003e (get the first N elements). It will return an output to the driver program (main application) or external storage. Importantly, \u003cem\u003eTransformations\u003c/em\u003e are evaluated lazily, which means that they build a logical execution plan on the data, which reflects the order in which the \u003cem\u003eTransformations\u003c/em\u003e should be applied rather than executing immediately. This approach allows Spark to optimize the workflow by avoiding unnecessary computations if the results of a \u003cem\u003eTransformation\u003c/em\u003e is not ultimately used by any actions.\u003c/p\u003e \u003cp\u003eApache Spark supports multiple programming languages such as Python, Scala, and Java, enabling users to work with data that is spread across several machines and stored on disk or in memory. Significantly, Spark can be used with a cluster management like YARN [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e] or Apache Mesos [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e], or it can run locally.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eIn this section, we introduce Spark4VCF, which integrates various bioinformatics tools into Apache Spark framework (Fig. \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e). The goals of building Spark4VCF are performance improvement, easy-to-config ability, and flexibility in integrating diverse tools into the framework.\u003c/p\u003e\n\u003ch2\u003e4.1 Architecture\u003c/h2\u003e\n\u003cp\u003eSpark4VCF workflow consists of three main stages: a pre-processing stage, a main computation stage, and a post-processing stage. The pre-processing stage loads the data from files in 3 formats BAM, VCF, or FASTQ into an HDFS storage. The main computation stage comprises three steps: 1) Data is distributed from HDFS to worker nodes by Spark API and injected into multiple pipes so that tools can process it; 2) Each worker node is called an external process, receives input data from the pipe, and starts the tasks. 3) The output data of pipes is returned to the driver. The post- processing stage comprises two steps: 1) The output files are uploaded to HDFS, and 2) they can be exported to a local, file server, or columnar storage such as Parquet, Elasticsearch, Cassandra, or HBase. Notably, we used HDFS as a distributed file system. The details of the workflow are presented in Fig. \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e.\u003c/p\u003e\n\u003ch2\u003e4.2 In-memory computing\u003c/h2\u003e\n\u003cp\u003eIn the pre-processing stage, there are two primary tasks: loading data from the local disk and uploading the data to HDFS. Loading and uploading are disk I/O tasks that can be handled by Direct Memory Access. Uploading data relies on the network channel while loading data depends on disk I/O. It would be unnecessary to load data from HDFS to the local disk at each step during the main computation stage. Instead, it is stored in memory for later phases thanks to the spark mechanism. As a result, this stage was developed to allow data to be directly fed into the next processes, so it minimizes the number of read/write cycles from disk while original tools read and write from disk in each process (Fig. \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e). Moreover, only one set of data is processed into an independent hardware component in each time unit. Therefore, two processes are handling 1\u0026ndash;2 data units in each time slot.\u003c/p\u003e\n\u003cp\u003eParticularly, the chunk identifier of an input file was used as a key in RDDs. Thus, the RDD variable was generated by Spark having the following format: {\u003cem\u003echunk id, chunk content\u003c/em\u003e} where \u003cem\u003echunk content\u003c/em\u003e contains variant information of a chunk id identifier. This variable will be fed for the main computation stage, such as VEP, PyPGX, and GATK-HaplotypeCaller algorithm. Once the computation phase is complete, users can merge all the outputs into one file.\u003c/p\u003e\n\u003ch2\u003e4.3 Spark4VCF Console\u003c/h2\u003e\n\u003cp\u003eEasy-to-config is one of the key features of Spark4VCF to perform various tasks such as variant annotation, variant calling, or phenotype prediction. With this in mind, Spark4VCF is built based on a big data technology namely Apache Spark, and can be installed through Github.\u003c/p\u003e\n\u003cp\u003eSpark4VCF application was packaged into Java Archive (jar) format in Scala language. It can be passed a variety of options to run a specific task, e.g., listing 1 - example in supplement-listing-1. We implemented VEP (variant annotation), GATK-HaplotypeCaller (variant calling), and PyPGx (phenotype prediction) to verify whether the application could be applied to various tasks.\u003c/p\u003e\n\u003cp\u003eWe integrated Spark4VCF into \u003cem\u003espark-submit\u003c/em\u003e command, providing a way for running from the terminal to leverage the distributed power. The command supports a variety of arguments and options in which the user can run specific details based on the YARN configuration. For example, it is possible to choose the number of executors, the number of cores used per worker, or the volume of memory (the basic syntax in listing 2, as shown in the example of \u003cem\u003espark-submit\u003c/em\u003e in supplement-listing-2).\u003c/p\u003e\n\u003cp\u003eAs a result, Spark4VCF is an easy-to-configure tool that enables a user to run tasks using the standard \u003cem\u003espark-submit\u003c/em\u003e from the terminal.\u003c/p\u003e\n\u003cp\u003eListing 1: The basic syntax for spark-submit\u003c/p\u003e\n\u003cp\u003e\u003cspan\u003e$\u003c/span\u003e java\u0026thinsp;\u0026minus;\u0026thinsp;jar spark4vcf.jar --help Usage : Spark4VCF [ options ]\u003c/p\u003e\n\u003cp\u003e-- bioinformaticstool \u003cem\u003e\u0026lt;\u003c/em\u003evalue\u003cem\u003e\u0026gt;\u003c/em\u003e: Bioinformatics tool name (e.G., vep, pypgx, gatk\u0026minus;haplotypecaller)\u003c/p\u003e\n\u003cp\u003e-- tooldir \u003cem\u003e\u0026lt;\u003c/em\u003evalue\u003cem\u003e\u0026gt;\u003c/em\u003e: Executable path\u003c/p\u003e\n\u003cp\u003e\u0026minus;i, -- inputfile \u003cem\u003e\u0026lt;\u003c/em\u003evalue\u003cem\u003e\u0026gt;\u003c/em\u003e: Absolute path to input file in HDFS storage\u003c/p\u003e\n\u003cp\u003e\u0026minus;o, --outputfile\u003cem\u003e\u0026lt;\u003c/em\u003evalue\u003cem\u003e\u0026gt;\u003c/em\u003e: Absolute path to output file in HDFS storage\u003c/p\u003e\n\u003cp\u003e-- toolargs \u003cem\u003e\u0026lt;\u003c/em\u003evalue\u003cem\u003e\u0026gt;\u003c/em\u003e: Bioinformatic tools arguments in Quote\u003c/p\u003e\n\u003cp\u003e\u0026minus;h, --help : Print usage\u003c/p\u003e\n\u003cp\u003eListing 2: The basic syntax for spark-submit\u003c/p\u003e\n\u003cp\u003e\u003cspan\u003e$\u003c/span\u003espark\u0026minus;submit [ options ] \u003cem\u003e\u0026lt;\u003c/em\u003eapplication \u003cem\u003e\u0026gt;\u003c/em\u003e [ application arguments ]\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e\u0026lt;\u003c/em\u003eapplication\u003cem\u003e\u0026gt;\u003c/em\u003e: This can be the path to the JAR file or the Python script containing the Spark Application.\u003c/p\u003e\n\u003cp\u003e[options]: These are optional arguments that allow you to configure the Spark application. Here are some commonly used options:\u003c/p\u003e\n\u003cp\u003e--master: Specifies the cluster manager to use( e.g ., yarn, local)\u003c/p\u003e\n\u003cp\u003e--deploy\u0026ndash;mode: Sets the deployment mode ( client or cluster )\u003c/p\u003e\n\u003cp\u003e--memory: Defines the memory allocation for the driver and executors.\u003c/p\u003e\n\u003cp\u003e--num\u0026minus;executors: Sets the number of worker nodes to use.\u003c/p\u003e\n\u003cp\u003e4.4 Bioinformatics Tools\u003c/p\u003e\n\u003ch2\u003e4.4.1 GATK-HaplotypeCaller\u003c/h2\u003e\n\u003cp\u003eGATK introduced a variant calling tool namely HaplotypeCaller, specifically designed for high-accuracy variant genotyping in diploid genomes to address sequencing errors and mismapping reads in variant calling [\u003cspan class=\"CitationRef\"\u003e41\u003c/span\u003e]. GATK-HaplotypeCaller employs a de Bruijn graph-based approach to assemble contigs from aligned sequencing reads [42]. This strategy allows for the reconstruction of haplotypes, the complete set of alleles on a single chromosome within a sequencing sample. By analyzing these reconstructed haplotypes, \u003cem\u003eHaplotypeCaller\u003c/em\u003e can effectively distinguish true variants from sequencing errors based on their presence within the assembled contigs. The toolkit leveraging a haplotype-based assembly approach, overcomes the limitations of traditional variant callers and delivers a more accurate and comprehensive set of identified variants, although \u003cem\u003eHaplotypeCaller\u003c/em\u003e was evaluated as the most time-consuming step.\u003c/p\u003e\n\u003ch2\u003e4.4.2 The Variant Effect Predictor (VEP)\u003c/h2\u003e\n\u003cp\u003eVEP, which has been developed by the Ensembl Project, plays a crucial role in variant annotation tasks[\u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e]. With enormous genomic data, VEP has been designed as a thorough in-silico annotation tool that predicts the effects of genetic variants. The tool accepts a list of variants, typically in the Variant Call Format (VCF). VEP finds out the potential effects of a variant, including its location within a gene (e.g., exonic, intronic), predicted protein consequences (e.g., missense mutation, synonymous mutation), and potential impact on protein functions (e.g., predictors of damaging, benign). Users may quickly determine the genes and transcripts impacted by the variants by entering the chromosome coordinates of the variants and the nucleotide alterations.\u003c/p\u003e\n\u003cp\u003eThe inclusion of VEP in research workflows offers several significant benefits. VEP automates the variant annotation process instead of manual analysis, which helps reduce time and resources. Moreover, with relevant databases, VEP provides valuable insights into the potential functional effects of variants.\u003c/p\u003e\n\u003ch2\u003e4.4.3 PyPGx\u003c/h2\u003e\n\u003cp\u003ePyPGx, which has been developed at Macrogen Inc., [\u003cspan class=\"CitationRef\"\u003e8\u003c/span\u003e], uses The 1000 Genome Project reference panel to get information about individual diplotype calls, plots of copy number, and allele fraction. Firstly, genomic variants were phased by the Beagle tool before matching to star alleles in the Gene Haplotype translation reference. Then, the target gene\u0026rsquo;s per-base read depth is translated to copy numbers utilizing an intra- sample normalization method with a control gene. The inter-sample normalization is further carried out in the case of focused sequencing data to take into account the variation in total coverage across samples.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn this paper, we reported Spark4VCF, a framework in which bioinformatic tools are built on top of Apache Spark, achieving superior performance with an easy-to-config manner. The Spark4VCF consists of three stages: pre-processing, main computation, and post-processing. In the main computation stage, we improved processing speed by avoiding the need to use intermediate data stored on the local disk. By using Spark4VCF in the practice of analyzing large-scale biology data, we obtained high-performance tools that will enable routine analyses of high quantities of variants in human genome analyses using common tools such as PyPGx and VEP. With the drastic increase in the number of samples and features in genomics research, efficient tools for genomic data processing are crucial for the research commu- nity. Our work proved that Spark framework could be well utilized in various genomic data analyses and Spark4VCF could serve as a complementary solution for such tasks. Currently, Spark4VCF has several limitations such as integrating only several tools at the moment and running them in the manner of step-by-step instead of the whole pipeline from raw data to final output. We are in a plan to develop Spark4VCF as a more powerful and comprehensive solution for genomic data analyses.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e6 Author contributions statement\u003c/p\u003e\n\u003cp\u003eT.H.H. and V.C.D. wrote the manuscript and performed data analysis. V.C.D. deployed and performed QC for the system. T.H.H. and T.D.D. designed the system and developed the driver algorithm (main program). C.D.L. wrote test cases and tested annotation tools. V.H.P. develops an automated deployment tool for Spark4VCF. Q.N. contributed to the experiment design and revised the manuscript. N.S.V. conceived the project, supervised the experiments, revised the manuscript, and coordinated the overall project. All authors revised the manuscript, discussed the analysis results, and contributed to bringing innovative ideas into the manuscript.\u003c/p\u003e\n\n\u003cp\u003e7 Code availability statement\u003c/p\u003e\n\u003cp\u003eSource code of Spark4VCF are available from: https://github.com/vinhdc10998/ Spark4VCF\u003c/p\u003e\n\n\u003cp\u003e8 Data access statement\u003c/p\u003e\n\u003cp\u003e1000 genome project phase 3 (1KGP3) dataset: 2504 unrelated individuals of WGS data was used HG00131 sample was downloaded from: https://www. internationalgenome.org/data-portal. This 1KGP3 dataset has its own consent form. All datasets follow the relevant guidelines and regulations (e.g. Helsinki Declaration).\u003c/p\u003e\n\n\u003cp\u003e9 Human Ethics and Consent to Participate declarations \u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\n\u003cp\u003e10 Ethics Approval declaration\u003c/p\u003e\n\u003cp\u003eThe project (grant VINIF.DA.2020.02) was approved by the Institutional Review Board of the Hanoi Medical University, Hanoi, Vietnam - IRB-VN01.001/IRB00003121.\u003c/p\u003e\n\n\u003cp\u003e11 Funding Declaration\u003c/p\u003e\n\u003cp\u003eThis work is funded by Institute of VinUni Big Data Research, VinUniversity inter- nal funding, and partly supported by the Vingroup Innovation Foundation (grant VINIF.DA.2020.02).\u003c/p\u003e\n\n\u003cp\u003e12 Competing interests\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eMaarala, A. I., P\u0026uml;arn, K., Nun˜ez-Fontarnau, J. \u0026amp; Heljanko, K. Sparkbeagle: Scal- able genotype imputation from distributed whole-genome reference panel cloud. In: Proceedings of the 11th ACM International Conference on Bioinfor- matics, Computational Biology and Health Informatics. BCB \u0026rsquo;20. Association for Computing Machinery, New York, NY, USA (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/3388440.3414860\u003c/span\u003e\u003cspan address=\"10.1145/3388440.3414860\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. https://doi.org/10.1145/3388440.3414860.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChi Duong, V. et al. A rapid and reference-free impu- tation method for low-cost genotyping platforms. \u003cem\u003eSci. Rep.\u003c/em\u003e \u003cb\u003e13\u003c/b\u003e (1), 23083 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAuwera, G. A. \u0026amp; O\u0026rsquo;Connor, B. D. \u003cem\u003eGenomics in the Cloud: Using Docker, GATK, and WDL in Terra\u003c/em\u003e (O\u0026rsquo;Reilly Media, 2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: Snps in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6(2), 80\u0026ndash;92 (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcLaren, W. et al. The ensembl variant effect predictor. \u003cem\u003eGenome Biol.\u003c/em\u003e \u003cb\u003e17\u003c/b\u003e (1), 1\u0026ndash;14 (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, K., Li, M. \u0026amp; Hakonarson, H. Annovar: functional annotation of genetic variants from high-throughput sequencing data. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cb\u003e38\u003c/b\u003e (16), 164\u0026ndash;164 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHart, R. K. et al. A python package for parsing, validating, mapping and formatting sequence variants using hgvs nomenclature. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e31\u003c/b\u003e (2), 268\u0026ndash;270 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee, S., Shin, J. Y., Kwon, N. J., Kim, C. \u0026amp; Seo, J. S. Clinpharmseq: A targeted sequencing panel for clinical pharmacogenetics implementation. \u003cem\u003ePLoS One\u003c/em\u003e. \u003cb\u003e17\u003c/b\u003e (7), 0272129 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee, S. et al. Stargazer: a software tool for calling star alleles from next-generation sequencing data using cyp2d6 as a model. \u003cem\u003eGenet. Sci.\u003c/em\u003e \u003cb\u003e21\u003c/b\u003e (2), 361\u0026ndash;372 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchobers, G. et al. Genome sequencing as a generic diagnostic strategy for rare disease. \u003cem\u003eGenome Med.\u003c/em\u003e \u003cb\u003e16\u003c/b\u003e (1), 32 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePhan, V., Gao, S., Tran, Q. \u0026amp; Vo, N. S. How genome complexity can explain the hardness of aligning reads to genomes. In: 2014 IEEE 4th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp. 1\u0026ndash;2 (2014). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ICCABS.2014.6863916\u003c/span\u003e\u003cspan address=\"10.1109/ICCABS.2014.6863916\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eApache Software Foundation. Hadoop. https://hadoop.apache.org.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeo, S., Santoni, F. \u0026amp; Zanetti, G. Biodoop: Bioinformatics on hadoop. In: 2009 International Conference on Parallel Processing Workshops, pp. 415\u0026ndash;422 (2009). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ICPPW.2009.37\u003c/span\u003e\u003cspan address=\"10.1109/ICPPW.2009.37\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFerraro Petrillo, U., Roscigno, G., Cattaneo, G. \u0026amp; Giancarlo, R. Fastdoop: a ver- satile and efficient library for the input of fasta and fastq files for mapreduce hadoop bioinformatics applications. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e33\u003c/b\u003e (10), 1575\u0026ndash;1577. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/bioinformatics/btx010\u003c/span\u003e\u003cspan address=\"10.1093/bioinformatics/btx010\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlnasir, J. J. \u0026amp; Shanahan, H. P. The application of hadoop in structural bioinfor- matics. \u003cem\u003eBrief. Bioinform.\u003c/em\u003e \u003cb\u003e21\u003c/b\u003e (1), 96\u0026ndash;105 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao, G., Ling, C. \u0026amp; Sun, D. Sparksw: scalable distributed computing system for large-scale biological sequence alignment. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 845\u0026ndash;852 IEEE (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, B. et al. Dsa: scalable distributed sequence alignment system using simd instructions. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 758\u0026ndash;761 IEEE (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, B. et al. Efficient dis- tributed smith-waterman algorithm based on apache spark. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 608\u0026ndash;615 IEEE (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbu\u0026acute;ın, J. M., Pichel, J. C., Pena, T. F. \u0026amp; Amigo, J. Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. \u003cem\u003ePloS one\u003c/em\u003e. \u003cb\u003e11\u003c/b\u003e (5), 0155461 (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMushtaq, H., Ahmed, N. \u0026amp; Al-Ars, Z. Streaming distributed dna sequence align- ment using apache spark. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 188\u0026ndash;193 IEEE (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbu\u0026acute;ın, J. M., Pena, T. F. \u0026amp; Pichel, J. C. Pastaspark: multiple sequence alignment meets big data. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e33\u003c/b\u003e (18), 2948\u0026ndash;2950 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLlad\u0026acute;os, J., Guirado, F. \u0026amp; Cores, F. Ppcas: Implementation of a probabilistic pairwise model for consistency-based multiple alignment in apache spark. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 601\u0026ndash;610 Springer (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCastro, M. R., Tostes, C. S., D\u0026acute;avila, A. M., Senger, H. \u0026amp; Silva, F. A. Sparkblast:scalable blast processing using in-memory operations. \u003cem\u003eBMC Bioinform.\u003c/em\u003e \u003cb\u003e18\u003c/b\u003e, 1\u0026ndash;13 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou, W. et al. Metaspark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e33\u003c/b\u003e (7), 1090\u0026ndash;1092 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFerraro Petrillo, U., Palini, F., Cattaneo, G. \u0026amp; Giancarlo, R. Alignment-free genomic analysis via a big data spark platform. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e37\u003c/b\u003e (12), 1658\u0026ndash;1665 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlJame, M. \u0026amp; Ahmad, I. Dna short read alignment on apache spark. \u003cem\u003eAppl. Comput. Inf.\u003c/em\u003e \u003cb\u003e19\u003c/b\u003e (1/2), 64\u0026ndash;81 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDeng, L., Huang, G., Zhuang, Y., Wei, J. \u0026amp; Yan, Y. Higene: A high-performance platform for genomic data analysis. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 576\u0026ndash;583 IEEE (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, X. et al. Accelerating large- scale genomic analysis with spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 747\u0026ndash;751 IEEE (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWiewi\u0026acute;orka, M. S. et al. Sparkseq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 30(18), 2652\u0026ndash; 2653 Demirbaga, U\u003csup\u003e\u0026uml;\u003c/sup\u003e., Aujla, G.S., Jindal, A., Kalyon, O.: Big data analytics in bioinfor- matics. In: Big Data Analytics: Theory, Techniques, Platforms, and Applications, pp. 265\u0026ndash;284. Springer (2024) (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, X., Ji, Z. \u0026amp; Zhang, Z. Cloudphylo: a fast and scalable tool for phylogeny reconstruction. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e33\u003c/b\u003e (3), 438\u0026ndash;440 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHarnie, D. et al. Scaling machine learning for target prediction in drug discovery using apache spark. \u003cem\u003eFuture Generation Comput. Syst.\u003c/em\u003e \u003cb\u003e67\u003c/b\u003e, 409\u0026ndash;417 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang, A., Troup, M., Lin, P. \u0026amp; Ho, J. W. Falco: a quick and flexible single-cell rna-seq processing framework on the cloud. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e33\u003c/b\u003e (5), 767\u0026ndash;769 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eO\u0026rsquo;Brien, A. R. et al. Variantspark: population scale clustering of genotype information. \u003cem\u003eBMC Genom.\u003c/em\u003e \u003cb\u003e16\u003c/b\u003e, 1\u0026ndash;9 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuha Neogi, A., Eltaher, A. \u0026amp; Sargsyan, A. NGS data analysis with apache spark. In Computational Life Sciences: Data Engineering and Data Mining for Life Sciences (441\u0026ndash;467). Cham: Springer International Publishing. (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChicco, D., Petrillo, F. \u0026amp; Cattaneo, U. Ten quick tips for bioinformat- ics analyses using an apache spark distributed computing environment. \u003cem\u003ePLoS Comput. Biol.\u003c/em\u003e \u003cb\u003e19\u003c/b\u003e (7), 1011272 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. \u0026amp; Stoica, I. Spark: Cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10) (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZaharia, M. et al. Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15\u0026ndash;28 (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVavilapalli, V. K. et al. : Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1\u0026ndash;16 (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHindman, B. et al. Mesos: A platform for {Fine-Grained} resource sharing in the data center. In: 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11) (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePoplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. \u003cem\u003eBioRxiv\u003c/em\u003e, 201178 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRen, S., Bertels, K. \u0026amp; Al-Ars, Z. Efficient acceleration of the pair-hmms forward algorithm for gatk haplotypecaller on graphics processing units. \u003cem\u003eEvolutionary Bioinf.\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e, 1176934318760543 (2018).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Apache Spark, Easy-to-config setting, Variant Call Format, Variant Annotation, Variant Calling, Phenotype Prediction","lastPublishedDoi":"10.21203/rs.3.rs-8910343/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8910343/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIn recent years, the exponential growth of Next Generation Sequencing (NGS) has led to an unprecedented increase in the amount of genomics data. While NGS technologies enable us to read the entire human genome, the analysis of functions of variants and phenotype prediction found in human sequences are still limited by computational tools that usually require high computing overhead due to the gigabytes or terabytes of data to be analyzed. Here we report a powerful big data framework called Spark4VCF which uses Apache Spark engine to accelerate genomics pipelines. Spark4VCF leverages independent attributes between variants and samples to speed up commonly used computational tools while maintaining quality and optimizing I/O tasks through parallel computing. We illustrated the superior speed, CPU usage and memory usage as well as new capability of Spark4VCF by showing example applications of three popular genomics toolboxes: GATK, VEP, and PyPGx. In summary, Spark4VCF is a high-performance framework that provides not only capacity of analyzing high quantities of genomics datasets but also user-friendly applications in big data settings.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e","manuscriptTitle":"Spark4VCF: A Novel Big Data Framework to Accelerate Genomics Analysis","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-02 10:06:22","doi":"10.21203/rs.3.rs-8910343/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-03-30T09:11:18+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-27T19:43:27+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"248635956922386043886827455830284360961","date":"2026-02-27T19:16:20+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"83551523410952781239516935016420620389","date":"2026-02-27T17:06:13+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-25T23:57:44+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"273924388347148321411307381708258742567","date":"2026-02-25T22:49:45+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-02-25T18:07:38+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-25T17:58:54+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-02-24T17:45:05+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-23T18:24:17+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2026-02-23T18:19:36+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"e7751e03-9988-4576-80ad-ab0eedf878a6","owner":[],"postedDate":"March 2nd, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":63725018,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":63725019,"name":"Biological sciences/Genetics"},{"id":63725020,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2026-05-19T11:53:06+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-02 10:06:22","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8910343","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8910343","identity":"rs-8910343","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.