{"paper_id":"21fc4637-e0a4-4edb-a175-5d0f46f326e7","body_text":"OPTIMIZING GENOMICS PIPELINE EXECUTION WITH INTEGER\nLINEAR PROGRAMMING\nA PREPRINT\nOlesya Melnichenko and Venkat S. Malladi\nResearch & AI\nMicrosoft, Redmond W A 98052, USA\nolesya.melnichenko@microsoft.com\nvenkat.malladi@microsoft.com\nFebruary 6, 2024\nABSTRACT\nIn the field of genomics, bioinformatics pipelines play a crucial role in processing and analyzing\nvast biological datasets. These pipelines, consisting of interconnected tasks, can be optimized for\nefficiency and scalability by leveraging cloud platforms such as Microsoft Azure. The choice of\ncompute resources introduces a trade-off between cost and time. This paper introduces an approach\nthat uses Linear Programming (LP) to optimize pipeline execution. We consider optimizing two\ncompeting cases: minimizing cost with a run duration restriction and minimizing duration with a cost\nrestriction. Our results showcase the utility of using LP in guiding researchers to make informed\ncompute decisions based on specific data sets, cost and time requirements, and resource constraints.\n1 Introduction\nIn the realm of genomics, bioinformatics pipelines are integral to processing and analyzing vast amounts of biological\ndata. A pipeline, in this context, is a series of data processing steps, or tasks, that are interconnected. Each task takes\na set of inputs, processes it, and produces a set of outputs that may be used as inputs for the subsequent tasks or are\nconsidered terminal outputs. This sequence of tasks can be distributed, or divided into smaller, more manageable pieces\nor shards, to enhance efficiency and scalability. These pipelines can be run on cloud platforms like Microsoft Azure,\nleveraging its robust computational resources and storage capabilities. These pipelines are typically run using some of\nthe following workflow managers all able to run on Azure: Nextflow [1, 2], Cromwell [3, 4], Snakemake [5, 6].\nIt’s important to note that the cost and duration of running these pipelines on Azure are influenced by the choice of\ncompute resources. More powerful resources can process data faster but at a higher cost, necessitating a balance\nbetween speed, efficiency, and budget. Choosing the appropriate compute resources is dependent on the tools and data\nstorage (memory and disk) requirements of each task.\nWith a variety of compute resources available, there is always a trade-off between cost and time when running a\ngenomics pipeline. The preference for this trade-off depends on the researcher’s circumstances. If the researcher has\nlimited funding and is less restricted on time, they might be interested in getting results as cheaply as possible without\nspending more than Tmax hours per sample. In clinical settings where time is critical, the preference might be to get\nresults as quickly as possible.\nEfficient resources utilization remains a key issue in parallel and distributed computing environments. Resource\nallocation and task scheduling problems are well-known in this field, and a lot of effort has been invested in improving\nthe management of Cloud resources [7, 8, 9, 10]. In particular, cost optimization of scientific workflows has been a\nfocus of attention [11]. Existing open-source workflow engines in genomics field, such as CromwellOnAzure [ 12],\nSnakemake [5] or Nextflow [1], require researchers to provide a virtual machine-to-task assignment list in advance as\none of the workflow inputs or make a default assignment with no guarantees for optimal performance.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nTable 1: Set of experiments defined for a given pipeline\nExperiment 1 Experiment 2\nT ask1 V M11 V M12\nT ask2 V M21 V M22\nT ask3 V M31 V M32\nTable 2: Expected execution statistics: average time to complete each pipeline task\nExperiment 1 Experiment 2\nT ask1 t11 t12\nT ask2 t21 t22\nT ask3 t31 t32\nTable 3: Expected execution statistics: average cost to complete each pipeline task\nExperiment 1 Experiment 2\nT ask1 c11 c12\nT ask2 c21 c22\nT ask3 c31 c32\nWhen a researcher’s institution does not use a workflow management system that monitors workflow execution and\noptimizes task scheduling, the researcher faces the challenge of optimizing a pipeline according to project requirements\nwhile taking into account the variety of compute options and non-linear pipeline topology. In this paper, we demonstrate\nthat linear programming can be a useful tool for pipeline execution optimization in a bioinformatics researcher’s toolbox.\nWe consider two use cases for pipeline optimization:\nUC 1 minimizing the cost of running the pipeline with a restriction on run duration,\nUC 2 minimizing the duration of the pipeline run with a restriction on cost.\nGenomics workloads often involve running a given pipeline on a large number of samples. Typically, data from these\nsamples have the same number of reads, length, coverage and quality level, i.e. the sample dataset is homogeneous.\nGiven that the dataset is homogeneous, we assume that average execution statistics collected for a subset of data are\nsufficient for preliminary pipeline performance estimates and optimizations. We also assume that statistics have been\ncollected, and the scope of this work includes the following steps performed by the researcher: selecting a sample subset,\nselecting a virtual machine subset for each pipeline task, executing the pipeline on the selected subset of samples and\nvirtual machines, collecting execution statistics, and computing the average cost and execution time for each pipeline\ntask. For example, if virtual machine assignment is described by Table 1, then corresponding execution statistics will\ndefine matrices C = {cij} and T = {tij} presented as Tables 2 and 3.\n2 Formulating linear programming problems\nMaximizing or minimizing a function subject to constraints is a common problem in many fields. Linear programming\n(LP) is a special case of mathematical optimization that deals with linear functions and constraints. In LP, we assume\nthat the function to be maximized or minimized is a linear function and that the constraints are linear equations or\nlinear inequalities [13]. LP has proven useful in modeling diverse types of problems in planning, routing, scheduling,\nassignment, and design. In this section, we formulate LP problems for the use cases defined above.\nSuppose we have a pipeline consisting of n tasks and m experiments defined by the researcher. The cost and time to\ncomplete each task are given by matricesC = {cij} and T = {tij}, respectively, wherei = 1, . . . , n and j = 1, . . . , m.\nWe can formulate the following multiple-choice problems for the use cases in consideration:\nUC 1 For each task, pick exactly one virtual machine out of m options in order to minimize the total cost of running\nthe pipeline without exceeding the allotted time Tmax.\nUC 2 For each task, pick exactly one virtual machine out of m options in order to minimize the duration of the\npipeline run without exceeding the allowed cost Cmax.\n2\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nWe can formulate LP problems for pipeline optimization by following the approach used for the multiple-choice\nknapsack problem (MCKP) [14]. To do this, we introduce indicator variables Iij, where i = 1, . . . , n, j = 1, . . . , m.\nThe variable Iij is equal to 1 if we choose virtual machine (VM) from Experiment j for T aski, and 0 otherwise.\n2.1 Linear pipeline topology\nA linear pipeline consists of a chain of tasks arranged so that the output of each element is the input of the next. The\nprocessing of data is done in a linear and sequential manner, see example on Figure 1. In the case of a linear topology,\nthe total time and cost to run a pipeline can be computed as a sum of all tasks time and cost. Following MCKP approach,\nthe linear programming formulations can be expressed as:\nLP 1 – Minimize cost subject to time constraint\nMinimize\nnX\ni=1\nmX\nj=1\ncijIij total cost to run pipeline,\nsubject to\nmX\nj=1\nIij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task,\nnX\ni=1\nmX\nj=1\ntijIij ≤ Tmax ensures time constraint is not exceeded.\nIij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}\nLP 2 – Minimize time subject to cost constraint\nMinimize\nnX\ni=1\nmX\nj=1\ntijIij → min total time to run pipeline,\nsubject to\nmX\nj=1\nIij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task,\nnX\ni=1\nmX\nj=1\ncijIij ≤ Cmax ensures cost constraint is not exceeded.\nIij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}\n2.2 Distributed tasks\nSharding is a technique used to split a large task into smaller, more manageable pieces. The input data for the task is\npartitioned into shards based on a predetermined criterion, such as sequencing group or chromosome. Once the data\nis partitioned, it is distributed across multiple nodes for processing. Each node processes a subset of the data, which\nreduces the processing time and improves the overall performance of the system.\nWhen a task is split into shards (as shown in Figure 2), we aggregate the shards and represent the task as a single vertex\non the pipeline graph for several reasons. First, existing implementations of shard functionality don’t allow different\nvirtual machines to be assigned to different shards, and each shard of the task is assigned the same type of virtual\nmachine. Second, the goal of creating shards is usually to split the task into pieces as equally as possible. Therefore,\nwe assume that the execution time and cost won’t be drastically different between shards. When collecting execution\nstatistics, we define the cost to complete the distributed task as the sum of all shards’ costs, and the time to complete the\ndistributed task as the maximum across shards’ run duration.\nA pipeline with tasks split into shards can also be represented as a pipeline with non-linear topology (as shown in 2.3), it\nwill increase the complexity of LP problem. It might be interesting to preform a comparison study to better understand\nthe benefits from using non-linear topology versus the shard aggregation approach. This question is out of scope of this\npaper, and for simplicity, we will consider the distributed task as one vertex on pipeline graph.\n3\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nFigure 1: Example of linear pipeline topology\nFigure 2: Example of distributed task and shards’ execution statistics aggregation\nFigure 3: Example of non-linear pipeline topology\n2.3 Non-linear pipeline topology\nThe vast majority of genomics pipelines have a non-linear topology. In the non-linear case, the cost can still be computed\nas a sum of all costs, but the time to complete the pipeline should be computed differently. Let’s consider the pipeline\npresented in Figure 3. We can compute the time to complete it in the following way:\nTtotal = timeT ask1 + max(timeT ask2 + timeT ask3 , timeT ask4 ) + timeT ask5 + timeT ask6\nUnfortunately, we cannot replace the constraint in problem LP 1 and the objective function in problem LP 2 with the\nexpression given above because it breaks the definition of an LP problem. However, there are tips and tricks that can\nhelp deal with non-linear functions, such as maximum, absolute value, etc., which frequently appear in real-world\nproblems [13]. Usually, this is accomplished by creating additional variables.\nWe introduce a variable Fi for each T aski, which describes amount of time required to complete a part of the pipeline\nfrom the start up to T aski. We also define a set of immediate predecessors for each vertex of the pipeline graph\nImP redi. For example, for T ask5 the set of immediate predecessors is ImP red5 = {T ask3, T ask4}, and two\ninequalities hold F3 + timeT ask5 ≤ F5 and F4 + timeT ask5 ≤ F5. These inequalities ensure that T ask3 and T ask4\n4\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nare completed prior to T ask5, and then timeT ask5 is required to complete the part of the pipeline from the start and up\nto T ask5.\nWe also introduce Ftotal, which describes the amount of time required to complete the whole pipeline from the start\nto the end. With additional variables, linear programming formulations for non-linear topology can be expressed as\nfollows:\nLP 3 – Minimize cost subject to time constraint\nMinimize\nnX\ni=1\nmX\nj=1\ncijIij total cost to run pipeline,\nsubject to\nmX\nj=1\nIij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task,\nFi ≤ Tmax, i ∈ {1, . . . , n} ensures time constraint is not exceeded,\nFpr +\nmX\nj=1\ntijIij ≤ Fi, i ∈ {1, . . . , n}, pr ∈ ImP redi ensures predecessors are completed prior to the task.\nIij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}\nFi ≥ 0, i ∈ {1, . . . , n}\nLP 4 – Minimize time subject to cost constraint\nMinimize Ftotal total time to run pipeline,\nsubject to\nmX\nj=1\nIij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task,\nnX\ni=1\nmX\nj=1\ncijIij ≤ Cmax ensures cost constraint is not exceeded.\nFpr +\nmX\nj=1\ntijIij ≤ Fi, i ∈ {1, . . . , n}, pr ∈ ImP redi ensures predecessors are completed prior to the task,\nFi ≤ Ftotal, i ∈ {1, . . . , n} ensures all pipeline tasks are completed.\nIij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}\nFtotal ≥ 0, Fi ≥ 0, i ∈ {1, . . . , n}\n3 Application: optimizing genome data pre-processing pipeline\n3.1 Pipeline and data\nWe leveraged LP problems formulated in the previous section to optimize UnmappedBamToAlignedBam workflow\nwhich implements the data pre-processing step of the Broad Institute Whole Genome Germline Single Sample\nPipeline [15]. The version of the pipeline adapted to Azure is available in [16]. The UnmappedBamToAlignedBam\nworkflow graph is presented in Figure 4: it has 13 tasks total, 5 of which are distributed. The list of immediate\npredecessors for each task is presented in Listing S4.\nThe pipeline employs different sharding strategies for distributed tasks. For SamToFastqAndBwaMemAndMba (labeled\nas BW A on the pipeline graph), CollectQualityYieldMetrics, and CollectUnsortedReadgroupBamQualityMetrics, one\nshard is created for each input file with no guarantees that the shards will be of equal size. For BaseRecalibrator\nand ApplyBQSR (labeled as BR and BQSR respectively), an attempt is made to split the input bam file into equal\npieces with sequencing groupings. SamToFastqAndBwaMemAndMba is a compute-intensive task, and its performance\nsignificantly affects the performance of the entire pipeline. We use it to observe how the shard-per-file strategy affects\npipeline performance in terms of cost and run duration.\nWe selected a subset of the 1000 Genomes Project dataset [17, 18]: 49 samples in total, each with 30x coverage. Of\nthese, 46 samples have 12 unmapped bam files per sample, while the remaining 3 samples have 8, 13, and 14 files,\n5\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nTable 4: Subset of samples used to collect pipeline execution statistics\nSample Number of files File size statistics (in GB) Total size (in GB)\nmean stdev min max\nNA19017 12 4.13 0.19 3.95 4.45 49.6\nHG03436 12 3.77 1.17 2.88 5.41 45.19\nNA06984 12 3.64 0.36 3.13 4.04 43.64\nHG01982 12 3.62 0.68 2.95 4.65 43.45\nNA20764 12 3.6 0.44 2.85 3.92 43.14\nFigure 4: Pipeline graph for UnmappedBamToAlignedBam workflow. Tasks ’sink’ and ’source’ are introduced for\nimplementation purposes to mark entry and exit points of the pipeline.\nrespectively. Detailed sample statistics are presented in Table S1. To collect execution statistics, we randomly selected\n5 samples out of the 46 samples with 12 files, as shown in Table 4. Genome assembly GRCh38 was used as a reference.\n3.2 Virtual machine assignments and execution statistics collection\nWe executed the UnmappedBamToAlignedBam workflow on Azure using a modified version of CromwellOn-\nAzure v3.0.0 [12]. The modifications were made to label each run with the sample and experiment name to simplify\nexecution statistics collection and processing. No other functionality was changed.\nThree experiments were defined: default VM assignment, Dv2 family, and Ddv4 family. By default, CromwellOn-\nAzure assigns the cheapest VM per hour that satisfies the task’s compute requirements, such as disk, memory, or number\nof CPUs. The list of defaults was saved and then passed as an additional input file for the pipeline run to guarantee\nthe same defaults were used for all samples. For Dv2 and Ddv4 family experiments, we created VM assignment lists\nmanually based on the tasks’ compute requirements. These lists were also passed as additional input files for the\ncorresponding pipeline runs. All VM assignment lists are available in the Supplementary materials, see Listings S1–S3.\nWe were inspired by Intel’s benchmarking experiment [ 19] when choosing the VM families for our experiments.\nHowever, both Intel’s experiment and our experiments were conducted about two years ago. Therefore, the results\npresented in the next sections should not be considered as an optimal choice recommendation. The Azure VM offering\nhas changed since then, with some families being deprecated and several new families being added. As a result, we\nhighly recommend evaluating the currently available VM families to find up-to-date optimal compute resources.\nTo collect execution statistics we ran the pipeline three times per each sample defined in Table 4 and each experiment\ndefined above, resulting in 45 runs in total. Then, for each task-experiment pair, we computed the average cost and\nexecution time over runs and samples (see Table S2). The average cost and time to complete whole pipeline are\npresented in Table 5. We can observe that simple switch from default VM assignment to Ddv4 family decreases time\nto complete the pipeline twice with a small increase in cost. We also computed the lower and upper bounds for cost and\ntime that can be achieved by mixing VMs from the three experiments:\n$ 2.71 ≤ Cost to complete pipeline ≤ $ 4.23,\n6.45 hr ≤ Time to complete pipeline ≤ 13.91 hr.\n6\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nTable 5: Cost and time to complete pipeline for each experiment.\nExperiment Cost (in $) Time (in hr)\ndefault 3.26 13.88\nDv2 3.55 8.23\nDdv4 3.57 6.55\nTable 6: Cost constraints and corresponding minimal times to complete pipeline.\nCmax (in $) Tmin (in hr)\n2.71 6.82\n2.86 6.53\n3.01 6.53\n3.17 6.53\n3.32 6.53\n3.47 6.45\n3.62 6.45\n3.77 6.45\n3.93 6.45\n4.08 6.45\n4.23 6.45\nTable 7: Recommended VM assignment when minimizing time with cost constraint Cmax = 2.71\nTask Recommended Recommended\nexperiment VM\nApplyBQSR Dv2 Standard_D3_v2\nBaseRecalibrator Ddv4 Standard_D4d_v4\nCheckContamination Ddv4 Standard_D4d_v4\nCollectQualityYieldMetrics Ddv4 Standard_D2d_v4\nCollectUnsortedReadgroupBamQualityMetrics Ddv4 Standard_D2d_v4\nCreateSequenceGroupingTSV Ddv4 Standard_D2d_v4\nCrossCheckFingerprints Ddv4 Standard_D2d_v4\nGatherBamFiles Ddv4 Standard_D4d_v4\nGatherBqsrReports Ddv4 Standard_D2d_v4\nMarkDuplicates Ddv4 Standard_D8d_v4\nSamToFastqAndBwaMemAndMba default Standard_F16s_v2\nSortSampleBam Ddv4 Standard_D8d_v4\nSumFloats Ddv4 Standard_D2d_v4\n3.3 Implementing and solving LP problems\nWe used PuLP [20], a Python-based modeling language, to implement problems LP 3 and LP 4 (see Listings S5 and\nS6). PuLP can call a variety of open-source and commercial linear programming solvers, we used COIN-OR CBC [21],\nwhich is bundled with the PuLP package.\nWe selected UC 2 from the two use cases described in the introduction, which involves minimizing the time to complete\nthe pipeline with a restriction on cost. As shown above, the cost lower and upper bounds are $ 2.71 and $ 4.23\nrespectively. We took several points from the cost range\n[2.71, 4.23] and used each of them as the cost constraint\nCmax, thus generating and solving several time minimization LP problems. The time matrix T and cost matrix C\nwere obtained from the execution statistics presented in Table S2, pipeline graph was defined in JSON format (see\nListing S4).\nTable 6 displays the values used as cost constraints Cmax and minimum times Tmin to complete the pipeline corre-\nsponding to these constraints. Tmin is the minimum of the objective function for problem LP 4 with cost constraint\nCmax. The results presented demonstrate the cost-time trade-off that we described in the introduction. For instance, if\nwe are willing to spend $ 4.23 on each pipeline run, then the minimal time to complete the pipeline is 6.45 hours. On\n7\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nthe other hand, if we do not want to spend more than $ 3.3 on each run, then we need a minimum of 6.53 hours. Table 6\nalso allows us to make a choice based on cost-time requirements specific to a research project. We decided to proceed\nwith Cmax = 2.71 and Tmin = 6.82. Solution of LP problem with Cmax = 2.71 is presented in Table 7.\n3.4 Using recommended VM assignment to process full set of samples\nWe analyzed the full set of 49 samples using the compute resources recommended in Table 7. One run was performed\nfor each sample, and for the majority of samples, the pipeline run was completed within (or very close to) the cost-time\nboundaries defined by Cost ≤ $ 2.71 and T ime ≤ 6.82 hr (see the bottom left section of Figure 5). Sample NA12872\nis not present on the figure, as the run for this sample failed due to its large size and lack of disk space for processing a\nfile by one of the non-distributed tasks. Samples that appear on the right section of Figure 5 are characterized by a\nlarger standard deviation of sample file sizes, a large total sample size, or a combination of both (see Table 8).\nSamples HG03436, NA20869, and NA12340 illustrate how the duration of the pipeline run is affected by the variability\nin sample file sizes. As previously mentioned, the task SamToFastqAndBwaMemAndMba is a distributed task where one\nshard is created for each input file, and the same type of VM is assigned to all shards of the task. For sample NA12340,\ninput files of size 1.93 GB and 11.65 GB are processed in parallel on two VMs of the same type (Standard_F16s_v2).\nSince larger files require more processing time, variability in file sizes leads to variability in shard processing times.\nThe task SamToFastqAndBwaMemAndMba is completed when the longest shard is completed, and for NA12340, it\ntakes longer compared to evenly distributed samples. At the same time, the total amount of processed data for NA12340\n(46.15 GB) is similar to that of other samples, so we do not observe an increase in the cost of the pipeline run. Overall,\nthe uneven distribution of samples across input files shifts samples to the right on the Time vs. Cost graph, whilst an\nincrease in total sample size affects both cost and time to complete the pipeline, and shifts samples to the top right (see\nsample HG01565).\nFigure 5: Cost and time to complete data pre-processing pipeline with recommended VM assignment, each point\ncorresponds to one sample.\n4 Discussion\nWhile our assumptions such as average execution statistics, aggregated shards, and multiple-choice problem formulation\nhelp simplify the LP problem and make it more accessible to non-experts, they also introduce limitations to the approach\ndescribed above. In this section, we discuss several enhancements that can be implemented and the scenarios in which\nthey might be useful.\nDealing with distributed tasks As demonstrated earlier, combining shards into a single node on the pipeline graph\nmay not be effective when there is an uneven distribution of inputs across shards. If this scenario is anticipated, an\nalternative approach to consider is to represent each shard with a separate node and use a non-linear pipeline topology.\n8\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nTable 8: Top 15 samples by size; for samples in bold cost or time to complete pipeline exceeded limits Cost ≤ $ 2.71\nand T ime ≤ 6.82 hr\nSample Number of files File size statistics (in GB) Total size (in GB)\nmean stdev min max\n1 NA12872 12 4.6 0.97 3.21 5.56 55.25\n2 HG03006 12 4.34 2.14 2.13 7.2 52.07\n3 HG01565 12 4.32 0.27 3.85 4.63 51.81\n4 HG02855 12 4.19 0.44 3.66 4.8 50.25\n5 NA19017 12 4.13 0.19 3.95 4.45 49.6\n6 NA20869 12 4.05 2.05 2.43 6.89 48.6\n7 NA19035 12 3.89 0.58 3.13 4.67 46.73\n8 HG03727 12 3.85 0.14 3.66 4.15 46.24\n9 NA12340 14 3.3 3.21 1.93 11.65 46.15\n10 HG01583 12 3.84 0.15 3.67 4.08 46.08\n11 HG03742 12 3.81 0.23 3.57 4.16 45.71\n12 NA20502 12 3.77 0.45 3.35 4.42 45.3\n13 NA20509 12 3.77 0.15 3.57 4.03 45.24\n14 NA18995 12 3.77 0.11 3.61 3.91 45.22\n15 HG03436 12 3.77 1.17 2.88 5.41 45.19\nCloud compute resources are limited We made the assumption that VMs used for experiments are available in any\nquantity when we wrote multiple-choice problems and formulated LP problems. This assumption might be acceptable\nif samples are analyzed in small quantities in sequential order. For instance, if a few new samples come every 8 hours,\nthey’re sent for analysis right away and it’s completed before the next batch comes. In this scenario, resource capacity\nissues might be prevented with careful evaluation of compute requirements, Cloud quotas and competing workloads. If\nthe capacity limits for some virtual machines are reached when analyzing one sample and changing the quota is not\npossible, then a capacity constraint must be added to the LP problems.\nWhen analyzing samples at a large scale, capacity limits may be easily reached by multiple distributed tasks with\nhundreds or thousands of shards running in parallel. For instance, one HaplotypeCaller task from Whole Genome\nGermline Single Sample Pipeline requires 50 shards for a 30x coverage sample. In this case, it might be beneficial to\noptimize resource utilization in general and consider a resource allocation and scheduling problems that are well-known\nin Cloud computing. We recommend reviewing the current state of research in this field and carefully evaluating the\ncomplexity of the solution that your project requires.\nIs execution statistics required? We assumed that the researcher is willing to collect execution statistics before\noptimizing the pipeline, which might be impossible or cost-prohibitive in some cases. Additionally, we used average\ncost and execution time values to solve LP problems, which resulted in the loss of information about individual sample\nsize, task cost, and execution time distributions. We also did not take into account the input size of incoming sample to\nprocess.\nIf we can accurately predict task execution time and cost for each sample, then VM assignment recommendation can\nbe done individually as well. When historical execution data is available, various machine learning techniques can be\nused to predict task runtime for new samples based on their size, workflow structure, etc. [22, 23, 24, 25, 26] However,\napproaches relying on historical data cannot be applied to workflows without any historical traces or workflows with\nchanged parameters and configurations on any kind of cluster. In such cases, methods employing microbenchmarks for\npredicting the runtimes of tasks before they are executed might be very useful [27].\n5 Conclusion\nIn this paper, we present a linear programming method that can help bioinformatics researchers optimize the execution\nof scientific pipelines for either cost or time. Our method is designed to work with all scientific pipelines that have a\ndirected acyclic graph topology. The implementation we present uses execution data from representative samples to\ndetermine the optimal compute needed for the pipeline execution. The same methodology might be used with individual\ntask runtimes predicted on historical data or microbenchmarks. In the future, we want to explore the limitation of\ncompute resources and capacity when dealing with competing interests, such as running competing pipelines or scaling\na single pipeline at thousands of parallel samples.\n9\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nReferences\n[1] P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, C. Notredame, Nextflow enables reproducible\ncomputational workflows, Nat Biotechnol 35 (4) (2017) 316–319. doi:10.1038/nbt.3820.\n[2] Nextflow documentation: Running on azure, https://www.nextflow.io/docs/edge/azure.html, [Online;\naccessed: January 25, 2024].\n[3] K. V oss, G. V . D. Auwera, J. Gentry, Full-stack genomics pipelining with gatk4 + wdl + cromwell,\nhttps://f1000research.com/slides/6-1381, [version 1; not peer reviewed] (2017). doi:10.7490/\nf1000research.1114634.1.\n[4] Cromwell documentation: Azure backend, https://cromwell.readthedocs.io/en/stable/backends/\nAzure, [Online; accessed: January 25, 2024].\n[5] F. Mölder, K. P. Jablonski, B. Letcher, et al., Sustainable data analysis with snakemake, F1000Research 10 (2021)\n33, [version 1; peer review: 1 approved, 1 approved with reservations].doi:10.12688/f1000research.29032.\n1.\n[6] Snakemake documentation: Executing a snakemake workflow via azure batch,\nhttps://snakemake.readthedocs.io/en/v7.31.1/executing/cloud.html#\nexecuting-a-snakemake-workflow-via-azure-batch , [Online; accessed: January 25, 2024].\n[7] S. Srivastava, I. Banicescu, Scheduling in parallel and distributed computing systems, in: Topics in Par-\nallel and Distributed Computing, Springer International Publishing, 2018, pp. 313–337.\ndoi:10.1007/\n978-3-319-93109-8_11 .\n[8] V . Vinothina, R. Sridaran, P. Ganapathi, A survey on resource allocation strategies in cloud computing, Int. J. Adv.\nComput. Sci. Appl. 3 (6) (2012) 97–104.\n[9] S. H. H. Madni, M. S. A. Latiff, Y . Coulibaly, S. M. Abdulhamid, Recent advancements in resource allocation\ntechniques for cloud computing environment: a systematic review, Cluster Computing 20 (2017) 2489–2533.\n[10] K. Saidi, D. Bardou, Task scheduling and vm placement to resource allocation in cloud computing: challenges\nand opportunities, Cluster Computing 26 (2023) 3069–3087.\n[11] E. N. Alkhanak, S. P. Lee, R. Rezaei, R. M. Parizi, Cost optimization approaches for scientific workflow scheduling\nin cloud and grid computing: A review, classifications, and open issues, Journal of Systems and Software 113\n(2016) 1–26. doi:https://doi.org/10.1016/j.jss.2015.11.023.\n[12] https://github.com/microsoft/CromwellOnAzure, [Online; accessed: January 18, 2024].\n[13] T. Hu, A. Kahng, Linear and Integer Programming Made Easy, Springer International Publishing Switzerland,\n2016.\n[14] H. Kellerer, U. Pferschy, D. Pisinger, Knapsack Problems, Springer Berlin Heidelberg, 2004.\n[15] D. Caetano-Anolles, Data pre-processing for variant discovery, https://gatk.broadinstitute.org/hc/\nen-us/articles/360035535912 (2020).\n[16] https://github.com/microsoft/gatk4-genome-processing-pipeline-azure , [Online; accessed: Jan-\nuary 18, 2024].\n[17] Overview of the 1000 genomes project, https://www.internationalgenome.org/\n1000-genomes-summary, [Online; accessed: January 18, 2024].\n[18] Azure open datasets: genomics data lake – 1000 genomes, https://learn.microsoft.com/en-us/azure/\nopen-datasets/dataset-1000-genomes, [Online; accessed: January 18, 2024].\n[19] R. Pruitt, M. Powers, J. Chia, P. Sebastian, K. Mannthey, Evaluating genomics pipelines on azure: Intel-based\nvirtual machines, https://techcommunity.microsoft.com/t5/azure-high-performance-computing/\nevaluating-genomics-pipelines-on-azure-intel-based-virtual/ba-p/2824608 (2021).\n[20] https://pypi.org/project/PuLP, [Online; accessed: January 18, 2024].\n[21] https://github.com/coin-or/Cbc, [Online; accessed: January 18, 2024].\n[22] S. M. Sadjadi, S. Shimizu, J. Figueroa, R. Rangaswami, J. Delgado, H. A. Duran-Limon, X. J. Collazo-Mojica, A\nmodeling approach for estimating execution time of long-running scientific applications, 2008 IEEE International\nSymposium on Parallel and Distributed Processing (2008) 1–8.\n[23] R. F. da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez, D. Thain, B. Tovar, M. Livny, Toward fine-grained\nonline task characteristics estimation in scientific workflows, Proceedings of the 8th Workshop on Workflows in\nSupport of Large-Scale Science (2013) 58–67.\n10\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\n[24] R. F. da Silva, G. Juve, M. Rynge, E. Deelman, M. Livny, Online task resource consumption prediction for\nscientific workflows, Parallel Process. Lett. 25 (2015) 1541003:1–1541003:25.\n[25] F. Nadeem, D. M. Alghazzawi, A. S. Mashat, K. A. Fakeeh, A. S. A.-M. Al-Ghamdi, H. Hagras, Modeling and\npredicting execution time of scientific workflows in the grid using radial basis function neural network, Cluster\nComputing 20 (2017) 2805–2819.\n[26] M. J. Rosa, C. G. Ralha, M. Holanda, A. P. Araujo, Computational resource and cost prediction service for\nscientific workflows in federated clouds, Future Generation Computer Systems 125 (2021) 844–858.\n[27] J. Bader, F. Lehmann, L. Thamsen, U. Leser, O. Kao, Lotaru: Locally predicting workflow task runtimes\nfor resource management on heterogeneous infrastructures, Future Generation Computer Systems 150 (2024)\n171–185.\n11\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nSupplementary materials\nCompute resources used for experiments\n1 {\n2 \"CollectQualityYieldMetrics\": \"Standard_D2_v3\",\n3 \"SamToFastqAndBwaMemAndMba\": \"Standard_F16s_v2\",\n4 \"CreateSequenceGroupingTSV\": \"Standard_A2_v2\",\n5 \"CollectUnsortedReadgroupBamQualityMetrics\": \"Standard_D2_v3\",\n6 \"SumFloats\": \"Standard_A2_v2\",\n7 \"MarkDuplicates\": \"Standard_L4s\",\n8 \"SortSampleBam\": \"Standard_L4s\",\n9 \"CrossCheckFingerprints\": \"Standard_A2\",\n10 \"BaseRecalibrator\": \"Standard_D2_v2\",\n11 \"CheckContamination\": \"Standard_D11_v2\",\n12 \"GatherBqsrReports\": \"Standard_A2_v2\",\n13 \"ApplyBQSR\": \"Standard_A3\",\n14 \"GatherBamFiles\": \"Standard_A2\",\n15 }\nListing S1: Default virtual machine assignment made by CromwellOnAzure\n1 {\n2 \"CollectQualityYieldMetrics\": \"Standard_D2_v2\",\n3 \"SamToFastqAndBwaMemAndMba\": \"Standard_D5_v2\",\n4 \"CreateSequenceGroupingTSV\": \"Standard_D2_v2\",\n5 \"CollectUnsortedReadgroupBamQualityMetrics\": \"Standard_D2_v2\",\n6 \"SumFloats\": \"Standard_D2_v2\",\n7 \"MarkDuplicates\": \"Standard_D4_v2\",\n8 \"SortSampleBam\": \"Standard_D4_v2\",\n9 \"CrossCheckFingerprints\": \"Standard_D2_v2\",\n10 \"BaseRecalibrator\": \"Standard_D2_v2\",\n11 \"CheckContamination\": \"Standard_D3_v2\",\n12 \"GatherBqsrReports\": \"Standard_D2_v2\",\n13 \"ApplyBQSR\": \"Standard_D3_v2\",\n14 \"GatherBamFiles\": \"Standard_D3_v2\",\n15 }\nListing S2: Virtual machine assignment for Dv2 family experiment\n12\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\n1 {\n2 \"CollectQualityYieldMetrics\": \"Standard_D2d_v4\",\n3 \"SamToFastqAndBwaMemAndMba\": \"Standard_D16d_v4\",\n4 \"CreateSequenceGroupingTSV.vm_size\": \"Standard_D2d_v4\",\n5 \"CollectUnsortedReadgroupBamQualityMetrics\": \"Standard_D2d_v4\",\n6 \"SumFloats\": \"Standard_D2d_v4\",\n7 \"MarkDuplicates\": \"Standard_D8d_v4\",\n8 \"SortSampleBam\": \"Standard_D8d_v4\",\n9 \"CrossCheckFingerprints\": \"Standard_D2d_v4\",\n10 \"BaseRecalibrator\": \"Standard_D4d_v4\",\n11 \"CheckContamination\": \"Standard_D4d_v4\",\n12 \"GatherBqsrReports\": \"Standard_D2d_v4\",\n13 \"ApplyBQSR\": \"Standard_D8d_v4\",\n14 \"GatherBamFiles\": \"Standard_D4d_v4\",\n15 }\nListing S3: Virtual machine assignment for Ddv4 family experiment\nUnmappedBamToAlignedBam workflow graph\n1 {\n2 \"CollectQualityYieldMetrics\": [\"source\"],\n3 \"SamToFastqAndBwaMemAndMba\": [\"source\"],\n4 \"CreateSequenceGroupingTSV\": [\"source\"],\n5 \"CollectUnsortedReadgroupBamQualityMetrics\": [\"SamToFastqAndBwaMemAndMba\"],\n6 \"SumFloats\": [\"SamToFastqAndBwaMemAndMba\"],\n7 \"MarkDuplicates\": [\"SumFloats\"],\n8 \"SortSampleBam\": [\"MarkDuplicates\"],\n9 \"CrossCheckFingerprints\": [\"SortSampleBam\"],\n10 \"BaseRecalibrator\": [\"SortSampleBam\", \"CreateSequenceGroupingTSV\"],\n11 \"CheckContamination\": [\"SortSampleBam\"],\n12 \"GatherBqsrReports\": [\"BaseRecalibrator\"],\n13 \"ApplyBQSR\": [\"GatherBqsrReports\"],\n14 \"GatherBamFiles\": [\"ApplyBQSR\"] ,\n15 \"sink\":\n16 [\"GatherBamFiles\",\n17 \"CheckContamination\",\n18 \"CrossCheckFingerprints\",\n19 \"CollectUnsortedReadgroupBamQualityMetrics\",\n20 \"CollectQualityYieldMetrics\"]\n21 }\nListing S4: Immediate predecessors list for each task. Tasks ’sink’ and ’source’ are introduced for implementation\npurposes to mark entry and exit points of the pipeline.\n13\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nImplementation of LP problems for non-linear pipeline topology using PuLP\nBelow we assume that pipeline is an instance of a to-be-implemented Pipeline class with the following properties:\n• tasks – list of task names,\n• experiments – list of experiment names,\n• costs – cost matrix C in the following dictionary format\n{'Task_1': {'Experiment_1': c_11, 'Experiment_2': c_12},\n'Task_2': {'Experiment_1': c_21, 'Experiment_2': c_22}}\n• times – time matrix T in the same dictionary format as presented above for cost,\n• graph – dictionary that maps task name to a list of its immediate predecessors.\n1 from pulp import *\n2\n3 def cost_minimization_problem(pipeline, resource_limit, model_name):\n4 tasks = pipeline.tasks\n5 experiments = pipeline.experiments\n6 costs = pipeline.costs\n7 times = pipeline.times\n8 graph = pipeline.graph\n9\n10 choices = [(t,e) for t in tasks for e in experiments]\n11\n12 # Create model to contain the problem data\n13 model = LpProblem(model_name, LpMinimize)\n14\n15 # Create indicator variables corresponding to I_ij in LP formulation\n16 ivars = LpVariable.dicts('choice', (tasks, experiments), 0, 1, LpBinary)\n17\n18 # The objective function is added to model first\n19 model += (lpSum([ivars[t][e]*costs[t][e] for (t,e) in choices]), \"total_cost\")\n20\n21 # These constraints guarantee that only one VM is selected per task\n22 for t in tasks:\n23 model += (lpSum([ivars[t][e] for e in experiments]) == 1, \"limit_for_task_%s\"%t)\n24\n25 if resource_limit != None:\n26 # Create additional variables F_i (finishing time for Task_i)\n27 fvars = LpVariable.dicts('F', tasks, 0, None, cat = 'Continuous')\n28 # These constraints ensure that predecessors are completed prior to the task\n29 for t in tasks:\n30 for pr in graph[t]:\n31 if pr == 'source':\n32 model += (lpSum([ivars[t][e]*times[t][e] for e in experiments])\n33 - fvars[t] <= 0, \"finish_time_task_%s_%s\"%(t, pr))\n34 else:\n35 model += (fvars[pr]+lpSum([ivars[t][e]*times[t][e] for e in experiments])\n36 - fvars[t] <= 0, \"finish_time_task_%s_%s\"%(t, pr))\n37\n38 # Main resource constraint ensures that alotted time is not exceeded\n39 for t in tasks:\n40 model += (fvars[t] <= resource_limit, \"main_resource_constraint_%s\"%t)\n41\n42 return model\nListing S5: PuLP model for cost minimization problem\n14\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\n1 from pulp import *\n2\n3 def time_minimization_problem(pipeline, resource_limit, model_name):\n4 tasks = pipeline.tasks\n5 experiments = pipeline.configs\n6 costs = pipeline.costs\n7 times = pipeline.times\n8 graph = pipeline.graph\n9\n10 choices = [(t,e) for t in tasks for e in experiments]\n11\n12 # Create model to contain the problem data\n13 model = LpProblem(model_name, LpMinimize)\n14\n15 # Create indicator variables corresponding to I_ij in LP formulation\n16 ivars = LpVariable.dicts('choice', (tasks, experiments), 0, 1, LpBinary)\n17 # Create additional variables F_i (finishing time for Task_i)\n18 fvars = LpVariable.dicts('F', tasks, 0, None, cat = 'Continuous')\n19 # Create additional variable F_max (finishing time for full pipeline)\n20 dvar = LpVariable('F_max', 0, None, 'Continuous')\n21\n22 # The objective function is added to model first\n23 model += (lpSum(dvar), \"total_time\")\n24\n25 # These constraints guarantee that only one VM is selected per task\n26 for t in tasks:\n27 model += (lpSum([ivars[t][e] for e in experiments]) == 1, \"limit_for_task_%s\"%t)\n28\n29 # These constraints ensure that predecessors are completed prior to the task\n30 for t in tasks:\n31 for pr in graph[t]:\n32 if pr == 'source':\n33 model += (lpSum([ivars[t][e]*times[t][e] for e in experiments])\n34 - fvars[t] <= 0, \"finish_time_task_%s_%s\"%(t, pr))\n35 else:\n36 model += (fvars[pr]+lpSum([ivars[t][e]*times[t][e] for e in experiments])\n37 - fvars[t] <= 0, \"finish_time_task_%s_%s\"%(t, pr))\n38\n39 # These constraints ensure that all tasks are completed\n40 for t in tasks:\n41 model += (fvars[t] <= dvar, \"max_replacement_task_%s\"%t)\n42\n43 # Main resource constraint ensures that allowed cost is not exceeded\n44 if resource_limit != None:\n45 model += (lpSum([ivars[t][e]*costs[t][e] for (t,e) in choices])\n46 <= resource_limit, \"main_resource_constraint\")\n47\n48 return model\nListing S6: PuLP model for time minimization problem\n15\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nList of samples and execution statistics\nTable S1: Full list of samples analyzed by data pre-processing pipeline, ordered by total sample size. Samples in bold\nwere used to collect pipeline execution statistics presented in Table S2\nSample Number of files File size statistics (in GB) Total size (in GB)\nmean stdev min max\n1 NA12872 12 4.6 0.97 3.21 5.56 55.25\n2 HG03006 12 4.34 2.14 2.13 7.2 52.07\n3 HG01565 12 4.32 0.27 3.85 4.63 51.81\n4 HG02855 12 4.19 0.44 3.66 4.8 50.25\n5 NA19017 12 4.13 0.19 3.95 4.45 49.6\n6 NA20869 12 4.05 2.05 2.43 6.89 48.6\n7 NA19035 12 3.89 0.58 3.13 4.67 46.73\n8 HG03727 12 3.85 0.14 3.66 4.15 46.24\n9 NA12340 14 3.3 3.21 1.93 11.65 46.15\n10 HG01583 12 3.84 0.15 3.67 4.08 46.08\n11 HG03742 12 3.81 0.23 3.57 4.16 45.71\n12 NA20502 12 3.77 0.45 3.35 4.42 45.3\n13 NA20509 12 3.77 0.15 3.57 4.03 45.24\n14 NA18995 12 3.77 0.11 3.61 3.91 45.22\n15 HG03436 12 3.77 1.17 2.88 5.41 45.19\n16 HG01595 12 3.76 0.3 3.3 4.04 45.17\n17 HG03756 12 3.75 0.27 3.4 4.22 45.04\n18 HG03449 12 3.72 0.17 3.48 3.89 44.68\n19 HG03100 12 3.69 0.24 3.35 3.92 44.24\n20 HG03722 12 3.68 0.46 3.03 4.18 44.19\n21 NA11894 12 3.66 0.19 3.49 3.92 43.93\n22 HG00096 12 3.66 0.24 3.3 3.88 43.87\n23 HG00419 12 3.66 0.16 3.47 3.91 43.87\n24 HG00410 12 3.64 0.2 3.26 3.85 43.71\n25 NA06984 12 3.64 0.36 3.13 4.04 43.64\n26 NA12878 12 3.63 0.12 3.47 3.79 43.61\n27 NA10847 12 3.63 1 2.72 4.99 43.59\n28 HG01958 12 3.62 0.17 3.45 3.88 43.48\n29 HG01982 12 3.62 0.68 2.95 4.65 43.45\n30 NA19648 12 3.62 0.28 3.17 3.93 43.38\n31 HG01879 12 3.59 0.26 3.36 3.98 43.14\n32 NA20764 12 3.6 0.44 2.85 3.92 43.14\n33 HG02642 12 3.59 0.25 3.19 3.9 43.09\n34 NA12489 12 3.58 0.11 3.43 3.74 42.94\n35 NA20895 13 3.28 1.39 2.07 6.36 42.69\n36 HG03052 12 3.53 0.18 3.23 3.75 42.38\n37 HG00759 12 3.51 0.27 3.11 3.79 42.16\n38 HG01112 12 3.51 0.2 3.23 3.82 42.16\n39 NA20321 12 3.49 0.1 3.37 3.67 41.92\n40 NA20752 12 3.49 0.08 3.38 3.59 41.91\n41 NA20126 12 3.49 0.52 2.75 3.9 41.86\n42 HG01500 12 3.48 0.52 2.91 4.22 41.77\n43 NA19661 8 5.21 0.59 4.62 5.83 41.68\n44 NA19102 12 3.46 0.49 2.75 3.93 41.55\n45 HG00268 12 3.46 0.17 3.24 3.77 41.54\n46 NA18956 12 3.36 0.23 3.14 3.75 40.35\n47 HG02085 12 3.31 0.33 2.93 3.74 39.75\n48 HG01393 12 3.28 0.46 2.84 3.98 39.38\n49 HG02069 12 3.18 0.31 2.82 3.61 38.2\n16\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint \n\nOptimizing genomics pipeline execution with integer linear programming A P REPRINT\nTable S2: Pipeline execution statistics\nTask Experiment Cost (in $) Time (in hr)\nApplyBQSR Ddv4 0.591 0.459\nApplyBQSR Dv2 0.524 0.757\nApplyBQSR default 0.903 1.39\nBaseRecalibrator Ddv4 0.338 0.531\nBaseRecalibrator Dv2 0.356 1.19\nBaseRecalibrator default 0.37 1.19\nCheckContamination Ddv4 0.0302 0.668\nCheckContamination Dv2 0.043 0.769\nCheckContamination default 0.0425 1.15\nCollectQualityYieldMetrics Ddv4 0.0175 0.0824\nCollectQualityYieldMetrics Dv2 0.0283 0.132\nCollectQualityYieldMetrics default 0.0227 0.19\nCollectUnsortedReadgroupBamQualityMetrics Ddv4 0.0381 0.175\nCollectUnsortedReadgroupBamQualityMetrics Dv2 0.0615 0.248\nCollectUnsortedReadgroupBamQualityMetrics default 0.0503 0.354\nCreateSequenceGroupingTSV Ddv4 0.000223 0.00986\nCreateSequenceGroupingTSV Dv2 0.000314 0.0108\nCreateSequenceGroupingTSV default 0.000358 0.0199\nCrossCheckFingerprints Ddv4 0.00509 0.225\nCrossCheckFingerprints Dv2 0.00858 0.296\nCrossCheckFingerprints default 0.0126 0.523\nGatherBamFiles Ddv4 0.0272 0.602\nGatherBamFiles Dv2 0.0368 0.659\nGatherBamFiles default 0.117 4.86\nGatherBqsrReports Ddv4 0.000737 0.0326\nGatherBqsrReports Dv2 0.00133 0.0459\nGatherBqsrReports default 0.0022 0.122\nMarkDuplicates Ddv4 0.194 2.15\nMarkDuplicates Dv2 0.28 2.5\nMarkDuplicates default 0.202 2.93\nSamToFastqAndBwaMemAndMba Ddv4 2.17 1.03\nSamToFastqAndBwaMemAndMba Dv2 1.97 0.931\nSamToFastqAndBwaMemAndMba default 1.37 1.01\nSortSampleBam Ddv4 0.157 1.74\nSortSampleBam Dv2 0.24 2.14\nSortSampleBam default 0.163 2.37\nSumFloats Ddv4 0.000225 0.00998\nSumFloats Dv2 0.000261 0.009\nSumFloats default 0.000459 0.0255\n17\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}