Abstract
In the field of genomics, bioinformatics pipelines play a crucial role in processing and analyzing
vast biological datasets. These pipelines, consisting of interconnected tasks, can be optimized for
efficiency and scalability by leveraging cloud platforms such as Microsoft Azure. The choice of
compute resources introduces a trade-off between cost and time. This paper introduces an approach
that uses Linear Programming (LP) to optimize pipeline execution. We consider optimizing two
competing cases: minimizing cost with a run duration restriction and minimizing duration with a cost
restriction. Our results showcase the utility of using LP in guiding researchers to make informed
compute decisions based on specific data sets, cost and time requirements, and resource constraints.
1 Introduction
In the realm of genomics, bioinformatics pipelines are integral to processing and analyzing vast amounts of biological
data. A pipeline, in this context, is a series of data processing steps, or tasks, that are interconnected. Each task takes
a set of inputs, processes it, and produces a set of outputs that may be used as inputs for the subsequent tasks or are
considered terminal outputs. This sequence of tasks can be distributed, or divided into smaller, more manageable pieces
or shards, to enhance efficiency and scalability. These pipelines can be run on cloud platforms like Microsoft Azure,
leveraging its robust computational resources and storage capabilities. These pipelines are typically run using some of
the following workflow managers all able to run on Azure: Nextflow [1, 2], Cromwell [3, 4], Snakemake [5, 6].
It’s important to note that the cost and duration of running these pipelines on Azure are influenced by the choice of
compute resources. More powerful resources can process data faster but at a higher cost, necessitating a balance
between speed, efficiency, and budget. Choosing the appropriate compute resources is dependent on the tools and data
storage (memory and disk) requirements of each task.
With a variety of compute resources available, there is always a trade-off between cost and time when running a
genomics pipeline. The preference for this trade-off depends on the researcher’s circumstances. If the researcher has
limited funding and is less restricted on time, they might be interested in getting results as cheaply as possible without
spending more than Tmax hours per sample. In clinical settings where time is critical, the preference might be to get
Results
as quickly as possible.
Efficient resources utilization remains a key issue in parallel and distributed computing environments. Resource
allocation and task scheduling problems are well-known in this field, and a lot of effort has been invested in improving
the management of Cloud resources [7, 8, 9, 10]. In particular, cost optimization of scientific workflows has been a
focus of attention [11]. Existing open-source workflow engines in genomics field, such as CromwellOnAzure [ 12],
Snakemake [5] or Nextflow [1], require researchers to provide a virtual machine-to-task assignment list in advance as
one of the workflow inputs or make a default assignment with no guarantees for optimal performance.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
Table 1: Set of experiments defined for a given pipeline
Experiment 1 Experiment 2
T ask1 V M11 V M12
T ask2 V M21 V M22
T ask3 V M31 V M32
Table 2: Expected execution statistics: average time to complete each pipeline task
Experiment 1 Experiment 2
T ask1 t11 t12
T ask2 t21 t22
T ask3 t31 t32
Table 3: Expected execution statistics: average cost to complete each pipeline task
Experiment 1 Experiment 2
T ask1 c11 c12
T ask2 c21 c22
T ask3 c31 c32
When a researcher’s institution does not use a workflow management system that monitors workflow execution and
optimizes task scheduling, the researcher faces the challenge of optimizing a pipeline according to project requirements
while taking into account the variety of compute options and non-linear pipeline topology. In this paper, we demonstrate
that linear programming can be a useful tool for pipeline execution optimization in a bioinformatics researcher’s toolbox.
We consider two use cases for pipeline optimization:
UC 1 minimizing the cost of running the pipeline with a restriction on run duration,
UC 2 minimizing the duration of the pipeline run with a restriction on cost.
Genomics workloads often involve running a given pipeline on a large number of samples. Typically, data from these
samples have the same number of reads, length, coverage and quality level, i.e. the sample dataset is homogeneous.
Given that the dataset is homogeneous, we assume that average execution statistics collected for a subset of data are
sufficient for preliminary pipeline performance estimates and optimizations. We also assume that statistics have been
collected, and the scope of this work includes the following steps performed by the researcher: selecting a sample subset,
selecting a virtual machine subset for each pipeline task, executing the pipeline on the selected subset of samples and
virtual machines, collecting execution statistics, and computing the average cost and execution time for each pipeline
task. For example, if virtual machine assignment is described by Table 1, then corresponding execution statistics will
define matrices C = {cij} and T = {tij} presented as Tables 2 and 3.
2 Formulating linear programming problems
Maximizing or minimizing a function subject to constraints is a common problem in many fields. Linear programming
(LP) is a special case of mathematical optimization that deals with linear functions and constraints. In LP, we assume
that the function to be maximized or minimized is a linear function and that the constraints are linear equations or
linear inequalities [13]. LP has proven useful in modeling diverse types of problems in planning, routing, scheduling,
assignment, and design. In this section, we formulate LP problems for the use cases defined above.
Suppose we have a pipeline consisting of n tasks and m experiments defined by the researcher. The cost and time to
complete each task are given by matricesC = {cij} and T = {tij}, respectively, wherei = 1, . . . , n and j = 1, . . . , m.
We can formulate the following multiple-choice problems for the use cases in consideration:
UC 1 For each task, pick exactly one virtual machine out of m options in order to minimize the total cost of running
the pipeline without exceeding the allotted time Tmax.
UC 2 For each task, pick exactly one virtual machine out of m options in order to minimize the duration of the
pipeline run without exceeding the allowed cost Cmax.
2
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
We can formulate LP problems for pipeline optimization by following the approach used for the multiple-choice
knapsack problem (MCKP) [14]. To do this, we introduce indicator variables Iij, where i = 1, . . . , n, j = 1, . . . , m.
The variable Iij is equal to 1 if we choose virtual machine (VM) from Experiment j for T aski, and 0 otherwise.
2.1 Linear pipeline topology
A linear pipeline consists of a chain of tasks arranged so that the output of each element is the input of the next. The
processing of data is done in a linear and sequential manner, see example on Figure 1. In the case of a linear topology,
the total time and cost to run a pipeline can be computed as a sum of all tasks time and cost. Following MCKP approach,
the linear programming formulations can be expressed as:
LP 1 – Minimize cost subject to time constraint
Minimize
nX
i=1
mX
j=1
cijIij total cost to run pipeline,
subject to
mX
j=1
Iij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task,
nX
i=1
mX
j=1
tijIij ≤ Tmax ensures time constraint is not exceeded.
Iij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}
LP 2 – Minimize time subject to cost constraint
Minimize
nX
i=1
mX
j=1
tijIij → min total time to run pipeline,
subject to
mX
j=1
Iij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task,
nX
i=1
mX
j=1
cijIij ≤ Cmax ensures cost constraint is not exceeded.
Iij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}
2.2 Distributed tasks
Sharding is a technique used to split a large task into smaller, more manageable pieces. The input data for the task is
partitioned into shards based on a predetermined criterion, such as sequencing group or chromosome. Once the data
is partitioned, it is distributed across multiple nodes for processing. Each node processes a subset of the data, which
reduces the processing time and improves the overall performance of the system.
When a task is split into shards (as shown in Figure 2), we aggregate the shards and represent the task as a single vertex
on the pipeline graph for several reasons. First, existing implementations of shard functionality don’t allow different
virtual machines to be assigned to different shards, and each shard of the task is assigned the same type of virtual
machine. Second, the goal of creating shards is usually to split the task into pieces as equally as possible. Therefore,
we assume that the execution time and cost won’t be drastically different between shards. When collecting execution
statistics, we define the cost to complete the distributed task as the sum of all shards’ costs, and the time to complete the
distributed task as the maximum across shards’ run duration.
A pipeline with tasks split into shards can also be represented as a pipeline with non-linear topology (as shown in 2.3), it
will increase the complexity of LP problem. It might be interesting to preform a comparison study to better understand
the benefits from using non-linear topology versus the shard aggregation approach. This question is out of scope of this
paper, and for simplicity, we will consider the distributed task as one vertex on pipeline graph.
3
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
Figure 1: Example of linear pipeline topology
Figure 2: Example of distributed task and shards’ execution statistics aggregation
Figure 3: Example of non-linear pipeline topology
2.3 Non-linear pipeline topology
The vast majority of genomics pipelines have a non-linear topology. In the non-linear case, the cost can still be computed
as a sum of all costs, but the time to complete the pipeline should be computed differently. Let’s consider the pipeline
presented in Figure 3. We can compute the time to complete it in the following way:
Ttotal = timeT ask1 + max(timeT ask2 + timeT ask3 , timeT ask4 ) + timeT ask5 + timeT ask6
Unfortunately, we cannot replace the constraint in problem LP 1 and the objective function in problem LP 2 with the
expression given above because it breaks the definition of an LP problem. However, there are tips and tricks that can
help deal with non-linear functions, such as maximum, absolute value, etc., which frequently appear in real-world
problems [13]. Usually, this is accomplished by creating additional variables.
We introduce a variable Fi for each T aski, which describes amount of time required to complete a part of the pipeline
from the start up to T aski. We also define a set of immediate predecessors for each vertex of the pipeline graph
ImP redi. For example, for T ask5 the set of immediate predecessors is ImP red5 = {T ask3, T ask4}, and two
inequalities hold F3 + timeT ask5 ≤ F5 and F4 + timeT ask5 ≤ F5. These inequalities ensure that T ask3 and T ask4
4
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
are completed prior to T ask5, and then timeT ask5 is required to complete the part of the pipeline from the start and up
to T ask5.
We also introduce Ftotal, which describes the amount of time required to complete the whole pipeline from the start
to the end. With additional variables, linear programming formulations for non-linear topology can be expressed as
follows:
LP 3 – Minimize cost subject to time constraint
Minimize
nX
i=1
mX
j=1
cijIij total cost to run pipeline,
subject to
mX
j=1
Iij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task,
Fi ≤ Tmax, i ∈ {1, . . . , n} ensures time constraint is not exceeded,
Fpr +
mX
j=1
tijIij ≤ Fi, i ∈ {1, . . . , n}, pr ∈ ImP redi ensures predecessors are completed prior to the task.
Iij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}
Fi ≥ 0, i ∈ {1, . . . , n}
LP 4 – Minimize time subject to cost constraint
Minimize Ftotal total time to run pipeline,
subject to
mX
j=1
Iij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task,
nX
i=1
mX
j=1
cijIij ≤ Cmax ensures cost constraint is not exceeded.
Fpr +
mX
j=1
tijIij ≤ Fi, i ∈ {1, . . . , n}, pr ∈ ImP redi ensures predecessors are completed prior to the task,
Fi ≤ Ftotal, i ∈ {1, . . . , n} ensures all pipeline tasks are completed.
Iij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}
Ftotal ≥ 0, Fi ≥ 0, i ∈ {1, . . . , n}
3 Application: optimizing genome data pre-processing pipeline
3.1 Pipeline and data
We leveraged LP problems formulated in the previous section to optimize UnmappedBamToAlignedBam workflow
which implements the data pre-processing step of the Broad Institute Whole Genome Germline Single Sample
Pipeline [15]. The version of the pipeline adapted to Azure is available in [16]. The UnmappedBamToAlignedBam
workflow graph is presented in Figure 4: it has 13 tasks total, 5 of which are distributed. The list of immediate
predecessors for each task is presented in Listing S4.
The pipeline employs different sharding strategies for distributed tasks. For SamToFastqAndBwaMemAndMba (labeled
as BW A on the pipeline graph), CollectQualityYieldMetrics, and CollectUnsortedReadgroupBamQualityMetrics, one
shard is created for each input file with no guarantees that the shards will be of equal size. For BaseRecalibrator
and ApplyBQSR (labeled as BR and BQSR respectively), an attempt is made to split the input bam file into equal
pieces with sequencing groupings. SamToFastqAndBwaMemAndMba is a compute-intensive task, and its performance
significantly affects the performance of the entire pipeline. We use it to observe how the shard-per-file strategy affects
pipeline performance in terms of cost and run duration.
We selected a subset of the 1000 Genomes Project dataset [17, 18]: 49 samples in total, each with 30x coverage. Of
these, 46 samples have 12 unmapped bam files per sample, while the remaining 3 samples have 8, 13, and 14 files,
5
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
Table 4: Subset of samples used to collect pipeline execution statistics
Sample Number of files File size statistics (in GB) Total size (in GB)
mean stdev min max
NA19017 12 4.13 0.19 3.95 4.45 49.6
HG03436 12 3.77 1.17 2.88 5.41 45.19
NA06984 12 3.64 0.36 3.13 4.04 43.64
HG01982 12 3.62 0.68 2.95 4.65 43.45
NA20764 12 3.6 0.44 2.85 3.92 43.14
Figure 4: Pipeline graph for UnmappedBamToAlignedBam workflow. Tasks ’sink’ and ’source’ are introduced for
implementation purposes to mark entry and exit points of the pipeline.
respectively. Detailed sample statistics are presented in Table S1. To collect execution statistics, we randomly selected
5 samples out of the 46 samples with 12 files, as shown in Table 4. Genome assembly GRCh38 was used as a reference.
3.2 Virtual machine assignments and execution statistics collection
We executed the UnmappedBamToAlignedBam workflow on Azure using a modified version of CromwellOn-
Azure v3.0.0 [12]. The modifications were made to label each run with the sample and experiment name to simplify
execution statistics collection and processing. No other functionality was changed.
Three experiments were defined: default VM assignment, Dv2 family, and Ddv4 family. By default, CromwellOn-
Azure assigns the cheapest VM per hour that satisfies the task’s compute requirements, such as disk, memory, or number
of CPUs. The list of defaults was saved and then passed as an additional input file for the pipeline run to guarantee
the same defaults were used for all samples. For Dv2 and Ddv4 family experiments, we created VM assignment lists
manually based on the tasks’ compute requirements. These lists were also passed as additional input files for the
corresponding pipeline runs. All VM assignment lists are available in the Supplementary materials, see Listings S1–S3.
We were inspired by Intel’s benchmarking experiment [ 19] when choosing the VM families for our experiments.
However, both Intel’s experiment and our experiments were conducted about two years ago. Therefore, the results
presented in the next sections should not be considered as an optimal choice recommendation. The Azure VM offering
has changed since then, with some families being deprecated and several new families being added. As a result, we
highly recommend evaluating the currently available VM families to find up-to-date optimal compute resources.
To collect execution statistics we ran the pipeline three times per each sample defined in Table 4 and each experiment
defined above, resulting in 45 runs in total. Then, for each task-experiment pair, we computed the average cost and
execution time over runs and samples (see Table S2). The average cost and time to complete whole pipeline are
presented in Table 5. We can observe that simple switch from default VM assignment to Ddv4 family decreases time
to complete the pipeline twice with a small increase in cost. We also computed the lower and upper bounds for cost and
time that can be achieved by mixing VMs from the three experiments:
$ 2.71 ≤ Cost to complete pipeline ≤ $ 4.23,
6.45 hr ≤ Time to complete pipeline ≤ 13.91 hr.
6
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
Table 5: Cost and time to complete pipeline for each experiment.
Experiment Cost (in $) Time (in hr)
default 3.26 13.88
Dv2 3.55 8.23
Ddv4 3.57 6.55
Table 6: Cost constraints and corresponding minimal times to complete pipeline.
Cmax (in $) Tmin (in hr)
2.71 6.82
2.86 6.53
3.01 6.53
3.17 6.53
3.32 6.53
3.47 6.45
3.62 6.45
3.77 6.45
3.93 6.45
4.08 6.45
4.23 6.45
Table 7: Recommended VM assignment when minimizing time with cost constraint Cmax = 2.71
Task Recommended Recommended
experiment VM
ApplyBQSR Dv2 Standard_D3_v2
BaseRecalibrator Ddv4 Standard_D4d_v4
CheckContamination Ddv4 Standard_D4d_v4
CollectQualityYieldMetrics Ddv4 Standard_D2d_v4
CollectUnsortedReadgroupBamQualityMetrics Ddv4 Standard_D2d_v4
CreateSequenceGroupingTSV Ddv4 Standard_D2d_v4
CrossCheckFingerprints Ddv4 Standard_D2d_v4
GatherBamFiles Ddv4 Standard_D4d_v4
GatherBqsrReports Ddv4 Standard_D2d_v4
MarkDuplicates Ddv4 Standard_D8d_v4
SamToFastqAndBwaMemAndMba default Standard_F16s_v2
SortSampleBam Ddv4 Standard_D8d_v4
SumFloats Ddv4 Standard_D2d_v4
3.3 Implementing and solving LP problems
We used PuLP [20], a Python-based modeling language, to implement problems LP 3 and LP 4 (see Listings S5 and
S6). PuLP can call a variety of open-source and commercial linear programming solvers, we used COIN-OR CBC [21],
which is bundled with the PuLP package.
We selected UC 2 from the two use cases described in the introduction, which involves minimizing the time to complete
the pipeline with a restriction on cost. As shown above, the cost lower and upper bounds are $ 2.71 and $ 4.23
respectively. We took several points from the cost range
[2.71, 4.23] and used each of them as the cost constraint
Cmax, thus generating and solving several time minimization LP problems. The time matrix T and cost matrix C
were obtained from the execution statistics presented in Table S2, pipeline graph was defined in JSON format (see
Listing S4).
Table 6 displays the values used as cost constraints Cmax and minimum times Tmin to complete the pipeline corre-
sponding to these constraints. Tmin is the minimum of the objective function for problem LP 4 with cost constraint
Cmax. The results presented demonstrate the cost-time trade-off that we described in the introduction. For instance, if
we are willing to spend $ 4.23 on each pipeline run, then the minimal time to complete the pipeline is 6.45 hours. On
7
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
the other hand, if we do not want to spend more than $ 3.3 on each run, then we need a minimum of 6.53 hours. Table 6
also allows us to make a choice based on cost-time requirements specific to a research project. We decided to proceed
with Cmax = 2.71 and Tmin = 6.82. Solution of LP problem with Cmax = 2.71 is presented in Table 7.
3.4 Using recommended VM assignment to process full set of samples
We analyzed the full set of 49 samples using the compute resources recommended in Table 7. One run was performed
for each sample, and for the majority of samples, the pipeline run was completed within (or very close to) the cost-time
boundaries defined by Cost ≤ $ 2.71 and T ime ≤ 6.82 hr (see the bottom left section of Figure 5). Sample NA12872
is not present on the figure, as the run for this sample failed due to its large size and lack of disk space for processing a
file by one of the non-distributed tasks. Samples that appear on the right section of Figure 5 are characterized by a
larger standard deviation of sample file sizes, a large total sample size, or a combination of both (see Table 8).
Samples HG03436, NA20869, and NA12340 illustrate how the duration of the pipeline run is affected by the variability
in sample file sizes. As previously mentioned, the task SamToFastqAndBwaMemAndMba is a distributed task where one
shard is created for each input file, and the same type of VM is assigned to all shards of the task. For sample NA12340,
input files of size 1.93 GB and 11.65 GB are processed in parallel on two VMs of the same type (Standard_F16s_v2).
Since larger files require more processing time, variability in file sizes leads to variability in shard processing times.
The task SamToFastqAndBwaMemAndMba is completed when the longest shard is completed, and for NA12340, it
takes longer compared to evenly distributed samples. At the same time, the total amount of processed data for NA12340
(46.15 GB) is similar to that of other samples, so we do not observe an increase in the cost of the pipeline run. Overall,
the uneven distribution of samples across input files shifts samples to the right on the Time vs. Cost graph, whilst an
increase in total sample size affects both cost and time to complete the pipeline, and shifts samples to the top right (see
sample HG01565).
Figure 5: Cost and time to complete data pre-processing pipeline with recommended VM assignment, each point
corresponds to one sample.
4 Discussion
While our assumptions such as average execution statistics, aggregated shards, and multiple-choice problem formulation
help simplify the LP problem and make it more accessible to non-experts, they also introduce limitations to the approach
described above. In this section, we discuss several enhancements that can be implemented and the scenarios in which
they might be useful.
Dealing with distributed tasks As demonstrated earlier, combining shards into a single node on the pipeline graph
may not be effective when there is an uneven distribution of inputs across shards. If this scenario is anticipated, an
alternative approach to consider is to represent each shard with a separate node and use a non-linear pipeline topology.
8
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
Table 8: Top 15 samples by size; for samples in bold cost or time to complete pipeline exceeded limits Cost ≤ $ 2.71
and T ime ≤ 6.82 hr
Sample Number of files File size statistics (in GB) Total size (in GB)
mean stdev min max
1 NA12872 12 4.6 0.97 3.21 5.56 55.25
2 HG03006 12 4.34 2.14 2.13 7.2 52.07
3 HG01565 12 4.32 0.27 3.85 4.63 51.81
4 HG02855 12 4.19 0.44 3.66 4.8 50.25
5 NA19017 12 4.13 0.19 3.95 4.45 49.6
6 NA20869 12 4.05 2.05 2.43 6.89 48.6
7 NA19035 12 3.89 0.58 3.13 4.67 46.73
8 HG03727 12 3.85 0.14 3.66 4.15 46.24
9 NA12340 14 3.3 3.21 1.93 11.65 46.15
10 HG01583 12 3.84 0.15 3.67 4.08 46.08
11 HG03742 12 3.81 0.23 3.57 4.16 45.71
12 NA20502 12 3.77 0.45 3.35 4.42 45.3
13 NA20509 12 3.77 0.15 3.57 4.03 45.24
14 NA18995 12 3.77 0.11 3.61 3.91 45.22
15 HG03436 12 3.77 1.17 2.88 5.41 45.19
Cloud compute resources are limited We made the assumption that VMs used for experiments are available in any
quantity when we wrote multiple-choice problems and formulated LP problems. This assumption might be acceptable
if samples are analyzed in small quantities in sequential order. For instance, if a few new samples come every 8 hours,
they’re sent for analysis right away and it’s completed before the next batch comes. In this scenario, resource capacity
issues might be prevented with careful evaluation of compute requirements, Cloud quotas and competing workloads. If
the capacity limits for some virtual machines are reached when analyzing one sample and changing the quota is not
possible, then a capacity constraint must be added to the LP problems.
When analyzing samples at a large scale, capacity limits may be easily reached by multiple distributed tasks with
hundreds or thousands of shards running in parallel. For instance, one HaplotypeCaller task from Whole Genome
Germline Single Sample Pipeline requires 50 shards for a 30x coverage sample. In this case, it might be beneficial to
optimize resource utilization in general and consider a resource allocation and scheduling problems that are well-known
in Cloud computing. We recommend reviewing the current state of research in this field and carefully evaluating the
complexity of the solution that your project requires.
Is execution statistics required? We assumed that the researcher is willing to collect execution statistics before
optimizing the pipeline, which might be impossible or cost-prohibitive in some cases. Additionally, we used average
cost and execution time values to solve LP problems, which resulted in the loss of information about individual sample
size, task cost, and execution time distributions. We also did not take into account the input size of incoming sample to
process.
If we can accurately predict task execution time and cost for each sample, then VM assignment recommendation can
be done individually as well. When historical execution data is available, various machine learning techniques can be
used to predict task runtime for new samples based on their size, workflow structure, etc. [22, 23, 24, 25, 26] However,
approaches relying on historical data cannot be applied to workflows without any historical traces or workflows with
changed parameters and configurations on any kind of cluster. In such cases, methods employing microbenchmarks for
predicting the runtimes of tasks before they are executed might be very useful [27].
5 Conclusion
In this paper, we present a linear programming method that can help bioinformatics researchers optimize the execution
of scientific pipelines for either cost or time. Our method is designed to work with all scientific pipelines that have a
directed acyclic graph topology. The implementation we present uses execution data from representative samples to
determine the optimal compute needed for the pipeline execution. The same methodology might be used with individual
task runtimes predicted on historical data or microbenchmarks. In the future, we want to explore the limitation of
compute resources and capacity when dealing with competing interests, such as running competing pipelines or scaling
a single pipeline at thousands of parallel samples.
9
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
References
[1] P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, C. Notredame, Nextflow enables reproducible
computational workflows, Nat Biotechnol 35 (4) (2017) 316–319. doi:10.1038/nbt.3820.
[2] Nextflow documentation: Running on azure, https://www.nextflow.io/docs/edge/azure.html, [Online;
accessed: January 25, 2024].
[3] K. V oss, G. V . D. Auwera, J. Gentry, Full-stack genomics pipelining with gatk4 + wdl + cromwell,
https://f1000research.com/slides/6-1381, [version 1; not peer reviewed] (2017). doi:10.7490/
f1000research.1114634.1.
[4] Cromwell documentation: Azure backend, https://cromwell.readthedocs.io/en/stable/backends/
Azure, [Online; accessed: January 25, 2024].
[5] F. Mölder, K. P. Jablonski, B. Letcher, et al., Sustainable data analysis with snakemake, F1000Research 10 (2021)
33, [version 1; peer review: 1 approved, 1 approved with reservations].doi:10.12688/f1000research.29032.
1.
[6] Snakemake documentation: Executing a snakemake workflow via azure batch,
https://snakemake.readthedocs.io/en/v7.31.1/executing/cloud.html#
executing-a-snakemake-workflow-via-azure-batch , [Online; accessed: January 25, 2024].
[7] S. Srivastava, I. Banicescu, Scheduling in parallel and distributed computing systems, in: Topics in Par-
allel and Distributed Computing, Springer International Publishing, 2018, pp. 313–337.
doi:10.1007/
978-3-319-93109-8_11 .
[8] V . Vinothina, R. Sridaran, P. Ganapathi, A survey on resource allocation strategies in cloud computing, Int. J. Adv.
Comput. Sci. Appl. 3 (6) (2012) 97–104.
[9] S. H. H. Madni, M. S. A. Latiff, Y . Coulibaly, S. M. Abdulhamid, Recent advancements in resource allocation
techniques for cloud computing environment: a systematic review, Cluster Computing 20 (2017) 2489–2533.
[10] K. Saidi, D. Bardou, Task scheduling and vm placement to resource allocation in cloud computing: challenges
and opportunities, Cluster Computing 26 (2023) 3069–3087.
[11] E. N. Alkhanak, S. P. Lee, R. Rezaei, R. M. Parizi, Cost optimization approaches for scientific workflow scheduling
in cloud and grid computing: A review, classifications, and open issues, Journal of Systems and Software 113
(2016) 1–26. doi:https://doi.org/10.1016/j.jss.2015.11.023.
[12] https://github.com/microsoft/CromwellOnAzure, [Online; accessed: January 18, 2024].
[13] T. Hu, A. Kahng, Linear and Integer Programming Made Easy, Springer International Publishing Switzerland,
2016.
[14] H. Kellerer, U. Pferschy, D. Pisinger, Knapsack Problems, Springer Berlin Heidelberg, 2004.
[15] D. Caetano-Anolles, Data pre-processing for variant discovery, https://gatk.broadinstitute.org/hc/
en-us/articles/360035535912 (2020).
[16] https://github.com/microsoft/gatk4-genome-processing-pipeline-azure , [Online; accessed: Jan-
uary 18, 2024].
[17] Overview of the 1000 genomes project, https://www.internationalgenome.org/
1000-genomes-summary, [Online; accessed: January 18, 2024].
[18] Azure open datasets: genomics data lake – 1000 genomes, https://learn.microsoft.com/en-us/azure/
open-datasets/dataset-1000-genomes, [Online; accessed: January 18, 2024].
[19] R. Pruitt, M. Powers, J. Chia, P. Sebastian, K. Mannthey, Evaluating genomics pipelines on azure: Intel-based
virtual machines, https://techcommunity.microsoft.com/t5/azure-high-performance-computing/
evaluating-genomics-pipelines-on-azure-intel-based-virtual/ba-p/2824608 (2021).
[20] https://pypi.org/project/PuLP, [Online; accessed: January 18, 2024].
[21] https://github.com/coin-or/Cbc, [Online; accessed: January 18, 2024].
[22] S. M. Sadjadi, S. Shimizu, J. Figueroa, R. Rangaswami, J. Delgado, H. A. Duran-Limon, X. J. Collazo-Mojica, A
modeling approach for estimating execution time of long-running scientific applications, 2008 IEEE International
Symposium on Parallel and Distributed Processing (2008) 1–8.
[23] R. F. da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez, D. Thain, B. Tovar, M. Livny, Toward fine-grained
online task characteristics estimation in scientific workflows, Proceedings of the 8th Workshop on Workflows in
Support of Large-Scale Science (2013) 58–67.
10
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
[24] R. F. da Silva, G. Juve, M. Rynge, E. Deelman, M. Livny, Online task resource consumption prediction for
scientific workflows, Parallel Process. Lett. 25 (2015) 1541003:1–1541003:25.
[25] F. Nadeem, D. M. Alghazzawi, A. S. Mashat, K. A. Fakeeh, A. S. A.-M. Al-Ghamdi, H. Hagras, Modeling and
predicting execution time of scientific workflows in the grid using radial basis function neural network, Cluster
Computing 20 (2017) 2805–2819.
[26] M. J. Rosa, C. G. Ralha, M. Holanda, A. P. Araujo, Computational resource and cost prediction service for
scientific workflows in federated clouds, Future Generation Computer Systems 125 (2021) 844–858.
[27] J. Bader, F. Lehmann, L. Thamsen, U. Leser, O. Kao, Lotaru: Locally predicting workflow task runtimes
for resource management on heterogeneous infrastructures, Future Generation Computer Systems 150 (2024)
171–185.
11
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
Supplementary materials
Compute resources used for experiments
1 {
2 "CollectQualityYieldMetrics": "Standard_D2_v3",
3 "SamToFastqAndBwaMemAndMba": "Standard_F16s_v2",
4 "CreateSequenceGroupingTSV": "Standard_A2_v2",
5 "CollectUnsortedReadgroupBamQualityMetrics": "Standard_D2_v3",
6 "SumFloats": "Standard_A2_v2",
7 "MarkDuplicates": "Standard_L4s",
8 "SortSampleBam": "Standard_L4s",
9 "CrossCheckFingerprints": "Standard_A2",
10 "BaseRecalibrator": "Standard_D2_v2",
11 "CheckContamination": "Standard_D11_v2",
12 "GatherBqsrReports": "Standard_A2_v2",
13 "ApplyBQSR": "Standard_A3",
14 "GatherBamFiles": "Standard_A2",
15 }
Listing S1: Default virtual machine assignment made by CromwellOnAzure
1 {
2 "CollectQualityYieldMetrics": "Standard_D2_v2",
3 "SamToFastqAndBwaMemAndMba": "Standard_D5_v2",
4 "CreateSequenceGroupingTSV": "Standard_D2_v2",
5 "CollectUnsortedReadgroupBamQualityMetrics": "Standard_D2_v2",
6 "SumFloats": "Standard_D2_v2",
7 "MarkDuplicates": "Standard_D4_v2",
8 "SortSampleBam": "Standard_D4_v2",
9 "CrossCheckFingerprints": "Standard_D2_v2",
10 "BaseRecalibrator": "Standard_D2_v2",
11 "CheckContamination": "Standard_D3_v2",
12 "GatherBqsrReports": "Standard_D2_v2",
13 "ApplyBQSR": "Standard_D3_v2",
14 "GatherBamFiles": "Standard_D3_v2",
15 }
Listing S2: Virtual machine assignment for Dv2 family experiment
12
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
1 {
2 "CollectQualityYieldMetrics": "Standard_D2d_v4",
3 "SamToFastqAndBwaMemAndMba": "Standard_D16d_v4",
4 "CreateSequenceGroupingTSV.vm_size": "Standard_D2d_v4",
5 "CollectUnsortedReadgroupBamQualityMetrics": "Standard_D2d_v4",
6 "SumFloats": "Standard_D2d_v4",
7 "MarkDuplicates": "Standard_D8d_v4",
8 "SortSampleBam": "Standard_D8d_v4",
9 "CrossCheckFingerprints": "Standard_D2d_v4",
10 "BaseRecalibrator": "Standard_D4d_v4",
11 "CheckContamination": "Standard_D4d_v4",
12 "GatherBqsrReports": "Standard_D2d_v4",
13 "ApplyBQSR": "Standard_D8d_v4",
14 "GatherBamFiles": "Standard_D4d_v4",
15 }
Listing S3: Virtual machine assignment for Ddv4 family experiment
UnmappedBamToAlignedBam workflow graph
1 {
2 "CollectQualityYieldMetrics": ["source"],
3 "SamToFastqAndBwaMemAndMba": ["source"],
4 "CreateSequenceGroupingTSV": ["source"],
5 "CollectUnsortedReadgroupBamQualityMetrics": ["SamToFastqAndBwaMemAndMba"],
6 "SumFloats": ["SamToFastqAndBwaMemAndMba"],
7 "MarkDuplicates": ["SumFloats"],
8 "SortSampleBam": ["MarkDuplicates"],
9 "CrossCheckFingerprints": ["SortSampleBam"],
10 "BaseRecalibrator": ["SortSampleBam", "CreateSequenceGroupingTSV"],
11 "CheckContamination": ["SortSampleBam"],
12 "GatherBqsrReports": ["BaseRecalibrator"],
13 "ApplyBQSR": ["GatherBqsrReports"],
14 "GatherBamFiles": ["ApplyBQSR"] ,
15 "sink":
16 ["GatherBamFiles",
17 "CheckContamination",
18 "CrossCheckFingerprints",
19 "CollectUnsortedReadgroupBamQualityMetrics",
20 "CollectQualityYieldMetrics"]
21 }
Listing S4: Immediate predecessors list for each task. Tasks ’sink’ and ’source’ are introduced for implementation
purposes to mark entry and exit points of the pipeline.
13
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
Implementation of LP problems for non-linear pipeline topology using PuLP
Below we assume that pipeline is an instance of a to-be-implemented Pipeline class with the following properties:
• tasks – list of task names,
• experiments – list of experiment names,
• costs – cost matrix C in the following dictionary format
{'Task_1': {'Experiment_1': c_11, 'Experiment_2': c_12},
'Task_2': {'Experiment_1': c_21, 'Experiment_2': c_22}}
• times – time matrix T in the same dictionary format as presented above for cost,
• graph – dictionary that maps task name to a list of its immediate predecessors.
1 from pulp import *
2
3 def cost_minimization_problem(pipeline, resource_limit, model_name):
4 tasks = pipeline.tasks
5 experiments = pipeline.experiments
6 costs = pipeline.costs
7 times = pipeline.times
8 graph = pipeline.graph
9
10 choices = [(t,e) for t in tasks for e in experiments]
11
12 # Create model to contain the problem data
13 model = LpProblem(model_name, LpMinimize)
14
15 # Create indicator variables corresponding to I_ij in LP formulation
16 ivars = LpVariable.dicts('choice', (tasks, experiments), 0, 1, LpBinary)
17
18 # The objective function is added to model first
19 model += (lpSum([ivars[t][e]*costs[t][e] for (t,e) in choices]), "total_cost")
20
21 # These constraints guarantee that only one VM is selected per task
22 for t in tasks:
23 model += (lpSum([ivars[t][e] for e in experiments]) == 1, "limit_for_task_%s"%t)
24
25 if resource_limit != None:
26 # Create additional variables F_i (finishing time for Task_i)
27 fvars = LpVariable.dicts('F', tasks, 0, None, cat = 'Continuous')
28 # These constraints ensure that predecessors are completed prior to the task
29 for t in tasks:
30 for pr in graph[t]:
31 if pr == 'source':
32 model += (lpSum([ivars[t][e]*times[t][e] for e in experiments])
33 - fvars[t] <= 0, "finish_time_task_%s_%s"%(t, pr))
34 else:
35 model += (fvars[pr]+lpSum([ivars[t][e]*times[t][e] for e in experiments])
36 - fvars[t] <= 0, "finish_time_task_%s_%s"%(t, pr))
37
38 # Main resource constraint ensures that alotted time is not exceeded
39 for t in tasks:
40 model += (fvars[t] <= resource_limit, "main_resource_constraint_%s"%t)
41
42 return model
Listing S5: PuLP model for cost minimization problem
14
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
1 from pulp import *
2
3 def time_minimization_problem(pipeline, resource_limit, model_name):
4 tasks = pipeline.tasks
5 experiments = pipeline.configs
6 costs = pipeline.costs
7 times = pipeline.times
8 graph = pipeline.graph
9
10 choices = [(t,e) for t in tasks for e in experiments]
11
12 # Create model to contain the problem data
13 model = LpProblem(model_name, LpMinimize)
14
15 # Create indicator variables corresponding to I_ij in LP formulation
16 ivars = LpVariable.dicts('choice', (tasks, experiments), 0, 1, LpBinary)
17 # Create additional variables F_i (finishing time for Task_i)
18 fvars = LpVariable.dicts('F', tasks, 0, None, cat = 'Continuous')
19 # Create additional variable F_max (finishing time for full pipeline)
20 dvar = LpVariable('F_max', 0, None, 'Continuous')
21
22 # The objective function is added to model first
23 model += (lpSum(dvar), "total_time")
24
25 # These constraints guarantee that only one VM is selected per task
26 for t in tasks:
27 model += (lpSum([ivars[t][e] for e in experiments]) == 1, "limit_for_task_%s"%t)
28
29 # These constraints ensure that predecessors are completed prior to the task
30 for t in tasks:
31 for pr in graph[t]:
32 if pr == 'source':
33 model += (lpSum([ivars[t][e]*times[t][e] for e in experiments])
34 - fvars[t] <= 0, "finish_time_task_%s_%s"%(t, pr))
35 else:
36 model += (fvars[pr]+lpSum([ivars[t][e]*times[t][e] for e in experiments])
37 - fvars[t] <= 0, "finish_time_task_%s_%s"%(t, pr))
38
39 # These constraints ensure that all tasks are completed
40 for t in tasks:
41 model += (fvars[t] <= dvar, "max_replacement_task_%s"%t)
42
43 # Main resource constraint ensures that allowed cost is not exceeded
44 if resource_limit != None:
45 model += (lpSum([ivars[t][e]*costs[t][e] for (t,e) in choices])
46 <= resource_limit, "main_resource_constraint")
47
48 return model
Listing S6: PuLP model for time minimization problem
15
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
List of samples and execution statistics
Table S1: Full list of samples analyzed by data pre-processing pipeline, ordered by total sample size. Samples in bold
were used to collect pipeline execution statistics presented in Table S2
Sample Number of files File size statistics (in GB) Total size (in GB)
mean stdev min max
1 NA12872 12 4.6 0.97 3.21 5.56 55.25
2 HG03006 12 4.34 2.14 2.13 7.2 52.07
3 HG01565 12 4.32 0.27 3.85 4.63 51.81
4 HG02855 12 4.19 0.44 3.66 4.8 50.25
5 NA19017 12 4.13 0.19 3.95 4.45 49.6
6 NA20869 12 4.05 2.05 2.43 6.89 48.6
7 NA19035 12 3.89 0.58 3.13 4.67 46.73
8 HG03727 12 3.85 0.14 3.66 4.15 46.24
9 NA12340 14 3.3 3.21 1.93 11.65 46.15
10 HG01583 12 3.84 0.15 3.67 4.08 46.08
11 HG03742 12 3.81 0.23 3.57 4.16 45.71
12 NA20502 12 3.77 0.45 3.35 4.42 45.3
13 NA20509 12 3.77 0.15 3.57 4.03 45.24
14 NA18995 12 3.77 0.11 3.61 3.91 45.22
15 HG03436 12 3.77 1.17 2.88 5.41 45.19
16 HG01595 12 3.76 0.3 3.3 4.04 45.17
17 HG03756 12 3.75 0.27 3.4 4.22 45.04
18 HG03449 12 3.72 0.17 3.48 3.89 44.68
19 HG03100 12 3.69 0.24 3.35 3.92 44.24
20 HG03722 12 3.68 0.46 3.03 4.18 44.19
21 NA11894 12 3.66 0.19 3.49 3.92 43.93
22 HG00096 12 3.66 0.24 3.3 3.88 43.87
23 HG00419 12 3.66 0.16 3.47 3.91 43.87
24 HG00410 12 3.64 0.2 3.26 3.85 43.71
25 NA06984 12 3.64 0.36 3.13 4.04 43.64
26 NA12878 12 3.63 0.12 3.47 3.79 43.61
27 NA10847 12 3.63 1 2.72 4.99 43.59
28 HG01958 12 3.62 0.17 3.45 3.88 43.48
29 HG01982 12 3.62 0.68 2.95 4.65 43.45
30 NA19648 12 3.62 0.28 3.17 3.93 43.38
31 HG01879 12 3.59 0.26 3.36 3.98 43.14
32 NA20764 12 3.6 0.44 2.85 3.92 43.14
33 HG02642 12 3.59 0.25 3.19 3.9 43.09
34 NA12489 12 3.58 0.11 3.43 3.74 42.94
35 NA20895 13 3.28 1.39 2.07 6.36 42.69
36 HG03052 12 3.53 0.18 3.23 3.75 42.38
37 HG00759 12 3.51 0.27 3.11 3.79 42.16
38 HG01112 12 3.51 0.2 3.23 3.82 42.16
39 NA20321 12 3.49 0.1 3.37 3.67 41.92
40 NA20752 12 3.49 0.08 3.38 3.59 41.91
41 NA20126 12 3.49 0.52 2.75 3.9 41.86
42 HG01500 12 3.48 0.52 2.91 4.22 41.77
43 NA19661 8 5.21 0.59 4.62 5.83 41.68
44 NA19102 12 3.46 0.49 2.75 3.93 41.55
45 HG00268 12 3.46 0.17 3.24 3.77 41.54
46 NA18956 12 3.36 0.23 3.14 3.75 40.35
47 HG02085 12 3.31 0.33 2.93 3.74 39.75
48 HG01393 12 3.28 0.46 2.84 3.98 39.38
49 HG02069 12 3.18 0.31 2.82 3.61 38.2
16
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Optimizing genomics pipeline execution with integer linear programming A P REPRINT
Table S2: Pipeline execution statistics
Task Experiment Cost (in $) Time (in hr)
ApplyBQSR Ddv4 0.591 0.459
ApplyBQSR Dv2 0.524 0.757
ApplyBQSR default 0.903 1.39
BaseRecalibrator Ddv4 0.338 0.531
BaseRecalibrator Dv2 0.356 1.19
BaseRecalibrator default 0.37 1.19
CheckContamination Ddv4 0.0302 0.668
CheckContamination Dv2 0.043 0.769
CheckContamination default 0.0425 1.15
CollectQualityYieldMetrics Ddv4 0.0175 0.0824
CollectQualityYieldMetrics Dv2 0.0283 0.132
CollectQualityYieldMetrics default 0.0227 0.19
CollectUnsortedReadgroupBamQualityMetrics Ddv4 0.0381 0.175
CollectUnsortedReadgroupBamQualityMetrics Dv2 0.0615 0.248
CollectUnsortedReadgroupBamQualityMetrics default 0.0503 0.354
CreateSequenceGroupingTSV Ddv4 0.000223 0.00986
CreateSequenceGroupingTSV Dv2 0.000314 0.0108
CreateSequenceGroupingTSV default 0.000358 0.0199
CrossCheckFingerprints Ddv4 0.00509 0.225
CrossCheckFingerprints Dv2 0.00858 0.296
CrossCheckFingerprints default 0.0126 0.523
GatherBamFiles Ddv4 0.0272 0.602
GatherBamFiles Dv2 0.0368 0.659
GatherBamFiles default 0.117 4.86
GatherBqsrReports Ddv4 0.000737 0.0326
GatherBqsrReports Dv2 0.00133 0.0459
GatherBqsrReports default 0.0022 0.122
MarkDuplicates Ddv4 0.194 2.15
MarkDuplicates Dv2 0.28 2.5
MarkDuplicates default 0.202 2.93
SamToFastqAndBwaMemAndMba Ddv4 2.17 1.03
SamToFastqAndBwaMemAndMba Dv2 1.97 0.931
SamToFastqAndBwaMemAndMba default 1.37 1.01
SortSampleBam Ddv4 0.157 1.74
SortSampleBam Dv2 0.24 2.14
SortSampleBam default 0.163 2.37
SumFloats Ddv4 0.000225 0.00998
SumFloats Dv2 0.000261 0.009
SumFloats default 0.000459 0.0255
17
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.