Optimizing genomics pipeline execution with integer linear programming

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 50,445 characters · extracted from oa-pdf · 3 sections · click to expand

Abstract

In the field of genomics, bioinformatics pipelines play a crucial role in processing and analyzing vast biological datasets. These pipelines, consisting of interconnected tasks, can be optimized for efficiency and scalability by leveraging cloud platforms such as Microsoft Azure. The choice of compute resources introduces a trade-off between cost and time. This paper introduces an approach that uses Linear Programming (LP) to optimize pipeline execution. We consider optimizing two competing cases: minimizing cost with a run duration restriction and minimizing duration with a cost restriction. Our results showcase the utility of using LP in guiding researchers to make informed compute decisions based on specific data sets, cost and time requirements, and resource constraints. 1 Introduction In the realm of genomics, bioinformatics pipelines are integral to processing and analyzing vast amounts of biological data. A pipeline, in this context, is a series of data processing steps, or tasks, that are interconnected. Each task takes a set of inputs, processes it, and produces a set of outputs that may be used as inputs for the subsequent tasks or are considered terminal outputs. This sequence of tasks can be distributed, or divided into smaller, more manageable pieces or shards, to enhance efficiency and scalability. These pipelines can be run on cloud platforms like Microsoft Azure, leveraging its robust computational resources and storage capabilities. These pipelines are typically run using some of the following workflow managers all able to run on Azure: Nextflow [1, 2], Cromwell [3, 4], Snakemake [5, 6]. It’s important to note that the cost and duration of running these pipelines on Azure are influenced by the choice of compute resources. More powerful resources can process data faster but at a higher cost, necessitating a balance between speed, efficiency, and budget. Choosing the appropriate compute resources is dependent on the tools and data storage (memory and disk) requirements of each task. With a variety of compute resources available, there is always a trade-off between cost and time when running a genomics pipeline. The preference for this trade-off depends on the researcher’s circumstances. If the researcher has limited funding and is less restricted on time, they might be interested in getting results as cheaply as possible without spending more than Tmax hours per sample. In clinical settings where time is critical, the preference might be to get

Results

as quickly as possible. Efficient resources utilization remains a key issue in parallel and distributed computing environments. Resource allocation and task scheduling problems are well-known in this field, and a lot of effort has been invested in improving the management of Cloud resources [7, 8, 9, 10]. In particular, cost optimization of scientific workflows has been a focus of attention [11]. Existing open-source workflow engines in genomics field, such as CromwellOnAzure [ 12], Snakemake [5] or Nextflow [1], require researchers to provide a virtual machine-to-task assignment list in advance as one of the workflow inputs or make a default assignment with no guarantees for optimal performance. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT Table 1: Set of experiments defined for a given pipeline Experiment 1 Experiment 2 T ask1 V M11 V M12 T ask2 V M21 V M22 T ask3 V M31 V M32 Table 2: Expected execution statistics: average time to complete each pipeline task Experiment 1 Experiment 2 T ask1 t11 t12 T ask2 t21 t22 T ask3 t31 t32 Table 3: Expected execution statistics: average cost to complete each pipeline task Experiment 1 Experiment 2 T ask1 c11 c12 T ask2 c21 c22 T ask3 c31 c32 When a researcher’s institution does not use a workflow management system that monitors workflow execution and optimizes task scheduling, the researcher faces the challenge of optimizing a pipeline according to project requirements while taking into account the variety of compute options and non-linear pipeline topology. In this paper, we demonstrate that linear programming can be a useful tool for pipeline execution optimization in a bioinformatics researcher’s toolbox. We consider two use cases for pipeline optimization: UC 1 minimizing the cost of running the pipeline with a restriction on run duration, UC 2 minimizing the duration of the pipeline run with a restriction on cost. Genomics workloads often involve running a given pipeline on a large number of samples. Typically, data from these samples have the same number of reads, length, coverage and quality level, i.e. the sample dataset is homogeneous. Given that the dataset is homogeneous, we assume that average execution statistics collected for a subset of data are sufficient for preliminary pipeline performance estimates and optimizations. We also assume that statistics have been collected, and the scope of this work includes the following steps performed by the researcher: selecting a sample subset, selecting a virtual machine subset for each pipeline task, executing the pipeline on the selected subset of samples and virtual machines, collecting execution statistics, and computing the average cost and execution time for each pipeline task. For example, if virtual machine assignment is described by Table 1, then corresponding execution statistics will define matrices C = {cij} and T = {tij} presented as Tables 2 and 3. 2 Formulating linear programming problems Maximizing or minimizing a function subject to constraints is a common problem in many fields. Linear programming (LP) is a special case of mathematical optimization that deals with linear functions and constraints. In LP, we assume that the function to be maximized or minimized is a linear function and that the constraints are linear equations or linear inequalities [13]. LP has proven useful in modeling diverse types of problems in planning, routing, scheduling, assignment, and design. In this section, we formulate LP problems for the use cases defined above. Suppose we have a pipeline consisting of n tasks and m experiments defined by the researcher. The cost and time to complete each task are given by matricesC = {cij} and T = {tij}, respectively, wherei = 1, . . . , n and j = 1, . . . , m. We can formulate the following multiple-choice problems for the use cases in consideration: UC 1 For each task, pick exactly one virtual machine out of m options in order to minimize the total cost of running the pipeline without exceeding the allotted time Tmax. UC 2 For each task, pick exactly one virtual machine out of m options in order to minimize the duration of the pipeline run without exceeding the allowed cost Cmax. 2 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT We can formulate LP problems for pipeline optimization by following the approach used for the multiple-choice knapsack problem (MCKP) [14]. To do this, we introduce indicator variables Iij, where i = 1, . . . , n, j = 1, . . . , m. The variable Iij is equal to 1 if we choose virtual machine (VM) from Experiment j for T aski, and 0 otherwise. 2.1 Linear pipeline topology A linear pipeline consists of a chain of tasks arranged so that the output of each element is the input of the next. The processing of data is done in a linear and sequential manner, see example on Figure 1. In the case of a linear topology, the total time and cost to run a pipeline can be computed as a sum of all tasks time and cost. Following MCKP approach, the linear programming formulations can be expressed as: LP 1 – Minimize cost subject to time constraint Minimize nX i=1 mX j=1 cijIij total cost to run pipeline, subject to mX j=1 Iij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task, nX i=1 mX j=1 tijIij ≤ Tmax ensures time constraint is not exceeded. Iij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m} LP 2 – Minimize time subject to cost constraint Minimize nX i=1 mX j=1 tijIij → min total time to run pipeline, subject to mX j=1 Iij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task, nX i=1 mX j=1 cijIij ≤ Cmax ensures cost constraint is not exceeded. Iij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m} 2.2 Distributed tasks Sharding is a technique used to split a large task into smaller, more manageable pieces. The input data for the task is partitioned into shards based on a predetermined criterion, such as sequencing group or chromosome. Once the data is partitioned, it is distributed across multiple nodes for processing. Each node processes a subset of the data, which reduces the processing time and improves the overall performance of the system. When a task is split into shards (as shown in Figure 2), we aggregate the shards and represent the task as a single vertex on the pipeline graph for several reasons. First, existing implementations of shard functionality don’t allow different virtual machines to be assigned to different shards, and each shard of the task is assigned the same type of virtual machine. Second, the goal of creating shards is usually to split the task into pieces as equally as possible. Therefore, we assume that the execution time and cost won’t be drastically different between shards. When collecting execution statistics, we define the cost to complete the distributed task as the sum of all shards’ costs, and the time to complete the distributed task as the maximum across shards’ run duration. A pipeline with tasks split into shards can also be represented as a pipeline with non-linear topology (as shown in 2.3), it will increase the complexity of LP problem. It might be interesting to preform a comparison study to better understand the benefits from using non-linear topology versus the shard aggregation approach. This question is out of scope of this paper, and for simplicity, we will consider the distributed task as one vertex on pipeline graph. 3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT Figure 1: Example of linear pipeline topology Figure 2: Example of distributed task and shards’ execution statistics aggregation Figure 3: Example of non-linear pipeline topology 2.3 Non-linear pipeline topology The vast majority of genomics pipelines have a non-linear topology. In the non-linear case, the cost can still be computed as a sum of all costs, but the time to complete the pipeline should be computed differently. Let’s consider the pipeline presented in Figure 3. We can compute the time to complete it in the following way: Ttotal = timeT ask1 + max(timeT ask2 + timeT ask3 , timeT ask4 ) + timeT ask5 + timeT ask6 Unfortunately, we cannot replace the constraint in problem LP 1 and the objective function in problem LP 2 with the expression given above because it breaks the definition of an LP problem. However, there are tips and tricks that can help deal with non-linear functions, such as maximum, absolute value, etc., which frequently appear in real-world problems [13]. Usually, this is accomplished by creating additional variables. We introduce a variable Fi for each T aski, which describes amount of time required to complete a part of the pipeline from the start up to T aski. We also define a set of immediate predecessors for each vertex of the pipeline graph ImP redi. For example, for T ask5 the set of immediate predecessors is ImP red5 = {T ask3, T ask4}, and two inequalities hold F3 + timeT ask5 ≤ F5 and F4 + timeT ask5 ≤ F5. These inequalities ensure that T ask3 and T ask4 4 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT are completed prior to T ask5, and then timeT ask5 is required to complete the part of the pipeline from the start and up to T ask5. We also introduce Ftotal, which describes the amount of time required to complete the whole pipeline from the start to the end. With additional variables, linear programming formulations for non-linear topology can be expressed as follows: LP 3 – Minimize cost subject to time constraint Minimize nX i=1 mX j=1 cijIij total cost to run pipeline, subject to mX j=1 Iij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task, Fi ≤ Tmax, i ∈ {1, . . . , n} ensures time constraint is not exceeded, Fpr + mX j=1 tijIij ≤ Fi, i ∈ {1, . . . , n}, pr ∈ ImP redi ensures predecessors are completed prior to the task. Iij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m} Fi ≥ 0, i ∈ {1, . . . , n} LP 4 – Minimize time subject to cost constraint Minimize Ftotal total time to run pipeline, subject to mX j=1 Iij = 1, i ∈ {1, . . . , n} guarantees exactly one VM is selected per task, nX i=1 mX j=1 cijIij ≤ Cmax ensures cost constraint is not exceeded. Fpr + mX j=1 tijIij ≤ Fi, i ∈ {1, . . . , n}, pr ∈ ImP redi ensures predecessors are completed prior to the task, Fi ≤ Ftotal, i ∈ {1, . . . , n} ensures all pipeline tasks are completed. Iij ∈ {0, 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m} Ftotal ≥ 0, Fi ≥ 0, i ∈ {1, . . . , n} 3 Application: optimizing genome data pre-processing pipeline 3.1 Pipeline and data We leveraged LP problems formulated in the previous section to optimize UnmappedBamToAlignedBam workflow which implements the data pre-processing step of the Broad Institute Whole Genome Germline Single Sample Pipeline [15]. The version of the pipeline adapted to Azure is available in [16]. The UnmappedBamToAlignedBam workflow graph is presented in Figure 4: it has 13 tasks total, 5 of which are distributed. The list of immediate predecessors for each task is presented in Listing S4. The pipeline employs different sharding strategies for distributed tasks. For SamToFastqAndBwaMemAndMba (labeled as BW A on the pipeline graph), CollectQualityYieldMetrics, and CollectUnsortedReadgroupBamQualityMetrics, one shard is created for each input file with no guarantees that the shards will be of equal size. For BaseRecalibrator and ApplyBQSR (labeled as BR and BQSR respectively), an attempt is made to split the input bam file into equal pieces with sequencing groupings. SamToFastqAndBwaMemAndMba is a compute-intensive task, and its performance significantly affects the performance of the entire pipeline. We use it to observe how the shard-per-file strategy affects pipeline performance in terms of cost and run duration. We selected a subset of the 1000 Genomes Project dataset [17, 18]: 49 samples in total, each with 30x coverage. Of these, 46 samples have 12 unmapped bam files per sample, while the remaining 3 samples have 8, 13, and 14 files, 5 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT Table 4: Subset of samples used to collect pipeline execution statistics Sample Number of files File size statistics (in GB) Total size (in GB) mean stdev min max NA19017 12 4.13 0.19 3.95 4.45 49.6 HG03436 12 3.77 1.17 2.88 5.41 45.19 NA06984 12 3.64 0.36 3.13 4.04 43.64 HG01982 12 3.62 0.68 2.95 4.65 43.45 NA20764 12 3.6 0.44 2.85 3.92 43.14 Figure 4: Pipeline graph for UnmappedBamToAlignedBam workflow. Tasks ’sink’ and ’source’ are introduced for implementation purposes to mark entry and exit points of the pipeline. respectively. Detailed sample statistics are presented in Table S1. To collect execution statistics, we randomly selected 5 samples out of the 46 samples with 12 files, as shown in Table 4. Genome assembly GRCh38 was used as a reference. 3.2 Virtual machine assignments and execution statistics collection We executed the UnmappedBamToAlignedBam workflow on Azure using a modified version of CromwellOn- Azure v3.0.0 [12]. The modifications were made to label each run with the sample and experiment name to simplify execution statistics collection and processing. No other functionality was changed. Three experiments were defined: default VM assignment, Dv2 family, and Ddv4 family. By default, CromwellOn- Azure assigns the cheapest VM per hour that satisfies the task’s compute requirements, such as disk, memory, or number of CPUs. The list of defaults was saved and then passed as an additional input file for the pipeline run to guarantee the same defaults were used for all samples. For Dv2 and Ddv4 family experiments, we created VM assignment lists manually based on the tasks’ compute requirements. These lists were also passed as additional input files for the corresponding pipeline runs. All VM assignment lists are available in the Supplementary materials, see Listings S1–S3. We were inspired by Intel’s benchmarking experiment [ 19] when choosing the VM families for our experiments. However, both Intel’s experiment and our experiments were conducted about two years ago. Therefore, the results presented in the next sections should not be considered as an optimal choice recommendation. The Azure VM offering has changed since then, with some families being deprecated and several new families being added. As a result, we highly recommend evaluating the currently available VM families to find up-to-date optimal compute resources. To collect execution statistics we ran the pipeline three times per each sample defined in Table 4 and each experiment defined above, resulting in 45 runs in total. Then, for each task-experiment pair, we computed the average cost and execution time over runs and samples (see Table S2). The average cost and time to complete whole pipeline are presented in Table 5. We can observe that simple switch from default VM assignment to Ddv4 family decreases time to complete the pipeline twice with a small increase in cost. We also computed the lower and upper bounds for cost and time that can be achieved by mixing VMs from the three experiments: $ 2.71 ≤ Cost to complete pipeline ≤ $ 4.23, 6.45 hr ≤ Time to complete pipeline ≤ 13.91 hr. 6 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT Table 5: Cost and time to complete pipeline for each experiment. Experiment Cost (in $) Time (in hr) default 3.26 13.88 Dv2 3.55 8.23 Ddv4 3.57 6.55 Table 6: Cost constraints and corresponding minimal times to complete pipeline. Cmax (in $) Tmin (in hr) 2.71 6.82 2.86 6.53 3.01 6.53 3.17 6.53 3.32 6.53 3.47 6.45 3.62 6.45 3.77 6.45 3.93 6.45 4.08 6.45 4.23 6.45 Table 7: Recommended VM assignment when minimizing time with cost constraint Cmax = 2.71 Task Recommended Recommended experiment VM ApplyBQSR Dv2 Standard_D3_v2 BaseRecalibrator Ddv4 Standard_D4d_v4 CheckContamination Ddv4 Standard_D4d_v4 CollectQualityYieldMetrics Ddv4 Standard_D2d_v4 CollectUnsortedReadgroupBamQualityMetrics Ddv4 Standard_D2d_v4 CreateSequenceGroupingTSV Ddv4 Standard_D2d_v4 CrossCheckFingerprints Ddv4 Standard_D2d_v4 GatherBamFiles Ddv4 Standard_D4d_v4 GatherBqsrReports Ddv4 Standard_D2d_v4 MarkDuplicates Ddv4 Standard_D8d_v4 SamToFastqAndBwaMemAndMba default Standard_F16s_v2 SortSampleBam Ddv4 Standard_D8d_v4 SumFloats Ddv4 Standard_D2d_v4 3.3 Implementing and solving LP problems We used PuLP [20], a Python-based modeling language, to implement problems LP 3 and LP 4 (see Listings S5 and S6). PuLP can call a variety of open-source and commercial linear programming solvers, we used COIN-OR CBC [21], which is bundled with the PuLP package. We selected UC 2 from the two use cases described in the introduction, which involves minimizing the time to complete the pipeline with a restriction on cost. As shown above, the cost lower and upper bounds are $ 2.71 and $ 4.23 respectively. We took several points from the cost range [2.71, 4.23] and used each of them as the cost constraint Cmax, thus generating and solving several time minimization LP problems. The time matrix T and cost matrix C were obtained from the execution statistics presented in Table S2, pipeline graph was defined in JSON format (see Listing S4). Table 6 displays the values used as cost constraints Cmax and minimum times Tmin to complete the pipeline corre- sponding to these constraints. Tmin is the minimum of the objective function for problem LP 4 with cost constraint Cmax. The results presented demonstrate the cost-time trade-off that we described in the introduction. For instance, if we are willing to spend $ 4.23 on each pipeline run, then the minimal time to complete the pipeline is 6.45 hours. On 7 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT the other hand, if we do not want to spend more than $ 3.3 on each run, then we need a minimum of 6.53 hours. Table 6 also allows us to make a choice based on cost-time requirements specific to a research project. We decided to proceed with Cmax = 2.71 and Tmin = 6.82. Solution of LP problem with Cmax = 2.71 is presented in Table 7. 3.4 Using recommended VM assignment to process full set of samples We analyzed the full set of 49 samples using the compute resources recommended in Table 7. One run was performed for each sample, and for the majority of samples, the pipeline run was completed within (or very close to) the cost-time boundaries defined by Cost ≤ $ 2.71 and T ime ≤ 6.82 hr (see the bottom left section of Figure 5). Sample NA12872 is not present on the figure, as the run for this sample failed due to its large size and lack of disk space for processing a file by one of the non-distributed tasks. Samples that appear on the right section of Figure 5 are characterized by a larger standard deviation of sample file sizes, a large total sample size, or a combination of both (see Table 8). Samples HG03436, NA20869, and NA12340 illustrate how the duration of the pipeline run is affected by the variability in sample file sizes. As previously mentioned, the task SamToFastqAndBwaMemAndMba is a distributed task where one shard is created for each input file, and the same type of VM is assigned to all shards of the task. For sample NA12340, input files of size 1.93 GB and 11.65 GB are processed in parallel on two VMs of the same type (Standard_F16s_v2). Since larger files require more processing time, variability in file sizes leads to variability in shard processing times. The task SamToFastqAndBwaMemAndMba is completed when the longest shard is completed, and for NA12340, it takes longer compared to evenly distributed samples. At the same time, the total amount of processed data for NA12340 (46.15 GB) is similar to that of other samples, so we do not observe an increase in the cost of the pipeline run. Overall, the uneven distribution of samples across input files shifts samples to the right on the Time vs. Cost graph, whilst an increase in total sample size affects both cost and time to complete the pipeline, and shifts samples to the top right (see sample HG01565). Figure 5: Cost and time to complete data pre-processing pipeline with recommended VM assignment, each point corresponds to one sample. 4 Discussion While our assumptions such as average execution statistics, aggregated shards, and multiple-choice problem formulation help simplify the LP problem and make it more accessible to non-experts, they also introduce limitations to the approach described above. In this section, we discuss several enhancements that can be implemented and the scenarios in which they might be useful. Dealing with distributed tasks As demonstrated earlier, combining shards into a single node on the pipeline graph may not be effective when there is an uneven distribution of inputs across shards. If this scenario is anticipated, an alternative approach to consider is to represent each shard with a separate node and use a non-linear pipeline topology. 8 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT Table 8: Top 15 samples by size; for samples in bold cost or time to complete pipeline exceeded limits Cost ≤ $ 2.71 and T ime ≤ 6.82 hr Sample Number of files File size statistics (in GB) Total size (in GB) mean stdev min max 1 NA12872 12 4.6 0.97 3.21 5.56 55.25 2 HG03006 12 4.34 2.14 2.13 7.2 52.07 3 HG01565 12 4.32 0.27 3.85 4.63 51.81 4 HG02855 12 4.19 0.44 3.66 4.8 50.25 5 NA19017 12 4.13 0.19 3.95 4.45 49.6 6 NA20869 12 4.05 2.05 2.43 6.89 48.6 7 NA19035 12 3.89 0.58 3.13 4.67 46.73 8 HG03727 12 3.85 0.14 3.66 4.15 46.24 9 NA12340 14 3.3 3.21 1.93 11.65 46.15 10 HG01583 12 3.84 0.15 3.67 4.08 46.08 11 HG03742 12 3.81 0.23 3.57 4.16 45.71 12 NA20502 12 3.77 0.45 3.35 4.42 45.3 13 NA20509 12 3.77 0.15 3.57 4.03 45.24 14 NA18995 12 3.77 0.11 3.61 3.91 45.22 15 HG03436 12 3.77 1.17 2.88 5.41 45.19 Cloud compute resources are limited We made the assumption that VMs used for experiments are available in any quantity when we wrote multiple-choice problems and formulated LP problems. This assumption might be acceptable if samples are analyzed in small quantities in sequential order. For instance, if a few new samples come every 8 hours, they’re sent for analysis right away and it’s completed before the next batch comes. In this scenario, resource capacity issues might be prevented with careful evaluation of compute requirements, Cloud quotas and competing workloads. If the capacity limits for some virtual machines are reached when analyzing one sample and changing the quota is not possible, then a capacity constraint must be added to the LP problems. When analyzing samples at a large scale, capacity limits may be easily reached by multiple distributed tasks with hundreds or thousands of shards running in parallel. For instance, one HaplotypeCaller task from Whole Genome Germline Single Sample Pipeline requires 50 shards for a 30x coverage sample. In this case, it might be beneficial to optimize resource utilization in general and consider a resource allocation and scheduling problems that are well-known in Cloud computing. We recommend reviewing the current state of research in this field and carefully evaluating the complexity of the solution that your project requires. Is execution statistics required? We assumed that the researcher is willing to collect execution statistics before optimizing the pipeline, which might be impossible or cost-prohibitive in some cases. Additionally, we used average cost and execution time values to solve LP problems, which resulted in the loss of information about individual sample size, task cost, and execution time distributions. We also did not take into account the input size of incoming sample to process. If we can accurately predict task execution time and cost for each sample, then VM assignment recommendation can be done individually as well. When historical execution data is available, various machine learning techniques can be used to predict task runtime for new samples based on their size, workflow structure, etc. [22, 23, 24, 25, 26] However, approaches relying on historical data cannot be applied to workflows without any historical traces or workflows with changed parameters and configurations on any kind of cluster. In such cases, methods employing microbenchmarks for predicting the runtimes of tasks before they are executed might be very useful [27]. 5 Conclusion In this paper, we present a linear programming method that can help bioinformatics researchers optimize the execution of scientific pipelines for either cost or time. Our method is designed to work with all scientific pipelines that have a directed acyclic graph topology. The implementation we present uses execution data from representative samples to determine the optimal compute needed for the pipeline execution. The same methodology might be used with individual task runtimes predicted on historical data or microbenchmarks. In the future, we want to explore the limitation of compute resources and capacity when dealing with competing interests, such as running competing pipelines or scaling a single pipeline at thousands of parallel samples. 9 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT

References

[1] P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, C. Notredame, Nextflow enables reproducible computational workflows, Nat Biotechnol 35 (4) (2017) 316–319. doi:10.1038/nbt.3820. [2] Nextflow documentation: Running on azure, https://www.nextflow.io/docs/edge/azure.html, [Online; accessed: January 25, 2024]. [3] K. V oss, G. V . D. Auwera, J. Gentry, Full-stack genomics pipelining with gatk4 + wdl + cromwell, https://f1000research.com/slides/6-1381, [version 1; not peer reviewed] (2017). doi:10.7490/ f1000research.1114634.1. [4] Cromwell documentation: Azure backend, https://cromwell.readthedocs.io/en/stable/backends/ Azure, [Online; accessed: January 25, 2024]. [5] F. Mölder, K. P. Jablonski, B. Letcher, et al., Sustainable data analysis with snakemake, F1000Research 10 (2021) 33, [version 1; peer review: 1 approved, 1 approved with reservations].doi:10.12688/f1000research.29032. 1. [6] Snakemake documentation: Executing a snakemake workflow via azure batch, https://snakemake.readthedocs.io/en/v7.31.1/executing/cloud.html# executing-a-snakemake-workflow-via-azure-batch , [Online; accessed: January 25, 2024]. [7] S. Srivastava, I. Banicescu, Scheduling in parallel and distributed computing systems, in: Topics in Par- allel and Distributed Computing, Springer International Publishing, 2018, pp. 313–337. doi:10.1007/ 978-3-319-93109-8_11 . [8] V . Vinothina, R. Sridaran, P. Ganapathi, A survey on resource allocation strategies in cloud computing, Int. J. Adv. Comput. Sci. Appl. 3 (6) (2012) 97–104. [9] S. H. H. Madni, M. S. A. Latiff, Y . Coulibaly, S. M. Abdulhamid, Recent advancements in resource allocation techniques for cloud computing environment: a systematic review, Cluster Computing 20 (2017) 2489–2533. [10] K. Saidi, D. Bardou, Task scheduling and vm placement to resource allocation in cloud computing: challenges and opportunities, Cluster Computing 26 (2023) 3069–3087. [11] E. N. Alkhanak, S. P. Lee, R. Rezaei, R. M. Parizi, Cost optimization approaches for scientific workflow scheduling in cloud and grid computing: A review, classifications, and open issues, Journal of Systems and Software 113 (2016) 1–26. doi:https://doi.org/10.1016/j.jss.2015.11.023. [12] https://github.com/microsoft/CromwellOnAzure, [Online; accessed: January 18, 2024]. [13] T. Hu, A. Kahng, Linear and Integer Programming Made Easy, Springer International Publishing Switzerland, 2016. [14] H. Kellerer, U. Pferschy, D. Pisinger, Knapsack Problems, Springer Berlin Heidelberg, 2004. [15] D. Caetano-Anolles, Data pre-processing for variant discovery, https://gatk.broadinstitute.org/hc/ en-us/articles/360035535912 (2020). [16] https://github.com/microsoft/gatk4-genome-processing-pipeline-azure , [Online; accessed: Jan- uary 18, 2024]. [17] Overview of the 1000 genomes project, https://www.internationalgenome.org/ 1000-genomes-summary, [Online; accessed: January 18, 2024]. [18] Azure open datasets: genomics data lake – 1000 genomes, https://learn.microsoft.com/en-us/azure/ open-datasets/dataset-1000-genomes, [Online; accessed: January 18, 2024]. [19] R. Pruitt, M. Powers, J. Chia, P. Sebastian, K. Mannthey, Evaluating genomics pipelines on azure: Intel-based virtual machines, https://techcommunity.microsoft.com/t5/azure-high-performance-computing/ evaluating-genomics-pipelines-on-azure-intel-based-virtual/ba-p/2824608 (2021). [20] https://pypi.org/project/PuLP, [Online; accessed: January 18, 2024]. [21] https://github.com/coin-or/Cbc, [Online; accessed: January 18, 2024]. [22] S. M. Sadjadi, S. Shimizu, J. Figueroa, R. Rangaswami, J. Delgado, H. A. Duran-Limon, X. J. Collazo-Mojica, A modeling approach for estimating execution time of long-running scientific applications, 2008 IEEE International Symposium on Parallel and Distributed Processing (2008) 1–8. [23] R. F. da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez, D. Thain, B. Tovar, M. Livny, Toward fine-grained online task characteristics estimation in scientific workflows, Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science (2013) 58–67. 10 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT [24] R. F. da Silva, G. Juve, M. Rynge, E. Deelman, M. Livny, Online task resource consumption prediction for scientific workflows, Parallel Process. Lett. 25 (2015) 1541003:1–1541003:25. [25] F. Nadeem, D. M. Alghazzawi, A. S. Mashat, K. A. Fakeeh, A. S. A.-M. Al-Ghamdi, H. Hagras, Modeling and predicting execution time of scientific workflows in the grid using radial basis function neural network, Cluster Computing 20 (2017) 2805–2819. [26] M. J. Rosa, C. G. Ralha, M. Holanda, A. P. Araujo, Computational resource and cost prediction service for scientific workflows in federated clouds, Future Generation Computer Systems 125 (2021) 844–858. [27] J. Bader, F. Lehmann, L. Thamsen, U. Leser, O. Kao, Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures, Future Generation Computer Systems 150 (2024) 171–185. 11 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT Supplementary materials Compute resources used for experiments 1 { 2 "CollectQualityYieldMetrics": "Standard_D2_v3", 3 "SamToFastqAndBwaMemAndMba": "Standard_F16s_v2", 4 "CreateSequenceGroupingTSV": "Standard_A2_v2", 5 "CollectUnsortedReadgroupBamQualityMetrics": "Standard_D2_v3", 6 "SumFloats": "Standard_A2_v2", 7 "MarkDuplicates": "Standard_L4s", 8 "SortSampleBam": "Standard_L4s", 9 "CrossCheckFingerprints": "Standard_A2", 10 "BaseRecalibrator": "Standard_D2_v2", 11 "CheckContamination": "Standard_D11_v2", 12 "GatherBqsrReports": "Standard_A2_v2", 13 "ApplyBQSR": "Standard_A3", 14 "GatherBamFiles": "Standard_A2", 15 } Listing S1: Default virtual machine assignment made by CromwellOnAzure 1 { 2 "CollectQualityYieldMetrics": "Standard_D2_v2", 3 "SamToFastqAndBwaMemAndMba": "Standard_D5_v2", 4 "CreateSequenceGroupingTSV": "Standard_D2_v2", 5 "CollectUnsortedReadgroupBamQualityMetrics": "Standard_D2_v2", 6 "SumFloats": "Standard_D2_v2", 7 "MarkDuplicates": "Standard_D4_v2", 8 "SortSampleBam": "Standard_D4_v2", 9 "CrossCheckFingerprints": "Standard_D2_v2", 10 "BaseRecalibrator": "Standard_D2_v2", 11 "CheckContamination": "Standard_D3_v2", 12 "GatherBqsrReports": "Standard_D2_v2", 13 "ApplyBQSR": "Standard_D3_v2", 14 "GatherBamFiles": "Standard_D3_v2", 15 } Listing S2: Virtual machine assignment for Dv2 family experiment 12 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT 1 { 2 "CollectQualityYieldMetrics": "Standard_D2d_v4", 3 "SamToFastqAndBwaMemAndMba": "Standard_D16d_v4", 4 "CreateSequenceGroupingTSV.vm_size": "Standard_D2d_v4", 5 "CollectUnsortedReadgroupBamQualityMetrics": "Standard_D2d_v4", 6 "SumFloats": "Standard_D2d_v4", 7 "MarkDuplicates": "Standard_D8d_v4", 8 "SortSampleBam": "Standard_D8d_v4", 9 "CrossCheckFingerprints": "Standard_D2d_v4", 10 "BaseRecalibrator": "Standard_D4d_v4", 11 "CheckContamination": "Standard_D4d_v4", 12 "GatherBqsrReports": "Standard_D2d_v4", 13 "ApplyBQSR": "Standard_D8d_v4", 14 "GatherBamFiles": "Standard_D4d_v4", 15 } Listing S3: Virtual machine assignment for Ddv4 family experiment UnmappedBamToAlignedBam workflow graph 1 { 2 "CollectQualityYieldMetrics": ["source"], 3 "SamToFastqAndBwaMemAndMba": ["source"], 4 "CreateSequenceGroupingTSV": ["source"], 5 "CollectUnsortedReadgroupBamQualityMetrics": ["SamToFastqAndBwaMemAndMba"], 6 "SumFloats": ["SamToFastqAndBwaMemAndMba"], 7 "MarkDuplicates": ["SumFloats"], 8 "SortSampleBam": ["MarkDuplicates"], 9 "CrossCheckFingerprints": ["SortSampleBam"], 10 "BaseRecalibrator": ["SortSampleBam", "CreateSequenceGroupingTSV"], 11 "CheckContamination": ["SortSampleBam"], 12 "GatherBqsrReports": ["BaseRecalibrator"], 13 "ApplyBQSR": ["GatherBqsrReports"], 14 "GatherBamFiles": ["ApplyBQSR"] , 15 "sink": 16 ["GatherBamFiles", 17 "CheckContamination", 18 "CrossCheckFingerprints", 19 "CollectUnsortedReadgroupBamQualityMetrics", 20 "CollectQualityYieldMetrics"] 21 } Listing S4: Immediate predecessors list for each task. Tasks ’sink’ and ’source’ are introduced for implementation purposes to mark entry and exit points of the pipeline. 13 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT Implementation of LP problems for non-linear pipeline topology using PuLP Below we assume that pipeline is an instance of a to-be-implemented Pipeline class with the following properties: • tasks – list of task names, • experiments – list of experiment names, • costs – cost matrix C in the following dictionary format {'Task_1': {'Experiment_1': c_11, 'Experiment_2': c_12}, 'Task_2': {'Experiment_1': c_21, 'Experiment_2': c_22}} • times – time matrix T in the same dictionary format as presented above for cost, • graph – dictionary that maps task name to a list of its immediate predecessors. 1 from pulp import * 2 3 def cost_minimization_problem(pipeline, resource_limit, model_name): 4 tasks = pipeline.tasks 5 experiments = pipeline.experiments 6 costs = pipeline.costs 7 times = pipeline.times 8 graph = pipeline.graph 9 10 choices = [(t,e) for t in tasks for e in experiments] 11 12 # Create model to contain the problem data 13 model = LpProblem(model_name, LpMinimize) 14 15 # Create indicator variables corresponding to I_ij in LP formulation 16 ivars = LpVariable.dicts('choice', (tasks, experiments), 0, 1, LpBinary) 17 18 # The objective function is added to model first 19 model += (lpSum([ivars[t][e]*costs[t][e] for (t,e) in choices]), "total_cost") 20 21 # These constraints guarantee that only one VM is selected per task 22 for t in tasks: 23 model += (lpSum([ivars[t][e] for e in experiments]) == 1, "limit_for_task_%s"%t) 24 25 if resource_limit != None: 26 # Create additional variables F_i (finishing time for Task_i) 27 fvars = LpVariable.dicts('F', tasks, 0, None, cat = 'Continuous') 28 # These constraints ensure that predecessors are completed prior to the task 29 for t in tasks: 30 for pr in graph[t]: 31 if pr == 'source': 32 model += (lpSum([ivars[t][e]*times[t][e] for e in experiments]) 33 - fvars[t] <= 0, "finish_time_task_%s_%s"%(t, pr)) 34 else: 35 model += (fvars[pr]+lpSum([ivars[t][e]*times[t][e] for e in experiments]) 36 - fvars[t] <= 0, "finish_time_task_%s_%s"%(t, pr)) 37 38 # Main resource constraint ensures that alotted time is not exceeded 39 for t in tasks: 40 model += (fvars[t] <= resource_limit, "main_resource_constraint_%s"%t) 41 42 return model Listing S5: PuLP model for cost minimization problem 14 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT 1 from pulp import * 2 3 def time_minimization_problem(pipeline, resource_limit, model_name): 4 tasks = pipeline.tasks 5 experiments = pipeline.configs 6 costs = pipeline.costs 7 times = pipeline.times 8 graph = pipeline.graph 9 10 choices = [(t,e) for t in tasks for e in experiments] 11 12 # Create model to contain the problem data 13 model = LpProblem(model_name, LpMinimize) 14 15 # Create indicator variables corresponding to I_ij in LP formulation 16 ivars = LpVariable.dicts('choice', (tasks, experiments), 0, 1, LpBinary) 17 # Create additional variables F_i (finishing time for Task_i) 18 fvars = LpVariable.dicts('F', tasks, 0, None, cat = 'Continuous') 19 # Create additional variable F_max (finishing time for full pipeline) 20 dvar = LpVariable('F_max', 0, None, 'Continuous') 21 22 # The objective function is added to model first 23 model += (lpSum(dvar), "total_time") 24 25 # These constraints guarantee that only one VM is selected per task 26 for t in tasks: 27 model += (lpSum([ivars[t][e] for e in experiments]) == 1, "limit_for_task_%s"%t) 28 29 # These constraints ensure that predecessors are completed prior to the task 30 for t in tasks: 31 for pr in graph[t]: 32 if pr == 'source': 33 model += (lpSum([ivars[t][e]*times[t][e] for e in experiments]) 34 - fvars[t] <= 0, "finish_time_task_%s_%s"%(t, pr)) 35 else: 36 model += (fvars[pr]+lpSum([ivars[t][e]*times[t][e] for e in experiments]) 37 - fvars[t] <= 0, "finish_time_task_%s_%s"%(t, pr)) 38 39 # These constraints ensure that all tasks are completed 40 for t in tasks: 41 model += (fvars[t] <= dvar, "max_replacement_task_%s"%t) 42 43 # Main resource constraint ensures that allowed cost is not exceeded 44 if resource_limit != None: 45 model += (lpSum([ivars[t][e]*costs[t][e] for (t,e) in choices]) 46 <= resource_limit, "main_resource_constraint") 47 48 return model Listing S6: PuLP model for time minimization problem 15 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT List of samples and execution statistics Table S1: Full list of samples analyzed by data pre-processing pipeline, ordered by total sample size. Samples in bold were used to collect pipeline execution statistics presented in Table S2 Sample Number of files File size statistics (in GB) Total size (in GB) mean stdev min max 1 NA12872 12 4.6 0.97 3.21 5.56 55.25 2 HG03006 12 4.34 2.14 2.13 7.2 52.07 3 HG01565 12 4.32 0.27 3.85 4.63 51.81 4 HG02855 12 4.19 0.44 3.66 4.8 50.25 5 NA19017 12 4.13 0.19 3.95 4.45 49.6 6 NA20869 12 4.05 2.05 2.43 6.89 48.6 7 NA19035 12 3.89 0.58 3.13 4.67 46.73 8 HG03727 12 3.85 0.14 3.66 4.15 46.24 9 NA12340 14 3.3 3.21 1.93 11.65 46.15 10 HG01583 12 3.84 0.15 3.67 4.08 46.08 11 HG03742 12 3.81 0.23 3.57 4.16 45.71 12 NA20502 12 3.77 0.45 3.35 4.42 45.3 13 NA20509 12 3.77 0.15 3.57 4.03 45.24 14 NA18995 12 3.77 0.11 3.61 3.91 45.22 15 HG03436 12 3.77 1.17 2.88 5.41 45.19 16 HG01595 12 3.76 0.3 3.3 4.04 45.17 17 HG03756 12 3.75 0.27 3.4 4.22 45.04 18 HG03449 12 3.72 0.17 3.48 3.89 44.68 19 HG03100 12 3.69 0.24 3.35 3.92 44.24 20 HG03722 12 3.68 0.46 3.03 4.18 44.19 21 NA11894 12 3.66 0.19 3.49 3.92 43.93 22 HG00096 12 3.66 0.24 3.3 3.88 43.87 23 HG00419 12 3.66 0.16 3.47 3.91 43.87 24 HG00410 12 3.64 0.2 3.26 3.85 43.71 25 NA06984 12 3.64 0.36 3.13 4.04 43.64 26 NA12878 12 3.63 0.12 3.47 3.79 43.61 27 NA10847 12 3.63 1 2.72 4.99 43.59 28 HG01958 12 3.62 0.17 3.45 3.88 43.48 29 HG01982 12 3.62 0.68 2.95 4.65 43.45 30 NA19648 12 3.62 0.28 3.17 3.93 43.38 31 HG01879 12 3.59 0.26 3.36 3.98 43.14 32 NA20764 12 3.6 0.44 2.85 3.92 43.14 33 HG02642 12 3.59 0.25 3.19 3.9 43.09 34 NA12489 12 3.58 0.11 3.43 3.74 42.94 35 NA20895 13 3.28 1.39 2.07 6.36 42.69 36 HG03052 12 3.53 0.18 3.23 3.75 42.38 37 HG00759 12 3.51 0.27 3.11 3.79 42.16 38 HG01112 12 3.51 0.2 3.23 3.82 42.16 39 NA20321 12 3.49 0.1 3.37 3.67 41.92 40 NA20752 12 3.49 0.08 3.38 3.59 41.91 41 NA20126 12 3.49 0.52 2.75 3.9 41.86 42 HG01500 12 3.48 0.52 2.91 4.22 41.77 43 NA19661 8 5.21 0.59 4.62 5.83 41.68 44 NA19102 12 3.46 0.49 2.75 3.93 41.55 45 HG00268 12 3.46 0.17 3.24 3.77 41.54 46 NA18956 12 3.36 0.23 3.14 3.75 40.35 47 HG02085 12 3.31 0.33 2.93 3.74 39.75 48 HG01393 12 3.28 0.46 2.84 3.98 39.38 49 HG02069 12 3.18 0.31 2.82 3.61 38.2 16 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint Optimizing genomics pipeline execution with integer linear programming A P REPRINT Table S2: Pipeline execution statistics Task Experiment Cost (in $) Time (in hr) ApplyBQSR Ddv4 0.591 0.459 ApplyBQSR Dv2 0.524 0.757 ApplyBQSR default 0.903 1.39 BaseRecalibrator Ddv4 0.338 0.531 BaseRecalibrator Dv2 0.356 1.19 BaseRecalibrator default 0.37 1.19 CheckContamination Ddv4 0.0302 0.668 CheckContamination Dv2 0.043 0.769 CheckContamination default 0.0425 1.15 CollectQualityYieldMetrics Ddv4 0.0175 0.0824 CollectQualityYieldMetrics Dv2 0.0283 0.132 CollectQualityYieldMetrics default 0.0227 0.19 CollectUnsortedReadgroupBamQualityMetrics Ddv4 0.0381 0.175 CollectUnsortedReadgroupBamQualityMetrics Dv2 0.0615 0.248 CollectUnsortedReadgroupBamQualityMetrics default 0.0503 0.354 CreateSequenceGroupingTSV Ddv4 0.000223 0.00986 CreateSequenceGroupingTSV Dv2 0.000314 0.0108 CreateSequenceGroupingTSV default 0.000358 0.0199 CrossCheckFingerprints Ddv4 0.00509 0.225 CrossCheckFingerprints Dv2 0.00858 0.296 CrossCheckFingerprints default 0.0126 0.523 GatherBamFiles Ddv4 0.0272 0.602 GatherBamFiles Dv2 0.0368 0.659 GatherBamFiles default 0.117 4.86 GatherBqsrReports Ddv4 0.000737 0.0326 GatherBqsrReports Dv2 0.00133 0.0459 GatherBqsrReports default 0.0022 0.122 MarkDuplicates Ddv4 0.194 2.15 MarkDuplicates Dv2 0.28 2.5 MarkDuplicates default 0.202 2.93 SamToFastqAndBwaMemAndMba Ddv4 2.17 1.03 SamToFastqAndBwaMemAndMba Dv2 1.97 0.931 SamToFastqAndBwaMemAndMba default 1.37 1.01 SortSampleBam Ddv4 0.157 1.74 SortSampleBam Dv2 0.24 2.14 SortSampleBam default 0.163 2.37 SumFloats Ddv4 0.000225 0.00998 SumFloats Dv2 0.000261 0.009 SumFloats default 0.000459 0.0255 17 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 8, 2024. ; https://doi.org/10.1101/2024.02.06.579197doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-4.0