Determination of high-confidence germline genetic variants in next- generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation

preprint OA: closed
Full text JSON View at publisher
Full text 147,165 characters · extracted from preprint-html · click to expand
Determination of high-confidence germline genetic variants in next- generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Determination of high-confidence germline genetic variants in next- generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation Muqing Yan, Qiandong Zeng, Zhenxi Zhang, Patricia Okamoto, Stanley Letovsky, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6513733/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 06 Aug, 2025 Read the published version in BMC Genomics → Version 1 posted 12 You are reading this latest preprint version Abstract Background: Orthogonal confirmation of variants identified by next-generation sequencing (NGS) is routinely performed in many clinical laboratories to improve assay specificity. However, confirmatory testing of all clinically significant variants increases both turnaround time and operating costs for laboratories. Improvements to early NGS methods and bioinformatics algorithms have dramatically improved variant calling accuracy, particularly for single nucleotide variants (SNVs), thus calling into question the necessity of confirmatory testing for all variant types. The purpose of this study is to develop a new machine learning approach to capture false positive heterozygous variants (SNVs) from whole exome sequencing (WES) data. Results: WES variant calls from Genome in a Bottle (GIAB) cell lines and their associated quality features were used to train five different machine learning models to predict whether a variant was a true positive or false positive based on quality metrics. Logistic regression and random forest models exhibited the highest false positive capture rates among the selected models, but GradientBoosting achieved the best balance between false positive capture rates and true positive flag rates. Further assessment using simulated false positive events as well as different combinations of quality features showed that model performance can be refined. Integration of the highest-performing models into a custom two-tiered confirmation bypass pipeline with additional guardrail metrics achieved 99.9% precision and 98% specificity in the identification of true positive heterozygous SNVs within the GIAB benchmark regions. Furthermore, testing on an independent set of heterozygous SNVs (n=93) detected by exome sequencing of patient samples and cell lines demonstrated 100% accuracy. Conclusions: Machine-learning models can be trained to classify SNVs into high or low-confidence categories with high precision, thus reducing the level of confirmatory testing required. Laboratories interested in deploying such models should consider incorporating additional quality criteria and thresholds to serve as guardrails in the assessment process. Next generation sequencing Sanger confirmation Machine learning Clinical decision-support tool Figures Figure 1 Figure 2 Background Sanger sequencing is a first-generation DNA sequencing method that has long been considered the gold standard for the accurate detection of small sequence variants [ 1 ], but its use as a primary approach for variant detection has been largely supplanted by NGS in clinical laboratories. Many laboratories continue to employ Sanger sequencing as an orthogonal method to confirm variants identified by NGS; however, the sensitivity and accuracy of current NGS methods and bioinformatic tools have significantly improved since its inception [ 2 – 4 ]. Numerous studies examining the necessity of Sanger sequencing report concordance rates of > 99% between NGS and Sanger sequencing results for single nucleotide variants (SNVs) and insertion-deletion variants (indels) in high-complexity regions [ 1 , 5 – 9 ], whereas low-complexity regions comprised of repetitive elements, homologous regions, and high-GC content, as well as technical artifacts are more likely to be enriched for false positive variants with relatively poor quality metrics. Machine learning is a type of artificial intelligence that has the capability to make decisions or render predictions based on inferred relationships between features (a.k.a parameters) of a dataset without explicit programming. Diagnostic applications of machine learning frequently rely on supervised learning models that require training on labeled data to learn which features are significant for a given class, while unsupervised approaches attempt to establish relationships between features without a priori knowledge of truth. Within the field of medical genetics, supervised machine-learning models have been reported to significantly reduce the confirmation rate of NGS variant calls using sequencing parameters such as read depth, allele frequency, sequencing quality, and mapping quality as variables to train models [ 10 – 12 ]. The success of these models can be attributed to quantitative and qualitative differences that separate high and low-confidence variant calls. While these studies demonstrate proof-of-concept for implementing machine learning models for triaging variants in confirmatory testing, pipeline-specific differences in quality features necessitate de novo model building and clinical validation before integrating these models into a clinical genetic workflow. In this study, we aim to employ supervised machine learning models to differentiate between two types of heterozygous SNVs: high-confidence variants which do not require orthogonal confirmation, and low-confidence variants which require additional review and confirmatory testing. Random forest (RF), logistic regression (LR), AdaBoost, Gradient Boosting (GB), and Easy Ensemble methods were selected for comparison in this study. Multiple iterations of supervised machine learning were performed to identify which features and statistical methods yielded optimal results, and a two-tiered model with guardrails for allele frequency and sequence context was developed to achieve the optimal balance between sensitivity and specificity. Testing of the final model suggested that our approach significantly reduces the number of true positive variants requiring confirmation while mitigating the risk of reporting false positives. Methods Cell lines and specimens Genomic DNA isolated from genome-in-a-bottle (GIAB) reference specimens NA12878, NA24385, NA24149, NA24143, NA24631, NA24694, and NA24695 were purchased from the Coriell Institute for Medical Research (Camden, NJ) (Additional file 1: Table S1 ). Additional lymphoblast cell line DNA (Coriell) and de-identified patient specimens were used in a separate validation of the final model. Informed consent for the clinical testing was obtained by referring physicians prior to sample submission and residual DNA was de-identified prior to use in this study. Per the United States Code of Federal Regulations for the Protection of Human Subjects, institutional review board exemption is applicable due to de-identification of the patient data presented herein (45 CFR part 46.101(b)(4)). Data Downloads and source materials GIAB benchmark files (version v 4.2.1 for GRCh37) containing high-confidence variant calls were downloaded from the National Center for Biotechnology Information (NCBI) ftp site for use as truth sets for supervised learning and assessment of model performance. Genomic regions ineligible for Sanger bypass were compiled by downloading the following bed files from the UCSC genome browser: ENCODE blacklist, NCBI NGS high stringency, NCBI NGS low stringency, NCBI NGS dead zone, and segmental duplication tracks. These data were supplemented with additional regions of low-mappability identified by an internal assessment. NGS library preparation and data processing Whole exome libraries for GIAB cell lines were sequenced twice on two separate flow cells. Library preparation and target enrichment were carried out using an internally developed automation workflow on the Hamilton NGS Star workstation (Hamilton Company, Reno, NV). Briefly, libraries were prepared from 250 ng of genomic DNA using Kapa HyperPlus reagents (Kapa Biosystems, Inc./Roche, Wilmington, MA) for enzymatic fragmentation, end-repair, A-tailing and adaptor ligation. Each library was indexed with unique dual barcodes (IDT, Coralville, IA) to eliminate the possibility of index hopping between samples. For target enrichment, twelve normalized libraries were pooled together, and a custom panel of biotinylated, double-stranded DNA probes (Twist Biosciences, South San Francisco, CA) was used to capture exome sequences as well as other regions of interest (~ 41.4Mb total). The hybridized libraries were further purified using streptavidin beads, and the library pools quantified via the Kapa qPCR Library Quantification kit (Kapa Biosystems Inc./Roche, Wilmington, MA) on a QuantStudio®7 (ThermoScientific, Waltham, MA). After normalization, the library pools were combined, and ~ 1–2% PhiX library control spiked into the final pool to monitor sequencing quality in real-time. Up to 192 libraries were sequenced (paired-end, 2x150 cycles) per S4 flowcell on the NovaSeq 6000 sequencer (Illumina, San Diego, CA). Sequencing run quality metrics were tracked in real-time using the Illumina Sequencing Analysis Viewer v.2.4.7 software for percent of clusters passing filter, fraction of bases at > Q30, sequencing yield and flowcell occupancy. Additional metrics related to the PhiX control such as alignment, error rate and pre-phasing/phasing were used for troubleshooting. Sequencing data were demultiplexed with the bcl2fastq2 v.2.20 or BCLConvert v.3.8.2 software (Illumina, San Diego, CA), and the fastq files processed through a customized data analysis pipeline that consisted of the CLCBio Genomics CLS WebService v.21.0.5 and Workbench v.21.0.5 (Qiagen Bioinformatics, Redwood City, CA) software and plugins thereof as well as internally developed algorithms. Reads were trimmed to remove adaptor sequences and low-quality bases (< Q20), and then aligned to the GRCh37/hg19 NCBI reference genome followed by duplicate reads removal, local re-alignment and variant detection. Data quality was assessed based on metrics such as mean target coverage, fraction of bases at minimum coverage, coverage uniformity expressed as Fold 80 base penalty, on-target rate and insert size, all of which were calculated using Picard v.2.3.0 tools in the Genome Analysis Toolkit (GATK; Broad Institute, Cambridge, MA). Sequence data were analyzed with the CLCBio Clinical Lab Service to generate annotated TR.xml files with quality features (Table S2) used for training and testing various machine-learning algorithms. All heterozygous SNVs called from the internal pipeline were intersected with the GIAB benchmark bed files and variants in the high-confidence regions were annotated as 0 if present in the truth set ( i.e . true positive (TP)) or 1 if absent ( i.e . false positive (FP)). Sanger sequencing Confirmation of select variants was performed by Sanger sequencing. Primers flanking the test variants were designed online using Primer3Plus software and primer specificity was verified using the UCSC genome browser (Univerisity of California, Santa Cruz, CA) in silico PCR tool. Sanger sequencing was performed by capillary electrophoresis on the Applied Biosystems 3730xl genetic analyzer (Thermo Fischer Scientific, Waltham, MA). GeneStudio™ Pro (Informer Technologies Inc., New York City, NY) and UGENE (Unipro, Novosibirsk, Russia) software were used for alignment and analysis of Sanger sequencing traces. Model training and testing strategies Predictive modeling of high-confidence variants detected in the GIAB specimens was performed using logistic regression, random forest, gradient boosting, EasyEnsemble, and AdaBoost. The features used for model training included allele frequency, read count metrics, coverage, quality, read position probability, read direction probability, homopolymer presence, and overlap with low-complexity sequence (i.e. complex regions) (Table S2). All annotated variants from the two flow cells were evenly split into two subsets with truth stratification to ensure proportions of FPs and TPs are similar. The first half of the data was used for leave-one-sample-out cross validation (LOOCV), where each GIAB sample was left out once and used as the testing set, the other six samples were used as training set, and the second half of the data was reserved to test the models trained on the first dataset. In the cross validation (CV) experiment, the machine-learning models were trained using all high-confidence variants with known truth (imbalanced raw data) and all available features. Hyperparameters were tuned and selected in this phase. A second phase training was performed using the first half of data as the training set and the second half of the data as the testing set to evaluate the importance of quality features, the impact of imbalanced data and pick the best model combination. Feature coefficients were estimated on both raw and scaled data (minmaxscale module in Python scikit learn). Features with positive effects were more likely to be associated with false positive variants, whereas features with negative weights were more likely to be associated with true positive variants, which are in the context of labeling false positives as 1 considering false positive variants are the primary target. High-impact features were selected based on the estimates in this step. Since our data is imbalanced with most variants representing true positive calls, comparisons among balanced and imbalanced datasets were performed in the final testing. Methods selected to achieve balanced datasets for evaluation included simple over sampling (SOS), which randomly duplicates data points from the minority data set, and synthetic minority oversampling (SMOTE), which generates synthetic data points according to a k-nearest neighbor analysis of minority data point clustering. Common metrics in ML classification-based methods including accuracy, confusion matrix, area under the curve (AUC) or/and area under receiver operating characteristics (AUROC) and F1-score were examined to assess the performance of the models. AUC measures the true-positive rate (TPR) or sensitivity, true-negative rate (TNR) or specificity and the false positive rate (FPR), whereas the F1-score assesses precision and recall rate in highly imbalanced data. For both AUC and F1-score, a greater value reflects better model performance. The confusion matrix describes the complete model performance by measuring the model accuracy to calculate true-positive values plus true-negative values and dividing the sum over the total number of samples [ 13 ]. In the context of this study, since the primary target is false positive variants which are labeled as 1 in the model training, TPR and false positive capture rate are used interchangeably. True positive flag rate is defined as the rate at which models incorrectly tag true positive variants as low-confidence variants. External Validation External validation of the model was performed using a new variant set comprised of 93 heterozygous SNVs identified in 83 de-identified specimens or cell lines. Criteria for the selection of test variants were based on (1) location within a genomic region validated in any of our NGS assays and the variant’s clinical significance. For the purpose of this validation, clinical significance refers to the minimum variant classification required to be considered reportable for a given panel (e.g. only likely pathogenic or pathogenic variants in a carrier screening gene; pathogenic, likely pathogenic and variants of uncertain significance (VUS) in genes overlapping a cardiogenetic or hereditary cancer panel). Results Dataset preparation Seven GIAB cell lines were sequenced twice on two flow cells resulting in 282,076 heterozygous SNVs. Intersection of these variants with the GIAB benchmark high-confidence regions resulted in a total of 222,489 heterozygous SNVs, of which 212,397 variants were annotated as TP and 10,092 labeled as FP (Table 1 a). Labeled variants in each sample were then split into two halves such that the counts and percentages for TPs and FPs were evenly divided between the two subsets (Table 1 b). Table 1 Detailed counts by sample along the workflow for heterozygous single-nucleotide variants (SNV) a. Counts of heterozygous SNVs throughout the workflow Flow cells NIST ID Raw In GIAB benchmark regions Present in truth set (TP) Absent in truth set (FP) Flow Cell 1 HG001 20,612 15,933 15,109 824 HG002 20,625 16,600 15,829 771 HG003 20,540 16,322 15,583 739 HG004 21,090 16,630 15,929 701 HG005 19,844 15,627 14,808 819 HG006 18,938 14,970 14,223 747 HG007 19,618 15,315 14,663 652 Flow Cell 2 HG001 20,645 15,880 15,120 760 HG002 20,620 16,553 15,879 674 HG003 20,511 16,330 15,598 732 HG004 20,990 16,600 15,943 657 HG005 19,504 15,439 14,807 632 HG006 18,890 14,976 14,220 756 HG007 19,649 15,314 14,686 628 Total - 282,076 222,489 212,397 10,092 Seven cell lines were sequenced twice in two separate flow cells as technical replicates. Truth was retrieved from GIAB benchmark regions. True positive (TP) is a variant present in the truth set; false positive (FP) is a variant absent in the truth set. b. Counts of heterozygous SNVs for cross-validation and final test First half for CV Second half for final training Flow cells NIST ID Present (TP) Absent (FP) Present (TP) Absent (FP) Total Flow Cell 1 HG001 7,554 412 7,555 412 15,933 HG002 7,914 386 7,915 385 16,600 HG003 7,791 370 7,792 369 16,322 HG004 7,964 351 7,965 350 16,630 HG005 7,404 409 7,404 410 15,627 HG006 7,111 374 7,112 373 14,970 HG007 7,331 326 7,332 326 15,315 Flow Cell 2 HG001 7,560 380 7,560 380 15,880 HG002 7,939 337 7,940 337 16,553 HG003 7,799 366 7,799 366 16,330 HG004 7,971 329 7,972 328 16,600 HG005 7,403 316 7,404 316 15,439 HG006 7,110 378 7,110 378 14,976 HG007 7,343 314 7,343 314 15,314 Total - 106,194 5,048 106,203 5,044 222,489 For each cell line in each flow cell, both TP and FP variants were split evenly into two groups. The first group was used for cross-validation; both groups were used in the second and final testing phases. Cross-validation A LOOCV was performed using the first subset for both training and testing logistic regression, random forest, Gradient Boosting, AdaBoost, and EasyEnsemble models. In this first phase of model training, all 13 quality features and capture rates of 95% and 99% were tested, indicating that 5/100 and 1/100 false positive calls are missed, respectively. Logistic regression and random forest models exhibited the best performance with respect to false positive capture rates, while gradient boosting achieved the best all-around performance with a high FP capture rate and low TP flag rate (Table 2 ). Low standard deviations were observed for all models across different genetic backgrounds, indicating consistent and robust model performance. Table 2 Summary of cross-validation experiments for five models on heterozygous SNVs Recall 0.95 (TPR) Recall 0.99 (TPR) Variant type Models CV FP capture rate (TPR %) CV TP flag rate (FPR %) CV FP capture rate (TPR %) CV TP flag rate (FPR %) CV ROC AUC (%) SNV-heterozygous GradientBoosting 91.34+-2.32 19.25+-3.72 96.56+-0.66 54.33+-4.26 94.77+-0.81 LogisticRegression 94.88+-1.52 41.81+-6.89 99.00+-0.45 89.43+-3.54 94.52+-0.71 EasyEnsemble 93.81+-1.63 34.46+-5.17 98.50+-0.76 75.12+-5.66 94.34+-0.88 AdaBoost 88.19+-2.88 12.75+-2.47 91.90+-2.27 29.62+-4.25 93.83+-1.04 RandomForest 94.22+-1.65 50.68+-8.22 99.07+-0.64 82.30+-4.86 92.79+-1.05 Mean and standard deviation for true positive rate (TPR), false positive rate (FPR), and area under the receiver operating characteristics curve (ROC AUC) in cross-validation (CV) under 0.95 and 0.99 recall rates. Models are sorted by AUC values. Model evaluation and selection In the second phase of testing, all five models were trained using the first dataset for training and the second dataset for testing. The results of phase two were comparable to phase one with gradient boosting exhibiting the most balanced performance and random forest and logistic regression models exceeding EasyEnsemble and AdaBoost in false positive capture rates (Table 3 and Fig. 1). Refinement of model performance was then explored by comparing model performance with the full set of sequence features versus select high-impact features. Feature coefficients (applicable to LR) and importance (applicable to RF and GB) were estimated using both raw and scaled data to determine the relative contribution of each feature to the associated true positive or false positive label, and to eliminate redundant features. This assessment indicated that scaling of LR coefficients yielded inconsistent patterns for several features (Table 4 ); however, eight features (frequency, read count, coverage, forward count, reverse count, forward/reverse ratio, read position probability, and overlap with complex regions) exhibited consistent trends between raw and scaled data. The features with highest contributions were selected as key features for subsequent model training and comparisons. Density plots for these key features showing the difference between true positive and false positive can be reviewed in the Additional file1: Figure S3. The effects of imbalanced data were also investigated using two statistical models for balancing skewed datasets: SOS and SMOTE. In total, eighteen combinations with variable feature sets and relative balance were evaluated across RF, LG and GR models to determine the optimal conditions for training (Table S3). The top-performing model configurations according to the F1 scores are: GB trained with all quality features and raw imbalanced data, LR trained with selected key features and SOS balanced data, and RF trained with all quality features and raw imbalanced data (Table 5 , full list of statistics for all combinations can be found in the Additional file 1: Table S4). Because each optimized model has distinct advantages, we decided to combine all three models and utilize the thresholds harvested from the 0.99 recall rate in the training set to form a two-tiered (2T) workflow for the evaluation of heterozygous SNVs (Fig. 2 ). Logistic regression and random forest models demonstrated the highest sensitivity in identifying false positives. Therefore, we integrated these two models as the initial classifier to detect prominent low-confidence calls. Given the superior robustness of the gradient boosting model, we implemented it as a secondary layer to classify the uncertain cases that emerged from the combined random forest and logistic regression analysis. Additional criteria were also integrated into the pipeline, including allele frequency ranges, minimum coverage, and genomic location to prevent bypass of variants with atypical features. Table 3 Summary of second-phase training experiments for five models on heterozygous SNVs Second-phase training Recall 0.99 (TPR) in the second-phase training Variant type Models FP capture rate (TPR %) TP flag rate (FPR %) TP capture rate (TNR %) FP flag rate (FNR %) ROC AUC (%) Threshold (%) SNV-het GradientBoosting 98.12 10.74 89.26 1.88 98.67 0.42 EasyEnsemble 98.67 19.41 80.59 1.33 98.3 48.64 LogisticRegression 98.68 36.18 63.82 1.32 98.07 0.29 AdaBoost 94.94 6.47 93.53 5.06 98.02 49.44 RandomForest 99.13 37 63 0.87 97.85 10.27 True positive rate (TPR), false positive rate (FPR), area under the receiver operating characteristics curve (ROC AUC) and threshold drawn under 0.99 recall rate in the second phase. Models are sorted by AUC values. Table 4 Assessment of features importance/coefficients for RF, LG and GB using both raw and scaled data HET SNV frequency read count coverage forward count reverse count forward/reverse ratio average quality LR_coefficients -0.15 -0.23 0 0.2 0.21 -7.54 0.1 LR_coeffficients_scaled -11.44 -1.68 2.18 15.63 16.59 -1.79 -0.4 RF_importance 0.22 0.01 0.03 0.04 0.03 0 0 GB_importance 0.24 0.03 0.11 0.07 0.05 0.01 0 probability read position probability read direction probability homopolymer homopolyer length in complex region LR_coefficients 0.47 -1.43 -0.58 -0.02 0.07 2.65 LR_coefficients_scaled -0.63 -3.16 0.61 -0.15 0.39 2.51 RF_importance 0 0.23 0.05 0 0 0.38 GB_importance 0 0.36 0.02 0 0 0.11 Coefficients evaluation for logistic regression (RF) and features importance evaluation for random forest (RF) and gradient boosting (GB) for both raw and scaled data using all thirteen next-generation-sequencing (NGS) features which are initially available. Table 5 Final models selection of the top performing combinations of features and datasets for heterozygous SNVs features data status model name model name TPR FPR TNR FNR ROC Threshold (%) S s F1 all imbalanced imb_all GB 98.12 10.74 89.26 1.88 98.67 0.42 0.90 0.90 key SOS balanced sos_key LR 98.69 28.66 71.34 1.31 98.30 4.91 0.97 0.82 all imbalanced imb_all RF 99.13 37.00 63.00 0.87 97.85 10.27 1.00 0.77 Best models picked for the Sanger Bypass pathway are gradient boosting (GB) trained with all 13 features and imbalanced original data (imb_all), logistic regression (LR) trained with key features and simple-oversampling (SOS) balanced data (sos_key), and random forest (RF) trained with all 13 features and imbalanced original data (imb_all). S s : scaled capture rate score. Models are sorted by F1 values. Final model evaluation The final 2T workflow was then tested on the other half of GIAB variants which was serving as the testing set (a total of 106,203 known TPs and 5,044 FPs, Table 3 ). Tier one (RF + LR) returned variant predictions for 24,683 variants (22,320 high-confidence + 2,363 low-confidence). (Table 6 a). However, approximately 77.8% (86,564/111,247) of variants with known truth could not be predicted as present or absent at the selected thresholds (“Unknown” by RF + LR: 13,972 + 72,592), thus processing by the GB machine learning model as a second tier of confirmatory bypass workflow was required. GB correctly classified 86.7% (72,503/83,628) of the remaining true positive variants and 96.9% (2,847/2,936) of the true negative variants. Taken together, the 2T predictions were concordant with GIAB truth for 89.7% (99,766/111,247) of the total variants. According to the established workflow, 89.3% ((22,314 + 72,503)/106,203) of GIAB true positives labeled as high-confidence by the workflow would be eligible to bypass orthogonal confirmation, while 16,335 (2,363 by RF + LR and 13,972 by GB) variants predicted to be low-confidence including 11,386 (261 by RF + LR and 11,125 by GB) true positive variants, would require reflex to Sanger sequencing (Table 6 a). Additionally, 95 (6 by RF + LR and 89 by GB) false positive variants were incorrectly predicted to be true positive by the combined models, resulting in a false positive rate of 1.88%. Overall, the 2T model delivers a 99.9% (94,817/(94,817 + 95)) PPV/precision, 89.2% (94,817/(94,817 + 11,386)) sensitivity, and 98.1% (95/(95 + 4949)) specificity (Table 6 b) in the context of correctly predicting true positive ones in the truth set. Table 6 a. Performance of three models combined on the GIAB testing set GIAB cell lines variants truth Gradient Boosting Prediction Present (Positive) Absent (Negative) Totals Random Forest + Logistic Regression High-confidence (Positive) Low-confidence (Negative) 0 0 0 High-confidence (Positive) 22,314 6 22,320 Low-confidence (Negative) Low-confidence (Negative) 261 2,102 2,363 High-confidence (Positive) 0 0 0 Unknown Low-confidence (Negative) 11,125 2,847 13,972 High-confidence (Positive) 72,503 89 72,592 Totals - 106,203 5,044 111,247 Table 6 b. Confusion matrix of 2T model on the GIAB testing set The Truth Present (Positive) Absent (Negative) 2T Prediction High-confidence (Positive) 94,817 95 Low-confidence (Negative) 11,386 4,949 The first step in the pipeline is to combine both random forest and logistic models to make variants classification, then gradient boosting is served as the third model to further predict the status of variants. High-confidence variants predicted by the 2T model are the ones qualified for Sangerbypass, while as for the low-confidence variants are the ones that will require Sanger confirmation. External validation Validation of the models was performed on a subset of new heterozygous SNVs (n = 93) identified through end-to-end testing of samples on the exome panel to assess overfitting of the final models to the training dataset. The accuracy of the predictions for this new dataset was determined using Sanger sequencing data as a source of truth. The concordance rate between the machine-learning predictions and Sanger sequencing results was 100% in this validation study (Additional file 1: Table S5), suggesting that overfitting did not contribute to the high sensitivity rates observed in the GIAB testing dataset. Discussion The development of artificial intelligence for decision support in healthcare is rapidly gaining acceptance among the medical and scientific communities. In recent years, several publications have described machine learning tools that can distinguish between true positive and false positive NGS calls based on sequence metrics and variant characteristics. In this study, we built upon the framework developed by Holt et al. [ 12 ] by using a combination of continuous and binary models to establish a workflow for bypassing confirmation of heterozygous SNVs detected by NGS. The decision to limit the scope of our development efforts to heterozygous variants was based on our observation that the overwhelming majority of clinically-significant variants detected by NGS in our laboratory are heterozygous sequence changes. Consequently, developing models that can accurately classify heterozygous variants as true or false positives provides the greatest benefit in terms of financial savings and improved turnaround. Of note, heterozygous structural variants (deletions, insertions, and indels) in the GIAB dataset were initially included in our early assessment of different models; however, our preliminary data suggested poor performance (data not shown). Though previous studies have reported success in applying models to indel prediction [ 11 , 12 , 14 – 16 ], our discordant outcomes can be reasonably attributed to differences between methodologies and bioinformatics pipelines. The decision to adopt any strategy to bypass confirmatory testing in a clinical setting, regardless of whether machine learning is involved, should be taken only after a thorough risk assessment. While several studies describe models with impressive performance, the majority of approaches failed to achieve a false positive capture rate of 100%. Training and testing of various models in our laboratory also failed to capture all false positives in the GIAB datasets. Thus, to reduce the risk of reporting false positives, additional criteria should be considered when designing confirmatory bypass workflows, such as thresholds for allele frequency range, minimum coverage, and genomic regions ineligible for bypass. Additionally, it may be advisable to limit the use of predictive models for medically actionable conditions to avoid the immediate and irreversible harm stemming from the reporting of potential false positives. Although we believe the machine learning models and the proposed workflow described here is conservative and robust, we recognize that our approach also has limitations. Notably, training was performed exclusively on variant calls in the GIAB benchmark regions. The use of large datasets with known truth is ideal for training robust models; however, the final models may not perform well on variants beyond those high-confidence regions if those variants have different characteristics. It’s worth noting that laboratories may wish to circumvent this limitation by restricting the training data to regions of interest that correspond to target capture regions in their specific panels. This approach would yield a reduced number of variants for training and testing, which may result in more customized models with better performance due to reduced complexity of the input variants. Additionally, training models on the GIAB dataset is predicated on the assumption that GIAB benchmark files do not contain errors. Using verified datasets for model training is clearly preferable, but it’s not feasible for any lab to perform confirmatory tests for all GIAB benchmark variants. Internal datasets might be a viable alternative for some laboratories provided that these datasets have a sufficient number of false positives for a model to learn the distinguishing characteristics of the minority class. Conclusions To summarize, our study suggests that the general approach of using GIAB benchmark data along with variant quality features to train machine learning models can significantly improve clinical NGS workflows by easing the burden of orthogonal confirmations on labor, cost, and turnaround time. Although these models were developed on whole exome data using our internal NGS pipeline, customized models can also be developed according to different pipelines and NGS libraries. Abbreviations NGS Next generation sequencing SNV Single nucleotide variant WES Whole exome sequencing GIAB Genome in a bottle TPR true positive rate RF Random Forest model LR Logistic Regression model GB Gradient Boosting model NCBI National Center for Biotechnology Information LOOCV leave-one-sample-out cross validation CV cross validation SOS simple over sampling SMOTE synthetic minority oversampling AUC area under the curve AUROC area under receiver operating characteristics TNR true negative rate FPR false positive rate VUS variants with uncertain significance 2T two-tiered PPV positive predictive value. Declarations Ethics approvals and consent to participate All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2024. Per the US Federal Policy for the Protection of Human Subjects, institutional review board exemption is applicable due to de-identification of the presented data (45 CFR part 46.101(b)(4)). The contents of this manuscript have been reviewed for compliance by the Labcorp Legal department and the Department of Science and Technology. Consent for publication Not applicable Availability of data and materials The raw sequencing data (fastq files) for the GIAB cell lines generated in this study have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession number PRJNA1257936. Sequence datasets generated from patient specimens will not be made available for distribution as an additional measure to protect patient privacy. The software described in this article is proprietary and subject to company regulations. However, inquiries regarding access or usage may be directed to the corresponding author. Competing Interests MY, ZZ, QZ, AK, PO, SL, NL, and JR are current employees of Labcorp, a commercial laboratory that receives compensation for clinical testing. A provisional patent application for the machine learning model presented herein has been submitted with MY, QZ, AK, SL and JR listed as inventors. Funding Funding for this study was solely provided by Labcorp Genetics. Authors’ contributions MY-Conceptualization, Methodology, Software, Data Curation, Writing-Original Draft, Review & Editing, Visualization; JR-Conceptualization, Methodology, Clinical Analysis, Investigation, Writing-Original Draft, Review & Editing. MY and JR were major contributors to preparing the manuscript. ZZ performed Sanger sequencing confirmation including amplicon design, Sanger sequencing assay and data analysis. PO supervised the overall wet lab process. AK, QZ, SL and NL – Review and Editing, Supervision. Acknowledgements Not Applicable Authors’ information MY-Sr. Bioinformatics Scientist; JR-Clinical laboratory director, American Board of Medical Genetics and Genomics certification in Cytogenetics and Molecular Genetics, FACMG; NL-Sr. Laboratory Director, American Board of Medical Genetics and Genomics certification in Cytogenetics and Molecular Genetics, FACMG; ZZ-Research and Development Scientist III; PO-Technical Director II; QZ-Principal Bioinformatics Scientist; AK-Director of Information Technology; SL-Executive Director, Scientific Projects, Diagnostics & Precision Medicine. References Mu W, Lu H-M, Chen J, Li S, Elliott AM. Sanger Confirmation Is Required to Achieve Optimal Sensitivity and Specificity in Next-Generation Sequencing Panel Testing. The Journal of Molecular Diagnostics. 2016;18(6):923-932. McCourt CM, McArt DG, Mills K, Catherwood MA, Maxwell P, Waugh DJ, Hamilton P, O'Sullivan JM, Salto-Tellez M. Validation of Next Generation Sequencing Technologies in Comparison to Current Diagnostic Gold Standards for BRAF, EGFR and KRAS Mutational Analysis. PLoS ONE. 2013;8(7):e69604. Lincoln SE, Truty R, Lin C-F, Zook JM, Paul J, Ramey VH, Salit M, Rehm HL, Nussbaum RL, Lebo MS. A Rigorous Interlaboratory Examination of the Need to Confirm Next-Generation Sequencing–Detected Variants with an Orthogonal Method in Clinical Genetic Testing. The Journal of Molecular Diagnostics. 2019;21(2):318-329. Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Human Genomics. 2022;16(1). Arteche-López A, Ávila-Fernández A, Romero R, Riveiro-Álvarez R, López-Martínez MA, Giménez-Pardo A, Vélez-Monsalve C, Gallego-Merlo J, García-Vara I, Almoguera B et al. Sanger sequencing is no longer always necessary based on a single-center validation of 1109 NGS variants in 825 clinical exomes. Scientific Reports. 2021;11(1). Baudhuin LM, Lagerstedt SA, Klee EW, Fadra N, Oglesbee D, Ferber MJ. Confirming Variants in Next-Generation Sequencing Panel Testing by Sanger Sequencing. The Journal of Molecular Diagnostics. 2015;17(4):456-461. Beck TF, Mullikin JC; NISC Comparative Sequencing Program; Biesecker LG. Systematic Evaluation of Sanger Validation of Next-Generation Sequencing Variants. Clin Chem. 2016;62(4):647-654. De Cario R KA, Suraci S, Magi A, Volta A, Marcucci R, Gori AM, Pepe G, Giusti B, Sticchi E. Sanger Validation of High-Throughput Sequencing in Genetic Diagnosis: Still the Best Practice? Front Genet. 2020;11(592588). Pellegrino E, Jacques C, Beaufils N, Nanni I, Carlioz A, Metellus P, Ouafik LH. Machine learning random forest for predicting oncosomatic variant NGS analysis. Scientific Reports. 2021;11(1). G. Marceddu TD, G. Guerri, A. zulian, C. Marinelli, M. Bertelli. Analysis of machine learning algorithms as integrative tools for validation of next generation sequencing data. European Review for Medical and Pharmacological Sciences. 2019;23(8139-8147). Jeroen van den Akker GM, Anjali D. Zimmer and Alicia Y. Zhou. A machine learning model to determine the accuracy of variant calls in capturebased next generation sequencing. BMC Genomics. 2018;19(263). Holt JM, Kelly M, Sundlof B, Nakouzi G, Bick D, Lyon E. Reducing Sanger confirmation testing through false positive prediction algorithms. Genetics in Medicine. 2021;23(7):1255-1262. Handelman GS, Kok HK, Chandra RV, Razavi AH, Huang S, Brooks M, Lee MJ, Asadi H. Peering Into the Black Box of Artificial Intelligence: Evaluation Metrics of Machine Learning Methods. AJR American journal of roentgenology. 2019;212 1:38-43. Huang Y-S, Hsu C, Chune Y-C, Liao IC, Wang H, Lin Y-L, Hwu W-L, Lee N-C, Lai F. Diagnosis of a Single-Nucleotide Variant in Whole-Exome Sequencing Data for Patients With Inherited Diseases: Machine Learning Study Using Artificial Intelligence Variant Prioritization. JMIR Bioinformatics and Biotechnology. 2022;3(1):e37701. Li J, Jew B, Zhan L, Hwang S, Coppola G, Freimer NB, Sul JH. ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest. PLOS Computational Biology. 2019;15(12):e1007556. Talukder A, Barham C, Li X, Hu H: Interpretation of deep learning in genomics and epigenomics. Briefings in Bioinformatics. 2021;22(3):bbaa177. Additional Declarations Competing interest reported. MY, ZZ, QZ, AK, PO, SL, NL, and JR are current employees of Labcorp, a commercial laboratory that receives compensation for clinical testing. A provisional patent application for the machine learning model presented herein has been submitted with MY, QZ, AK, SL and JR listed as inventors. Supplementary Files sangerbypasssupplement0422submit.docx Additional files Additional file 1 Table S1. GIAB cell lines overview Table S2. Complete list of all candidate quality features for model training Figure S3. Density plots for eight selected key features between true positives and false positives Table S4. Summary of metrics for combinations of features and datasets Table S5. Summary of Sanger confirmation results of SNV calls that were classified as high-quality by the two-tiered workflow Cite Share Download PDF Status: Published Journal Publication published 06 Aug, 2025 Read the published version in BMC Genomics → Version 1 posted Editorial decision: Revision requested 28 May, 2025 Reviews received at journal 27 May, 2025 Reviews received at journal 17 May, 2025 Reviewers agreed at journal 13 May, 2025 Reviewers agreed at journal 09 May, 2025 Reviewers agreed at journal 07 May, 2025 Reviewers agreed at journal 07 May, 2025 Reviewers invited by journal 06 May, 2025 Editor invited by journal 05 May, 2025 Editor assigned by journal 05 May, 2025 Submission checks completed at journal 02 May, 2025 First submitted to journal 02 May, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6513733","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":454339888,"identity":"6c9347a5-22e0-4aeb-a2d7-640bccb76768","order_by":0,"name":"Muqing Yan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAtUlEQVRIiWNgGAWjYLCCDxUWDAYk6WCccUaCRC3MvG2kaDE4fvboBt55EvLm7IePfS5gsJPTbSCk5Uxe2g3JbRKGO3vSkmfPYEg2NjtASMuBHLMbhtskGDfc4DFm5mE4kLiNoJbzb8xuJM6RsCdByw2gLQcbJBKJ1yJ5443ZzYZjEskgvzDPMCDCL3znc8xu/6mxsd3Ofvgwc0GFnRxBLQrICpiJih35BhQto2AUjIJRMAqwAABxr0IE+Kg15QAAAABJRU5ErkJggg==","orcid":"","institution":"LabCorp (United States)","correspondingAuthor":true,"prefix":"","firstName":"Muqing","middleName":"","lastName":"Yan","suffix":""},{"id":454339889,"identity":"d45c7981-e508-4fd9-b5e2-5180a5bb61c6","order_by":1,"name":"Qiandong Zeng","email":"","orcid":"","institution":"LabCorp (United States)","correspondingAuthor":false,"prefix":"","firstName":"Qiandong","middleName":"","lastName":"Zeng","suffix":""},{"id":454339890,"identity":"f86767ab-5e31-4664-b1a0-8c9f431f36ca","order_by":2,"name":"Zhenxi Zhang","email":"","orcid":"","institution":"LabCorp (United States)","correspondingAuthor":false,"prefix":"","firstName":"Zhenxi","middleName":"","lastName":"Zhang","suffix":""},{"id":454339891,"identity":"e23d4753-0e10-4942-bcbf-3174e91f7dc2","order_by":3,"name":"Patricia Okamoto","email":"","orcid":"","institution":"LabCorp (United States)","correspondingAuthor":false,"prefix":"","firstName":"Patricia","middleName":"","lastName":"Okamoto","suffix":""},{"id":454339892,"identity":"b9fcc035-238e-40c7-bdbc-f5e9151f8fa7","order_by":4,"name":"Stanley Letovsky","email":"","orcid":"","institution":"LabCorp (United States)","correspondingAuthor":false,"prefix":"","firstName":"Stanley","middleName":"","lastName":"Letovsky","suffix":""},{"id":454339893,"identity":"976def41-637d-49f5-bce5-9f6a495096d6","order_by":5,"name":"Angela Kenyon","email":"","orcid":"","institution":"LabCorp (United States)","correspondingAuthor":false,"prefix":"","firstName":"Angela","middleName":"","lastName":"Kenyon","suffix":""},{"id":454339894,"identity":"3ae0fe08-0bad-4a95-88db-8075fb41e9ab","order_by":6,"name":"Natalia Leach","email":"","orcid":"","institution":"LabCorp (United States)","correspondingAuthor":false,"prefix":"","firstName":"Natalia","middleName":"","lastName":"Leach","suffix":""},{"id":454339895,"identity":"f792bf59-68c3-4661-9daa-c07ab130b3ae","order_by":7,"name":"Jennifer Reiner","email":"","orcid":"","institution":"LabCorp (United States)","correspondingAuthor":false,"prefix":"","firstName":"Jennifer","middleName":"","lastName":"Reiner","suffix":""}],"badges":[],"createdAt":"2025-04-23 14:53:10","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6513733/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6513733/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12864-025-11889-z","type":"published","date":"2025-08-06T15:57:48+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":82604620,"identity":"a0e302f7-a727-4df6-a516-ff63160377b6","added_by":"auto","created_at":"2025-05-13 09:57:28","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":121995,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eROC curves for all models showing performances in the second training-testing phase. a. overall ROC curves; b. zoomed-in to show TPR over 0.95.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6513733/v1/1aca5c63be9a3e1d616e2044.png"},{"id":82602450,"identity":"c901f60d-c7df-42ce-85fb-a0402cf59e31","added_by":"auto","created_at":"2025-05-13 09:49:28","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":126159,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTwo-tiered workflow using machine learning models to detect high-confidence heterozygous SNVs.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6513733/v1/6ec2a0e7ce04e53d53815203.png"},{"id":88814180,"identity":"a85cd1d3-8d19-4e61-aad6-6d5f8a7daa85","added_by":"auto","created_at":"2025-08-11 16:08:00","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1394632,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6513733/v1/1f6e1f80-0ebb-46f3-b48f-3481bf3ef049.pdf"},{"id":82604623,"identity":"a78adb75-3f87-49f8-9388-524725c70d31","added_by":"auto","created_at":"2025-05-13 09:57:28","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":205425,"visible":true,"origin":"","legend":"\u003cp\u003eAdditional files\u003c/p\u003e\n\u003cp\u003eAdditional file 1\u003c/p\u003e\n\u003cp\u003eTable S1. GIAB cell lines overview\u003c/p\u003e\n\u003cp\u003eTable S2. Complete list of all candidate quality features for model training\u003c/p\u003e\n\u003cp\u003eFigure S3. Density plots for eight selected key features between true positives and false positives\u003c/p\u003e\n\u003cp\u003eTable S4. Summary of metrics for combinations of features and datasets\u003c/p\u003e\n\u003cp\u003eTable S5. Summary of Sanger confirmation results of SNV calls that were classified as high-quality by the two-tiered workflow\u003c/p\u003e","description":"","filename":"sangerbypasssupplement0422submit.docx","url":"https://assets-eu.researchsquare.com/files/rs-6513733/v1/b5dba3f968b2a9a71577f7c2.docx"}],"financialInterests":"Competing interest reported. MY, ZZ, QZ, AK, PO, SL, NL, and JR are current employees of Labcorp, a commercial laboratory that receives compensation for clinical testing. A provisional patent application for the machine learning model presented herein has been submitted with MY, QZ, AK, SL and JR listed as inventors.","formattedTitle":"Determination of high-confidence germline genetic variants in next- generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation","fulltext":[{"header":"Background","content":"\u003cp\u003eSanger sequencing is a first-generation DNA sequencing method that has long been considered the gold standard for the accurate detection of small sequence variants [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e], but its use as a primary approach for variant detection has been largely supplanted by NGS in clinical laboratories. Many laboratories continue to employ Sanger sequencing as an orthogonal method to confirm variants identified by NGS; however, the sensitivity and accuracy of current NGS methods and bioinformatic tools have significantly improved since its inception [\u003cspan additionalcitationids=\"CR3\" citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Numerous studies examining the necessity of Sanger sequencing report concordance rates of \u0026gt;\u0026thinsp;99% between NGS and Sanger sequencing results for single nucleotide variants (SNVs) and insertion-deletion variants (indels) in high-complexity regions [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan additionalcitationids=\"CR6 CR7 CR8\" citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], whereas low-complexity regions comprised of repetitive elements, homologous regions, and high-GC content, as well as technical artifacts are more likely to be enriched for false positive variants with relatively poor quality metrics.\u003c/p\u003e \u003cp\u003eMachine learning is a type of artificial intelligence that has the capability to make decisions or render predictions based on inferred relationships between features (a.k.a parameters) of a dataset without explicit programming. Diagnostic applications of machine learning frequently rely on supervised learning models that require training on labeled data to learn which features are significant for a given class, while unsupervised approaches attempt to establish relationships between features without \u003cem\u003ea priori\u003c/em\u003e knowledge of truth. Within the field of medical genetics, supervised machine-learning models have been reported to significantly reduce the confirmation rate of NGS variant calls using sequencing parameters such as read depth, allele frequency, sequencing quality, and mapping quality as variables to train models [\u003cspan additionalcitationids=\"CR11\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. The success of these models can be attributed to quantitative and qualitative differences that separate high and low-confidence variant calls. While these studies demonstrate proof-of-concept for implementing machine learning models for triaging variants in confirmatory testing, pipeline-specific differences in quality features necessitate \u003cem\u003ede novo\u003c/em\u003e model building and clinical validation before integrating these models into a clinical genetic workflow.\u003c/p\u003e \u003cp\u003eIn this study, we aim to employ supervised machine learning models to differentiate between two types of heterozygous SNVs: high-confidence variants which do not require orthogonal confirmation, and low-confidence variants which require additional review and confirmatory testing. Random forest (RF), logistic regression (LR), AdaBoost, Gradient Boosting (GB), and Easy Ensemble methods were selected for comparison in this study. Multiple iterations of supervised machine learning were performed to identify which features and statistical methods yielded optimal results, and a two-tiered model with guardrails for allele frequency and sequence context was developed to achieve the optimal balance between sensitivity and specificity. Testing of the final model suggested that our approach significantly reduces the number of true positive variants requiring confirmation while mitigating the risk of reporting false positives.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eCell lines and specimens\u003c/h2\u003e \u003cp\u003eGenomic DNA isolated from genome-in-a-bottle (GIAB) reference specimens NA12878, NA24385, NA24149, NA24143, NA24631, NA24694, and NA24695 were purchased from the Coriell Institute for Medical Research (Camden, NJ) (Additional file 1: Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). Additional lymphoblast cell line DNA (Coriell) and de-identified patient specimens were used in a separate validation of the final model. Informed consent for the clinical testing was obtained by referring physicians prior to sample submission and residual DNA was de-identified prior to use in this study. Per the United States Code of Federal Regulations for the Protection of Human Subjects, institutional review board exemption is applicable due to de-identification of the patient data presented herein (45 CFR part 46.101(b)(4)).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eData Downloads and source materials\u003c/h3\u003e\n\u003cp\u003eGIAB benchmark files (version v 4.2.1 for GRCh37) containing high-confidence variant calls were downloaded from the National Center for Biotechnology Information (NCBI) ftp site for use as truth sets for supervised learning and assessment of model performance. Genomic regions ineligible for Sanger bypass were compiled by downloading the following bed files from the UCSC genome browser: ENCODE blacklist, NCBI NGS high stringency, NCBI NGS low stringency, NCBI NGS dead zone, and segmental duplication tracks. These data were supplemented with additional regions of low-mappability identified by an internal assessment.\u003c/p\u003e\n\u003ch3\u003eNGS library preparation and data processing\u003c/h3\u003e\n\u003cp\u003eWhole exome libraries for GIAB cell lines were sequenced twice on two separate flow cells. Library preparation and target enrichment were carried out using an internally developed automation workflow on the Hamilton NGS Star workstation (Hamilton Company, Reno, NV). Briefly, libraries were prepared from 250 ng of genomic DNA using Kapa HyperPlus reagents (Kapa Biosystems, Inc./Roche, Wilmington, MA) for enzymatic fragmentation, end-repair, A-tailing and adaptor ligation. Each library was indexed with unique dual barcodes (IDT, Coralville, IA) to eliminate the possibility of index hopping between samples. For target enrichment, twelve normalized libraries were pooled together, and a custom panel of biotinylated, double-stranded DNA probes (Twist Biosciences, South San Francisco, CA) was used to capture exome sequences as well as other regions of interest (~\u0026thinsp;41.4Mb total). The hybridized libraries were further purified using streptavidin beads, and the library pools quantified via the Kapa qPCR Library Quantification kit (Kapa Biosystems Inc./Roche, Wilmington, MA) on a QuantStudio\u0026reg;7 (ThermoScientific, Waltham, MA). After normalization, the library pools were combined, and ~\u0026thinsp;1\u0026ndash;2% PhiX library control spiked into the final pool to monitor sequencing quality in real-time. Up to 192 libraries were sequenced (paired-end, 2x150 cycles) per S4 flowcell on the NovaSeq 6000 sequencer (Illumina, San Diego, CA). Sequencing run quality metrics were tracked in real-time using the Illumina Sequencing Analysis Viewer v.2.4.7 software for percent of clusters passing filter, fraction of bases at \u0026gt;\u0026thinsp;Q30, sequencing yield and flowcell occupancy. Additional metrics related to the PhiX control such as alignment, error rate and pre-phasing/phasing were used for troubleshooting. Sequencing data were demultiplexed with the bcl2fastq2 v.2.20 or BCLConvert v.3.8.2 software (Illumina, San Diego, CA), and the fastq files processed through a customized data analysis pipeline that consisted of the CLCBio Genomics CLS WebService v.21.0.5 and Workbench v.21.0.5 (Qiagen Bioinformatics, Redwood City, CA) software and plugins thereof as well as internally developed algorithms. Reads were trimmed to remove adaptor sequences and low-quality bases (\u0026lt;\u0026thinsp;Q20), and then aligned to the GRCh37/hg19 NCBI reference genome followed by duplicate reads removal, local re-alignment and variant detection. Data quality was assessed based on metrics such as mean target coverage, fraction of bases at minimum coverage, coverage uniformity expressed as Fold 80 base penalty, on-target rate and insert size, all of which were calculated using Picard v.2.3.0 tools in the Genome Analysis Toolkit (GATK; Broad Institute, Cambridge, MA). Sequence data were analyzed with the CLCBio Clinical Lab Service to generate annotated TR.xml files with quality features (Table S2) used for training and testing various machine-learning algorithms. All heterozygous SNVs called from the internal pipeline were intersected with the GIAB benchmark bed files and variants in the high-confidence regions were annotated as 0 if present in the truth set (\u003cem\u003ei.e\u003c/em\u003e. true positive (TP)) or 1 if absent (\u003cem\u003ei.e\u003c/em\u003e. false positive (FP)).\u003c/p\u003e\n\u003ch3\u003eSanger sequencing\u003c/h3\u003e\n\u003cp\u003eConfirmation of select variants was performed by Sanger sequencing. Primers flanking the test variants were designed online using Primer3Plus software and primer specificity was verified using the UCSC genome browser (Univerisity of California, Santa Cruz, CA) \u003cem\u003ein silico\u003c/em\u003e PCR tool. Sanger sequencing was performed by capillary electrophoresis on the Applied Biosystems 3730xl genetic analyzer (Thermo Fischer Scientific, Waltham, MA). GeneStudio\u0026trade; Pro (Informer Technologies Inc., New York City, NY) and UGENE (Unipro, Novosibirsk, Russia) software were used for alignment and analysis of Sanger sequencing traces.\u003c/p\u003e\n\u003ch3\u003eModel training and testing strategies\u003c/h3\u003e\n\u003cp\u003ePredictive modeling of high-confidence variants detected in the GIAB specimens was performed using logistic regression, random forest, gradient boosting, EasyEnsemble, and AdaBoost. The features used for model training included allele frequency, read count metrics, coverage, quality, read position probability, read direction probability, homopolymer presence, and overlap with low-complexity sequence (i.e. complex regions) (Table S2). All annotated variants from the two flow cells were evenly split into two subsets with truth stratification to ensure proportions of FPs and TPs are similar. The first half of the data was used for leave-one-sample-out cross validation (LOOCV), where each GIAB sample was left out once and used as the testing set, the other six samples were used as training set, and the second half of the data was reserved to test the models trained on the first dataset. In the cross validation (CV) experiment, the machine-learning models were trained using all high-confidence variants with known truth (imbalanced raw data) and all available features. Hyperparameters were tuned and selected in this phase. A second phase training was performed using the first half of data as the training set and the second half of the data as the testing set to evaluate the importance of quality features, the impact of imbalanced data and pick the best model combination. Feature coefficients were estimated on both raw and scaled data (minmaxscale module in Python scikit learn). Features with positive effects were more likely to be associated with false positive variants, whereas features with negative weights were more likely to be associated with true positive variants, which are in the context of labeling false positives as 1 considering false positive variants are the primary target. High-impact features were selected based on the estimates in this step. Since our data is imbalanced with most variants representing true positive calls, comparisons among balanced and imbalanced datasets were performed in the final testing. Methods selected to achieve balanced datasets for evaluation included simple over sampling (SOS), which randomly duplicates data points from the minority data set, and synthetic minority oversampling (SMOTE), which generates synthetic data points according to a k-nearest neighbor analysis of minority data point clustering. Common metrics in ML classification-based methods including accuracy, confusion matrix, area under the curve (AUC) or/and area under receiver operating characteristics (AUROC) and F1-score were examined to assess the performance of the models. AUC measures the true-positive rate (TPR) or sensitivity, true-negative rate (TNR) or specificity and the false positive rate (FPR), whereas the F1-score assesses precision and recall rate in highly imbalanced data. For both AUC and F1-score, a greater value reflects better model performance. The confusion matrix describes the complete model performance by measuring the model accuracy to calculate true-positive values plus true-negative values and dividing the sum over the total number of samples [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. In the context of this study, since the primary target is false positive variants which are labeled as 1 in the model training, TPR and false positive capture rate are used interchangeably. True positive flag rate is defined as the rate at which models incorrectly tag true positive variants as low-confidence variants.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eExternal Validation\u003c/h2\u003e \u003cp\u003eExternal validation of the model was performed using a new variant set comprised of 93 heterozygous SNVs identified in 83 de-identified specimens or cell lines. Criteria for the selection of test variants were based on (1) location within a genomic region validated in any of our NGS assays and the variant\u0026rsquo;s clinical significance. For the purpose of this validation, clinical significance refers to the minimum variant classification required to be considered reportable for a given panel (e.g. only likely pathogenic or pathogenic variants in a carrier screening gene; pathogenic, likely pathogenic and variants of uncertain significance (VUS) in genes overlapping a cardiogenetic or hereditary cancer panel).\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\n \u003ch2\u003eDataset preparation\u003c/h2\u003e\n \u003cp\u003eSeven GIAB cell lines were sequenced twice on two flow cells resulting in 282,076 heterozygous SNVs. Intersection of these variants with the GIAB benchmark high-confidence regions resulted in a total of 222,489 heterozygous SNVs, of which 212,397 variants were annotated as TP and 10,092 labeled as FP (Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003ea). Labeled variants in each sample were then split into two halves such that the counts and percentages for TPs and FPs were evenly divided between the two subsets (Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003eb).\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u0026nbsp;\u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eDetailed counts by sample along the workflow for heterozygous single-nucleotide variants (SNV) a.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" colspan=\"6\"\u003e\n \u003cp\u003eCounts of heterozygous SNVs throughout the workflow\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFlow cells\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNIST ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRaw\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIn GIAB benchmark regions\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePresent in truth set (TP)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAbsent in truth set (FP)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"7\"\u003e\n \u003cp\u003eFlow Cell 1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG001\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20,612\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,933\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,109\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e824\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG002\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20,625\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,600\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,829\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e771\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG003\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20,540\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,322\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,583\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e739\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG004\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e21,090\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,630\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,929\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e701\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG005\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e19,844\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,627\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,808\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e819\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG006\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e18,938\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,970\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,223\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e747\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG007\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e19,618\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,315\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,663\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e652\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"7\"\u003e\n \u003cp\u003eFlow Cell 2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG001\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20,645\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,880\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,120\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e760\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG002\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20,620\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,553\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,879\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e674\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG003\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20,511\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,330\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,598\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e732\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG004\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20,990\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,600\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,943\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e657\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG005\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e19,504\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,439\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,807\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e632\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG006\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e18,890\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,976\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,220\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e756\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG007\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e19,649\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,314\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,686\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e628\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTotal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e282,076\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e222,489\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e212,397\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10,092\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" colspan=\"6\"\u003e\n \u003cp\u003eSeven cell lines were sequenced twice in two separate flow cells as technical replicates. Truth was retrieved from GIAB benchmark regions. True positive (TP) is a variant present in the truth set; false positive (FP) is a variant absent in the truth set.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003e\u003cstrong\u003eb.\u003c/strong\u003e\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003cdiv align=\"left\" class=\"colspec\"\u003e\u003cbr\u003e\u003c/div\u003e\n \u003ctable id=\"Taba\" border=\"1\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" colspan=\"7\"\u003e\n \u003cp\u003eCounts of heterozygous SNVs for cross-validation and final test\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003eFirst half for CV\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003eSecond half for final training\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFlow cells\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNIST ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePresent (TP)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAbsent (FP)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePresent (TP)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAbsent (FP)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTotal\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"7\"\u003e\n \u003cp\u003eFlow Cell 1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG001\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,554\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e412\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,555\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e412\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,933\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG002\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,914\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e386\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,915\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e385\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,600\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG003\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,791\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e370\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,792\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e369\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,322\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG004\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,964\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e351\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,965\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e350\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,630\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG005\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,404\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e409\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,404\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e410\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,627\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG006\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,111\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e374\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,112\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e373\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,970\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG007\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,331\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e326\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,332\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e326\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,315\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"7\"\u003e\n \u003cp\u003eFlow Cell 2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG001\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,560\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e380\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,560\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e380\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,880\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG002\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,939\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e337\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,940\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e337\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,553\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG003\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,799\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e366\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,799\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e366\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,330\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG004\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,971\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e329\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,972\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e328\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16,600\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG005\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,403\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e316\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,404\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e316\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,439\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG006\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,110\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e378\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,110\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e378\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14,976\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHG007\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,343\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e314\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7,343\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e314\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15,314\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTotal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e106,194\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5,048\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e106,203\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5,044\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e222,489\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" colspan=\"7\"\u003e\n \u003cp\u003eFor each cell line in each flow cell, both TP and FP variants were split evenly into two groups. The first group was used for cross-validation; both groups were used in the second and final testing phases.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n \u003ch2\u003eCross-validation\u003c/h2\u003e\n \u003cp\u003eA LOOCV was performed using the first subset for both training and testing logistic regression, random forest, Gradient Boosting, AdaBoost, and EasyEnsemble models. In this first phase of model training, all 13 quality features and capture rates of 95% and 99% were tested, indicating that 5/100 and 1/100 false positive calls are missed, respectively. Logistic regression and random forest models exhibited the best performance with respect to false positive capture rates, while gradient boosting achieved the best all-around performance with a high FP capture rate and low TP flag rate (Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e). Low standard deviations were observed for all models across different genetic backgrounds, indicating consistent and robust model performance.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eSummary of cross-validation experiments for five models on heterozygous SNVs\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003eRecall 0.95 (TPR)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003eRecall 0.99 (TPR)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVariant type\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eModels\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV FP capture rate (TPR %)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV TP flag rate (FPR %)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV FP capture rate (TPR %)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV TP flag rate (FPR %)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV ROC AUC (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"5\"\u003e\n \u003cp\u003eSNV-heterozygous\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGradientBoosting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e91.34+-2.32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e19.25+-3.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e96.56+-0.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e54.33+-4.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94.77+-0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLogisticRegression\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94.88+-1.52\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e41.81+-6.89\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e99.00+-0.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e89.43+-3.54\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94.52+-0.71\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEasyEnsemble\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e93.81+-1.63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e34.46+-5.17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.50+-0.76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e75.12+-5.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94.34+-0.88\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAdaBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e88.19+-2.88\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e12.75+-2.47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e91.90+-2.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e29.62+-4.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e93.83+-1.04\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRandomForest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94.22+-1.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e50.68+-8.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e99.07+-0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.30+-4.86\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e92.79+-1.05\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eMean and standard deviation for true positive rate (TPR), false positive rate (FPR), and area under the receiver operating characteristics curve (ROC AUC) in cross-validation (CV) under 0.95 and 0.99 recall rates. Models are sorted by AUC values.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\n \u003ch2\u003eModel evaluation and selection\u003c/h2\u003e\n \u003cp\u003eIn the second phase of testing, all five models were trained using the first dataset for training and the second dataset for testing. The results of phase two were comparable to phase one with gradient boosting exhibiting the most balanced performance and random forest and logistic regression models exceeding EasyEnsemble and AdaBoost in false positive capture rates (Table \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e and Fig. 1). Refinement of model performance was then explored by comparing model performance with the full set of sequence features versus select high-impact features.\u003c/p\u003e\n \u003cp\u003eFeature coefficients (applicable to LR) and importance (applicable to RF and GB) were estimated using both raw and scaled data to determine the relative contribution of each feature to the associated true positive or false positive label, and to eliminate redundant features. This assessment indicated that scaling of LR coefficients yielded inconsistent patterns for several features (Table \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e); however, eight features (frequency, read count, coverage, forward count, reverse count, forward/reverse ratio, read position probability, and overlap with complex regions) exhibited consistent trends between raw and scaled data. The features with highest contributions were selected as key features for subsequent model training and comparisons. Density plots for these key features showing the difference between true positive and false positive can be reviewed in the Additional file1: Figure S3. The effects of imbalanced data were also investigated using two statistical models for balancing skewed datasets: SOS and SMOTE. In total, eighteen combinations with variable feature sets and relative balance were evaluated across RF, LG and GR models to determine the optimal conditions for training (Table S3). The top-performing model configurations according to the F1 scores are: GB trained with all quality features and raw imbalanced data, LR trained with selected key features and SOS balanced data, and RF trained with all quality features and raw imbalanced data (Table \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e, full list of statistics for all combinations can be found in the Additional file 1: Table S4).\u003c/p\u003e\n \u003cp\u003eBecause each optimized model has distinct advantages, we decided to combine all three models and utilize the thresholds harvested from the 0.99 recall rate in the training set to form a two-tiered (2T) workflow for the evaluation of heterozygous SNVs (Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e). Logistic regression and random forest models demonstrated the highest sensitivity in identifying false positives. Therefore, we integrated these two models as the initial classifier to detect prominent low-confidence calls. Given the superior robustness of the gradient boosting model, we implemented it as a secondary layer to classify the uncertain cases that emerged from the combined random forest and logistic regression analysis. Additional criteria were also integrated into the pipeline, including allele frequency ranges, minimum coverage, and genomic location to prevent bypass of variants with atypical features.\u0026nbsp;\u003c/p\u003e\n \u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eSummary of second-phase training experiments for five models on heterozygous SNVs\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSecond-phase training\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"6\"\u003e\n \u003cp\u003eRecall 0.99 (TPR) in the second-phase training\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVariant type\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eModels\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFP capture rate (TPR %)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTP flag rate (FPR %)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTP capture rate (TNR %)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFP flag rate (FNR %)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eROC AUC (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eThreshold (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"5\"\u003e\n \u003cp\u003eSNV-het\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGradientBoosting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10.74\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e89.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.88\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.42\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEasyEnsemble\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e19.41\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e80.59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e48.64\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLogisticRegression\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e36.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e63.82\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.29\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAdaBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e6.47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e93.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5.06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.02\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e49.44\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRandomForest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e99.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e37\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.87\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e97.85\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10.27\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n \u003cp\u003eTrue positive rate (TPR), false positive rate (FPR), area under the receiver operating characteristics curve (ROC AUC) and threshold drawn under 0.99 recall rate in the second phase. Models are sorted by AUC values.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u0026nbsp;\u003ctable id=\"Tab4\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eAssessment of features importance/coefficients for RF, LG and GB using both raw and scaled data\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eHET SNV\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003efrequency\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eread count\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ecoverage\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eforward count\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ereverse count\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eforward/reverse ratio\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eaverage quality\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLR_coefficients\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-0.15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-0.23\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-7.54\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLR_coeffficients_scaled\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-11.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-1.68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15.63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e16.59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-1.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-0.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF_importance\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.01\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.03\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.04\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.03\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGB_importance\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.03\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.01\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eprobability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eread position probability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eread direction probability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehomopolymer\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehomopolyer length\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ein complex region\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLR_coefficients\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-1.43\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-0.02\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLR_coefficients_scaled\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-0.63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-3.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-0.15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.39\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2.51\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF_importance\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.23\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGB_importance\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.36\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.02\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eCoefficients evaluation for logistic regression (RF) and features importance evaluation for random forest (RF) and gradient boosting (GB) for both raw and scaled data using all thirteen next-generation-sequencing (NGS) features which are initially available.\u0026nbsp;\u003c/p\u003e\n \u003ctable id=\"Tab5\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eFinal models selection of the top performing combinations of features and datasets for heterozygous SNVs\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003efeatures\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003edata status\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003emodel name\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003emodel name\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eTPR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFPR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eTNR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFNR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eROC\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eThreshold (%)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eS\u003csub\u003es\u003c/sub\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eF1\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eall\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eimbalanced\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eimb_all\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10.74\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e89.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.88\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ekey\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSOS balanced\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003esos_key\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.69\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e28.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e71.34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.31\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4.91\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.82\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eall\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eimbalanced\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eimb_all\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e99.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e37.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e63.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.87\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e97.85\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.77\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003eBest models picked for the Sanger Bypass pathway are gradient boosting (GB) trained with all 13 features and imbalanced original data (imb_all), logistic regression (LR) trained with key features and simple-oversampling (SOS) balanced data (sos_key), and random forest (RF) trained with all 13 features and imbalanced original data (imb_all). S\u003csub\u003es\u003c/sub\u003e: scaled capture rate score. Models are sorted by F1 values.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n \u003ch2\u003eFinal model evaluation\u003c/h2\u003e\n \u003cp\u003eThe final 2T workflow was then tested on the other half of GIAB variants which was serving as the testing set (a total of 106,203 known TPs and 5,044 FPs, Table \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e). Tier one (RF\u0026thinsp;+\u0026thinsp;LR) returned variant predictions for 24,683 variants (22,320 high-confidence\u0026thinsp;+\u0026thinsp;2,363 low-confidence). (Table \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003ea). However, approximately 77.8% (86,564/111,247) of variants with known truth could not be predicted as present or absent at the selected thresholds (\u0026ldquo;Unknown\u0026rdquo; by RF\u0026thinsp;+\u0026thinsp;LR: 13,972\u0026thinsp;+\u0026thinsp;72,592), thus processing by the GB machine learning model as a second tier of confirmatory bypass workflow was required. GB correctly classified 86.7% (72,503/83,628) of the remaining true positive variants and 96.9% (2,847/2,936) of the true negative variants. Taken together, the 2T predictions were concordant with GIAB truth for 89.7% (99,766/111,247) of the total variants. According to the established workflow, 89.3% ((22,314\u0026thinsp;+\u0026thinsp;72,503)/106,203) of GIAB true positives labeled as high-confidence by the workflow would be eligible to bypass orthogonal confirmation, while 16,335 (2,363 by RF\u0026thinsp;+\u0026thinsp;LR and 13,972 by GB) variants predicted to be low-confidence including 11,386 (261 by RF\u0026thinsp;+\u0026thinsp;LR and 11,125 by GB) true positive variants, would require reflex to Sanger sequencing (Table \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003ea). Additionally, 95 (6 by RF\u0026thinsp;+\u0026thinsp;LR and 89 by GB) false positive variants were incorrectly predicted to be true positive by the combined models, resulting in a false positive rate of 1.88%. Overall, the 2T model delivers a 99.9% (94,817/(94,817\u0026thinsp;+\u0026thinsp;95)) PPV/precision, 89.2% (94,817/(94,817\u0026thinsp;+\u0026thinsp;11,386)) sensitivity, and 98.1% (95/(95\u0026thinsp;+\u0026thinsp;4949)) specificity (Table \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003eb) in the context of correctly predicting true positive ones in the truth set.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u0026nbsp;\u003ctable id=\"Tab6\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003ea. Performance of three models combined on the GIAB testing set\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"3\"\u003e\n \u003cp\u003eGIAB cell lines variants truth\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGradient Boosting Prediction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePresent (Positive)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAbsent (Negative)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTotals\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"6\"\u003e\n \u003cp\u003eRandom Forest\u0026thinsp;+\u0026thinsp;Logistic Regression\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eHigh-confidence (Positive)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLow-confidence (Negative)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHigh-confidence (Positive)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e22,314\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e22,320\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eLow-confidence (Negative)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLow-confidence (Negative)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e261\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2,102\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2,363\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHigh-confidence (Positive)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eUnknown\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLow-confidence (Negative)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e11,125\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2,847\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e13,972\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHigh-confidence (Positive)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e72,503\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e89\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e72,592\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTotals\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e106,203\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5,044\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e111,247\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003cdiv align=\"left\" class=\"colspec\"\u003e\u003cbr\u003e\u003c/div\u003e\u0026nbsp;\u0026nbsp;\u003ctable id=\"Tab7\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eb. Confusion matrix of 2T model on the GIAB testing set\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003eThe Truth\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePresent (Positive)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAbsent (Negative)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003e2T Prediction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHigh-confidence (Positive)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94,817\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e95\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLow-confidence (Negative)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e11,386\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4,949\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eThe first step in the pipeline is to combine both random forest and logistic models to make variants classification, then gradient boosting is served as the third model to further predict the status of variants. High-confidence variants predicted by the 2T model are the ones qualified for Sangerbypass, while as for the low-confidence variants are the ones that will require Sanger confirmation.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\n \u003ch2\u003eExternal validation\u003c/h2\u003e\n \u003cp\u003eValidation of the models was performed on a subset of new heterozygous SNVs (n\u0026thinsp;=\u0026thinsp;93) identified through end-to-end testing of samples on the exome panel to assess overfitting of the final models to the training dataset. The accuracy of the predictions for this new dataset was determined using Sanger sequencing data as a source of truth. The concordance rate between the machine-learning predictions and Sanger sequencing results was 100% in this validation study (Additional file 1: Table S5), suggesting that overfitting did not contribute to the high sensitivity rates observed in the GIAB testing dataset.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe development of artificial intelligence for decision support in healthcare is rapidly gaining acceptance among the medical and scientific communities. In recent years, several publications have described machine learning tools that can distinguish between true positive and false positive NGS calls based on sequence metrics and variant characteristics. In this study, we built upon the framework developed by Holt et al. [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] by using a combination of continuous and binary models to establish a workflow for bypassing confirmation of heterozygous SNVs detected by NGS. The decision to limit the scope of our development efforts to heterozygous variants was based on our observation that the overwhelming majority of clinically-significant variants detected by NGS in our laboratory are heterozygous sequence changes. Consequently, developing models that can accurately classify heterozygous variants as true or false positives provides the greatest benefit in terms of financial savings and improved turnaround. Of note, heterozygous structural variants (deletions, insertions, and indels) in the GIAB dataset were initially included in our early assessment of different models; however, our preliminary data suggested poor performance (data not shown). Though previous studies have reported success in applying models to indel prediction [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan additionalcitationids=\"CR15\" citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], our discordant outcomes can be reasonably attributed to differences between methodologies and bioinformatics pipelines.\u003c/p\u003e \u003cp\u003eThe decision to adopt any strategy to bypass confirmatory testing in a clinical setting, regardless of whether machine learning is involved, should be taken only after a thorough risk assessment. While several studies describe models with impressive performance, the majority of approaches failed to achieve a false positive capture rate of 100%. Training and testing of various models in our laboratory also failed to capture all false positives in the GIAB datasets. Thus, to reduce the risk of reporting false positives, additional criteria should be considered when designing confirmatory bypass workflows, such as thresholds for allele frequency range, minimum coverage, and genomic regions ineligible for bypass. Additionally, it may be advisable to limit the use of predictive models for medically actionable conditions to avoid the immediate and irreversible harm stemming from the reporting of potential false positives.\u003c/p\u003e \u003cp\u003eAlthough we believe the machine learning models and the proposed workflow described here is conservative and robust, we recognize that our approach also has limitations. Notably, training was performed exclusively on variant calls in the GIAB benchmark regions. The use of large datasets with known truth is ideal for training robust models; however, the final models may not perform well on variants beyond those high-confidence regions if those variants have different characteristics. It\u0026rsquo;s worth noting that laboratories may wish to circumvent this limitation by restricting the training data to regions of interest that correspond to target capture regions in their specific panels. This approach would yield a reduced number of variants for training and testing, which may result in more customized models with better performance due to reduced complexity of the input variants. Additionally, training models on the GIAB dataset is predicated on the assumption that GIAB benchmark files do not contain errors. Using verified datasets for model training is clearly preferable, but it\u0026rsquo;s not feasible for any lab to perform confirmatory tests for all GIAB benchmark variants. Internal datasets might be a viable alternative for some laboratories provided that these datasets have a sufficient number of false positives for a model to learn the distinguishing characteristics of the minority class.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eTo summarize, our study suggests that the general approach of using GIAB benchmark data along with variant quality features to train machine learning models can significantly improve clinical NGS workflows by easing the burden of orthogonal confirmations on labor, cost, and turnaround time. Although these models were developed on whole exome data using our internal NGS pipeline, customized models can also be developed according to different pipelines and NGS libraries.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eNGS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eNext generation sequencing\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eSNV\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eSingle nucleotide variant\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eWES\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eWhole exome sequencing\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eGIAB\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eGenome in a bottle\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eTPR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003etrue positive rate\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRF\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRandom Forest model\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eLR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eLogistic Regression model\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eGB\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eGradient Boosting model\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eNCBI\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eNational Center for Biotechnology Information\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eLOOCV\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eleave-one-sample-out cross validation\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eCV\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003ecross validation\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eSOS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003esimple over sampling\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eSMOTE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003esynthetic minority oversampling\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eAUC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003earea under the curve\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eAUROC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003earea under receiver operating characteristics\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eTNR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003etrue negative rate\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eFPR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003efalse positive rate\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eVUS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003evariants with uncertain significance\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e2T\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003etwo-tiered\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003ePPV\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003epositive predictive value.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approvals and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2024. Per the US Federal Policy for the Protection of Human Subjects, institutional review board exemption is applicable due to de-identification of the presented data (45 CFR part 46.101(b)(4)). The contents of this manuscript have been reviewed for compliance by the Labcorp Legal department and the Department of Science and Technology.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe raw sequencing data (fastq files) for the GIAB cell lines generated in this study have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession number PRJNA1257936. Sequence datasets generated from patient specimens will not be made available for distribution as an additional measure to protect patient privacy. The software described in this article is proprietary and subject to company regulations. However, inquiries regarding access or usage may be directed to the corresponding author.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMY, ZZ, QZ, AK, PO, SL, NL, and JR are current employees of Labcorp, a commercial laboratory that receives compensation for clinical testing. A provisional patent application for the machine learning model presented herein has been submitted with MY, QZ, AK, SL and JR listed as inventors.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFunding for this study was solely provided by Labcorp Genetics.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMY-Conceptualization, Methodology, Software, Data Curation, Writing-Original Draft, Review \u0026amp; Editing, Visualization; JR-Conceptualization, Methodology, Clinical Analysis, Investigation, Writing-Original Draft, Review \u0026amp; Editing. MY and JR were major contributors to preparing the manuscript. ZZ performed Sanger sequencing confirmation including amplicon design, Sanger sequencing assay and data analysis. PO supervised the overall wet lab process. AK, QZ, SL and NL \u0026ndash; Review and Editing, Supervision.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot Applicable\u003c/p\u003e\n\u003cp\u003eAuthors\u0026rsquo; information\u003c/p\u003e\n\u003cp\u003eMY-Sr. Bioinformatics Scientist; JR-Clinical laboratory director, American Board of Medical Genetics and Genomics certification in Cytogenetics and Molecular Genetics, FACMG; NL-Sr. Laboratory Director, American Board of Medical Genetics and Genomics certification in Cytogenetics and Molecular Genetics, FACMG; ZZ-Research and Development Scientist III; PO-Technical Director II; QZ-Principal Bioinformatics Scientist; AK-Director of Information Technology; SL-Executive Director, Scientific Projects, Diagnostics \u0026amp; Precision Medicine.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eMu W, Lu H-M, Chen J, Li S, Elliott AM. Sanger Confirmation Is Required to Achieve Optimal Sensitivity and Specificity in Next-Generation Sequencing Panel Testing. The Journal of Molecular Diagnostics. 2016;18(6):923-932.\u003c/li\u003e\n \u003cli\u003eMcCourt CM, McArt DG, Mills K, Catherwood MA, Maxwell P, Waugh DJ, Hamilton P, O\u0026apos;Sullivan JM, Salto-Tellez M. Validation of Next Generation Sequencing Technologies in Comparison to Current Diagnostic Gold Standards for BRAF, EGFR and KRAS Mutational Analysis. PLoS ONE. 2013;8(7):e69604.\u003c/li\u003e\n \u003cli\u003eLincoln SE, Truty R, Lin C-F, Zook JM, Paul J, Ramey VH, Salit M, Rehm HL, Nussbaum RL, Lebo MS. A Rigorous Interlaboratory Examination of the Need to Confirm Next-Generation Sequencing\u0026ndash;Detected Variants with an Orthogonal Method in Clinical Genetic Testing. The Journal of Molecular Diagnostics. 2019;21(2):318-329.\u003c/li\u003e\n \u003cli\u003eAlharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Human Genomics. 2022;16(1).\u003c/li\u003e\n \u003cli\u003eArteche-L\u0026oacute;pez A, \u0026Aacute;vila-Fern\u0026aacute;ndez A, Romero R, Riveiro-\u0026Aacute;lvarez R, L\u0026oacute;pez-Mart\u0026iacute;nez MA, Gim\u0026eacute;nez-Pardo A, V\u0026eacute;lez-Monsalve C, Gallego-Merlo J, Garc\u0026iacute;a-Vara I, Almoguera B et al. Sanger sequencing is no longer always necessary based on a single-center validation of 1109 NGS variants in 825 clinical exomes. Scientific Reports. 2021;11(1).\u003c/li\u003e\n \u003cli\u003eBaudhuin LM, Lagerstedt SA, Klee EW, Fadra N, Oglesbee D, Ferber MJ. Confirming Variants in Next-Generation Sequencing Panel Testing by Sanger Sequencing. The Journal of Molecular Diagnostics. 2015;17(4):456-461.\u003c/li\u003e\n \u003cli\u003eBeck TF, Mullikin JC; NISC Comparative Sequencing Program; Biesecker LG. Systematic Evaluation of Sanger Validation of Next-Generation Sequencing Variants. Clin Chem. 2016;62(4):647-654.\u003c/li\u003e\n \u003cli\u003eDe Cario R KA, Suraci S, Magi A, Volta A, Marcucci R, Gori AM, Pepe G, Giusti B, Sticchi E. Sanger Validation of High-Throughput Sequencing in Genetic Diagnosis: Still the Best Practice? Front Genet. 2020;11(592588).\u003c/li\u003e\n \u003cli\u003ePellegrino E, Jacques C, Beaufils N, Nanni I, Carlioz A, Metellus P, Ouafik LH. Machine learning random forest for predicting oncosomatic variant NGS analysis. Scientific Reports. 2021;11(1).\u003c/li\u003e\n \u003cli\u003eG. Marceddu TD, G. Guerri, A. zulian, C. Marinelli, M. Bertelli. Analysis of machine learning algorithms as integrative tools for validation of next generation sequencing data. European Review for Medical and Pharmacological Sciences. 2019;23(8139-8147).\u003c/li\u003e\n \u003cli\u003eJeroen van den Akker GM, Anjali D. Zimmer and Alicia Y. Zhou. A machine learning model to determine the accuracy of variant calls in capturebased next generation sequencing. BMC Genomics. 2018;19(263).\u003c/li\u003e\n \u003cli\u003eHolt JM, Kelly M, Sundlof B, Nakouzi G, Bick D, Lyon E. Reducing Sanger confirmation testing through false positive prediction algorithms. Genetics in Medicine. 2021;23(7):1255-1262.\u003c/li\u003e\n \u003cli\u003eHandelman GS, Kok HK, Chandra RV, Razavi AH, Huang S, Brooks M, Lee MJ, Asadi H. Peering Into the Black Box of Artificial Intelligence: Evaluation Metrics of Machine Learning Methods. AJR American journal of roentgenology. 2019;212 1:38-43.\u003c/li\u003e\n \u003cli\u003eHuang Y-S, Hsu C, Chune Y-C, Liao IC, Wang H, Lin Y-L, Hwu W-L, Lee N-C, Lai F. Diagnosis of a Single-Nucleotide Variant in Whole-Exome Sequencing Data for Patients With Inherited Diseases: Machine Learning Study Using Artificial Intelligence Variant Prioritization. JMIR Bioinformatics and Biotechnology. 2022;3(1):e37701.\u003c/li\u003e\n \u003cli\u003eLi J, Jew B, Zhan L, Hwang S, Coppola G, Freimer NB, Sul JH. ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest. PLOS Computational Biology. 2019;15(12):e1007556.\u003c/li\u003e\n \u003cli\u003eTalukder A, Barham C, Li X, Hu H: Interpretation of deep learning in genomics and epigenomics. Briefings in Bioinformatics. 2021;22(3):bbaa177.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-genomics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"gics","sideBox":"Learn more about [BMC Genomics](http://bmcgenomics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/gics","title":"BMC Genomics","twitterHandle":"#BMCGenomics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Next generation sequencing, Sanger confirmation, Machine learning, Clinical decision-support tool","lastPublishedDoi":"10.21203/rs.3.rs-6513733/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6513733/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground: \u003c/strong\u003eOrthogonal confirmation of variants identified by next-generation sequencing (NGS) is routinely performed in many clinical laboratories to improve assay specificity. However, confirmatory testing of all clinically significant variants increases both turnaround time and operating costs for laboratories. Improvements to early NGS methods and bioinformatics algorithms have dramatically improved variant calling accuracy, particularly for single nucleotide variants (SNVs), thus calling into question the necessity of confirmatory testing for all variant types. The purpose of this study is to develop a new machine learning approach to capture false positive heterozygous variants (SNVs) from whole exome sequencing (WES) data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults: \u003c/strong\u003eWES variant calls from Genome in a Bottle (GIAB) cell lines and their associated quality features were used to train five different machine learning models to predict whether a variant was a true positive or false positive based on quality metrics. Logistic regression and random forest models exhibited the highest false positive capture rates among the selected models, but GradientBoosting achieved the best balance between false positive capture rates and true positive flag rates. Further assessment using simulated false positive events as well as different combinations of quality features showed that model performance can be refined. Integration of the highest-performing models into a custom two-tiered confirmation bypass pipeline with additional guardrail metrics achieved 99.9% precision and 98% specificity in the identification of true positive heterozygous SNVs within the GIAB benchmark regions. Furthermore, testing on an independent set of heterozygous SNVs (n=93) detected by exome sequencing of patient samples and cell lines demonstrated 100% accuracy.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusions:\u003c/strong\u003e Machine-learning models can be trained to classify SNVs into high or low-confidence categories with high precision, thus reducing the level of confirmatory testing required. Laboratories interested in deploying such models should consider incorporating additional quality criteria and thresholds to serve as guardrails in the assessment process.\u003c/p\u003e","manuscriptTitle":"Determination of high-confidence germline genetic variants in next- generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-13 09:49:23","doi":"10.21203/rs.3.rs-6513733/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-05-28T09:29:38+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-27T18:17:28+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-17T12:20:24+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"310205574888617841559290576777883226801","date":"2025-05-13T04:39:26+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"332680662121470907872165777154157404801","date":"2025-05-09T13:04:13+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"94872764866677970003661044508197589915","date":"2025-05-07T13:28:55+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"288532618917849727527053694822220647871","date":"2025-05-07T05:31:33+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-05-07T00:35:24+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-05-05T11:52:41+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-05-05T11:42:06+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-05-02T18:09:54+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Genomics","date":"2025-05-02T18:08:52+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-genomics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"gics","sideBox":"Learn more about [BMC Genomics](http://bmcgenomics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/gics","title":"BMC Genomics","twitterHandle":"#BMCGenomics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"e4c30328-a44d-48f5-9ad9-3eabb35cce60","owner":[],"postedDate":"May 13th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-08-11T16:02:58+00:00","versionOfRecord":{"articleIdentity":"rs-6513733","link":"https://doi.org/10.1186/s12864-025-11889-z","journal":{"identity":"bmc-genomics","isVorOnly":false,"title":"BMC Genomics"},"publishedOn":"2025-08-06 15:57:48","publishedOnDateReadable":"August 6th, 2025"},"versionCreatedAt":"2025-05-13 09:49:23","video":"","vorDoi":"10.1186/s12864-025-11889-z","vorDoiUrl":"https://doi.org/10.1186/s12864-025-11889-z","workflowStages":[]},"version":"v1","identity":"rs-6513733","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6513733","identity":"rs-6513733","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00