Genome-wide classification of tumor-derived reads from bulk long-read sequencing

preprint OA: closed CC-BY-4.0
Full text 1,325 characters · extracted from oa-html · click to expand
Abstract DNA extracted from tissue samples typically derives from a complex mixture of cell types. Without single cell analysis, it has been generally impossible to determine the cell type of origin for most molecules. One clear example of this is in the complex milieu of a human neoplasm. Here, we develop ROCIT (https://github.com/tobybaker/rocit), a transformer-based model to classify the tumor or non-tumor origin of individual reads from bulk tumor samples sequenced with long-read whole genome sequencing. Using somatic mutations to derive training data, ROCIT uses read-level methylation patterns to accurately classify reads from any-where in the genome without requiring the adjacent normal tissue or the explicit identification of tumor differentially methylated regions. We apply ROCIT to a cohort of prostate and ovarian tumors and demonstrate high classification accuracy across the entire genome. We then demonstrate the potential of ROCIT predictions to improve somatic variant calling. ROCIT represents a major step forward in the analysis of bulk tumors with long-reads, enabling the accurate and sensitive identification of reads with specific cell types of origin genome-wide. Competing Interest Statement P.T.S is a shareholder in Convergent Genomics, a consultant to Illumina Inc and Exact Biosciences.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-4.0