Leveraging the largest harmonized epigenomic data collection for metadata prediction validated and augmented over 350,000 public epigenomic datasets

doi:10.1101/2025.09.04.670545

Leveraging the largest harmonized epigenomic data collection for metadata prediction validated and augmented over 350,000 public epigenomic datasets

2025 · doi:10.1101/2025.09.04.670545

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 1,748 characters · extracted from oa-doi-fallback · click to expand

Abstract Epigenomic data found in public databases often suffer from issues of non-standardization and incompleteness in their associated metadata. There are currently no automated approaches to validate or correct missing or inaccurate information listed in databases. To tackle this challenge, we harnessed the extensive harmonized data and metadata provided by the EpiATLAS project of the International Human Epigenome Consortium (IHEC) to train EpiClass, a suite of machine learning classifiers that can predict key metadata (∼98% accuracy), including experimental assay, donor sex, biospecimen and sample cancer status. The development of these classifiers enabled the identification of a few mislabeled and low-quality datasets in the EpiATLAS project, while also completing with high-confidence most of the missing metadata. These classifiers were also validated on ENCODE datasets absent from the initial training, then applied to assess more than 350,000 human ChIP-Seq and RNA-Seq datasets from public repositories. Overall, this effort not only validated the accuracy of the vast majority of assays reported by the original authors, but also unveiled ∼500 datasets with discrepancies, in particular through data swap within series of experiments. More importantly, EpiClass also supplied high-confidence predictions for over 320,000 metadata attributes of the biological sample such as the sex, cancer status and biomaterial type, which had been originally omitted in the majority of cases. Our work introduces the first systematic approach for metadata correction and augmentation, enhancing the quality and reliability of publicly available epigenomic data. Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00