In silico generation of synthetic cancer genomes using generative AI

preprint OA: closed
📄 Open PDF Full text JSON View at publisher
Full text 2,732 characters · extracted from oa-doi-fallback · click to expand
Abstract Cancer originates from alterations in the genome, and understanding how these changes lead to disease is crucial for achieving the goals of precision oncology. Connecting genomic alterations to health outcomes requires extensive computational analysis using accurate algorithms. Over the years, these algorithms have become increasingly sophisticated, but a severe shortage of open access gold-standard datasets presents a fundamental challenge. Since genomic data is considered personal health information, only an extremely limited number of deeply sequenced legacy cancer genomes can be shared and redistributed. As a result, tool benchmarking is often conducted on the same small set of genomes sequenced with older technologies and uncertain ground truths. This is a major obstacle to the development of improved analytic tools. To address this issue, we have developed OncoGAN, a novel generative AI tool that uses a combination of generative adversarial networks and tabular variational autoencoders to generate realistic but entirely synthetic cancer genomes based on training sets derived from large-scale genomic projects. Our results demonstrate that this approach accurately reproduces the scale, distribution, and characteristics of somatic point mutations, copy number alterations and structural variants across multiple common cancer types, while protecting donors’ privacy information. OncoGAN accurately recapitulates tumor type-specific mutational signatures as well as the positional distribution of somatic mutations. To evaluate the fidelity of the simulations, we tested the synthetic genomes using DeepTumour, a software capable of identifying tumor types based on mutational patterns, and demonstrated a high level of concordance between the synthetic genome tumor type and DeepTumour’s prediction of the type. We also showed that augmenting real donor data with OncoGAN-generated synthetic data could be used to train a more accurate version of DeepTumour. This tool will allow the generation of an extensive and realistic set of training and testing cancer genomes whose ground truth is known exactly. This advance provides computational biologists with the ability to develop realistic cancer genome benchmarking sets and make them available to the research community for the testing, development and enhancement of cancer genome analysis tools. Competing Interest Statement The authors have declared no competing interest. Footnotes - Added new results on simulation of copy number alterations and structural variants - Added evaluation of driver calling algorithms - Figure 3 revised - Supplemental files updated https://huggingface.co/collections/anderdnavarro/oncogan-67110940dcbafe5f1aa2d524

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-19T01:45:01.086888+00:00