Full text
3,035 characters
· extracted from
oa-doi-fallback
· click to expand
ABSTRACT
Three-dimensional genome organization is essential for gene regulation, yet in various species it is driven by different biological mechanisms. Species-specific factors and DNA sequences influence chromatin folding, complicating cross-species comparisons. Leveraging Hi-C data and machine learning, we introduce Chimaera — a convolutional neural network that predicts Hi-C maps from DNA sequences, enabling exploration of genome folding in evolution. Chimaera’s latent representations revealed an unsupervised atlas of key chromatin features (such as insulation, loops, fountains/jets) and supported the detection and quantification of structural signatures in processes such as the cell cycle and embryogenesis. Targeted search in the latent space linked DNA sequence elements to specific chromatin structures. Applying Chimaera across multiple species confirmed the insulator roles of CTCF in vertebrates and BEAF-32 in D. melanogaster and identified a previously unreported insulator motif in D. melanogaster. In amoeba D. discoideum, gene orientation on the DNA strand was shown to influence loop formation. Models for other organisms also showed chromatin folding patterns associated with gene location. Finally, using cross-species predictions we tested the transferability of chromatin folding patterns and revealed evolutionary relationships, culminating in a chromatin structure-based cluster tree spanning plants to mammals.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵† Aleksei Shkolikov and Aleksandra Galitsyna should be regarded as joint first authors.
New species have been added to the analysis, some methods have been revised and the results have been supplemented.
DATA AVAILABILITY
Raw and processed Hi-C and Micro-C data were obtained from BioProject accessions PRJNA606649 (Xenopus tropicalis), PRJNA630123 (Anopheles merus), PRJNA665323 (Culex quinquefasciatus), PRJNA749654 (Sarcoptes scabiei), PRJNA683935 (Archegozetes longisetosus), PRJNA680311 (Arion vulgaris), PRJNA427478 (Pomacea canaliculata), PRJNA792953 (Cataglyphis hispanica) PRJCA014302 (Arabidopsis thaliana); BioSample accession SAMN13118423 (Apis cerana); GEO datasets GSE178982 (Mus musculus ESC), GSE129997 (Mus musculus cell cycle), GSE178982 (Mus musculus with depleted structural proteins) GSE171396 (Drosophila melanogaster), GSE128568 (Caenorhabditis elegans), GSE151553 (Saccharomyces cerevisiae wild type), GSE217017 (Saccharomyces cerevisiae with exogenous DNA), GSE85220 (Schizosaccharomyces pombe) GSE260572 (Trichoplax adhaerens, Mnemiopsis leidyi) GSE152150 (Symbiodinium microadriaticum), GSE195609 (Danio rerio embryos), GSE134055 (Danio rerio muscle cells) GSE247397 (Dictyostelium discoideum), GSM7120275 (Bombyx mori); 4DN dataset 4DNBSZOFFFM6 (Homo sapiens).
Preprocessed data for all studied organisms and trained models are posted online at OSF (doi: 10.17605/OSF.IO/YF7CR). Chimaera code and illustrative examples are available at Zenodo (doi: 10.5281/zenodo.17418710).
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.