CoExpPhylo – A Novel Pipeline for Biosynthesis Gene Discovery

preprint OA: closed
📄 Open PDF Full text JSON View at publisher
Full text 4,131 characters · extracted from oa-doi-fallback · 3 sections · click to expand

Abstract

Background The rapid advancement of sequencing technologies has drastically increased the availability of plant genomic and transcriptomic data, shifting the challenge from data generation to functional interpretation. Identifying genes involved in specialized metabolism remains difficult. While coexpression analysis is a widely used approach to identify genes acting in the same pathway or process, it has limitations, particularly in distinguishing genes coexpressed due to shared regulatory triggers from those directly involved in the same pathway. To enhance functional predictions, integrating phylogenetic analysis provides an additional layer of confidence by considering evolutionary conservation. Here, we introduce CoExpPhylo, a computational pipeline that systematically combines coexpression analysis and phylogenetics to identify candidate genes involved in specialized biosynthetic pathways across multiple species based on one to multiple bait gene candidates.

Results

CoExpPhylo systematically integrates coexpression information and phylogenetic signals to identify candidate genes involved in specialized biosynthetic pathways. The pipeline consists of multiple computational steps: (1) species-specific coexpression analysis, (2) local sequence alignment to identify orthologs, (3) clustering of candidate genes into Orthologous Coexpressed Groups (OCGs), (4) functional annotation, (5) global sequence alignment, (6) phylogenetic tree generation, and optionally (7) visualization. The workflow is highly customizable, allowing users to adjust correlation thresholds, filtering parameters, and annotation sources. Benchmarking CoExpPhylo on multiple pathways, including the biosynthesis of anthocyanins, proanthocyanidins, and flavonols, as well as lutein and zeaxanthin, confirmed its ability to recover known genes while also suggesting novel candidates.

Conclusion

CoExpPhylo provides a systematic framework for identifying candidate genes involved in the specialized metabolism. By integrating coexpression data with phylogenetic clustering, it facilitates the discovery of both conserved and lineage-specific genes. The resulting OCGs offer a strong foundation for further experimental validation, bridging the gap between computational predictions and functional characterization. Future improvements, such as incorporating multi-species reference databases and refining clustering for large gene families, could further enhance its resolution. Overall, CoExpPhylo represents a valuable tool for accelerating pathway elucidation and advancing our understanding of specialized metabolism in plants. Competing Interest Statement The authors have declared no competing interest. Footnotes - Carotenoid biosynthesis added as proof of concept - Selection of cutoff values and parameter settings explained - Run time analysis included - Example dataset added on GitHub - Extended documentation Abbreviations - 3GT - Anthocyanidin-3-O-Glycosyltransferase - 4CL - 4-Coumarate:CoA Ligase - ANR - Anthocyanidin Reductase - ANS - Anthocyanidin Synthase - arGST - Anthocyanin-related Glutathion S-Transferase - BAN - BANYLUS - BCH - Beta carotenoid hydroxylase - β-LCY - Lycopene β -cyclase - C4H - Cinnamate 4-Hydroxylase - CDS - Coding Sequence - CHI - Chalcone Isomersae - CHS - Chalcone Synthase - CRTISO - Carotenoid isomerase - CYP97 - Cytochrome P450, family 97 - DFR - Dihydroflavonol 4-Reductase - ε-LCY - Lycopene ε -cyclase - F3’5’H - Flavonoid 3’,5’-Hydroxylase - F3’H - Flavonoid 3’-Hydroxylase - F3H - Flavanone 3-Hydroxylase - FLS - Flavonol Synthase - GGPP - Geranylgeranyl diphosphate - LAR - Leucoanthocyanidin Reductase - LDOX - Leucoanthocyanidin Dioxygenase - LUT - Lutein deficient - NCBI - National Center for Biotechnology Information - OCG - Orthologous Coexpressed Group - OMT - O-Methyltransferase - PA - Proanthocyanidin - PAL - Phenylalanine Ammonia-Lyase - PDS - Phytoene desaturase - PSY - Phytoene synthase - SRA - Sequence Read Archive - TPM - Transcripts Per Million - TT - Transparent Testa - UFGT - UDP-Glucose:Flavonoid Glucosyltransferase - ZDS - ζ-carotene desaturase

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00