Accessible and Robust Machine Learning Approaches to Improve the Opsin Genotype-Phenotype Map

doi:10.1101/2025.08.22.671864

Accessible and Robust Machine Learning Approaches to Improve the Opsin Genotype-Phenotype Map

2025 · doi:10.1101/2025.08.22.671864

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 2,023 characters · extracted from oa-doi-fallback · click to expand

Abstract Predicting phenotypes from genetic variation is a central challenge in biology. Linking genotypes and phenotypes using machine learning (ML) offers great promise, but its use is limited by poor accessibility, overestimated performance, and a “data-cliff”—a gap between abundant sequences and scarce functional measurements. To develop more robust methods for genotype–phenotype prediction, an outstanding model system is opsin genes, visual pigments with extensive phenotypic information that strongly influence animal spectral sensitivity. Here we advance ML characterization of the opsin genotype–phenotype map through four main contributions. First, we introduce the Opsin Phenotype Tool for Inference of Color Sensitivity (OPTICS), a user-friendly platform for predicting maximum wavelength sensitivity (λmax) from amino-acid sequences. Second, we show that encoding sequences with amino-acid physicochemical properties improves predictive performance and reveals mechanistic relationships. Third, we develop Phylogenetically Weighted Cross-Validation (PW-CV), a method that accounts for non-independence among related sequences, providing more realistic assessments of model generalizability. Finally, we present the Mine-N-Match (MNM) pipeline, which systematically links published opsin sequences to compiled in-vivo λmax data, expanding genotype–phenotype coverage and improving prediction, especially for invertebrate opsins with undersampled heterologous data. By integrating accessible software, biologically informed encoding, phylogeny-aware evaluation, and data harmonization, our framework improves confidence, accuracy, and interpretability of genotype–phenotype prediction. An accurate genotype-phenotype map allows simulating molecular evolution of function, reconstructing the history of visual phenotypes, designing functional proteins, and generating new hypotheses that can be tested with heterologous phenotyping. Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0