Decrypting viral dark matter through key proteins using an NLP-enhanced framework

preprint OA: closed
Full text JSON View at publisher
Full text 215,359 characters · extracted from preprint-html · click to expand
Decrypting viral dark matter through key proteins using an NLP-enhanced framework | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Decrypting viral dark matter through key proteins using an NLP-enhanced framework Zhihua Du, Min Li, Kaihuang Lin, Bo Xing, Yuehua Ou, Wenchen Song, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8534670/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Viral sequences in diverse environments remain largely uncharacterized, impeding our comprehension of their genetic makeup, biological interactions, and potential applications. This underscores an urgent need for innovative analytical methods. Here we present the VirHost Hunter framework, which employs phage tails and lysins, bypassing the requirement for full genomes, for efficient and high-resolution host assignment. By harnessing Protein Language Models and Vision Transformers, VirHost Hunter captures protein functional homology despite sequence dissimilarity, significantly boosting prediction accuracy. In the scenario of disease-associated gut bacteria, calibrated VirHost Hunter surpassed existing methods, doubling phage host assignments, expanding taxonomic reach, and revealing new phages targeting gut bacteria, including Akkermansia and Prevotella . Therefore, we established a gut phage lysin database, enabling the synthesis of a lysin that effectively and specifically targets an obesity-inducing bacterium. VirHost Hunter's precision and scalability mark a significant leap forward in virome research and present a promising avenue for microbiome therapies. Virology microbiome machine learning phage-host prediction Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Virome is a significant component of Earth’s ecosystems and has a profound impact on ecological and human health. In various environments, uncharacterized viral genomes and sequences widely exist due to limitations in current analytical techniques and are referred to as viral dark matter. This concept highlights the need for innovative approaches to uncover and understand these hidden viral entities 1 . The intricate interplay between bacteria and their viruses - bacteriophages (phages) - has garnered significant attention in recent years, fueled by advances in predictive modeling and therapeutic applications. Identifying the host range of phages is essential in studying phage resistance of bacteria, coevolution of phage-bacteria 2 , 3 , the influence of community context on phage-bacteria systems 4 , and the role of phages in human health and diseases 5 , 6 . In clinical settings, phages have already been adopted to treat infections caused by drug-resistant bacteria, offering advantages in precision medicine due to their host specificity and minimal disturbance of normal gut flora 7 , 8 . Whereas phages also hold great promise in modulating gut microbiota, its efficacy hinges on the availability of phages targeting gut bacteria, particularly those associated with chronic diseases. Only a handful of phages have been reported targeting gut anaerobes, and it has been implicated that isolating gut phages is arduous 9 , 10 . Phage lysins have demonstrated effective antimicrobial effects in animal models, food industries, and clinical therapies 11 – 15 , presenting broad industrial and medical applications. Lysins possess moderate host specificity, are easy to synthesize and are especially suitable under scenarios where phages are unavailable, or the available phages are too host-specific to apply. However, existing lysin databases focus on clinical pathogens rather than gut commensal bacteria 16 , limiting their application in gut microbiota modulation. Predicting phage hosts and establishing a phage lysin database, by leveraging gut phage databases, specifically targeting gut bacteria serve as an alternative solution. Various computational approaches have been developed for predicting phage hosts, falling into two categories: alignment-dependent and alignment-free methods (Table S1). Alignment-dependent methods rely on phage marker genes 17 – 19 , phage-host relatedness 20 , and CRISPR spacers 21 , 22 , but they have limitations such as database size, data source dependency, alignment parameters, and applicability only to phages with specific marker genes or CRISPR signals 23 , 24 . Alignment-free approaches utilize phage-bacterium interaction matrices 25 – 36 , phage whole genomes 37 , 38 or sequences of receptor-binding proteins (RBPs) 39 , 40 to predict phage hosts. Notably, Gonzales et al. 40 utilized protein language models (PLMs) for feature extraction from RBPs, highlighting the potential of computational techniques in this area. PLMs are a subset of Natural Language Processing (NLP), which can interpret biological sequences in a manner akin to human language. This approach significantly improves contextual understanding and enables the identification of complex patterns that were previously difficult to discern. Current viral databases predominantly use alignment-dependent methods and CRISPR spacers for host assignment, resulting in incomplete coverage and limited recall. For instance, Paez-Espino et al . 41 identified 9,992 putative virus-host associations covering only 7.7% of metagenomic viral contigs (mVCs) in their study of Earth’s virome 41 . In the past three years, several human gut virome databases have also been released, the Metagenomic Gut Virus (MGV) database assigned host to 81% (n = 153,892) of the phages, followed by 69% (n = 31,259) within the Cenote Human Virome Database (CHVD), 42% (n = 13,954) within the Gut Virome Database (GVD), and 29% (n = 40,932) within the Gut Phage Database (GPD) 42 – 45 . The GPD had the most stringent criteria resulting the lowest recall of the four databases, i.e., it only utilized CRISPR spacers from 2,898 high-quality genomes of cultured human gut bacteria and tolerated zero mismatches across the whole length of the spacers. Therefore, high quality alignment-free method, with improved machine learning models and input features, can be complementary to CRISPR spacers method to increase the sensitivity of host prediction without compromising precision. Indeed, a recent tool, iPHoP 46 , integrates alignment-dependent and alignment-free methods for host prediction, including Blast, CRISPR, WIsH, VHM, and PHP. Although iPHoP is the most comprehensive tool to date for phage host prediction, the authors discussed its limitations, including slow running time and the fact that it only achieves genus-level resolution, which may impact its practical applications. Alignment-free computational methods based on host-specific proteins such as tails and lysins instead the whole genomes of phages, may overcome these challenges: (1) they require minimal data input, avoiding vast redundant information and overuse of computing resources; (2) they can handle incomplete genome assemblies resulting from virome sequencing; (3) they can achieve high-resolution host prediction, likely species or strain level, for phage therapy applications; and (4) they facilitate applications in synthetic biology, including host range modulation by swapping or engineering phage RBPs 47 , 48 , delivery vehicles based on proteins recognizing and attach host surfaces 49 , and therapeutic agents based on lytic proteins breaking down bacterial cell walls 50 . In this study, we develop a framework for phage host prediction integrating highlights in feature extraction, dataset construction, and model selection. We verify the roles of each highlight of our design by conducting control analyses, followed by a comprehensive comparison to other methods across family to species levels. We calibrate the model to facilitate its application towards disease-associated gut bacteria and validate its robustness under targeted scenarios. We apply the calibrated model to the GPD and identify a great number of phages targeting disease-associated gut bacteria, including new ones targeting renowned bacterial species whose phages have hardly been characterized before. To further promote application of the resource, we extract lysins from the GPD with expanded host assignment to establish a repository. As a proof of concept, we select a lysin from the repository and synthesize it to verify its function against an obesity-inducing bacterium. This work elucidates the design of a predictive framework for phage host prediction and provides insights in how to utilize machine learning to serve genomics data mining and protein function prediction. Deciphering gut phages using this tool not only enhances our understanding in phage diversity and phage-bacteria interactions, but also facilitates downstream application of the gut phage resources into disease intervention. Results Designing a phage host prediction framework To predict the host of phages, full genome sequences of phages and bacteria are usually used. However, whole genome-based methods introduce a significant amount of non-essential data, including proteins unrelated to host recognition or infection, which can create noise and interfere with the prediction accuracy, resolution, and efficiency. Concentrating on phage proteins conferring specificity—those directly involved in the infection cycle—offers a more targeted approach. Some methods have utilized receptor-binding proteins (RBPs) to predict host 39 , 40 , but it can be challenging to annotate RBPs for many phages. We initially counted the number of RBPs and tail proteins in 7,598 phage genomes from NCBI (December 29, 2021), revealing an average of 1.33 RBPs and 15.24 tail proteins per phage (Figure S1). Therefore, we expanded the dataset to include specific proteins beyond tail fibers and tail spikes: non-RBPs of phage tails, such as tail sheath, tail tube, baseplate, and tail collar proteins; and lysins, which are enzymes highly active against bacterial cell wall 51 . These proteins are key for the infection cycle while more widely annotated and are thus included for prediction as well. Proteins may share low sequence similarity while still performing similar functions across diverse species, rendering traditional sequence alignment methods less effective in capturing these functional similarities. To overcome the challenge of predicting host specificity using these proteins, particularly when sequence similarity is low, we employed protein language models (PLMs). PLMs provide a powerful solution by learning deep contextual and functional patterns within protein sequences, enabling them to capture viral protein function and viral biology even in cases of minimal sequence homology 52 . Because the same protein sequence can be encoded by different DNA sequences, we incorporate DNA sequence features of tail proteins and lysins into the framework. DNA sequences provide additional insights into phages’ genomic context, such as codon usage bias, GC content, and nucleotide frequency, which can further refine predictions by accounting for genomic stability and evolutionary constraints 39 . To uncover long-range dependencies and global patterns in DNA sequence data, we utilized a Vision Transformer (ViT) 53 . As a language model, ViT can capture complex relationships and contextual information inherent in DNA sequences. These patterns can reveal insights into genetic structures, functions, or relationships that are not easily discerned by examining individual sequences alone. As a result, we present the VirHost Hunter framework with the above characteristics (Fig. 1 A). It consists of two primary components: a feature extractor and a classifier. The feature extractor integrates protein sequence embeddings from the ProtT5 model 54 , physicochemical features of DNA sequences, and K-mer features derived from DNA sequences using a deep neural network (DNN) 55 . Utilizing three convolutional neural networks (Figure S2) and a visual transformer (ViT), the DNN extracts multi-scale features from the data. The final classification step uses a multi-layer perceptron (MLP) and a Random Forest (RF) classifier 56 , with RF refining high-confidence predictions from the MLP to improve accuracy. We next ask if combining protein and DNA features, constructing datasets from specific proteins, and using language models such as PLMs and ViT enhances host prediction as expected, respectively. Using both protein and DNA features improves learning over either alone To evaluate whether integrating protein and DNA features offers superior performance compared to using either individually, we conducted ablation experiments using two benchmark datasets: the Bacteriophage RBP (Drug-Resistant receptor-binding proteins, DRRBP) dataset (n = 4,845) 39 and the Bacteriophage Tail Proteins (Drug-Resistant tail, DRTail) dataset (n = 12,509). We measured performance using accuracy (ACC), precision, and F1 scores under three experimental conditions: using only protein features, using only DNA features, and using a combination of both. It is shown that relying on a single type of feature led to inconsistent model performance across different datasets (Fig. 2 A). Particularly, in the DRRBP dataset, models that used only protein features outperformed those that used only DNA features. In contrast, for the DRTail dataset, DNA features alone provided better performance than protein features. This inconsistency reveals the limitations of using only one feature type, as neither approach fully captures the complexity of phage-host interactions. On the other hand, integrating both protein and DNA features consistently improved model performance across all datasets and metrics. For instance, using both feature sets together resulted in the highest performance, with an accuracy of 0.9081 and 0.8927, precision of 0.9090 and 0.8930, and F1 scores of 0.9077 and 0.8925 on the DRRBP and DRTail datasets, respectively, significantly outperforming models that used either feature set alone. This demonstrates that integrating protein and DNA features not only enhances predictive accuracy but also provides greater consistency and stability across datasets, particularly in the context of bacteriophage host prediction. Phage tail components and lysins drive host prediction without full-genome data To confirm that using all tail components - RBPs, tail sheath, tail tube, baseplate, tail collar, etc - for host prediction is feasible, we conducted a 10-fold cross-validation on the DRRBP and DRTail datasets using our method, DeepHost 38 , Random Forest (RF, Boeckaerts' method) 39 , and Protein Embeddings 40 (Table S2). The results suggest that host prediction accuracy through all tail proteins is comparable to that via RBPs, emphasizing the preference of using all tail proteins since they exist 10 times more than RBPs. At the same time, we also found that VirHost Hunter outperformed the other methods, achieving an accuracy of 0.9081 and 0.8927, precision of 0.9090 and 0.8930, and F1 scores of 0.9077 and 0.8925 on the DRRBP and DRTail datasets, respectively (Table S2). To compare the efficacy of phage tails and lysins with that of non-specific proteins for host prediction, we conducted tests using head proteins and terminases from the same phage datasets. As shown in Fig. 2 B, phage tails and lysins consistently outperformed head proteins and terminases across family, genus, and species levels. For all sequence similarity thresholds tested, phage tails achieved the highest accuracy, followed closely by lysins. In contrast, head proteins and terminases reached significantly lower accuracy, with a notable decline in performance at lower sequence similarity thresholds, especially at the species level (Fig. 2 B). This further illustrates that phage tails and lysins maintain their predictive power, even at reduced sequence similarity, unlike the non-specific control proteins. This further implies that relying on whole genomes for host prediction may be redundant, as focusing on key proteins provides more accurate and efficient predictions. We further demonstrated that VirHost Hunter can reach species-level resolution and had superior accuracy, precision, and F1 compared with the other methods, on a multi-taxonomic dataset of 7,598 phage genomes (Fig. 1 B, Supplementary Results, Figure S2A and Table S3-S4). Functional homology is captured even in low-similarity sequences To demonstrate the ability of VirHost Hunter to capture functional homology using NLP-based representations, we evaluated its performance across datasets with varying sequence similarities. Using CD-HIT 57 , we partitioned the multi-taxonomic dataset into subsets with sequence similarity thresholds of 50%, 60%, 70%, 80%, and 90%, enabling us to assess VirHost Hunter’s capability of predicting phage host based on functional relationships rather than strict sequence homology. We compared VirHost Hunter’s performance to that of other models, including Boeckaerts et al. 39 , DeepHost 38 , and M. Gonzales et al. 40 , across different similarity thresholds. As illustrated in Fig. 2 C, VirHost Hunter consistently outperformed the other methods across taxonomic ranks—family, genus, and species—highlighting its superior ability to leverage functional homology in low-similarity datasets (Fig. 2 C). Crucially, as sequence similarity decreased, the performance gap between VirHost Hunter and the other methods widened, particularly at the family and genus levels. This underscores the increasing importance of capturing functional homology in low-similarity regions, where conventional sequence similarity-based methods typically fail. VirHost Hunter’s integration of protein language models (PLMs), such as ProtT5, and DNA sequence features enable it to move beyond reliance on sequence similarity alone. Instead, it identifies deeper functional relationships, resulting in robust and accurate predictions, even under low similarity conditions. Robust phage host prediction for targeted scenarios Given the substantial impact of gut bacteria on human health, such as inflammatory bowel disease (IBD) 58 , 59 , colorectal cancer 60 – 62 , and metabolic diseases 63 – 67 , obtaining more information of phages targeting these bacteria is advantageous. We can expand our knowledge in gut phage-bacteria interactions, gut phage diversity, and utilize them for therapeutic purposes. Phage information can be obtained either through co-culturing with bacterial host or mining data from high-throughput sequencing. However, gut phages, especially those targeting obligate anaerobes, are hard to culture and isolate. Investigating gut phages by analyzing sequencing data is, therefore, usually considered more efficient. Now that we have validated the superior performance of VirHost Hunter, including accuracy, precision, and resolution, we next evaluate its effectiveness in identifying phages targeting disease-associated gut bacteria. We compiled a dataset consisting of 60 gut bacterial species associated with various diseases, including carotid atherosclerosis, inflammatory bowel disease (IBD), and obesity (Fig. 1 B, Table S5). We annotated prophage tails and lysins from the dataset, resulting in a total of unique 328,701 tail proteins and 312,565 lysins. We calibrated VirHost Hunter model using these sequences across 29 families, 40 genera, and 60 species. Consistent with previous evaluations, VirHost Hunter outperformed the other three tested methods when applied to this dataset. At the family, genus, and species levels, VirHost Hunter-tail (based on gut phage tails) yielded ACC scores of 0.9516, 0.935, and 0.9132, Precision scores of 0.9513, 0.9341, and 0.9112, and F1 scores of 0.9512, 0.9342, and 0.9115, respectively (Figure S3B, Table S6). VirHost Hunter-lysin (based on gut phage lysins) exhibited ACC scores of 0.9817, 0.9756, and 0.9590, Precision scores of 0.9817, 0.9755, and 0.958, and F1 scores of 0.9817, 0.9755, and 0.9582, respectively (Figure S3B, Table S7). We also examined how sample sequence similarity would affect model performance. VirHost Hunter consistently outperformed other methods across various similarity thresholds and taxonomic levels (Figure S4), further highlighting its robustness in predicting gut phage hosts associated with chronic diseases. To further validate VirHost Hunter’s performance on isolated gut phages, we used a previously reported collection of cultivated gut phages 68 targeting Bifidobacterium , Bacillus , Bacteroides , Campylobacter , Clostridium , Enterococcus , and Streptococcus (Fig. 1 B). 702 tail proteins and 373 lysins were extracted from 156 gut phages, all with experimentally verified host data. Both VirHost Hunter and CRISPR-based method were tested under equivalent precision thresholds as CRISPR-based method was mostly frequently used to assign bacterial hosts by previous work. At a 95% precision cutoff, VirHost Hunter correctly identified hosts for 73/156 phages at the family level and 58/156 at the genus level, while CRISPR-based method yielded no assignments for a low recall rate (Table 1 ). At 84% and 69% cutoffs, VirHost Hunter performed comparably with the CRISPR-based method, and combining both methods further improved the accuracy to 101/156 (84% cutoff) ,113/156 (69% cutoff) at the family level and 107/156 (84% cutoff), 117/156 (69% cutoff) at the genus level (Table 1 ). Additionally, VirHost Hunter achieved species-level predictions, a resolution not attainable by CRISPR-based method, with precision rates of 9/156 (95% cutoff), 20/156 (84% cutoff), and 26/156 (69% cutoff) respectively, including Bacteroides fragilis , Phocaeicola vulgatus , and Eggerthella lenta (Table 1 , Table S8). Table 1 Host prediction for cultivated gut phages by VirHost Hunter and CRISPR-based method 95% precision 84% precision 69% precision VirHost Hunter CRISPR-based combined VirHost Hunter CRISPR-based combined VirHost Hunter CRISPR-based combined Family 73 0 73 82 95 101 96 105 113 Genus 58 0 58 94 95 107 105 105 117 Species 9 N.D. 9 20 N.D. 20 26 N.D. 26 To sum up, VirHost Hunter demonstrated superior performance in comparison to the other three alignment-free methods tested. Furthermore, it significantly outperformed the CRISPR-based method in an independent gut phage-host dataset under a 95% precision cutoff and achieved comparable performance under 84% and 69% precision cutoffs. Additionally, our experiment revealed that the combination of VirHost Hunter and the CRISPR-based method significantly enhances the proportion of true positive predictions, particularly for high-resolution phage-host predictions in the gut microbiota at species level. Overall, these results highlight the scalability of VirHost Hunter across different environments. Phages targeting disease-associated gut bacteria are vastly expanded The four most recently published gut virus databases typically adopted commensal bacteria as their CRISPR libraries. Among them, the GVD 42 , the MGV 44 , and the CHVD 45 set loose cutoffs compared to the GPD 43 , which allowed zero mismatches and resulted in low assignment. Although these databases are comprehensive, a tailored approach is needed for specific application scenarios, such as for intestinal pathogenic bacteria. Considering that the GPD had the lowest host assignment recall of 28.66% among the four databases, and as evaluated by Dion et al. 21 the precision was 84% at the genus level, we used VirHost Hunter to assign hosts for GPD with 95% and 84% precision, aiming to explore the dark matter in the human gut associated with chronic diseases (Fig. 1 C). Using our optimized annotation pipeline, we identified 163,590 lysins and 388,894 tail proteins from 142,809 assembled gut phages in the GPD. We applied precision filters of 84% and 95% to predict hosts at different taxonomic levels (Table S9). Through phylogenetic composition analysis of the results, the annotation results covered 8 phyla, 13 classes, 21 orders, 29 families, 40 genera, and 58 species, including 42 species of obligate anaerobes (Fig. 3 A). The host assignment results for each phage combined the predictions by tails and lysins. Notably, 7 families can only be assigned by VirHost Hunter-lysin but not VirHost Hunter-tail, including Eubacteriaceae , Atopobiaceae , Leuconostocaceae , Prevotellaceae , Peptoniphilaceae , Gemellaceae , and Aerococcaceae (Fig. 3 A). We evaluated the host assignment results of VirHost Hunter using 95% and 84% precision and compared that with the previous results of the GPD. We found that both VirHost Hunter-tail or VirHost Hunter-lysin can enhance the host assignment of gut phages. At 95% precision, VirHost Hunter newly assigned host to 15.91% (22,724/142,809) of the GPD phages, with 10.98% (15,677/142,809) by VirHost Hunter-tail and 9.41% (13,432/142,809) by VirHost Hunter-lysin (Fig. 3 B). At 84% precision, VirHost Hunter newly assigned host to 33.99% (48,545/142,809) of the GPD phages, with VirHost Hunter-tail contributing 20.16% (28,790/142,809) and VirHost Hunter-lysin contributing 25.37% (36,236/142,809), boosting the final host assignment ratio to 62.66% (89,478/142,809) (Fig. 3 B). These data illustrate that excelling VirHost Hunter on either tails or lysins can enhance the host assignment of gut phages, while combing the results of VirHost Hunter based on different key proteins and that of the CRISPR method could optimize the outcome. By integrating VirHost Hunter and the CRISPR-based method, we assessed the improvement and refinement of host assignment results in the GPD. Both VirHost Hunter-tail and VirHost Hunter-lysin significantly enhanced host taxonomic classification compared to the previous results. The host assignment results of VirHost Hunter-tail newly covered 3 families, 8 genera, and 20 species and that of VirHost Hunter-lysin newly covered 5 families, 12 genera, and 25 species (Fig. 3 C-E). Overall, at the family level, VirHost Hunter identified phages targeting 5 new families accounting for 1.38% of total assignments under 84% precision, while phages targeting Aerococcaceae were not detected at the 95% cutoff. Lachnospiraceae and Bacteroidaceae , recognized as the two most prevalent host families by both VirHost Hunter and the CRISPR-based method, collectively accounted for over 50% of total assignments at both the 84% and 95% cutoffs (Fig. 3 C). At the genus level, VirHost Hunter identified phages targeting 12 new genera accounting for 21.58% of total assignments at the 84% cutoff, while three of the new genera were not detected at the 95% cutoff. Bacteroides is the most abundant host genus identified by both VirHost Hunter and the CRISPR-based method (Fig. 3 D). At the species level, VirHost Hunter identified phages targeting 25 new species accounting for 0.14% of total assignments at the 84% cutoff, while four of the new species were not predicted at the 95% cutoff. Notably, VirHost Hunter identified phages targeting Cronobacter sakazakii as predominant, which was not detected by the CRISPR-based method, likely due to differences in training datasets (Fig. 3 E). In the refined database, there are five newly annotated host families, including Aerococcaceae , Akkermansiaceae , Gemellaceae , Prevotellaceae , and Xanthomonadaceae (Fig. 1 D, 3 A, 3 C). Among them, Akkermansia muciniphila within the Akkermansiaceae family has been extensively reported due to its ability to modulate multiple diseases, including obesity 69 , diabetes 70 , 71 , inflammatory bowel disease 72 – 74 , and schizophrenia 75 . However, phages targeting Akkermansia muciniphila had never been characterized by any previous publications. We successfully identified 36 phages targeting Akkermansia muciniphila at 95% precision cutoff and 95 phages at 84% precision cutoff, and we examined the 36 phages at the more stringent cutoff (Table S9). It was shown that the genome sizes of the Akkermansia muciniphila phages range from 11,830 to 92,135 bp and the GC content ranged from 49.11% to 60% (Figure S4). The number of CDS is between 21 and 127 and the annotation rate is between 23.81% and 52.17% (Figure S5). Prevotella copri within the Prevotellaceae family, is another renowned species associated with rheumatoid arthritis 76 – 78 and type 2 and type 1 diabetes mellitus 79 – 81 . Megaphages were the only phages reported to target the Prevotella copri , but previous attempts to isolate them failed 82 . We successfully identified 15 phages targeting Prevotella copri at a 95% cutoff and 22 phages at an 84% cutoff, and we examined the 15 phages at the most stringent cutoff (Table S9). It was shown that the genome sizes of the Prevotella copri phages range from 12,114 to 127,100 bp and the GC content range from 39.17% to 48% (Figure S5). The number of CDS is between 16 and 166 and the annotation rate is between 23.17% and 56.25% (Figure S5). We selected representative phages targeting Akkermansia muciniphila and Prevotella copri using CD-hit with a threshold of coverage of 0.6 and identity of 0.6 and annotated their genomes using our refined pipeline. It was shown that the functional elements of phages mainly include lysis, lysogenic-related, structure, DNA maintenance, packaging and assembly, replication and transcription, transport, and regulation (Fig. 3 F). Diversity and geographic distribution of gut phages To gain further insights from the expanded host assignments, we first analyzed the phylogenetic lineages of phages of the refined database (Fig. 1 D). Out of the 89,478 phages, 11.42% phages were classified under six viral families, including Siphoviridae, Myoviridae, Podoviridae, Herelleviridae, Tectiviridae, and Microviridae , covering all taxonomic classifications identified in the GPD (Fig. 4 A). The remaining 88.57% of assigned phages were unclassified (Fig. 4 A). Compared to the previous results of the GPD, VirHost Hunter newly assigned hosts by nearly 1-fold to Siphoviridae and Herelleviridae phages, nearly 2-fold to Podoviridae and Myoviridae phages, and 33.3% to Microviridae phages, significantly enhancing host assignments across multiple taxonomic levels. As a result, we assigned host to 79,332 unclassified phages, 4,566 Siphoviridae phages, 2,902 Myoviridae phages , 2,598 Podoviridae phages, 75 Herelleviridae phages, 4 Microviridae phages and 1 Tectiviridae phage (Fig. 4 A). It is noteworthy that Microviridae , a class of tailless phages, were assigned hosts by VirHost Hunter-lysin instead of VirHost Hunter-tail as expected. Therefore, it is important to combine the results of VirHost Hunter-tail and VirHost Hunter-lysin for downstream analyses. These findings demonstrate the broad applicability of VirHost Hunter for host prediction across diverse phage lineages, regardless that the phages are with or without tails. Given the large number of phages predicted to target identical hosts, we evaluated phage diversity within bacterial families across diverse phyla by calculating the ratio of VC numbers to phage counts sharing the same host (Fig. 1 D). We observed a wide distribution of phage diversity across bacterial families, especially within Bacteroidetes . Notably, 23 bacterial families exhibited the highest viral diversity, with 15 of these families belonging to Firmincutes (Fig. 4 B), a finding that is consistent with GPD. Additionally, we newly found that some bacteria genus belonging to Actinobacteria, Bacteroidetes and Proteobacteria were showed high viral diversity, such as Pseudomonadaceae , Neisseriaceae , Muribaculaceae , Moraxellaceae , Dermabacteriaceae , Corynebacteriaceae , Coprobacteriaceae and Cellulomonadaceae (Fig. 4 B). In contrast, lowest viral diversity was detected in Bacteroidaceae , DTU089 , Marinifilaceae , Rikenellaceae and Tannerellaceae , all belonging to the Bacteroidetes (Fig. 4 B). Firmicutes , Bacteroidetes , Proteobacteria , and Actinobacteriota were previously reported as the common phyla in the human gut, which are also prominently featured in our data. To gain insights into the relationship between VC number of host families and their geographic distribution, we analyzed the dominant families and performed principal coordinate analysis (PCoA) (Fig. 1 D, 4 C-D). The results showed that Asia and Europe have a higher total phage count compared to others, which may be attributed to the greater number of human metagenomic sequencing studies conducted in Asia and Europe (Fig. 4 C). Host families show similar geographic distribution patterns across continents, with Lachnospiraceae and Bacteroidaceae dominating across all continents, indicating their role as hosts for globally prevalent gut phages (Fig. 4 C). Similarly, Asia, Europe and Africa have more overlapping regions, suggesting similar phage compositions, while Oceania, North America and South America are more distinct, indicating different phage communities (Fig. 4 C). The PCoA reveals a similar result, showing that while some continents do not completely overlap with others in host bacterial community compositions, the slight differences observed are not statistically significant (Pr(> F) = 0.065), indicating similar phage compositions (Fig. 4 D). An expansive lysin repository countering a broad array of gut bacteria Considering VirHost Hunter's precision in predicting gut phage and lysin hosts and its complementarity with the CRISPR-based method (Fig. 3 B, Supplementary Results), we established the Gut Phage Lysin Database (GPLD), which encompasses 117,698 lysins precisely targeting 29 disease-related gut bacterial families (Fig. 1 E, Table S9). Of these, 35.20% (n = 41,429) can be identified through both VirHost Hunter and the CRISPR-based method, 13.27% (n = 15,617) were identified using the CRISPR-based method, and 51.53% (n = 60,652) were exclusively identified by VirHost Hunter. Hydrolases, holins, and endolysins were the predominant functional categories (Fig. 5 A). To better understand the functionality, stability, and potential applications of lysin proteins, we conducted various analyses focusing on their physicochemical properties. Their secondary structures were turn, sheet, and helix in fraction (Fig. 5 B), varying in length from 30 bp to 5811 bp, with a mean length of 195 bp. The molecular weight ranged from 2.8 kDa to 65 kDa, with a mean of 21.6 kDa (Fig. 5 C), suggesting favorable attributes for efficient synthesis and manipulation. A majority (81.98%) were stable with an instability index below 40 (Fig. 5 C). Amino acid frequency analysis indicated a prevalence of hydrophobic alanine, leucine, and isoleucine, potentially enhancing protein stability and function. To analyze the functional diversity and sequence-function relationships within the gut lysin protein family, we employed the sequence similarity network (SSN) tool 83 , 84 , which provided insights into their sequence and functional divergence. Lysins were grouped into 603 clusters based on sequence similarity, with nodes colored according to the host taxonomical phylum (Fig. 5 D). The SSN was predominantly populated by proteins from Firmicutes (n = 4559), followed by Proteobacteria (n = 996), Bacteroidetes (n = 906), and Actinobacteria (n = 600) (Fig. 5 D). The protein clusters in the sequence similarity network (SSN) were categorized according to their respective protein types, indicating that proteins with similar functions might possess conserved domains that confer these functions (Fig. 5 D). Notably, holins, differing from other protein types in the network, exhibit high diversity in their sequences and structures (Fig. 5 D). Furthermore, proteins against the same host phylum tend to cluster together, implying that lytic proteins, much like phages, exhibit a high degree of host specificity. This host specificity is likely mediated by conserved regions within the lytic proteins that are essential for identifying and binding to host cells. To uncover the conserved functional motifs and the underlying mechanisms, we generated sequence logos 85 for three representative clusters. Cluster 1, the largest cluster, containing 1,735 protein sequences from Firmicutes , Actinobacteria , Bacteroidetes , and Proteobacteria hosts, including holins (n = 3170), hydrolases (n = 3098), endolysins (n = 1125) and lysis proteins (n = 796). The sequence logo analysis revealed three conserved motifs (RHTKAPAVLIECCFVDNKDD, NVTVHRDFANKSCPG, and RSWCSSSAANDNRAITIEVA), all located in the N-acetylmuramoyl-L-alanine amidase domain, which is crucial for phage-mediated bacterial lysis (Fig. 5 E). Cluster 2 comprised 1,213 representative protein sequences belonging to holins, and two conserved motifs were detected in the toxin secretion domain, which facilitates the release of lytic enzymes to lyse bacterial cells (Fig. 5 E). Cluster 3 consisted of 296 hydrolase representative sequences, and its motif was mainly associated with the N-(deoxy)ribosyltransferase-like domain, which functions in degrading bacterial cell walls during phage infection (Fig. 5 E). Functionally important residues were found to be conserved in putative isofunctional clusters, with motif and domain analyses revealing differences between different types of phage lytic proteins. The findings have valuable implications for the design and engineering of lysins and their application in lysin therapy. Lysin Ply491_6 effectively and specifically inhibits an obesity-inducing bacterium Obesity has emerged as a significant global health concern, with the gut microbiome implicated in its onset and progression 86 . Comparative analyses have revealed distinct microbiome profiles between obese and non-obese individuals, suggesting association between certain bacterial genera and obesity, including Bacteroides , Megamonas , Ruminococcus , Dorea , Coprococcus , Fusobacterium , Blautia , and Eubacterium 63 , 87 , 88 . While phage therapy holds promise for modulating the gut microbiota, the lack of reported phages targeting Megamonas and our failure in repetitive attempts to isolate Megamonas phages prompted our investigation into the therapeutic potential of lysins from the Gut Phage Lysin Database (GPLD) against this bacterial genus (Fig. 1 F). We identified 526 unique lysin sequences specific to Megamonas from GPLD, clustered into 167 distinct clusters (Fig. 6 A). Ply491_6 (ivig_491_6) is the representative sequence of the protein cluster with the highest number of proteins (Fig. 6 A, 6 B). The cDNA sequence encoding Ply491_6 spans 561 base pairs. Ply491_6 comprises 187 amino acids, with a molecular weight of 20.8 kDa and a theoretical isoelectric point (pI) of 5.37. Ply491_6 exhibits hydrophilicity, with a grand average of hydropathicity (GRAVY) value of -0.207. The instability index is 25.93, indicating that Ply491_6 is a stable protein. Additionally, Ply491_6 is devoid of signal peptides and transmembrane regions and is structurally characterized by four predominant α-helices alongside multiple β-sheets (Fig. 6 C). Ply491_6 shares high sequence identity (99.46%) with QIW89318.1, a cell wall hydrolase autolysin from Caudoviricetes sp. , and contains a conserved N-acetylmuramoyl-L-alanine amidase domain. Therefore, we synthesized and purified the Ply491_6 protein for in vitro assays to verify its lytic activity against Megamonas . We incubated Ply491_6 with Megamonas rupellensis and monitored the bacterial turbidity over time. It was shown that Ply491_6 effectively lysed bacterial cells at concentrations as low as 20 µg/mL, with a significant reduction in bacterial turbidity observed within 150 minutes (Fig. 6 D). To further assess the specificity of Ply491_6, we measured its lytic activity against other high-abundance gut bacteria and common probiotics, including Bacteroides fragilis , Clostridium perfringerns , Ruminococcus gnavus , Bifidobacterium longum , Lacticaseibacillus paracasei , and Lactiplantibacillus plantarum. Ply491_6 demonstrated minimal impact on the viability of these bacteria (Fig. 6 E-F). These results underscore the efficacy and specificity of Ply491_6 to Megamonas , positioning it as a promising candidate for targeted bactericidal therapy against obesity-associated dysbiosis. These findings contribute valuable insights into phage-bacteria interactions in the gut and offer essential data for the development of precision therapies against intestinal pathobionts. Discussion The VirHost Hunter framework presented here integrates three highlights. By conducting control analyses, we verified that each highlight enhanced the prediction performance. By comprehensive comparison with other methods across multi-taxonomic levels, VirHost Hunter demonstrated superior precision and recall, and it also showed a higher resolution as it reached accurate species-level host prediction. This can be attributed to three key factors: 1) the integration of a large language model, specifically ProtT5, allows for advanced contextual understanding of protein sequences, enabling VirHost Hunter to capture functional homology effectively; 2) by focusing on phage tails and lysins, VirHost Hunter can directly relates to the functional roles of these key proteins in phage-host interactions and make accurate and high-resolution predictions even in cases of incomplete genomic data; 3) the incorporation of DNA sequence features, such as codon usage and nucleotide composition, as complementary to protein features, further enriches the predictive capabilities of VirHost Hunter. These results provided insights in how to leverage machine learning to predict protein function and mine sequencing data in the future. Because CRISPR-based method has been the single most widely used tool to assign bacterial host, we also compared the performance of VirHost Hunter and CRISPR-based method using two independent datasets with biological experimental evidence: a collection of 156 cultivated gut phages, and another collection of 31 lysins (Supplementary Results). We demonstrated that both methods had similar recall when precision was set at 84% and 69%, but VirHost Hunter had higher recall than the CRISPR-based method when the precision was set at 95%. Interestingly, combining both methods resulted an improved host assignment ratio compared with either alone, one of the reasons is likely due to the differences in training datasets. The synergy between VirHost Hunter and the CRISPR-based predictions allowed us to expand the host assignment ratio of the GPD from 28.66% to 62.66%. Therefore, we propose a guideline for users: we should prioritize VirHost Hunter if aiming for highly precise or species-level prediction, and we can use both VirHost Hunter and the CRISPR-based method in parallel for general purposes. Using the calibrated model, we greatly improved the host assignment ratio of the gut phage database, particularly for phages associated with chronic diseases. We also identified dozens of new phages targeting Akkermansia muciniphila and Prevotella copri , whose phages have hardly been characterized before. To further promote application of the resource, we established the Gut Phage Lysin Database, cataloging 117,698 host-specific lysins targeting various gut bacteria. This database is pivotal for identifying and engineering lysins, particularly against bacteria linked to chronic diseases. As a proof of concept, we selected a lysin from the database for synthesis, and verified its efficacy and specificity against Megamonas , an obesity-inducing bacterium. We have not seen any reported means targeting Megamonas before, and we have failed to isolate Megamonas phages in our repetitive efforts during the past a few years. In fact, it has been rather difficult to isolate phages targeting all obligate anaerobic bacteria and thus VirHost Hunter can be exceptionally useful under this scenario, deciphering new phages to reveal biological insights, and discovering new lytic proteins to inform therapeutic potentials. Fujimoto et al. has showed that E. faecalis phage-derived endolysin worked effectively in humanized gnotobiotic acute graft-versus-host disease (GVHD) mice, as it decreased levels of intestinal cytolysin-positive E. faecalis and significantly increased survival 89 . Compared to 7-log demonstrated by Fujimoto et al. , lysin Ply491_6 inhibited the bacterial growth by only between 1- to 2-log, which is a good start but requires further engineering for downstream application. Some possible directions for engineering include: 1) fusing lytic proteins with functional peptides to form nanoparticles, which can enhance both lytic efficacy and stability 90 ; 2) integrating the enzymatic active domains (EAD) and cell wall binding domains (CBD) from different lysins, particularly for endolysins, to boost lytic activity and broaden the host range 91 ; 3) introducing targeted mutations at active sites or increasing positive charges to enhance lytic activity and binding efficiency 92 ; 4) fusing lysins with receptor-binding proteins to improve the targeting specificity 93 . While VirHost Hunter demonstrated strong predictive performance, there are some limitations. Firstly, we only utilized phage tails and lysins for model training and host prediction. Although out data ruled out the possibility of using two structural proteins, other proteins might also confer different levels of host specificity. Secondly, VirHost Hunter should be robust when calibrating with any datasets, but due to the focus of this work we only verified the scenario of searching for phages and lysins targeting disease-associated gut bacteria. Future research should aim to refine VirHost Hunter by incorporating a broader range of datasets, including diverse protein datasets and environmental contexts. A great advantage is that VirHost Hunter only requires input of key proteins, which can be extracted from prophages integrated within bacterial genomes and fragmented phage genomes from metagenomic sequencing, vastly expanding the scale of datasets. For instance, VirHost Hunter can be calibrated targeting other gut bacteria that are not necessarily diseases-associated, further improving the host assignment ratio of the gut phage database. The implications of this study also extend to the broader field of environmental microbiology beyond gut microbiome, as environmental microbiologists encounter an even worse situation in phage host assignment. With an estimated 10³¹ particles globally, phages are a key component of Earth’s ecosystems and play crux roles in regulating microbial populations, nutrient cycling, and ecosystem dynamics 94 , 95 . VirHost Hunter can then be calibrated targeting environmental bacteria, shedding light on the "viral dark matter" and their interactions with bacteria in various ecosystems, out of which extreme environments will be of special interest. Identifying the hosts of environmental phages will enhance our understanding of virus-host-environment interactions, their role on microbial community structures, and their influence in biogeochemical processes. These insights can inform conservation efforts, bioremediation strategies, and the management of microbial communities in natural and engineered environments. Methods Establishment of the VirHost Hunter framework VirHost Hunter consists of two primary components: a feature extractor and a classifier (Fig. 1 A). When extracting features for VirHost Hunter, we utilize three distinct tools to process phage specific proteins and their corresponding DNA sequences, resulting in three types of features: protein sequence embeddings from the pre-trained ProtT5 model, physical-chemical characteristics of DNA sequences, and k-mer features of DNA sequences extracted via a DNN network. These features will be elaborated upon in detail below. For protein sequence representation, we leveraged the capabilities of the pre-trained protein language model ProtT5 to generate dense vector representations (embeddings) of protein sequences. Specifically, we utilized only the encoder portion of the ProtT5 model. The encoder integrates essential components such as a multi-head attention mechanism and feedforward layers, enabling it to capture intricate relationships between amino acid residues in the input protein sequence. This process yields rich embedding vectors containing valuable information regarding protein structure and functionality. We extract the average embedding vector from the last layer of the pre-trained model to generate the embedded feature vector, resulting in a 1024-dimensional feature vector for each protein sequence. The physical-chemical features employed to represent DNA sequences align with the methodology proposed by Boeckaerts et al 39 . These features encompass nucleotide frequency, GC content, codon frequency, and codon usage bias, amounting to a total of 133 dimensions for the representation of DNA sequences. To preserve the intrinsic sequence information of DNA sequences, we encoded them following the approach outlined by Wang et al. in their study DeepHost 38 . This method represents DNA sequences through K-mer frequency. Subsequently, we construct a deep neural network (DNN). The DNN incorporates a convolutional neural network with three paths, each outputting a different number of channels, facilitating the capture of feature information at varying scales (Fig. 1 B). By using multiple channels in parallel and fusing their outputs, the model can simultaneously learn abstract features at different levels. Subsequently, we leverage the Vision Transformer (ViT) 53 , 96 , utilizing the self-attention mechanism of the ViT model to capture global relationships and multi-channel feature representations, yielding richer original sequence feature embeddings. Subsequently, we merge the three types of features learned by the model into a unified vector, which serves as input for the classifier ensemble. This ensemble includes an MLP neural network, an autoencoder, and a random forest. Throughout the training process, the MLP neural network updates parameters. We employ the softmax function as the activation function, cross-entropy as the loss function, and utilize the Adam algorithm to optimize the loss function. Due to the inherent characteristics of the softmax function, it leads to high confidence predictions in incorrect categories, which may exceed the confidence levels justified by true probability estimates. To address this issue, we integrate both the RF and MLP neural network into the classification prediction process. Initially, we train the MLP neural network and feature extractor, then stabilize the parameters of the feature extractor. Next, we train the autoencoder and RF. During testing, predictions are a blend of MLP and RF outputs, with the RF correcting highly confident but potentially inaccurate MLP predictions. Supplementary Table S10 provides detailed VirHost Hunter construction parameters. Bioinformatics pipeline for phage genome annotations A bioinformatics pipeline was developed to enable the rapid and efficient annotation of phage tails and lytic proteins. The pipeline involved several steps. Firstly, proteins predicted from phage genomes using Prodigal v2.6.3 (-f gff -c -p meta) 97 . Secondly, the predicted proteins were aligned against multiple databases, including 1) the NR phage protein database using Blastp v2.3.0 (-evalue 1e-5 -max_target_seqs 1 -outfmt ‘6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore’) 98 , 2) Uniref phage protein database using phmmer v3.1b2 (-E 1e-5), 3) Uniprotkb phage protein database using phmmer v3.1b2 (-E 1e-5) and 4) TIGRFAM, SMART, CDD, ProSiteProfiles, SUPERFAMILY, PRINTS, PANTHER, Gene3D, PIRSF, Pfam, Coils, and MobiDBLite database using hmmscan v3.1b2 (-E 1e-5). The final annotation was merged with the comprehensive alignment results. For phage tail protein identification, the keyword ‘tail’ was used to extract sequences from the final annotation results through Seqkit v0.16.0 99 . For phage lysin proteins, additional filters were applied in the blastp step, including ≥ 50% of coverage and ≥ 50% of identity, and specific keywords (‘lysis’/‘lyase’/‘lysin’/‘holin’/‘hydrolase’/‘spanin’/‘endolysin)’ were used to extract sequences from the final annotation results through Seqkit v0.16.0 99 . Construction of benchmark datasets Complete phage genomes were collected from NCBI using specific keywords related to bacterial hosts, including ‘ Staphylococcus ’, ’ Acinetobacter ’, ‘ Escherichia ’, ‘ Clostridium ’, ‘ Klebsiella ’, ‘ Pseudomonas ’, and ‘ Salmonella ’. A total of 3,116 phage genomes were collected. Protein annotation was performed using the bioinformatics pipeline, resulting in 22,151 phage tail proteins (21,264 from the pipeline and 887 from a published paper by Boeckaerts et al . 39 ). From these, 7,493 RBPs were screened out using specific keywords related to the tail protein functions, including ‘fiber’, ‘fibre’, and ‘spike’. Three filters were applied to clean the tail proteins and RBP datasets: 1) sequences with lengths shorter than 50 amino acids or longer than 1,500 amino acids were removed, 2) sequences containing undetermined amino acid ‘X’ in protein sequences or undetermined nucleotides ‘N’ in CDS were excluded, and 3) identical protein sequences with different hosts were discarded to remove redundancy. The final benchmark datasets consisted of 4,845 RBPs in DRRBP and 12,509 tail proteins in DRTail, respectively. Construction of tail protein and lysin datasets at multi-taxonomic levels Phage genomes from the viral category were screened in the NCBI database as of December 29, 2021 ( https://www.ncbi.nlm.nih.gov/genome/browse/#!/viruses ). Those contain partial genomes and coding sequences were excluded from the dataset. Genomic sequences in FASTA format and annotation files in GBFF format were downloaded from the corresponding table on the NCBI FTP site. This screening process resulted in a total of 7,598 phage genomes for further analysis. Next, information related to the host organism was extracted from the annotation files (GBFF format) using a custom script. If the ‘host=’ filed was empty, the species information mentioned in front of the phage in the GenBank tile (ORGANISM) was selected as the host information. For instance, if the host information of phage AF234172 was empty, we selected ‘Escherichia’ as the host based on the record ‘ORGANISM: Escherichia virus P1’. Then, NCBI taxonomy toolkit, TaxnoKit, was used to obtain the taxonomy ID and taxonomy level of the host organism (taxnokit name2taxid –show-rank). The host taxonomic information was transformed into a standard format including phylum, class, order, family, genus, species, and strain (taxonkit lineage | taxnokit reformat | cut -f 1,3). This process resulted in the compilation of phage-host taxonomic rank information. We counted the number of RBPs and tail proteins in the datasets and observed an average of 1.33 RBPs and 15.24 tail proteins per phage (Figure S1). We also found that 53.10% of phages lacked RBPs, prompting us to construct multi-taxonomic levels dataset at different taxonomic ranks, enabling the establishment of a tail proteins-based VirHost Hunter (VirHost Hunter-tail) for broader applications. We filtered the phage data and created a phage tail protein dataset, including 37 families, 54 genera, and 57 species. Additionally, we trained VirHost Hunter on lysins – another type of host-specific protein – using 37,469 lysin protein sequences from the same 7,598 phages to construct lysin-based VirHost Hunter (VirHost Hunter-lysin). The lysin dataset comprised 37 families, 42 genera and 47 species. Family, genus, and species datasets for tail proteins were constructed based on the taxonomic ranks obtained in the previous step. The three datasets were filled using the same three filters as used in constructing the benchmark datasets. Category with fewer than 50 counts in each taxonomic dataset were discarded. After filtering, there were 47 families, 72 genera, and 120 species remaining in the tail protein datasets. These three datasets were used to train the VirHost Hunter-tail model. To address bias issues observed in certain taxa, taxa with precision lower than 0.7 were eliminated from the datasets. For example, Enterobacteriaceae has a recall of 0.9098 and precision of 0.6941 at family level, Escherichia has a recall 0.7125 and precision of 0.3526 at genus level, and Escherichia coli has a recall of 0.7373 and precision of 0.3199 at species level. As a result, the final set taxa include 37 families, 54 genera, and 56 species left. Each dataset was randomly split into training, validation, and testing sets with a proportion of 6:2:2. The original lysin dataset contained a total of 37,469 protein sequences. The same three filters as previously applied to the benchmark datasets were used to clean the lysin dataset. However, the maximum allowed sequence length was set to 1,000 amino acids since protein sequences with a length of over 1,000 amino acids accounted for less than 2%. The screening process and building procedures for the VirHost Hunter-lysin dataset followed a similar approach to VirHost Hunter-tail. Taxa with precision lower than 0.62 were eliminated from the family, genus, and species taxonomy datasets based on the training results. This step ensured reliable predictions for the remaining taxa. After eliminating low-precision taxa, the final lysin datasets consisted of 37 families, 42 genera, 47 species. Each dataset was randomly split into training, validation, and testing sets with a proportion of 6:2:2. Additional filter for higher precision at multi-taxonomic ranks Since the range of categories that our model can cover is limited, an additional filter was implemented to VirHost Hunter trained in the multi-taxonomic levels’ dataset and the gut prophages dataset, to generate an ‘Unknown’ output for any given input that exceeded the prediction range. To determine the appropriate cutoff for this filter, two datasets were constructed: Positive Control, which comprised samples from the test dataset, and Negative Control containing samples not belonging to any predefined classes in the training dataset. The recall and precision on the Positive Control and the specificity on the Negative Control were illustrated in Figure S5, Figure S6. These figures showed that more stringent cutoffs resulted in higher precision and lower recall. This phenomenon occurred because as the cutoff increased, more data were classified as ‘Unknown’, and the remaining data was considered more reliable by VirHost Hunter. To benchmark VirHost Hunter’s performance against other methods, we considered the work of Dion et al. 21 where they evaluated the precision and recall of a CRISPR spacer-based method under different cutoffs of mismatch numbers or e-value at the genus level. They found that with an e-value of 10 − 9 , the method achieved the highest precision of 95% but the lowest recall of 2.5%. With zero mismatches, the method achieved 84% precision and 31% recall. By tolerating two mismatches, the method obtained a balanced performance of 69% precision and 49% recall. Accordingly, several probability cutoffs were selected at the family, genus, and species levels to achieve the same precision values of 95%, 84%, and 69%, respectively (Table S11). Consequently, when the precision on the Positive Control and the specificity on the Negative Control surpassed 95%, VirHost Hunter demonstrated a precision of 95%. Extraction of synthetic lysins from PhaLP The latest SQL file (v2021_04) was downloaded from the largest available Phage Lytic Protein database (PhaLP) 16 . The SQL file provided lysin IDs, corresponding phage genome IDs, lysin annotation information, host taxonomy information, and experimental support information. Phage genome IDs that were not used in VirHost Hunter construction were marked in the dataset collected from NCBI, resulting in 3,448 phage genomes. Lysins that were synthesized and experimentally validated and their corresponding 31 phages were screened out from the dataset. A total of 138 tail proteins were annotated to using the custom bioinformatics pipeline. Three phages could not be annotated with tail proteins, leaving final real-world evidence of 31 phage genomes and 138 phage tail protein sequences. Extraction of phage tail and lysin proteins from GPD Gut Phage Database (GPD) and the corresponding taxonomy information table by Camarillo-Guerrero et al. were downloaded 43 . Based on the ‘Host_range_taxo’ field in the information table, phages were categorized into two groups: those with host information and those without host information. Phage tail and lysin proteins were annotated using a custom annotation pipeline. A total of 163,590 lysin sequences and 388,894 tail protein sequences were obtained from 111,355 phages, which accounted for 77.97% of the total 142,809 phage genomes. A comprehensive dataset of 42,586 proteins was downloaded from NCBI (as of February 22, 2024) using keywords(lysis protein, lysin, lyase, holin, hydrolase, and endolysin AND phage). Lysins encoded by gut phages were identified by comparing with the dataset using BLASTP with a threshold of 60% identity and 50% coverage. VirHost Hunter-lysin model (95%, 84%), VirHost Hunter-tail model (95%, 84%), and CRISPR-based method (84%) were used to construct the Gut Phage Lysin Database (GPLD) targeting human gut commensal bacteria. Statistical analysis and sequence similarity network for the Gut Phage Lysin Database (GPLD) Biopython was employed to conduct a comprehensive statistical analysis of the GPLD database, which included aspects such as protein categories, secondary structure proportions, length, amino acid composition, molecular weight, isoelectric point, and stability index. The results were visually represented using ggplot2 100 . A sequence similarity network was established using a tool developed by Miguel M. Sandin (available at https://github.com/MiguelMSandin/SSNetworks ) , which was based on lysin sequences clustered using CD-HIT with a 70% similarity threshold. The network construction parameters were set at an identity level of 35% and a coverage level of 50%. The resulting networks were visualized using Cytoscape. Furthermore, MEME was utilized to identify conserved motif sites within the three primary clusters of the network, employing default parameters. Identification of Megamonas- targeting lysin from GPLD A total of 536 unique lysin sequences specific to the genus Megamonas were identified from the GPLD. These sequences were subsequently clustered into 167 distinct groups using the CD-HIT with a sequence similarity threshold of 95% and a coverage threshold of 90% (-c 0.95 -aL 0.9). Ply491_6 (ivig_491_6), representing the largest cluster among these groups, was chosen for in-depth characterization and experimental validation of lytic activity. This process involved the prediction of signal peptides using SignalP ( https://services.healthtech.dtu.dk/services/SignalP-6.0/ ), identification of transmembrane regions with HMMTOP ( https://services.healthtech.dtu.dk/services/TMHMM-2.0/ ), and assessment of physicochemical properties via ProtParam ( https://web.expasy.org/protparam/ ). Synthesis and purification of Ply491_6 To synthesize and purify Ply491_6, the gene encoding Ply491_6 was synthesized and subcloned into the pET-30a(+) plasmids using NdeI and XhoI restriction sites. The plasmids were constructed and transformed into BL21 (DE3) competent cells. These transformed cells were cultured on agar plates containing kanamycin at a final concentration of 50 µg/mL at 37°C. Colonies were picked from the plates and cultured until the optical density at 600 nm (OD 600 ) reached 0.6–0.8. Protein expression was induced by adding IPTG to a final concentration of 0.5 mM, followed by incubation of the cultures for an additional 4 hours at 37°C. Cells were then harvested, lysed, and the lysates were subjected by SDS-PAGE to verify protein expression. Then the proteins were purified using Ni-NTA affinity chromatography. The purified proteins were dialyzed into phosphate-buffered saline (PBS) containing 300 mM NaCl, 10% glycerol, and adjusted to pH 7.4, followed by filter sterilization. Bacterial strains Megamonas rupellensis strain 150922 was used for lysin activity assay of Ply491_6 in vitro . Bacteroides fragilis bf2 (BF1), B. fragilis bf5 (BF2), Clostridium perfringerns 0840 (CP1), C. perfringerns 0812 (CP1), Ruminococcus gnavus 1177 (RG1), R. gnavus 1186 (RG2), Bifidobacterium longum 4486 (BL1), B. longum 2366 (BL2), Lacticaseibacillus paracasei LAC-F (LP1), L. paracasei LAC-J (LP2) and Lactiplantibacillus plantarum SZHD0015 ( L. plantarum ) were used for comparing lysin activity of Ply491_6 in vitro . All bacterial strains were isolated from human feces. All bacterial strains were grown overnight in BHI-YH (Brain Heart Infusion medium supplemented with 5 g/L yeast extract, 5 mg/L hemin). To maintain anaerobic conditions, all media and buffers were additionally supplemented with 0.5 g/L L-cysteine hydrochloride and 0.25 g/L anhydrous sodium sulfide, serving as reducing agents. Lytic activity and specificity of Ply491_6 M. rupellensis strain 150922 was grown overnight, diluted 1:100, and grown to the midlogarithmic phase. The bacterial cells centrifuged, washed, and resuspended in phosphate buffered saline (PBS, pH 7.4) to an OD 600 of 0.9. Phage lysin Ply491_6 was added to bacterial suspension with a final concentration of 20 µg/mL. Each concentration was plated in a U-bottomed 96-well plate in triplicate. Ply491_6 dilution plates were then incubated at 37°C in a BioTek Epoch2 Microplate Spectrophotometer(BioTek Instruments, Inc., USA) for 240 minutes. The OD 600 was measured every 10 min. To verify the specificity of Ply491_6, B. fragilis (n = 2), C. perfringerns (n = 2), R. gnavus (n = 2), B. longum (n = 2), L. paracasei (n = 2) and L. plantarum (n = 1) strains were each grown overnight. The bacterial cells were then centrifuged, washed, and resuspended in PBS, and were then incubated with 20 µg/mL Ply491_6 or PBS at 37°C for 240 min. The OD 600 was measured every 10 min. Declarations Author Information M.X. conceived the study. Z.D., K.L., and Y.O. developed the tool. M.L. and B.X. compiled the training, validation, and test sets. K.L., M.L., B.X., and M.X. analyzed the viral dark matter. K.L., M.L., B.X., and Y.O. drafted the manuscript and made the figures. Z.D., M.X., and Junhua L. revised the manuscript. Jianqiang L., J.W., H.Y., and X.X. provided consultation. All authors read, edited, and approved the final manuscript. Acknowledgements This work is supported by National Key R&D Program of China (2020YFA0908700), National Nature Science Foundation of China Grant 32100130 and 62176164. We sincerely thank the China National GeneBank DataBase (CNGB) for providing valuable data support and computational resources. We extend our heartfelt sympathy to Min Li and Kaihuang Lin, who, despite of being co-first authors, unfortunately did not witness the fruition of their work before their graduation. Their unwavering support since then has been invaluable. We hope that the next-generation co-second authors, Bo Xing and Yuehua Ou, will enjoy greater fortune in their academic endeavors. References Bayfield OW et al (2023) Structural atlas of a human gut crassvirus. Nature 617:409–416 Koskella B, Brockhurst MA (2014) Bacteria-phage coevolution as a driver of ecological and evolutionary processes in microbial communities. FEMS Microbiol Rev 38:916–931 Borin JM, Avrani S, Barrick JE, Petrie KL, Meyer JR (2021) Coevolutionary phage training leads to greater bacterial suppression and delays the evolution of phage resistance. Proc Natl Acad Sci U S A 118 Blazanin M, Turner PE (2021) Community context matters for bacteria-phage ecology and evolution. ISME J 15:3119–3128 Lawrence D, Baldridge MT, Handley SA (2019) Phages and Human Health: More Than Idle Hitchhikers. Viruses 11 Federici S, Nobs SP, Elinav E (2021) Phages and their potential to modulate the microbiome and immunity. Cell Mol Immunol 18:889–904 Bhargava K, Nath G, Bhargava A, Aseri GK (2021) Jain, N. Phage therapeutics: from promises to practices and prospectives. Appl Microbiol Biotechnol 105:9047–9067 Vijay A, Valdes AM (2022) Role of the gut microbiome in chronic diseases: a narrative review. Eur J Clin Nutr 76:489–501 Guerin E, Hill C (2020) Shining Light on Human Gut Bacteriophages. Front Cell Infect Microbiol 10:481 Porter NT et al (2020) Phase-variable capsular polysaccharides and lipoproteins modify bacteriophage susceptibility in Bacteroides thetaiotaomicron. Nat Microbiol 5:1170–1181 Vazquez R, Garcia E, Garcia P (2018) Phage Lysins for Fighting Bacterial Respiratory Infections: A New Generation of Antimicrobials. Front Immunol 9:2252 Ghose C, Euler CW (2020) Gram-Negative Bacterial Lysins. Antibiot (Basel) 9 Danis-Wlodarczyk KM, Wozniak DJ, Abedon ST (2021) Treating Bacterial Infections with Bacteriophage-Based Enzybiotics: In Vitro, In Vivo and Clinical Application. Antibiotics (Basel) 10 Rahman MU et al (2021) Endolysin, a Promising Solution against Antimicrobial Resistance. Antibiot (Basel) 10 Lee C, Kim H, Ryu S (2023) Bacteriophage and endolysin engineering for biocontrol of food pathogens/pathogens in the food: recent advances and future trends. Crit Rev Food Sci Nutr 63:8919–8938 Criel B, Taelman S, Van Criekinge W, Stock M, Briers Y (2021) PhaLP: A Database for the Study of Phage Lytic Proteins and Their Evolution. Viruses 13 Coutinho FH et al (2021) RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns (N Y) 2:100274 Pons JC et al (2021) VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 37:1805–1813 Amgarten D, Iha BKV, Piroupo CM, da Silva AM, Setubal JC, vHULK (2022) a New Tool for Bacteriophage Host Prediction Based on Annotated Genomic Features and Neural Networks. Phage (New Rochelle) 3:204–212 Zielezinski A, Barylski J, Karlowski WM (2021) Taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships. BMC Biol 19:223 Dion MB et al (2021) Streamlining CRISPR spacer-based bacterial host predictions to decipher the viral dark matter. Nucleic Acids Res 49:3127–3138 Zhang R et al (2021) SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts. Bioinformatics 37:3364–3366 Edwards RA, McNair K, Faust K, Raes J, Dutilh BE (2016) Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol Rev 40:258–272 Li J, Yang F, Xiao M, Li A (2022) Advances and challenges in cataloging the human gut virome. Cell Host Microbe 30:908–916 Galiez C, Siebert M, Enault F, Vincent J, Soding J (2017) WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics 33:3113–3114 Leite DMC et al (2018) Computational prediction of inter-species relationships through omics data analysis and machine learning. BMC Bioinformatics 19:420 Li M et al (2021) A Deep Learning-Based Method for Identification of Bacteriophage-Host Interaction. IEEE/ACM Trans Comput Biol Bioinform 18:1801–1810 Li M, Zhang W (2022) PHIAF: prediction of phage-host interactions with GAN-based data augmentation and sequence-based feature fusion. Brief Bioinform 23 Liu D, Ma Y, Jiang X, He T (2019) Predicting virus-host association by Kernelized logistic matrix factorization and similarity network fusion. BMC Bioinformatics 20:594 Wang W et al (2020) A network-based integrated framework for predicting virus-prokaryote interactions. NAR Genom Bioinform 2:lqaa044 Lu C et al (2021) Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol 19:5 Shang J, Sun Y (2021) Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning. BMC Biol 19:250 Shang J, Sun Y (2022) CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model. Brief Bioinform 23 Tan J et al (2022) HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes. Bioinformatics 38:543–545 Tang T, Hou S, Fuhrman JA, Sun F (2022) Phage-bacterial contig association prediction with a convolutional neural network. Bioinformatics 38:i45–i52 Zielezinski A, Deorowicz S, Gudys A (2022) PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences. Bioinformatics 38:1447–1449 Villarroel J et al (2016) HostPhinder: A Phage Host Prediction Tool. Viruses 8 Ruohan W, Xianglilan Z, Jianping W (2022) & Shuai Cheng, L.I. DeepHost: phage host prediction with convolutional neural network. Brief Bioinform 23 Boeckaerts D et al (2021) Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins. Sci Rep 11:1467 Gonzales MEM, Ureta JC, Shrestha AMS (2023) Protein embeddings improve phage-host interaction prediction. PLoS ONE 18:e0289030 Paez-Espino D et al (2016) Uncovering Earth's virome. Nature 536:425–430 Gregory AC et al (2020) The Gut Virome Database Reveals Age-Dependent Patterns of Virome Diversity in the Human Gut. Cell Host Microbe 28, 724–740 e728 Camarillo-Guerrero LF, Almeida A, Rangel-Pineros G, Finn RD, Lawley TD (2021) Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109 e1099 Nayfach S et al (2021) Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat Microbiol 6:960–970 Tisza MJ, Buck CB (2021) A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc Natl Acad Sci U S A 118 Roux S et al (2023) iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol 21:e3002083 Dams D, Brondsted L, Drulis-Kawa Z, Briers Y (2019) Engineering of receptor-binding proteins in bacteriophages and phage tail-like bacteriocins. Biochem Soc Trans 47:449–460 Yehl K et al (2019) Engineering Phage Host-Range and Suppressing Bacterial Resistance through Phage Tail Fiber Mutagenesis. Cell 179:459–469e459 Opperman CJ, Wojno JM, Brink AJ (2022) Treating bacterial infections with bacteriophages in the 21st century. S Afr J Infect Dis 37:346 Rakhuba DV, Kolomiets EI, Dey ES, Novik GI (2010) Bacteriophage receptors, mechanisms of phage adsorption and penetration into host cell. Pol J Microbiol 59:145–155 Nelson D, Schuch R, Chahales P, Zhu S, Fischetti VA (2006) PlyC: a multimeric bacteriophage lysin. Proc Natl Acad Sci U S A 103:10765–10770 Flamholz ZN, Biller SJ, Kelly L (2024) Large language models improve annotation of prokaryotic viral proteins. Nat Microbiol 9:537–549 Dosovitskiy A An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv :(2010). 11929 (2020) Elnaggar A et al (2022) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 44:7112–7127 Kim GB, Gao Y, Palsson BO, Lee SY, DeepTFactor (2021) A deep learning-based tool for the prediction of transcription factors. Proc Natl Acad Sci U S A 118 Breiman L (2001) Random forests. Mach Learn 45:5–32 Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152 Lloyd-Price J et al (2019) Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569:655–662 Schirmer M, Garner A, Vlamakis H, Xavier RJ (2019) Microbial genes and pathways in inflammatory bowel disease. Nat Rev Microbiol 17:497–511 Ternes D et al (2022) The gut microbial metabolite formate exacerbates colorectal cancer progression. Nat Metab 4:458–475 Wong SH, Yu J (2019) Gut microbiota in colorectal cancer: mechanisms of action and clinical applications. Nat Rev Gastroenterol Hepatol 16:690–704 Qin Y et al (2024) Consistent signatures in the human gut microbiome of old- and young-onset colorectal cancer. Nat Commun 15:3396 Liu R et al (2017) Gut microbiome and serum metabolome alterations in obesity and after weight-loss intervention. Nat Med 23:859–868 Jie Z et al (2017) The gut microbiome in atherosclerotic cardiovascular disease. Nat Commun 8:845 Wu C et al (2024) Obesity-enriched gut microbe degrades myo-inositol and promotes lipid absorption. Cell Host Microbe 32:1301–1314e1309 Wang T et al (2024) Divergent age-associated and metabolism-associated gut microbiome signatures modulate cardiovascular disease risk. Nat Med 30:1722–1731 Qin J et al (2012) A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55–60 Shen J et al (2023) Large-scale phage cultivation for commensal human gut bacteria. Cell Host Microbe 31:665–677e667 Dao MC et al (2016) Akkermansia muciniphila and improved metabolic health during a dietary intervention in obesity: relationship with gut microbiome richness and ecology. Gut 65:426–436 Shin NR et al (2014) An increase in the Akkermansia spp. population induced by metformin treatment improves glucose homeostasis in diet-induced obese mice. Gut 63:727–735 Shih CT, Yeh YT, Lin CC, Yang LY, Chiang CP (2020) Akkermansia muciniphila is Negatively Correlated with Hemoglobin A1c in Refractory Diabetes. Microorganisms 8 Zhang T et al (2020) Alterations of Akkermansia muciniphila in the inflammatory bowel disease patients with washed microbiota transplantation. Appl Microbiol Biotechnol 104:10203–10215 Lo Sasso G et al (2021) Inflammatory Bowel Disease-Associated Changes in the Gut: Focus on Kazan Patients. Inflamm Bowel Dis 27:418–433 Danilova NA et al (2019) Markers of dysbiosis in patients with ulcerative colitis and Crohn's disease. Ter Arkh 91:17–24 Zhu F et al (2020) Metagenome-wide association of gut microbiome features for schizophrenia. Nat Commun 11:1612 Alpizar-Rodriguez D et al (2019) Prevotella copri in individuals at risk for rheumatoid arthritis. Ann Rheum Dis 78:590–593 Scher JU et al (2013) Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. Elife 2:e01202 Maeda Y et al (2016) Dysbiosis Contributes to Arthritis Development via Activation of Autoreactive T Cells in the Intestine. Arthritis Rheumatol 68:2646–2661 Tsai CY et al (2023) Abundance of Prevotella copri in gut microbiota is inversely related to a healthy diet in patients with type 2 diabetes. J Food Drug Anal 31:599–608 Yue T et al (2022) High-risk genotypes for type 1 diabetes are associated with the imbalance of gut microbiome and serum metabolites. Front Immunol 13:1033393 Yang C et al (2024) Prevotella copri alleviates hyperglycemia and regulates gut microbiota and metabolic profiles in mice. mSystems 9:e0053224 Devoto AE et al (2019) Megaphages infect Prevotella and variants are widespread in gut microbiomes. Nat Microbiol 4:693–700 Weston J, Elisseeff A, Zhou D, Leslie CS, Noble WS (2004) Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci U S A 101:6559–6563 Copp JN, Anderson DW, Akiva E, Babbitt PC, Tokuriki N (2019) Exploring the sequence, function, and evolutionary space of protein superfamilies using sequence similarity networks and phylogenetic reconstructions. Methods Enzymol 620:315–347 Dey KK, Xie D, Stephens M (2018) A new sequence logo plot to highlight enrichment and depletion. BMC Bioinformatics 19:473 Gupta A, Osadchiy V, Mayer EA (2020) Brain-gut-microbiome interactions in obesity and food addiction. Nat Rev Gastroenterol Hepatol 17:655–672 Kasai C et al (2015) Comparison of the gut microbiota composition between obese and non-obese individuals in a Japanese population, as analyzed by terminal restriction fragment length polymorphism and next-generation sequencing. BMC Gastroenterol 15:100 Kocelak P et al (2013) Resting energy expenditure and gut microbiota in obese and normal weight subjects. Eur Rev Med Pharmacol Sci 17:2816–2821 Fujimoto K et al (2024) An enterococcal phage-derived enzyme suppresses graft-versus-host disease. Nature 632:174–181 Dzuvor CKO et al (2022) Engineering Self-Assembled Endolysin Nanoparticles against Antibiotic-Resistant Bacteria. ACS Appl Bio Mater Lee C, Kim J, Son B, Ryu S (2021) Development of Advanced Chimeric Endolysin to Control Multidrug-Resistant Staphylococcus aureus through Domain Shuffling. ACS Infect Dis 7:2081–2092 Diez-Martinez R et al (2013) Improving the lethal effect of cpl-7, a pneumococcal phage lysozyme with broad bactericidal activity, by inverting the net charge of its cell wall-binding module. Antimicrob Agents Chemother 57:5355–5365 Zampara A et al (2020) Exploiting phage receptor binding proteins to enable endolysins to kill Gram-negative bacteria. Sci Rep 10:12087 Hendrix RW, Smith MC, Burns RN, Ford ME, Hatfull GF (1999) Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage. Proc Natl Acad Sci U S A 96:2192–2197 Adriaenssens EM (2021) Phage Diversity in the Human Gut Microbiome: a Taxonomist's Perspective. mSystems 6, e0079921 Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks? Adv Neural Inf Process Syst 34:12116–12128 Hyatt D et al (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119 McGinnis S, Madden TL (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 32:W20–25 Shen W, Le S, Li Y, Hu F, SeqKit (2016) A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 11:e0163962 Wickham H, Wickham H (2016) Data analysis. Springer Guo X et al (2020) CNSA: a data repository for archiving omics data. Database (Oxford) 2020 Chen FZ et al (2020) CNGBdb: China National GeneBank DataBase. Yi Chuan 42:799–809 Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8534670","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":573444394,"identity":"ea72cb84-86d7-466e-8816-14a208d7344a","order_by":0,"name":"Zhihua Du","email":"","orcid":"","institution":"College of Computer Science and Software Engineering, Shenzhen University, Guangdong Province, PR China","correspondingAuthor":false,"prefix":"","firstName":"Zhihua","middleName":"","lastName":"Du","suffix":""},{"id":573444395,"identity":"8258dc97-13b8-4f96-9d49-2da133167c2c","order_by":1,"name":"Min Li","email":"","orcid":"","institution":"BGI Research, Shenzhen 518083, China","correspondingAuthor":false,"prefix":"","firstName":"Min","middleName":"","lastName":"Li","suffix":""},{"id":573444396,"identity":"e1ea1d02-4a0e-4446-afab-eb2174d7dfa2","order_by":2,"name":"Kaihuang Lin","email":"","orcid":"","institution":"College of Computer Science and Software Engineering, Shenzhen University, Guangdong Province, PR China","correspondingAuthor":false,"prefix":"","firstName":"Kaihuang","middleName":"","lastName":"Lin","suffix":""},{"id":573444397,"identity":"f9290626-3ee1-4125-bac8-41714bf9d8d8","order_by":3,"name":"Bo Xing","email":"","orcid":"","institution":"BGI Research, Shenzhen 518083, China","correspondingAuthor":false,"prefix":"","firstName":"Bo","middleName":"","lastName":"Xing","suffix":""},{"id":573444398,"identity":"378bc6d2-56ac-46bc-878c-bd464782ec2a","order_by":4,"name":"Yuehua Ou","email":"","orcid":"","institution":"College of Computer Science and Software Engineering, Shenzhen University, Guangdong Province, PR China","correspondingAuthor":false,"prefix":"","firstName":"Yuehua","middleName":"","lastName":"Ou","suffix":""},{"id":573444399,"identity":"d05ba8b5-936f-4a75-a3b7-7db2a5d76599","order_by":5,"name":"Wenchen Song","email":"","orcid":"","institution":"BGI Research, Shenzhen 518083, China","correspondingAuthor":false,"prefix":"","firstName":"Wenchen","middleName":"","lastName":"Song","suffix":""},{"id":573444400,"identity":"3db4c4e8-43df-4dd1-854f-439768c3b93d","order_by":6,"name":"Jie Chen","email":"","orcid":"","institution":"College of Computer Science and Software Engineering, Shenzhen University, Guangdong Province, PR China","correspondingAuthor":false,"prefix":"","firstName":"Jie","middleName":"","lastName":"Chen","suffix":""},{"id":573444401,"identity":"0c44a6d2-ac6a-48db-b1bf-36f0eda7f57c","order_by":7,"name":"Junhua Li","email":"","orcid":"","institution":"BGI Research, Shenzhen 518083, China","correspondingAuthor":false,"prefix":"","firstName":"Junhua","middleName":"","lastName":"Li","suffix":""},{"id":573444402,"identity":"b22a57a7-576b-453c-909d-e0805b76af03","order_by":8,"name":"Jianqiang Li","email":"","orcid":"","institution":"College of Computer Science and Software Engineering, Shenzhen University, Guangdong Province, PR China","correspondingAuthor":false,"prefix":"","firstName":"Jianqiang","middleName":"","lastName":"Li","suffix":""},{"id":573444403,"identity":"3a1c29c4-7322-49a4-a8df-8b721d80c867","order_by":9,"name":"Minfeng Xiao","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+klEQVRIiWNgGAWjYDACZhBhwMAPJBkfIESI0CLZAGQaEKcFCkBa2CSIUmpwnPnhoxsFdyT4pduvVX6puCOn2878+ANDzR3cpjezGRvnGDyTkJxzpuy2zJlnxmaH2cwkGI49w6mFn5nBTDrH4HCdwY2ctNuSbYcTtx3mYWNgbDiMUwsbM/s3kBYJe6CWYsl/YC3MH/Bp4WfmAdsiYSCRfozxYwNYC4MEPi2SzTzFxiAtEjdymKUZjh2G+CXhGG4tBuePb3yc8+ewBP+M9Icff9QcljM7f/jxhw81uLUgAR4DZh4YO4EYDQwM7A8YfxCnchSMglEwCkYYAAAKAFQrzGvCQQAAAABJRU5ErkJggg==","orcid":"","institution":"BGI Research, Shenzhen 518083, China","correspondingAuthor":true,"prefix":"","firstName":"Minfeng","middleName":"","lastName":"Xiao","suffix":""}],"badges":[],"createdAt":"2026-01-06 19:58:21","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8534670/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8534670/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":100368231,"identity":"8ce80661-80cb-4c8b-9ba0-9b53d0281d75","added_by":"auto","created_at":"2026-01-16 07:57:44","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":236354,"visible":true,"origin":"","legend":"","description":"","filename":"VirHostHuntermanuscript20241206NC.docx","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/71c7f915c93dc7e0d7f589c4.docx"},{"id":100368239,"identity":"3142a0b8-7a65-41ee-a27f-89aba7aa1424","added_by":"auto","created_at":"2026-01-16 07:57:44","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs8534670.json","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/b974119a5f0d021393afd850.json"},{"id":100173628,"identity":"067e0ea6-02a2-40d8-ab66-9e59625ee298","added_by":"auto","created_at":"2026-01-13 17:17:21","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":211227,"visible":true,"origin":"","legend":"","description":"","filename":"rs85346700enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/bedbd477cccdede89d7539d4.xml"},{"id":100173626,"identity":"bb43e5c9-7b97-4e31-97ac-bb63ace3177d","added_by":"auto","created_at":"2026-01-13 17:17:21","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":207270,"visible":true,"origin":"","legend":"","description":"","filename":"rs85346700structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/ed973921e086f9e994608a6e.xml"},{"id":100173627,"identity":"443dbba0-7772-4739-8b00-e6959a533b5c","added_by":"auto","created_at":"2026-01-13 17:17:21","extension":"html","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":229707,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/b2757d4c5d94138057693173.html"},{"id":100173618,"identity":"2a525765-b6da-4d98-b2ce-03d4a50bd283","added_by":"auto","created_at":"2026-01-13 17:17:21","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":108231,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe design, validation, and application of the VirHost Hunter framework\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"f1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/ef9eca8f7a3dbd19ef578c86.jpg"},{"id":100173619,"identity":"9a09507f-6ed1-4d26-a749-b55df334f2c4","added_by":"auto","created_at":"2026-01-13 17:17:21","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":80807,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe contribution of combining protein and DNA features, constructing datasets from specific proteins, and using language models to the framework\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"f2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/99323fb21381aab63028e2f1.jpg"},{"id":100173625,"identity":"7e6100cc-d602-4d0f-af4d-5b352bf7b862","added_by":"auto","created_at":"2026-01-13 17:17:21","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":254067,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTens of thousands of host assignments newly uncovered\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"f3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/ed890915c26a0c9e25faaab2.jpg"},{"id":100173620,"identity":"a0b63df8-47da-47e5-b11a-48f44f2553e9","added_by":"auto","created_at":"2026-01-13 17:17:21","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":121852,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDiversity and global distribution of gut phages\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"f4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/a99c08a278f05c0ae11f8fed.jpg"},{"id":100369492,"identity":"77ac61c0-27da-48b4-9fa5-f2149f6eeb92","added_by":"auto","created_at":"2026-01-16 07:59:04","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":156398,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCharacteristics of the Gut Phage Lysin Database\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"f5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/8db334027271ab4d93745a12.jpg"},{"id":100369154,"identity":"55926a70-fb10-4c4e-b630-905789b99337","added_by":"auto","created_at":"2026-01-16 07:58:44","extension":"jpg","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":103568,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eScreening and functional verification of Ply491_6\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"f6.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/28e4d0281b41cd93c72242db.jpg"},{"id":100382382,"identity":"5ca7fd45-5762-47a4-8ff6-51f2add38415","added_by":"auto","created_at":"2026-01-16 10:42:32","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2225489,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8534670/v1/b0d1224f-9619-4712-9032-5f380f575fc8.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"Decrypting viral dark matter through key proteins using an NLP-enhanced framework","fulltext":[{"header":"Introduction","content":"\u003cp\u003eVirome is a significant component of Earth\u0026rsquo;s ecosystems and has a profound impact on ecological and human health. In various environments, uncharacterized viral genomes and sequences widely exist due to limitations in current analytical techniques and are referred to as viral dark matter. This concept highlights the need for innovative approaches to uncover and understand these hidden viral entities\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. The intricate interplay between bacteria and their viruses - bacteriophages (phages) - has garnered significant attention in recent years, fueled by advances in predictive modeling and therapeutic applications. Identifying the host range of phages is essential in studying phage resistance of bacteria, coevolution of phage-bacteria\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e, the influence of community context on phage-bacteria systems\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e, and the role of phages in human health and diseases\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. In clinical settings, phages have already been adopted to treat infections caused by drug-resistant bacteria, offering advantages in precision medicine due to their host specificity and minimal disturbance of normal gut flora\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Whereas phages also hold great promise in modulating gut microbiota, its efficacy hinges on the availability of phages targeting gut bacteria, particularly those associated with chronic diseases. Only a handful of phages have been reported targeting gut anaerobes, and it has been implicated that isolating gut phages is arduous\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003ePhage lysins have demonstrated effective antimicrobial effects in animal models, food industries, and clinical therapies\u003csup\u003e\u003cspan additionalcitationids=\"CR12 CR13 CR14\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e, presenting broad industrial and medical applications. Lysins possess moderate host specificity, are easy to synthesize and are especially suitable under scenarios where phages are unavailable, or the available phages are too host-specific to apply. However, existing lysin databases focus on clinical pathogens rather than gut commensal bacteria\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e, limiting their application in gut microbiota modulation. Predicting phage hosts and establishing a phage lysin database, by leveraging gut phage databases, specifically targeting gut bacteria serve as an alternative solution.\u003c/p\u003e \u003cp\u003eVarious computational approaches have been developed for predicting phage hosts, falling into two categories: alignment-dependent and alignment-free methods (Table S1). Alignment-dependent methods rely on phage marker genes\u003csup\u003e\u003cspan additionalcitationids=\"CR18\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e, phage-host relatedness\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e, and CRISPR spacers\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, but they have limitations such as database size, data source dependency, alignment parameters, and applicability only to phages with specific marker genes or CRISPR signals\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. Alignment-free approaches utilize phage-bacterium interaction matrices\u003csup\u003e\u003cspan additionalcitationids=\"CR26 CR27 CR28 CR29 CR30 CR31 CR32 CR33 CR34 CR35\" citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e, phage whole genomes\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e or sequences of receptor-binding proteins (RBPs)\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e to predict phage hosts. Notably, Gonzales et al.\u003csup\u003e40\u003c/sup\u003e utilized protein language models (PLMs) for feature extraction from RBPs, highlighting the potential of computational techniques in this area. PLMs are a subset of Natural Language Processing (NLP), which can interpret biological sequences in a manner akin to human language. This approach significantly improves contextual understanding and enables the identification of complex patterns that were previously difficult to discern.\u003c/p\u003e \u003cp\u003eCurrent viral databases predominantly use alignment-dependent methods and CRISPR spacers for host assignment, resulting in incomplete coverage and limited recall. For instance, Paez-Espino \u003cem\u003eet al\u003c/em\u003e.\u003csup\u003e41\u003c/sup\u003e identified 9,992 putative virus-host associations covering only 7.7% of metagenomic viral contigs (mVCs) in their study of Earth\u0026rsquo;s virome\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e. In the past three years, several human gut virome databases have also been released, the Metagenomic Gut Virus (MGV) database assigned host to 81% (n\u0026thinsp;=\u0026thinsp;153,892) of the phages, followed by 69% (n\u0026thinsp;=\u0026thinsp;31,259) within the Cenote Human Virome Database (CHVD), 42% (n\u0026thinsp;=\u0026thinsp;13,954) within the Gut Virome Database (GVD), and 29% (n\u0026thinsp;=\u0026thinsp;40,932) within the Gut Phage Database (GPD)\u003csup\u003e\u003cspan additionalcitationids=\"CR43 CR44\" citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e. The GPD had the most stringent criteria resulting the lowest recall of the four databases, i.e., it only utilized CRISPR spacers from 2,898 high-quality genomes of cultured human gut bacteria and tolerated zero mismatches across the whole length of the spacers. Therefore, high quality alignment-free method, with improved machine learning models and input features, can be complementary to CRISPR spacers method to increase the sensitivity of host prediction without compromising precision.\u003c/p\u003e \u003cp\u003eIndeed, a recent tool, iPHoP\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e, integrates alignment-dependent and alignment-free methods for host prediction, including Blast, CRISPR, WIsH, VHM, and PHP. Although iPHoP is the most comprehensive tool to date for phage host prediction, the authors discussed its limitations, including slow running time and the fact that it only achieves genus-level resolution, which may impact its practical applications. Alignment-free computational methods based on host-specific proteins such as tails and lysins instead the whole genomes of phages, may overcome these challenges: (1) they require minimal data input, avoiding vast redundant information and overuse of computing resources; (2) they can handle incomplete genome assemblies resulting from virome sequencing; (3) they can achieve high-resolution host prediction, likely species or strain level, for phage therapy applications; and (4) they facilitate applications in synthetic biology, including host range modulation by swapping or engineering phage RBPs\u003csup\u003e\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e, delivery vehicles based on proteins recognizing and attach host surfaces\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e, and therapeutic agents based on lytic proteins breaking down bacterial cell walls\u003csup\u003e\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eIn this study, we develop a framework for phage host prediction integrating highlights in feature extraction, dataset construction, and model selection. We verify the roles of each highlight of our design by conducting control analyses, followed by a comprehensive comparison to other methods across family to species levels. We calibrate the model to facilitate its application towards disease-associated gut bacteria and validate its robustness under targeted scenarios. We apply the calibrated model to the GPD and identify a great number of phages targeting disease-associated gut bacteria, including new ones targeting renowned bacterial species whose phages have hardly been characterized before. To further promote application of the resource, we extract lysins from the GPD with expanded host assignment to establish a repository. As a proof of concept, we select a lysin from the repository and synthesize it to verify its function against an obesity-inducing bacterium. This work elucidates the design of a predictive framework for phage host prediction and provides insights in how to utilize machine learning to serve genomics data mining and protein function prediction. Deciphering gut phages using this tool not only enhances our understanding in phage diversity and phage-bacteria interactions, but also facilitates downstream application of the gut phage resources into disease intervention.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eDesigning a phage host prediction framework\u003c/h2\u003e \u003cp\u003eTo predict the host of phages, full genome sequences of phages and bacteria are usually used. However, whole genome-based methods introduce a significant amount of non-essential data, including proteins unrelated to host recognition or infection, which can create noise and interfere with the prediction accuracy, resolution, and efficiency. Concentrating on phage proteins conferring specificity\u0026mdash;those directly involved in the infection cycle\u0026mdash;offers a more targeted approach. Some methods have utilized receptor-binding proteins (RBPs) to predict host\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e, but it can be challenging to annotate RBPs for many phages. We initially counted the number of RBPs and tail proteins in 7,598 phage genomes from NCBI (December 29, 2021), revealing an average of 1.33 RBPs and 15.24 tail proteins per phage (Figure S1). Therefore, we expanded the dataset to include specific proteins beyond tail fibers and tail spikes: non-RBPs of phage tails, such as tail sheath, tail tube, baseplate, and tail collar proteins; and lysins, which are enzymes highly active against bacterial cell wall\u003csup\u003e\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e. These proteins are key for the infection cycle while more widely annotated and are thus included for prediction as well.\u003c/p\u003e \u003cp\u003eProteins may share low sequence similarity while still performing similar functions across diverse species, rendering traditional sequence alignment methods less effective in capturing these functional similarities. To overcome the challenge of predicting host specificity using these proteins, particularly when sequence similarity is low, we employed protein language models (PLMs). PLMs provide a powerful solution by learning deep contextual and functional patterns within protein sequences, enabling them to capture viral protein function and viral biology even in cases of minimal sequence homology\u003csup\u003e\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eBecause the same protein sequence can be encoded by different DNA sequences, we incorporate DNA sequence features of tail proteins and lysins into the framework. DNA sequences provide additional insights into phages\u0026rsquo; genomic context, such as codon usage bias, GC content, and nucleotide frequency, which can further refine predictions by accounting for genomic stability and evolutionary constraints\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e. To uncover long-range dependencies and global patterns in DNA sequence data, we utilized a Vision Transformer (ViT)\u003csup\u003e\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e\u003c/sup\u003e. As a language model, ViT can capture complex relationships and contextual information inherent in DNA sequences. These patterns can reveal insights into genetic structures, functions, or relationships that are not easily discerned by examining individual sequences alone.\u003c/p\u003e \u003cp\u003eAs a result, we present the VirHost Hunter framework with the above characteristics (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eA). It consists of two primary components: a feature extractor and a classifier. The feature extractor integrates protein sequence embeddings from the ProtT5 model\u003csup\u003e\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e, physicochemical features of DNA sequences, and K-mer features derived from DNA sequences using a deep neural network (DNN)\u003csup\u003e\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e\u003c/sup\u003e. Utilizing three convolutional neural networks (Figure S2) and a visual transformer (ViT), the DNN extracts multi-scale features from the data. The final classification step uses a multi-layer perceptron (MLP) and a Random Forest (RF) classifier\u003csup\u003e\u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e\u003c/sup\u003e, with RF refining high-confidence predictions from the MLP to improve accuracy. We next ask if combining protein and DNA features, constructing datasets from specific proteins, and using language models such as PLMs and ViT enhances host prediction as expected, respectively.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eUsing both protein and DNA features improves learning over either alone\u003c/h3\u003e\n\u003cp\u003eTo evaluate whether integrating protein and DNA features offers superior performance compared to using either individually, we conducted ablation experiments using two benchmark datasets: the Bacteriophage RBP (Drug-Resistant receptor-binding proteins, DRRBP) dataset (n\u0026thinsp;=\u0026thinsp;4,845)\u003csup\u003e39\u003c/sup\u003e and the Bacteriophage Tail Proteins (Drug-Resistant tail, DRTail) dataset (n\u0026thinsp;=\u0026thinsp;12,509). We measured performance using accuracy (ACC), precision, and F1 scores under three experimental conditions: using only protein features, using only DNA features, and using a combination of both. It is shown that relying on a single type of feature led to inconsistent model performance across different datasets (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA). Particularly, in the DRRBP dataset, models that used only protein features outperformed those that used only DNA features. In contrast, for the DRTail dataset, DNA features alone provided better performance than protein features. This inconsistency reveals the limitations of using only one feature type, as neither approach fully captures the complexity of phage-host interactions.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eOn the other hand, integrating both protein and DNA features consistently improved model performance across all datasets and metrics. For instance, using both feature sets together resulted in the highest performance, with an accuracy of 0.9081 and 0.8927, precision of 0.9090 and 0.8930, and F1 scores of 0.9077 and 0.8925 on the DRRBP and DRTail datasets, respectively, significantly outperforming models that used either feature set alone. This demonstrates that integrating protein and DNA features not only enhances predictive accuracy but also provides greater consistency and stability across datasets, particularly in the context of bacteriophage host prediction.\u003c/p\u003e\n\u003ch3\u003ePhage tail components and lysins drive host prediction without full-genome data\u003c/h3\u003e\n\u003cp\u003eTo confirm that using all tail components - RBPs, tail sheath, tail tube, baseplate, tail collar, \u003cem\u003eetc\u003c/em\u003e - for host prediction is feasible, we conducted a 10-fold cross-validation on the DRRBP and DRTail datasets using our method, DeepHost\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e, Random Forest (RF, Boeckaerts' method)\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e, and Protein Embeddings\u003csup\u003e\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e (Table S2). The results suggest that host prediction accuracy through all tail proteins is comparable to that via RBPs, emphasizing the preference of using all tail proteins since they exist 10 times more than RBPs. At the same time, we also found that VirHost Hunter outperformed the other methods, achieving an accuracy of 0.9081 and 0.8927, precision of 0.9090 and 0.8930, and F1 scores of 0.9077 and 0.8925 on the DRRBP and DRTail datasets, respectively (Table S2).\u003c/p\u003e \u003cp\u003eTo compare the efficacy of phage tails and lysins with that of non-specific proteins for host prediction, we conducted tests using head proteins and terminases from the same phage datasets. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB, phage tails and lysins consistently outperformed head proteins and terminases across family, genus, and species levels. For all sequence similarity thresholds tested, phage tails achieved the highest accuracy, followed closely by lysins. In contrast, head proteins and terminases reached significantly lower accuracy, with a notable decline in performance at lower sequence similarity thresholds, especially at the species level (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB). This further illustrates that phage tails and lysins maintain their predictive power, even at reduced sequence similarity, unlike the non-specific control proteins. This further implies that relying on whole genomes for host prediction may be redundant, as focusing on key proteins provides more accurate and efficient predictions. We further demonstrated that VirHost Hunter can reach species-level resolution and had superior accuracy, precision, and F1 compared with the other methods, on a multi-taxonomic dataset of 7,598 phage genomes (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eB, Supplementary Results, Figure S2A and Table S3-S4).\u003c/p\u003e\n\u003ch3\u003eFunctional homology is captured even in low-similarity sequences\u003c/h3\u003e\n\u003cp\u003eTo demonstrate the ability of VirHost Hunter to capture functional homology using NLP-based representations, we evaluated its performance across datasets with varying sequence similarities. Using CD-HIT\u003csup\u003e\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e\u003c/sup\u003e, we partitioned the multi-taxonomic dataset into subsets with sequence similarity thresholds of 50%, 60%, 70%, 80%, and 90%, enabling us to assess VirHost Hunter\u0026rsquo;s capability of predicting phage host based on functional relationships rather than strict sequence homology.\u003c/p\u003e \u003cp\u003eWe compared VirHost Hunter\u0026rsquo;s performance to that of other models, including Boeckaerts et al.\u003csup\u003e39\u003c/sup\u003e, DeepHost\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e, and M. Gonzales et al.\u003csup\u003e40\u003c/sup\u003e, across different similarity thresholds. As illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC, VirHost Hunter consistently outperformed the other methods across taxonomic ranks\u0026mdash;family, genus, and species\u0026mdash;highlighting its superior ability to leverage functional homology in low-similarity datasets (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC).\u003c/p\u003e \u003cp\u003eCrucially, as sequence similarity decreased, the performance gap between VirHost Hunter and the other methods widened, particularly at the family and genus levels. This underscores the increasing importance of capturing functional homology in low-similarity regions, where conventional sequence similarity-based methods typically fail. VirHost Hunter\u0026rsquo;s integration of protein language models (PLMs), such as ProtT5, and DNA sequence features enable it to move beyond reliance on sequence similarity alone. Instead, it identifies deeper functional relationships, resulting in robust and accurate predictions, even under low similarity conditions.\u003c/p\u003e\n\u003ch3\u003eRobust phage host prediction for targeted scenarios\u003c/h3\u003e\n\u003cp\u003eGiven the substantial impact of gut bacteria on human health, such as inflammatory bowel disease (IBD)\u003csup\u003e\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e, \u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e59\u003c/span\u003e\u003c/sup\u003e, colorectal cancer\u003csup\u003e\u003cspan additionalcitationids=\"CR61\" citationid=\"CR60\" class=\"CitationRef\"\u003e60\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e\u003c/sup\u003e, and metabolic diseases\u003csup\u003e\u003cspan additionalcitationids=\"CR64 CR65 CR66\" citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR67\" class=\"CitationRef\"\u003e67\u003c/span\u003e\u003c/sup\u003e, obtaining more information of phages targeting these bacteria is advantageous. We can expand our knowledge in gut phage-bacteria interactions, gut phage diversity, and utilize them for therapeutic purposes. Phage information can be obtained either through co-culturing with bacterial host or mining data from high-throughput sequencing. However, gut phages, especially those targeting obligate anaerobes, are hard to culture and isolate. Investigating gut phages by analyzing sequencing data is, therefore, usually considered more efficient. Now that we have validated the superior performance of VirHost Hunter, including accuracy, precision, and resolution, we next evaluate its effectiveness in identifying phages targeting disease-associated gut bacteria. We compiled a dataset consisting of 60 gut bacterial species associated with various diseases, including carotid atherosclerosis, inflammatory bowel disease (IBD), and obesity (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eB, Table S5). We annotated prophage tails and lysins from the dataset, resulting in a total of unique 328,701 tail proteins and 312,565 lysins. We calibrated VirHost Hunter model using these sequences across 29 families, 40 genera, and 60 species.\u003c/p\u003e \u003cp\u003eConsistent with previous evaluations, VirHost Hunter outperformed the other three tested methods when applied to this dataset. At the family, genus, and species levels, VirHost Hunter-tail (based on gut phage tails) yielded ACC scores of 0.9516, 0.935, and 0.9132, Precision scores of 0.9513, 0.9341, and 0.9112, and F1 scores of 0.9512, 0.9342, and 0.9115, respectively (Figure S3B, Table S6). VirHost Hunter-lysin (based on gut phage lysins) exhibited ACC scores of 0.9817, 0.9756, and 0.9590, Precision scores of 0.9817, 0.9755, and 0.958, and F1 scores of 0.9817, 0.9755, and 0.9582, respectively (Figure S3B, Table S7). We also examined how sample sequence similarity would affect model performance. VirHost Hunter consistently outperformed other methods across various similarity thresholds and taxonomic levels (Figure S4), further highlighting its robustness in predicting gut phage hosts associated with chronic diseases.\u003c/p\u003e \u003cp\u003eTo further validate VirHost Hunter\u0026rsquo;s performance on isolated gut phages, we used a previously reported collection of cultivated gut phages\u003csup\u003e\u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e68\u003c/span\u003e\u003c/sup\u003e targeting \u003cem\u003eBifidobacterium\u003c/em\u003e, \u003cem\u003eBacillus\u003c/em\u003e, \u003cem\u003eBacteroides\u003c/em\u003e, \u003cem\u003eCampylobacter\u003c/em\u003e, \u003cem\u003eClostridium\u003c/em\u003e, \u003cem\u003eEnterococcus\u003c/em\u003e, and \u003cem\u003eStreptococcus\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eB). 702 tail proteins and 373 lysins were extracted from 156 gut phages, all with experimentally verified host data. Both VirHost Hunter and CRISPR-based method were tested under equivalent precision thresholds as CRISPR-based method was mostly frequently used to assign bacterial hosts by previous work. At a 95% precision cutoff, VirHost Hunter correctly identified hosts for 73/156 phages at the family level and 58/156 at the genus level, while CRISPR-based method yielded no assignments for a low recall rate (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). At 84% and 69% cutoffs, VirHost Hunter performed comparably with the CRISPR-based method, and combining both methods further improved the accuracy to 101/156 (84% cutoff) ,113/156 (69% cutoff) at the family level and 107/156 (84% cutoff), 117/156 (69% cutoff) at the genus level (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Additionally, VirHost Hunter achieved species-level predictions, a resolution not attainable by CRISPR-based method, with precision rates of 9/156 (95% cutoff), 20/156 (84% cutoff), and 26/156 (69% cutoff) respectively, including \u003cem\u003eBacteroides fragilis\u003c/em\u003e, \u003cem\u003ePhocaeicola vulgatus\u003c/em\u003e, and \u003cem\u003eEggerthella lenta\u003c/em\u003e (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, Table S8).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eHost prediction for cultivated gut phages by VirHost Hunter and CRISPR-based method\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"10\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e \u003cp\u003e95% precision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c7\" namest=\"c5\"\u003e \u003cp\u003e84% precision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c10\" namest=\"c8\"\u003e \u003cp\u003e69% precision\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eVirHost Hunter\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCRISPR-based\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003ecombined\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eVirHost Hunter\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eCRISPR-based\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003ecombined\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eVirHost Hunter\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eCRISPR-based\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c10\"\u003e \u003cp\u003ecombined\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFamily\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e73\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e73\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e101\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e105\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e113\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGenus\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e58\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e58\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e94\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e107\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e105\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e105\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e117\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSpecies\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eN.D.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eN.D.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e26\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eN.D.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e26\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTo sum up, VirHost Hunter demonstrated superior performance in comparison to the other three alignment-free methods tested. Furthermore, it significantly outperformed the CRISPR-based method in an independent gut phage-host dataset under a 95% precision cutoff and achieved comparable performance under 84% and 69% precision cutoffs. Additionally, our experiment revealed that the combination of VirHost Hunter and the CRISPR-based method significantly enhances the proportion of true positive predictions, particularly for high-resolution phage-host predictions in the gut microbiota at species level. Overall, these results highlight the scalability of VirHost Hunter across different environments.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003ePhages targeting disease-associated gut bacteria are vastly expanded\u003c/h2\u003e \u003cp\u003eThe four most recently published gut virus databases typically adopted commensal bacteria as their CRISPR libraries. Among them, the GVD\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e, the MGV\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e, and the CHVD\u003csup\u003e\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e set loose cutoffs compared to the GPD\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e, which allowed zero mismatches and resulted in low assignment. Although these databases are comprehensive, a tailored approach is needed for specific application scenarios, such as for intestinal pathogenic bacteria. Considering that the GPD had the lowest host assignment recall of 28.66% among the four databases, and as evaluated by Dion et al.\u003csup\u003e21\u003c/sup\u003e the precision was 84% at the genus level, we used VirHost Hunter to assign hosts for GPD with 95% and 84% precision, aiming to explore the dark matter in the human gut associated with chronic diseases (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eC).\u003c/p\u003e \u003cp\u003eUsing our optimized annotation pipeline, we identified 163,590 lysins and 388,894 tail proteins from 142,809 assembled gut phages in the GPD. We applied precision filters of 84% and 95% to predict hosts at different taxonomic levels (Table S9). Through phylogenetic composition analysis of the results, the annotation results covered 8 phyla, 13 classes, 21 orders, 29 families, 40 genera, and 58 species, including 42 species of obligate anaerobes (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). The host assignment results for each phage combined the predictions by tails and lysins. Notably, 7 families can only be assigned by VirHost Hunter-lysin but not VirHost Hunter-tail, including \u003cem\u003eEubacteriaceae\u003c/em\u003e, \u003cem\u003eAtopobiaceae\u003c/em\u003e, \u003cem\u003eLeuconostocaceae\u003c/em\u003e, \u003cem\u003ePrevotellaceae\u003c/em\u003e, \u003cem\u003ePeptoniphilaceae\u003c/em\u003e, \u003cem\u003eGemellaceae\u003c/em\u003e, and \u003cem\u003eAerococcaceae\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). We evaluated the host assignment results of VirHost Hunter using 95% and 84% precision and compared that with the previous results of the GPD. We found that both VirHost Hunter-tail or VirHost Hunter-lysin can enhance the host assignment of gut phages. At 95% precision, VirHost Hunter newly assigned host to 15.91% (22,724/142,809) of the GPD phages, with 10.98% (15,677/142,809) by VirHost Hunter-tail and 9.41% (13,432/142,809) by VirHost Hunter-lysin (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). At 84% precision, VirHost Hunter newly assigned host to 33.99% (48,545/142,809) of the GPD phages, with VirHost Hunter-tail contributing 20.16% (28,790/142,809) and VirHost Hunter-lysin contributing 25.37% (36,236/142,809), boosting the final host assignment ratio to 62.66% (89,478/142,809) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). These data illustrate that excelling VirHost Hunter on either tails or lysins can enhance the host assignment of gut phages, while combing the results of VirHost Hunter based on different key proteins and that of the CRISPR method could optimize the outcome.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eBy integrating VirHost Hunter and the CRISPR-based method, we assessed the improvement and refinement of host assignment results in the GPD. Both VirHost Hunter-tail and VirHost Hunter-lysin significantly enhanced host taxonomic classification compared to the previous results. The host assignment results of VirHost Hunter-tail newly covered 3 families, 8 genera, and 20 species and that of VirHost Hunter-lysin newly covered 5 families, 12 genera, and 25 species (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC-E). Overall, at the family level, VirHost Hunter identified phages targeting 5 new families accounting for 1.38% of total assignments under 84% precision, while phages targeting \u003cem\u003eAerococcaceae\u003c/em\u003e were not detected at the 95% cutoff. \u003cem\u003eLachnospiraceae\u003c/em\u003e and \u003cem\u003eBacteroidaceae\u003c/em\u003e, recognized as the two most prevalent host families by both VirHost Hunter and the CRISPR-based method, collectively accounted for over 50% of total assignments at both the 84% and 95% cutoffs (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). At the genus level, VirHost Hunter identified phages targeting 12 new genera accounting for 21.58% of total assignments at the 84% cutoff, while three of the new genera were not detected at the 95% cutoff. \u003cem\u003eBacteroides\u003c/em\u003e is the most abundant host genus identified by both VirHost Hunter and the CRISPR-based method (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eD). At the species level, VirHost Hunter identified phages targeting 25 new species accounting for 0.14% of total assignments at the 84% cutoff, while four of the new species were not predicted at the 95% cutoff. Notably, VirHost Hunter identified phages targeting \u003cem\u003eCronobacter sakazakii\u003c/em\u003e as predominant, which was not detected by the CRISPR-based method, likely due to differences in training datasets (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eE).\u003c/p\u003e \u003cp\u003eIn the refined database, there are five newly annotated host families, including \u003cem\u003eAerococcaceae\u003c/em\u003e, \u003cem\u003eAkkermansiaceae\u003c/em\u003e, \u003cem\u003eGemellaceae\u003c/em\u003e, \u003cem\u003ePrevotellaceae\u003c/em\u003e, and \u003cem\u003eXanthomonadaceae\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eD, \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA, \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). Among them, \u003cem\u003eAkkermansia muciniphila\u003c/em\u003e within the \u003cem\u003eAkkermansiaceae\u003c/em\u003e family has been extensively reported due to its ability to modulate multiple diseases, including obesity\u003csup\u003e\u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e69\u003c/span\u003e\u003c/sup\u003e, diabetes\u003csup\u003e\u003cspan citationid=\"CR70\" class=\"CitationRef\"\u003e70\u003c/span\u003e, \u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e71\u003c/span\u003e\u003c/sup\u003e, inflammatory bowel disease\u003csup\u003e\u003cspan additionalcitationids=\"CR73\" citationid=\"CR72\" class=\"CitationRef\"\u003e72\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR74\" class=\"CitationRef\"\u003e74\u003c/span\u003e\u003c/sup\u003e, and schizophrenia\u003csup\u003e\u003cspan citationid=\"CR75\" class=\"CitationRef\"\u003e75\u003c/span\u003e\u003c/sup\u003e. However, phages targeting \u003cem\u003eAkkermansia muciniphila\u003c/em\u003e had never been characterized by any previous publications. We successfully identified 36 phages targeting \u003cem\u003eAkkermansia muciniphila\u003c/em\u003e at 95% precision cutoff and 95 phages at 84% precision cutoff, and we examined the 36 phages at the more stringent cutoff (Table S9). It was shown that the genome sizes of the \u003cem\u003eAkkermansia muciniphila\u003c/em\u003e phages range from 11,830 to 92,135 bp and the GC content ranged from 49.11% to 60% (Figure S4). The number of CDS is between 21 and 127 and the annotation rate is between 23.81% and 52.17% (Figure S5). \u003cem\u003ePrevotella copri\u003c/em\u003e within the \u003cem\u003ePrevotellaceae\u003c/em\u003e family, is another renowned species associated with rheumatoid arthritis\u003csup\u003e\u003cspan additionalcitationids=\"CR77\" citationid=\"CR76\" class=\"CitationRef\"\u003e76\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR78\" class=\"CitationRef\"\u003e78\u003c/span\u003e\u003c/sup\u003e and type 2 and type 1 diabetes mellitus\u003csup\u003e\u003cspan additionalcitationids=\"CR80\" citationid=\"CR79\" class=\"CitationRef\"\u003e79\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR81\" class=\"CitationRef\"\u003e81\u003c/span\u003e\u003c/sup\u003e. Megaphages were the only phages reported to target the \u003cem\u003ePrevotella copri\u003c/em\u003e, but previous attempts to isolate them failed\u003csup\u003e\u003cspan citationid=\"CR82\" class=\"CitationRef\"\u003e82\u003c/span\u003e\u003c/sup\u003e. We successfully identified 15 phages targeting \u003cem\u003ePrevotella copri\u003c/em\u003e at a 95% cutoff and 22 phages at an 84% cutoff, and we examined the 15 phages at the most stringent cutoff (Table S9). It was shown that the genome sizes of the \u003cem\u003ePrevotella copri\u003c/em\u003e phages range from 12,114 to 127,100 bp and the GC content range from 39.17% to 48% (Figure S5). The number of CDS is between 16 and 166 and the annotation rate is between 23.17% and 56.25% (Figure S5). We selected representative phages targeting \u003cem\u003eAkkermansia muciniphila\u003c/em\u003e and \u003cem\u003ePrevotella copri\u003c/em\u003e using CD-hit with a threshold of coverage of 0.6 and identity of 0.6 and annotated their genomes using our refined pipeline. It was shown that the functional elements of phages mainly include lysis, lysogenic-related, structure, DNA maintenance, packaging and assembly, replication and transcription, transport, and regulation (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eF).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eDiversity and geographic distribution of gut phages\u003c/h3\u003e\n\u003cp\u003eTo gain further insights from the expanded host assignments, we first analyzed the phylogenetic lineages of phages of the refined database (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eD). Out of the 89,478 phages, 11.42% phages were classified under six viral families, including \u003cem\u003eSiphoviridae, Myoviridae, Podoviridae, Herelleviridae, Tectiviridae, and Microviridae\u003c/em\u003e, covering all taxonomic classifications identified in the GPD (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). The remaining 88.57% of assigned phages were unclassified (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). Compared to the previous results of the GPD, VirHost Hunter newly assigned hosts by nearly 1-fold to \u003cem\u003eSiphoviridae\u003c/em\u003e and \u003cem\u003eHerelleviridae\u003c/em\u003e phages, nearly 2-fold to \u003cem\u003ePodoviridae\u003c/em\u003e and \u003cem\u003eMyoviridae\u003c/em\u003e phages, and 33.3% to \u003cem\u003eMicroviridae\u003c/em\u003e phages, significantly enhancing host assignments across multiple taxonomic levels. As a result, we assigned host to 79,332 unclassified phages, 4,566 \u003cem\u003eSiphoviridae\u003c/em\u003e phages, 2,902 \u003cem\u003eMyoviridae phages\u003c/em\u003e, 2,598 \u003cem\u003ePodoviridae\u003c/em\u003e phages, 75 \u003cem\u003eHerelleviridae\u003c/em\u003e phages, 4 \u003cem\u003eMicroviridae\u003c/em\u003e phages and 1 \u003cem\u003eTectiviridae\u003c/em\u003e phage (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). It is noteworthy that \u003cem\u003eMicroviridae\u003c/em\u003e, a class of tailless phages, were assigned hosts by VirHost Hunter-lysin instead of VirHost Hunter-tail as expected. Therefore, it is important to combine the results of VirHost Hunter-tail and VirHost Hunter-lysin for downstream analyses. These findings demonstrate the broad applicability of VirHost Hunter for host prediction across diverse phage lineages, regardless that the phages are with or without tails.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eGiven the large number of phages predicted to target identical hosts, we evaluated phage diversity within bacterial families across diverse phyla by calculating the ratio of VC numbers to phage counts sharing the same host (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eD). We observed a wide distribution of phage diversity across bacterial families, especially within \u003cem\u003eBacteroidetes\u003c/em\u003e. Notably, 23 bacterial families exhibited the highest viral diversity, with 15 of these families belonging to \u003cem\u003eFirmincutes\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB), a finding that is consistent with GPD. Additionally, we newly found that some bacteria genus belonging to \u003cem\u003eActinobacteria, Bacteroidetes\u003c/em\u003e and \u003cem\u003eProteobacteria\u003c/em\u003e were showed high viral diversity, such as \u003cem\u003ePseudomonadaceae\u003c/em\u003e, \u003cem\u003eNeisseriaceae\u003c/em\u003e, \u003cem\u003eMuribaculaceae\u003c/em\u003e, \u003cem\u003eMoraxellaceae\u003c/em\u003e, \u003cem\u003eDermabacteriaceae\u003c/em\u003e, \u003cem\u003eCorynebacteriaceae\u003c/em\u003e, \u003cem\u003eCoprobacteriaceae\u003c/em\u003e and \u003cem\u003eCellulomonadaceae\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB). In contrast, lowest viral diversity was detected in \u003cem\u003eBacteroidaceae\u003c/em\u003e, \u003cem\u003eDTU089\u003c/em\u003e, \u003cem\u003eMarinifilaceae\u003c/em\u003e, \u003cem\u003eRikenellaceae\u003c/em\u003e and \u003cem\u003eTannerellaceae\u003c/em\u003e, all belonging to the \u003cem\u003eBacteroidetes\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB). \u003cem\u003eFirmicutes\u003c/em\u003e, \u003cem\u003eBacteroidetes\u003c/em\u003e, \u003cem\u003eProteobacteria\u003c/em\u003e, and \u003cem\u003eActinobacteriota\u003c/em\u003e were previously reported as the common phyla in the human gut, which are also prominently featured in our data.\u003c/p\u003e \u003cp\u003eTo gain insights into the relationship between VC number of host families and their geographic distribution, we analyzed the dominant families and performed principal coordinate analysis (PCoA) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eD, \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC-D). The results showed that Asia and Europe have a higher total phage count compared to others, which may be attributed to the greater number of human metagenomic sequencing studies conducted in Asia and Europe (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC). Host families show similar geographic distribution patterns across continents, with \u003cem\u003eLachnospiraceae\u003c/em\u003e and \u003cem\u003eBacteroidaceae\u003c/em\u003e dominating across all continents, indicating their role as hosts for globally prevalent gut phages (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC). Similarly, Asia, Europe and Africa have more overlapping regions, suggesting similar phage compositions, while Oceania, North America and South America are more distinct, indicating different phage communities (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC). The PCoA reveals a similar result, showing that while some continents do not completely overlap with others in host bacterial community compositions, the slight differences observed are not statistically significant (Pr(\u0026gt;\u0026thinsp;F)\u0026thinsp;=\u0026thinsp;0.065), indicating similar phage compositions (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eD).\u003c/p\u003e\n\u003ch3\u003eAn expansive lysin repository countering a broad array of gut bacteria\u003c/h3\u003e\n\u003cp\u003eConsidering VirHost Hunter's precision in predicting gut phage and lysin hosts and its complementarity with the CRISPR-based method (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB, Supplementary Results), we established the Gut Phage Lysin Database (GPLD), which encompasses 117,698 lysins precisely targeting 29 disease-related gut bacterial families (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eE, Table S9). Of these, 35.20% (n\u0026thinsp;=\u0026thinsp;41,429) can be identified through both VirHost Hunter and the CRISPR-based method, 13.27% (n\u0026thinsp;=\u0026thinsp;15,617) were identified using the CRISPR-based method, and 51.53% (n\u0026thinsp;=\u0026thinsp;60,652) were exclusively identified by VirHost Hunter. Hydrolases, holins, and endolysins were the predominant functional categories (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA). To better understand the functionality, stability, and potential applications of lysin proteins, we conducted various analyses focusing on their physicochemical properties. Their secondary structures were turn, sheet, and helix in fraction (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB), varying in length from 30 bp to 5811 bp, with a mean length of 195 bp. The molecular weight ranged from 2.8 kDa to 65 kDa, with a mean of 21.6 kDa (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eC), suggesting favorable attributes for efficient synthesis and manipulation. A majority (81.98%) were stable with an instability index below 40 (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eC). Amino acid frequency analysis indicated a prevalence of hydrophobic alanine, leucine, and isoleucine, potentially enhancing protein stability and function.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo analyze the functional diversity and sequence-function relationships within the gut lysin protein family, we employed the sequence similarity network (SSN) tool\u003csup\u003e\u003cspan citationid=\"CR83\" class=\"CitationRef\"\u003e83\u003c/span\u003e, \u003cspan citationid=\"CR84\" class=\"CitationRef\"\u003e84\u003c/span\u003e\u003c/sup\u003e, which provided insights into their sequence and functional divergence. Lysins were grouped into 603 clusters based on sequence similarity, with nodes colored according to the host taxonomical phylum (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eD). The SSN was predominantly populated by proteins from \u003cem\u003eFirmicutes\u003c/em\u003e (n\u0026thinsp;=\u0026thinsp;4559), followed by \u003cem\u003eProteobacteria\u003c/em\u003e (n\u0026thinsp;=\u0026thinsp;996), \u003cem\u003eBacteroidetes\u003c/em\u003e (n\u0026thinsp;=\u0026thinsp;906), and \u003cem\u003eActinobacteria\u003c/em\u003e (n\u0026thinsp;=\u0026thinsp;600) (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eD). The protein clusters in the sequence similarity network (SSN) were categorized according to their respective protein types, indicating that proteins with similar functions might possess conserved domains that confer these functions (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eD). Notably, holins, differing from other protein types in the network, exhibit high diversity in their sequences and structures (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eD). Furthermore, proteins against the same host phylum tend to cluster together, implying that lytic proteins, much like phages, exhibit a high degree of host specificity. This host specificity is likely mediated by conserved regions within the lytic proteins that are essential for identifying and binding to host cells.\u003c/p\u003e \u003cp\u003eTo uncover the conserved functional motifs and the underlying mechanisms, we generated sequence logos\u003csup\u003e\u003cspan citationid=\"CR85\" class=\"CitationRef\"\u003e85\u003c/span\u003e\u003c/sup\u003e for three representative clusters. Cluster 1, the largest cluster, containing 1,735 protein sequences from \u003cem\u003eFirmicutes\u003c/em\u003e, \u003cem\u003eActinobacteria\u003c/em\u003e, \u003cem\u003eBacteroidetes\u003c/em\u003e, \u003cem\u003eand Proteobacteria\u003c/em\u003e hosts, including holins (n\u0026thinsp;=\u0026thinsp;3170), hydrolases (n\u0026thinsp;=\u0026thinsp;3098), endolysins (n\u0026thinsp;=\u0026thinsp;1125) and lysis proteins (n\u0026thinsp;=\u0026thinsp;796). The sequence logo analysis revealed three conserved motifs (RHTKAPAVLIECCFVDNKDD, NVTVHRDFANKSCPG, and RSWCSSSAANDNRAITIEVA), all located in the N-acetylmuramoyl-L-alanine amidase domain, which is crucial for phage-mediated bacterial lysis (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eE). Cluster 2 comprised 1,213 representative protein sequences belonging to holins, and two conserved motifs were detected in the toxin secretion domain, which facilitates the release of lytic enzymes to lyse bacterial cells (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eE). Cluster 3 consisted of 296 hydrolase representative sequences, and its motif was mainly associated with the N-(deoxy)ribosyltransferase-like domain, which functions in degrading bacterial cell walls during phage infection (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eE). Functionally important residues were found to be conserved in putative isofunctional clusters, with motif and domain analyses revealing differences between different types of phage lytic proteins. The findings have valuable implications for the design and engineering of lysins and their application in lysin therapy.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eLysin Ply491_6 effectively and specifically inhibits an obesity-inducing bacterium\u003c/h2\u003e \u003cp\u003eObesity has emerged as a significant global health concern, with the gut microbiome implicated in its onset and progression\u003csup\u003e\u003cspan citationid=\"CR86\" class=\"CitationRef\"\u003e86\u003c/span\u003e\u003c/sup\u003e. Comparative analyses have revealed distinct microbiome profiles between obese and non-obese individuals, suggesting association between certain bacterial genera and obesity, including \u003cem\u003eBacteroides\u003c/em\u003e, \u003cem\u003eMegamonas\u003c/em\u003e, \u003cem\u003eRuminococcus\u003c/em\u003e, \u003cem\u003eDorea\u003c/em\u003e, \u003cem\u003eCoprococcus\u003c/em\u003e, \u003cem\u003eFusobacterium\u003c/em\u003e, \u003cem\u003eBlautia\u003c/em\u003e, and \u003cem\u003eEubacterium\u003c/em\u003e\u003csup\u003e\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e, \u003cspan citationid=\"CR87\" class=\"CitationRef\"\u003e87\u003c/span\u003e, \u003cspan citationid=\"CR88\" class=\"CitationRef\"\u003e88\u003c/span\u003e\u003c/sup\u003e. While phage therapy holds promise for modulating the gut microbiota, the lack of reported phages targeting \u003cem\u003eMegamonas\u003c/em\u003e and our failure in repetitive attempts to isolate \u003cem\u003eMegamonas\u003c/em\u003e phages prompted our investigation into the therapeutic potential of lysins from the Gut Phage Lysin Database (GPLD) against this bacterial genus (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eF).\u003c/p\u003e \u003cp\u003eWe identified 526 unique lysin sequences specific to \u003cem\u003eMegamonas\u003c/em\u003e from GPLD, clustered into 167 distinct clusters (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eA). Ply491_6 (ivig_491_6) is the representative sequence of the protein cluster with the highest number of proteins (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eA, \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eB). The cDNA sequence encoding Ply491_6 spans 561 base pairs. Ply491_6 comprises 187 amino acids, with a molecular weight of 20.8 kDa and a theoretical isoelectric point (pI) of 5.37. Ply491_6 exhibits hydrophilicity, with a grand average of hydropathicity (GRAVY) value of -0.207. The instability index is 25.93, indicating that Ply491_6 is a stable protein. Additionally, Ply491_6 is devoid of signal peptides and transmembrane regions and is structurally characterized by four predominant α-helices alongside multiple β-sheets (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eC). Ply491_6 shares high sequence identity (99.46%) with QIW89318.1, a cell wall hydrolase autolysin from \u003cem\u003eCaudoviricetes sp.\u003c/em\u003e, and contains a conserved N-acetylmuramoyl-L-alanine amidase domain.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTherefore, we synthesized and purified the Ply491_6 protein for \u003cem\u003ein vitro\u003c/em\u003e assays to verify its lytic activity against \u003cem\u003eMegamonas\u003c/em\u003e. We incubated Ply491_6 with \u003cem\u003eMegamonas rupellensis\u003c/em\u003e and monitored the bacterial turbidity over time. It was shown that Ply491_6 effectively lysed bacterial cells at concentrations as low as 20 \u0026micro;g/mL, with a significant reduction in bacterial turbidity observed within 150 minutes (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eD). To further assess the specificity of Ply491_6, we measured its lytic activity against other high-abundance gut bacteria and common probiotics, including \u003cem\u003eBacteroides fragilis\u003c/em\u003e, \u003cem\u003eClostridium perfringerns\u003c/em\u003e, \u003cem\u003eRuminococcus gnavus\u003c/em\u003e, \u003cem\u003eBifidobacterium longum\u003c/em\u003e, \u003cem\u003eLacticaseibacillus paracasei\u003c/em\u003e, and \u003cem\u003eLactiplantibacillus plantarum.\u003c/em\u003e Ply491_6 demonstrated minimal impact on the viability of these bacteria (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eE-F). These results underscore the efficacy and specificity of Ply491_6 to \u003cem\u003eMegamonas\u003c/em\u003e, positioning it as a promising candidate for targeted bactericidal therapy against obesity-associated dysbiosis. These findings contribute valuable insights into phage-bacteria interactions in the gut and offer essential data for the development of precision therapies against intestinal pathobionts.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe VirHost Hunter framework presented here integrates three highlights. By conducting control analyses, we verified that each highlight enhanced the prediction performance. By comprehensive comparison with other methods across multi-taxonomic levels, VirHost Hunter demonstrated superior precision and recall, and it also showed a higher resolution as it reached accurate species-level host prediction. This can be attributed to three key factors: 1) the integration of a large language model, specifically ProtT5, allows for advanced contextual understanding of protein sequences, enabling VirHost Hunter to capture functional homology effectively; 2) by focusing on phage tails and lysins, VirHost Hunter can directly relates to the functional roles of these key proteins in phage-host interactions and make accurate and high-resolution predictions even in cases of incomplete genomic data; 3) the incorporation of DNA sequence features, such as codon usage and nucleotide composition, as complementary to protein features, further enriches the predictive capabilities of VirHost Hunter. These results provided insights in how to leverage machine learning to predict protein function and mine sequencing data in the future.\u003c/p\u003e \u003cp\u003eBecause CRISPR-based method has been the single most widely used tool to assign bacterial host, we also compared the performance of VirHost Hunter and CRISPR-based method using two independent datasets with biological experimental evidence: a collection of 156 cultivated gut phages, and another collection of 31 lysins (Supplementary Results). We demonstrated that both methods had similar recall when precision was set at 84% and 69%, but VirHost Hunter had higher recall than the CRISPR-based method when the precision was set at 95%. Interestingly, combining both methods resulted an improved host assignment ratio compared with either alone, one of the reasons is likely due to the differences in training datasets. The synergy between VirHost Hunter and the CRISPR-based predictions allowed us to expand the host assignment ratio of the GPD from 28.66% to 62.66%. Therefore, we propose a guideline for users: we should prioritize VirHost Hunter if aiming for highly precise or species-level prediction, and we can use both VirHost Hunter and the CRISPR-based method in parallel for general purposes.\u003c/p\u003e \u003cp\u003eUsing the calibrated model, we greatly improved the host assignment ratio of the gut phage database, particularly for phages associated with chronic diseases. We also identified dozens of new phages targeting \u003cem\u003eAkkermansia muciniphila\u003c/em\u003e and \u003cem\u003ePrevotella copri\u003c/em\u003e, whose phages have hardly been characterized before. To further promote application of the resource, we established the Gut Phage Lysin Database, cataloging 117,698 host-specific lysins targeting various gut bacteria. This database is pivotal for identifying and engineering lysins, particularly against bacteria linked to chronic diseases. As a proof of concept, we selected a lysin from the database for synthesis, and verified its efficacy and specificity against \u003cem\u003eMegamonas\u003c/em\u003e, an obesity-inducing bacterium. We have not seen any reported means targeting \u003cem\u003eMegamonas\u003c/em\u003e before, and we have failed to isolate \u003cem\u003eMegamonas\u003c/em\u003e phages in our repetitive efforts during the past a few years. In fact, it has been rather difficult to isolate phages targeting all obligate anaerobic bacteria and thus VirHost Hunter can be exceptionally useful under this scenario, deciphering new phages to reveal biological insights, and discovering new lytic proteins to inform therapeutic potentials.\u003c/p\u003e \u003cp\u003eFujimoto \u003cem\u003eet al.\u003c/em\u003e has showed that \u003cem\u003eE. faecalis\u003c/em\u003e phage-derived endolysin worked effectively in humanized gnotobiotic acute graft-versus-host disease (GVHD) mice, as it decreased levels of intestinal cytolysin-positive \u003cem\u003eE. faecalis\u003c/em\u003e and significantly increased survival\u003csup\u003e\u003cspan citationid=\"CR89\" class=\"CitationRef\"\u003e89\u003c/span\u003e\u003c/sup\u003e. Compared to 7-log demonstrated by Fujimoto \u003cem\u003eet al.\u003c/em\u003e, lysin Ply491_6 inhibited the bacterial growth by only between 1- to 2-log, which is a good start but requires further engineering for downstream application. Some possible directions for engineering include: 1) fusing lytic proteins with functional peptides to form nanoparticles, which can enhance both lytic efficacy and stability\u003csup\u003e\u003cspan citationid=\"CR90\" class=\"CitationRef\"\u003e90\u003c/span\u003e\u003c/sup\u003e; 2) integrating the enzymatic active domains (EAD) and cell wall binding domains (CBD) from different lysins, particularly for endolysins, to boost lytic activity and broaden the host range\u003csup\u003e\u003cspan citationid=\"CR91\" class=\"CitationRef\"\u003e91\u003c/span\u003e\u003c/sup\u003e; 3) introducing targeted mutations at active sites or increasing positive charges to enhance lytic activity and binding efficiency\u003csup\u003e\u003cspan citationid=\"CR92\" class=\"CitationRef\"\u003e92\u003c/span\u003e\u003c/sup\u003e; 4) fusing lysins with receptor-binding proteins to improve the targeting specificity \u003csup\u003e\u003cspan citationid=\"CR93\" class=\"CitationRef\"\u003e93\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eWhile VirHost Hunter demonstrated strong predictive performance, there are some limitations. Firstly, we only utilized phage tails and lysins for model training and host prediction. Although out data ruled out the possibility of using two structural proteins, other proteins might also confer different levels of host specificity. Secondly, VirHost Hunter should be robust when calibrating with any datasets, but due to the focus of this work we only verified the scenario of searching for phages and lysins targeting disease-associated gut bacteria. Future research should aim to refine VirHost Hunter by incorporating a broader range of datasets, including diverse protein datasets and environmental contexts. A great advantage is that VirHost Hunter only requires input of key proteins, which can be extracted from prophages integrated within bacterial genomes and fragmented phage genomes from metagenomic sequencing, vastly expanding the scale of datasets.\u003c/p\u003e \u003cp\u003eFor instance, VirHost Hunter can be calibrated targeting other gut bacteria that are not necessarily diseases-associated, further improving the host assignment ratio of the gut phage database. The implications of this study also extend to the broader field of environmental microbiology beyond gut microbiome, as environmental microbiologists encounter an even worse situation in phage host assignment. With an estimated 10³¹ particles globally, phages are a key component of Earth’s ecosystems and play crux roles in regulating microbial populations, nutrient cycling, and ecosystem dynamics\u003csup\u003e\u003cspan citationid=\"CR94\" class=\"CitationRef\"\u003e94\u003c/span\u003e, \u003cspan citationid=\"CR95\" class=\"CitationRef\"\u003e95\u003c/span\u003e\u003c/sup\u003e. VirHost Hunter can then be calibrated targeting environmental bacteria, shedding light on the \"viral dark matter\" and their interactions with bacteria in various ecosystems, out of which extreme environments will be of special interest. Identifying the hosts of environmental phages will enhance our understanding of virus-host-environment interactions, their role on microbial community structures, and their influence in biogeochemical processes. These insights can inform conservation efforts, bioremediation strategies, and the management of microbial communities in natural and engineered environments.\u003c/p\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003cdiv id=\"Sec14\" class=\"Section3\"\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003c/div\u003e \u003c/div\u003e "},{"header":"Methods","content":"\u003ch2\u003eEstablishment of the VirHost Hunter framework\u003c/h2\u003e\u003cp\u003eVirHost Hunter consists of two primary components: a feature extractor and a classifier (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eA). When extracting features for VirHost Hunter, we utilize three distinct tools to process phage specific proteins and their corresponding DNA sequences, resulting in three types of features: protein sequence embeddings from the pre-trained ProtT5 model, physical-chemical characteristics of DNA sequences, and k-mer features of DNA sequences extracted via a DNN network. These features will be elaborated upon in detail below.\u003c/p\u003e\u003cp\u003eFor protein sequence representation, we leveraged the capabilities of the pre-trained protein language model ProtT5 to generate dense vector representations (embeddings) of protein sequences. Specifically, we utilized only the encoder portion of the ProtT5 model. The encoder integrates essential components such as a multi-head attention mechanism and feedforward layers, enabling it to capture intricate relationships between amino acid residues in the input protein sequence. This process yields rich embedding vectors containing valuable information regarding protein structure and functionality. We extract the average embedding vector from the last layer of the pre-trained model to generate the embedded feature vector, resulting in a 1024-dimensional feature vector for each protein sequence.\u003c/p\u003e\u003cp\u003eThe physical-chemical features employed to represent DNA sequences align with the methodology proposed by Boeckaerts et al\u003csup\u003e39\u003c/sup\u003e. These features encompass nucleotide frequency, GC content, codon frequency, and codon usage bias, amounting to a total of 133 dimensions for the representation of DNA sequences.\u003c/p\u003e\u003cp\u003eTo preserve the intrinsic sequence information of DNA sequences, we encoded them following the approach outlined by Wang et al. in their study DeepHost\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e. This method represents DNA sequences through K-mer frequency. Subsequently, we construct a deep neural network (DNN). The DNN incorporates a convolutional neural network with three paths, each outputting a different number of channels, facilitating the capture of feature information at varying scales (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eB). By using multiple channels in parallel and fusing their outputs, the model can simultaneously learn abstract features at different levels. Subsequently, we leverage the Vision Transformer (ViT)\u003csup\u003e\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e, \u003cspan citationid=\"CR96\" class=\"CitationRef\"\u003e96\u003c/span\u003e\u003c/sup\u003e, utilizing the self-attention mechanism of the ViT model to capture global relationships and multi-channel feature representations, yielding richer original sequence feature embeddings.\u003c/p\u003e\u003cp\u003eSubsequently, we merge the three types of features learned by the model into a unified vector, which serves as input for the classifier ensemble. This ensemble includes an MLP neural network, an autoencoder, and a random forest. Throughout the training process, the MLP neural network updates parameters. We employ the softmax function as the activation function, cross-entropy as the loss function, and utilize the Adam algorithm to optimize the loss function.\u003c/p\u003e\u003cp\u003eDue to the inherent characteristics of the softmax function, it leads to high confidence predictions in incorrect categories, which may exceed the confidence levels justified by true probability estimates. To address this issue, we integrate both the RF and MLP neural network into the classification prediction process. Initially, we train the MLP neural network and feature extractor, then stabilize the parameters of the feature extractor. Next, we train the autoencoder and RF. During testing, predictions are a blend of MLP and RF outputs, with the RF correcting highly confident but potentially inaccurate MLP predictions. Supplementary Table S10 provides detailed VirHost Hunter construction parameters.\u003c/p\u003e\u003ch2\u003eBioinformatics pipeline for phage genome annotations\u003c/h2\u003e\u003cp\u003eA bioinformatics pipeline was developed to enable the rapid and efficient annotation of phage tails and lytic proteins. The pipeline involved several steps. Firstly, proteins predicted from phage genomes using Prodigal v2.6.3 (-f gff -c -p meta)\u003csup\u003e\u003cspan citationid=\"CR97\" class=\"CitationRef\"\u003e97\u003c/span\u003e\u003c/sup\u003e. Secondly, the predicted proteins were aligned against multiple databases, including 1) the NR phage protein database using Blastp v2.3.0 (-evalue 1e-5 -max_target_seqs 1 -outfmt ‘6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore’)\u003csup\u003e98\u003c/sup\u003e, 2) Uniref phage protein database using phmmer v3.1b2 (-E 1e-5), 3) Uniprotkb phage protein database using phmmer v3.1b2 (-E 1e-5) and 4) TIGRFAM, SMART, CDD, ProSiteProfiles, SUPERFAMILY, PRINTS, PANTHER, Gene3D, PIRSF, Pfam, Coils, and MobiDBLite database using hmmscan v3.1b2 (-E 1e-5). The final annotation was merged with the comprehensive alignment results.\u003c/p\u003e\u003cp\u003eFor phage tail protein identification, the keyword ‘tail’ was used to extract sequences from the final annotation results through Seqkit v0.16.0\u003csup\u003e99\u003c/sup\u003e. For phage lysin proteins, additional filters were applied in the blastp step, including ≥ 50% of coverage and ≥ 50% of identity, and specific keywords (‘lysis’/‘lyase’/‘lysin’/‘holin’/‘hydrolase’/‘spanin’/‘endolysin)’ were used to extract sequences from the final annotation results through Seqkit v0.16.0\u003csup\u003e99\u003c/sup\u003e.\u003c/p\u003e\u003ch2\u003eConstruction of benchmark datasets\u003c/h2\u003e\u003cp\u003eComplete phage genomes were collected from NCBI using specific keywords related to bacterial hosts, including ‘\u003cem\u003eStaphylococcus\u003c/em\u003e’, ’\u003cem\u003eAcinetobacter\u003c/em\u003e’, ‘\u003cem\u003eEscherichia\u003c/em\u003e’, ‘\u003cem\u003eClostridium\u003c/em\u003e’, ‘\u003cem\u003eKlebsiella\u003c/em\u003e’, ‘\u003cem\u003ePseudomonas\u003c/em\u003e’, and ‘\u003cem\u003eSalmonella\u003c/em\u003e’. A total of 3,116 phage genomes were collected. Protein annotation was performed using the bioinformatics pipeline, resulting in 22,151 phage tail proteins (21,264 from the pipeline and 887 from a published paper by Boeckaerts \u003cem\u003eet al\u003c/em\u003e.\u003csup\u003e39\u003c/sup\u003e). From these, 7,493 RBPs were screened out using specific keywords related to the tail protein functions, including ‘fiber’, ‘fibre’, and ‘spike’.\u003c/p\u003e\u003cp\u003eThree filters were applied to clean the tail proteins and RBP datasets: 1) sequences with lengths shorter than 50 amino acids or longer than 1,500 amino acids were removed, 2) sequences containing undetermined amino acid ‘X’ in protein sequences or undetermined nucleotides ‘N’ in CDS were excluded, and 3) identical protein sequences with different hosts were discarded to remove redundancy. The final benchmark datasets consisted of 4,845 RBPs in DRRBP and 12,509 tail proteins in DRTail, respectively.\u003c/p\u003e\u003ch2\u003eConstruction of tail protein and lysin datasets at multi-taxonomic levels\u003c/h2\u003e\u003cp\u003ePhage genomes from the viral category were screened in the NCBI database as of December 29, 2021 (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.ncbi.nlm.nih.gov/genome/browse/#!/viruses\u003c/span\u003e\u003cspan address=\"https://www.ncbi.nlm.nih.gov/genome/browse/#!/viruses\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e).\u003c/span\u003e Those contain partial genomes and coding sequences were excluded from the dataset. Genomic sequences in FASTA format and annotation files in GBFF format were downloaded from the corresponding table on the NCBI FTP site. This screening process resulted in a total of 7,598 phage genomes for further analysis. Next, information related to the host organism was extracted from the annotation files (GBFF format) using a custom script. If the ‘host=’ filed was empty, the species information mentioned in front of the phage in the GenBank tile (ORGANISM) was selected as the host information. For instance, if the host information of phage AF234172 was empty, we selected ‘Escherichia’ as the host based on the record ‘ORGANISM: Escherichia virus P1’. Then, NCBI taxonomy toolkit, TaxnoKit, was used to obtain the taxonomy ID and taxonomy level of the host organism (taxnokit name2taxid –show-rank). The host taxonomic information was transformed into a standard format including phylum, class, order, family, genus, species, and strain (taxonkit lineage | taxnokit reformat | cut -f 1,3). This process resulted in the compilation of phage-host taxonomic rank information.\u003c/p\u003e\u003cp\u003eWe counted the number of RBPs and tail proteins in the datasets and observed an average of 1.33 RBPs and 15.24 tail proteins per phage (Figure S1). We also found that 53.10% of phages lacked RBPs, prompting us to construct multi-taxonomic levels dataset at different taxonomic ranks, enabling the establishment of a tail proteins-based VirHost Hunter (VirHost Hunter-tail) for broader applications. We filtered the phage data and created a phage tail protein dataset, including 37 families, 54 genera, and 57 species. Additionally, we trained VirHost Hunter on lysins – another type of host-specific protein – using 37,469 lysin protein sequences from the same 7,598 phages to construct lysin-based VirHost Hunter (VirHost Hunter-lysin). The lysin dataset comprised 37 families, 42 genera and 47 species.\u003c/p\u003e\u003cp\u003eFamily, genus, and species datasets for tail proteins were constructed based on the taxonomic ranks obtained in the previous step. The three datasets were filled using the same three filters as used in constructing the benchmark datasets. Category with fewer than 50 counts in each taxonomic dataset were discarded. After filtering, there were 47 families, 72 genera, and 120 species remaining in the tail protein datasets. These three datasets were used to train the VirHost Hunter-tail model. To address bias issues observed in certain taxa, taxa with precision lower than 0.7 were eliminated from the datasets. For example, \u003cem\u003eEnterobacteriaceae\u003c/em\u003e has a recall of 0.9098 and precision of 0.6941 at family level, \u003cem\u003eEscherichia\u003c/em\u003e has a recall 0.7125 and precision of 0.3526 at genus level, and \u003cem\u003eEscherichia coli\u003c/em\u003e has a recall of 0.7373 and precision of 0.3199 at species level. As a result, the final set taxa include 37 families, 54 genera, and 56 species left. Each dataset was randomly split into training, validation, and testing sets with a proportion of 6:2:2.\u003c/p\u003e\u003cp\u003eThe original lysin dataset contained a total of 37,469 protein sequences. The same three filters as previously applied to the benchmark datasets were used to clean the lysin dataset. However, the maximum allowed sequence length was set to 1,000 amino acids since protein sequences with a length of over 1,000 amino acids accounted for less than 2%. The screening process and building procedures for the VirHost Hunter-lysin dataset followed a similar approach to VirHost Hunter-tail. Taxa with precision lower than 0.62 were eliminated from the family, genus, and species taxonomy datasets based on the training results. This step ensured reliable predictions for the remaining taxa. After eliminating low-precision taxa, the final lysin datasets consisted of 37 families, 42 genera, 47 species. Each dataset was randomly split into training, validation, and testing sets with a proportion of 6:2:2.\u003c/p\u003e\u003ch2\u003eAdditional filter for higher precision at multi-taxonomic ranks\u003c/h2\u003e\u003cp\u003eSince the range of categories that our model can cover is limited, an additional filter was implemented to VirHost Hunter trained in the multi-taxonomic levels’ dataset and the gut prophages dataset, to generate an ‘Unknown’ output for any given input that exceeded the prediction range. To determine the appropriate cutoff for this filter, two datasets were constructed: Positive Control, which comprised samples from the test dataset, and Negative Control containing samples not belonging to any predefined classes in the training dataset. The recall and precision on the Positive Control and the specificity on the Negative Control were illustrated in Figure S5, Figure S6. These figures showed that more stringent cutoffs resulted in higher precision and lower recall. This phenomenon occurred because as the cutoff increased, more data were classified as ‘Unknown’, and the remaining data was considered more reliable by VirHost Hunter.\u003c/p\u003e\u003cp\u003eTo benchmark VirHost Hunter’s performance against other methods, we considered the work of Dion \u003cem\u003eet al.\u003c/em\u003e\u003csup\u003e21\u003c/sup\u003e where they evaluated the precision and recall of a CRISPR spacer-based method under different cutoffs of mismatch numbers or e-value at the genus level. They found that with an e-value of 10\u003csup\u003e− 9\u003c/sup\u003e, the method achieved the highest precision of 95% but the lowest recall of 2.5%. With zero mismatches, the method achieved 84% precision and 31% recall. By tolerating two mismatches, the method obtained a balanced performance of 69% precision and 49% recall. Accordingly, several probability cutoffs were selected at the family, genus, and species levels to achieve the same precision values of 95%, 84%, and 69%, respectively (Table S11). Consequently, when the precision on the Positive Control and the specificity on the Negative Control surpassed 95%, VirHost Hunter demonstrated a precision of 95%.\u003c/p\u003e\u003ch2\u003eExtraction of synthetic lysins from PhaLP\u003c/h2\u003e\u003cp\u003eThe latest SQL file (v2021_04) was downloaded from the largest available Phage Lytic Protein database (PhaLP)\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. The SQL file provided lysin IDs, corresponding phage genome IDs, lysin annotation information, host taxonomy information, and experimental support information. Phage genome IDs that were not used in VirHost Hunter construction were marked in the dataset collected from NCBI, resulting in 3,448 phage genomes. Lysins that were synthesized and experimentally validated and their corresponding 31 phages were screened out from the dataset. A total of 138 tail proteins were annotated to using the custom bioinformatics pipeline. Three phages could not be annotated with tail proteins, leaving final real-world evidence of 31 phage genomes and 138 phage tail protein sequences.\u003c/p\u003e\u003ch2\u003eExtraction of phage tail and lysin proteins from GPD\u003c/h2\u003e\u003cp\u003eGut Phage Database (GPD) and the corresponding taxonomy information table by Camarillo-Guerrero et al. were downloaded\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e. Based on the ‘Host_range_taxo’ field in the information table, phages were categorized into two groups: those with host information and those without host information. Phage tail and lysin proteins were annotated using a custom annotation pipeline. A total of 163,590 lysin sequences and 388,894 tail protein sequences were obtained from 111,355 phages, which accounted for 77.97% of the total 142,809 phage genomes. A comprehensive dataset of 42,586 proteins was downloaded from NCBI (as of February 22, 2024) using keywords(lysis protein, lysin, lyase, holin, hydrolase, and endolysin AND phage). Lysins encoded by gut phages were identified by comparing with the dataset using BLASTP with a threshold of 60% identity and 50% coverage. VirHost Hunter-lysin model (95%, 84%), VirHost Hunter-tail model (95%, 84%), and CRISPR-based method (84%) were used to construct the Gut Phage Lysin Database (GPLD) targeting human gut commensal bacteria.\u003c/p\u003e\u003ch2\u003eStatistical analysis and sequence similarity network for the Gut Phage Lysin Database (GPLD)\u003c/h2\u003e\u003cp\u003eBiopython was employed to conduct a comprehensive statistical analysis of the GPLD database, which included aspects such as protein categories, secondary structure proportions, length, amino acid composition, molecular weight, isoelectric point, and stability index. The results were visually represented using ggplot2\u003csup\u003e100\u003c/sup\u003e. A sequence similarity network was established using a tool developed by Miguel M. Sandin (available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/MiguelMSandin/SSNetworks\u003c/span\u003e\u003cspan address=\"https://github.com/MiguelMSandin/SSNetworks\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e)\u003c/span\u003e, which was based on lysin sequences clustered using CD-HIT with a 70% similarity threshold. The network construction parameters were set at an identity level of 35% and a coverage level of 50%. The resulting networks were visualized using Cytoscape. Furthermore, MEME was utilized to identify conserved motif sites within the three primary clusters of the network, employing default parameters.\u003c/p\u003e\u003cp\u003e \u003cb\u003eIdentification of\u003c/b\u003e \u003cb\u003eMegamonas-\u003c/b\u003e\u003cb\u003etargeting lysin from GPLD\u003c/b\u003e\u003c/p\u003e\u003cp\u003eA total of 536 unique lysin sequences specific to the genus \u003cem\u003eMegamonas\u003c/em\u003e were identified from the GPLD. These sequences were subsequently clustered into 167 distinct groups using the CD-HIT with a sequence similarity threshold of 95% and a coverage threshold of 90% (-c 0.95 -aL 0.9). Ply491_6 (ivig_491_6), representing the largest cluster among these groups, was chosen for in-depth characterization and experimental validation of lytic activity. This process involved the prediction of signal peptides using SignalP (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://services.healthtech.dtu.dk/services/SignalP-6.0/\u003c/span\u003e\u003cspan address=\"https://services.healthtech.dtu.dk/services/SignalP-6.0/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), identification of transmembrane regions with HMMTOP (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://services.healthtech.dtu.dk/services/TMHMM-2.0/\u003c/span\u003e\u003cspan address=\"https://services.healthtech.dtu.dk/services/TMHMM-2.0/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), and assessment of physicochemical properties via ProtParam (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://web.expasy.org/protparam/\u003c/span\u003e\u003cspan address=\"https://web.expasy.org/protparam/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e\u003ch2\u003eSynthesis and purification of Ply491_6\u003c/h2\u003e\u003cp\u003eTo synthesize and purify Ply491_6, the gene encoding Ply491_6 was synthesized and subcloned into the pET-30a(+) plasmids using NdeI and XhoI restriction sites. The plasmids were constructed and transformed into BL21 (DE3) competent cells. These transformed cells were cultured on agar plates containing kanamycin at a final concentration of 50 µg/mL at 37°C. Colonies were picked from the plates and cultured until the optical density at 600 nm (OD\u003csub\u003e600\u003c/sub\u003e) reached 0.6–0.8. Protein expression was induced by adding IPTG to a final concentration of 0.5 mM, followed by incubation of the cultures for an additional 4 hours at 37°C. Cells were then harvested, lysed, and the lysates were subjected by SDS-PAGE to verify protein expression. Then the proteins were purified using Ni-NTA affinity chromatography. The purified proteins were dialyzed into phosphate-buffered saline (PBS) containing 300 mM NaCl, 10% glycerol, and adjusted to pH 7.4, followed by filter sterilization.\u003c/p\u003e\u003ch2\u003eBacterial strains\u003c/h2\u003e\u003cp\u003e \u003cem\u003eMegamonas rupellensis\u003c/em\u003e strain 150922 was used for lysin activity assay of Ply491_6 \u003cem\u003ein vitro\u003c/em\u003e. \u003cem\u003eBacteroides fragilis\u003c/em\u003e bf2 (BF1), \u003cem\u003eB. fragilis\u003c/em\u003e bf5 (BF2), \u003cem\u003eClostridium perfringerns\u003c/em\u003e 0840 (CP1), \u003cem\u003eC. perfringerns\u003c/em\u003e 0812 (CP1), \u003cem\u003eRuminococcus gnavus\u003c/em\u003e 1177 (RG1), \u003cem\u003eR. gnavus\u003c/em\u003e 1186 (RG2), \u003cem\u003eBifidobacterium longum\u003c/em\u003e 4486 (BL1), \u003cem\u003eB. longum\u003c/em\u003e 2366 (BL2), \u003cem\u003eLacticaseibacillus paracasei\u003c/em\u003e LAC-F (LP1), \u003cem\u003eL. paracasei\u003c/em\u003e LAC-J (LP2) and \u003cem\u003eLactiplantibacillus plantarum\u003c/em\u003e SZHD0015 (\u003cem\u003eL. plantarum\u003c/em\u003e) were used for comparing lysin activity of Ply491_6 \u003cem\u003ein vitro\u003c/em\u003e. All bacterial strains were isolated from human feces. All bacterial strains were grown overnight in BHI-YH (Brain Heart Infusion medium supplemented with 5 g/L yeast extract, 5 mg/L hemin). To maintain anaerobic conditions, all media and buffers were additionally supplemented with 0.5 g/L L-cysteine hydrochloride and 0.25 g/L anhydrous sodium sulfide, serving as reducing agents.\u003c/p\u003e\u003ch2\u003eLytic activity and specificity of Ply491_6\u003c/h2\u003e\u003cp\u003e \u003cem\u003eM. rupellensis\u003c/em\u003e strain 150922 was grown overnight, diluted 1:100, and grown to the midlogarithmic phase. The bacterial cells centrifuged, washed, and resuspended in phosphate buffered saline (PBS, pH 7.4) to an OD\u003csub\u003e600\u003c/sub\u003e of 0.9. Phage lysin Ply491_6 was added to bacterial suspension with a final concentration of 20 µg/mL. Each concentration was plated in a U-bottomed 96-well plate in triplicate. Ply491_6 dilution plates were then incubated at 37°C in a BioTek Epoch2 Microplate Spectrophotometer(BioTek Instruments, Inc., USA) for 240 minutes. The OD\u003csub\u003e600\u003c/sub\u003e was measured every 10 min.\u003c/p\u003e\u003cp\u003eTo verify the specificity of Ply491_6, \u003cem\u003eB. fragilis\u003c/em\u003e (n = 2), \u003cem\u003eC. perfringerns\u003c/em\u003e (n = 2), \u003cem\u003eR. gnavus\u003c/em\u003e (n = 2), \u003cem\u003eB. longum\u003c/em\u003e (n = 2), \u003cem\u003eL. paracasei\u003c/em\u003e (n = 2) and \u003cem\u003eL. plantarum\u003c/em\u003e (n = 1) strains were each grown overnight. The bacterial cells were then centrifuged, washed, and resuspended in PBS, and were then incubated with 20 µg/mL Ply491_6 or PBS at 37°C for 240 min. The OD\u003csub\u003e600\u003c/sub\u003e was measured every 10 min.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003eAuthor Information\u003c/h2\u003e \u003cp\u003eM.X. conceived the study. Z.D., K.L., and Y.O. developed the tool. M.L. and B.X. compiled the training, validation, and test sets. K.L., M.L., B.X., and M.X. analyzed the viral dark matter. K.L., M.L., B.X., and Y.O. drafted the manuscript and made the figures. Z.D., M.X., and Junhua L. revised the manuscript. Jianqiang L., J.W., H.Y., and X.X. provided consultation. All authors read, edited, and approved the final manuscript.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAcknowledgements\u003c/h2\u003e \u003cp\u003eThis work is supported by National Key R\u0026amp;D Program of China (2020YFA0908700), National Nature Science Foundation of China Grant 32100130 and 62176164. We sincerely thank the China National GeneBank DataBase (CNGB) for providing valuable data support and computational resources. We extend our heartfelt sympathy to Min Li and Kaihuang Lin, who, despite of being co-first authors, unfortunately did not witness the fruition of their work before their graduation. Their unwavering support since then has been invaluable. We hope that the next-generation co-second authors, Bo Xing and Yuehua Ou, will enjoy greater fortune in their academic endeavors.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBayfield OW et al (2023) Structural atlas of a human gut crassvirus. Nature 617:409\u0026ndash;416\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKoskella B, Brockhurst MA (2014) Bacteria-phage coevolution as a driver of ecological and evolutionary processes in microbial communities. FEMS Microbiol Rev 38:916\u0026ndash;931\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBorin JM, Avrani S, Barrick JE, Petrie KL, Meyer JR (2021) Coevolutionary phage training leads to greater bacterial suppression and delays the evolution of phage resistance. Proc Natl Acad Sci U S A 118\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlazanin M, Turner PE (2021) Community context matters for bacteria-phage ecology and evolution. ISME J 15:3119\u0026ndash;3128\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLawrence D, Baldridge MT, Handley SA (2019) Phages and Human Health: More Than Idle Hitchhikers. Viruses 11\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFederici S, Nobs SP, Elinav E (2021) Phages and their potential to modulate the microbiome and immunity. Cell Mol Immunol 18:889\u0026ndash;904\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBhargava K, Nath G, Bhargava A, Aseri GK (2021) Jain, N. Phage therapeutics: from promises to practices and prospectives. Appl Microbiol Biotechnol 105:9047\u0026ndash;9067\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVijay A, Valdes AM (2022) Role of the gut microbiome in chronic diseases: a narrative review. Eur J Clin Nutr 76:489\u0026ndash;501\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuerin E, Hill C (2020) Shining Light on Human Gut Bacteriophages. Front Cell Infect Microbiol 10:481\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePorter NT et al (2020) Phase-variable capsular polysaccharides and lipoproteins modify bacteriophage susceptibility in Bacteroides thetaiotaomicron. Nat Microbiol 5:1170\u0026ndash;1181\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVazquez R, Garcia E, Garcia P (2018) Phage Lysins for Fighting Bacterial Respiratory Infections: A New Generation of Antimicrobials. Front Immunol 9:2252\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhose C, Euler CW (2020) Gram-Negative Bacterial Lysins. Antibiot (Basel) 9\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDanis-Wlodarczyk KM, Wozniak DJ, Abedon ST (2021) Treating Bacterial Infections with Bacteriophage-Based Enzybiotics: In Vitro, In Vivo and Clinical Application. \u003cem\u003eAntibiotics (Basel)\u003c/em\u003e 10\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRahman MU et al (2021) Endolysin, a Promising Solution against Antimicrobial Resistance. Antibiot (Basel) 10\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee C, Kim H, Ryu S (2023) Bacteriophage and endolysin engineering for biocontrol of food pathogens/pathogens in the food: recent advances and future trends. Crit Rev Food Sci Nutr 63:8919\u0026ndash;8938\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCriel B, Taelman S, Van Criekinge W, Stock M, Briers Y (2021) PhaLP: A Database for the Study of Phage Lytic Proteins and Their Evolution. Viruses 13\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCoutinho FH et al (2021) RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns (N Y) 2:100274\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePons JC et al (2021) VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 37:1805\u0026ndash;1813\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAmgarten D, Iha BKV, Piroupo CM, da Silva AM, Setubal JC, vHULK (2022) a New Tool for Bacteriophage Host Prediction Based on Annotated Genomic Features and Neural Networks. Phage (New Rochelle) 3:204\u0026ndash;212\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZielezinski A, Barylski J, Karlowski WM (2021) Taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships. BMC Biol 19:223\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDion MB et al (2021) Streamlining CRISPR spacer-based bacterial host predictions to decipher the viral dark matter. Nucleic Acids Res 49:3127\u0026ndash;3138\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang R et al (2021) SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts. Bioinformatics 37:3364\u0026ndash;3366\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEdwards RA, McNair K, Faust K, Raes J, Dutilh BE (2016) Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol Rev 40:258\u0026ndash;272\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi J, Yang F, Xiao M, Li A (2022) Advances and challenges in cataloging the human gut virome. Cell Host Microbe 30:908\u0026ndash;916\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGaliez C, Siebert M, Enault F, Vincent J, Soding J (2017) WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics 33:3113\u0026ndash;3114\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeite DMC et al (2018) Computational prediction of inter-species relationships through omics data analysis and machine learning. BMC Bioinformatics 19:420\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi M et al (2021) A Deep Learning-Based Method for Identification of Bacteriophage-Host Interaction. IEEE/ACM Trans Comput Biol Bioinform 18:1801\u0026ndash;1810\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi M, Zhang W (2022) PHIAF: prediction of phage-host interactions with GAN-based data augmentation and sequence-based feature fusion. Brief Bioinform 23\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu D, Ma Y, Jiang X, He T (2019) Predicting virus-host association by Kernelized logistic matrix factorization and similarity network fusion. BMC Bioinformatics 20:594\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang W et al (2020) A network-based integrated framework for predicting virus-prokaryote interactions. NAR Genom Bioinform 2:lqaa044\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLu C et al (2021) Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol 19:5\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShang J, Sun Y (2021) Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning. BMC Biol 19:250\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShang J, Sun Y (2022) CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model. Brief Bioinform 23\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTan J et al (2022) HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes. Bioinformatics 38:543\u0026ndash;545\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTang T, Hou S, Fuhrman JA, Sun F (2022) Phage-bacterial contig association prediction with a convolutional neural network. Bioinformatics 38:i45\u0026ndash;i52\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZielezinski A, Deorowicz S, Gudys A (2022) PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences. Bioinformatics 38:1447\u0026ndash;1449\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVillarroel J et al (2016) HostPhinder: A Phage Host Prediction Tool. Viruses 8\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRuohan W, Xianglilan Z, Jianping W (2022) \u0026amp; Shuai Cheng, L.I. DeepHost: phage host prediction with convolutional neural network. Brief Bioinform 23\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBoeckaerts D et al (2021) Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins. Sci Rep 11:1467\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGonzales MEM, Ureta JC, Shrestha AMS (2023) Protein embeddings improve phage-host interaction prediction. PLoS ONE 18:e0289030\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePaez-Espino D et al (2016) Uncovering Earth's virome. Nature 536:425\u0026ndash;430\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGregory AC et al (2020) The Gut Virome Database Reveals Age-Dependent Patterns of Virome Diversity in the Human Gut. \u003cem\u003eCell Host Microbe\u003c/em\u003e 28, 724\u0026ndash;740 e728\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCamarillo-Guerrero LF, Almeida A, Rangel-Pineros G, Finn RD, Lawley TD (2021) Massive expansion of human gut bacteriophage diversity. \u003cem\u003eCell\u003c/em\u003e 184, 1098\u0026ndash;1109 e1099\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNayfach S et al (2021) Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat Microbiol 6:960\u0026ndash;970\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTisza MJ, Buck CB (2021) A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc Natl Acad Sci U S A 118\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRoux S et al (2023) iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol 21:e3002083\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDams D, Brondsted L, Drulis-Kawa Z, Briers Y (2019) Engineering of receptor-binding proteins in bacteriophages and phage tail-like bacteriocins. Biochem Soc Trans 47:449\u0026ndash;460\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYehl K et al (2019) Engineering Phage Host-Range and Suppressing Bacterial Resistance through Phage Tail Fiber Mutagenesis. Cell 179:459\u0026ndash;469e459\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOpperman CJ, Wojno JM, Brink AJ (2022) Treating bacterial infections with bacteriophages in the 21st century. S Afr J Infect Dis 37:346\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRakhuba DV, Kolomiets EI, Dey ES, Novik GI (2010) Bacteriophage receptors, mechanisms of phage adsorption and penetration into host cell. Pol J Microbiol 59:145\u0026ndash;155\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNelson D, Schuch R, Chahales P, Zhu S, Fischetti VA (2006) PlyC: a multimeric bacteriophage lysin. Proc Natl Acad Sci U S A 103:10765\u0026ndash;10770\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFlamholz ZN, Biller SJ, Kelly L (2024) Large language models improve annotation of prokaryotic viral proteins. Nat Microbiol 9:537\u0026ndash;549\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDosovitskiy A An image is worth 16x16 words: Transformers for image recognition at scale. \u003cem\u003earXiv preprint arXiv\u003c/em\u003e:(2010). \u003cem\u003e11929\u003c/em\u003e (2020)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eElnaggar A et al (2022) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 44:7112\u0026ndash;7127\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim GB, Gao Y, Palsson BO, Lee SY, DeepTFactor (2021) A deep learning-based tool for the prediction of transcription factors. Proc Natl Acad Sci U S A 118\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBreiman L (2001) Random forests. Mach Learn 45:5\u0026ndash;32\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150\u0026ndash;3152\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLloyd-Price J et al (2019) Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569:655\u0026ndash;662\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchirmer M, Garner A, Vlamakis H, Xavier RJ (2019) Microbial genes and pathways in inflammatory bowel disease. Nat Rev Microbiol 17:497\u0026ndash;511\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTernes D et al (2022) The gut microbial metabolite formate exacerbates colorectal cancer progression. Nat Metab 4:458\u0026ndash;475\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWong SH, Yu J (2019) Gut microbiota in colorectal cancer: mechanisms of action and clinical applications. Nat Rev Gastroenterol Hepatol 16:690\u0026ndash;704\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQin Y et al (2024) Consistent signatures in the human gut microbiome of old- and young-onset colorectal cancer. Nat Commun 15:3396\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu R et al (2017) Gut microbiome and serum metabolome alterations in obesity and after weight-loss intervention. Nat Med 23:859\u0026ndash;868\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJie Z et al (2017) The gut microbiome in atherosclerotic cardiovascular disease. Nat Commun 8:845\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu C et al (2024) Obesity-enriched gut microbe degrades myo-inositol and promotes lipid absorption. Cell Host Microbe 32:1301\u0026ndash;1314e1309\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang T et al (2024) Divergent age-associated and metabolism-associated gut microbiome signatures modulate cardiovascular disease risk. Nat Med 30:1722\u0026ndash;1731\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQin J et al (2012) A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55\u0026ndash;60\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShen J et al (2023) Large-scale phage cultivation for commensal human gut bacteria. Cell Host Microbe 31:665\u0026ndash;677e667\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDao MC et al (2016) Akkermansia muciniphila and improved metabolic health during a dietary intervention in obesity: relationship with gut microbiome richness and ecology. Gut 65:426\u0026ndash;436\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShin NR et al (2014) An increase in the Akkermansia spp. population induced by metformin treatment improves glucose homeostasis in diet-induced obese mice. Gut 63:727\u0026ndash;735\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShih CT, Yeh YT, Lin CC, Yang LY, Chiang CP (2020) Akkermansia muciniphila is Negatively Correlated with Hemoglobin A1c in Refractory Diabetes. Microorganisms 8\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang T et al (2020) Alterations of Akkermansia muciniphila in the inflammatory bowel disease patients with washed microbiota transplantation. Appl Microbiol Biotechnol 104:10203\u0026ndash;10215\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLo Sasso G et al (2021) Inflammatory Bowel Disease-Associated Changes in the Gut: Focus on Kazan Patients. Inflamm Bowel Dis 27:418\u0026ndash;433\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDanilova NA et al (2019) Markers of dysbiosis in patients with ulcerative colitis and Crohn's disease. Ter Arkh 91:17\u0026ndash;24\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu F et al (2020) Metagenome-wide association of gut microbiome features for schizophrenia. Nat Commun 11:1612\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlpizar-Rodriguez D et al (2019) Prevotella copri in individuals at risk for rheumatoid arthritis. Ann Rheum Dis 78:590\u0026ndash;593\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eScher JU et al (2013) Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. Elife 2:e01202\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMaeda Y et al (2016) Dysbiosis Contributes to Arthritis Development via Activation of Autoreactive T Cells in the Intestine. Arthritis Rheumatol 68:2646\u0026ndash;2661\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTsai CY et al (2023) Abundance of Prevotella copri in gut microbiota is inversely related to a healthy diet in patients with type 2 diabetes. J Food Drug Anal 31:599\u0026ndash;608\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYue T et al (2022) High-risk genotypes for type 1 diabetes are associated with the imbalance of gut microbiome and serum metabolites. Front Immunol 13:1033393\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang C et al (2024) Prevotella copri alleviates hyperglycemia and regulates gut microbiota and metabolic profiles in mice. mSystems 9:e0053224\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDevoto AE et al (2019) Megaphages infect Prevotella and variants are widespread in gut microbiomes. Nat Microbiol 4:693\u0026ndash;700\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeston J, Elisseeff A, Zhou D, Leslie CS, Noble WS (2004) Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci U S A 101:6559\u0026ndash;6563\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCopp JN, Anderson DW, Akiva E, Babbitt PC, Tokuriki N (2019) Exploring the sequence, function, and evolutionary space of protein superfamilies using sequence similarity networks and phylogenetic reconstructions. Methods Enzymol 620:315\u0026ndash;347\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDey KK, Xie D, Stephens M (2018) A new sequence logo plot to highlight enrichment and depletion. BMC Bioinformatics 19:473\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGupta A, Osadchiy V, Mayer EA (2020) Brain-gut-microbiome interactions in obesity and food addiction. Nat Rev Gastroenterol Hepatol 17:655\u0026ndash;672\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKasai C et al (2015) Comparison of the gut microbiota composition between obese and non-obese individuals in a Japanese population, as analyzed by terminal restriction fragment length polymorphism and next-generation sequencing. BMC Gastroenterol 15:100\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKocelak P et al (2013) Resting energy expenditure and gut microbiota in obese and normal weight subjects. Eur Rev Med Pharmacol Sci 17:2816\u0026ndash;2821\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFujimoto K et al (2024) An enterococcal phage-derived enzyme suppresses graft-versus-host disease. Nature 632:174\u0026ndash;181\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDzuvor CKO et al (2022) Engineering Self-Assembled Endolysin Nanoparticles against Antibiotic-Resistant Bacteria. ACS Appl Bio Mater\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee C, Kim J, Son B, Ryu S (2021) Development of Advanced Chimeric Endolysin to Control Multidrug-Resistant Staphylococcus aureus through Domain Shuffling. ACS Infect Dis 7:2081\u0026ndash;2092\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiez-Martinez R et al (2013) Improving the lethal effect of cpl-7, a pneumococcal phage lysozyme with broad bactericidal activity, by inverting the net charge of its cell wall-binding module. Antimicrob Agents Chemother 57:5355\u0026ndash;5365\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZampara A et al (2020) Exploiting phage receptor binding proteins to enable endolysins to kill Gram-negative bacteria. Sci Rep 10:12087\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHendrix RW, Smith MC, Burns RN, Ford ME, Hatfull GF (1999) Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage. Proc Natl Acad Sci U S A 96:2192\u0026ndash;2197\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAdriaenssens EM (2021) Phage Diversity in the Human Gut Microbiome: a Taxonomist's Perspective. \u003cem\u003emSystems\u003c/em\u003e 6, e0079921\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRaghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks? Adv Neural Inf Process Syst 34:12116\u0026ndash;12128\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHyatt D et al (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcGinnis S, Madden TL (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 32:W20\u0026ndash;25\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShen W, Le S, Li Y, Hu F, SeqKit (2016) A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 11:e0163962\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWickham H, Wickham H (2016) Data analysis. Springer\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo X et al (2020) CNSA: a data repository for archiving omics data. Database (Oxford) 2020\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen FZ et al (2020) CNGBdb: China National GeneBank DataBase. Yi Chuan 42:799\u0026ndash;809\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"BGI Research","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"microbiome, machine learning, phage-host prediction","lastPublishedDoi":"10.21203/rs.3.rs-8534670/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8534670/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eViral sequences in diverse environments remain largely uncharacterized, impeding our comprehension of their genetic makeup, biological interactions, and potential applications. This underscores an urgent need for innovative analytical methods. Here we present the VirHost Hunter framework, which employs phage tails and lysins, bypassing the requirement for full genomes, for efficient and high-resolution host assignment. By harnessing Protein Language Models and Vision Transformers, VirHost Hunter captures protein functional homology despite sequence dissimilarity, significantly boosting prediction accuracy. In the scenario of disease-associated gut bacteria, calibrated VirHost Hunter surpassed existing methods, doubling phage host assignments, expanding taxonomic reach, and revealing new phages targeting gut bacteria, including \u003cem\u003eAkkermansia\u003c/em\u003e and \u003cem\u003ePrevotella\u003c/em\u003e. Therefore, we established a gut phage lysin database, enabling the synthesis of a lysin that effectively and specifically targets an obesity-inducing bacterium. VirHost Hunter's precision and scalability mark a significant leap forward in virome research and present a promising avenue for microbiome therapies.\u003c/p\u003e","manuscriptTitle":"Decrypting viral dark matter through key proteins using an NLP-enhanced framework","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-13 17:17:17","doi":"10.21203/rs.3.rs-8534670/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"d5615d09-9c09-4356-88e8-2ec407eb88cd","owner":[],"postedDate":"January 13th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":61016164,"name":"Virology"}],"tags":[],"updatedAt":"2026-01-13T17:17:17+00:00","versionOfRecord":[],"versionCreatedAt":"2026-01-13 17:17:17","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8534670","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8534670","identity":"rs-8534670","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00