Verification and Comparison of Pig, Mouse, and Human Genome Similarities: Use of Manual Assembly and Analyses

preprint OA: closed
Full text JSON View at publisher
Full text 158,860 characters · extracted from preprint-html · click to expand
Verification and Comparison of Pig, Mouse, and Human Genome Similarities: Use of Manual Assembly and Analyses | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Verification and Comparison of Pig, Mouse, and Human Genome Similarities: Use of Manual Assembly and Analyses Harry D. Dawson, Celine Chen, Jack Ragonese, Allen D Smith, Joan K Lunney This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6856588/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Recently there have been numerous attempts to improve the genome of the pig. Despite these efforts, there is a substantial amount of work remaining to obtain a “finished version” of the genome; analysis of incomplete versions can lead to incorrect biological interpretations. To that end, we manually assembled and annotated a non-redundant, 16,146 RNA and 15,613 pig protein sequence library. We used it to assess the assembly and annotation status of the 3 latest builds of the genome and to the mouse and human genomes. Results Our analysis of 3,333 protein coding genes reveals that the percentage of error-free assembled and annotated genes in NCBI and Ensembl builds 11.1 and MARC build 1.0 are 69.4, 50.1, and 40.0%, respectively. An examination of these errors revealed nine predominant sources that are detailed in the Results. Using our protein library, we determined 1:1 orthology to 16,496 mouse and 15,770 human proteins. 73.8% of these proteins were conserved among the 3 species; however, when a gene was missing from one of the three genomes, pigs were 5.0X more likely to have the human gene than mice. REACTOME, KEGG, GO BP Direct and Ingenuity Pathway Analysis functional enrichment analyses of pig-human orthologous genes revealed 8, 3, 14 and 32 conserved pathways, and 0, 3, 0, and 29 for human-mouse pathways, respectively. Last, we conducted an analysis of functional domain preservation for 3,465 proteins and discovered when a functional domain is missing from a protein in 1 of the 3 species, pigs are 1.5X more likely to have the human domain than mice. Conclusions These data strongly indicate that, overall, swine are a scientifically important intermediate species (rodent-human) for conducting scientific research on human health. Epigenetics & Genomics pig mouse human genome nutrition metabolism immunity Figures Figure 1 Figure 2 Figure 3 Introduction The human and mouse genomes have undergone automatic and extensive manual annotation by the Human And Vertebrate Analysis and Annotation (HAVANA) group [ 1 ]. Recently, the development of telomere to telomere sequencing has led to the complete sequencing of the human and mouse genomes [ 2 , 3 ]. However, there is evidence that the human and mouse genome assemblies are still flawed and could benefit from reassembly; at least 92 human and 22 mouse genes have their NCBI Annotation category listed as “suggests misassembly”. In contrast, only 21 pig genes have their NCBI Annotation category listed as “suggests misassembly”. There have been numerous attempts to improve the annotation and assembly of the porcine genome [ 4 – 8 ]. Despite these efforts, several recent analyses indicates that there is a significant amount of work that needs to be done towards a “finished” pig genome [ 9 , 10 ], particularly the need for manual annotation [ 6 , 8 ]. Although recent work by our group and others overwhelmingly suggests that pigs and humans exhibit greater genome similarity at the macro level and share more genes [ 11 – 13 ]; our preliminary analysis indicated that pigs and humans have greater conservation of protein functional domains [ 11 ], than do humans and mice. A recent study concluded that humans and mice share more Kyoto Encyclopedia of Genes and Genomes (KEGG)-related pathways than do humans and pigs [ 14 ]; however, the authors of that study speculated that their conclusion is likely to be affected by the incomplete status of the annotations in the porcine genome. The various pig genomes annotated by NCBI and Ensembl (Table 1 ) have a wide range of predicted genes (46573–152168), predicted protein coding genes (19974-22,125), predicted transcripts (56900–78200) and predicted transcripts/gene (0.39–2.92). It is not logical to assume that there is this much natural, pig breed variation. But rather, this variation is likely pipeline intrinsic and due to the variable amounts of updates and patches applied to each. This assertion is supported by the observation that, using the same sequence source (Duroc build 11.1), NCBI predicts 6.8x and 2.9x more pseudogenes and transcripts per gene, respectively, than Ensembl. Furthermore, a recent paper using automated analysis has noted the incongruity of the annotation of protein coding porcine genes in the NCBI and Ensembl builds of 11.1 with 2119 and 3371 discordant genes [ 15 ]. Potential sources of this discrepancy were not identified. These data make it difficult to assess the actual number of genes and transcripts in machine-annotated genomes. Furthermore, any cross-species functional comparison using pigs would be compromised by these discrepancies. Previously we discovered several sources of systematic errors in the earlier pig reference genome (Ensembl and NCBI builds 10.2) prediction or annotation pipelines (selenoproteins, taste receptor (TR) genes, intronless genes, artifactually duplicated genes) by using a manually annotated set of sequences. Herein, we extend our analysis to Ensembl and NCBI builds 11.1 and MARC 1.0, with a much larger sequence library of RNA and protein sequences, in order to uncover potential systematic errors. We used this library to determine the conservation of pig and human 5’ and 3’UTR (untranslated regions) RNA regions and RNA splice variants. We then used this nonredundant and highly annotated pig proteome to identify 1–1 mouse and/or human orthologs. We compared the 1–1 orthologs for all 3 species, to determine functional enrichment. A similar analysis was performed on proteins with shared or non-shared functional domains. Results A. Error Analysis and Categorization 1. Protein Coding Genes In a comparison of a subset of the 18405 protein coding genes, we determined that the percentage of correctly assembled and annotated genes in the NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0 to be 58.9%, 51.7% and 47.1%, respectively ( Table 2 ). The sources of errors were varied. The most frequent broadly defined error category, error in annotated locus, occurred in 24.9, 42.6, and 43.8%, of NCBI build 11.1, Ensembl build 11.1 and MARC build 1, respectively. Because of the higher error rate of MARC 1.0, we examined a larger, randomly chosen, set in the Ensembl assembly of MARC 1.0 (data not shown). This larger search of MARC 1.0 revealed a significant number of missing genes (306 of which 261 are protein coding) compared to NCBI (28) and Ensembl build 11.1 (36) ( Table 1S ). The missing genes span approximately 99.2 Mb of the genome and involved significant segments of porcine chromosomes 1 (41 genes), 2 (41 genes) and 13 (27 genes). This missing gene rate is much higher than expected. These areas of MARC 1.0 should be targeted for resequencing in any future builds. Our analysis also discovered 565 protein coding genes that are not annotated in MARC 1.0. If these results were extended to the whole genome, over 1,100 protein-coding genes would not be annotated and 500 would be missing. We also examined 500 genes from the newly sequenced Ossabaw genome (build 1.0 deposited in Ensembl) and found the error rate was similar to that of MARC build 1.0 (data not shown). 2. Proteins of Extreme Size/Indel Analysis We analyzed 350 proteins of extreme size (>2,000 amino acids (AA)) ( Table 2S ). Our error analysis revealed that the percentage of correctly assembled genes for these proteins in NCBI build 11.1, Ensembl build 11.1 and Ensembl MARC build 1.0 is, respectively, 53.1%, 27.4% and 23.4%. This analysis also identified the most serious source of error that prohibited correct assembly and annotation of Ensembl and NCBI builds of 11.1, was the presence of an indel every 12,465 bp. This is rather surprising since the coverage of the genome averaged 65x [8]. Using the search term “low quality protein” in NCBI, yields 2807 porcine protein coding genes that are affected by this. This number of proteins is inflated because some of these low-quality proteins are pseudogenes; however, the number is likely to be much higher in Ensembl build 11.1 because of the presence of pre-genome, annotated reference sequences in NCBI build 11. NCBI usually fills the indel with ambiguous nucleotides (N), as a place holder and annotates it as a correctly sized, but low-quality protein. Ensembl does not do this. As a result, a great number of truncated or elongated proteins arise in the Ensembl assembly because the algorithm appears to be searching for the next best splicing site. These genes may still be useful when doing RNASeq if the stringency of matching is lowered; however, this raises the risk of erroneously mapping high-similarity sequences. As expected, every protein-coding gene we evaluated and identified in NCBI build 11.1 as low quality (1437) was also incorrect in Ensembl build 11.1; however, only 28.7.1% of these proteins (412) were correctly assembled in MARC, reflecting the overall fundamental error in the gene assembling algorithm, particularly with genes that have a high number of exons. This error also limits the accuracy of splice variant determination, as NCBI does not annotate predicted splice variants of low-quality proteins, and functional domain analysis, so many of the misassembled proteins are missing one or more functional domains. Last, although the insertion of an indel by NCBI benefits the analysis of genes affected by a real indel, NCBI also inserts an ambiguous nucleotide(s) and annotates pseudogenes as low-quality proteins in the presence of a predicted stop codon. We found that 218/1582 (13.8%) of NCBI-annotated pig low-quality genes were actually pseudogenes. 3. Intronless Genes Intronless genes constitute a significant number of genes with errors. Intronless genes make up approximately 3% of the human genome [16]. These genes can be divided into 2 categories, genes that consist of a single exon (true intronless genes) and genes whose protein-coding region span a single exon but are interrupted by introns in the UTRs. Estimates of the number of human single exon coding region genes approaches 2000 [17]. The number of intronless genes in the pig is likely to be much higher because of the large number of Olfactory Receptor (OR) genes. Previous analysis of the intronless genes in humans and mice revealed that automatic annotation of these genes is problematic. Two related protein superfamilies, keratin associated proteins (KRTAP) and late cornified envelope (LCE) proteins are overrepresented in intronless genes. Other prominent classes of protein coding genes overrepresented in intronless genes are G-protein coupled receptors. Subclasses of genes found in intronless G-protein-coupled receptors include vomeronasal receptors (VMRs) and TRs. a. Keratin Associated Proteins (KRTAP) and Late Cornified Envelope (LCE) proteins Our study determined that humans have 124 (107 genes, 17 pseudogenes), pigs have 125 (110 genes and 15 pseudogenes) and mice have 187 (141 genes and 46 pseudogenes) KRTAP genes ( Tables 3 and 3S ). The vast majority of the porcine genes we identified are not annotated genes in NCBI 11.1 (80 missing), Ensembl 11.1 (64 missing) or MARC 1.0 (68 missing) genomes (Table 3S). Furthermore, they have very limited sequence homology, so assigning 1:1 orthology is difficult. Our study also determined that humans have 19, pigs have 15 and mice have 21 LCE proteins. Like the KRTAP genes, many of the 15 porcine LCE genes we identified are not annotated genes in NCBI 11.1 (10 missing), Ensembl 11.1 (7 missing) or MARC 1.0 (8 missing) genomes. b. Vomeronasal Receptors (VMRs) We found that pigs (14) and humans (5) have a significantly smaller number of VMRs (VMN1+VMN2) compared to mice (225). The larger number of VMR genes in pigs relative to humans is because, the pig VN1R4 gene has diverged into 12 paralogs (VN1R4, VN1R4L1, VN1R4L2, LOC110261363, LOC110261366, LOC110261370, LOC110261364, LOC102167894, LOC100520313, LOC100738896, LOC106510602) and 2 pseudogenes (VN1R4Ps1, VN1R4Ps2). We identified one intact pig-mouse VMN1 ortholog (VMN1R233). VMN2R1, a rodent-specific VMN2 gene, is an expressed pseudogene in pigs. 4. Endogenous Retroviral Sequences (ERVs) ERVs comprise 8% of the human genome [18]; however, few are translated into functional proteins. We discovered more than 500 endogenous retroviral sequences in Ensembl build 11.1 that are annotated as protein coding (data not shown). The majority of these are whole or fragmentary parts of retroviral endonucleases/reverse transcriptases and are likely to be artifacts. These errors significantly inflate the number of pig proteins especially those that have been deemed to be pig specific. The vast majority of these are filtered out of the human and mouse Ensembl genome builds (mice have 3). Only 3 bonafide, intact endonuclease/reverse transcriptases are found in the pig (ABR01162.1 (1272 AA), human AL50637.1, (1275 AA) and mouse (AAC72793.1, 1281 AA) genomes. 5. Selenoproteins In NCBI build 11.1, only one selenoprotein protein is incorrectly assembled, however; in Ensembl build 11.1 and MARC build 1.0 of the porcine genome, 11 out of 25 (44%) and 18 out of 25 (72%), respectively, of selenoprotein genes are assigned a premature stop codon or have additional errors ( Tables 3 and 4S ). All human and mouse proteins are correctly assembled and annotated in NCBI and Ensembl. 6. Mucins We identified 20 mucin genes in pigs. Of these, 33.3%, 14.3% and 19.0% are properly assembled and annotated in NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0, respectively, and only two genes, CD164 and MUC15, were assembled properly in all three builds. There is, however, little overlap in this gene set ( Tables 3 and 5S ) with a variety of assembly errors throughout all three builds. 7. Protocadherins Between the 3 species, we identified 80 protocadherin genes. Sixty-seven protein-coding protocadherin genes exist in pigs. Of these, 22.4 %, 11.9% and 9.0% are properly assembled and annotated in NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0, respectively; however, there is little overlap in this gene set. ( Tables 3 and 6S ). 8. Readthrough, fusion or conjoined genes Via manual annotation, we determined that pigs can likely make 103 readthrough genes ( Table 3 and 7S ). We developed a scoring system where we assess the confidence of our predictions. The categories from lowest to highest confidence are; 0 = no transcript could be predicted, human, primate or mouse-specific gene; 1 = Predicted transcript but no evidence of existence in other species; 2 = Predicted transcript, limited evidence of existence (2 species or less); 3 = Predicted transcript, limited evidence of existence (2 species or less), transcription demonstrated in pig; and 4 = Predicted transcript, evidence of existence (2 species or more), transcription demonstrated in pig. The number of pig readthrough genes in each category are 7, 30, 42 and 24 for categories 1, 2, 3 and 4, respectively. Ninety two of these correspond to human or primate genes and many have orthology to transcripts in other mammalian species. We determined that there are only three pig genes annotated as readthrough genes in NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0, combined. 9. Microproteins Microproteins and small open reading frames (sORFs)-encoded proteins (SEPs) are small (<100 AA) proteins containing a single domain. Standard genome annotation pipelines routinely miss these [19]. They can exist as distinct genes or arise as a result of a shift in the open reading frame during translation. More than 1,000 of these have been identified in mice and humans, using the OpenProt 2.0 database [20]. Sheep and cow proteins, but not pig proteins, are indexed in this database. We identified 153 human microproteins via a literature search as these genes are sometime not annotated as protein coding genes in the OpenProt or NCBI-annotated genome. We discovered 115 pig orthologs or paralogs of these genes by translated BLAST (tblastn) of the human protein sequence to the pig genome. The corresponding DNA sequence was translated, and the putative DNA or protein sequences were used to search whether these proteins were annotated as such in the 3 porcine genomes. Data is presented in Tables 3 and 8S . Only 49.1%, 51.7% and 47.4% pig microprotein RNA or proteins are properly assembled and/or annotated in NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0, respectively, B. Protein Orthology and Functional Domain Analysis 1. Protein Orthology Analysis We determined orthology for 47879 (15613 pig, 16496 mouse, and 15770 human) protein coding genes ( Table 9S and Figure 1 ). NCBI lists 20790, 22192 and 20080 protein coding genes in pigs, mice, and humans, respectively; therefore, our analysis is estimated to encompass 75.1% of pig, 74.3% of mouse, and 78.5% of human proteins (a species average of 76.0%). We omitted OR, TCR and BCR, and MHC class I and II proteins from our analysis. Our species average estimate of coverage of the proteome excluding these groups is likely to exceed 80%. Our analysis shows that when a gene is missing from one of the three genomes, pigs are 5.0 X more likely to have the human gene than mice. Mice had 2.2 and 2.5 X the number of unique proteins compared to humans and pigs, respectively. These unique proteins, and their pseudogenes, were categorized by Superfamily and appear in Table 4 . As expected, mice exhibited the largest number of Superfamily member expansions. We conducted DAVID data mining of differentially encoded proteins to determine pathway enrichment. We found 8 REACTOME pathways ( Tables 5 and 10S ) that were enriched after correction for multiple comparison, for the pig-human comparison; R-HSA-212436~Generic Transcription Pathway (3.0 fold, p = 5.27E-24), R-HSA-6805567~Keratinization (7.1 fold, p = 5.67E-22), R-HSA-73857~RNA Polymerase II Transcription (2.7 fold, p = 3.61E-21), R-HSA-74160~Gene expression (Transcription) (2.4 fold, p = 8.95E-18), R-HSA-1461957~Beta defensins (13.2 fold, p = 4.91E-12), R-HSA-6803157~Antimicrobial peptides (7.7 fold, p = 5.09E-11), R-HSA-1461973~Defensins (10.8 fold, p = 1.24E-10) and R-HSA-168249~Innate Immune System (1.5 fold, p = 3.93E-02). We found no significant pathway enrichment for the mouse-human REACTOME comparisons. We found 3 conserved human-pig KEGG pathways ( Tables 5 and 10S ) that were enriched after correction for multiple comparison; hsa04613:Neutrophil extracellular trap formation (4.4 fold, p = 3.15E-03), hsa04061:Viral protein interaction with cytokine and cytokine receptor (5.5 fold, p = 2.32E-02), hsa05322:Systemic lupus erythematosus (4.3 fold, p = 3.42E-02). We also found 3 conserved human-mouse KEGG pathways ( Table 5 and 10S ) that were enriched after correction for multiple comparison; hsa00982:Drug metabolism - cytochrome P450 (12.7 fold, p = 3.81E-02, hsa00980:Metabolism of xenobiotics by cytochrome P450 (11.7 fold, p = 3.81E-02), hsa00983:Drug metabolism - other enzymes (11.5 fold, p = 3.81E-02). We conducted GO BP DIRECT mining of differentially encoded proteins to determine ontology enrichment. We found 13 GO terms that were enriched after correction for multiple comparison for the pig-human comparison, although many more pathways had p values < 0.05 before correction ( Table 5 ). GO:0006355~regulation of DNA-templated transcription (4.7 fold, p = 6.68E-41), GO:0006357~regulation of transcription by RNA polymerase II (4.7 fold, p = 6.68E-41), GO:0042742~defense response to bacterium (3.5 fold, p = 6.68E-41), GO:0045087~innate immune response (2.6 fold, p = 1.13E-04), GO:0061844~antimicrobial humoral immune response mediated by antimicrobial peptide 4.6 fold, p = 9.36E-04), GO:0031424~keratinization (5.5 fold, p = 1.22E-02), GO:0019373~epoxygenase P450 pathway (12.4 fold, p = 1.22E-02), GO:0048006~antigen processing and presentation, endogenous lipid antigen via MHC class Ib (33.3 fold, p = 1.55E-02), GO:0006805~xenobiotic metabolic process (4.1 fold, p = 1.81E-02), GO:0050829~defense response to Gram-negative bacterium (4.4 fold, p = 1.81E-02), GO:0031640~killing of cells of another organism (4.4 fold, p = 3.48E-02), GO:0048007~antigen processing and presentation, exogenous lipid antigen via MHC class Ib (23.6 fold, p = 3.50E-02), GO:0006955~immune response (2.1 fold, p = 4.73E-02). We found no significant pathway enrichment for the mouse-human GO BP DIRECT comparisons. We conducted Ingenuity Pathway Analysis (IPA) of pig-human and mouse-human conserved genes and found 32 and 29 pathways respectively. These are summarized in graphical form in Figure 2 and in tabular form in Table 10S . In addition to the pathways discovered by REACTOME and GO BP Direct, genes involved in IL-13 Signaling Pathway (p = 1.48E-07) and IL-17 Signaling pathway (p = 1.43E-04) as well as Retinol Biosynthesis (p = 1.93E-02) and α-tocopherol Degradation (p = 2.79E-04) were enriched in the pig-human dataset ( Figure 2A ). For the mouse-human dataset ( Figure 2B) , Granulocyte Adhesion and Diapedesis, the top canonical pathway (p = 9 2.27E-04) and Interleukin-10 signaling (p = 1.78E-02) were the sole immune-related pathways. There was no enrichment of nutrition related pathways determined by IPA. Although there were similar numbers of pig-human and mouse-human conserved pathways, the number of genes per node and the statistical significance was less for the mouse-human comparisons: 45 mouse-human pathways were significant at a p < 0.01; whereas 35 pig human pathways were significant at a p < 0.01. 2. Functional Domain Comparison We examined a randomly chosen subset of 3465 protein coding genes (10395 proteins overall) for preservation of protein Superfamily domains or other features ( Tables 4 and 10S ). Six examples of these are shown in Figure 3 . We identified 644 structural differences between 1737 proteins (shared between pig, mouse, and human). We then conducted DAVID data mining of differentially expressed functional domains to determine pathway enrichment ( Table 6 ). We found 2 REACTOME pathways that were enriched after FDR correction for multiple comparison, for the pig-human comparison R-HSA-1474244~Extracellular matrix organization (4.8 fold, p = 1.22E-04) and R-HSA-168256~Immune System (1.9 fold, p = 4.46E-03). We found 4 pathways that were significantly enriched in the pig-mouse-comparison; R-HSA-168256~Immune System (2.3 fold, p = 1.35E-05), R-HSA-168898~Toll-like Receptor Cascades (7.7 fold, p = 1.15E-04), R-HSA-1280215~Cytokine Signaling in Immune system (3.2 fold, p = 3.76E-04) and R-HSA-449147~Signaling by Interleukins (3.5 fold, p = 1.17E-02) In contrast, we found no pathway enrichment for the mouse-human comparison. Similar results were obtained from mining GO BP DIRECT. No enrichment was found for the pig-human comparison; whereas, 17 terms were significant in the pig-mouse comparison; GO:0045087~innate immune response (5.1 fold, p = 6.99E-07), GO:0006954~inflammatory response (5.9 fold, p = 6.99E-07), GO:0002224~toll-like receptor signaling pathway (28.3 fold, p = 7.27E-04) , GO:0050729~positive regulation of inflammatory response (10.2 fold, p = 7.85E-04) , GO:0043123~positive regulation of canonical NF-kappaB signal transduction (6.0 fold, p = 3.33E-03) , GO:0032729~positive regulation of type II interferon production (11.4 fold, p = 6.32E-03), GO:0007157~heterophilic cell-cell adhesion via plasma membrane cell adhesion molecules (15.6 fold , p = 6.41E-03), GO:0051607~defense response to virus (5.3 fold, p = 1.61E-02), GO:0007155~cell adhesion (3.4 fold, p = 1.61E-02), GO:0032757~positive regulation of interleukin-8 production (11.8 fold, p = 1.69E-02 ), GO:0070555~response to interleukin-1 (17.7 fold, p = 1.69E-02), GO:1901224~positive regulation of non-canonical NF-kappaB signal transduction (11.4 fold, p = 1.69E-02) , GO:0031297~replication fork processing ( 16.8 fold, p = 1.89E-02), GO:0051092~positive regulation of NF-kappaB transcription factor activity (7.4 fold, p = 2.94E-02), GO:0071260~cellular response to mechanical stimulus (9.5 fold, p = 3.11E-02) , GO:0006979~response to oxidative stress (7.0 fold, p = 3.64E-02). With 2 exceptions, all of these overlapping pathways reflect genes involved in immunity and/or inflammation. No enrichment was observed for the mouse-human comparison. We evaluated whether the differences in functional domains occurred randomly. or whether particular functional domains were more abundant in each species. Of the domains that appear more than 5 times in one species we found that pigs had 6 (Atrophin-1, PHA03307, PHA03378, PRK03918, PRK07764, PTZ00121, PTZ0049) expanded functional domains, mice had 2 (Collagen, Herpes BLFF1, PRK03918) and humans had 5 (Atrophin-1, PHA03307, PHA03378 PTZ00121, PTZ0049) Superfamily domains that were increased or decreased by 25% from the other species. The meaning of this is not clear as the vast majority of these structural domains do not have a defined function. C. 5 and 3’UTR and splice variant analyses. 1. 5’ and 3’ UTR mRNA conservation Our current analysis of 8151 mRNA indicates that 1.2% (98) and 5.09% (415) genes have low conservation of the 5’ or 3’UTR regions, respectively, while 0.85% (69) genes have a combined low 5’ and 3’ conservation ( Table 11S ). DAVID REACTOME analysis of these genes indicated that the genes with low 3’ conservation between human and pig, were enriched in genes related to metabolism (19.7 fold) and the immune System (16.3 fold), although the relationship did not persist after correction for multiple comparisons. Our previous analysis indicates that conservation of mRNA sequences between pigs and humans averages 75%. A separate analysis indicated that the current version of Ensembl build 11.1 does not adequately capture the UTR regions of genes [10]. Next, we determined the potential conservation of 14,122 human transcript variants (NM_+ XM_) in pigs ( Table 12S) . We found that pigs could make 89.7% (12,556) of 14,122 human transcript variants. We then examined the NCBI-annotated number of exons for 4,824 randomly selected pig and human genes. We discovered a serious under annotation of exons in the pig genome, almost an exon less per gene with a high bias towards gene with a large number of exons. Discussion The data presented here should help to obtain a finished version of the swine genome and enable improved biological insights. With our manually assembled and annotated non-redundant, 16,146 RNA and 15,613 pig protein sequence library we assessed the assembly and annotation status of the 3 latest builds of the swine genome and compared them to the mouse and human genomes. Since we generated an extremely large amount of data, we will highlight the major and unique aspects of our analyses. Error Analysis and Categorization Our comparison of a subset of 6135 protein coding genes revealed that the percentage of correctly assembled and annotated genes in the NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0 to be 58.9%, 51.7% and 47.1% respectively. The sources of errors were varied but could be systematized to a certain degree: an indel every 12K, intronless genes, failure to annotate genes for various reasons (intronless genes, mucins, protocadherins, readthrough genes), annotation of endogenous retroviral sequences as protein coding genes. In addition, there are genes that are actually missing from one or more genomes. It is likely that the genome build 11.1 is over 95% complete with regard to sequence representation; however, MARC 1.0 may be slightly less complete. One gene, phospholipase A and acyltransferase 4 (PLAAT4), could not be assigned a chromosomal location because it is missing from all 3 genomes (NCBI build 11.1, Ensembl build 11.1 and MARC 1.0). It was also missing from several other porcine genomes (Berkshire_pig_v1, Hampshire_pig_v1, Large_White_v1, Pietrain_pig_v1, Tibetan_Pig_v2), but present in the Bamei_pig_v1, Jinhua_pig_v1, Meishan_pig_v1, Rongchang_pig_v1, and Wuzhishan minipig_v1.0 genomes. There are at least two reasons for this. The first is that these DNA regions are difficult to sequence. The second, and more interesting possibility, is that the gene is actually present or absent in different pig breeds. The gene is also missing from cows and rodents but is present in other species in the order Artiodactyla, such as Phacochoerus africanus , Diceros bicornis and Ceratotherium simum simum . The protein derived from this gene is a retinoic acid-induced negative regulator of cell proliferation with phospholipase A(1/2) activity [ 21 , 22 ]. Errors due to Indels The presence of an indel every 12 Kbp is a major flaw in the Ensembl and NCBI builds of 11.1, affecting the correct assembly and annotation of thousands of genes. and limits the accuracy of splice variant determination as NCBI does not annotate predicted splice variants of low-quality proteins. Although the insertion of an indel by NCBI benefits the analysis of genes affected by a real indel, NCBI also inserts an ambiguous nucleotide (s) and annotates pseudogenes as low-quality proteins in the presence of a predicted stop codon. We previously documented this error for type 1 IFNs [ 23 ], CD Markers (BTLA, CD160) [ 24 ] and components of the inflammasome (NLRC4, NAIP, NLRP14) [ 25 ]. Keratin Associated Proteins (KRTAP) and Late Cornified Envelope (LCE) proteins Keratin associated protein genes are a family of intronless genes that encode type I keratins of the Keratin B2 Superfamily. Our study determined that humans have 124 (107 genes, 17 pseudogenes), pigs have 125 (110 genes and 15 pseudogenes) and mice have 187 (141 genes and 46 pseudogenes) KRTAP genes. A recent census of human genes discovered 93 protein coding KRTAP genes divided into 26 subfamilies [ 26 ]. An expanded KRTAP gene repertoire was previously reported in rodents compared to humans [ 27 ]. A preliminary characterization identified 121 KRTAP genes (102 genes and 19 pseudogenes) in pigs [ 28 ]; however, the full repertoire of the genes in pigs has not been reported for several reasons. The version of the pig genome used for that characterization (10.2), was missing the 5’ and 3’ region flanking the KRTAP5 cluster was missing [ 28 ]; The vast majority of the porcine genes we identified are not annotated genes in NCBI 11.1, Ensembl 11.1 or MARC genomes. Furthermore, they have very limited sequence homology, so assigning 1:1 orthology is difficult. KRTAP proteins are involved in the formation of intermediate filaments that provide structural support to cells in the body, particularly in the skin and hair. These proteins play a critical role in maintaining the integrity of the skin and hair, and defects in KRTAP genes have been associated with various skin and hair disorders in humans. The number of KRTAP roughly corresponds to the fur coverage of each species [ 26 ]. Vomeronasal Receptors (VMRs) The number of VMRs is proportional to each species reliance on pheromone communication. We found a larger number (225) of mouse VMRs than previous reported [ 29 ]; however, some of these may be pseudogenes or may not be expressed. Selenoproteins The discrepancies between the fidelity of selenoprotein assembly and analysis between the species is unknown; but, conceptually, it seems possible to automatically assemble and annotate porcine selenoproteins. Selenium, in the form of selenocysteine (Sec) is cotranslationally inserted into polypeptide chains in response to the UGA codon, whose normal function is to terminate translation [ 30 ]. To decode UGA as Sec, organisms evolved the Sec insertion machinery that allows incorporation of this amino acid at specific UGA codons in a process requiring a cis-acting, Sec insertion sequence (SECIS) element [ 30 ]. Mucins The repetitive nature of sequences in mucin genes makes them difficult to assemble [ 31 ]. Of the 87 human genes that have their NCBI Annotation category listed as “suggests misassembly”, 8 (MUC1, MUC2, MUC3A, MUC4, MUC5AC, MUC6, MUC8, MUC16, MUC19) are mucins. The actual number of mucin genes in pigs has been difficult to determine. MUCL3 was previously reported to be a pseudogene in pigs [ 32 ] but instead is a misassembled gene and a truncated protein in all three builds. Conversely, MUC17 is a pseudogene in pigs. A previous publication reported that this gene is expressed in pigs [ 5 ]. Some mucins provide the critical barrier between epithelial cells and the environment. Disorders of mucin synthesis can lead to human diseases like bronchial asthma, ulcerative colitis/inflammatory bowel disease and cystic fibrosis. Having full length pig sequences will aid in the characterization of these mucins in pig models of human disease. Protocadherins The high error rate found for protocadherin assembly and annotation is because the annotation process is not readily amenable to automation. Protocadherins are arranged in tandemly linked gene clusters in a manner similar to that of B-cell and T-cell receptor gene clusters. Virtually nothing is known about pig protocadherins, a single study compared pig, mouse, and human PCDH11X genes [ 33 ] and found that all exons present in mouse and pig transcripts had homologous sequences in the human genome but not all exons are represented in human transcripts [ 33 ]. Protocadherins play a general role in cell adhesion and development. They are particularly important in the function of neurons. Disruption of protocadherin expression or function is thought to play a role in the development of certain human cancers, neurological and inflammatory diseases [ 34 ]. Readthrough Genes To the best of our knowledge only one pig readthough gene (pig-specific), TNNI2-ACTA1, has been described in the literature [ 35 ]. Readthrough transcripts are RNA transcripts that are formed via exon splicing of more than one distinct gene. They are somewhat analogous to alternative splicing in terms of providing genome diversity although their numbers are smaller. An early estimate indicated that there were 751 conjoined or read-through genes in the human genome that are supported by at least one mRNA or EST sequence [ 36 ]. More recent Gencode estimates list 650 and 230 for the human and mouse genome [ 37 ]; but there is little overlap between the 2 species. Readthrough protein sequences can be different from their corresponding parent protein sequences due to frame shifting. In addition, a significant number of these are classified as ncRNA, an even smaller subset of these have been designated as non-sense mediated decay (NMD) candidates. Because of the recency of their discovery, the function of readthrough RNAs or proteins is unknown. Genes that are fusions of genes with critical functions in immunity (IFNAR2-IL10RB, TNFSF12-TNFSF13) or metabolism (RBP1-NMNAT3, NT5C1B-RDH14) are likely to have vastly different functions than either parent gene. Notable Gene Superfamilies Missing in Pigs The Semenogelin and Seminal Vesicle Secretory Proteins (SVG) Family consists of 2 family members in humans and 6 in mice. Although several catalog vendors claim to have antibodies that are cross reactive to porcine G1 (SEMG1) [ 18 ] and an early report describes the presence of a peptide (HNKQEGRDHD) corresponding to human SEMG1 [ 38 ] in boar semen, we could not find any SVG family members in pigs. The protein is also missing from other determined proteomes of boar semen [ 39 , 40 ]. The loss of this protein family is not unique to pigs because we could not find any evidence of this protein family members in mammalian orders outside of rodents and primates. The functional meaning of the absence of this protein family is unknown, however, SEMG1 the most abundant protein in human sperm, is essential for sperm coagulation and is broken down to form SgI-29, an antimicrobial peptide [ 41 ]. Splice Variant Analysis Our analysis revealed that pig have the potential to make 89.9% of human transcript variants. To our knowledge there are no recent estimates of this conservation in pig or mice. Early estimates of conservation of splice variants between human and mouse vary widely, from 50–70% [ 42 , 43 ]. An early estimate of conservation between human and pig [ 44 ], using ESTs, estimated that around 70% of splice variants were conserved between humans and pigs. We found several instances where automatic annotation led to exon omission. For example the pig NCBI locus for myelin basic protein (MBP) has one transcript, corresponding to human and mouse isoform 1, but does not include the 3 exons required to make pig splice variants of human and mouse oligodendrocyte lineage (Golli)-Mbp isoforms 1 and 2 [ 45 ]. These proteins are represented in the pig TSA archive (Golli-MBP isoform 1 (HDB76269.1, HDA86907.1) and are partially represented in Ensembl build 11.1 (Golli-MBP isoform 1, ENSSSCP00000055770) and MARC build 1.0 (Golli-MBP isoform 1, ENSSSCP00070023358). The function and expression of Golli-MBPs are distinct from that of MBP and are important for myelin repair [ 46 ]. Alternate splicing of genes is a significant source of diversity in the genome. Predicting the number of alternative splice variants has proven to be somewhat difficult. In a previous analysis, Ensembl failed to predict 14 and 20% of validated splice variants in human and mouse genomes, respectively [ 47 ]. In the current versions of the pig genomes, the number of pig splice variants is undercounted for several reasons. As previously mentioned, NCBI does not predict splice variants for low-quality proteins (2,807 proteins are annotated as low quality in the NCBI build of 11.1). This set contains a large number of high molecular weight proteins and their contributions to the undercount will be disproportionate because of the large number of nucleotides, and potential splice variants they possess. Irrespective of these errors, the algorithms used to predict splice variants and number of exons, seem to yield vastly different results among the platforms (NCBI versus Ensembl) and species (pigs versus humans or mice). For example, currently the human MBD1 gene in NCBI, has 20 exons that can be rearranged to form 165 transcript variants. In Ensembl the human gene has 28 transcript variants. The pig MBD1 gene has 26 exons and 50 predicted splice variants in the NCBI build 11.1, 4 predicted splice variants in Ensembl build 11.1 and 6 in MARC build 1.0. Both the Ensembl build 11.1 and MARC 1.0 loci fail to predict the longest protein coding transcript. The human PTK2 gene has at least 162 transcript variants; predicted pig splice variants in the NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0 are 40, 8 and 10, respectively. There are numerous other examples of this, but they are beyond the scope of the current manuscript. Splice variants can give rise to identical proteins or can lead to distinct isoforms. Although it has been suggested that most proteins have a single isoform [ 48 ] and that many splice variants are not translated into proteins, there are multiple splice variants for 72% of annotated human genes and 205,000 transcripts had protein-coding potential (> 10 transcripts per gene). Predicted or actual functional consequences of each splice variant are incomplete, even for human and mouse genes. A full discussion of this is beyond the scope of the current manuscript. Instead, we will provide a few examples of differential splice variants or isoforms where there is comparative, functional data. Pigs and humans can make all 17 transcript variants/isoforms of the T cell transcription factor, TCF7L2, involved with antiviral responses [ 18 ]. In contrast pigs can only make 1 out of 5 isoforms of the three prime repair exonuclease 1 (TREX1) and 1 out of 4 isoforms of interferon regulatory factor 9 (IRF9), proteins involved in antiviral immune responses [ 19 ]. Humans express 3 isoforms of interleukin 22 receptor, alpha 2 (IL22RA2) that differ in expression and/or function [ 49 ]. Pigs and mice lack exon 3, that gives rise to IL22RA2 isoform 1 in humans, and can only form the soluble form (isoform 2) [ 50 ]. Alternate splice forms of LY96 (MD-2) that inhibit signaling have been identified in both mice and humans, but these alternate splice forms arise from different splicing events. The mouse protein isoform MD-2B, formed by a 54 base pair deletion at the 5’ end of exon 3, inhibits TLR4 activation by LPS [ 51 ]. The human isoform 2 (MD-2s), formed by skipping exon 2, also inhibits LPS signaling through TLR4 and is not found in mice [ 52 ]. We predict isoform 2 can be formed in in pigs but found no evidence for isoform 2 expression in the EST and TSA archives. NCBI predicted that isoform 2 occurs in various Canids ( Canis lupus familiaris , Canis lupus dingo , Vulpes lagopus ) and Pinnipeds ( Neomonachus schauinslandi , Mirounga angustirostris , Mirounga leonine , Halichoerus grypu ). Conclusions In this manuscript, we analyzed Ensembl and NCBI builds 11.1 and MARC 1.0, with a large sequence library of manually assembled and annotated RNA and protein sequences, in order to better annotate the pig genome and discover systematic sources of errors. These sources include a frequently occurring indel in proteins of large size that alter the predicted size or delegation of a gene as protein-coding. This leads to failure to properly assemble large sized genes, e.g., the mucin and protocadherin genes. Additional errors include selenoprotein genes being assigned a premature stop codon, and endogenous retroviral sequences being annotated as protein coding genes. We identified several hundred pig putative protein coding genes in the process. We analyzed the conservation of pig and human 5’ and 3’ UTR RNA regions and RNA splice variants. We assembled a partial, but nonredundant and highly annotated, pig RNAome and proteome and used it to identify 1–1 mouse and/or human orthologs. We compared the 1–1 orthologs or proteins with shared or non-shared functional domains, for all 3 species, to determine functional enrichment. The results are summarized in Table 7 . These data overwhelmingly support the relevance and importance of the pig as a biomedical research model for humans. Our analysis also highlights areas where mice may be a better model and areas where both of these species are likely to be of limited use. For example, although we have made a strong case for the use of the pig as a biomedical model, particularly in the area of nutrition and immunity, our DAVID REACTOME analysis of genes with low 3’ conservation between human and pig showed enrichment in genes related to metabolism and the immune system. The 3’ UTR of mRNA can contain binding sites for regulatory RNA and proteins. Table 7 Summation of Results In addition to these comparisons, we provide the first description and evidence for over 100 potential porcine readthrough genes, the first formal identification (to our knowledge) of pig Golli-MBPs and a complete, comparative analysis of pig, mouse and human VMR receptors. One of the strengths of our approach is that we identify pig transcript variants and protein isoforms based upon their orthology to human counterparts. This would make the genome annotation process better align with human genome and make potential functions of known human transcripts translatable to the pig. The need to better align the nomenclature of species used in biomedical research to their human counterpart has recently been emphasized [ 44 ]. One of the weaknesses of our approach is that we did not attempt to identify pig-specific transcripts. Furthermore, we did not characterize mouse transcripts. Another weakness is that for protein functional domain analysis, we used results from the NCBI BLAST search. While this program identifies macrodomains, it cannot map certain fine structures of proteins as we have done in our previous analysis [ 25 ]. There are currently more than 31 sequenced pig genomes that are publicly available. The annotation states of these are highly variable. Development of more accurate, artificial intelligence-based annotation software is urgently needed. We propose using our templates and nomenclature system to train such software for use in any current or future generation of pig genomes. Materials and Methods One-to-one orthology was determined for pig-mouse, pig- human or mouse-human genes as previously described [ 53 ]. Briefly reciprocal cross-BLASTing of pig (Sus scrofa) sequence sources in Genbank (non-redundant, expressed sequences tag, high throughput genomic sequence, whole genome shotgun contig sequences (WGS), transcriptome shot gun assembly (TSA) and expressed sequence tag (EST)) was performed using discontiguous Megablast (default settings, word size = 11), using reference sequence accession numbers to human or mouse genes/proteins of interest. Ensembl build 11.1 (release 111 - Jan 2024) and Ensembl MARC build 1.0 (release 111 - Jan 2024) were searched using the default settings. When 1–1 orthology could not be established for pig genes based upon protein homology, the RNA was used. If orthology could not be determined by protein and RNA homology, the relative chromosomal location was used to determine orthology [ 23 ]. The human gene symbol was assigned for pig orthologs whenever appropriate. For pig-specific paralogs, the gene symbol assigned was based upon comparative homology to the human gene, following the convention of the human gene family. The gene symbol was then terminated with an asterisk (*). For example, pig-specific paralogs of human SLC7A3, were assigned SLC7A3L1*, SLC7A3L2*, SLC7A3L3*, etc. with SLC7A3L1 being the closest in homology. Splice variant/exon conservation of the pig gene was determined relative to the human reference transcript. Pig-specific transcript variants were not determined. Predicted mRNAs were analyzed for errors (ambiguous nucleotides, gene duplications artifacts, mis-assemblies, mis-annotations). Whenever an ambiguous nucleotide was assigned by NCBI, we blasted the sequence against the WGS, TSA and EST databases to obtain a consensus sequence. Predicted mRNAs were translated into proteins using the ExPASy translate tool ( http://web.expasy.org/translate/ ). The size (in amino acids) of the major protein isoforms were used as a checksum to determine whether the respective genome assembly was correct. Potential porcine readthrough genes were identified by comparison to human and mouse readthrough genes. A predicative scoring system was developed based on whether the transcript exists in other species and whether there was evidence that pigs can make the respective transcript (sequence found in TSA, EST, NCBI and Ensembl build 11.1 and MARC 1.0) databases. Pig-specific readthroughs were distinguished from chimeric artifacts by determining whether the transcript appears in other species and if there was support for the transcript (previously sequenced RNA). Consensus sequences were numerically annotated with base pair positions aligning to the beginning and end of human reference transcript. where possible. Conservation of the 5’ and 3’ predicted pig mRNA was then determined relative to the human reference transcript. Non-coding RNA will not be discussed here with the exception of small nucleolar RNAs (snoRNAs) and small Cajal body-specific RNAs (ScaRNAs). We determined 1 to 1 to 1 pig-mouse-human orthology for 12,720, 12,887 and 12,770 protein coding genes in pigs, mice and humans, respectively (total 38,377). We excluded olfactory receptors (ORs), T and B Cell receptors (TCR and BCR), and MHC class I and II proteins from our analysis and discussion because determining 1:1 orthology for these genes is difficult. The human and mouse genome nomenclature committees have adopted different conventions for assigning nomenclature and assigning orthology is not straightforward (cannot easily be determined by reciprocal cross blasting and/or chromosomal location). These genes will be described in separate manuscripts. A similar situation exists with regard to TRs. To determine structural orthology of TRs, we conducted phylogenetic tree analysis using Geneious Prime program (Geneious Pro v 20231.2) and the Jukes–Cantor algorithm. These unambiguous protein orthologs (from 3 species) were used for Venn analysis using Venny 2.1. ( http://bioinfogp.cnb.csic.es/tools/venny/index.html ). One to one pig/human, pig/mouse or mouse/human orthologs were analyzed by DAVID ( https://davidbioinformatics.nih.gov/ ), REACTOME ( https://reactome.org/ ) and Go Direct BP ( https://geneontology.org/ ) databases and queried to determine functional enrichment. Alternatively, we used our highly annotated database to determine whether the protein was part of the immunome [ 13 ] or involve in nutrition and/or metabolism [ 53 ]. Differentially encoded genes were also compared using Ingenuity Pathway Analysis (IPA) software (QIAGEN Bioinformatics, Redwood, CA). Conservation of protein functional domains was determined by comparing the BLAST graphic summary of the longest, comparable protein isoform sequences. Proteins with non or shared domains were analyzed by DAVID to determine functional enrichment. Abbreviations AA Amino Acid BCR B Cell receptor EST Expressed sequence tag HAVANA Human And Vertebrate Analysis and Annotation KEGG Kyoto Encyclopedia of Genes and Genomes KRTAP keratin associated proteins LCE late cornified envelope miRNA microRNA ncRNA Non-coding RNA NMD Non-sense mediated decay OR Olfactory Receptor ScaRNA Small Cajal body-specific RNA). SEP sORF-encoded protein SnoRNAs Small nucleolar RNA sORFs small open reading frame SRA Short Read Archive Database tblastn translated BLAST TSA Transcriptome Shotgun Assembly Sequence Database TR Taste Receptor UTR Untranslated region VMR Vomeronasal Receptors WGS Whole genome shotgun contig sequences Declarations Ethics approval and consent to participate Not Applicable Consent for publication All authors have given their consent to publish. Availability of data and materials All data generated or analyzed during this study are included in this published article [and its supplementary information files]. Complete data (pig RNA and protein sequences and their respective annotations) for these analyses can be found in our online database (http://tinyurl.com/hxxq3ur). Competing interests The authors do not have any competing interests. Funding This work was supported by USDA ARS projects 8040-51000-058 and 8042-32000-117. Authors' contributions HD, JR conducted the analysis of the data. HD and CC maintain the online version of the database. HD, JL, JR, CC and AS wrote the manuscript and participated in the editing of the manuscript. References Mudge JM, Harrow J (2015) Creating reference gene annotation for the mouse C57BL6/J genome assembly. Mammalian genome: official J Int Mammalian Genome Soc 26(9–10):366–378 Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A et al (2022) The complete sequence of a human genome. Science 376(6588):44–53 Liu J, Li Q, Hu Y, Yu Y, Zheng K, Li D, Qin L, Yu X (2024) The complete telomere-to-telomere sequence of a mouse genome. Science 386(6726):1141–1146 Li M, Chen L, Tian S, Lin Y, Tang Q, Zhou X, Li D, Yeung CKL, Che T, Jin L et al (2017) Comprehensive variation discovery and recovery of missing sequence in the pig genome using multiple de novo assemblies. Genome Res 27(5):865–874 Gilbert DG (2019) Genes of the pig, Sus scrofa, reconstructed with EvidentialGene. PeerJ 7:e6374 Summers KM, Bush SJ, Wu C, Su AI, Muriuki C, Clark EL, Finlayson HA, Eory L, Waddell LA, Talbot R et al (2019) Functional Annotation of the Transcriptome of the Pig, Sus scrofa, Based Upon Network Analysis of an RNAseq Transcriptional Atlas. Front Genet 10:1355 Beiki H, Liu H, Huang J, Manchanda N, Nonneman D, Smith TPL, Reecy JM, Tuggle CK (2019) Improved annotation of the domestic pig genome through integration of Iso-Seq and RNA-seq data. BMC Genomics 20(1):344 Warr A, Affara N, Aken B, Beiki H, Bickhart DM, Billis K, Chow W, Eory L, Finlayson HA, Flicek P et al (2020) An improved pig reference genome sequence to enable pig genetics and genomics research. GigaScience 9(6) van der Hee B, Madsen O, Vervoort J, Smidt H, Wells JM (2020) Congruence of Transcription Programs in Adult Stem Cell-Derived Jejunum Organoids and Original Tissue During Long-Term Culture. Front Cell Dev Biol 8:375 Yu W, Moninger TO, Thurman AL, Xie Y, Jain A, Zarei K, Powers LS, Pezzulo AA, Stoltz DA, Welsh MJ (2022) Cellular and molecular architecture of submucosal glands in wild-type and cystic fibrosis pigs. Proc Natl Acad Sci USA 119(4) Dawson HD (2011) A comparative assessment of the pig, mouse and human genomes. The Minipig in Biomedical Research. CRC, Boca Raton, FL, pp 323–342 Groenen MA, Archibald AL, Uenishi H, Tuggle CK, Takeuchi Y, Rothschild MF, Rogel-Gaillard C, Park C, Milan D, Megens HJ et al (2012) Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491(7424):393–398 Dawson HD, Loveland JE, Pascal G, Gilbert JG, Uenishi H, Mann KM, Sang Y, Zhang J, Carvalho-Silva D, Hunt T et al (2013) Structural and functional annotation of the porcine immunome. BMC Genomics 14:332 Doncheva NT, Palasca O, Yarani R, Litman T, Anthon C, Groenen MAM, Stadler PF, Pociot F, Jensen LJ, Gorodkin J (2021) Human pathways in animal models: possibilities and limitations. Nucleic Acids Res 49(4):1859–1871 Triant DA, Walsh AT, Hartley GA, Petry B, Stegemiller MR, Nelson BM, McKendrick MM, Fuller EP, Cockett NE, Koltes JE et al (2023) AgAnimalGenomes: browsers for viewing and manually annotating farm animal genomes. Mammalian genome: official J Int Mammalian Genome Soc 34(3):418–436 Grzybowska EA (2012) Human intronless genes: functional groups, associated diseases, evolution, and mRNA processing in absence of splicing. Biochem Biophys Res Commun 424(1):1–6 Jorquera R, Gonzalez C, Clausen P, Petersen B, Holmes DS (2018) Improved ontology for eukaryotic single-exon coding sequences in biological databases. Database: J Biol databases curation 2018:1–6 Griffiths DJ (2001) Endogenous retroviruses in the human genome sequence. Genome Biol 2(6):REVIEWS1017 Leong AZ, Lee PY, Mohtar MA, Syafruddin SE, Pung YF, Low TY (2022) Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures. J Biomed Sci 29(1):19 Leblanc S, Yala F, Provencher N, Lucier JF, Levesque M, Lapointe X, Jacques JF, Fournier I, Salzet M, Ouangraoua A et al (2024) OpenProt 2.0 builds a path to the functional characterization of alternative proteins. Nucleic Acids Res 52(D1):D522–D528 DiSepio D, Ghosn C, Eckert RL, Deucher A, Robinson N, Duvic M, Chandraratna RA, Nagpal S (1998) Identification and characterization of a retinoid-induced class II tumor suppressor/growth regulatory gene. Proc Natl Acad Sci USA 95(25):14811–14815 Uyama T, Jin XH, Tsuboi K, Tonai T, Ueda N (2009) Characterization of the human tumor suppressors TIG3 and HRASLS2 as phospholipid-metabolizing enzymes. Biochim Biophys Acta 1791(12):1114–1124 Dawson HD, Sang Y, Lunney JK (2020) Porcine cytokines, chemokines and growth factors: 2019 update. Res Vet Sci 131:266–300 Dawson HD, Lunney JK (2018) Porcine cluster of differentiation (CD) markers 2018 update. Res Vet Sci 118:199–246 Dawson HD, Smith AD, Chen C, Urban JF Jr. (2017) An in-depth comparison of the porcine, murine and human inflammasomes; lessons from the porcine genome and transcriptome. Vet Microbiol 202:2–15 Litman T, Stein WD (2023) Ancient lineages of the keratin-associated protein (KRTAP) genes and their co-option in the evolution of the hair follicle. BMC Ecol Evol 23(1):7 Wu DD, Irwin DM, Zhang YP (2008) Molecular evolution of the keratin associated protein gene family in mammals, role in the evolution of mammalian hair. BMC Evol Biol 8:241 Khan I, Maldonado E, Vasconcelos V, O'Brien SJ, Johnson WE, Antunes A (2014) Mammalian keratin associated proteins (KRTAPs) subgenomes: disentangling hair diversity and adaptation to terrestrial and aquatic environments. BMC Genomics 15(1):779 Rodriguez I, Del Punta K, Rothman A, Ishii T, Mombaerts P (2002) Multiple new and isolated families within the mouse superfamily of V1r vomeronasal receptors. Nat Neurosci 5(2):134–140 Labunskyy VM, Hatfield DL, Gladyshev VN (2014) Selenoproteins: molecular pathways and physiological roles. Physiol Rev 94(3):739–777 Lang T, Pelaseyed T (2022) Discovery of a MUC3B gene reconstructs the membrane mucin gene cluster on human chromosome 7. PLoS ONE 17(10):e0275671 Shigenari A, Ando A, Renard C, Chardon P, Shiina T, Kulski JK, Yasue H, Inoko H (2004) Nucleotide sequencing analysis of the swine 433-kb genomic segment located between the non-classical and classical SLA class I gene clusters. Immunogenetics 55(10):695–705 Blanco-Arias P, Sargent CA, Affara NA (2004) A comparative analysis of the pig, mouse, and human PCDHX genes. Mammalian genome: official J Int Mammalian Genome Soc 15(4):296–306 Pancho A, Aerts T, Mitsogiannis MD, Seuntjens E (2020) Protocadherins at the Crossroad of Signaling Pathways. Front Mol Neurosci 13:117 Liu D, Xia J, Yang Z, Zhao X, Li J, Hao W, Yang X (2021) Identification of Chimeric RNAs in Pig Skeletal Muscle and Transcriptomic Analysis of Chimeric RNA TNNI2-ACTA1 V1. Front Vet Sci 8:742593 Prakash T, Sharma VK, Adati N, Ozawa R, Kumar N, Nishida Y, Fujikake T, Takeda T, Taylor TD (2010) Expression of conjoined genes: another mechanism for gene regulation in eukaryotes. PLoS ONE 5(10):e13284 Frankish A, Carbonell-Sala S, Diekhans M, Jungreis I, Loveland JE, Mudge JM, Sisu C, Wright JC, Arnan C, Barnes I et al (2023) GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res 51(D1):D942–D949 Jonakova V, Kraus M, Veselsky L, Cechova D, Bezouska K, Ticha M (1998) Spermadhesins of the AQN and AWN families, DQH sperm surface protein and HNK protein in the heparin-binding fraction of boar seminal plasma. J Reprod Fertil 114(1):25–34 Perez-Patino C, Parrilla I, Li J, Barranco I, Martinez EA, Rodriguez-Martinez H, Roca J (2019) The Proteome of Pig Spermatozoa Is Remodeled During Ejaculation. Mol Cell proteomics: MCP 18(1):41–50 Xu Y, Han Q, Ma C, Wang Y, Zhang P, Li C, Cheng X, Xu H (2021) Comparative Proteomics and Phosphoproteomics Analysis Reveal the Possible Breed Difference in Yorkshire and Duroc Boar Spermatozoa. Front Cell Dev Biol 9:652809 Zhao H, Lee WH, Shen JH, Li H, Zhang Y (2008) Identification of novel semenogelin I-derived antimicrobial peptide from liquefied human seminal plasma. Peptides 29(4):505–511 Modrek B, Lee CJ (2003) Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 34(2):177–180 Nurtdinov RN, Artamonova II, Mironov AA, Gelfand MS (2003) Low conservation of alternative splicing patterns in the human and mouse genomes. Hum Mol Genet 12(11):1313–1320 Nygard AB, Cirera S, Gilchrist MJ, Gorodkin J, Jorgensen CB, Fredholm M (2010) A study of alternative splicing in the pig. BMC Res Notes 3:123 Campagnoni AT, Pribyl TM, Campagnoni CW, Kampf K, Amur-Umarjee S, Landry CF, Handley VW, Newman SL, Garbay B, Kitamura K (1993) Structure and developmental regulation of Golli-mbp, a 105-kilobase gene that encompasses the myelin basic protein gene and is expressed in cells in the oligodendrocyte lineage in the brain. J Biol Chem 268(7):4930–4938 Siu CR, Balsor JL, Jones DG, Murphy KM (2015) Classic and Golli Myelin Basic Protein have distinct developmental trajectories in human visual cortex. Front Neurosci 9:138 Tapial J, Ha KCH, Sterne-Weiler T, Gohr A, Braunschweig U, Hermoso-Pulido A, Quesnel-Vallieres M, Permanyer J, Sodaei R, Marquez Y et al (2017) An atlas of alternative splicing profiles and functional associations reveals new regulatory programs and genes that simultaneously express multiple major isoforms. Genome Res 27(10):1759–1768 Gonzalez-Porta M, Frankish A, Rung J, Harrow J, Brazma A (2013) Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol 14(7):R70 Gomez-Fernandez P, Urtasun A, Paton AW, Paton JC, Borrego F, Dersh D, Argon Y, Alloza I, Vandenbroeck K (2018) Long Interleukin-22 Binding Protein Isoform-1 Is an Intracellular Activator of the Unfolded Protein Response. Front Immunol 9:2934 Weiss B, Wolk K, Grunberg BH, Volk HD, Sterry W, Asadullah K, Sabat R (2004) Cloning of murine IL-22 receptor alpha 2 and comparison with its human counterpart. Genes Immun 5(5):330–336 Ohta S, Bahrun U, Tanaka M, Kimoto M (2004) Identification of a novel isoform of MD-2 that downregulates lipopolysaccharide signaling. Biochem Biophys Res Commun 323(3):1103–1108 Gray P, Michelsen KS, Sirois CM, Lowe E, Shimada K, Crother TR, Chen S, Brikos C, Bulut Y, Latz E et al (2010) Identification of a novel human MD-2 splice variant that negatively regulates Lipopolysaccharide-induced TLR4 signaling. J Immunol 184(11):6359–6366 Dawson HD, Chen C, Gaynor B, Shao J, Urban JF Jr. (2017) The porcine translational research database: a manually curated, genomics and proteomics-based research resource. BMC Genomics 18(1):643 Tables Tables 1 to 7 are available in the Supplementary Files section. Additional Declarations The authors declare no competing interests. Supplementary Files Table1SSummaryofGeneOmissionErrors042625.xlsx Summary of Gene Omission Errors Table2SAnalysisofProteinsofExtremeSize042125.xlsx Analysis of Proteins of Extreme Size Table3SStatusofKRTAPandLCEGenesinthe3PigGenomes042625.xlsx Status of KRTAP and LCE Genes in the 3 Pig Genomes Table4SSummaryofBuildErrorsinSelenoproteinGenes425.xlsx Summary of Build Errors in Selenoprotein Genes Table5S.StatusofMucinGenesinthe3PigGenomes0425.xlsx Status of Mucin Genes in the 3 Pig Genomes Table6SStatusofProtocadherinGenesinthe3PigGenomes425.xlsx Status of Protocadherin Genes in the 3 Pig Genomes Table7SStatusofReadthroughGenesinthe3PigGenomes042225.xlsx Status of Readthrough Genes in the 3 Pig Genomes Table8SMicroproteins42225.xlsx Microproteins Table9SPigMouseandHumanGeneandTotalProteinComparisons041825.xlsx Pig, Mouse and Human Gene and Total Protein Comparisons Table10S.AnalysesofStructuralDomainComparisons042625.xlsx Analyses of Structural Domain Comparisons Table11S.ComparisonofGeneswithLow5andor3ConservationBetweenHumansandPigs042525.xlsx Comparison of Genes with Low 5' and or 3 Conservation Between Humans and Pigs Table12S.SummaryofConservedSwineHumanSpliceVariant.042225.xlsx Summary of Conserved Swine/Human Splice Variants ListofSupplementaryTablesLegends.docx Tables.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6856588","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":468772623,"identity":"ff3b35cb-542e-4298-9efe-0d4089a8a899","order_by":0,"name":"Harry D. Dawson","email":"","orcid":"","institution":"USDA ARS BHNRC DGIL","correspondingAuthor":false,"prefix":"","firstName":"Harry","middleName":"D.","lastName":"Dawson","suffix":""},{"id":468772624,"identity":"63d76ebf-1bb0-4303-95c7-e00b2b1ae07b","order_by":1,"name":"Celine Chen","email":"","orcid":"","institution":"USDA ARS BHNRC DGIL","correspondingAuthor":false,"prefix":"","firstName":"Celine","middleName":"","lastName":"Chen","suffix":""},{"id":468772625,"identity":"37ebbdd7-f484-40dd-a4e0-a141d2010a78","order_by":2,"name":"Jack Ragonese","email":"","orcid":"","institution":"USDA ARS BHNRC DGIL","correspondingAuthor":false,"prefix":"","firstName":"Jack","middleName":"","lastName":"Ragonese","suffix":""},{"id":468772626,"identity":"f30da2f7-67d1-4a33-8cd3-b701004babb5","order_by":3,"name":"Allen D Smith","email":"","orcid":"","institution":"USDA ARS BHNRC DGIL","correspondingAuthor":false,"prefix":"","firstName":"Allen","middleName":"D","lastName":"Smith","suffix":""},{"id":468772627,"identity":"e617d2a8-5c33-4677-b6a2-8f2f4d686126","order_by":4,"name":"Joan K Lunney","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5UlEQVRIiWNgGAWjYBACxgYQWQBhP2BgYCZWiwGYzWxAlBYIgGhhkyBKC3N787MHDAaH88zZe8yqef5YyzPwLz4mgddhPcfMDYBaii17zpjd5m1LN2yQeJaGX8uMBDMJoJbEDTdygFoaDjM2SJwxNsCvJf0bRMv9N2bFPH8O2xOhJQdmC48ZMw/b4cQG/h7DB/j9cqZMIsEgPXHDmbRiyblt6cltEmyJeLUYtrdvk/hQYZ244fjhjR/e/LG27ec/fOAAXi0NQCIBzOSAxU4CPg0MDPIIJjvUOfx47RgFo2AUjIIRCAA/MknFQZgfQwAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0002-1147-8662","institution":"USDA ARS BARC APDL","correspondingAuthor":true,"prefix":"","firstName":"Joan","middleName":"K","lastName":"Lunney","suffix":""}],"badges":[],"createdAt":"2025-06-09 17:35:49","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-6856588/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6856588/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":84394716,"identity":"53534a1a-8ff2-4bd2-9de2-96820fbba3ae","added_by":"auto","created_at":"2025-06-11 12:20:04","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":77309,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGreater Pig-Human Similarity of Genes Revealed by Analysis of Non-conserved\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eProtein-coding Genes. \u003c/strong\u003e\u0026nbsp;Venn diagrams were prepared using Venny 2.1. \u003ca href=\"http://bioinfogp.cnb.csic.es/tools/venny/index.html\"\u003ehttp://bioinfogp.cnb.csic.es/tools/venny/index.html\u003c/a\u003e). When a gene is missing from one of the 3 genomes, pigs are 5.2 X more likely to have the human gene than mice.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/66933fa441c4f5c601642153.png"},{"id":84396006,"identity":"32ae0db0-de8e-44c9-bc05-9f0bd084ed00","added_by":"auto","created_at":"2025-06-11 12:36:04","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":163978,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eAnalysis of Pig-Human (A) or Mouse-Human (B) Conserved Pathways. \u003c/strong\u003eIngenuity Pathway Analysis functional enrichment revealed 32 and 29 enriched pathways for \u003cstrong\u003eA\u003c/strong\u003e. pig-human or \u003cstrong\u003eB\u003c/strong\u003e. mouse-human, orthologous genes, respectively.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/5ea3591d8f5e67f822fd49de.png"},{"id":84396007,"identity":"0bec8496-cb64-47cd-829c-6fee0e0da268","added_by":"auto","created_at":"2025-06-11 12:36:04","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":211028,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCorresponding Protein Isoforms with Non-conserved FN3 Functional Domains. \u003c/strong\u003eTwo-dimensional analyses were generated using NCBI Blast. Results showed greater Pig Human similarity occurs for \u003cstrong\u003eA\u003c/strong\u003e) for IL2B and \u003cstrong\u003eB\u003c/strong\u003e) IL27RA and \u003cstrong\u003eC\u003c/strong\u003e) EBI3 proteins. Greater Mouse Human similarity occurs for \u003cstrong\u003eD\u003c/strong\u003e) PTPRJ and \u003cstrong\u003eE\u003c/strong\u003e) COL20A1 proteins. Greater Pig-Mouse similarity occurs for \u003cstrong\u003eF\u003c/strong\u003e) L1CAM protein.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/80d1be9929a2376eb436fc5f.png"},{"id":84396968,"identity":"50c90930-4be4-48de-b2b1-b6dd47df93b6","added_by":"auto","created_at":"2025-06-11 12:44:05","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1385091,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/d73aa7cb-a253-4931-ba54-62152d2a535c.pdf"},{"id":84395501,"identity":"ecf53dd2-4c9f-40d5-9231-3b6325e3228c","added_by":"auto","created_at":"2025-06-11 12:28:04","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":33798,"visible":true,"origin":"","legend":"\u003cp\u003eSummary of Gene Omission Errors\u003c/p\u003e","description":"","filename":"Table1SSummaryofGeneOmissionErrors042625.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/7457c003e1831cffcdd4da8e.xlsx"},{"id":84394718,"identity":"77334e98-3a98-41cf-b094-029fa062533e","added_by":"auto","created_at":"2025-06-11 12:20:04","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":46895,"visible":true,"origin":"","legend":"\u003cp\u003e\u0026nbsp;Analysis of Proteins of Extreme Size\u0026nbsp;\u003c/p\u003e","description":"","filename":"Table2SAnalysisofProteinsofExtremeSize042125.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/ad6530e79cc72de6557d9356.xlsx"},{"id":84395504,"identity":"06bd2696-79b0-433e-96b4-b19cfda36350","added_by":"auto","created_at":"2025-06-11 12:28:04","extension":"xlsx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":31908,"visible":true,"origin":"","legend":"\u003cp\u003eStatus of KRTAP and LCE Genes in the 3 Pig Genomes\u003c/p\u003e","description":"","filename":"Table3SStatusofKRTAPandLCEGenesinthe3PigGenomes042625.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/fbaab1b89af4f87c18fac228.xlsx"},{"id":84394723,"identity":"2c899514-88e0-40ba-b43e-d1d9eb27c82a","added_by":"auto","created_at":"2025-06-11 12:20:04","extension":"xlsx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":14146,"visible":true,"origin":"","legend":"\u003cp\u003eSummary of Build Errors in Selenoprotein Genes\u003c/p\u003e","description":"","filename":"Table4SSummaryofBuildErrorsinSelenoproteinGenes425.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/64b284c5fb2d4ca30b29b781.xlsx"},{"id":84394722,"identity":"f8be82a1-474e-4f4b-96ec-2ad1adc47f16","added_by":"auto","created_at":"2025-06-11 12:20:04","extension":"xlsx","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":19509,"visible":true,"origin":"","legend":"\u003cp\u003eStatus of Mucin Genes in the 3 Pig Genomes\u0026nbsp;\u003c/p\u003e","description":"","filename":"Table5S.StatusofMucinGenesinthe3PigGenomes0425.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/b22d2d28fe0781e0a471b42d.xlsx"},{"id":84395505,"identity":"59a00091-cfaa-4786-8f99-28ab9a4e5969","added_by":"auto","created_at":"2025-06-11 12:28:04","extension":"xlsx","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":20100,"visible":true,"origin":"","legend":"\u003cp\u003eStatus of Protocadherin Genes in the 3 Pig Genomes\u0026nbsp;\u003c/p\u003e","description":"","filename":"Table6SStatusofProtocadherinGenesinthe3PigGenomes425.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/12d7da341b6abf434c86c197.xlsx"},{"id":84395507,"identity":"0cdb3881-3ba3-4e15-b28c-8b54e085d7a7","added_by":"auto","created_at":"2025-06-11 12:28:04","extension":"xlsx","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":25522,"visible":true,"origin":"","legend":"\u003cp\u003eStatus of Readthrough Genes in the 3 Pig Genomes\u003c/p\u003e","description":"","filename":"Table7SStatusofReadthroughGenesinthe3PigGenomes042225.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/ada47259e7e876037fafdacd.xlsx"},{"id":84395509,"identity":"2e66aec3-c567-472a-a61e-304e4c8a508c","added_by":"auto","created_at":"2025-06-11 12:28:04","extension":"xlsx","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":20013,"visible":true,"origin":"","legend":"\u003cp\u003eMicroproteins\u0026nbsp;\u003c/p\u003e","description":"","filename":"Table8SMicroproteins42225.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/2399a5ba26b9f50b0dc6a0f6.xlsx"},{"id":84395510,"identity":"7c2dbe3f-1274-4cf7-a474-efaf63e0e905","added_by":"auto","created_at":"2025-06-11 12:28:04","extension":"xlsx","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":962067,"visible":true,"origin":"","legend":"\u003cp\u003ePig, Mouse and Human Gene and Total Protein Comparisons\u003c/p\u003e","description":"","filename":"Table9SPigMouseandHumanGeneandTotalProteinComparisons041825.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/7151c0a26cd2f0d988f0a26e.xlsx"},{"id":84394733,"identity":"120f3f2e-b5d4-4143-ab2e-7bbc634f4d16","added_by":"auto","created_at":"2025-06-11 12:20:04","extension":"xlsx","order_by":10,"title":"","display":"","copyAsset":false,"role":"supplement","size":138459,"visible":true,"origin":"","legend":"\u003cp\u003e\u0026nbsp;Analyses of Structural Domain Comparisons\u003c/p\u003e","description":"","filename":"Table10S.AnalysesofStructuralDomainComparisons042625.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/13a0d77b6b760cc3dfe2d199.xlsx"},{"id":84394730,"identity":"c2cb818d-b1ea-4e08-89ce-e34996e66d45","added_by":"auto","created_at":"2025-06-11 12:20:04","extension":"xlsx","order_by":11,"title":"","display":"","copyAsset":false,"role":"supplement","size":72595,"visible":true,"origin":"","legend":"\u003cp\u003e\u0026nbsp;Comparison of Genes with Low 5' and or 3 Conservation Between Humans and Pigs\u003c/p\u003e","description":"","filename":"Table11S.ComparisonofGeneswithLow5andor3ConservationBetweenHumansandPigs042525.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/1099b6051e7119f7c9f732de.xlsx"},{"id":84394742,"identity":"d11e5099-a069-498d-aea4-e11d4f9e06e5","added_by":"auto","created_at":"2025-06-11 12:20:04","extension":"xlsx","order_by":12,"title":"","display":"","copyAsset":false,"role":"supplement","size":149005,"visible":true,"origin":"","legend":"\u003cp\u003eSummary of Conserved Swine/Human Splice Variants\u003c/p\u003e","description":"","filename":"Table12S.SummaryofConservedSwineHumanSpliceVariant.042225.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/463f49987bf130092b2e5053.xlsx"},{"id":84394741,"identity":"8d43c39c-2a62-4186-9c59-3cb45c8f4f89","added_by":"auto","created_at":"2025-06-11 12:20:04","extension":"docx","order_by":13,"title":"","display":"","copyAsset":false,"role":"supplement","size":14093,"visible":true,"origin":"","legend":"","description":"","filename":"ListofSupplementaryTablesLegends.docx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/f092044938429457ae3f734b.docx"},{"id":84394743,"identity":"e4484a5a-d994-41a2-a1ae-880daab5174d","added_by":"auto","created_at":"2025-06-11 12:20:05","extension":"docx","order_by":14,"title":"","display":"","copyAsset":false,"role":"supplement","size":135497,"visible":true,"origin":"","legend":"","description":"","filename":"Tables.docx","url":"https://assets-eu.researchsquare.com/files/rs-6856588/v1/d33ea6c03707f990c8cdd06c.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eVerification and Comparison of Pig, Mouse, and Human Genome Similarities: Use of Manual Assembly and Analyses\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe human and mouse genomes have undergone automatic and extensive manual annotation by the Human And Vertebrate Analysis and Annotation (HAVANA) group [\u003cspan class=\"CitationRef\"\u003e1\u003c/span\u003e]. Recently, the development of telomere to telomere sequencing has led to the complete sequencing of the human and mouse genomes [\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003e]. However, there is evidence that the human and mouse genome assemblies are still flawed and could benefit from reassembly; at least 92 human and 22 mouse genes have their NCBI Annotation category listed as \u0026ldquo;suggests misassembly\u0026rdquo;. In contrast, only 21 pig genes have their NCBI Annotation category listed as \u0026ldquo;suggests misassembly\u0026rdquo;. There have been numerous attempts to improve the annotation and assembly of the porcine genome [\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e8\u003c/span\u003e]. Despite these efforts, several recent analyses indicates that there is a significant amount of work that needs to be done towards a \u0026ldquo;finished\u0026rdquo; pig genome [\u003cspan class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e10\u003c/span\u003e], particularly the need for manual annotation [\u003cspan class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eAlthough recent work by our group and others overwhelmingly suggests that pigs and humans exhibit greater genome similarity at the macro level and share more genes [\u003cspan class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e13\u003c/span\u003e]; our preliminary analysis indicated that pigs and humans have greater conservation of protein functional domains [\u003cspan class=\"CitationRef\"\u003e11\u003c/span\u003e], than do humans and mice. A recent study concluded that humans and mice share more Kyoto Encyclopedia of Genes and Genomes (KEGG)-related pathways than do humans and pigs [\u003cspan class=\"CitationRef\"\u003e14\u003c/span\u003e]; however, the authors of that study speculated that their conclusion is likely to be affected by the incomplete status of the annotations in the porcine genome.\u003c/p\u003e\n\u003cp\u003eThe various pig genomes annotated by NCBI and Ensembl (Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e) have a wide range of predicted genes (46573\u0026ndash;152168), predicted protein coding genes (19974-22,125), predicted transcripts (56900\u0026ndash;78200) and predicted transcripts/gene (0.39\u0026ndash;2.92). It is not logical to assume that there is this much natural, pig breed variation. But rather, this variation is likely pipeline intrinsic and due to the variable amounts of updates and patches applied to each. This assertion is supported by the observation that, using the same sequence source (Duroc build 11.1), NCBI predicts 6.8x and 2.9x more pseudogenes and transcripts per gene, respectively, than Ensembl. Furthermore, a recent paper using automated analysis has noted the incongruity of the annotation of protein coding porcine genes in the NCBI and Ensembl builds of 11.1 with 2119 and 3371 discordant genes [\u003cspan class=\"CitationRef\"\u003e15\u003c/span\u003e]. Potential sources of this discrepancy were not identified. These data make it difficult to assess the actual number of genes and transcripts in machine-annotated genomes. Furthermore, any cross-species functional comparison using pigs would be compromised by these discrepancies.\u003c/p\u003e\n\u003cp\u003ePreviously we discovered several sources of systematic errors in the earlier pig reference genome (Ensembl and NCBI builds 10.2) prediction or annotation pipelines (selenoproteins, taste receptor (TR) genes, intronless genes, artifactually duplicated genes) by using a manually annotated set of sequences. Herein, we extend our analysis to Ensembl and NCBI builds 11.1 and MARC 1.0, with a much larger sequence library of RNA and protein sequences, in order to uncover potential systematic errors. We used this library to determine the conservation of pig and human 5\u0026rsquo; and 3\u0026rsquo;UTR (untranslated regions) RNA regions and RNA splice variants. We then used this nonredundant and highly annotated pig proteome to identify 1\u0026ndash;1 mouse and/or human orthologs. We compared the 1\u0026ndash;1 orthologs for all 3 species, to determine functional enrichment. A similar analysis was performed on proteins with shared or non-shared functional domains.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eA. Error Analysis and Categorization\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e1. Protein Coding Genes\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eIn a comparison of a subset of the 18405 protein coding genes, we determined that the percentage of correctly assembled and annotated genes in the NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0 to be 58.9%, 51.7% and 47.1%, respectively (\u003cstrong\u003eTable 2\u003c/strong\u003e). The sources of errors were varied. The most frequent broadly defined error category, error in annotated locus, occurred in 24.9, 42.6, and 43.8%, of NCBI build 11.1, Ensembl build 11.1 and MARC build 1, respectively. Because of the higher error rate of MARC 1.0, we examined a larger, randomly chosen, set in the Ensembl assembly of MARC 1.0 (data not shown). This larger search of MARC 1.0 revealed a significant number of missing genes (306 of which 261 are protein coding) compared to NCBI (28) and Ensembl build 11.1 (36) (\u003cstrong\u003eTable 1S\u003c/strong\u003e). The missing genes span approximately 99.2 Mb of the genome and involved significant segments of porcine chromosomes 1 (41 genes), 2 (41 genes) and 13 (27 genes). This missing gene rate is much higher than expected. These areas of MARC 1.0 should be targeted for resequencing in any future builds. Our analysis also discovered 565 protein coding genes that are not annotated in MARC 1.0. If these results were extended to the whole genome, over 1,100 protein-coding genes would not be annotated and 500 would be missing. We also examined 500 genes from the newly sequenced Ossabaw genome (build 1.0 deposited in Ensembl) and found the error rate was similar to that of MARC build 1.0 (data not shown).\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e2. Proteins of Extreme Size/Indel Analysis\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eWe analyzed 350 proteins of extreme size (\u0026gt;2,000 amino acids (AA)) (\u003cstrong\u003eTable 2S\u003c/strong\u003e). Our error analysis revealed that the percentage of correctly assembled genes for these proteins in NCBI build 11.1, Ensembl build 11.1 and Ensembl MARC build 1.0 is, respectively, 53.1%, 27.4% and 23.4%. This analysis also identified the most serious source of error that prohibited correct assembly and annotation of Ensembl and NCBI builds of 11.1, was the presence of an indel every 12,465 bp. This is rather surprising since the coverage of the genome averaged 65x [8]. Using the search term \u0026ldquo;low quality protein\u0026rdquo; in NCBI, yields 2807 porcine protein coding genes that are affected by this. This number of proteins is inflated because some of these low-quality proteins are pseudogenes; however, the number is likely to be much higher in Ensembl build 11.1 because of the presence of pre-genome, annotated reference sequences in NCBI build 11. NCBI usually fills the indel with ambiguous nucleotides (N), as a place holder and annotates it as a correctly sized, but low-quality protein. Ensembl does not do this. As a result, a great number of truncated or elongated proteins arise in the Ensembl assembly because the algorithm appears to be searching for the next best splicing site. These genes may still be useful when doing RNASeq if the stringency of matching is lowered; however, this raises the risk of erroneously mapping high-similarity sequences. As expected, every protein-coding gene we evaluated and identified in NCBI build 11.1 as low quality (1437) was also incorrect in Ensembl build 11.1; however, only 28.7.1% of these proteins (412) were correctly assembled in MARC, reflecting the overall fundamental error in the gene assembling algorithm, particularly with genes that have a high number of exons.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThis error also limits the accuracy of splice variant determination, as NCBI does not annotate predicted splice variants of low-quality proteins, and functional domain analysis, so many of the misassembled proteins are missing one or more functional domains. Last, although the insertion of an indel by NCBI benefits the analysis of genes affected by a real indel, NCBI also inserts an ambiguous nucleotide(s) and annotates pseudogenes as low-quality proteins in the presence of a predicted stop codon. We found that 218/1582 (13.8%) of NCBI-annotated pig low-quality genes were actually pseudogenes.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e3. Intronless Genes\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eIntronless genes constitute a significant number of genes with errors. Intronless genes make up approximately 3% of the human genome [16]. These genes can be divided into 2 categories, genes that consist of a single exon (true intronless genes) and genes whose protein-coding region span a single exon but are interrupted by introns in the UTRs. Estimates of the number of human single exon coding region genes approaches 2000 [17]. The number of intronless genes in the pig is likely to be much higher because of the large number of Olfactory Receptor (OR) genes. Previous analysis of the intronless genes in humans and mice revealed that automatic annotation of these genes is problematic.\u0026nbsp;Two related protein superfamilies, keratin associated proteins (KRTAP) and late cornified envelope (LCE) proteins are overrepresented in intronless genes. Other prominent classes of protein coding genes overrepresented in intronless genes are G-protein coupled receptors. Subclasses of genes found in intronless G-protein-coupled receptors include\u0026nbsp;vomeronasal receptors (VMRs) and TRs.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003ea.\u0026nbsp;\u003c/em\u003e\u003cem\u003eKeratin Associated Proteins (KRTAP) and Late Cornified Envelope (LCE) proteins\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eOur study determined that humans have 124 (107 genes, 17 pseudogenes), pigs have 125 (110 genes and 15 pseudogenes) and mice have 187 (141 genes and 46 pseudogenes) \u003cem\u003eKRTAP\u003c/em\u003e genes (\u003cstrong\u003eTables 3 and 3S\u003c/strong\u003e). The vast majority of the porcine genes we identified are not annotated genes in NCBI 11.1 (80 missing), Ensembl 11.1 (64 missing) or MARC 1.0 (68 missing) genomes (Table 3S). Furthermore, they have very limited sequence homology, so assigning 1:1 orthology is difficult. Our study also determined that humans have 19, pigs have 15 and mice have 21 LCE proteins. Like the KRTAP genes, many of the 15 porcine LCE genes we identified are not annotated genes in NCBI 11.1 (10 missing), Ensembl 11.1 (7 missing) or MARC 1.0 (8 missing) genomes.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eb. Vomeronasal Receptors (VMRs)\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eWe found that pigs (14) and humans (5) have a significantly smaller number of VMRs (VMN1+VMN2) compared to mice (225). The larger number of VMR genes in pigs relative to humans is because, the pig VN1R4 gene has diverged into 12 paralogs (VN1R4, VN1R4L1, VN1R4L2, LOC110261363, LOC110261366, LOC110261370, LOC110261364, LOC102167894, LOC100520313, LOC100738896, LOC106510602) and 2 pseudogenes (VN1R4Ps1, VN1R4Ps2). We identified one intact pig-mouse VMN1 ortholog (VMN1R233). \u0026nbsp;VMN2R1, a rodent-specific\u0026nbsp;VMN2\u0026nbsp;gene, is an expressed pseudogene in pigs.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e4. Endogenous Retroviral Sequences (ERVs)\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eERVs comprise 8% of the human genome [18]; however, few are translated into functional proteins. We discovered more than 500 endogenous retroviral sequences in Ensembl build 11.1 that are annotated as protein coding (data not shown). The majority of these are whole or fragmentary parts of retroviral endonucleases/reverse transcriptases and are likely to be artifacts. These errors significantly inflate the number of pig proteins especially those that have been deemed to be pig specific. The vast majority of these are filtered out of the human and mouse Ensembl genome builds (mice have 3). Only 3 bonafide, intact endonuclease/reverse transcriptases are found in the pig (ABR01162.1 (1272 AA), human AL50637.1, (1275 AA) and mouse (AAC72793.1, 1281 AA) genomes.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e5. Selenoproteins\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eIn NCBI build 11.1, only one selenoprotein protein is incorrectly assembled, however; in Ensembl build 11.1 and MARC build 1.0 of the porcine genome, 11 out of 25 (44%) and 18 out of 25 (72%), respectively, of selenoprotein genes are assigned a premature stop codon or have additional errors (\u003cstrong\u003eTables 3 and 4S\u003c/strong\u003e). All human and mouse proteins are correctly assembled and annotated in NCBI and Ensembl.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e6. Mucins\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eWe identified 20 mucin genes in pigs. Of these, 33.3%, 14.3% and 19.0% are properly assembled and annotated in NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0, respectively, and only two genes,\u0026nbsp;CD164 and MUC15,\u0026nbsp;were assembled properly in all three builds. There is, however, little overlap in this gene set (\u003cstrong\u003eTables 3 and 5S\u003c/strong\u003e) with a variety of assembly errors throughout all three builds.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e7. Protocadherins\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eBetween the 3 species, we identified 80 protocadherin genes. Sixty-seven protein-coding protocadherin genes exist in pigs. Of these, 22.4 %, 11.9% and 9.0% are properly assembled and annotated in NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0, respectively; however, there is little overlap in this gene set. (\u003cstrong\u003eTables 3 and 6S\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e8. Readthrough, fusion or conjoined genes\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eVia manual annotation, we determined that pigs can likely make 103 readthrough genes (\u003cstrong\u003eTable 3 and\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003e7S\u003c/strong\u003e). We developed a scoring system where we assess the confidence of our predictions. The categories from lowest to highest confidence are; 0 = no transcript could be predicted, human, primate or mouse-specific gene; 1 = Predicted transcript but no evidence of existence in other species; 2 = Predicted transcript, limited evidence of existence (2 species or less); 3 = Predicted transcript, limited evidence of existence (2 species or less), transcription demonstrated in pig; and 4 = Predicted transcript, evidence of existence (2 species or more), transcription demonstrated in pig. The number of pig readthrough genes in each category are 7, 30, 42 and 24 for categories 1, 2, 3 and 4, respectively. Ninety two of these correspond to human or primate genes and many have orthology to transcripts in other mammalian species. We determined that there are only three pig genes annotated as readthrough genes in NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0, combined.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e9. Microproteins\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eMicroproteins and small open reading frames (sORFs)-encoded proteins (SEPs) are small (\u0026lt;100 AA) proteins containing a single domain. Standard genome annotation pipelines routinely miss these [19]. They can exist as distinct genes or arise as a result of a shift in the open reading frame during translation. More than 1,000 of these have been identified in mice and humans, using the OpenProt 2.0 database [20]. Sheep and cow proteins, but not pig proteins, are indexed in this database. We identified 153 human microproteins via a literature search as these genes are sometime not annotated as protein coding genes in the OpenProt or NCBI-annotated genome. We discovered 115 pig orthologs or paralogs of these genes by translated BLAST (tblastn) of the human protein sequence to the pig genome. The corresponding DNA sequence was translated, and the putative DNA or protein sequences were used to search whether these proteins were annotated as such in the 3 porcine genomes. Data is presented in\u0026nbsp;\u003cstrong\u003eTables 3 and 8S\u003c/strong\u003e.\u0026nbsp;Only 49.1%, 51.7% and 47.4% pig microprotein RNA or proteins are properly assembled and/or annotated in NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0, respectively,\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eB. Protein Orthology and Functional Domain Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e1. Protein Orthology Analysis\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eWe determined orthology for 47879 (15613 pig, 16496 mouse, and 15770 human) protein coding genes (\u003cstrong\u003eTable 9S and Figure 1\u003c/strong\u003e).\u0026nbsp;NCBI lists 20790, 22192 and 20080 protein coding genes in pigs, mice, and humans, respectively; therefore, our analysis is estimated to encompass 75.1% of pig, 74.3% of mouse, and 78.5% of human proteins (a species average of 76.0%). We omitted OR, TCR and BCR, and MHC class I and II proteins from our analysis. Our species average estimate of coverage of the proteome excluding these groups is likely to exceed 80%.\u0026nbsp;Our analysis shows that when\u0026nbsp;a gene is missing from one of the three genomes, pigs are 5.0 X more likely to have the human gene than mice. Mice had 2.2 and 2.5 X the number of unique proteins compared to humans and pigs, respectively. These unique proteins, and their pseudogenes, were categorized by Superfamily and appear in \u003cstrong\u003eTable 4\u003c/strong\u003e. As expected, mice exhibited the largest number of Superfamily member expansions.\u003c/p\u003e\n\u003cp\u003eWe conducted DAVID data mining of differentially encoded proteins to determine pathway enrichment. We found 8 REACTOME pathways (\u003cstrong\u003eTables 5\u003c/strong\u003e \u003cstrong\u003eand 10S\u003c/strong\u003e) that were enriched after correction for multiple comparison, for the pig-human comparison; R-HSA-212436~Generic Transcription Pathway (3.0 fold, p = 5.27E-24), R-HSA-6805567~Keratinization (7.1 fold, p =\u0026nbsp;5.67E-22),\u0026nbsp;R-HSA-73857~RNA Polymerase II Transcription (2.7 fold, p = 3.61E-21), \u0026nbsp;R-HSA-74160~Gene expression (Transcription) (2.4 fold, p = 8.95E-18),\u0026nbsp;R-HSA-1461957~Beta defensins (13.2 fold, p = 4.91E-12),\u0026nbsp;R-HSA-6803157~Antimicrobial peptides (7.7 fold, p = 5.09E-11),\u0026nbsp;R-HSA-1461973~Defensins (10.8 fold, p = 1.24E-10) and\u0026nbsp;R-HSA-168249~Innate Immune System (1.5 fold, p = 3.93E-02).\u0026nbsp;We found no significant pathway enrichment for the mouse-human REACTOME comparisons. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe found 3 conserved human-pig KEGG pathways (\u003cstrong\u003eTables 5\u003c/strong\u003e \u003cstrong\u003eand 10S\u003c/strong\u003e) that were enriched after correction for multiple comparison; hsa04613:Neutrophil extracellular trap formation (4.4 fold, p = 3.15E-03), hsa04061:Viral protein interaction with cytokine and cytokine receptor (5.5 fold, p = 2.32E-02), hsa05322:Systemic lupus erythematosus (4.3 fold, p = 3.42E-02). We also found 3 conserved human-mouse KEGG pathways (\u003cstrong\u003eTable 5 and 10S\u003c/strong\u003e) that were enriched after correction for multiple comparison; hsa00982:Drug metabolism - cytochrome P450 (12.7 fold, p = 3.81E-02, hsa00980:Metabolism of xenobiotics by cytochrome P450 (11.7 fold, p = 3.81E-02), hsa00983:Drug metabolism - other enzymes (11.5 fold, p = 3.81E-02).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;We conducted GO BP DIRECT mining of differentially encoded proteins to determine ontology enrichment. We found 13 GO terms that were enriched after correction for multiple comparison for the pig-human comparison, although many more pathways had p values \u0026lt; 0.05 before correction (\u003cstrong\u003eTable 5\u003c/strong\u003e). GO:0006355~regulation of DNA-templated transcription (4.7 fold, p = 6.68E-41), GO:0006357~regulation of transcription by RNA polymerase II (4.7 fold, p = 6.68E-41), GO:0042742~defense response to bacterium (3.5 fold, p = 6.68E-41), GO:0045087~innate immune response (2.6 fold, p = 1.13E-04), GO:0061844~antimicrobial humoral immune response mediated by antimicrobial peptide 4.6 fold, p = 9.36E-04), GO:0031424~keratinization (5.5 fold, p = 1.22E-02), GO:0019373~epoxygenase P450 pathway (12.4 fold, p = 1.22E-02), GO:0048006~antigen processing and presentation, endogenous lipid antigen via MHC class Ib (33.3 fold, p = 1.55E-02), GO:0006805~xenobiotic metabolic process (4.1 fold, p = 1.81E-02), GO:0050829~defense response to Gram-negative bacterium (4.4 fold, p = 1.81E-02), GO:0031640~killing of cells of another organism (4.4 fold, p = 3.48E-02), GO:0048007~antigen processing and presentation, exogenous lipid antigen via MHC class Ib (23.6 fold, p = 3.50E-02), GO:0006955~immune response (2.1 fold, p = 4.73E-02). We found no significant pathway enrichment for the mouse-human GO BP DIRECT comparisons.\u003c/p\u003e\n\u003cp\u003eWe conducted Ingenuity Pathway Analysis (IPA)\u0026nbsp;of pig-human and mouse-human conserved genes and found\u0026nbsp;32 and 29 pathways respectively. These are summarized in graphical form in \u003cstrong\u003eFigure 2\u003c/strong\u003e and in tabular form in \u003cstrong\u003eTable 10S\u003c/strong\u003e. In addition to the pathways discovered by REACTOME and GO BP Direct, genes involved in IL-13 Signaling Pathway (p = 1.48E-07) and IL-17 Signaling pathway (p = 1.43E-04) as well as Retinol Biosynthesis (p = \u0026nbsp;1.93E-02) and \u0026alpha;-tocopherol Degradation (p = 2.79E-04) were enriched in the pig-human dataset (\u003cstrong\u003eFigure 2A\u003c/strong\u003e). For the mouse-human dataset (\u003cstrong\u003eFigure 2B)\u003c/strong\u003e, Granulocyte Adhesion and Diapedesis, the top canonical pathway (p = 9 2.27E-04) and Interleukin-10 signaling (p = 1.78E-02) were the sole immune-related pathways. There was no enrichment of nutrition related pathways determined by IPA. Although there were similar numbers of\u0026nbsp;pig-human and mouse-human conserved\u0026nbsp;pathways, the number of genes per node and the statistical significance was less for the\u0026nbsp;mouse-human comparisons:\u0026nbsp;45 mouse-human\u0026nbsp;pathways were significant at a p \u0026lt; 0.01; whereas 35 pig human pathways were significant at a p \u0026lt; 0.01.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e2. Functional Domain Comparison\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eWe examined a randomly chosen subset of 3465 protein coding genes (10395 proteins overall) for preservation of protein Superfamily domains or other features (\u003cstrong\u003eTables 4 and 10S\u003c/strong\u003e). Six examples of these are shown in \u003cstrong\u003eFigure 3\u003c/strong\u003e. We identified 644 structural differences between 1737 proteins (shared between pig, mouse, and human). We then conducted DAVID data mining of differentially expressed functional domains to determine pathway enrichment (\u003cstrong\u003eTable 6\u003c/strong\u003e). We found 2 REACTOME pathways that were enriched after FDR correction for multiple comparison, for the pig-human comparison R-HSA-1474244~Extracellular matrix organization (4.8 fold, p = 1.22E-04) and R-HSA-168256~Immune System (1.9 fold, p = 4.46E-03). We found 4 pathways that were significantly enriched in the pig-mouse-comparison; R-HSA-168256~Immune System (2.3 fold, p = 1.35E-05), R-HSA-168898~Toll-like Receptor Cascades (7.7 fold, p = 1.15E-04), R-HSA-1280215~Cytokine Signaling in Immune system\u0026nbsp;(3.2 fold, p = 3.76E-04) and R-HSA-449147~Signaling by Interleukins (3.5 fold, p = 1.17E-02)\u0026nbsp;\u0026nbsp; In contrast, we found no pathway enrichment for the mouse-human comparison.\u003c/p\u003e\n\u003cp\u003eSimilar results were obtained from mining GO BP DIRECT. No enrichment was found for the pig-human comparison; whereas, 17 terms were significant in the pig-mouse comparison; GO:0045087~innate immune response (5.1 fold, p = 6.99E-07), GO:0006954~inflammatory response (5.9 fold, p = 6.99E-07), GO:0002224~toll-like receptor signaling pathway (28.3 fold, p = 7.27E-04) , GO:0050729~positive regulation of inflammatory response (10.2 fold, p = 7.85E-04) , GO:0043123~positive regulation of canonical NF-kappaB signal transduction (6.0 fold, p = 3.33E-03) , GO:0032729~positive regulation of type II interferon production (11.4 fold, p = 6.32E-03), GO:0007157~heterophilic cell-cell adhesion via plasma membrane cell adhesion molecules (15.6 fold , p = 6.41E-03), GO:0051607~defense response to virus (5.3 fold, p = 1.61E-02), GO:0007155~cell adhesion (3.4 fold, p = 1.61E-02), GO:0032757~positive regulation of interleukin-8 production (11.8 fold, p = 1.69E-02 ), GO:0070555~response to interleukin-1 (17.7 fold, p = 1.69E-02), GO:1901224~positive regulation of non-canonical NF-kappaB signal transduction (11.4 fold, p = 1.69E-02) , GO:0031297~replication fork processing ( 16.8 fold, p = 1.89E-02), GO:0051092~positive regulation of NF-kappaB transcription factor activity (7.4 fold, p = 2.94E-02), GO:0071260~cellular response to mechanical stimulus (9.5 fold, p = 3.11E-02) , GO:0006979~response to oxidative stress (7.0 fold, p = 3.64E-02).\u0026nbsp;With 2 exceptions, all of these overlapping pathways reflect genes involved in immunity and/or inflammation. No enrichment was observed for the mouse-human comparison.\u003c/p\u003e\n\u003cp\u003eWe evaluated whether the differences in functional domains occurred randomly. or whether particular functional domains were more abundant in each species. Of the domains that appear more than 5 times in one species we found that pigs had 6 (Atrophin-1, PHA03307, PHA03378, PRK03918, PRK07764, PTZ00121, PTZ0049) expanded functional domains, mice had 2 (Collagen, Herpes BLFF1, PRK03918) and humans had 5 (Atrophin-1, PHA03307, PHA03378 PTZ00121, PTZ0049) Superfamily domains that were increased or decreased by 25% from the other species. The meaning of this is not clear as the vast majority of these structural domains do not have a defined function.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eC. 5 and 3\u0026rsquo;UTR and splice variant analyses.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e1. \u0026nbsp; 5\u0026rsquo; and 3\u0026rsquo; UTR mRNA conservation\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eOur current analysis of 8151 mRNA indicates that 1.2% (98) and 5.09% (415) genes have low conservation of the 5\u0026rsquo; or 3\u0026rsquo;UTR regions, respectively, while 0.85% (69) genes have a combined low 5\u0026rsquo; and 3\u0026rsquo; conservation (\u003cstrong\u003eTable 11S\u003c/strong\u003e). DAVID REACTOME analysis of these genes indicated that the genes with low 3\u0026rsquo; conservation between human and pig, were enriched in genes related to metabolism (19.7 fold) and the immune System (16.3 fold), although the relationship did not persist after correction for multiple comparisons. Our previous analysis indicates that conservation of mRNA sequences between pigs and humans averages 75%. A separate analysis indicated that the current version of Ensembl build 11.1 does not adequately capture the UTR regions of genes [10].\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eNext, we determined the potential conservation of 14,122 human transcript variants (NM_+ XM_) in pigs (\u003cstrong\u003eTable 12S)\u003c/strong\u003e. We found that pigs could make 89.7% (12,556) of 14,122 human transcript variants. We then examined the NCBI-annotated number of exons for 4,824 randomly selected pig and human genes. We discovered a serious under annotation of exons in the pig genome, almost an exon less per gene with a high bias towards gene with a large number of exons.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe data presented here should help to obtain a finished version of the swine genome and enable improved biological insights. With our manually assembled and annotated non-redundant, 16,146 RNA and 15,613 pig protein sequence library we assessed the assembly and annotation status of the 3 latest builds of the swine genome and compared them to the mouse and human genomes. Since we generated an extremely large amount of data, we will highlight the major and unique aspects of our analyses.\u003c/p\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eError Analysis and Categorization\u003c/h2\u003e \u003cp\u003eOur comparison of a subset of 6135 protein coding genes revealed that the percentage of correctly assembled and annotated genes in the NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0 to be 58.9%, 51.7% and 47.1% respectively. The sources of errors were varied but could be systematized to a certain degree: an indel every 12K, intronless genes, failure to annotate genes for various reasons (intronless genes, mucins, protocadherins, readthrough genes), annotation of endogenous retroviral sequences as protein coding genes. In addition, there are genes that are actually missing from one or more genomes. It is likely that the genome build 11.1 is over 95% complete with regard to sequence representation; however, MARC 1.0 may be slightly less complete. One gene, phospholipase A and acyltransferase 4 (PLAAT4), could not be assigned a chromosomal location because it is missing from all 3 genomes (NCBI build 11.1, Ensembl build 11.1 and MARC 1.0). It was also missing from several other porcine genomes (Berkshire_pig_v1, Hampshire_pig_v1, Large_White_v1, Pietrain_pig_v1, Tibetan_Pig_v2), but present in the Bamei_pig_v1, Jinhua_pig_v1, Meishan_pig_v1, Rongchang_pig_v1, and Wuzhishan minipig_v1.0 genomes. There are at least two reasons for this. The first is that these DNA regions are difficult to sequence. The second, and more interesting possibility, is that the gene is actually present or absent in different pig breeds. The gene is also missing from cows and rodents but is present in other species in the order Artiodactyla, such as \u003cem\u003ePhacochoerus africanus\u003c/em\u003e, \u003cem\u003eDiceros bicornis\u003c/em\u003e and \u003cem\u003eCeratotherium simum simum\u003c/em\u003e. The protein derived from this gene is a retinoic acid-induced negative regulator of cell proliferation with phospholipase A(1/2) activity [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eErrors due to Indels\u003c/h2\u003e \u003cp\u003eThe presence of an indel every 12 Kbp is a major flaw in the Ensembl and NCBI builds of 11.1, affecting the correct assembly and annotation of thousands of genes. and limits the accuracy of splice variant determination as NCBI does not annotate predicted splice variants of low-quality proteins. Although the insertion of an indel by NCBI benefits the analysis of genes affected by a real indel, NCBI also inserts an ambiguous nucleotide (s) and annotates pseudogenes as low-quality proteins in the presence of a predicted stop codon. We previously documented this error for type 1 IFNs [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e], CD Markers (BTLA, CD160) [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] and components of the inflammasome (NLRC4, NAIP, NLRP14) [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eKeratin Associated Proteins (KRTAP) and Late Cornified Envelope (LCE) proteins\u003c/h2\u003e \u003cp\u003eKeratin associated protein genes are a family of intronless genes that encode type I keratins of the Keratin B2 Superfamily. Our study determined that humans have 124 (107 genes, 17 pseudogenes), pigs have 125 (110 genes and 15 pseudogenes) and mice have 187 (141 genes and 46 pseudogenes) KRTAP genes. A recent census of human genes discovered 93 protein coding KRTAP genes divided into 26 subfamilies [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. An expanded KRTAP gene repertoire was previously reported in rodents compared to humans [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. A preliminary characterization identified 121 KRTAP genes (102 genes and 19 pseudogenes) in pigs [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]; however, the full repertoire of the genes in pigs has not been reported for several reasons. The version of the pig genome used for that characterization (10.2), was missing the 5\u0026rsquo; and 3\u0026rsquo; region flanking the KRTAP5 cluster was missing [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]; The vast majority of the porcine genes we identified are not annotated genes in NCBI 11.1, Ensembl 11.1 or MARC genomes. Furthermore, they have very limited sequence homology, so assigning 1:1 orthology is difficult.\u003c/p\u003e \u003cp\u003eKRTAP proteins are involved in the formation of intermediate filaments that provide structural support to cells in the body, particularly in the skin and hair. These proteins play a critical role in maintaining the integrity of the skin and hair, and defects in KRTAP genes have been associated with various skin and hair disorders in humans. The number of KRTAP roughly corresponds to the fur coverage of each species [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eVomeronasal Receptors (VMRs)\u003c/h2\u003e \u003cp\u003eThe number of VMRs is proportional to each species reliance on pheromone communication. We found a larger number (225) of mouse VMRs than previous reported [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]; however, some of these may be pseudogenes or may not be expressed.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eSelenoproteins\u003c/h2\u003e \u003cp\u003eThe discrepancies between the fidelity of selenoprotein assembly and analysis between the species is unknown; but, conceptually, it seems possible to automatically assemble and annotate porcine selenoproteins. Selenium, in the form of selenocysteine (Sec) is cotranslationally inserted into polypeptide chains in response to the UGA codon, whose normal function is to terminate translation [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. To decode UGA as Sec, organisms evolved the Sec insertion machinery that allows incorporation of this amino acid at specific UGA codons in a process requiring a cis-acting, Sec insertion sequence (SECIS) element [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e].\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003eMucins\u003c/h2\u003e \u003cp\u003eThe repetitive nature of sequences in mucin genes makes them difficult to assemble [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Of the 87 human genes that have their NCBI Annotation category listed as \u0026ldquo;suggests misassembly\u0026rdquo;, 8 (MUC1, MUC2, MUC3A, MUC4, MUC5AC, MUC6, MUC8, MUC16, MUC19) are mucins. The actual number of mucin genes in pigs has been difficult to determine. MUCL3 was previously reported to be a pseudogene in pigs [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] but instead is a misassembled gene and a truncated protein in all three builds. Conversely, MUC17 is a pseudogene in pigs. A previous publication reported that this gene is expressed in pigs [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Some mucins provide the critical barrier between epithelial cells and the environment. Disorders of mucin synthesis can lead to human diseases like bronchial asthma, ulcerative colitis/inflammatory bowel disease and cystic fibrosis. Having full length pig sequences will aid in the characterization of these mucins in pig models of human disease.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003eProtocadherins\u003c/h2\u003e \u003cp\u003eThe high error rate found for protocadherin assembly and annotation is because the annotation process is not readily amenable to automation. Protocadherins are arranged in tandemly linked gene clusters in a manner similar to that of B-cell and T-cell receptor gene clusters. Virtually nothing is known about pig protocadherins, a single study compared pig, mouse, and human PCDH11X genes [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e] and found that all exons present in mouse and pig transcripts had homologous sequences in the human genome but not all exons are represented in human transcripts [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. Protocadherins play a general role in cell adhesion and development. They are particularly important in the function of neurons. Disruption of protocadherin expression or function is thought to play a role in the development of certain human cancers, neurological and inflammatory diseases [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e].\u003c/p\u003e \u003cdiv id=\"Sec25\" class=\"Section3\"\u003e \u003ch2\u003eReadthrough Genes\u003c/h2\u003e \u003cp\u003eTo the best of our knowledge only one pig readthough gene (pig-specific), TNNI2-ACTA1, has been described in the literature [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. Readthrough transcripts are RNA transcripts that are formed via exon splicing of more than one distinct gene. They are somewhat analogous to alternative splicing in terms of providing genome diversity although their numbers are smaller. An early estimate indicated that there were 751 conjoined or read-through genes in the human genome that are supported by at least one mRNA or EST sequence [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]. More recent Gencode estimates list 650 and 230 for the human and mouse genome [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]; but there is little overlap between the 2 species. Readthrough protein sequences can be different from their corresponding parent protein sequences due to frame shifting. In addition, a significant number of these are classified as ncRNA, an even smaller subset of these have been designated as non-sense mediated decay (NMD) candidates. Because of the recency of their discovery, the function of readthrough RNAs or proteins is unknown. Genes that are fusions of genes with critical functions in immunity (IFNAR2-IL10RB, TNFSF12-TNFSF13) or metabolism (RBP1-NMNAT3, NT5C1B-RDH14) are likely to have vastly different functions than either parent gene.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section3\"\u003e \u003ch2\u003eNotable Gene Superfamilies Missing in Pigs\u003c/h2\u003e \u003cp\u003eThe Semenogelin and Seminal Vesicle Secretory Proteins (SVG) Family consists of 2 family members in humans and 6 in mice. Although several catalog vendors claim to have antibodies that are cross reactive to porcine G1 (SEMG1) [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] and an early report describes the presence of a peptide (HNKQEGRDHD) corresponding to human SEMG1 [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e] in boar semen, we could not find any SVG family members in pigs. The protein is also missing from other determined proteomes of boar semen [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e]. The loss of this protein family is not unique to pigs because we could not find any evidence of this protein family members in mammalian orders outside of rodents and primates. The functional meaning of the absence of this protein family is unknown, however, SEMG1 the most abundant protein in human sperm, is essential for sperm coagulation and is broken down to form SgI-29, an antimicrobial peptide [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section3\"\u003e \u003ch2\u003eSplice Variant Analysis\u003c/h2\u003e \u003cp\u003eOur analysis revealed that pig have the potential to make 89.9% of human transcript variants. To our knowledge there are no recent estimates of this conservation in pig or mice. Early estimates of conservation of splice variants between human and mouse vary widely, from 50\u0026ndash;70% [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e, \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e]. An early estimate of conservation between human and pig [\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e], using ESTs, estimated that around 70% of splice variants were conserved between humans and pigs. We found several instances where automatic annotation led to exon omission. For example the pig NCBI locus for myelin basic protein (MBP) has one transcript, corresponding to human and mouse isoform 1, but does not include the 3 exons required to make pig splice variants of human and mouse oligodendrocyte lineage (Golli)-Mbp isoforms 1 and 2 [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e]. These proteins are represented in the pig TSA archive (Golli-MBP isoform 1 (HDB76269.1, HDA86907.1) and are partially represented in Ensembl build 11.1 (Golli-MBP isoform 1, ENSSSCP00000055770) and MARC build 1.0 (Golli-MBP isoform 1, ENSSSCP00070023358). The function and expression of Golli-MBPs are distinct from that of MBP and are important for myelin repair [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAlternate splicing of genes is a significant source of diversity in the genome. Predicting the number of alternative splice variants has proven to be somewhat difficult. In a previous analysis, Ensembl failed to predict 14 and 20% of validated splice variants in human and mouse genomes, respectively [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e]. In the current versions of the pig genomes, the number of pig splice variants is undercounted for several reasons. As previously mentioned, NCBI does not predict splice variants for low-quality proteins (2,807 proteins are annotated as low quality in the NCBI build of 11.1). This set contains a large number of high molecular weight proteins and their contributions to the undercount will be disproportionate because of the large number of nucleotides, and potential splice variants they possess. Irrespective of these errors, the algorithms used to predict splice variants and number of exons, seem to yield vastly different results among the platforms (NCBI versus Ensembl) and species (pigs versus humans or mice). For example, currently the human MBD1 gene in NCBI, has 20 exons that can be rearranged to form 165 transcript variants. In Ensembl the human gene has 28 transcript variants. The pig MBD1 gene has 26 exons and 50 predicted splice variants in the NCBI build 11.1, 4 predicted splice variants in Ensembl build 11.1 and 6 in MARC build 1.0. Both the Ensembl build 11.1 and MARC 1.0 loci fail to predict the longest protein coding transcript. The human PTK2 gene has at least 162 transcript variants; predicted pig splice variants in the NCBI build 11.1, Ensembl build 11.1 and MARC build 1.0 are 40, 8 and 10, respectively. There are numerous other examples of this, but they are beyond the scope of the current manuscript.\u003c/p\u003e \u003cp\u003eSplice variants can give rise to identical proteins or can lead to distinct isoforms. Although it has been suggested that most proteins have a single isoform [\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e] and that many splice variants are not translated into proteins, there are multiple splice variants for 72% of annotated human genes and 205,000 transcripts had protein-coding potential (\u0026gt;\u0026thinsp;10 transcripts per gene). Predicted or actual functional consequences of each splice variant are incomplete, even for human and mouse genes. A full discussion of this is beyond the scope of the current manuscript. Instead, we will provide a few examples of differential splice variants or isoforms where there is comparative, functional data.\u003c/p\u003e \u003cp\u003ePigs and humans can make all 17 transcript variants/isoforms of the T cell transcription factor, TCF7L2, involved with antiviral responses [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. In contrast pigs can only make 1 out of 5 isoforms of the three prime repair exonuclease 1 (TREX1) and 1 out of 4 isoforms of interferon regulatory factor 9 (IRF9), proteins involved in antiviral immune responses [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. Humans express 3 isoforms of interleukin 22 receptor, alpha 2 (IL22RA2) that differ in expression and/or function [\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]. Pigs and mice lack exon 3, that gives rise to IL22RA2 isoform 1 in humans, and can only form the soluble form (isoform 2) [\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e]. Alternate splice forms of LY96 (MD-2) that inhibit signaling have been identified in both mice and humans, but these alternate splice forms arise from different splicing events. The mouse protein isoform MD-2B, formed by a 54 base pair deletion at the 5\u0026rsquo; end of exon 3, inhibits TLR4 activation by LPS [\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e]. The human isoform 2 (MD-2s), formed by skipping exon 2, also inhibits LPS signaling through TLR4 and is not found in mice [\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e]. We predict isoform 2 can be formed in in pigs but found no evidence for isoform 2 expression in the EST and TSA archives. NCBI predicted that isoform 2 occurs in various Canids (\u003cem\u003eCanis lupus familiaris\u003c/em\u003e, \u003cem\u003eCanis lupus dingo\u003c/em\u003e, \u003cem\u003eVulpes lagopus\u003c/em\u003e) and Pinnipeds (\u003cem\u003eNeomonachus schauinslandi\u003c/em\u003e, \u003cem\u003eMirounga angustirostris\u003c/em\u003e, \u003cem\u003eMirounga leonine\u003c/em\u003e, \u003cem\u003eHalichoerus grypu\u003c/em\u003e).\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Conclusions","content":"\u003cp\u003eIn this manuscript, we analyzed Ensembl and NCBI builds 11.1 and MARC 1.0, with a large sequence library of manually assembled and annotated RNA and protein sequences, in order to better annotate the pig genome and discover systematic sources of errors. These sources include a frequently occurring indel in proteins of large size that alter the predicted size or delegation of a gene as protein-coding. This leads to failure to properly assemble large sized genes, e.g., the mucin and protocadherin genes. Additional errors include selenoprotein genes being assigned a premature stop codon, and endogenous retroviral sequences being annotated as protein coding genes. We identified several hundred pig putative protein coding genes in the process.\u003c/p\u003e \u003cp\u003eWe analyzed the conservation of pig and human 5\u0026rsquo; and 3\u0026rsquo; UTR RNA regions and RNA splice variants. We assembled a partial, but nonredundant and highly annotated, pig RNAome and proteome and used it to identify 1\u0026ndash;1 mouse and/or human orthologs. We compared the 1\u0026ndash;1 orthologs or proteins with shared or non-shared functional domains, for all 3 species, to determine functional enrichment. The results are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e. These data overwhelmingly support the relevance and importance of the pig as a biomedical research model for humans. Our analysis also highlights areas where mice may be a better model and areas where both of these species are likely to be of limited use. For example, although we have made a strong case for the use of the pig as a biomedical model, particularly in the area of nutrition and immunity, our DAVID REACTOME analysis of genes with low 3\u0026rsquo; conservation between human and pig showed enrichment in genes related to metabolism and the immune system. The 3\u0026rsquo; UTR of mRNA can contain binding sites for regulatory RNA and proteins.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSummation of Results\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eIn addition to these comparisons, we provide the first description and evidence for over 100 potential porcine readthrough genes, the first formal identification (to our knowledge) of pig Golli-MBPs and a complete, comparative analysis of pig, mouse and human VMR receptors.\u003c/p\u003e \u003cp\u003eOne of the strengths of our approach is that we identify pig transcript variants and protein isoforms based upon their orthology to human counterparts. This would make the genome annotation process better align with human genome and make potential functions of known human transcripts translatable to the pig. The need to better align the nomenclature of species used in biomedical research to their human counterpart has recently been emphasized [\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]. One of the weaknesses of our approach is that we did not attempt to identify pig-specific transcripts. Furthermore, we did not characterize mouse transcripts. Another weakness is that for protein functional domain analysis, we used results from the NCBI BLAST search. While this program identifies macrodomains, it cannot map certain fine structures of proteins as we have done in our previous analysis [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThere are currently more than 31 sequenced pig genomes that are publicly available. The annotation states of these are highly variable. Development of more accurate, artificial intelligence-based annotation software is urgently needed. We propose using our templates and nomenclature system to train such software for use in any current or future generation of pig genomes.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003eOne-to-one orthology was determined for pig-mouse, pig- human or mouse-human genes as previously described [\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e]. Briefly reciprocal cross-BLASTing of pig (Sus scrofa) sequence sources in Genbank (non-redundant, expressed sequences tag, high throughput genomic sequence, whole genome shotgun contig sequences (WGS), transcriptome shot gun assembly (TSA) and expressed sequence tag (EST)) was performed using discontiguous Megablast (default settings, word size\u0026thinsp;=\u0026thinsp;11), using reference sequence accession numbers to human or mouse genes/proteins of interest. Ensembl build 11.1 (release 111 - Jan 2024) and Ensembl MARC build 1.0 (release 111 - Jan 2024) were searched using the default settings. When 1\u0026ndash;1 orthology could not be established for pig genes based upon protein homology, the RNA was used. If orthology could not be determined by protein and RNA homology, the relative chromosomal location was used to determine orthology [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. The human gene symbol was assigned for pig orthologs whenever appropriate. For pig-specific paralogs, the gene symbol assigned was based upon comparative homology to the human gene, following the convention of the human gene family. The gene symbol was then terminated with an asterisk (*). For example, pig-specific paralogs of human SLC7A3, were assigned SLC7A3L1*, SLC7A3L2*, SLC7A3L3*, etc. with SLC7A3L1 being the closest in homology. Splice variant/exon conservation of the pig gene was determined relative to the human reference transcript. Pig-specific transcript variants were not determined.\u003c/p\u003e \u003cp\u003ePredicted mRNAs were analyzed for errors (ambiguous nucleotides, gene duplications artifacts, mis-assemblies, mis-annotations). Whenever an ambiguous nucleotide was assigned by NCBI, we blasted the sequence against the WGS, TSA and EST databases to obtain a consensus sequence. Predicted mRNAs were translated into proteins using the ExPASy translate tool (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://web.expasy.org/translate/\u003c/span\u003e\u003cspan address=\"http://web.expasy.org/translate/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). The size (in amino acids) of the major protein isoforms were used as a checksum to determine whether the respective genome assembly was correct.\u003c/p\u003e \u003cp\u003ePotential porcine readthrough genes were identified by comparison to human and mouse readthrough genes. A predicative scoring system was developed based on whether the transcript exists in other species and whether there was evidence that pigs can make the respective transcript (sequence found in TSA, EST, NCBI and Ensembl build 11.1 and MARC 1.0) databases. Pig-specific readthroughs were distinguished from chimeric artifacts by determining whether the transcript appears in other species and if there was support for the transcript (previously sequenced RNA). Consensus sequences were numerically annotated with base pair positions aligning to the beginning and end of human reference transcript. where possible. Conservation of the 5\u0026rsquo; and 3\u0026rsquo; predicted pig mRNA was then determined relative to the human reference transcript. Non-coding RNA will not be discussed here with the exception of small nucleolar RNAs (snoRNAs) and small Cajal body-specific RNAs (ScaRNAs).\u003c/p\u003e \u003cp\u003eWe determined 1 to 1 to 1 pig-mouse-human orthology for 12,720, 12,887 and 12,770 protein coding genes in pigs, mice and humans, respectively (total 38,377). We excluded olfactory receptors (ORs), T and B Cell receptors (TCR and BCR), and MHC class I and II proteins from our analysis and discussion because determining 1:1 orthology for these genes is difficult. The human and mouse genome nomenclature committees have adopted different conventions for assigning nomenclature and assigning orthology is not straightforward (cannot easily be determined by reciprocal cross blasting and/or chromosomal location). These genes will be described in separate manuscripts. A similar situation exists with regard to TRs. To determine structural orthology of TRs, we conducted phylogenetic tree analysis using Geneious Prime program (Geneious Pro v 20231.2) and the Jukes\u0026ndash;Cantor algorithm. These unambiguous protein orthologs (from 3 species) were used for Venn analysis using Venny 2.1. (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://bioinfogp.cnb.csic.es/tools/venny/index.html\u003c/span\u003e\u003cspan address=\"http://bioinfogp.cnb.csic.es/tools/venny/index.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). One to one pig/human, pig/mouse or mouse/human orthologs were analyzed by DAVID (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://davidbioinformatics.nih.gov/\u003c/span\u003e\u003cspan address=\"https://davidbioinformatics.nih.gov/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e ), REACTOME (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://reactome.org/\u003c/span\u003e\u003cspan address=\"https://reactome.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) and Go Direct BP (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://geneontology.org/\u003c/span\u003e\u003cspan address=\"https://geneontology.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) databases and queried to determine functional enrichment. Alternatively, we used our highly annotated database to determine whether the protein was part of the immunome [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] or involve in nutrition and/or metabolism [\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e]. Differentially encoded genes were also compared using Ingenuity Pathway Analysis (IPA) software (QIAGEN Bioinformatics, Redwood, CA). Conservation of protein functional domains was determined by comparing the BLAST graphic summary of the longest, comparable protein isoform sequences. Proteins with non or shared domains were analyzed by DAVID to determine functional enrichment.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eAA Amino Acid\u003c/p\u003e \u003cp\u003eBCR B Cell receptor\u003c/p\u003e \u003cp\u003eEST Expressed sequence tag\u003c/p\u003e \u003cp\u003eHAVANA Human And Vertebrate Analysis and Annotation\u003c/p\u003e \u003cp\u003eKEGG Kyoto Encyclopedia of Genes and Genomes\u003c/p\u003e \u003cp\u003eKRTAP keratin associated proteins\u003c/p\u003e \u003cp\u003eLCE late cornified envelope\u003c/p\u003e \u003cp\u003emiRNA microRNA\u003c/p\u003e \u003cp\u003encRNA Non-coding RNA\u003c/p\u003e \u003cp\u003eNMD Non-sense mediated decay\u003c/p\u003e \u003cp\u003eOR Olfactory Receptor\u003c/p\u003e \u003cp\u003eScaRNA Small Cajal body-specific RNA).\u003c/p\u003e \u003cp\u003eSEP sORF-encoded protein\u003c/p\u003e \u003cp\u003eSnoRNAs Small nucleolar RNA\u003c/p\u003e \u003cp\u003esORFs small open reading frame\u003c/p\u003e \u003cp\u003eSRA Short Read Archive Database\u003c/p\u003e \u003cp\u003etblastn translated BLAST\u003c/p\u003e \u003cp\u003eTSA Transcriptome Shotgun Assembly Sequence Database\u003c/p\u003e \u003cp\u003eTR Taste Receptor\u003c/p\u003e \u003cp\u003eUTR Untranslated region\u003c/p\u003e \u003cp\u003eVMR Vomeronasal Receptors\u003c/p\u003e \u003cp\u003eWGS Whole genome shotgun contig sequences\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003e\u003cem\u003eEthics approval and consent to participate\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot Applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eConsent for publication\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll authors have given their consent to publish.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eAvailability of data and materials\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll data generated or analyzed during this study are included in this published article [and its supplementary information files]. Complete data (pig RNA and protein sequences and their respective annotations) for these analyses can be found in our online database (http://tinyurl.com/hxxq3ur).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eCompeting interests\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors do not have any competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eFunding\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by USDA ARS projects 8040-51000-058 and 8042-32000-117.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eAuthors\u0026apos; contributions\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHD, JR conducted the analysis of the data. HD and CC maintain the online version of the database. HD, JL, JR, CC and AS wrote the manuscript and participated in the editing of the manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eMudge JM, Harrow J (2015) Creating reference gene annotation for the mouse C57BL6/J genome assembly. Mammalian genome: official J Int Mammalian Genome Soc 26(9\u0026ndash;10):366\u0026ndash;378\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A et al (2022) The complete sequence of a human genome. Science 376(6588):44\u0026ndash;53\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu J, Li Q, Hu Y, Yu Y, Zheng K, Li D, Qin L, Yu X (2024) The complete telomere-to-telomere sequence of a mouse genome. Science 386(6726):1141\u0026ndash;1146\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi M, Chen L, Tian S, Lin Y, Tang Q, Zhou X, Li D, Yeung CKL, Che T, Jin L et al (2017) Comprehensive variation discovery and recovery of missing sequence in the pig genome using multiple de novo assemblies. Genome Res 27(5):865\u0026ndash;874\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGilbert DG (2019) Genes of the pig, Sus scrofa, reconstructed with EvidentialGene. PeerJ 7:e6374\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSummers KM, Bush SJ, Wu C, Su AI, Muriuki C, Clark EL, Finlayson HA, Eory L, Waddell LA, Talbot R et al (2019) Functional Annotation of the Transcriptome of the Pig, Sus scrofa, Based Upon Network Analysis of an RNAseq Transcriptional Atlas. Front Genet 10:1355\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBeiki H, Liu H, Huang J, Manchanda N, Nonneman D, Smith TPL, Reecy JM, Tuggle CK (2019) Improved annotation of the domestic pig genome through integration of Iso-Seq and RNA-seq data. BMC Genomics 20(1):344\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWarr A, Affara N, Aken B, Beiki H, Bickhart DM, Billis K, Chow W, Eory L, Finlayson HA, Flicek P et al (2020) An improved pig reference genome sequence to enable pig genetics and genomics research. GigaScience 9(6)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003evan der Hee B, Madsen O, Vervoort J, Smidt H, Wells JM (2020) Congruence of Transcription Programs in Adult Stem Cell-Derived Jejunum Organoids and Original Tissue During Long-Term Culture. Front Cell Dev Biol 8:375\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu W, Moninger TO, Thurman AL, Xie Y, Jain A, Zarei K, Powers LS, Pezzulo AA, Stoltz DA, Welsh MJ (2022) Cellular and molecular architecture of submucosal glands in wild-type and cystic fibrosis pigs. Proc Natl Acad Sci USA 119(4)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDawson HD (2011) A comparative assessment of the pig, mouse and human genomes. The Minipig in Biomedical Research. CRC, Boca Raton, FL, pp 323\u0026ndash;342\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGroenen MA, Archibald AL, Uenishi H, Tuggle CK, Takeuchi Y, Rothschild MF, Rogel-Gaillard C, Park C, Milan D, Megens HJ et al (2012) Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491(7424):393\u0026ndash;398\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDawson HD, Loveland JE, Pascal G, Gilbert JG, Uenishi H, Mann KM, Sang Y, Zhang J, Carvalho-Silva D, Hunt T et al (2013) Structural and functional annotation of the porcine immunome. BMC Genomics 14:332\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDoncheva NT, Palasca O, Yarani R, Litman T, Anthon C, Groenen MAM, Stadler PF, Pociot F, Jensen LJ, Gorodkin J (2021) Human pathways in animal models: possibilities and limitations. Nucleic Acids Res 49(4):1859\u0026ndash;1871\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTriant DA, Walsh AT, Hartley GA, Petry B, Stegemiller MR, Nelson BM, McKendrick MM, Fuller EP, Cockett NE, Koltes JE et al (2023) AgAnimalGenomes: browsers for viewing and manually annotating farm animal genomes. Mammalian genome: official J Int Mammalian Genome Soc 34(3):418\u0026ndash;436\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGrzybowska EA (2012) Human intronless genes: functional groups, associated diseases, evolution, and mRNA processing in absence of splicing. Biochem Biophys Res Commun 424(1):1\u0026ndash;6\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJorquera R, Gonzalez C, Clausen P, Petersen B, Holmes DS (2018) Improved ontology for eukaryotic single-exon coding sequences in biological databases. Database: J Biol databases curation 2018:1\u0026ndash;6\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGriffiths DJ (2001) Endogenous retroviruses in the human genome sequence. Genome Biol 2(6):REVIEWS1017\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeong AZ, Lee PY, Mohtar MA, Syafruddin SE, Pung YF, Low TY (2022) Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures. J Biomed Sci 29(1):19\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeblanc S, Yala F, Provencher N, Lucier JF, Levesque M, Lapointe X, Jacques JF, Fournier I, Salzet M, Ouangraoua A et al (2024) OpenProt 2.0 builds a path to the functional characterization of alternative proteins. Nucleic Acids Res 52(D1):D522\u0026ndash;D528\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiSepio D, Ghosn C, Eckert RL, Deucher A, Robinson N, Duvic M, Chandraratna RA, Nagpal S (1998) Identification and characterization of a retinoid-induced class II tumor suppressor/growth regulatory gene. Proc Natl Acad Sci USA 95(25):14811\u0026ndash;14815\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUyama T, Jin XH, Tsuboi K, Tonai T, Ueda N (2009) Characterization of the human tumor suppressors TIG3 and HRASLS2 as phospholipid-metabolizing enzymes. Biochim Biophys Acta 1791(12):1114\u0026ndash;1124\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDawson HD, Sang Y, Lunney JK (2020) Porcine cytokines, chemokines and growth factors: 2019 update. Res Vet Sci 131:266\u0026ndash;300\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDawson HD, Lunney JK (2018) Porcine cluster of differentiation (CD) markers 2018 update. Res Vet Sci 118:199\u0026ndash;246\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDawson HD, Smith AD, Chen C, Urban JF Jr. (2017) An in-depth comparison of the porcine, murine and human inflammasomes; lessons from the porcine genome and transcriptome. Vet Microbiol 202:2\u0026ndash;15\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLitman T, Stein WD (2023) Ancient lineages of the keratin-associated protein (KRTAP) genes and their co-option in the evolution of the hair follicle. BMC Ecol Evol 23(1):7\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu DD, Irwin DM, Zhang YP (2008) Molecular evolution of the keratin associated protein gene family in mammals, role in the evolution of mammalian hair. BMC Evol Biol 8:241\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKhan I, Maldonado E, Vasconcelos V, O'Brien SJ, Johnson WE, Antunes A (2014) Mammalian keratin associated proteins (KRTAPs) subgenomes: disentangling hair diversity and adaptation to terrestrial and aquatic environments. BMC Genomics 15(1):779\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRodriguez I, Del Punta K, Rothman A, Ishii T, Mombaerts P (2002) Multiple new and isolated families within the mouse superfamily of V1r vomeronasal receptors. Nat Neurosci 5(2):134\u0026ndash;140\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLabunskyy VM, Hatfield DL, Gladyshev VN (2014) Selenoproteins: molecular pathways and physiological roles. Physiol Rev 94(3):739\u0026ndash;777\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLang T, Pelaseyed T (2022) Discovery of a MUC3B gene reconstructs the membrane mucin gene cluster on human chromosome 7. PLoS ONE 17(10):e0275671\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShigenari A, Ando A, Renard C, Chardon P, Shiina T, Kulski JK, Yasue H, Inoko H (2004) Nucleotide sequencing analysis of the swine 433-kb genomic segment located between the non-classical and classical SLA class I gene clusters. Immunogenetics 55(10):695\u0026ndash;705\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlanco-Arias P, Sargent CA, Affara NA (2004) A comparative analysis of the pig, mouse, and human PCDHX genes. Mammalian genome: official J Int Mammalian Genome Soc 15(4):296\u0026ndash;306\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePancho A, Aerts T, Mitsogiannis MD, Seuntjens E (2020) Protocadherins at the Crossroad of Signaling Pathways. Front Mol Neurosci 13:117\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu D, Xia J, Yang Z, Zhao X, Li J, Hao W, Yang X (2021) Identification of Chimeric RNAs in Pig Skeletal Muscle and Transcriptomic Analysis of Chimeric RNA TNNI2-ACTA1 V1. Front Vet Sci 8:742593\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePrakash T, Sharma VK, Adati N, Ozawa R, Kumar N, Nishida Y, Fujikake T, Takeda T, Taylor TD (2010) Expression of conjoined genes: another mechanism for gene regulation in eukaryotes. PLoS ONE 5(10):e13284\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFrankish A, Carbonell-Sala S, Diekhans M, Jungreis I, Loveland JE, Mudge JM, Sisu C, Wright JC, Arnan C, Barnes I et al (2023) GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res 51(D1):D942\u0026ndash;D949\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJonakova V, Kraus M, Veselsky L, Cechova D, Bezouska K, Ticha M (1998) Spermadhesins of the AQN and AWN families, DQH sperm surface protein and HNK protein in the heparin-binding fraction of boar seminal plasma. J Reprod Fertil 114(1):25\u0026ndash;34\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePerez-Patino C, Parrilla I, Li J, Barranco I, Martinez EA, Rodriguez-Martinez H, Roca J (2019) The Proteome of Pig Spermatozoa Is Remodeled During Ejaculation. Mol Cell proteomics: MCP 18(1):41\u0026ndash;50\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu Y, Han Q, Ma C, Wang Y, Zhang P, Li C, Cheng X, Xu H (2021) Comparative Proteomics and Phosphoproteomics Analysis Reveal the Possible Breed Difference in Yorkshire and Duroc Boar Spermatozoa. Front Cell Dev Biol 9:652809\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao H, Lee WH, Shen JH, Li H, Zhang Y (2008) Identification of novel semenogelin I-derived antimicrobial peptide from liquefied human seminal plasma. Peptides 29(4):505\u0026ndash;511\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eModrek B, Lee CJ (2003) Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 34(2):177\u0026ndash;180\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNurtdinov RN, Artamonova II, Mironov AA, Gelfand MS (2003) Low conservation of alternative splicing patterns in the human and mouse genomes. Hum Mol Genet 12(11):1313\u0026ndash;1320\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNygard AB, Cirera S, Gilchrist MJ, Gorodkin J, Jorgensen CB, Fredholm M (2010) A study of alternative splicing in the pig. BMC Res Notes 3:123\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCampagnoni AT, Pribyl TM, Campagnoni CW, Kampf K, Amur-Umarjee S, Landry CF, Handley VW, Newman SL, Garbay B, Kitamura K (1993) Structure and developmental regulation of Golli-mbp, a 105-kilobase gene that encompasses the myelin basic protein gene and is expressed in cells in the oligodendrocyte lineage in the brain. J Biol Chem 268(7):4930\u0026ndash;4938\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSiu CR, Balsor JL, Jones DG, Murphy KM (2015) Classic and Golli Myelin Basic Protein have distinct developmental trajectories in human visual cortex. Front Neurosci 9:138\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTapial J, Ha KCH, Sterne-Weiler T, Gohr A, Braunschweig U, Hermoso-Pulido A, Quesnel-Vallieres M, Permanyer J, Sodaei R, Marquez Y et al (2017) An atlas of alternative splicing profiles and functional associations reveals new regulatory programs and genes that simultaneously express multiple major isoforms. Genome Res 27(10):1759\u0026ndash;1768\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGonzalez-Porta M, Frankish A, Rung J, Harrow J, Brazma A (2013) Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol 14(7):R70\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGomez-Fernandez P, Urtasun A, Paton AW, Paton JC, Borrego F, Dersh D, Argon Y, Alloza I, Vandenbroeck K (2018) Long Interleukin-22 Binding Protein Isoform-1 Is an Intracellular Activator of the Unfolded Protein Response. Front Immunol 9:2934\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeiss B, Wolk K, Grunberg BH, Volk HD, Sterry W, Asadullah K, Sabat R (2004) Cloning of murine IL-22 receptor alpha 2 and comparison with its human counterpart. Genes Immun 5(5):330\u0026ndash;336\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOhta S, Bahrun U, Tanaka M, Kimoto M (2004) Identification of a novel isoform of MD-2 that downregulates lipopolysaccharide signaling. Biochem Biophys Res Commun 323(3):1103\u0026ndash;1108\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGray P, Michelsen KS, Sirois CM, Lowe E, Shimada K, Crother TR, Chen S, Brikos C, Bulut Y, Latz E et al (2010) Identification of a novel human MD-2 splice variant that negatively regulates Lipopolysaccharide-induced TLR4 signaling. J Immunol 184(11):6359\u0026ndash;6366\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDawson HD, Chen C, Gaynor B, Shao J, Urban JF Jr. (2017) The porcine translational research database: a manually curated, genomics and proteomics-based research resource. BMC Genomics 18(1):643\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003eTables 1 to 7 are available in the Supplementary Files section.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"pig, mouse, human genome, nutrition, metabolism, immunity","lastPublishedDoi":"10.21203/rs.3.rs-6856588/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6856588/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eRecently there have been numerous attempts to improve the genome of the pig. Despite these efforts, there is a substantial amount of work remaining to obtain a \u0026ldquo;finished version\u0026rdquo; of the genome; analysis of incomplete versions can lead to incorrect biological interpretations. To that end, we manually assembled and annotated a non-redundant, 16,146 RNA and 15,613 pig protein sequence library. We used it to assess the assembly and annotation status of the 3 latest builds of the genome and to the mouse and human genomes.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eOur analysis of 3,333 protein coding genes reveals that the percentage of error-free assembled and annotated genes in NCBI and Ensembl builds 11.1 and MARC build 1.0 are 69.4, 50.1, and 40.0%, respectively. An examination of these errors revealed nine predominant sources that are detailed in the Results. Using our protein library, we determined 1:1 orthology to 16,496 mouse and 15,770 human proteins. 73.8% of these proteins were conserved among the 3 species; however, when a gene was missing from one of the three genomes, pigs were 5.0X more likely to have the human gene than mice. REACTOME, KEGG, GO BP Direct and Ingenuity Pathway Analysis functional enrichment analyses of pig-human orthologous genes revealed 8, 3, 14 and 32 conserved pathways, and 0, 3, 0, and 29 for human-mouse pathways, respectively. Last, we conducted an analysis of functional domain preservation for 3,465 proteins and discovered when a functional domain is missing from a protein in 1 of the 3 species, pigs are 1.5X more likely to have the human domain than mice.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eThese data strongly indicate that, overall, swine are a scientifically important intermediate species (rodent-human) for conducting scientific research on human health.\u003c/p\u003e","manuscriptTitle":"Verification and Comparison of Pig, Mouse, and Human Genome Similarities: Use of Manual Assembly and Analyses","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-06-11 12:19:59","doi":"10.21203/rs.3.rs-6856588/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c17d3610-f23d-4c25-bb02-810355a2d218","owner":[],"postedDate":"June 11th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":49770956,"name":"Epigenetics \u0026 Genomics"}],"tags":[],"updatedAt":"2025-06-11T12:19:59+00:00","versionOfRecord":[],"versionCreatedAt":"2025-06-11 12:19:59","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6856588","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6856588","identity":"rs-6856588","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00