Gene Length Bias in Human Papillomavirus Integration Sites: A Statistical Analysis Using KNIME

preprint OA: closed
Full text JSON View at publisher
Full text 29,400 characters · extracted from preprint-html · click to expand
Gene Length Bias in Human Papillomavirus Integration Sites: A Statistical Analysis Using KNIME | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Short Report Gene Length Bias in Human Papillomavirus Integration Sites: A Statistical Analysis Using KNIME Sahil Khandekar, Urja Bait This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8941085/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Integration of high-risk Human Papillomavirus (HPV) into the human genome represents a critical event in carcinogenesis. While recurrent integration hotspots are well documented, the structural determinants governing host gene susceptibility remain incompletely defined. Here, we evaluate whether human gene length influences HPV integration probability using VISDB integration datasets for HPV16 and HPV18 merged with Ensembl BioMart annotations. Statistical analysis performed in KNIME reveals that HPV-integrated genes exhibit significantly greater length than non-integrated genes (Two-sided Wilcoxon-Mann-Whitney, p = 0.0001). Distributional analysis demonstrates a consistent rightward shift in log-transformed gene length among integrated genes. Genomic validation using the UCSC Genome Browser confirms integration within established oncogenic regions, including 8q24 near MYC. These findings identify gene length as a measurable structural bias contributing to HPV integration susceptibility, while underscoring the role of genomic context in hotspot formation. HPV integration gene length bias viral oncogenesis genomic hotspots statistical genomics HPV16 HPV18 Figures Figure 1 Figure 2 Figure 3 Figure 4 1 Introduction High-risk HPV types, particularly HPV16 and HPV18, promote malignant transformation through integration into host genomic DNA. Integration events can disrupt tumor suppressor genes, amplify oncogenic loci, and alter regulatory architecture. Recurrent hotspots such as PVT1 , FHIT , MACROD2 , and loci within 8q24 have been consistently reported [1-3]. However, whether intrinsic structural properties of host genes influence integration frequency remains unclear. We hypothesized that longer genes, by occupying larger genomic intervals, present increased opportunity for viral integration. 2 Materials and Methods 2.1 Data Acquisition HPV16 and HPV18 integration site data were obtained from VISDB [4]. Human gene annotations were retrieved from Ensembl BioMart (GRCh38) [5]. Integration coordinates were mapped to corresponding host genes using genomic overlap criteria. 2.2 Computational Workflow All preprocessing and statistical analyses were performed using the KNIME Analytics Platform [7]. VISDB Data Processing HPV16 and HPV18 datasets were imported separately using CSV Reader nodes and subsequently concatenated. The following preprocessing steps were applied: Column filtering to retain only relevant gene-level variables Removal of rows with missing gene annotations Cell splitting of multi-gene integration entries Ungrouping to separate duplicated gene entries Removal of duplicate rows Elimination of unnecessary identifiers (e.g., Ensembl ID where redundant) For hotspot identification, gene counts were aggregated using GroupBy nodes, followed by rule-based filtering and ranking using Sorter and Top K Row Filter nodes. Ensembl BioMart Processing Gene annotation data from Ensembl BioMart were: Filtered to retain essential genomic coordinates Cleaned to remove unnecessary chromosome entries Gene length calculated from genomic start and end positions Log transformation applied to gene length values to correct skewness Data Integration Processed VISDB and BioMart datasets were merged using genomic overlap criteria through Joiner nodes. A rule-based column was generated to classify genes into: Integrated Non-integrated Statistical Validation Group comparison was performed using the Wilcoxon–Mann–Whitney test node. Distributional visualization included: Box plot Kernel density estimation 2.3 Statistical Analysis Gene length distributions between integrated and non-integrated groups were compared using the Wilcoxon–Mann–Whitney test. Statistical significance was defined at p < 0 . 05. 3 Results 3.1 Gene Length Distribution Bias Integrated genes demonstrated higher median log-transformed lengths relative to non-integrated genes (Fig. 3). Distributional differences are further illustrated in the density plot (Fig. 4). Statistical testing confirmed a significant difference between groups (p = 0.0001), supporting a gene length bias in HPV integration. A total of 10063 integrated genes and 13199 non-integrated genes were included in the analysis. 3.2 Hotspot Context Genomic inspection using the UCSC Genome Browser [6] revealed that several integration hotspots reside within established oncogenic regions, including 8q24 proximal to MYC . Importantly, certain shorter genes within highly active regulatory domains also exhibited integration events, suggesting that chromatin state and regulatory architecture influence integration beyond structural length bias. The computational workflow used for hotspot identification is shown in Fig. 1 and Fig. 2. 4 Discussion Our findings demonstrate that human gene length is significantly associated with HPV integration susceptibility. The most direct interpretation is probabilistic: longer genes span greater genomic territory and therefore present increased opportunity for viral insertion events. However, gene length alone does not fully explain hotspot localization. Integration within shorter genes positioned in transcriptionally active or enhancer-rich regions indicates that epigenetic and chromatin features likely refine integration targeting. These observations support a composite model in which structural exposure establishes baseline susceptibility, while regulatory context determines hotspot specificity. 5 Conclusion Human gene length is positively associated with HPV integration probability. Integrated genes are significantly longer than non-integrated genes. Although gene length contributes to susceptibility, integration hotspot formation likely arises from interaction between structural and regulatory genomic features. 6 Limitations This analysis evaluated a single genomic feature (gene length) and Integration frequency was not normalized by gene density or chromosomal length distribution. 7 Future Directions Future investigations should integrate chromatin accessibility, histone modification profiles, gene expression data, replication timing, and three-dimensional genome organization to develop predictive models of HPV integration susceptibility. Abbreviations HPV: Human Papillomavirus; VISDB: Viral Integration Site Database. Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and materials HPV integration data were obtained from VISDB (http://bioinfo.ahu.edu.cn/VISDB). Gene annotation data were retrieved from Ensembl BioMart (https://www.ensembl.org/biomart). Genomic validation was performed using the UCSC Genome Browser (https://genome.ucsc.edu). Competing Interests The authors declare no competing interests. Funding This research received no external funding. Authors' contributions Sahil Khandekar conceptualized the study, performed data analysis, and drafted the manuscript. Urja Bait contributed to data interpretation and manuscript revision. Both authors approved the final manuscript. Acknowledgements Not applicable. References Bodelon C, Untereiner ME, Machiela MJ, Vinokurova S, Wentzensen N. Genomic characterization of viral integration sites in HPV-related cancers. Int J Cancer. 2016;139(9):2001–2011. Hu Z, Zhu D, Wang W, et al. Genome-wide profiling of HPV integration in cervical cancer identifies clustered genomic hot spots and a potential microhomology-mediated integration mechanism. Nat Genet. 2015;47(2):158–163. Akagi K, Li J, Broutian TR, et al. Genome-wide analysis of HPV integration in human cancers reveals recurrent, focal genomic instability. Genome Res. 2014;24(2):185–199. Cao L, et al. VISDB: a manually curated database of viral integration sites in the human genome. Nucleic Acids Res. 2020;48(D1):D633–D641. Kinsella RJ, Kähäri A, Haider S, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford). 2011;2011:bar030. Kent WJ, Sugnet CW, Furey TS, et al. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006. Berthold MR, Cebron N, Dill F, et al. KNIME – the Konstanz Information Miner: version 2.0 and beyond. SIGKDD Explor. 2009;11(1):26–31. Additional Declarations No competing interests reported. Supplementary Files UCSCgenevalidation.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8941085","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Short Report","associatedPublications":[],"authors":[{"id":599254004,"identity":"b2fd7bc0-4630-49cc-8fa8-8dd70448ec8f","order_by":0,"name":"Sahil Khandekar","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7klEQVRIiWNgGAWjYFACHoYDDwzYGBiYmRsfgLh8RGlJqACqY2ZsNgBx2YjRwpBwRg7IYGyTAPEJapFvP3vwQGKbWTR/O2Nb5dccOxk2BuaHj27g0WJwJi8BqCUtd8ZhxrbbstuSgQ5jMzbOwaeFIccAqOVYbgNIi+Q2ZqAWHjZpfFrk+9+AtPzPnQ/UUiy5rZ6wFoYbQFsSzrDlbgBqYfy47TBhLQY3gLYkVLDlbjzM2CzNuO04DxszAb/I9+cYf/hgwJY77/zhgx9/bqu252dvfvgYr8OQATMPmCRWOQgw/iBF9SgYBaNgFIwYAABtpkvV6LS4cgAAAABJRU5ErkJggg==","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Sahil","middleName":"","lastName":"Khandekar","suffix":""},{"id":599254005,"identity":"eed47ab8-1633-4e16-8f7c-0c1959a5eb7d","order_by":1,"name":"Urja Bait","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Urja","middleName":"","lastName":"Bait","suffix":""}],"badges":[],"createdAt":"2026-02-22 19:23:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8941085/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8941085/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103832676,"identity":"561a8054-4d8a-4bc1-a07f-bf80cae6bab3","added_by":"auto","created_at":"2026-03-03 13:01:21","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":310408,"visible":true,"origin":"","legend":"\u003cp\u003eKNIME workflow for preprocessing VISDB and BioMart datasets, data integration, gene\u003c/p\u003e\n\u003cp\u003eclassification, and statistical testing.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8941085/v1/fc33caf1e5a8425918c302a6.png"},{"id":103832679,"identity":"e840e0b8-9959-4172-b54c-cb1b42a595a0","added_by":"auto","created_at":"2026-03-03 13:01:21","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":173676,"visible":true,"origin":"","legend":"\u003cp\u003eWorkflow for identification and ranking of HPV16 and HPV18 integration hotspots using aggregation and top-k filtering.\u003c/p\u003e\n\u003cp\u003e(This workflow is available on GitHub link:- https://github.com/iamsahill4/HPV-KNIME-Workflow )\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8941085/v1/1e8fa955315be6df298719a1.png"},{"id":103832678,"identity":"858a6cc9-8a23-4984-b37c-1c0c1a0f5cd3","added_by":"auto","created_at":"2026-03-03 13:01:21","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":40288,"visible":true,"origin":"","legend":"\u003cp\u003eBox plot comparing log-transformed gene lengths between integrated and non-integrated genes.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8941085/v1/951fb1f5bfb8e0c37c2e49ae.png"},{"id":103832680,"identity":"181a6cbd-d6c4-46f0-9c99-8345fac8e764","added_by":"auto","created_at":"2026-03-03 13:01:21","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":186271,"visible":true,"origin":"","legend":"\u003cp\u003eKernel density plot showing rightward shift in gene length distribution among integrated genes.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8941085/v1/1bab09e5a2fdecf1cb589521.png"},{"id":105888666,"identity":"f50d53b8-caa8-489b-aeae-8b242fb342de","added_by":"auto","created_at":"2026-04-01 07:44:51","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1092244,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8941085/v1/87cf346c-0a7e-41f7-96d3-b75cbfce0189.pdf"},{"id":104400985,"identity":"b9987d73-954f-41bd-a2a8-93d06d5131f3","added_by":"auto","created_at":"2026-03-11 12:11:38","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":1714311,"visible":true,"origin":"","legend":"","description":"","filename":"UCSCgenevalidation.docx","url":"https://assets-eu.researchsquare.com/files/rs-8941085/v1/60117863d50d5d9ae39cb383.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Gene Length Bias in Human Papillomavirus Integration Sites: A Statistical Analysis Using KNIME","fulltext":[{"header":"1\tIntroduction","content":"\u003cp\u003eHigh-risk HPV types, particularly HPV16 and HPV18, promote malignant transformation through integration into host genomic DNA. Integration events can disrupt tumor suppressor genes, amplify oncogenic loci, and alter regulatory architecture. Recurrent hotspots such as \u003cem\u003ePVT1\u003c/em\u003e, \u003cem\u003eFHIT\u003c/em\u003e, \u003cem\u003eMACROD2\u003c/em\u003e, and loci within 8q24 have been consistently reported [1-3]. However, whether intrinsic structural properties of host genes influence integration frequency remains unclear.\u003c/p\u003e\n\u003cp\u003eWe hypothesized that longer genes, by occupying larger genomic intervals, present increased opportunity for viral integration.\u003c/p\u003e"},{"header":"2\tMaterials and Methods","content":"\u003ch2\u003e2.1 Data Acquisition\u003c/h2\u003e\n\u003cp\u003eHPV16 and HPV18 integration site data were obtained from VISDB [4]. Human gene annotations were retrieved from Ensembl BioMart (GRCh38) [5]. Integration coordinates were mapped to corresponding host genes using genomic overlap criteria.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e2.2 Computational Workflow\u003c/h2\u003e\n\u003cp\u003eAll preprocessing and statistical analyses were performed using the KNIME Analytics Platform [7].\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eVISDB Data Processing\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHPV16 and HPV18 datasets were imported separately using CSV Reader nodes and subsequently concatenated. The following preprocessing steps were applied:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eColumn filtering to retain only relevant gene-level variables\u003c/li\u003e\n \u003cli\u003eRemoval of rows with missing gene annotations\u003c/li\u003e\n \u003cli\u003eCell splitting of multi-gene integration entries\u003c/li\u003e\n \u003cli\u003eUngrouping to separate duplicated gene entries\u003c/li\u003e\n \u003cli\u003eRemoval of duplicate rows\u003c/li\u003e\n \u003cli\u003eElimination of unnecessary identifiers (e.g., Ensembl ID where redundant)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eFor hotspot identification, gene counts were aggregated using GroupBy nodes, followed by rule-based filtering and ranking using Sorter and Top K Row Filter nodes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEnsembl BioMart Processing\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGene annotation data from Ensembl BioMart were:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eFiltered to retain essential genomic coordinates\u003c/li\u003e\n \u003cli\u003eCleaned to remove unnecessary chromosome entries\u003c/li\u003e\n \u003cli\u003eGene length calculated from genomic start and end positions\u003c/li\u003e\n \u003cli\u003eLog transformation applied to gene length values to correct skewness\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eData Integration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eProcessed VISDB and BioMart datasets were merged using genomic overlap criteria through Joiner nodes. A rule-based column was generated to classify genes into:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eIntegrated\u003c/li\u003e\n \u003cli\u003eNon-integrated\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eStatistical Validation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGroup comparison was performed using the Wilcoxon\u0026ndash;Mann\u0026ndash;Whitney test node. Distributional visualization included:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eBox plot\u003c/li\u003e\n \u003cli\u003eKernel density estimation\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2\u003e2.3 Statistical Analysis\u003c/h2\u003e\n\u003cp\u003eGene length distributions between integrated and non-integrated groups were compared using the Wilcoxon\u0026ndash;Mann\u0026ndash;Whitney test. Statistical significance was defined at \u003cem\u003ep\u0026nbsp;\u003c/em\u003e\u003cem\u003e\u0026lt;\u0026nbsp;\u003c/em\u003e0\u003cem\u003e.\u003c/em\u003e05.\u003c/p\u003e"},{"header":"3\tResults","content":"\u003ch2\u003e3.1 Gene Length Distribution Bias\u003c/h2\u003e\n\u003cp\u003eIntegrated genes demonstrated higher median log-transformed lengths relative to non-integrated genes (Fig. 3). Distributional differences are further illustrated in the density plot (Fig. 4). Statistical testing confirmed a significant difference between groups (p = 0.0001), supporting a gene length bias in HPV integration.\u003c/p\u003e\n\u003cp\u003eA total of 10063 integrated genes and 13199 non-integrated genes were included in the analysis.\u003c/p\u003e\n\u003ch2\u003e3.2 Hotspot Context\u003c/h2\u003e\n\u003cp\u003eGenomic inspection using the UCSC Genome Browser [6] revealed that several integration hotspots reside within established oncogenic regions, including 8q24 proximal to \u003cem\u003eMYC\u003c/em\u003e. Importantly, certain shorter genes within highly active regulatory domains also exhibited integration events, suggesting that chromatin state and regulatory architecture influence integration beyond structural length bias. The computational workflow used for hotspot identification is shown in Fig. 1 and Fig. 2.\u003c/p\u003e"},{"header":"4\tDiscussion","content":"\u003cp\u003eOur findings demonstrate that human gene length is significantly associated with HPV integration susceptibility.\u0026nbsp;The most direct interpretation is probabilistic: longer genes span greater genomic territory and therefore present increased opportunity for viral insertion events.\u003c/p\u003e\n\u003cp\u003eHowever, gene length alone does not fully explain hotspot localization. Integration within shorter genes positioned in transcriptionally active or enhancer-rich regions indicates that epigenetic and chromatin features likely refine integration targeting. These observations support a composite model in which structural exposure establishes baseline susceptibility, while regulatory context determines hotspot specificity.\u003c/p\u003e"},{"header":"5\tConclusion","content":"\u003cp\u003e\u0026nbsp;Human gene length is positively associated with HPV integration probability. Integrated genes are significantly longer than non-integrated genes. Although gene length contributes to susceptibility, integration hotspot formation likely arises from interaction between structural and regulatory genomic features.\u003c/p\u003e"},{"header":"6\tLimitations","content":"\u003cp\u003eThis analysis evaluated a single genomic feature (gene length) and Integration frequency was not normalized by gene density or chromosomal length distribution.\u0026nbsp;\u003c/p\u003e"},{"header":"7\tFuture Directions","content":"\u003cp\u003eFuture investigations should integrate chromatin accessibility, histone modification profiles, gene expression data, replication timing, and three-dimensional genome organization to develop predictive models of HPV integration susceptibility.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eHPV: Human Papillomavirus; VISDB: Viral Integration Site Database.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHPV integration data were obtained from VISDB (http://bioinfo.ahu.edu.cn/VISDB). Gene annotation data were retrieved from Ensembl BioMart (https://www.ensembl.org/biomart). Genomic validation was performed using the UCSC Genome Browser (https://genome.ucsc.edu).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research received no external funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSahil Khandekar conceptualized the study, performed data analysis, and drafted the manuscript. Urja Bait contributed to data interpretation and manuscript revision. Both authors approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eBodelon C, Untereiner ME, Machiela MJ, Vinokurova S, Wentzensen N. Genomic characterization of viral integration sites in HPV-related cancers. \u003cem\u003eInt J Cancer.\u003c/em\u003e 2016;139(9):2001\u0026ndash;2011.\u003c/li\u003e\n \u003cli\u003eHu Z, Zhu D, Wang W, et al. Genome-wide profiling of HPV integration in cervical cancer identifies clustered genomic hot spots and a potential microhomology-mediated integration mechanism. \u003cem\u003eNat Genet.\u003c/em\u003e 2015;47(2):158\u0026ndash;163.\u003c/li\u003e\n \u003cli\u003eAkagi K, Li J, Broutian TR, et al. Genome-wide analysis of HPV integration in human cancers reveals recurrent, focal genomic instability. \u003cem\u003eGenome Res.\u003c/em\u003e 2014;24(2):185\u0026ndash;199.\u003c/li\u003e\n \u003cli\u003eCao L, et al. VISDB: a manually curated database of viral integration sites in the human genome. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e 2020;48(D1):D633\u0026ndash;D641.\u003c/li\u003e\n \u003cli\u003eKinsella RJ, K\u0026auml;h\u0026auml;ri A, Haider S, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. \u003cem\u003eDatabase (Oxford).\u003c/em\u003e 2011;2011:bar030.\u003c/li\u003e\n \u003cli\u003eKent WJ, Sugnet CW, Furey TS, et al. The human genome browser at UCSC. \u003cem\u003eGenome Res.\u003c/em\u003e 2002;12(6):996\u0026ndash;1006.\u003c/li\u003e\n \u003cli\u003eBerthold MR, Cebron N, Dill F, et al. KNIME \u0026ndash; the Konstanz Information Miner: version 2.0 and beyond. \u003cem\u003eSIGKDD Explor.\u003c/em\u003e 2009;11(1):26\u0026ndash;31.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"HPV integration, gene length bias, viral oncogenesis, genomic hotspots, statistical genomics, HPV16, HPV18","lastPublishedDoi":"10.21203/rs.3.rs-8941085/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8941085/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Integration of high-risk Human Papillomavirus (HPV) into the human genome represents a critical event in carcinogenesis. While recurrent integration hotspots are well documented, the structural determinants governing host gene susceptibility remain incompletely defined. Here, we evaluate whether human gene length influences HPV integration probability using VISDB integration datasets for HPV16 and HPV18 merged with Ensembl BioMart annotations. Statistical analysis performed in KNIME reveals that HPV-integrated genes exhibit significantly greater length than non-integrated genes (Two-sided Wilcoxon-Mann-Whitney, p = 0.0001). Distributional analysis demonstrates a consistent rightward shift in log-transformed gene length among integrated genes. Genomic validation using the UCSC Genome Browser confirms integration within established oncogenic regions, including 8q24 near MYC. These findings identify gene length as a measurable structural bias contributing to HPV integration susceptibility, while underscoring the role of genomic context in hotspot formation.","manuscriptTitle":"Gene Length Bias in Human Papillomavirus Integration Sites: A Statistical Analysis Using KNIME","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-03 13:01:17","doi":"10.21203/rs.3.rs-8941085/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"b1e1da6f-48b0-4347-844c-0fd39afe57b4","owner":[],"postedDate":"March 3rd, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-04-01T07:44:13+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-03 13:01:17","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8941085","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8941085","identity":"rs-8941085","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00