An urgent call for a Human Knowledge Archive (HKA): safeguarding factual integrity in the age of generative AI

preprint OA: closed
Full text JSON View at publisher
Full text 15,050 characters · extracted from preprint-html · click to expand
An urgent call for a Human Knowledge Archive (HKA): safeguarding factual integrity in the age of generative AI | Authorea try { document.documentElement.classList.add('js'); } catch (e) { } var _gaq = _gaq || []; _gaq.push(['_setAccount', 'G-8VDV14Y67G']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); Skip to main content Preprints Collections Wiley Open Research IET Open Research Ecological Society of Japan All Collections About About Authorea FAQs Contact Us Quick Search anywhere Search for preprint articles, keywords, etc. Search Search ADVANCED SEARCH SCROLL This is a preprint and has not been peer reviewed. Data may be preliminary. 21 July 2025 V1 Latest version Share on An urgent call for a Human Knowledge Archive (HKA): safeguarding factual integrity in the age of generative AI Author : Renhua Sun 0000-0002-8203-4946 [email protected] Authors Info & Affiliations https://doi.org/10.22541/au.175312789.99168429/v1 219 views 113 downloads Contents Abstract Information & Authors Metrics & Citations View Options References Figures Tables Media Share Abstract Generative Artificial Intelligence (AI) tools, such as Large Language Models (LLMs), have become powerful assistants in content creation and research. However, they pose a profound threat to the integrity of human knowledge by generating fictitious yet plausible-sounding information. This fabricated content, including non-existent scholarly citations, is increasingly integrated uncritically into public discourse, leading to an accelerating cycle of information contamination. Standard verification methods, like time-limiting searches to a ”pre-AI” era, are insufficient because AI can retroactively fabricate and timestamp historical data, polluting our understanding of the past itself. We are at a critical inflection point. To counter this existential threat, this article proposes the establishment of a Human Knowledge Archive (HKA). This archive would be a static, immutable, and comprehensive snapshot of our collective knowledge up to a designated point in time, adhering to principles of comprehensiveness, immutability, and public accessibility with strict isolation The HKA’s critical isolation, involving measures like physical air-gapping and decentralized ledger technologies, is crucial to prevent contamination and ensure it remains untainted by future AI training. This initiative is a vital act of intellectual preservation, committed to providing future generations and responsible AI systems with a reliable, factual baseline of human civilization. While the task is immense, the cost of inaction, especially as future AI capabilities remain profoundly unpredictable, is the sanctity of truth itself. Background: the emergence of a new epistemological crisis The advent and widespread adoption of powerful generative AI tools, most notably Large Language Models (LLMs) like ChatGPT and Gemini, represent a paradigm shift in how information is generated and synthesized. These models offer unprecedented efficiency in drafting text, summarizing complex topics, and aiding in the research process. Their utility is undeniable, and their integration into academic, journalistic, and public workflows is rapidly accelerating. However, this technological leap is accompanied by a profound and insidious risk. A well-documented phenomenon known as ”artificial hallucination” causes these models to fabricate information with a high degree of confidence and plausibility. A particularly dangerous manifestation of this is the generation of non-existent literature. An LLM might produce a paragraph of text supported by a list of references that appear entirely legitimate, complete with authors, titles, journal names, and years of publication. Yet, upon inspection, these sources are revealed to be complete fabrications - phantoms of the digital ether. The core of the problem lies not just in the AI’s capacity to err, but in the uncritical acceptance of its output by human users. In the rush for productivity, authors, researchers, and content creators are increasingly incorporating these AI-generated texts and their fictitious citations directly into manuscripts, reports, and articles without due diligence. Once these works are published, the fabricated information transcends its dubious origins. It becomes embedded in the formal record, a peer-reviewed paper, a news story, a database entry. It becomes, for all intents and purposes, a ”fact” that can be cited, shared, and propagated. We are witnessing the birth of a system where falsehoods are laundered into legitimacy, polluting the ecosystem of human knowledge. The initiative: a proposal for a secure Human Knowledge Archive (HKA) The accelerating contamination of our information landscape requires a bold and preemptive solution. We can no longer assume that the live, dynamic internet can serve as a reliable repository of factual knowledge. Therefore, we advocate for the immediate and large-scale effort to create a Human Knowledge Archive (HKA). The HKA would be a static, immutable, and comprehensive snapshot of our collective knowledge up to a designated point in time. Establishing this time point for the HKA will require careful consideration. While the aim is to capture knowledge prior to widespread AI contamination, the exact cutoff must be a subject of global consensus, balancing the need for a comprehensive baseline with the practicalities of data collection. It could align with a specific technological milestone, a widely recognized date of mainstream AI adoption, or a mutually agreed-upon societal demarcation, ensuring the archive serves as a truly untainted intellectual bedrock. This initiative would be guided by three core principles: Comprehensiveness: The archive must be exhaustive. The effort should aim to preserve the full spectrum of human intellectual and cultural output. This includes, but is not limited to: Scientific and Academic Literature: all peer-reviewed articles, preprints, conference proceedings, and dissertations. Databases: public and scientific databases (e.g., GenBank, Protein Data Bank, economic datasets). Web Content: a comprehensive crawl and preservation of the invaluable, yet ephemeral, content of the World Wide Web. Digitized Books and Historical Documents: all materials from projects like Google Books, Project Gutenberg, and national library archives. Legal and Governmental Records: legislation, court records, patents, and public reports. News and Media Archives: the archives of reputable journalistic outlets and public broadcasters. Cultural Works: digitized art, music, and other forms of cultural expression that constitute a part of our shared heritage. Immutability and Verifiability: The archived data must be stored in a ”write-once” format that is resistant to editing or tampering. Each entry would be cryptographically timestamped to certify its state at the time of archiving. This ensures that the HKA remains a stable and uncorrupted baseline. Public Accessibility with Isolation: The HKA must be fully searchable and accessible to the public for verification and research. However, its core dataset must be maintained as a ”cold backup,” completely isolated from the live internet. This isolation would likely involve a combination of physical air-gapping for primary storage, decentralized distributed ledger technologies for cryptographic timestamping and verification of integrity, and stringent access protocols. These measures are essential to prevent external alteration, which would compromise its fundamental purpose as an uncontaminated reference. Discussion: confronting a corrupted future and insufficient solutions The threat we face is existential for a civilization that prides itself on being built upon a foundation of evidence and fact. As AI-generated content proliferates, the proportion of verifiably true information within our accessible knowledge pool will decrease. Over time, distinguishing between genuine historical fact and plausible AI fiction will become monumentally difficult, if not impossible. The very rigor of human intellectual endeavor is at stake. One might argue that a simpler solution exists: to filter information based on its date of creation, effectively treating the advent of mainstream generative AI as a chronological firewall. One could, for instance, choose to only trust sources published before 2023. This approach is dangerously naive and fundamentally misunderstands the nature of AI-driven contamination. The fabrications generated by AI are not confined to contemporary topics. An LLM can generate a detailed, convincing, yet entirely false account of a 19th-century scientific discovery, invent a Renaissance political treatise, or fabricate data for a supposed 1990s clinical trial. The contamination is retroactive. It injects falsehoods into the historical record, polluting our understanding of the pre-AI era itself. Therefore, a simple time filter is an inadequate defense. The year of AI’s emergence is not a clean boundary; it is the point from which the entire timeline of human knowledge became vulnerable to corruption. The HKA’s creation is a monumental undertaking, undoubtedly fraught with logistical, financial, and ethical challenges. While critical issues like selection criteria and the technical infrastructure needed for petabyte-scale storage demand immediate attention, the establishment and ongoing maintenance of the HKA also necessitates an unprecedented level of international collaboration and governance. This monumental undertaking could ideally be overseen by a new, independent global body, perhaps operating under the auspices of the United Nations or a consortium of leading scientific and cultural institutions, responsible for securing funding and coordinating comprehensive collection efforts. To ignore these challenges means facing an even grimmer alternative: a future where our civilization can no longer trust its own historical records. The very moment AI started generating publicly accessible content, we entered a new era where the integrity of information is no longer guaranteed. This discussion, however, is based solely on AI’s current capabilities. When artificial intelligence achieves genuine autonomy and self-awareness, the landscape of information integrity will fundamentally shift into an entirely unpredictable and far more daunting scenario. Conclusion Our civilization stands at a precipice. The tools we have created to augment our intellect now threaten to erode its very foundation. The uncritical integration of AI-generated content is actively seeding our collective knowledge with enduring falsehoods. We must act decisively to preserve the intellectual heritage created before this new form of contamination became widespread. The proposed Human Knowledge Archive is not a measure of pessimism towards technology; instead, it is a necessary act of intellectual preservation. It is a commitment to ensuring that future generations, and future, more responsible AI systems, have a reliable, factual baseline of human civilization from which to learn. The task is immense, but the cost of inaction is the sanctity of truth itself. Information & Authors Information Version history V1 Version 1 21 July 2025 Copyright This work is licensed under a Non Exclusive No Reuse License. Keywords artificial intelligence (ai) chatgpt computing and processing copilot gemini generative ai large language models (llms) Authors Affiliations Renhua Sun 0000-0002-8203-4946 [email protected] View all articles by this author Metrics & Citations Metrics Article Usage 219 views 113 downloads .FvxKWukQNSOunydq8rnd { width: 100px; } Citations Download citation Renhua Sun. An urgent call for a Human Knowledge Archive (HKA): safeguarding factual integrity in the age of generative AI. Authorea . 21 July 2025. DOI: https://doi.org/10.22541/au.175312789.99168429/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu . Format Please select one from the list RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks Direct import Tips for downloading citations document.getElementById('citMgrHelpLink').addEventListener('click', function() { popupHelp(this.href); return false; }); $(".js__slcInclude").on("change", function(e){ if ($(this).val() == 'refworks') $('#direct').prop("checked", false); $('#direct').prop("disabled", ($(this).val() == 'refworks')); }); View Options View options PDF View PDF Figures Tables Media Share Share Share article link Copy Link Copied! Copying failed. Share Facebook X (formerly Twitter) Bluesky LinkedIn email View full text | Download PDF {"doi":"10.22541/au.175312789.99168429/v1","type":"Article"} Now Reading: Share Figures Tables Close figure viewer Back to article Figure title goes here Change zoom level Go to figure location within the article Download figure Toggle share panel Toggle share panel Share Toggle information panel Toggle information panel Go to previous graphic Go to next graphic Go to previous table Go to next table All figures All tables View all material View all material xrefBack.goTo xrefBack.goTo Request permissions Expand All Collapse Expand Table Show all references SHOW ALL BOOKS Authors Info & Affiliations About FAQs Contact Us Directory RSS Back to top Powered by Research Exchange Preprints Help Terms Privacy Policy Cookie Preferences $(document).ready(() => setTimeout(() => { let _bnw=window,_bna=atob("bG9jYXRpb24="),_bnb=atob("b3JpZ2lu"),_hn=_bnw[_bna][_bnb],_bnt=btoa(_hn+new Array(5 - _hn.length % 4).join(" ")); $.get("/resource/lodash?t="+_bnt); },4000)); (function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'9ff369846e46df94',t:'MTc3OTM2NDczNg=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00