Weakly Supervised Veracity Classification with LLM-Predicted Credibility Signals

preprint OA: closed
Full text JSON View at publisher
Full text 13,648 characters · extracted from preprint-html · click to expand
Weakly Supervised Veracity Classification with LLM-Predicted Credibility Signals | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Weakly Supervised Veracity Classification with LLM-Predicted Credibility Signals João Augusto Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5389911/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 21 Feb, 2025 Read the published version in EPJ Data Science → Version 1 posted 9 You are reading this latest preprint version Abstract Credibility signals represent a wide range of heuristics typically used by journalists and fact-checkers to assess the veracity of online content. Automating the extraction of credibility signals presents significant challenges due to the necessity of training high-accuracy, signal-specific extractors, coupled with the lack of sufficiently large annotated datasets. This paper introduces Pastel (Prompted weAk Supervision wiTh crEdibility signaLs), a weakly supervised approach that leverages large language models (LLMs) to extract credibility signals from web content, and subsequently combines them to predict the veracity of content without relying on human supervision. We validate our approach using four article-level misinformation detection datasets, demonstrating that Pastel outperforms zero-shot veracity detection by 38.3% and achieves 86.7% of the performance of the state-of-the-art system trained with human supervision. Moreover, in cross-domain settings where training and testing datasets originate from different domains, Pastel significantly outperforms the state-of-the-art supervised model by 63%. We further study the association between credibility signals and veracity, and perform an ablation study showing the impact of each signal on model performance. Our findings reveal that 12 out of the 19 proposed signals exhibit strong associations with veracity across all datasets, while some signals show domain-specific strengths. Veracity Classification Large Language Models Weak Supervision Credibility Signals Full Text Additional Declarations No competing interests reported. Supplementary Files PASTELcode.zip Cite Share Download PDF Status: Published Journal Publication published 21 Feb, 2025 Read the published version in EPJ Data Science → Version 1 posted Editorial decision: Revision requested 24 Dec, 2024 Reviews received at journal 22 Dec, 2024 Reviews received at journal 21 Nov, 2024 Reviewers agreed at journal 14 Nov, 2024 Reviewers agreed at journal 12 Nov, 2024 Reviewers invited by journal 12 Nov, 2024 Editor assigned by journal 07 Nov, 2024 Submission checks completed at journal 07 Nov, 2024 First submitted to journal 04 Nov, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5389911","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":381153473,"identity":"e8e9200a-e15b-4571-baef-4c51b6c92dbd","order_by":0,"name":"João Augusto Leite","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCElEQVRIie2RsWrDMBCGzxicRdSrQkrzCjICJ6WmfZUzBncJmTO1LgFnMXjN48hodenq0X6DZBN0aE+lY5WsHfSh5Th9/CcdgMfzDwmqsAIEYDEwgF3G4caWhHArASkIt/OKlL7kEF1TrAR0KRPqR4HrSnjo3sbJZEx+vHejwvVdNMPgZEBLZ0aT7wViydJhWwiFXEZsDOcN6NQ9Vl5zRE0KSxdnw20JCwCdOZV2OhjELybbfmUo5ZWU8POicsxr+jHFBGxSIAUjOjbFPdhx2nMsC8aHjeSkJDWb6vtGPDufn7RFdzbZ41Pc9slJ4csynhV6MLuHpHIpfzTspi4sculueTwej+eXb4AcVeZisX/vAAAAAElFTkSuQmCC","orcid":"","institution":"University of Sheffield","correspondingAuthor":true,"prefix":"","firstName":"João","middleName":"Augusto","lastName":"Leite","suffix":""},{"id":381153474,"identity":"a4be9142-de67-479e-9ae3-ee50e645bd1c","order_by":1,"name":"Olesya Razuvayevskaya","email":"","orcid":"","institution":"University of Sheffield","correspondingAuthor":false,"prefix":"","firstName":"Olesya","middleName":"","lastName":"Razuvayevskaya","suffix":""},{"id":381153475,"identity":"cca4f181-a4fb-43f5-8fd9-cde80991d729","order_by":2,"name":"Kalina Bontcheva","email":"","orcid":"","institution":"University of Sheffield","correspondingAuthor":false,"prefix":"","firstName":"Kalina","middleName":"","lastName":"Bontcheva","suffix":""},{"id":381153477,"identity":"0bcd2ae7-d908-45f8-82c1-70af78ffca74","order_by":3,"name":"Carolina Scarton","email":"","orcid":"","institution":"University of Sheffield","correspondingAuthor":false,"prefix":"","firstName":"Carolina","middleName":"","lastName":"Scarton","suffix":""}],"badges":[],"createdAt":"2024-11-04 17:08:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5389911/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5389911/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1140/epjds/s13688-025-00534-0","type":"published","date":"2025-02-21T15:57:17+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":77052533,"identity":"ca9c17d0-4a63-4089-b793-e3c0f0f6ceb6","added_by":"auto","created_at":"2025-02-24 16:14:12","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":571376,"visible":true,"origin":"","legend":"","description":"","filename":"EPJDataSciencePASTEL2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5389911/v1_covered_aa8e0803-367b-4ca8-8888-5ac9b7f3b715.pdf"},{"id":69857844,"identity":"757119ab-47dd-4020-a9ec-8817eb8fdd63","added_by":"auto","created_at":"2024-11-26 04:03:56","extension":"zip","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":112342,"visible":true,"origin":"","legend":"","description":"","filename":"PASTELcode.zip","url":"https://assets-eu.researchsquare.com/files/rs-5389911/v1/f873e9960d76e2088388da8a.zip"}],"financialInterests":"No competing interests reported.","formattedTitle":"Weakly Supervised Veracity Classification with LLM-Predicted Credibility Signals","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"epj-data-science","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"epds","sideBox":"Learn more about [EPJ Data Science](https://epjdatascience.springeropen.com/)","snPcode":"13688","submissionUrl":"https://submission.springernature.com/new-submission/13688/3","title":"EPJ Data Science","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Open","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Veracity Classification, Large Language Models, Weak Supervision, Credibility Signals","lastPublishedDoi":"10.21203/rs.3.rs-5389911/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5389911/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Credibility signals represent a wide range of heuristics typically used by journalists and fact-checkers to assess the veracity of online content. Automating the extraction of credibility signals presents significant challenges due to the necessity of training high-accuracy, signal-specific extractors, coupled with the lack of sufficiently large annotated datasets. This paper introduces Pastel (Prompted weAk Supervision wiTh crEdibility signaLs), a weakly supervised approach that leverages large language models (LLMs) to extract credibility signals from web content, and subsequently combines them to predict the veracity of content without relying on human supervision. We validate our approach using four article-level misinformation detection datasets, demonstrating that Pastel outperforms zero-shot veracity detection by 38.3% and achieves 86.7% of the performance of the state-of-the-art system trained with human supervision. Moreover, in cross-domain settings where training and testing datasets originate from different domains, Pastel significantly outperforms the state-of-the-art supervised model by 63%. We further study the association between credibility signals and veracity, and perform an ablation study showing the impact of each signal on model performance. Our findings reveal that 12 out of the 19 proposed signals exhibit strong associations with veracity across all datasets, while some signals show domain-specific strengths.","manuscriptTitle":"Weakly Supervised Veracity Classification with LLM-Predicted Credibility Signals","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-11-26 04:03:51","doi":"10.21203/rs.3.rs-5389911/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-12-24T13:15:22+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-12-23T02:21:51+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-11-22T01:21:47+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"232567763540852369395635045353449937163","date":"2024-11-14T18:16:22+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"93081235901612852972402847742029865582","date":"2024-11-12T17:18:32+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-11-12T15:13:43+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-11-08T04:53:31+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-11-08T04:51:33+00:00","index":"","fulltext":""},{"type":"submitted","content":"EPJ Data Science","date":"2024-11-04T17:05:10+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"epj-data-science","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"epds","sideBox":"Learn more about [EPJ Data Science](https://epjdatascience.springeropen.com/)","snPcode":"13688","submissionUrl":"https://submission.springernature.com/new-submission/13688/3","title":"EPJ Data Science","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Open","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"1bd85388-8e12-414a-bbf3-2a539f79cd55","owner":[],"postedDate":"November 26th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-02-24T16:00:36+00:00","versionOfRecord":{"articleIdentity":"rs-5389911","link":"https://doi.org/10.1140/epjds/s13688-025-00534-0","journal":{"identity":"epj-data-science","isVorOnly":false,"title":"EPJ Data Science"},"publishedOn":"2025-02-21 15:57:17","publishedOnDateReadable":"February 21st, 2025"},"versionCreatedAt":"2024-11-26 04:03:51","video":"","vorDoi":"10.1140/epjds/s13688-025-00534-0","vorDoiUrl":"https://doi.org/10.1140/epjds/s13688-025-00534-0","workflowStages":[]},"version":"v1","identity":"rs-5389911","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5389911","identity":"rs-5389911","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00