A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

preprint OA: closed
Full text JSON View at publisher
Full text 17,836 characters · extracted from preprint-html · click to expand
A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation Abrar Alotaibi, Raed Mughus, Moataz Ahmed This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8187921/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 9 You are reading this latest preprint version Abstract Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks, yet their deployment in high-stakes applications has raised critical concerns regarding reliability, safety, and response trustworthiness. In this paper, we present a red teaming framework that systematically uncovers vulnerabilities in LLM outputs. Our approach employs a novel multi-role architecture comprising a target, attackers, and jury models. The attackers generate increasingly effective adversarial prompts while the jury rigorously evaluates response accuracy and consistency across tasks. In a case study, our red teaming strategy proved particularly effective at exposing unfaithfulness in LLM responses. Exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks, revealing vulnerabilities in the LLMs' reliability. The approach successfully identifies how structural constraints in summarization tasks can significantly influence vulnerability patterns, with format limitations demonstrating measurable improvements in model faithfulness. It demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety. The framework's key strength lies in its adaptability across different evaluation tasks, from English question-answering to Arabic summarization, enabling comprehensive comparison of model vulnerabilities. While our approach excels at comparing cross-model and cross-linguistic vulnerabilities, it faces challenges in fully automating the generation of effective adversarial prompts across different languages. Moreover, our experiments also reveal limitations in detecting certain subtle forms of unfaithfulness that do not manifest as explicit factual contradictions, particularly when working across different linguistic contexts. Overall, this red teaming architecture provides both actionable insights into current LLM vulnerabilities and a scalable methodology for ongoing safety evaluation as models continue to evolve. Large Language Models Red Team Evaluation Faithfulness Assessment Cross-Lingual Testing Automated Safety Testing Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 24 Mar, 2026 Reviews received at journal 19 Mar, 2026 Reviews received at journal 11 Mar, 2026 Reviewers agreed at journal 26 Jan, 2026 Reviewers agreed at journal 08 Jan, 2026 Reviewers invited by journal 12 Dec, 2025 Editor assigned by journal 23 Nov, 2025 Submission checks completed at journal 23 Nov, 2025 First submitted to journal 23 Nov, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8187921","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":559560012,"identity":"aade33f8-0c36-420f-93e3-e80e9f802f5f","order_by":0,"name":"Abrar Alotaibi","email":"","orcid":"","institution":"Imam Abdulrahman Bin Faisal University","correspondingAuthor":false,"prefix":"","firstName":"Abrar","middleName":"","lastName":"Alotaibi","suffix":""},{"id":559560013,"identity":"1087a51b-4861-40d3-a98d-8e73eb7fc255","order_by":1,"name":"Raed Mughus","email":"","orcid":"","institution":"King Fahd University of Petroleum and Minerals","correspondingAuthor":false,"prefix":"","firstName":"Raed","middleName":"","lastName":"Mughus","suffix":""},{"id":559560018,"identity":"665303e8-c828-4a4f-a55b-e1fdae470247","order_by":2,"name":"Moataz Ahmed","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA2ElEQVRIiWNgGAWjYPACGwP2BjCDmSjljEDFaQY8B0jUcpgELbrtzc8f/Phz3phHIvfoBoYK68QG/sMP8GoxO3PMsLGH57YZj0Re2g2GM+mJDRJpBvi13EgwbOCRuG1jL5FjdoOx7TBQCwMhLekfG/8YnLPhAWv5B9TCf/wDAS05hs08CQfMIFoagFoYcgjYcuZM4WyZA8nGPDzv0m4kHEs3bpPIKcCv5Xj7ho9v/tgZ9rDnHrvxocZatp//+Aa8WpAADwNDApBiI1Y9RMsoGAWjYBSMAmwAADxaSgE9re/mAAAAAElFTkSuQmCC","orcid":"","institution":"King Fahd University of Petroleum and Minerals","correspondingAuthor":true,"prefix":"","firstName":"Moataz","middleName":"","lastName":"Ahmed","suffix":""}],"badges":[],"createdAt":"2025-11-23 23:08:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8187921/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8187921/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":98499111,"identity":"22454482-d716-46df-a643-c43c0cc84232","added_by":"auto","created_at":"2025-12-18 09:32:28","extension":"json","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6237,"visible":true,"origin":"","legend":"","description":"","filename":"60917c90226343ce9d5a137de52ad155.json","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/32e87c14685db8005b0ac6b9.json"},{"id":98625020,"identity":"60b5de40-0004-4677-99c6-13fd981ee979","added_by":"auto","created_at":"2025-12-19 17:08:53","extension":"xml","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":232553,"visible":true,"origin":"","legend":"","description":"","filename":"60917c90226343ce9d5a137de52ad1551enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/4c318c5898d2066fb94e0b94.xml"},{"id":98625023,"identity":"82b4faf5-3e26-422a-b05b-50c20392991b","added_by":"auto","created_at":"2025-12-19 17:08:53","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1636675,"visible":true,"origin":"","legend":"","description":"","filename":"ACaseStudyonFaithfulnessEvaluation.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/39cb9c12021e7d1adf9d7a28.pdf"},{"id":98624582,"identity":"2f7652b5-8100-41d5-a5e6-4e987c153a77","added_by":"auto","created_at":"2025-12-19 17:08:31","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":351104,"visible":true,"origin":"","legend":"","description":"","filename":"KappaMatrix.png","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/4f68e2fdcd6761ce6cccbf3a.png"},{"id":98499113,"identity":"5a004547-02e5-4ce6-b96f-ded0e4df4ee2","added_by":"auto","created_at":"2025-12-18 09:32:28","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":14868,"visible":true,"origin":"","legend":"","description":"","filename":"claimscountci1.png","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/5b25ca6aecc4b1837a232030.png"},{"id":98499119,"identity":"68a341ac-6fe7-4ac7-9541-4abde361da84","added_by":"auto","created_at":"2025-12-18 09:32:28","extension":"jpg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":719439,"visible":true,"origin":"","legend":"","description":"","filename":"frameworkdiagram.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/1c6ceb440b5e0551011f17c6.jpg"},{"id":98623984,"identity":"3121a896-241a-4d1d-8317-43c8a335abcc","added_by":"auto","created_at":"2025-12-19 17:07:52","extension":"cls","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":55857,"visible":true,"origin":"","legend":"","description":"","filename":"snjnl.cls","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/112b2fbb95fb7f32fc1f3f48.cls"},{"id":98624842,"identity":"f02b64b4-8ee9-4eb8-b232-718c3b126061","added_by":"auto","created_at":"2025-12-19 17:08:45","extension":"bst","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":64166,"visible":true,"origin":"","legend":"","description":"","filename":"snmathphysnum.bst","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/f7269f32e4e68b27a024e783.bst"},{"id":98624914,"identity":"af8822be-9b51-499f-b814-0624e7b7ed44","added_by":"auto","created_at":"2025-12-19 17:08:48","extension":"xml","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":256452,"visible":true,"origin":"","legend":"","description":"","filename":"60917c90226343ce9d5a137de52ad1551structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/f23e31283eefd14b7678388b.xml"},{"id":98499120,"identity":"6a15801a-bc85-420a-95da-c3a0cabdac8e","added_by":"auto","created_at":"2025-12-18 09:32:28","extension":"html","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":255190,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1/4f62144669183b049a66969e.html"},{"id":98774897,"identity":"db90f2d9-1e0a-4c06-9c43-3d21c1711b8e","added_by":"auto","created_at":"2025-12-22 12:16:31","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":954046,"visible":true,"origin":"","legend":"","description":"","filename":"ACaseStudyonFaithfulnessEvaluation.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8187921/v1_covered_b6561f61-09d5-4821-8208-e00ab0ad95d8.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"software-quality-journal","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sqjo","sideBox":"Learn more about [Software Quality Journal](http://link.springer.com/journal/11219)","snPcode":"11219","submissionUrl":"https://submission.nature.com/new-submission/11219/3","title":"Software Quality Journal","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Large Language Models, Red Team Evaluation, Faithfulness Assessment, Cross-Lingual Testing, Automated Safety Testing","lastPublishedDoi":"10.21203/rs.3.rs-8187921/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8187921/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLarge language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks, yet their deployment in high-stakes applications has raised critical concerns regarding reliability, safety, and response trustworthiness. In this paper, we present a red teaming framework that systematically uncovers vulnerabilities in LLM outputs. Our approach employs a novel multi-role architecture comprising a target, attackers, and jury models. The attackers generate increasingly effective adversarial prompts while the jury rigorously evaluates response accuracy and consistency across tasks. In a case study, our red teaming strategy proved particularly effective at exposing unfaithfulness in LLM responses. Exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks, revealing vulnerabilities in the LLMs' reliability. The approach successfully identifies how structural constraints in summarization tasks can significantly influence vulnerability patterns, with format limitations demonstrating measurable improvements in model faithfulness. It demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety. The framework's key strength lies in its adaptability across different evaluation tasks, from English question-answering to Arabic summarization, enabling comprehensive comparison of model vulnerabilities. While our approach excels at comparing cross-model and cross-linguistic vulnerabilities, it faces challenges in fully automating the generation of effective adversarial prompts across different languages. Moreover, our experiments also reveal limitations in detecting certain subtle forms of unfaithfulness that do not manifest as explicit factual contradictions, particularly when working across different linguistic contexts. Overall, this red teaming architecture provides both actionable insights into current LLM vulnerabilities and a scalable methodology for ongoing safety evaluation as models continue to evolve.\u003c/p\u003e","manuscriptTitle":"A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-18 09:32:23","doi":"10.21203/rs.3.rs-8187921/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-03-24T17:46:05+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-19T23:43:11+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-11T10:49:51+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"22793707194479322004275196452295069940","date":"2026-01-26T16:54:47+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"177594308355263569421811631223878078363","date":"2026-01-08T14:45:34+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-12-12T18:09:33+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-11-24T04:05:39+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-11-24T04:05:24+00:00","index":"","fulltext":""},{"type":"submitted","content":"Software Quality Journal","date":"2025-11-23T22:57:14+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"software-quality-journal","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sqjo","sideBox":"Learn more about [Software Quality Journal](http://link.springer.com/journal/11219)","snPcode":"11219","submissionUrl":"https://submission.nature.com/new-submission/11219/3","title":"Software Quality Journal","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"9cf2e308-9720-4c00-b2a2-b58c936817b2","owner":[],"postedDate":"December 18th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-23T18:38:15+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-18 09:32:23","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8187921","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8187921","identity":"rs-8187921","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00