Let’s Read the Log: Root Cause Analysis of Railway Test Execution Logs with Large Language Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Let’s Read the Log: Root Cause Analysis of Railway Test Execution Logs with Large Language Models Rahmanu Hermawan, Alessio Bucaioni, Eduard Enoiu, Wasif Afzal, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8869896/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 11 You are reading this latest preprint version Abstract Software quality assurance is pivotal in safety-critical domains such as railway systems, where failures could have catastrophic consequences. In this context, the train control and management system, which enables communication and control across multiple subsystems (such as doors and information panels) within a modern train, and its software must undergo rigorous validation. Alstom Rail Sweden AB employs a digital twin infrastructure to simulate and validate train control and management system software. While this setup improves system-level testing, root-cause analysis of test failures remains a manual, time-consuming bottleneck.In this study, we explore the potential of large language models to automate root cause analysis by interpreting test execution logs generated during digital twin-based testing. We benchmark nine state-of-the-art large language models: Aion-1.0, DeepSeek R1, DeepSeek V3 0324, Mistral Small 3.1 24B, GPT o3-mini, Gemini 2.5 Pro Experimental, QwB 32B, Gemini 2.0 Flash Experimental, and Amazon Nova 2 Lite using zero-shot chain-of-thought prompting to assess their ability to reason about fault patterns in real-world industrial test execution logs. The logs were sourced from Alstom’s digital twin-based testing environment and captured complex operational behaviour typical of embedded, safety-critical systems.Our results show that long-context large language models tended to achieve higher accuracy than smaller models. We also found that when a log exceeded an LLM’s context window, the model failed to reliably predict the root cause. Gemini 2.5 Pro Experimental achieved the best performance with 66.7% accuracy and produced strong reasoning in this domain, motivating further research on improving prediction accuracy for log-based root cause analysis. root cause analysis log analysis test log analysis LLM Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 11 May, 2026 Reviews received at journal 07 May, 2026 Reviews received at journal 07 May, 2026 Reviews received at journal 15 Apr, 2026 Reviewers agreed at journal 24 Mar, 2026 Reviewers agreed at journal 18 Mar, 2026 Reviewers agreed at journal 17 Mar, 2026 Reviewers invited by journal 17 Mar, 2026 Editor assigned by journal 16 Mar, 2026 Submission checks completed at journal 14 Feb, 2026 First submitted to journal 13 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8869896","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":608128251,"identity":"ea85b3d2-42c7-4466-9b57-3cac6785eae2","order_by":0,"name":"Rahmanu Hermawan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA80lEQVRIiWNgGAWjYFACHghlAOPzI7GJ1CLZQLIWgwMEtJi3nz264eeOwwzm7L1PN/xss8szvnZ4mwTDHxucWmTO5KXd7D1zmMGy57jZzd625GKz22llEoxtaTi1SDDkmN3gbTvMYHAjje0G7zbmxG23c8wkGBsO49bC/8bs5l+QlvvP2G7+3VafuHk2UAvDn/+4tUjkmN2G2MLGdpt32+HEDdIgLWwH8Gh5Y3Zbti2dx+BMGttt2X/HE2fcTiu2SGxLxuOwHLObb9us5QyOH2O7+eZMdWL/7OSNNz78scOpBQZ4ULkJBDWMglEwCkbBKMAHAJ38V5bEw+kfAAAAAElFTkSuQmCC","orcid":"","institution":"Mälardalen University","correspondingAuthor":true,"prefix":"","firstName":"Rahmanu","middleName":"","lastName":"Hermawan","suffix":""},{"id":608128270,"identity":"a033d8bd-60f9-4b71-b8bc-ed401c8ffa5d","order_by":1,"name":"Alessio Bucaioni","email":"","orcid":"","institution":"Mälardalen University","correspondingAuthor":false,"prefix":"","firstName":"Alessio","middleName":"","lastName":"Bucaioni","suffix":""},{"id":608128272,"identity":"32aade14-828a-422f-889d-3845a450ee79","order_by":2,"name":"Eduard Enoiu","email":"","orcid":"","institution":"Mälardalen University","correspondingAuthor":false,"prefix":"","firstName":"Eduard","middleName":"","lastName":"Enoiu","suffix":""},{"id":608128274,"identity":"16973a32-cb59-49b3-8b15-11d96abbc69b","order_by":3,"name":"Wasif Afzal","email":"","orcid":"","institution":"Mälardalen University","correspondingAuthor":false,"prefix":"","firstName":"Wasif","middleName":"","lastName":"Afzal","suffix":""},{"id":608128280,"identity":"2f841204-5430-420b-ba39-01fd93b5f2eb","order_by":4,"name":"Mehrdad Saadatmand","email":"","orcid":"","institution":"RISE Research Institutes of Sweden","correspondingAuthor":false,"prefix":"","firstName":"Mehrdad","middleName":"","lastName":"Saadatmand","suffix":""},{"id":608128282,"identity":"0875512c-5a57-4a6c-b355-1e9c288743a3","order_by":5,"name":"Nedim Zaimovic","email":"","orcid":"","institution":"Alstom (Sweden)","correspondingAuthor":false,"prefix":"","firstName":"Nedim","middleName":"","lastName":"Zaimovic","suffix":""},{"id":608128285,"identity":"d5de1296-2b0a-479e-b4ef-c547733c1f7e","order_by":6,"name":"Md Saleh Ibtasham","email":"","orcid":"","institution":"Alstom (Sweden)","correspondingAuthor":false,"prefix":"","firstName":"Md","middleName":"Saleh","lastName":"Ibtasham","suffix":""}],"badges":[],"createdAt":"2026-02-13 09:23:20","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8869896/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8869896/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105035928,"identity":"cfb24bf3-3679-423f-a1a8-d2df9999a8d2","added_by":"auto","created_at":"2026-03-20 07:27:00","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":461956,"visible":true,"origin":"","legend":"","description":"","filename":"JournalISSEbenchmarkingLLM.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8869896/v1_covered_450f073a-82db-4b12-a386-8169bac120ac.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Let’s Read the Log: Root Cause Analysis of Railway Test Execution Logs with Large Language Models","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"innovations-in-systems-and-software-engineering","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"isse","sideBox":"Learn more about [Innovations in Systems and Software Engineering](http://link.springer.com/journal/11334)","snPcode":"11334","submissionUrl":"https://submission.nature.com/new-submission/11334/3","title":"Innovations in Systems and Software Engineering","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"root cause analysis, log analysis, test log analysis, LLM","lastPublishedDoi":"10.21203/rs.3.rs-8869896/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8869896/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Software quality assurance is pivotal in safety-critical domains such as railway systems, where failures could have catastrophic consequences. In this context, the train control and management system, which enables communication and control across multiple subsystems (such as doors and information panels) within a modern train, and its software must undergo rigorous validation. Alstom Rail Sweden AB employs a digital twin infrastructure to simulate and validate train control and management system software. While this setup improves system-level testing, root-cause analysis of test failures remains a manual, time-consuming bottleneck.In this study, we explore the potential of large language models to automate root cause analysis by interpreting test execution logs generated during digital twin-based testing. We benchmark nine state-of-the-art large language models: Aion-1.0, DeepSeek R1, DeepSeek V3 0324, Mistral Small 3.1 24B, GPT o3-mini, Gemini 2.5 Pro Experimental, QwB 32B, Gemini 2.0 Flash Experimental, and Amazon Nova 2 Lite using zero-shot chain-of-thought prompting to assess their ability to reason about fault patterns in real-world industrial test execution logs. The logs were sourced from Alstom’s digital twin-based testing environment and captured complex operational behaviour typical of embedded, safety-critical systems.Our results show that long-context large language models tended to achieve higher accuracy than smaller models. We also found that when a log exceeded an LLM’s context window, the model failed to reliably predict the root cause. Gemini 2.5 Pro Experimental achieved the best performance with 66.7\\% accuracy and produced strong reasoning in this domain, motivating further research on improving prediction accuracy for log-based root cause analysis.","manuscriptTitle":"Let’s Read the Log: Root Cause Analysis of Railway Test Execution Logs with Large Language Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-20 05:47:17","doi":"10.21203/rs.3.rs-8869896/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-05-11T12:52:31+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-07T15:03:05+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-07T10:05:00+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-15T16:27:06+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"167105994989355849890973431824285117203","date":"2026-03-24T17:09:03+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"235304674592178191572426951238183944619","date":"2026-03-18T09:00:27+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"125030238659698765837368921604895987782","date":"2026-03-17T19:18:46+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-17T17:02:48+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-03-16T13:46:23+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-14T06:49:53+00:00","index":"","fulltext":""},{"type":"submitted","content":"Innovations in Systems and Software Engineering","date":"2026-02-13T09:10:16+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"innovations-in-systems-and-software-engineering","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"isse","sideBox":"Learn more about [Innovations in Systems and Software Engineering](http://link.springer.com/journal/11334)","snPcode":"11334","submissionUrl":"https://submission.nature.com/new-submission/11334/3","title":"Innovations in Systems and Software Engineering","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"c025cef7-6818-4ef6-967d-38ec6b08a58f","owner":[],"postedDate":"March 20th, 2026","published":true,"recentEditorialEvents":[{"type":"decision","content":"Revision requested","date":"2026-05-11T12:52:31+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-07T15:03:05+00:00","index":14,"fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-07T10:05:00+00:00","index":13,"fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[],"tags":[],"updatedAt":"2026-05-11T13:13:12+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-20 05:47:17","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8869896","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8869896","identity":"rs-8869896","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.