Duplicate Pull Requests in Code Management Platforms : A Systematic Literature Review | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Duplicate Pull Requests in Code Management Platforms : A Systematic Literature Review Rania Ben Chekaya, Kamel Garrouch, Mohamed Nazih Omri This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8732393/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Résumé Context. Duplicate PRs (PRs) plague code management platforms like GitHub, squandering valuable reviewer effort, delaying integrations, and frustrating contributors. While previous research has explored automated detection methods, the problem remains prevalent across both open-source and proprietary projects. Objective. This study conducts a systematic literature review to comprehensively examine the phenomenon of duplicate PRs. We aim to synthesize existing knowledge on their root causes, frequency, detection methods, and impacts, while evaluating the effectiveness of current approaches and identifying gaps for future research. Methods . Our research follows a systematic methodology for identifying and analyzing relevant literature. We rigorously selected and reviewed 11 primary studies focused specifically on duplicate PR detection, complemented by an extensive analysis of 39 additional works on general PR management to provide context. The review process incorporated quantitative analysis of reported results and qualitative synthesis of methodologies, features, and limitations. Results . The analysis reveals that approximately 3-12% of PRs in active repositories are duplicates, with the highest occurrence in projects lacking clear contribution guidelines. Current detection approaches are categorized into retrieval-based and classification-based methods, utilizing features ranging from simple textual similarity to complex combinations of textual and non-textual attributes. The evaluation shows that while traditional methods using TF-IDF and cosine similarity achieve 55-83% recall, more recent approaches incorporating deep learning and contextual analysis demonstrate improved accuracy up to 92%. Conclusion . Duplicate PRs represent a substantial inefficiency that demands systematic solutions. Our synthesis suggests that combining improved detection algorithms with better community practices could significantly reduce duplication. The review identifies critical research gaps, including limited cross-platform studies, inadequate handling of semantic duplicates, and insufficient attention to human factors, providing a foundation for future work in this domain. GitHub Pull Request Duplicate Pull Request Code Management Natural Language Processing Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8732393","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":593942950,"identity":"e14fb1c3-3d0a-4e9b-bf6c-cf4120ed8e78","order_by":0,"name":"Rania Ben Chekaya","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABJElEQVRIiWNgGAWjYBACAxDB2AChQYCHgb0BJG5BihaeAyBxCeK1MDBIJIBJnFrM2XsPMPzcYWPMPyP5AXNFzR0Z/pnPr274USDBwN/enYBNi2XPuQTG3jNpZhI30gwYzxx7xiNxO6fsZg/QYRJnzm7A6rAbOQbMjG2HbRhuJBgwNrAd5mG4nZN2gweoxUAiF7uW+29AWv7byN9I/8DY8O8wj/zNM2k3/+DTAjQQqOWAGcg6xsa2wzwGN9iP3cZni2VPjsHB3rZkY8MzbwoONvYd5jE8k8N2W8ZAggeXX8zZzxg++NlmZzjvePrGhw3fDtvLHT/+7OabPzZy/O29WLWAwAEwKZAAZTDwgCOJB5dyBOA/AGOxPyCsehSMglEwCkYSAACzqWbzaAfpRwAAAABJRU5ErkJggg==","orcid":"","institution":"Higher Institute of Computer Science and Communication Technologies","correspondingAuthor":true,"prefix":"","firstName":"Rania","middleName":"Ben","lastName":"Chekaya","suffix":""},{"id":593942951,"identity":"6463f1a9-00a4-4841-b43c-3454a949d99e","order_by":1,"name":"Kamel Garrouch","email":"","orcid":"","institution":"Higher Institute of Management of Sousse","correspondingAuthor":false,"prefix":"","firstName":"Kamel","middleName":"","lastName":"Garrouch","suffix":""},{"id":593942954,"identity":"6eb15845-982c-4c11-bf16-759380e375c3","order_by":2,"name":"Mohamed Nazih Omri","email":"","orcid":"","institution":"National Engineering School of Sousse","correspondingAuthor":false,"prefix":"","firstName":"Mohamed","middleName":"Nazih","lastName":"Omri","suffix":""}],"badges":[],"createdAt":"2026-01-29 13:54:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8732393/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8732393/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105493941,"identity":"32bdc730-2390-4e9f-82a3-a84bf1684b8d","added_by":"auto","created_at":"2026-03-26 15:56:45","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1360197,"visible":true,"origin":"","legend":"","description":"","filename":"Version1Survey1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8732393/v1_covered_48f787ce-c3ac-4504-b03b-4d2949c76adc.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Duplicate Pull Requests in Code Management Platforms : A Systematic Literature Review","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"GitHub, Pull Request, Duplicate Pull Request, Code Management, Natural Language Processing","lastPublishedDoi":"10.21203/rs.3.rs-8732393/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8732393/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eRésumé Context.\u003c/strong\u003e Duplicate PRs (PRs) plague code management platforms like GitHub, squandering valuable\u003c/p\u003e\n\u003cp\u003ereviewer effort, delaying integrations, and frustrating contributors. While previous research has explored\u003c/p\u003e\n\u003cp\u003eautomated detection methods, the problem remains prevalent across both open-source and proprietary projects.\u003c/p\u003e\n\u003cp\u003eObjective. This study conducts a systematic literature review to comprehensively examine the phenomenon\u003c/p\u003e\n\u003cp\u003eof duplicate PRs. We aim to synthesize existing knowledge on their root causes, frequency, detection methods,\u003c/p\u003e\n\u003cp\u003eand impacts, while evaluating the effectiveness of current approaches and identifying gaps for future research.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods\u003c/strong\u003e. Our research follows a systematic methodology for identifying and analyzing relevant literature.\u003c/p\u003e\n\u003cp\u003eWe rigorously selected and reviewed 11 primary studies focused specifically on duplicate PR detection, complemented\u003c/p\u003e\n\u003cp\u003eby an extensive analysis of 39 additional works on general PR management to provide context. The\u003c/p\u003e\n\u003cp\u003ereview process incorporated quantitative analysis of reported results and qualitative synthesis of methodologies,\u003c/p\u003e\n\u003cp\u003efeatures, and limitations.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e. The analysis reveals that approximately 3-12% of PRs in active repositories are duplicates, with\u003c/p\u003e\n\u003cp\u003ethe highest occurrence in projects lacking clear contribution guidelines. Current detection approaches are categorized\u003c/p\u003e\n\u003cp\u003einto retrieval-based and classification-based methods, utilizing features ranging from simple textual\u003c/p\u003e\n\u003cp\u003esimilarity to complex combinations of textual and non-textual attributes. The evaluation shows that while\u003c/p\u003e\n\u003cp\u003etraditional methods using TF-IDF and cosine similarity achieve 55-83% recall, more recent approaches incorporating\u003c/p\u003e\n\u003cp\u003edeep learning and contextual analysis demonstrate improved accuracy up to 92%.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion\u003c/strong\u003e. Duplicate PRs represent a substantial inefficiency that demands systematic solutions. Our synthesis\u003c/p\u003e\n\u003cp\u003esuggests that combining improved detection algorithms with better community practices could significantly\u003c/p\u003e\n\u003cp\u003ereduce duplication. The review identifies critical research gaps, including limited cross-platform studies, inadequate\u003c/p\u003e\n\u003cp\u003ehandling of semantic duplicates, and insufficient attention to human factors, providing a foundation for\u003c/p\u003e\n\u003cp\u003efuture work in this domain.\u003c/p\u003e","manuscriptTitle":"Duplicate Pull Requests in Code Management Platforms : A Systematic Literature Review","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-23 03:57:08","doi":"10.21203/rs.3.rs-8732393/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"23851822-52e6-45a8-b4a5-d5a126b0b82b","owner":[],"postedDate":"February 23rd, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-03-26T15:55:48+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-23 03:57:08","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8732393","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8732393","identity":"rs-8732393","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.