Robotic pursuit evasion problem in a constrained game area using deep reinforcement learning and self-play training | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Robotic pursuit evasion problem in a constrained game area using deep reinforcement learning and self-play training Chiraz BEN JABEUR, Hassene SEDDIK, Khaled KHNISSI, Ahmad HABLY This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6279213/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Pursuit evasion game (PEG) belongs to dynamic differential games, which have received a lot of attention thanks to its ability to articulate many real-life applications such as military, aerospace and mobile robotics. Several mathematical tools and processes are used to solve such problems, but recent techniques relying on deep reinforcement learning (DRL) have gained popularity, in particular DRL techniques adapted for problems with continuous action spaces such as Deep Deterministic policy gradients (DDPG). Most of these studies use two-phase training approach, where in the first phase only the pursuer is trained on a fixed trajectory of the evader, and in the second phase both DRL agents are trained simultaneously. The first phase of this approach requires trajectory generation, which introduces bias, but it remains adapted to unbound game areas. On the other hand, DDPG, is known to suffer from value overestimation problem, which led to the introduction of twin delayed DDPG (TD3). A tiny portion of the scientific literature use TD3 in the case of a one vs one pursuit evasion game, especially in the case of a bounded game area and without relying on a two-phase training approach. This paper explores the case of one-to-one pursuit evasion game in a constrained game area, using two TD3 agents trained simultaneously and from scratch via self-play only. Several rewards are proposed, which when combined, improve the training. Three training alternatives are presented, considering a normal self-play case, a case with a buffer zone and a final case with noisy actions. The three alternatives proved to output similar results, where both the pursuer and the evader agents were able to find optimal control strategies without any human intervention or trajectory generation. The simulation showed that the agents were performing better than other conventional methods such as Non-linear Model Predictive Control (NMPC). This study proposes a novel framework for a one-on-one PEG in a constrained environment, leveraging self-play to train two TD3 agents simultaneously and from scratch. Theoretical contributions include designing a multi-faceted reward function that integrates game status, evolution, duration, and agent performance to enhance training. Practical contributions involve evaluating three training configurations normal self-play, self-play with a buffer zone, and noisy actions—and demonstrating their effectiveness. Results show that all configurations enable the agents to discover optimal strategies autonomously, outperforming conventional methods like Non-linear Model Predictive Control (NMPC). Simulations reveal the agents' intelligent behaviors, with the pursuer leveraging game constraints and the evader executing evasive maneuvers. Comparison with existing methods highlights superior capture time and adaptability to noise, paving the way for real-world implementations. Pursuit evasion game (PEG) deep reinforcement learning (DRL) Deep Deterministic policy gradients (DDPG) Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6279213","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":433260397,"identity":"be6d7ced-51b9-46e3-a3b7-2d5b2891df67","order_by":0,"name":"Chiraz BEN JABEUR","email":"","orcid":"","institution":"ENSIT","correspondingAuthor":false,"prefix":"","firstName":"Chiraz","middleName":"BEN","lastName":"JABEUR","suffix":""},{"id":433260398,"identity":"3dbbd135-45d7-47fa-b07d-9ab2419024c8","order_by":1,"name":"Hassene SEDDIK","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA3klEQVRIiWNgGAWjYBACA2S2RMIBG9K1pDEwsDEwNhCvheHAYcJazNmPP/7wcYcNg3x788YbD86cT9xwv4H9cQEeLZY9OWaSM8+kMRicOVZskXDjduKGYwyMzTPwOexADhszb9thoKtyzCQSPtxOnNkG1MKDT8v5548//237zyA//w1IyzkitNxIMJBmbDvAwHCDB6jlxoHEfjaCWt6YSfa2JfMYnEkD+uVMsnE/W2LjbPwOS3/84WebnZx8++GNN38cs5NtYz584DM+LTCArAZ/tIyCUTAKRsEoIAIAANT3Ur5VVK8tAAAAAElFTkSuQmCC","orcid":"","institution":"Université Virtuelle de Tunis (UVT)/ ENSIT *, RIFTSI-lab","correspondingAuthor":true,"prefix":"","firstName":"Hassene","middleName":"","lastName":"SEDDIK","suffix":""},{"id":433260399,"identity":"4235bf68-f523-40c0-bb61-c87454a392e1","order_by":2,"name":"Khaled KHNISSI","email":"","orcid":"","institution":"Université Virtuelle de Tunis (UVT)/ ENSIT *, RIFTSI-lab","correspondingAuthor":false,"prefix":"","firstName":"Khaled","middleName":"","lastName":"KHNISSI","suffix":""},{"id":433260400,"identity":"5bcaf987-1c65-4c3a-afa2-d2c27f08d512","order_by":3,"name":"Ahmad HABLY","email":"","orcid":"","institution":"Univ. Grenoble Alpes, CNRS, Grenoble INP **, GIPSA-Lab","correspondingAuthor":false,"prefix":"","firstName":"Ahmad","middleName":"","lastName":"HABLY","suffix":""}],"badges":[],"createdAt":"2025-03-21 16:23:21","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6279213/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6279213/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":83616404,"identity":"c95c5ce9-3922-4bcb-ac64-18f676acde62","added_by":"auto","created_at":"2025-05-29 13:53:29","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1071365,"visible":true,"origin":"","legend":"","description":"","filename":"paperdynamicgamesandapplications.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6279213/v1_covered_c42c6c0b-9998-460b-bb98-3f1ba185da21.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Robotic pursuit evasion problem in a constrained game area using deep reinforcement learning and self-play training","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Pursuit evasion game (PEG), deep reinforcement learning (DRL), Deep Deterministic policy gradients (DDPG)","lastPublishedDoi":"10.21203/rs.3.rs-6279213/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6279213/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Pursuit evasion game (PEG) belongs to dynamic differential games, which have received a lot of attention thanks to its ability to articulate many real-life applications such as military, aerospace and mobile robotics. Several mathematical tools and processes are used to solve such problems, but recent techniques relying on deep reinforcement learning (DRL) have gained popularity, in particular DRL techniques adapted for problems with continuous action spaces such as Deep Deterministic policy gradients (DDPG). Most of these studies use two-phase training approach, where in the first phase only the pursuer is trained on a fixed trajectory of the evader, and in the second phase both DRL agents are trained simultaneously. The first phase of this approach requires trajectory generation, which introduces bias, but it remains adapted to unbound game areas. On the other hand, DDPG, is known to suffer from value overestimation problem, which led to the introduction of twin delayed DDPG (TD3). A tiny portion of the scientific literature use TD3 in the case of a one vs one pursuit evasion game, especially in the case of a bounded game area and without relying on a two-phase training approach. This paper explores the case of one-to-one pursuit evasion game in a constrained game area, using two TD3 agents trained simultaneously and from scratch via self-play only. Several rewards are proposed, which when combined, improve the training. Three training alternatives are presented, considering a normal self-play case, a case with a buffer zone and a final case with noisy actions. The three alternatives proved to output similar results, where both the pursuer and the evader agents were able to find optimal control strategies without any human intervention or trajectory generation. The simulation showed that the agents were performing better than other conventional methods such as Non-linear Model Predictive Control (NMPC). This study proposes a novel framework for a one-on-one PEG in a constrained environment, leveraging self-play to train two TD3 agents simultaneously and from scratch. Theoretical contributions include designing a multi-faceted reward function that integrates game status, evolution, duration, and agent performance to enhance training. Practical contributions involve evaluating three training configurations normal self-play, self-play with a buffer zone, and noisy actions—and demonstrating their effectiveness. Results show that all configurations enable the agents to discover optimal strategies autonomously, outperforming conventional methods like Non-linear Model Predictive Control (NMPC). Simulations reveal the agents' intelligent behaviors, with the pursuer leveraging game constraints and the evader executing evasive maneuvers. Comparison with existing methods highlights superior capture time and adaptability to noise, paving the way for real-world implementations.","manuscriptTitle":"Robotic pursuit evasion problem in a constrained game area using deep reinforcement learning and self-play training","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-01 08:28:30","doi":"10.21203/rs.3.rs-6279213/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"7584b465-b1df-4850-86e2-b17e32213b72","owner":[],"postedDate":"April 1st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-05-29T13:53:11+00:00","versionOfRecord":[],"versionCreatedAt":"2025-04-01 08:28:30","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6279213","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6279213","identity":"rs-6279213","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.