Application of reinforcement learning methods to allocate logistics resources to production halls in an automotive industry | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Application of reinforcement learning methods to allocate logistics resources to production halls in an automotive industry P. Haghshenas, S.M.T. Fatemi Ghomi, H. Mosadegh This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-3864140/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Efficiently managing internal logistics in the contemporary automobile industry is paramount. This paper delves into the simulation of an internal logistics (IL) system within an automotive factory, employing reinforcement learning. By capturing the unique IL characteristics of the factory, this paper formulates a comprehensive simulation model characterized by its incorporation of sparse reward mechanisms. This paper uses two distinct algorithms. The first algorithm is the multi-agent deep deterministic policy gradient, enhanced by integrating the Baseline to accommodate discrete actions. The second algorithm, shared experience deep Q-network, leverages the prioritized replay strategy to amplify its effectiveness in managing sparse rewards. This paper conducts rigorous numerical experiments to validate both the model's accuracy and the algorithms' efficacy. Simulation Internal logistics Reinforcement learning Sparse rewards Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 1. Introduction In today's rapidly evolving world, the automotive industry faces numerous challenges in its internal logistics processes. It becomes crucial to optimize transportation costs and time within the automobile industry's internal logistics. Moreover, trucking delivery costs have become a significant consideration in the logistics industry. According to recent data from [ 1 ], the average trucking rates per mile in 2023 are van rates at $ 2.76 per mile, reefer rates at $ 3.19 per mile, and flatbed rates at $ 3.14 per mile. These rates play a crucial role in determining transportation costs associated with moving goods and materials, which puts higher requirements for internal transportation in the automobile industry. Embracing this demand, prominent players in the automotive sector, have rapidly evolved to address these challenges and adapt to the changing logistics landscape. Traditionally, the automobile industry has relied on outdated systems to assign trucks for moving car bodies between various salons, including the paint shop, body maker, and assembly line, as shown in Fig. 1 . These trucks transport car bodies, ensuring a smooth workflow within the production process. However, using such systems presents challenges when specific trucks are left without assigned tasks due to salons needing more car bodies or when certain salons become overwhelmed with a backlog of bodies or trucks. These challenges stem from a specific number of trucks allocated for deliveries between two salons, resulting in their underutilization when further work is unavailable. Furthermore, the delay of this truck will increase the cost of waiting time and lower the production level of the industry. Consequently, a need arises to improve the truck allocation process and path planning. This study outlines its contributions as follows: Addressing the issue of internal logistics optimization: In response to this challenge, the paper has formulated a simulation model that accurately replicates the factory environment. Within this model, trucks assume the role of agents, learning when and to which salons they should proceed. Through the integration of deep reinforcement learning, the objective is to allocate trucks efficiently, minimizing periods of inactivity and averting excessive queues at the salons. Central to our approach is utilizing two methodologies: shared Experience Deep Q-network (SEDQN) methodology and the multi-agent deep deterministic policy gradient (MADDPG) methodology for truck assignment strategically assigned to salons. Applies prioritized experience replay on SEDQN and checks how it behaves after it obtained prioritized. Comparative analysis: This paper compares the results of MADDPG in a discrete action space with those of the SEDQN algorithm. This comparison verifies which approach yields superior outcomes across various simulation scenarios. 2. Literature review 2.1. Internal logistics The logistics concept can be traced back to its roots in the military sector[ 1 ]. In the business context, the term logistics was formally introduced in 1964 as a process known as business logistics, primarily focused on the physical distribution of goods[ 2 ]. In modern terms, logistics encompasses the planning, implementing, and controlling efficient and effective flow and storage of goods, services, and related information from the point of origin to the point of consumption, all to meet customer requirements[ 2 ], [ 3 ]. Logistics can be further divided into external and internal logistics. External logistics pertains to transporting, storing, and delivering goods to customers or other businesses [ 2 ]. On the other hand, internal logistics deals with support operations and the movement of materials within a company[ 4 ].Internal logistics systems can be classified into classical, mechanized, and automated systems based on the level of human operator involvement[ 5 ]. The literature on internal logistics is reviewed from two aspects: internal logistics in the automobile industry and internal logistics in reinforcement learning. Internal logistics in the industry Over the years, numerous articles have addressed internal logistics within the automotive industry. For instance, M. Fabri et al. [ 7 ] introduced a mixed-integer linear programming approach to address internal logistics within an automotive industry's first-tier suppliers at the factory's aggregation level. In [ 8 ], a mixed-integer linear programming approach was applied to assembly line optimization. Another work by[ 9 ] tackled a mixed-model sequencing problem involving stochastic processing times in a multi-station assembly line. [ 5 ] utilized a simulation-based approach to evaluate assembly lines from an internal logistics perspective. Z. Eujan [ 10 ] explored the application potential of Tecnomatix Plant Simulation software in simulating supply processes, particularly the Milk Run system. Anand et al. [ 11 ] designed cost-effective AGVs for internal logistics within the industry, highlighting the versatility of AGVs as an alternative to trucks in our case study. Macedo et al. [ 12 ] proposed a simulation-based decision support tool for manufacturing supermarket operations. Nourmohammadi et al. [ 13 ] extended the scope of supermarket location decisions by accounting for the availability of various transport vehicles. Coelho et al. [ 14 ] developed two simulation models using the Simio simulation modeling framework with intelligent objects. [ 15 ] addressed the routing problem in IL, they successfully addressed the IL by employing an integer linear programming (ILP) model and a simulation-based iterated local search (SimILS) algorithm, their solution markedly enhanced the company's operational processes. The study [ 16 ] examined a segment of a mixed-model assembly line encompassing a supermarket, assorted kits, human pickers, and automated guided vehicles (AGVs), exploring their influence on IL within the automotive industry. Reinforcement learning in industry For instance, [ 17 ] employed the Proximal Policy Optimization (PPO) method to optimize internal logistics within the automotive sector. In another study, [ 18 ] leveraged model-based reinforcement learning for Automated Guided Vehicles (AGVs) in the automotive domain. Moreover, [ 19 ] focused on assembly line optimization through a multi-AGV dispatching strategy. [ 20 ] applied learning method addressed vehicle routing, seamlessly blending operational research. Additionally, [ 21 ] introduced the multi-agent attention model to tackle the vehicle routing problem, substantiated by a series of experimentally validated results. Research gap Drawing on the historical trajectory outlined in the preceding sections, the research landscape pertaining to IL and reinforcement learning emerges as a rich and influential domain. A conspicuous void arises in the realm of reinforcement learning applied to internal logistics, specifically in the automotive industry context. This paper is poised to bridge this gap by directly comparing the performance of discrete MADDPG and SEDQN algorithms within the intricate landscape of automotive internal logistics. By conducting such a comparative assessment, the paper deepens our understanding of algorithm efficacy and sheds light on their practical applicability under real operational constraints. In tackling this multifaceted challenge, this research relies on reinforcement learning techniques, specifically the MADDPG and the SEDQN algorithm. The paper is structured as follows. Section 3 discusses the factory layout and task definition, detailing the simulation setup. Section 4 introduces the multi-agent reinforcement learning algorithm. Section 5 presents the results of numerical experiments. Finally, Section 6 is devoted to the conclusion and recommendation. 3. Problem description The company under study is tasked with allocating a specific number of trucks to facilitate material flow between designated salons, as illustrated in Fig. 2 . These trucks are responsible for transporting bodies to their intended locations and returning empty to reload with new bodies. Consequently, this process incurs substantial time wastage due to the return of empty trucks, resulting in escalated logistic costs for the system. Given the prohibitive expense of studying IL behavior in real-world scenarios, this paper resorts to simulation-based methods. The paper employs the REWARE environment [ 22 ], [ 23 ]. This environment emulates a grid-world warehouse, wherein agents (robots) navigate to locate and transport requested shelves to assigned workstations before returning them. The reward is only granted upon successful delivery, and agents perceive a 3 × 3 grid containing information about neighboring agents and shelves. Agents can maneuver, rotate, and load/unload a shelf. The environment encompasses three tasks characterized by variations in world size, agent count, and shelf requests. The reward sparsity in this context causes a challenging environment, requiring agents to fulfill a series of actions before obtaining rewards. Additionally, observations within this environment are sparse and high-dimensional compared to other settings [ 23 ]. RWARE operates as an open-source framework under the Massachusetts Institute Technology license, accessible at[ 24 ]. In our context, this paper designates the truck agents and bodies as "shelves," aligning with the environment's terminology. The paper devises a 7 × 11 grid world that faithfully represents the actual layout of factory salons, illustrated in Fig. 3 . This layout comprises a total of 9 salons, segmented into three categories: body makers (3 salons), paint shops (3 salons), and assembly lines (3 salons). The body maker salons exclusively send bodies and do not receive any. Consequently, this paper models them as a single state within our environment, resembling shelves. The paint shops, on the other hand, accept and send bodies to other salons. The states are defined: one to receive bodies (as goals) and another for dispatching them (as shelves). Lastly, the assembly lines merely serve as recipients, necessitating a one-state representation denoting goals. RWARE's custom layout capabilities are utilized to construct our environment layout, as depicted in Fig. 3 . Transforming salons into body senders requires establishing the environment with a new function that generates new shelves in the designated state of the grid, effectively allowing the creation of multiple shelves. Additional modifications involve refining the behavior of agents delivering bodies to goals, including the requirement that they return the shelf to an empty position. This entails a redesign of the shelf mechanics to facilitate removal from the grid once successfully delivered to the goal. A third crucial implementation is the integration of a stop function. This function prevents agents from delivering a shelf collected from the body maker to the assembly line. The delivery process is carefully governed by multiple conditions in the delivery scenario to ensure proper execution. Agents receive a partially observable environment, a 3x3 grid centered on the agent, capturing key elements: Location, rotation, and shelf-carrying status of the agent. Location and rotation of other robots. Positions of shelves and their presence in the request queue. Our observation space comprises 88 values, each representing a specific element, such as requested shelves, agent positions, or sensor data. Sensors extend agent capabilities, allowing it to perceive another state beyond the immediate state. For instance, a sensor value of 1 will increase the agent visibility one step ahead. This expands the observation shape to 88; without sensors, it reduces to 33. Agents navigate through a discrete action space: A = {Turn Left, Turn Right, Move Forward, Load/Unload Shelf}. Actions encompass rotations, forward movement, and shelf interactions at predetermined locations. The environment dynamic reflects real-world warehousing, permitting agents to navigate beneath shelves and use corridors when loaded, avoiding obstructions. Any collisions are resolved in a way that allows for maximum mobility. When two or more agents attempt to move to the same location, this study prioritizes the one that blocks others. Otherwise, the selection is done arbitrarily. Regarding rewards, a set number of shelves R is requested at each time interval. Delivering a requested shelf to a goal location triggers the sampling of another shelf for ongoing requests. Agents receive a reward of 1 for successfully delivering a requested shelf to a goal location. 4. Algorithm 4.1. Shared experience deep q learning The Deep Q-Networks (DQN) algorithm's groundbreaking work led to substantial advancements in tackling intricate sequential decision-making challenges [ 25 ]. Subsequent innovations, such as Double DQN mitigating overestimation bias [ 26 ], prioritized experience replay enhancing data efficiency [ 27 ], and the dueling network architecture improving action generalization [ 28 ], have extended the capabilities of the DQN algorithm. This paper applied the work of [ 22 ]. They achieve an exclusive result on sparse reward; instead of passing transactions separately, they pass transactions of all agents to the learning of each agent; this means that agents can learn from the experiences of other agents without necessarily having identical reward functions like Fig. 4 , Fig. 4 shows the path for each agent should go to reach the goal, and each agent learns from other agents. This paper applies prioritized experience replay on SEDQN. It calculates the TD error for all agents and then passes the TD error back to the replay buffer. 4.2. Multi-agent deep deterministic policy gradient (MADDPG) The traditional reinforcement learning methods such as Q-learning or policy gradient algorithm cannot be well applied to the multi-agent environment, since the strategy of each agent changes with the training process. For a single agent, the environment becomes unstable, which affects the learning stability of the model directly invoking the historical data in the experience pool for training. On the other hand, the high square error brought by multi-agent has a certain impact on the traditional algorithm. Therefore, [ 29 ] proposed a multi-agent deep deterministic policy gradient (MADDPG). In this algorithm, first the agent can make the next action through its local information; secondly it does not need the derivative property of environment interaction, and the interaction method between agents does not need to be specially set; thirdly it can be used in a competitive environment or cooperative environment, and the application scenarios are very wide. As shown in Fig. 5 , each agent has an actor-critic structure. Through the actor network, agents can make an action according to the current state, while the critic network scores the actor’s performance according to the state and action and feeds back to the actor network so that the actor can adjust its strategy according to the score, and strive for better performance next time. In addition, each critic network can receive the information of all agents and make better evaluations for its corresponding agents. This paper applied the work of [ 30 ], they have applied the gumble-softmax to discrete the action returned from network. 5. Numerical experiments 5.1. Parameter setting This section examines the optimization problem of IL in various factories and customizes layouts, encompassing small-scale, medium-scale, and large-scale scenarios to validate both the model and algorithm's effectiveness. The perparameters for SEDQL are set as follows: a_end: 0.0, a_last_episode: 100, a_start: 0.5, b_end: 1.0, b_last_episode: 100, b_start: 0.4. These parameters reflect the prototyped configuration. Additionally, batch_size is set to 1000, representing the number of batches utilized. Capacity is 10000, eps_end: 0.15, eps_last_episode: 1000, and eps_start: 1.0 denote parameters related to the exploration. Gamma is 0.78423, hidden_size is 64, lr is 0.097577, and sync_rate is 10, which signifies the time step for network copying to the target network. The MADDPG algorithm's parameters consist of actor_lr: 0.0001, critic_lr: 0.001, batch_size: 100, capacity: 10000, gamma: 0.99, gradient_clip: 0.5, hidden_size: 64, policy_regulariser: 0.01, and soft_update_size: 0.01. Notably, the hidden size is kept constant at 64 for both networks, while other parameters have been adjusted according to the specific algorithm. The grid sizes span from 3×5 to 7×11, accommodating varying numbers of agents from 2 to 5. Within this range, grid sizes from 3×5 to 3×6 are categorized as small-scale cases, sizes from 5×6 to 5×7 are medium-scale, and sizes from 6×10 to 7×11 represent large-scale scenarios. 5.2. Benchmarks This paper performs a comparative analysis between the SEDQN and MADDPG algorithms. (1) SEDQN: The training data is generated through random sampling, The necessary dependencies to implement the SEDQN algorithm environment include Python (3.9.13), OpenAI gym (0.21.0), PyTorch (1.13.1), PyTorch Lightning (1.6.0), and NumPy (1.25.1). The experiments are conducted on a computer powered by an Intel Core i7 processor (Nvidia GeForce RTX 2070 SUPER with 8GB VRAM), featuring 32GB of RAM and a 3.6 GHz CPU. (2) MADDPG: Similar to SEDQN, the training data is obtained through random sampling, The dependencies required to establish the MADDPG algorithm environment consist of Python (3.9.13), OpenAI gym (0.21.0), PyTorch (1.13.1), PyTorch Lightning (1.6.0), and NumPy (1.14.5). The experiments are executed on a high-performance computing (HPC) cluster equipped with 64 CPU cores. 5.3. Experimental results Figure 6 depicts SEDQN with and without prioritized experience replay results. The red charts represent the results from SEDQN with prioritized experience replay, and the orange charts illustrate the results for basic SEDQN. The prioritized experience replay SEDQN has much better results than the bare one. It may take a longer time, but it will reach better outcomes. Table 1 furnishes detailed insights into the outcomes derived from small-scale cases. A thorough comparative analysis was conducted, juxtaposing the numerical experimental results of SEDQN against those directly solved using MADDPG. Table 1 provides the result from these experiments, underscoring the efficacy of both the model and algorithms employed. Notably, in less intricate environments, MADDPG exhibits a remarkable capacity for learning, surpassing SEDQN in performance. Of particular interest is the observation that, in less intricate environments, MADDPG consistently outperforms SEDQN, suggesting that adjusting the sparsity of the rewards towards greater challenge levels could enable SEDQN to excel. Conversely, reducing the sparsity of rewards could lead to MADDPG outperforming SEDQN while also demonstrating a quicker learning pace. The results outlined in Table 1 , especially for the small grid size scenario (S1), underscore MADDPG's capability to achieve superior outcomes within condensed timeframes. Table 1 The small-scale results SEDQN MADDPG Case ID Grid size Number of agents Reward Time Grid size Number of agents Reward Time Sparsity S1 \(3\times 5\) 2 34.50 3.00 h \(3\times 5\) 2 48.61 2.30h low S2 \(3\times 5\) 2 74.88 5.00 h \(3\times 5\) 2 53.58 5.30h high S3 \(3\times 6\) 2 53.98 5.00 h \(3\times 6\) 2 36.42 5.30h high This trend persists across cases M1 and M3, as demonstrated in Table 2 for medium-scale scenarios. However, as the grid size expands or the environment's complexity intensifies, MADDPG's performance experiences a discernible decline. It is noteworthy that MADDPG necessitates more extensive learning periods than SEDQN, particularly in scenarios characterized by sparse rewards. In such contexts, MADDPG may yield comparatively weaker results, prompting the recommendation to favor SEDQN for more extensive applications, as evidenced in Table 3 . Table 2 The medium-scale results SEDQN MADDPG Case ID Grid size Number of agents Reward Time Grid size Number of agents Reward Time Sparsity M1 \(5\times 6\) 3 30.65 4.00h \(5\times 6\) 3 36.96 9.00h low M2 \(5\times 6\) 4 40.41 5.00h \(5\times 6\) 4 30.82 11.00h high M3 \(5\times 7\) 4 23.59 6.00h \(5\times 7\) 4 25.40 18.00h medium Furthermore, the accompanying charts depict the sparsity's impact on SEDQN and MADDPG. The blue charts represent SEDQN results, while the red ones illustrate MADDPG outcomes. Notably, MADDPG necessitates a higher number of steps than SEDQN to achieve learning in each environment. The influence of sparsity on MADDPG's performance is clearly delineated in Fig. 6 for the S2 scenario and in Fig. 7 for cases M1 and M3, highlighting that MADDPG requires additional time to explore the environment compared to SEDQN. Table 3 The large-scale results SEDQN MADDPG Case ID Grid size Number of agents Reward Time Grid size Number of agents Reward Time L1 \(6\times 11\) 4 14.56 4.30 h \(6\times 11\) 4 15.88 18.00h L2 \(7\times 10\) 5 37.22 4.00h \(7\times 10\) 5 33.33 23.00h L3 \(7\times 11\) 5 70.33 30.00h \(7\times 11\) 5 67.47 30.00h Figure 7 depicts S1, S2, and S3, of small-scale result. The red charts represent the results from MADDPG, and the blue charts illustrate the results for SEDQN. As seen only in S1, MADDPG acts better, and the environment with higher sparsity for rewards will cause MADDPG to get worse results. Although MADDPG yields better results at the start of the learning, in continuation, it obtained worse due to the sparsity and tried to explore more. Figure 8 depicts M1, M2, and M3 of the Medium-scale result. The red charts represent the results from MADDPG, and the blue charts illustrate the results for SEDQN. As seen in M1 and M2, MADDPG acted better, although the MADDPG obtained worse results after more exploration. At a certain time, it gave a better result than SEDQN. About M1, it can be expressed that MADDPG is a huge better than SEDQN. SEDQN has converged from the middle of the learning time to the end. In M2, SEDQN acted better than MADDPG because of the higher sparsity reward. In M3, both MADDPG and SEDQN have similar results. Figure 9 depicts L1, L2, and L3, of Large-scale result. The red charts illustrate the results from MADDPG, and the blue charts demonstrate the results for SEDQN. As can be seen in L1, MADDPG acts better, although it can be seen that SEDQN probably could beat MADDPG in L1 if learning time increases. As can be seen in L2 and L3, SEDQN behaves better than MADDPG. If longer time is concerned, all large-scale environments act as high sparsity, and it can be seen that on a large scale, SEDQN always acts better. 6. Conclusion and recommendation This paper delved into the real-world application of machine learning to tackle a challenging problem of internal logistics (IL) in transportation. It presented two methodologies to describe IL in a factory. The paper employs the REWARE environment to analyze the internal logistics of an automotive manufacturing setting. The simulation outcomes uncovered notable periods of inactivity and associated costs in truck deliveries, highlighting the potential to reduce queue times for trucks and bodies. Such reduction could facilitate trucks' real-time decision-making based on their location and requested loads. The investigation revealed the impact of the prototyped replay buffer on the model's performance. While the MADDPG algorithm generally provided superior results across most simulations involving reinforcement learning (RL), incorporating a prototyped buffer into SEDQN improved its learning efficiency, particularly in larger-scale scenarios. The key to effectively addressing environments with sparse rewards lies in enhancing the prioritization of buffers. This paper recommends that future research endeavors explore sparse reward implementations on the MAPPO algorithm to devise a prototyped algorithm. Another advancement could involve designing models that circumvent agent conflicts by employing a non-conflict version. Declarations Author contributions All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Peyman Haghshenas, Seyed Mohammad Taghi Fatemi Ghomi and Hadi Mosadegh. The first draft of the manuscript was written by Peyman Haghshenas and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. Funding The authors did not receive support from any organization for the submitted work. Data Availability The code implementing the algorithms and simulations in this study serves as the primary data for this research. It is available upon request from the corresponding author for the purpose of transparency and reproducibility. Conflict of interest The authors declare that they have no conflict of interest. Human Participants and/or Animals This study did not involve human participants or animals. Ethical approval This study did not require ethical approval as it did not involve human participants, animals, or sensitive data. Acknowledgment The author(s) declare no additional acknowledgments for this research. References Method. (n.d.). Trucking rates per mile. Retrieved from https://www.method.me/pricing-guides/trucking-rates-per-mile/ R. H. Ballou, (2007). The evolution and future of logistics and supply chain management. European Business Review, 19(4), 332-348., doi: 10.1108/09555340710760152. M. Amr, M. Ezzat, and S. Kassem, (2019, October). Logistics 4.0: Definition and historical background. In 2019 Novel Intelligent and Leading Emerging Sciences Conference (NILES) (Vol. 1, pp. 46-49). IEEE, doi: 10.1109/NILES.2019.8909314. B. Gammelgaard, and P. D. Larson, (2001). Logistics skills and competencies for supply chain management. Journal of Business Logistics, 22(2), 27-50. M. Fabri, H. Ramalhinho, M. Oliver, and J. C. Muñoz, (2022). Internal logistics flow simulation: A case study in automotive industry. Journal of Simulation, 16(2), 204-216, doi: 10.1080/17477778.2020.1781554. I. G. Pascu, G. C. Neacsu, E. L.Nitu, and A. C. Gavriluta, (2021). A brief review of the methods and techniques used in the innovative internal logistics processes and systems. In IOP Conference Series: Materials Science and Engineering (Vol. 1018, No. 1, p. 012023). IOP Publishing, doi: 10.1088/1757-899X/1018/1/012023. R. Baller, P. Fontaine, S. Minner, and Z.Lai, (2022). Optimizing automotive inbound logistics: A mixed-integer linear programming approach. Transportation Research Part E: Logistics and Transportation Review, 163, 102734, doi: 10.1016/j.tre.2022.102734. H. Mosadegh, M. Zandieh, and S. M. T. Fatemi Ghomi, (2012). Simultaneous solving of balancing and sequencing problems with station-dependent assembly times for mixed-model assembly lines. Applied Soft Computing, 12(4), 1359-1370, doi: 10.1016/j.asoc.2011.11.027. H. Mosadegh, , and S. M. T. Fatemi Ghomi, and G. A. Süer, (2020). Stochastic mixed-model assembly line sequencing problem: Mathematical modeling and Q-learning based simulated annealing hyper-heuristics. European Journal of Operational Research, 282(2), 530-544, doi: 10.1016/j.ejor.2019.09.021. Z. Čujan, (2016). Simulation of production lines supply within internal logistics systems. Open Engineering, 6(1), doi: 10.1515/eng-2016-0061. R. Anand, N. V. Vantagodi, K. A. Shanbhag, and M. Mahesh, (2019). Automated guided vehicles by permanent magnet synchronous motor: future of in-house logistics. Power Electronics and Drives, 4(1), 151-159, doi: 10.2478/pead-2019-0006. R. Macedo, F. Coelho, S. Relvas, and A. P. Barbosa-Póvoa, (2021). In-house logistics operations enhancement in the automobile industry using simulation. In Operational Research: IO 2019, Tomar, Portugal, July 22–24 20 (pp. 39-51). Springer International Publishing. doi: 10.1007/978-3-030-85476-8. A. Nourmohammadi, H. Eskandari, M. Fathi, and A. H. Ng, (2021). Integrated locating in-house logistics areas and transport vehicles selection problem in assembly lines. International Journal of Production Research, 59(2), 598-616, doi: 10.1080/00207543.2019.1701207. F. F Coelho, ,S. Relvas, and A. P. Barbosa-Póvoa, (2021). Simulation-based decision support tool for in-house logistics: the basis for a digital twin. Computers & Industrial Engineering, 153, 107094, doi: 10.1016/j.cie.2020.107094. M.Fabri, and H. Ramalhinho, (2023). The in‐house logistics routing problem. International Transactions in Operational Research, 30(2), 1144-1168, doi: 10.1111/itor.12965. B. Zhou, J. Zhang, and Q. Fei, (2022). Bi-objective green in-house transportation scheduling and fleet size determination in mixed-model assembly lines with mobile robots. Engineering Computations, 39(7), 2630-2654, doi: 10.1108/EC-08-2021-0483. S. Mayer, T. Classen, , and C. Endisch, (2021). Modular production control using deep reinforcement learning: proximal policy optimization. Journal of Intelligent Manufacturing, 32(8), 2335-2351, doi: 10.1007/s10845-021-01778-z. N. Feldkamp, S. Bergmann, , and S. Strassburger, (2020, December). Simulation-based deep reinforcement learning for modular production systems. In 2020 Winter Simulation Conference (WSC) (pp. 1596-1607). IEEE, doi: 10.1109/WSC48552.2020.9384089. G. Wang, X. Wang, L. Wang, , M. Shao, Y.Yu, and X. Cheng, (2021, October). Multi-AGVs dispatching strategy in automobile assembly line based on Deep Reinforcement Learning. In 2021 China Automation Congress (CAC) (pp. 6382-6386). IEEE, doi: 10.1109/CAC53003.2021.9727515. H. Lu, X. Zhang, and S. Yang, (2019, September). A learning-based iterative method for solving vehicle routing problems. In International Conference on Learning Representations. K. Zhang, F. He, Z. Zhang, , X. Lin, and M. Li, (2020). Multi-vehicle routing problems with soft time windows: A multi-agent reinforcement learning approach. Transportation Research Part C: Emerging Technologies, 121, 102861, doi: 10.1016/j.trc.2020.102861. Christianos, F., Schäfer, L., & Albrecht, S. (2020). Shared experience actor-critic for multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 33, 10707-10717. G. Papoudakis, F. Christianos, L. Schäfer, and S. V. Albrecht, “Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks,” no. NeurIPS, 2020, [Online]. Available: http://arxiv.org/abs/2006.07869. “GitHub - semitable_robotic-warehouse_ Multi-Robot Warehouse (RWARE)_ A multi-agent reinforcement learning environment.” V. Mnih, K. Kavukcuoglu, , D. Silver, A. A. Rusu, , J. Veness, , M. G. Bellemare,., ... and D. Hassabis, (2015). Human-level control through deep reinforcement learning. nature , 518 (7540), 529-533, doi: 10.1038/nature14236. H. Van Hasselt, A. Guez, and D. Silver, (2016, March). Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence 30(1), doi: 10.1609/aaai.v30i1.10295. T.Schaul, J.Quan, I. Antonoglou, and D. Silver, (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, (2016, June). Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning (pp. 1995-2003). PMLR. R. Lowe, Y. I. Wu,., A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing systems, 30. 6380-6391 C. R. Tilbury, F. Christianos, S. V. and Albrecht, (2023). Revisiting the Gumbel-Softmax in MADDPG. arXiv preprint arXiv:2302.11793. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-3864140","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":505946759,"identity":"30f3b45a-6a2b-4800-a550-6ff9028f5619","order_by":0,"name":"P. Haghshenas","email":"","orcid":"","institution":"Amirkabir University of Technology","correspondingAuthor":false,"prefix":"","firstName":"P.","middleName":"","lastName":"Haghshenas","suffix":""},{"id":505946760,"identity":"faca5cef-e895-4c45-951c-c481d8ea1a60","order_by":1,"name":"S.M.T. Fatemi Ghomi","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAuElEQVRIiWNgGAWjYBAC9gYQaWAD40sQ1sJzgBmkJY2BgY00LQyHYVqIADzs5w8+ulFwXt7gfgPjhx8MFvmEtfAkMxvnGNw23HCMgVmyh0HCsoGQFnuGZDZpoBZGoBYGaaBfDAjbwv+Y/XeOwTl7kC2/idMikczGnGNwIBGohY1IWyQeGwMdlpw881him2WPAVEOS3z4OeePnW3f4cOHb/yoqCOsBQkwNgDjlBQNo2AUjIJRMApwAgBY9zHZvbgqEwAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0003-4363-994X","institution":"Amirkabir University of Technology","correspondingAuthor":true,"prefix":"","firstName":"S.M.T.","middleName":"Fatemi","lastName":"Ghomi","suffix":""},{"id":505946761,"identity":"d5abe496-4b6a-4a2e-8438-91cdb53267fd","order_by":2,"name":"H. Mosadegh","email":"","orcid":"","institution":"Amirkabir University of Technology","correspondingAuthor":false,"prefix":"","firstName":"H.","middleName":"","lastName":"Mosadegh","suffix":""}],"badges":[],"createdAt":"2024-01-14 19:16:34","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-3864140/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-3864140/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":90521544,"identity":"d8c03807-47bb-4333-a39a-db2536e5b13b","added_by":"auto","created_at":"2025-09-03 15:53:02","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":40514,"visible":true,"origin":"","legend":"\u003cp\u003eFlow of car bodies\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/ec061e7ecc7fb4e12f8ccd5a.png"},{"id":90521545,"identity":"706c6f01-74fb-40c1-aa6b-ef2a0de48a9d","added_by":"auto","created_at":"2025-09-03 15:53:02","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":40031,"visible":true,"origin":"","legend":"\u003cp\u003eTruck path\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/50b029e3dea461e7b3d604f0.png"},{"id":90523409,"identity":"90b414f5-3521-4018-907b-525607945a87","added_by":"auto","created_at":"2025-09-03 16:17:02","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":66847,"visible":true,"origin":"","legend":"\u003cp\u003eSalons positions\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/dd72138e5bd3fe59bb719d64.png"},{"id":90521548,"identity":"2a7893ca-53e8-4169-890b-68b5321b4ef2","added_by":"auto","created_at":"2025-09-03 15:53:02","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":24027,"visible":true,"origin":"","legend":"\u003cp\u003eFlow of agents(triangles) the goal (square)\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/b8f93fc2cba2562747423ce6.png"},{"id":90522481,"identity":"36c729e7-33e8-4291-9cab-233038745709","added_by":"auto","created_at":"2025-09-03 16:01:02","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":41393,"visible":true,"origin":"","legend":"\u003cp\u003eThe actor-critic structure of the agents.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/5103fb1cf97d746163022c80.png"},{"id":90522483,"identity":"34f89cda-1378-4427-99df-68dbb7ccd8e6","added_by":"auto","created_at":"2025-09-03 16:01:02","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":47123,"visible":true,"origin":"","legend":"\u003cp\u003ePrioritized Replay result on SEDQN\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/fa9b9fedd475919352d83527.png"},{"id":90523260,"identity":"122bfe24-fac5-4503-bd56-b19695f6328d","added_by":"auto","created_at":"2025-09-03 16:09:02","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":74541,"visible":true,"origin":"","legend":"\u003cp\u003eSmall-scale charts\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/73eb9927f28ba066a4327795.png"},{"id":90521550,"identity":"7be97272-b484-4966-bd1d-970a855b203a","added_by":"auto","created_at":"2025-09-03 15:53:02","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":75436,"visible":true,"origin":"","legend":"\u003cp\u003eMedium-scale charts\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/dd0e0e4a8c152960967c007e.png"},{"id":90521554,"identity":"347474c5-1d98-49eb-bf3b-2cc7032ce66f","added_by":"auto","created_at":"2025-09-03 15:53:02","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":60681,"visible":true,"origin":"","legend":"\u003cp\u003eLarge-scale charts\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/b30253ac9ca6afdd76f180c7.png"},{"id":91773562,"identity":"7906631b-a42c-4644-bfb5-782835b1e7aa","added_by":"auto","created_at":"2025-09-20 17:18:06","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1024742,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-3864140/v1/dba3cf08-3094-44b0-9592-a64380337d90.pdf"}],"financialInterests":"","formattedTitle":"Application of reinforcement learning methods to allocate logistics resources to production halls in an automotive industry","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eIn today's rapidly evolving world, the automotive industry faces numerous challenges in its internal logistics processes. It becomes crucial to optimize transportation costs and time within the automobile industry's internal logistics. Moreover, trucking delivery costs have become a significant consideration in the logistics industry. According to recent data from [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e], the average trucking rates per mile in 2023 are van rates at \u003cspan\u003e$\u003c/span\u003e2.76 per mile, reefer rates at \u003cspan\u003e$\u003c/span\u003e3.19 per mile, and flatbed rates at \u003cspan\u003e$\u003c/span\u003e3.14 per mile. These rates play a crucial role in determining transportation costs associated with moving goods and materials, which puts higher requirements for internal transportation in the automobile industry. Embracing this demand, prominent players in the automotive sector, have rapidly evolved to address these challenges and adapt to the changing logistics landscape.\u003c/p\u003e\u003cp\u003eTraditionally, the automobile industry has relied on outdated systems to assign trucks for moving car bodies between various salons, including the paint shop, body maker, and assembly line, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. These trucks transport car bodies, ensuring a smooth workflow within the production process. However, using such systems presents challenges when specific trucks are left without assigned tasks due to salons needing more car bodies or when certain salons become overwhelmed with a backlog of bodies or trucks. These challenges stem from a specific number of trucks allocated for deliveries between two salons, resulting in their underutilization when further work is unavailable. Furthermore, the delay of this truck will increase the cost of waiting time and lower the production level of the industry. Consequently, a need arises to improve the truck allocation process and path planning. This study outlines its contributions as follows:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eAddressing the issue of internal logistics optimization: In response to this challenge, the paper has formulated a simulation model that accurately replicates the factory environment. Within this model, trucks assume the role of agents, learning when and to which salons they should proceed. Through the integration of deep reinforcement learning, the objective is to allocate trucks efficiently, minimizing periods of inactivity and averting excessive queues at the salons. Central to our approach is utilizing two methodologies: shared Experience Deep Q-network (SEDQN) methodology and the multi-agent deep deterministic policy gradient (MADDPG) methodology for truck assignment strategically assigned to salons.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eApplies prioritized experience replay on SEDQN and checks how it behaves after it obtained prioritized.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eComparative analysis: This paper compares the results of MADDPG in a discrete action space with those of the SEDQN algorithm. This comparison verifies which approach yields superior outcomes across various simulation scenarios.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"2. Literature review","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1. Internal logistics\u003c/h2\u003e\u003cp\u003eThe logistics concept can be traced back to its roots in the military sector[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. In the business context, the term logistics was formally introduced in 1964 as a process known as business logistics, primarily focused on the physical distribution of goods[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. In modern terms, logistics encompasses the planning, implementing, and controlling efficient and effective flow and storage of goods, services, and related information from the point of origin to the point of consumption, all to meet customer requirements[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Logistics can be further divided into external and internal logistics. External logistics pertains to transporting, storing, and delivering goods to customers or other businesses [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. On the other hand, internal logistics deals with support operations and the movement of materials within a company[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].Internal logistics systems can be classified into classical, mechanized, and automated systems based on the level of human operator involvement[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. The literature on internal logistics is reviewed from two aspects: internal logistics in the automobile industry and internal logistics in reinforcement learning.\u003c/p\u003e\u003cp\u003e\u003cb\u003eInternal logistics in the industry\u003c/b\u003e\u003c/p\u003e\u003cp\u003eOver the years, numerous articles have addressed internal logistics within the automotive industry. For instance, M. Fabri et al. [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] introduced a mixed-integer linear programming approach to address internal logistics within an automotive industry's first-tier suppliers at the factory's aggregation level. In [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], a mixed-integer linear programming approach was applied to assembly line optimization. Another work by[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] tackled a mixed-model sequencing problem involving stochastic processing times in a multi-station assembly line. [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] utilized a simulation-based approach to evaluate assembly lines from an internal logistics perspective. Z. Eujan [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] explored the application potential of Tecnomatix Plant Simulation software in simulating supply processes, particularly the Milk Run system. Anand et al. [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] designed cost-effective AGVs for internal logistics within the industry, highlighting the versatility of AGVs as an alternative to trucks in our case study. Macedo et al. [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] proposed a simulation-based decision support tool for manufacturing supermarket operations. Nourmohammadi et al. [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] extended the scope of supermarket location decisions by accounting for the availability of various transport vehicles. Coelho et al. [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] developed two simulation models using the Simio simulation modeling framework with intelligent objects.\u003c/p\u003e\u003cp\u003e[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] addressed the routing problem in IL, they successfully addressed the IL by employing an integer linear programming (ILP) model and a simulation-based iterated local search (SimILS) algorithm, their solution markedly enhanced the company's operational processes. The study [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] examined a segment of a mixed-model assembly line encompassing a supermarket, assorted kits, human pickers, and automated guided vehicles (AGVs), exploring their influence on IL within the automotive industry.\u003c/p\u003e\u003cp\u003e\u003cb\u003eReinforcement learning in industry\u003c/b\u003e\u003c/p\u003e\u003cp\u003eFor instance, [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] employed the Proximal Policy Optimization (PPO) method to optimize internal logistics within the automotive sector. In another study, [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] leveraged model-based reinforcement learning for Automated Guided Vehicles (AGVs) in the automotive domain. Moreover, [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] focused on assembly line optimization through a multi-AGV dispatching strategy. [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] applied learning method addressed vehicle routing, seamlessly blending operational research. Additionally, [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] introduced the multi-agent attention model to tackle the vehicle routing problem, substantiated by a series of experimentally validated results.\u003c/p\u003e\u003cp\u003e\u003cb\u003eResearch gap\u003c/b\u003e\u003c/p\u003e\u003cp\u003eDrawing on the historical trajectory outlined in the preceding sections, the research landscape pertaining to IL and reinforcement learning emerges as a rich and influential domain. A conspicuous void arises in the realm of reinforcement learning applied to internal logistics, specifically in the automotive industry context. This paper is poised to bridge this gap by directly comparing the performance of discrete MADDPG and SEDQN algorithms within the intricate landscape of automotive internal logistics. By conducting such a comparative assessment, the paper deepens our understanding of algorithm efficacy and sheds light on their practical applicability under real operational constraints.\u003c/p\u003e\u003cp\u003eIn tackling this multifaceted challenge, this research relies on reinforcement learning techniques, specifically the MADDPG and the SEDQN algorithm.\u003c/p\u003e\u003cp\u003eThe paper is structured as follows. Section \u003cspan refid=\"Sec4\" class=\"InternalRef\"\u003e3\u003c/span\u003e discusses the factory layout and task definition, detailing the simulation setup. Section \u003cspan refid=\"Sec5\" class=\"InternalRef\"\u003e4\u003c/span\u003e introduces the multi-agent reinforcement learning algorithm. Section \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e5\u003c/span\u003e presents the results of numerical experiments. Finally, Section \u003cspan refid=\"Sec12\" class=\"InternalRef\"\u003e6\u003c/span\u003e is devoted to the conclusion and recommendation.\u003c/p\u003e\u003c/div\u003e"},{"header":"3. Problem description","content":"\u003cp\u003eThe company under study is tasked with allocating a specific number of trucks to facilitate material flow between designated salons, as illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. These trucks are responsible for transporting bodies to their intended locations and returning empty to reload with new bodies. Consequently, this process incurs substantial time wastage due to the return of empty trucks, resulting in escalated logistic costs for the system. Given the prohibitive expense of studying IL behavior in real-world scenarios, this paper resorts to simulation-based methods. The paper employs the REWARE environment [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e], [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. This environment emulates a grid-world warehouse, wherein agents (robots) navigate to locate and transport requested shelves to assigned workstations before returning them. The reward is only granted upon successful delivery, and agents perceive a 3 \u0026times; 3 grid containing information about neighboring agents and shelves. Agents can maneuver, rotate, and load/unload a shelf. The environment encompasses three tasks characterized by variations in world size, agent count, and shelf requests. The reward sparsity in this context causes a challenging environment, requiring agents to fulfill a series of actions before obtaining rewards. Additionally, observations within this environment are sparse and high-dimensional compared to other settings [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eRWARE operates as an open-source framework under the Massachusetts Institute Technology license, accessible at[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. In our context, this paper designates the truck agents and bodies as \"shelves,\" aligning with the environment's terminology. The paper devises a 7 \u0026times; 11 grid world that faithfully represents the actual layout of factory salons, illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. This layout comprises a total of 9 salons, segmented into three categories: body makers (3 salons), paint shops (3 salons), and assembly lines (3 salons). The body maker salons exclusively send bodies and do not receive any. Consequently, this paper models them as a single state within our environment, resembling shelves. The paint shops, on the other hand, accept and send bodies to other salons. The states are defined: one to receive bodies (as goals) and another for dispatching them (as shelves). Lastly, the assembly lines merely serve as recipients, necessitating a one-state representation denoting goals. RWARE's custom layout capabilities are utilized to construct our environment layout, as depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Transforming salons into body senders requires establishing the environment with a new function that generates new shelves in the designated state of the grid, effectively allowing the creation of multiple shelves. Additional modifications involve refining the behavior of agents delivering bodies to goals, including the requirement that they return the shelf to an empty position. This entails a redesign of the shelf mechanics to facilitate removal from the grid once successfully delivered to the goal. A third crucial implementation is the integration of a stop function. This function prevents agents from delivering a shelf collected from the body maker to the assembly line. The delivery process is carefully governed by multiple conditions in the delivery scenario to ensure proper execution.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAgents receive a partially observable environment, a 3x3 grid centered on the agent, capturing key elements:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eLocation, rotation, and shelf-carrying status of the agent.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLocation and rotation of other robots.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ePositions of shelves and their presence in the request queue.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eOur observation space comprises 88 values, each representing a specific element, such as requested shelves, agent positions, or sensor data. Sensors extend agent capabilities, allowing it to perceive another state beyond the immediate state. For instance, a sensor value of 1 will increase the agent visibility one step ahead. This expands the observation shape to 88; without sensors, it reduces to 33. Agents navigate through a discrete action space: A = {Turn Left, Turn Right, Move Forward, Load/Unload Shelf}. Actions encompass rotations, forward movement, and shelf interactions at predetermined locations. The environment dynamic reflects real-world warehousing, permitting agents to navigate beneath shelves and use corridors when loaded, avoiding obstructions.\u003c/p\u003e\u003cp\u003eAny collisions are resolved in a way that allows for maximum mobility. When two or more agents attempt to move to the same location, this study prioritizes the one that blocks others. Otherwise, the selection is done arbitrarily. Regarding rewards, a set number of shelves R is requested at each time interval. Delivering a requested shelf to a goal location triggers the sampling of another shelf for ongoing requests. Agents receive a reward of 1 for successfully delivering a requested shelf to a goal location.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"4. Algorithm","content":"\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e4.1. Shared experience deep q learning\u003c/h2\u003e\u003cp\u003eThe Deep Q-Networks (DQN) algorithm's groundbreaking work led to substantial advancements in tackling intricate sequential decision-making challenges [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Subsequent innovations, such as Double DQN mitigating overestimation bias [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e], prioritized experience replay enhancing data efficiency [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], and the dueling network architecture improving action generalization [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], have extended the capabilities of the DQN algorithm.\u003c/p\u003e\u003cp\u003eThis paper applied the work of [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. They achieve an exclusive result on sparse reward; instead of passing transactions separately, they pass transactions of all agents to the learning of each agent; this means that agents can learn from the experiences of other agents without necessarily having identical reward functions like Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows the path for each agent should go to reach the goal, and each agent learns from other agents. This paper applies prioritized experience replay on SEDQN. It calculates the TD error for all agents and then passes the TD error back to the replay buffer.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e4.2. Multi-agent deep deterministic policy gradient (MADDPG)\u003c/h2\u003e\u003cp\u003eThe traditional reinforcement learning methods such as Q-learning or policy gradient algorithm cannot be well applied to the multi-agent environment, since the strategy of each agent changes with the training process. For a single agent, the environment becomes unstable, which affects the learning stability of the model directly invoking the historical data in the experience pool for training. On the other hand, the high square error brought by multi-agent has a certain impact on the traditional algorithm. Therefore, [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e] proposed a multi-agent deep deterministic policy gradient (MADDPG). In this algorithm, first the agent can make the next action through its local information; secondly it does not need the derivative property of environment interaction, and the interaction method between agents does not need to be specially set; thirdly it can be used in a competitive environment or cooperative environment, and the application scenarios are very wide.\u003c/p\u003e\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, each agent has an actor-critic structure. Through the actor network, agents can make an action according to the current state, while the critic network scores the actor\u0026rsquo;s performance according to the state and action and feeds back to the actor network so that the actor can adjust its strategy according to the score, and strive for better performance next time. In addition, each critic network can receive the information of all agents and make better evaluations for its corresponding agents.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThis paper applied the work of [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], they have applied the gumble-softmax to discrete the action returned from network.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Numerical experiments","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n \u003ch2\u003e5.1. Parameter setting\u003c/h2\u003e\n \u003cp\u003eThis section examines the optimization problem of IL in various factories and customizes layouts, encompassing small-scale, medium-scale, and large-scale scenarios to validate both the model and algorithm\u0026apos;s effectiveness. The perparameters for SEDQL are set as follows: a_end: 0.0, a_last_episode: 100, a_start: 0.5, b_end: 1.0, b_last_episode: 100, b_start: 0.4. These parameters reflect the prototyped configuration. Additionally, batch_size is set to 1000, representing the number of batches utilized. Capacity is 10000, eps_end: 0.15, eps_last_episode: 1000, and eps_start: 1.0 denote parameters related to the exploration. Gamma is 0.78423, hidden_size is 64, lr is 0.097577, and sync_rate is 10, which signifies the time step for network copying to the target network.\u003c/p\u003e\n \u003cp\u003eThe MADDPG algorithm\u0026apos;s parameters consist of actor_lr: 0.0001, critic_lr: 0.001, batch_size: 100, capacity: 10000, gamma: 0.99, gradient_clip: 0.5, hidden_size: 64, policy_regulariser: 0.01, and soft_update_size: 0.01. Notably, the hidden size is kept constant at 64 for both networks, while other parameters have been adjusted according to the specific algorithm. The grid sizes span from 3\u0026times;5 to 7\u0026times;11, accommodating varying numbers of agents from 2 to 5. Within this range, grid sizes from 3\u0026times;5 to 3\u0026times;6 are categorized as small-scale cases, sizes from 5\u0026times;6 to 5\u0026times;7 are medium-scale, and sizes from 6\u0026times;10 to 7\u0026times;11 represent large-scale scenarios.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\n \u003ch2\u003e5.2. Benchmarks\u003c/h2\u003e\n \u003cp\u003eThis paper performs a comparative analysis between the SEDQN and MADDPG algorithms.\u003c/p\u003e\n \u003cp\u003e(1) SEDQN: The training data is generated through random sampling, The necessary dependencies to implement the SEDQN algorithm environment include Python (3.9.13), OpenAI gym (0.21.0), PyTorch (1.13.1), PyTorch Lightning (1.6.0), and NumPy (1.25.1). The experiments are conducted on a computer powered by an Intel Core i7 processor (Nvidia GeForce RTX 2070 SUPER with 8GB VRAM), featuring 32GB of RAM and a 3.6 GHz CPU.\u003c/p\u003e\n \u003cp\u003e(2) MADDPG: Similar to SEDQN, the training data is obtained through random sampling, The dependencies required to establish the MADDPG algorithm environment consist of Python (3.9.13), OpenAI gym (0.21.0), PyTorch (1.13.1), PyTorch Lightning (1.6.0), and NumPy (1.14.5). The experiments are executed on a high-performance computing (HPC) cluster equipped with 64 CPU cores.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n \u003ch2\u003e5.3. Experimental results\u003c/h2\u003e\n \u003cp\u003eFigure \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e depicts SEDQN with and without prioritized experience replay results. The red charts represent the results from SEDQN with prioritized experience replay, and the orange charts illustrate the results for basic SEDQN. The prioritized experience replay SEDQN has much better results than the bare one. It may take a longer time, but it will reach better outcomes.\u003c/p\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e furnishes detailed insights into the outcomes derived from small-scale cases. A thorough comparative analysis was conducted, juxtaposing the numerical experimental results of SEDQN against those directly solved using MADDPG. Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e provides the result from these experiments, underscoring the efficacy of both the model and algorithms employed. Notably, in less intricate environments, MADDPG exhibits a remarkable capacity for learning, surpassing SEDQN in performance. Of particular interest is the observation that, in less intricate environments, MADDPG consistently outperforms SEDQN, suggesting that adjusting the sparsity of the rewards towards greater challenge levels could enable SEDQN to excel. Conversely, reducing the sparsity of rewards could lead to MADDPG outperforming SEDQN while also demonstrating a quicker learning pace. The results outlined in Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e, especially for the small grid size scenario (S1), underscore MADDPG\u0026apos;s capability to achieve superior outcomes within condensed timeframes.\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eThe small-scale results\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"4\"\u003e\n \u003cp\u003eSEDQN\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"4\"\u003e\n \u003cp\u003eMADDPG\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCase ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGrid size\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of agents\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eReward\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTime\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGrid size\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of agents\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eReward\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTime\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSparsity\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eS1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(3\\times 5\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e34.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3.00 h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(3\\times 5\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e48.61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2.30h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003elow\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eS2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(3\\times 5\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e74.88\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5.00 h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(3\\times 5\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e53.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5.30h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehigh\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eS3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(3\\times 6\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e53.98\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5.00 h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(3\\times 6\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e36.42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5.30h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehigh\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003eThis trend persists across cases M1 and M3, as demonstrated in Table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e for medium-scale scenarios. However, as the grid size expands or the environment\u0026apos;s complexity intensifies, MADDPG\u0026apos;s performance experiences a discernible decline. It is noteworthy that MADDPG necessitates more extensive learning periods than SEDQN, particularly in scenarios characterized by sparse rewards. In such contexts, MADDPG may yield comparatively weaker results, prompting the recommendation to favor SEDQN for more extensive applications, as evidenced in Table \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eThe medium-scale results\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"4\"\u003e\n \u003cp\u003eSEDQN\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"4\"\u003e\n \u003cp\u003eMADDPG\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCase ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGrid size\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of agents\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eReward\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTime\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGrid size\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of agents\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eReward\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTime\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSparsity\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eM1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(5\\times 6\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(5\\times 6\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e36.96\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e9.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003elow\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eM2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(5\\times 6\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e40.41\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(5\\times 6\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30.82\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e11.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehigh\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eM3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(5\\times 7\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e23.59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e6.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(5\\times 7\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e25.40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e18.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003emedium\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003eFurthermore, the accompanying charts depict the sparsity\u0026apos;s impact on SEDQN and MADDPG. The blue charts represent SEDQN results, while the red ones illustrate MADDPG outcomes. Notably, MADDPG necessitates a higher number of steps than SEDQN to achieve learning in each environment. The influence of sparsity on MADDPG\u0026apos;s performance is clearly delineated in Fig. \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e for the S2 scenario and in Fig. 7 for cases M1 and M3, highlighting that MADDPG requires additional time to explore the environment compared to SEDQN.\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eThe large-scale results\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"4\"\u003e\n \u003cp\u003eSEDQN\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"4\"\u003e\n \u003cp\u003eMADDPG\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCase ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGrid size\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of agents\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eReward\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTime\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGrid size\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of agents\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eReward\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTime\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eL1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(6\\times 11\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e14.56\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4.30 h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(6\\times 11\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e15.88\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e18.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eL2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(7\\times 10\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e37.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(7\\times 10\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e33.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e23.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eL3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(7\\times 11\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e70.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(7\\times 11\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e67.47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30.00h\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003eFigure 7 depicts S1, S2, and S3, of small-scale result. The red charts represent the results from MADDPG, and the blue charts illustrate the results for SEDQN. As seen only in S1, MADDPG acts better, and the environment with higher sparsity for rewards will cause MADDPG to get worse results. Although MADDPG yields better results at the start of the learning, in continuation, it obtained worse due to the sparsity and tried to explore more.\u003c/p\u003e\n \u003cp\u003eFigure 8 depicts M1, M2, and M3 of the Medium-scale result. The red charts represent the results from MADDPG, and the blue charts illustrate the results for SEDQN. As seen in M1 and M2, MADDPG acted better, although the MADDPG obtained worse results after more exploration. At a certain time, it gave a better result than SEDQN. About M1, it can be expressed that MADDPG is a huge better than SEDQN. SEDQN has converged from the middle of the learning time to the end. In M2, SEDQN acted better than MADDPG because of the higher sparsity reward. In M3, both MADDPG and SEDQN have similar results.\u003c/p\u003e\n \u003cp\u003eFigure 9 depicts L1, L2, and L3, of Large-scale result. The red charts illustrate the results from MADDPG, and the blue charts demonstrate the results for SEDQN. As can be seen in L1, MADDPG acts better, although it can be seen that SEDQN probably could beat MADDPG in L1 if learning time increases. As can be seen in L2 and L3, SEDQN behaves better than MADDPG. If longer time is concerned, all large-scale environments act as high sparsity, and it can be seen that on a large scale, SEDQN always acts better.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"6. Conclusion and recommendation","content":"\u003cp\u003eThis paper delved into the real-world application of machine learning to tackle a challenging problem of internal logistics (IL) in transportation. It presented two methodologies to describe IL in a factory. The paper employs the REWARE environment to analyze the internal logistics of an automotive manufacturing setting. The simulation outcomes uncovered notable periods of inactivity and associated costs in truck deliveries, highlighting the potential to reduce queue times for trucks and bodies. Such reduction could facilitate trucks' real-time decision-making based on their location and requested loads.\u003c/p\u003e\u003cp\u003eThe investigation revealed the impact of the prototyped replay buffer on the model's performance. While the MADDPG algorithm generally provided superior results across most simulations involving reinforcement learning (RL), incorporating a prototyped buffer into SEDQN improved its learning efficiency, particularly in larger-scale scenarios. The key to effectively addressing environments with sparse rewards lies in enhancing the prioritization of buffers. This paper recommends that future research endeavors explore sparse reward implementations on the MAPPO algorithm to devise a prototyped algorithm. Another advancement could involve designing models that circumvent agent conflicts by employing a non-conflict version.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Peyman Haghshenas, Seyed Mohammad Taghi Fatemi Ghomi and Hadi Mosadegh. The first draft of the manuscript was written by Peyman Haghshenas and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors did not receive support from any organization for the submitted work.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe code implementing the algorithms and simulations in this study serves as the primary data for this research. It is available upon request from the corresponding author for the purpose of transparency and reproducibility.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no conflict of interest.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHuman Participants and/or Animals\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study did not involve human participants or animals.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eEthical approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study did not require ethical approval as it did not involve human participants, animals, or sensitive data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgment\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe author(s) declare no additional acknowledgments for this research.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eMethod. (n.d.). Trucking rates per mile. Retrieved from https://www.method.me/pricing-guides/trucking-rates-per-mile/\u003c/li\u003e\n\u003cli\u003eR. H. Ballou, (2007). The evolution and future of logistics and supply chain management. European Business Review, 19(4), 332-348., doi: 10.1108/09555340710760152.\u003c/li\u003e\n\u003cli\u003eM. Amr, M. Ezzat, and S. Kassem, (2019, October). Logistics 4.0: Definition and historical background. In 2019 Novel Intelligent and Leading Emerging Sciences Conference (NILES) (Vol. 1, pp. 46-49). IEEE, doi: 10.1109/NILES.2019.8909314.\u003c/li\u003e\n\u003cli\u003eB. Gammelgaard, and P. D. Larson, (2001). Logistics skills and competencies for supply chain management. Journal of Business Logistics, 22(2), 27-50.\u003c/li\u003e\n\u003cli\u003eM. Fabri, H. Ramalhinho, M. Oliver, and J. C. Mu\u0026ntilde;oz, (2022). Internal logistics flow simulation: A case study in automotive industry. Journal of Simulation, 16(2), 204-216, doi: 10.1080/17477778.2020.1781554.\u003c/li\u003e\n\u003cli\u003eI. G. Pascu, G. C. Neacsu, E. L.Nitu, and A. C. Gavriluta, (2021). A brief review of the methods and techniques used in the innovative internal logistics processes and systems. In IOP Conference Series: Materials Science and Engineering (Vol. 1018, No. 1, p. 012023). IOP Publishing, doi: 10.1088/1757-899X/1018/1/012023.\u003c/li\u003e\n\u003cli\u003eR. Baller, P. Fontaine, S. Minner, and Z.Lai, (2022). Optimizing automotive inbound logistics: A mixed-integer linear programming approach. Transportation Research Part E: Logistics and Transportation Review, 163, 102734, doi: 10.1016/j.tre.2022.102734.\u003c/li\u003e\n\u003cli\u003eH. Mosadegh, M. Zandieh, and S. M. T. Fatemi Ghomi, (2012). Simultaneous solving of balancing and sequencing problems with station-dependent assembly times for mixed-model assembly lines. Applied Soft Computing, 12(4), 1359-1370, doi: 10.1016/j.asoc.2011.11.027.\u003c/li\u003e\n\u003cli\u003eH. Mosadegh, , and S. M. T. Fatemi Ghomi, and G. A. S\u0026uuml;er, (2020). Stochastic mixed-model assembly line sequencing problem: Mathematical modeling and Q-learning based simulated annealing hyper-heuristics. European Journal of Operational Research, 282(2), 530-544, doi: 10.1016/j.ejor.2019.09.021.\u003c/li\u003e\n\u003cli\u003eZ. Čujan, (2016). Simulation of production lines supply within internal logistics systems. Open Engineering, 6(1), doi: 10.1515/eng-2016-0061.\u003c/li\u003e\n\u003cli\u003eR. Anand, N. V. Vantagodi, K. A. Shanbhag, and M. Mahesh, (2019). Automated guided vehicles by permanent magnet synchronous motor: future of in-house logistics. Power Electronics and Drives, 4(1), 151-159, doi: 10.2478/pead-2019-0006.\u003c/li\u003e\n\u003cli\u003eR. Macedo, F. Coelho, S. Relvas, and A. P. Barbosa-P\u0026oacute;voa, (2021). In-house logistics operations enhancement in the automobile industry using simulation. In Operational Research: IO 2019, Tomar, Portugal, July 22\u0026ndash;24 20 (pp. 39-51). Springer International Publishing. doi: 10.1007/978-3-030-85476-8.\u003c/li\u003e\n\u003cli\u003eA. Nourmohammadi, H. Eskandari, M. Fathi, and A. H. Ng, (2021). Integrated locating in-house logistics areas and transport vehicles selection problem in assembly lines. International Journal of Production Research, 59(2), 598-616, doi: 10.1080/00207543.2019.1701207.\u003c/li\u003e\n\u003cli\u003eF. F Coelho, ,S. Relvas, and A. P. Barbosa-P\u0026oacute;voa, (2021). Simulation-based decision support tool for in-house logistics: the basis for a digital twin. Computers \u0026amp; Industrial Engineering, 153, 107094, doi: 10.1016/j.cie.2020.107094.\u003c/li\u003e\n\u003cli\u003eM.Fabri, and H. Ramalhinho, (2023). The in‐house logistics routing problem. International Transactions in Operational Research, 30(2), 1144-1168, doi: 10.1111/itor.12965.\u003c/li\u003e\n\u003cli\u003eB. Zhou, J. Zhang, and Q. Fei, (2022). Bi-objective green in-house transportation scheduling and fleet size determination in mixed-model assembly lines with mobile robots. Engineering Computations, 39(7), 2630-2654, doi: 10.1108/EC-08-2021-0483.\u003c/li\u003e\n\u003cli\u003eS. Mayer, T. Classen, , and C. Endisch, (2021). Modular production control using deep reinforcement learning: proximal policy optimization. Journal of Intelligent Manufacturing, 32(8), 2335-2351, doi: 10.1007/s10845-021-01778-z.\u003c/li\u003e\n\u003cli\u003eN. Feldkamp, S. Bergmann, , and S. Strassburger, (2020, December). Simulation-based deep reinforcement learning for modular production systems. In 2020 Winter Simulation Conference (WSC) (pp. 1596-1607). IEEE, doi: 10.1109/WSC48552.2020.9384089.\u003c/li\u003e\n\u003cli\u003eG. Wang, X. Wang, L. Wang, , M. Shao, Y.Yu, and X. Cheng, (2021, October). Multi-AGVs dispatching strategy in automobile assembly line based on Deep Reinforcement Learning. In 2021 China Automation Congress (CAC) (pp. 6382-6386). IEEE, doi: 10.1109/CAC53003.2021.9727515.\u003c/li\u003e\n\u003cli\u003eH. Lu, X. Zhang, and S. Yang, (2019, September). A learning-based iterative method for solving vehicle routing problems. In International Conference on Learning Representations.\u003c/li\u003e\n\u003cli\u003eK. Zhang, F. He, Z. Zhang, , X. Lin, and M. Li, (2020). Multi-vehicle routing problems with soft time windows: A multi-agent reinforcement learning approach. Transportation Research Part C: Emerging Technologies, 121, 102861, doi: 10.1016/j.trc.2020.102861.\u003c/li\u003e\n\u003cli\u003eChristianos, F., Sch\u0026auml;fer, L., \u0026amp; Albrecht, S. (2020). Shared experience actor-critic for multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 33, 10707-10717.\u003c/li\u003e\n\u003cli\u003eG. Papoudakis, F. Christianos, L. Sch\u0026auml;fer, and S. V. Albrecht, \u0026ldquo;Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks,\u0026rdquo; no. NeurIPS, 2020, [Online]. Available: http://arxiv.org/abs/2006.07869.\u003c/li\u003e\n\u003cli\u003e\u0026ldquo;GitHub - semitable_robotic-warehouse_ Multi-Robot Warehouse (RWARE)_ A multi-agent reinforcement learning environment.\u0026rdquo;\u003c/li\u003e\n\u003cli\u003eV. Mnih, K. Kavukcuoglu, , D. Silver, A. A. Rusu, , J. Veness, , M. G. Bellemare,., ... and D. Hassabis, (2015). Human-level control through deep reinforcement learning. \u003cem\u003enature\u003c/em\u003e, \u003cem\u003e518\u003c/em\u003e(7540), 529-533, doi: 10.1038/nature14236.\u003c/li\u003e\n\u003cli\u003eH. Van Hasselt, A. Guez, and D. Silver, (2016, March). Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence 30(1), doi: 10.1609/aaai.v30i1.10295.\u003c/li\u003e\n\u003cli\u003eT.Schaul, J.Quan, I. Antonoglou, and D. Silver, (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.\u003c/li\u003e\n\u003cli\u003eZ. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, (2016, June). Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning (pp. 1995-2003). PMLR.\u003c/li\u003e\n\u003cli\u003eR. Lowe, Y. I. Wu,., A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing systems, 30. 6380-6391\u003c/li\u003e\n\u003cli\u003eC. R. Tilbury, F. Christianos, S. V. and Albrecht, (2023). Revisiting the Gumbel-Softmax in MADDPG. arXiv preprint arXiv:2302.11793.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Simulation, Internal logistics, Reinforcement learning, Sparse rewards ","lastPublishedDoi":"10.21203/rs.3.rs-3864140/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-3864140/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eEfficiently managing internal logistics in the contemporary automobile industry is paramount. This paper delves into the simulation of an internal logistics (IL) system within an automotive factory, employing reinforcement learning. By capturing the unique IL characteristics of the factory, this paper formulates a comprehensive simulation model characterized by its incorporation of sparse reward mechanisms. This paper uses two distinct algorithms. The first algorithm is the multi-agent deep deterministic policy gradient, enhanced by integrating the Baseline to accommodate discrete actions. The second algorithm, shared experience deep Q-network, leverages the prioritized replay strategy to amplify its effectiveness in managing sparse rewards. This paper conducts rigorous numerical experiments to validate both the model's accuracy and the algorithms' efficacy.\u003c/p\u003e","manuscriptTitle":"Application of reinforcement learning methods to allocate logistics resources to production halls in an automotive industry","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-03 15:52:57","doi":"10.21203/rs.3.rs-3864140/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"ef678ce8-4b45-472c-ba6f-4a28de68ff66","owner":[],"postedDate":"September 3rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-09-20T17:09:58+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-03 15:52:57","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-3864140","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-3864140","identity":"rs-3864140","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.