Efficient exploration in single-agent and multi-agent deep reinforcement learning
dc.contributor.advisor
Albrecht, Stefano
dc.contributor.advisor
Storkey, Amos
dc.contributor.author
Schäfer, Lukas
dc.date.accessioned
2024-12-13T11:40:57Z
dc.date.available
2024-12-13T11:40:57Z
dc.date.issued
2024-12-13
dc.description.abstract
This thesis is concerned with reinforcement learning (RL) in which decision-making agents learn desirable behaviour by interacting with their environment and receiving feedback in the form of rewards. Learning decision making from interactions in this way is different from most machine learning paradigms in that RL agents need to both learn to collect data that informs their future decisions and learn the desired behaviour that maximises cumulative rewards. These processes are referred to as exploration and exploitation. In this thesis, we focus on the challenge of exploration for RL and propose novel solutions to guide agents towards efficient data collection and learn to balance exploration and exploitation.
The first part of this thesis is concerned with exploration for single-agent reinforcement learning in sparse-reward environments. In this setting, a single decision-making agent is learning from interactions with its environment and rarely receives (non-zero) rewards from the environment. A common approach to explore in such challenging environments where informative rewards are sparse is to introduce intrinsically computed rewards that incentivise agents to explore. However, by introducing this second optimisation objective, the agent needs to explicitly and carefully balance the exploration objective of intrinsic rewards and the exploitation objective of extrinsic rewards of the environment. We propose decoupled reinforcement learning (DeRL) to address this challenge. In DeRL, agents learn separate policies for both exploration and exploitation to account for the different objectives of these policies. The exploration policy is trained with intrinsic rewards and used to gather informative data for the exploitation policy while the exploitation policy is trained on the gathered data to solve the task at hand. We show that DeRL outperforms existing intrinsically motivated exploration approaches in terms of sample efficiency and robustness to hyperparameters that are responsible for the trade-off between exploration and exploitation.
For the subsequent parts of this thesis, we will shift our focus to exploration in the setting of multi-agent reinforcement learning (MARL). In MARL, multiple decision-making agents concurrently learn from interactions with their shared environment. In this thesis, we are concerned with environments that require agents to cooperate, i.e. agents need to learn to coordinate their actions to achieve their goals. This additional consideration in contrast to single-agent RL further complicates the learning process and exploration which now needs to account for the interactions between agents. The first contribution of this thesis to MARL is a comprehensive benchmark of ten algorithms across a set of 25 cooperative common-reward environments. As part of this study, we open-source EPyMARL, a codebase for MARL that extends the previously existing PyMARL codebase with more algorithms, support for more environments, and configurability. Following the analysis of this benchmark, we identify remaining challenges in MARL to efficiently train agents to cooperate in environments where informative feedback is sparse. Motivated by this challenge, we propose shared experience actor-critic (SEAC). SEAC leverages symmetry present in many multi-agent environments to share experiences across agents and learn from the collective experience of all agents using an actor-critic algorithm. In empirical experiments, we establish that experience sharing can significantly improve the efficiency of learning and help agents to learn skills simultaneously. However, the benefits of experience sharing are less pronounced for value-based algorithms where agents do not learn explicit policies. To guide exploration for value-based MARL algorithms, we propose ensemble value functions for multi-agent exploration (EMAX). EMAX trains an ensemble of value functions for each agent and steers the exploration of agents towards states and actions that might require cooperation between multiple agents. By doing so, agents learn to coordinate their actions more efficiency, and we show that EMAX as an extension of three common value-based MARL algorithms can significantly improve the sample efficiency and stability of training.
Lastly, this thesis presents a case study in which we discuss the application of MARL to warehouse logistics automation. The chapter is the result of an industry collaboration with Dematic GmbH for which we formalise warehouse logistics problems, and propose a twofold solution to the scalability challenge of this setting. Our approach leverages a hierarchical decomposition of the multi-agent learning architecture and masks out actions that are deemed ineffective. Together, these techniques simplify the learning objective of individual agents, allowing MARL agents to learn more efficiently and scale to larger warehouse instances with more agents and locations while outperforming industry-standard heuristics and standard MARL algorithms.
en
dc.identifier.uri
https://hdl.handle.net/1842/42887
dc.identifier.uri
http://dx.doi.org/10.7488/era/5441
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Lukas Schäfer, Filippos Christianos, Josiah P. Hanna, and Ste fano V. Albrecht. “Decoupled reinforcement learning to stabilise intrinsically motivated exploration.” In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2020. Available at arXiv 2107.08966
en
dc.relation.hasversion
Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks.” In Advances in Neural Information Processing Systems, Track on Datasets and Benchmarks. 2021. Available at arXiv 2006.07869
en
dc.relation.hasversion
Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. “Shared experience actor-critic for multi-agent reinforcement learning.” In Advances in Neural Information Processing Systems. 2020. Available at arXiv 2006.07169
en
dc.relation.hasversion
Lukas Schäfer, Oliver Slumbers, Stephen McAleer, Yali Du, Stefano V. Albrecht, and David Mguni. “Ensemble value functions for efficient exploration in multi-agent reinforcement learning.” In Adaptive and Learning Agents Workshop at the AAMAS conference. 2023. Available at arXiv 2302.03439
en
dc.relation.hasversion
Aleksandar Krnjaic, Raul D. Steleac, Jonathan D. Thomas, Georgios Papoudakis, Lukas Schäfer, Andrew Wing Keung To, Kuan-Ho Lao, Murat Cubuktepe, Matthew Haley, Peter Börsting, Stefano V. Albrecht “Scalable multi-agent reinforcement learning for warehouse logistics with robotic and human co-workers.” In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2024. Available at arXiv 2212.11498
en
dc.relation.hasversion
Stefano V. Albrecht, Filippos Christianos, and Lukas Schäfer. "Multi-agent reinforcement learning: Foundations and modern approaches." In MIT Press. Cambridge, MA, USA. 2024. Available at marl-book.com
en
dc.relation.hasversion
Trevor McInroe, Lukas Schäfer, and Stefano V. Albrecht. "Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning." In Transactions on Machine Learning Research. 2023. Available at arXiv 2206.11396
en
dc.relation.hasversion
Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano V. Albrecht, and Josiah P. Hanna. "Robust on-policy sampling for data-efficient policy evaluation in reinforcement learning." In Advances in Neural Information Processing Systems. 2022. Available at arXiv 2111.14552
en
dc.relation.hasversion
Lukas Schäfer. "Task generalisation in multi-agent reinforcement learning." In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2022. Available at ACM Digital Library
en
dc.relation.hasversion
Lukas Schäfer, Filippos Christianos, Amos Storkey, and Stefano V. Albrecht. "Learning task embeddings for teamwork adaptation in multi-agent reinforcement learning." In Workshop on Generalization in Planning at the NeurIPS conference. 2023. Available at arXiv 2207.02249
en
dc.relation.hasversion
Alain Andres, Lukas Schäfer, Esther Villar-Rodriguez, Stefano V. Albrecht, and Javier Del Ser. "Using Offline Data to Speed-up Reinforcement Learning in Procedurally Generated Environments." In Adaptive and Learning Agents Workshop at the AAMAS conference. 2023. Available at arXiv 2304.09825
en
dc.relation.hasversion
Lukas Schäfer, Logan Jones, Anssi Kanervisto, Yuhan Cao, Tabish Rashid, Raluca Georgescu, David Bignell, Siddhartha Sen, Andrea Treviño Gavito, and Sam Devlin. "Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games." In arXiv preprint. 2023. Available at arXiv 2312.02312
en
dc.subject
autonomous decision-making entities
en
dc.subject
machine learning
en
dc.subject
reinforcement learning
en
dc.subject
exploration
en
dc.subject
efficient data collection
en
dc.subject
multi-agent reinforcement learning
en
dc.subject
multiple predictions
en
dc.subject
automating warehouse logistics
en
dc.title
Efficient exploration in single-agent and multi-agent deep reinforcement learning
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Schäfer2024.pdf
- Size:
- 23.94 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

