Efficient exploration in single-agent and multi-agent deep reinforcement learning

Schäfer, Lukas

Efficient exploration in single-agent and multi-agent deep reinforcement learning

Simple item page

dc.contributor.advisor

Albrecht, Stefano

dc.contributor.advisor

Storkey, Amos

dc.contributor.author

Schäfer, Lukas

dc.date.accessioned

2024-12-13T11:40:57Z

dc.date.available

2024-12-13T11:40:57Z

dc.date.issued

2024-12-13

dc.description.abstract

This thesis is concerned with reinforcement learning (RL) in which decision-making agents learn desirable behaviour by interacting with their environment and receiving feedback in the form of rewards. Learning decision making from interactions in this way is different from most machine learning paradigms in that RL agents need to both learn to collect data that informs their future decisions and learn the desired behaviour that maximises cumulative rewards. These processes are referred to as exploration and exploitation. In this thesis, we focus on the challenge of exploration for RL and propose novel solutions to guide agents towards efficient data collection and learn to balance exploration and exploitation. The first part of this thesis is concerned with exploration for single-agent reinforcement learning in sparse-reward environments. In this setting, a single decision-making agent is learning from interactions with its environment and rarely receives (non-zero) rewards from the environment. A common approach to explore in such challenging environments where informative rewards are sparse is to introduce intrinsically computed rewards that incentivise agents to explore. However, by introducing this second optimisation objective, the agent needs to explicitly and carefully balance the exploration objective of intrinsic rewards and the exploitation objective of extrinsic rewards of the environment. We propose decoupled reinforcement learning (DeRL) to address this challenge. In DeRL, agents learn separate policies for both exploration and exploitation to account for the different objectives of these policies. The exploration policy is trained with intrinsic rewards and used to gather informative data for the exploitation policy while the exploitation policy is trained on the gathered data to solve the task at hand. We show that DeRL outperforms existing intrinsically motivated exploration approaches in terms of sample efficiency and robustness to hyperparameters that are responsible for the trade-oﬀ between exploration and exploitation. For the subsequent parts of this thesis, we will shift our focus to exploration in the setting of multi-agent reinforcement learning (MARL). In MARL, multiple decision-making agents concurrently learn from interactions with their shared environment. In this thesis, we are concerned with environments that require agents to cooperate, i.e. agents need to learn to coordinate their actions to achieve their goals. This additional consideration in contrast to single-agent RL further complicates the learning process and exploration which now needs to account for the interactions between agents. The first contribution of this thesis to MARL is a comprehensive benchmark of ten algorithms across a set of 25 cooperative common-reward environments. As part of this study, we open-source EPyMARL, a codebase for MARL that extends the previously existing PyMARL codebase with more algorithms, support for more environments, and configurability. Following the analysis of this benchmark, we identify remaining challenges in MARL to efficiently train agents to cooperate in environments where informative feedback is sparse. Motivated by this challenge, we propose shared experience actor-critic (SEAC). SEAC leverages symmetry present in many multi-agent environments to share experiences across agents and learn from the collective experience of all agents using an actor-critic algorithm. In empirical experiments, we establish that experience sharing can significantly improve the efficiency of learning and help agents to learn skills simultaneously. However, the benefits of experience sharing are less pronounced for value-based algorithms where agents do not learn explicit policies. To guide exploration for value-based MARL algorithms, we propose ensemble value functions for multi-agent exploration (EMAX). EMAX trains an ensemble of value functions for each agent and steers the exploration of agents towards states and actions that might require cooperation between multiple agents. By doing so, agents learn to coordinate their actions more efficiency, and we show that EMAX as an extension of three common value-based MARL algorithms can significantly improve the sample efficiency and stability of training. Lastly, this thesis presents a case study in which we discuss the application of MARL to warehouse logistics automation. The chapter is the result of an industry collaboration with Dematic GmbH for which we formalise warehouse logistics problems, and propose a twofold solution to the scalability challenge of this setting. Our approach leverages a hierarchical decomposition of the multi-agent learning architecture and masks out actions that are deemed ineffective. Together, these techniques simplify the learning objective of individual agents, allowing MARL agents to learn more efficiently and scale to larger warehouse instances with more agents and locations while outperforming industry-standard heuristics and standard MARL algorithms.

en

dc.identifier.uri

https://hdl.handle.net/1842/42887

dc.identifier.uri

http://dx.doi.org/10.7488/era/5441

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Lukas Schäfer, Filippos Christianos, Josiah P. Hanna, and Ste fano V. Albrecht. “Decoupled reinforcement learning to stabilise intrinsically motivated exploration.” In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2020. Available at arXiv 2107.08966

en

dc.relation.hasversion

Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks.” In Advances in Neural Information Processing Systems, Track on Datasets and Benchmarks. 2021. Available at arXiv 2006.07869

en

dc.relation.hasversion

Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. “Shared experience actor-critic for multi-agent reinforcement learning.” In Advances in Neural Information Processing Systems. 2020. Available at arXiv 2006.07169

en

dc.relation.hasversion

Lukas Schäfer, Oliver Slumbers, Stephen McAleer, Yali Du, Stefano V. Albrecht, and David Mguni. “Ensemble value functions for efficient exploration in multi-agent reinforcement learning.” In Adaptive and Learning Agents Workshop at the AAMAS conference. 2023. Available at arXiv 2302.03439

en

dc.relation.hasversion

Aleksandar Krnjaic, Raul D. Steleac, Jonathan D. Thomas, Georgios Papoudakis, Lukas Schäfer, Andrew Wing Keung To, Kuan-Ho Lao, Murat Cubuktepe, Matthew Haley, Peter Börsting, Stefano V. Albrecht “Scalable multi-agent reinforcement learning for warehouse logistics with robotic and human co-workers.” In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2024. Available at arXiv 2212.11498

en

dc.relation.hasversion

Stefano V. Albrecht, Filippos Christianos, and Lukas Schäfer. "Multi-agent reinforcement learning: Foundations and modern approaches." In MIT Press. Cambridge, MA, USA. 2024. Available at marl-book.com

en

dc.relation.hasversion

Trevor McInroe, Lukas Schäfer, and Stefano V. Albrecht. "Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning." In Transactions on Machine Learning Research. 2023. Available at arXiv 2206.11396

en

dc.relation.hasversion

Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano V. Albrecht, and Josiah P. Hanna. "Robust on-policy sampling for data-efficient policy evaluation in reinforcement learning." In Advances in Neural Information Processing Systems. 2022. Available at arXiv 2111.14552

en

dc.relation.hasversion

Lukas Schäfer. "Task generalisation in multi-agent reinforcement learning." In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2022. Available at ACM Digital Library

en

dc.relation.hasversion

Lukas Schäfer, Filippos Christianos, Amos Storkey, and Stefano V. Albrecht. "Learning task embeddings for teamwork adaptation in multi-agent reinforcement learning." In Workshop on Generalization in Planning at the NeurIPS conference. 2023. Available at arXiv 2207.02249

en

dc.relation.hasversion

Alain Andres, Lukas Schäfer, Esther Villar-Rodriguez, Stefano V. Albrecht, and Javier Del Ser. "Using Offline Data to Speed-up Reinforcement Learning in Procedurally Generated Environments." In Adaptive and Learning Agents Workshop at the AAMAS conference. 2023. Available at arXiv 2304.09825

en

dc.relation.hasversion

Lukas Schäfer, Logan Jones, Anssi Kanervisto, Yuhan Cao, Tabish Rashid, Raluca Georgescu, David Bignell, Siddhartha Sen, Andrea Treviño Gavito, and Sam Devlin. "Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games." In arXiv preprint. 2023. Available at arXiv 2312.02312

en

dc.subject

autonomous decision-making entities

en

dc.subject

machine learning

en

dc.subject

reinforcement learning

en

dc.subject

exploration

en

dc.subject

efficient data collection

en

dc.subject

multi-agent reinforcement learning

en

dc.subject

multiple predictions

en

dc.subject

automating warehouse logistics

en

dc.title

Efficient exploration in single-agent and multi-agent deep reinforcement learning

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Schäfer2024.pdf
Size:: 23.94 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection