RV-assisted Reinforcement Learning for Collaborative Multi-Agents


Reinforcement learning (RL) has been successfully used as a technique to act in an unknown environment in order to achieve a goal.  It enables the software agent to autonomously learn to perform actions by trial and error. RL is based on a feedback loop where the software agent performs actions on the environment in response to the observations, receiving a numerical value (reward) for each action. The goal of the RL agent is to maximise the cumulative received reward over time.

After training, the RL agent can effectively handle changes i.e., when a change occurs the system autonomously learns new policies for actions execution. RL is more challenges in the presence of more than one agent, as it needs cooperation. In particular it is then possible to have collaborative agents that try to achieve a common goal [1].

The use of formal methods is often seen as a way to increase confidence in a software system. Techniques such as runtime verification (RV) can be used to monitor software executions. It can then detect violations of safety properties at run-time and eventually providing the possibility of reacting to the incorrect behaviour of the software agent whenever an error is detected.

Run-time verification techniques might be exploited to make safer the exploration for RL agents. They can enable the agent to be monitored as it explores the environment, preventing the agent to perform catastrophic actions. The designer of the system can encode rules in several monitors and enforce them at run-time when the monitor detects that the agent is about to violate them. In particular, it is possible to use multiple monitors that communicate.

A specific interesting problem and challenge is how to combine RL and RV in a collaborative setting. Agents should collaborate in order to achieve their common goal (e.g., to distribute the task of collectively cutting grass), but in case an agent encounters a problem (e.g., running out of battery) then that particular agent would need to change its goal to pursue another local goal (e.g., to charge the battery). During learning, this would require that the learning of the common goal stops and it changes to a local learning mode to pursue the current learning objetive (e.g., charing the battery). During execution time, we want to ensure that whenever such event happens, then the agents behave as expected. The above scenario requires a design where monitors communicate among them in order to enhance learning and real-time execution.

A second, and related task in the above scenario, is to ensure that the local goals (specified also as local/individual RV monitors) are compliant with the common/global goal. This may be done by using formal methods techniques to prove the monitors are not enforcing contradictory properties.

In this master project the student(s) will design a mechanism for combining RL and RV to solve the above issues, both during learning as well as during the real execution of the system (having the safety monitors as a way to ensure the task is done, even if the learning has not taken a particular scenario into account).

We will work on a grid world setting with discrete observations and actions. The student(s) may need to set up a simpler environment for cooperative agents, e.g. OpenAI [2].

This master project may be dimensioned for one or more students / thesis projects.

[1] https://arxiv.org/pdf/1706.02275.pd

[2] https://github.com/openai/multiagent-competition

Contact: Gerardo Schneider (gersch@chalmers.se)

Courses: reinforcement learning, machine learning. A course on formal methods course would be a plus.

Prerequisites: programming (python). (Desirable to have some experience with formal methods and logic, though not strictly required)

Date range: 
October, 2018 to October, 2023