Safe Learning with Runtime Monitoring and Human Intervention


Reinforcement learning (RL) has been successfully used as a technique to act in an unknown environment in order to achieve a goal.  It enables the software agent to autonomously learn to perform actions by trial and error. RL is based on a feedback loop where the software agent performs actions on the environment in response to the observations, receiving a numerical value (reward) for each action. The goal of the RL agent is to maximise the cumulative received reward over time.

After training, the \rl agent can effectively handle changes i.e., when a change occurs the system autonomously learns new policies for actions execution.

The use of formal methods is often seen as a way to increase confidence in a software system. Techniques such as runtime verification (RV) can be used to monitor software executions. It can then detect violations of safety properties at run-time and eventually providing the possibility of reacting to the incorrect behaviour of the software agent whenever an error is detected.

Run-time verification techniques might be exploited to make safer the exploration for RL agents. They can enable the agent to be monitored as it explores the environment, preventing the agent to perform catastrophic actions. The designer of the system can encode rules in several monitors and enforce them at run-time when the monitor detects that the agent is about to violate them.

Safe-exploration of the environment can be achieved either with human guidance  and intervention [4], or with automatic methods that synthesise some sort of safety envelope for the agent [2][3].

Using monitors with domain-knowledge we can help the agent achieve its goal faster and without breaking the rules that are encoded in the monitor. But what if during the exploration the agent experiences something that has never seen before and is not encoded in any safety-monitor?

In this project, we will enhance the monitoring mechanism to be adaptive. The monitors become dynamic entities that also learn and adapt from the interactions of the agent with the environment. If the RL agent experience something that has never seen before its experience can be used to enhance the monitoring system either automatically or by human intervention.

We will work on a grid world setting with discrete observations and actions. Using gridworld environment for OpenAI Gym [1]

Course requirements: reinforcement learning, machine learning. Having taken a formal methods course would be a plus.

Prerequisites: programming (python). (Desirable to have some experience with formal methods and logic, though not strictly required)

Contact: Gerardo Schneider ( )

References and Further Reading


[2] "Safe Reinforcement Learning via Shielding" by Alshiekh et al.,

[3] "Safe Reinforcement Learning via Formal Methods" by Fulton et al.,

[4] "Trial without Error: Towards Safe Reinforcement Learning via Human Intervention” Saunders et al.

Date range: 
October, 2018 to October, 2023