Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees

Kai-Chieh Hsu*    Allen Z. Ren*    Duy Phuong Nguyen    Anirudha Majumdar**    Jaime F. Fisac**   

*equal contribution in alphabetical order    **equal advising

Princeton University    Artificial Intelligence Journal (AIJ), October 2022

Progress and Challenges in Building Trustworthy Embodied AI Workshop, NeurIPS 2022
Oral, Generalizable Policy Learning in the Physical World Workshop, ICLR 2022

Journal | Paper | Code | Bibtex


We propose Sim-to-Lab-Real, a framework that combines Hamilton-Jacobi reachability analysis and the PAC-Bayes Control framework to improve safety of robots during training and real-world deployment, and provide generalization guarantees on robots’ performance and safety in real environments.


Method Overview

We leverage an intermediate training stage, Lab, between Sim and Real to safely bridge the Sim-to-Real gap in ego-vision indoor navigation tasks. Compared to Sim training, Lab training is (1) more realistic and (2) more safety-critical.

For safe Sim-to-Lab transfer, we learn a safety critic with Hamilton-Jacobi reachability RL and apply a supervisory control scheme to shield unsafe actions during exploration.

For safe Lab-to-Real transfer, we use the Probably Approximately Correct (PAC)-Bayes Control framework to provide lower bounds (70-90%) on the expected performance and safety of policies in unseen environments.



Comparison with Other Methods

Our methods employs a dual-policy setup with reachability safety critic and shielding, which we demonstrate superior safety training and testing compared to using single-policy setup or risk-based safety critic or reward penalty. We also provide tight generalization guarantees to unseen environments.

Reachability vs. Discounted Risk

We compare our learned safety critic with those learned with SQRL [Srinivasan'20] and Recovery RL [Thananjeyan'21] that use sparse (binary) safety indicators. Reachability RL enables the safe critic to learn from near failure with dense signals. Thus, it recovers a thicker unsafe set and significantly reduces number of safety violations during Lab training and Real deployment.

Safety-Ensured Policy Distribution

We train a dual policy, performance (\pi^p) and backup (safety) policy (\pi^b), that is also conditioned on latent variables sampled from a distribution. As the performance policy guides the robot towards the goal, the backup policy intervenes minimally only when the safety critic deems the robot near danger. With a distribution of policies parameterized by the latent variables, the robot exhibits diverse trajectories around obstacles.

Shielding in Real Deployment

We test the dual policy in 10 different real indoor space including one shown on the left. The colored trajectory indicates the safety critic value (red for higher, blue for lower) at the locations. When the value exceeds the threshold, shielding activates and the backup policy (green arrow) overrides the performance policy (red arrow) to steer the robot away from obstacles.

Physical Experiments with Robot Trajectories

We run the policy three times in each environment by sampling different latent variables from the posterior distribution. The three numbers in images indicates success/unfinished/failure split. The top images show the RGB observation input used for the robot (no depth information used).



Acknowledgements

Allen Z. Ren and Anirudha Majumdar were supported by the Toyota Research Institute (TRI), the NSF CAREER award [2044149], the Office of Naval Research [N00014-21-1-2803], and the School of Engineering and Applied Science at Princeton University through the generosity of William Addy ’82. This article solely reflects the opinions and conclusions of its authors and not ONR, NSF, TRI or any other Toyota entity. We would like to thank Zixu Zhang for his valuable advice on the setup of the physical experiments.