Risk-Averse, Adversarial Reinforcement Learning
ABOUT THE PROJECT
At a glance
An important characteristic of vehicle controllers is that they have low probability of producing catastrophic outcomes. A related notion is “robustness” in control, which most often means a reward optimized over a distribution of environment parameters. This prevents the controller learning to overfit a specific, idealized set of parameters and provides a near-optimal reward across that distribution. But a robust controller optimizes the same reward as the baseline controller, and does not quantify the risk of a catastrophic outcome such as a vehicle crash. Catastrophic outcomes may be rare, and simple approaches to robustness by sampling environment parameters may fail to detect those outcomes and quantify their risk. Certain catastrophic outcomes in vehicle control such as traction loss, are also highly non-linear and naïve sampling will not lead to accurate risk estimates. For this project, we will explore risk-averse design, incorporating an explicit risk objective into the controller’s reward. An adversary is used to selectively sample from environment and state parameters in the style of  so that the driving policy leans to recover from a variety of adverse states. There are two significant differences in our approach compared to :
- An explicit risk term, which estimates the variance of the value function, is computed and added to the reward. Thus training the model explicitly minimizes risk. This contrasts with  in which only reward is optimized.
- The adversary in our approach incorporates a search (curiosity) objective. It is important for the adversary to systematically explore all states and environment parameters that lead to potentially catastrophic outcomes since the driving policy needs to experience these states to learn to recover from them.
The adversary in our approach can be viewed as a learned importance sampler, allowing us to accurately estimate risk for rare events. It optimizes negative reward plus a multiple of risk. By using an appropriate policy for the adversary (see ), the policy enacts Thompson sampling according to this reward, which means it explores states/environment parameters according to their risk. Thus it contrasts with simpler adversarial models such as , where the driver/adversary interaction is modeled as a two-player game. The goal of the adversary in our approach is not to “win” the game but to teach the driving policy by showing it a range of adverse states (not necessarily its best move).
To quantify risk in value estimates, we use an ensemble of deep models, trained on bootstrap samples of histories of environment actions/states . The bootstrap sampling design supports a simple implementation of Thompson sampling , which we will use for the adversary to explore high-risk states. It remains a research question whether simple Thompson sampling provides “deep” enough search, i.e. is able to explore multiple, disjoint modes in the distribution. We will also explore the deep exploration approach from , as well as perceptual novelty rewards like those in .
We have prototyped the above design using the TORCS driving simulator and Tensorflow. Early results verify that adding the adversary and risk goals reduces the risk of crashes and spinouts. For the coming year, we will complete the design and explore various exploration models for the adversary. More limited exploration is expected to improve the driving policy’s performance.
1. Pinto, L., Davidson, J., Sukthankar, R., Gupta, A., Robust Adversarial Reinforcement Learning, arXiv 1703.02702, 2017.
2. Osband, I., Blundell, C., Pritzel, A., Van Roy, B., Deep Exploration via Bootstrapped DQN. arXiv 1602.04621, 2016.
3. Pathak, D., Agrawal, P., Efros, A., Darrell, T., Curiosity-driven Exploration by Self-Supervision Prediction, arXiv 1705.05363, 2017.
|John Canny||self-driving vehicles, safety, robust control, risk-averse models, adversarial models, reinforcement learning, ensemble models|