Bringing in the Human Factor to Resolve Three Fundamental Problems of Deep Learning in Autonomous Driving


At a glance

Deep Learning in autonomous driving has made great progress in recent years. However, it still faces fundamental unsolved problems. We argue that some of these problems can only be resolved by bringing in the human factor. We plan to address three of them in this proposal: (1) understanding the visual-social cues in driving scenes, (2) estimating the underlying costs of prediction errors to properly guide supervised learning and (3) improving the reward/punishment function in driving for deep reinforcement learning.

Understanding the visual-social cues in driving scenes
Autonomous driving models based on CNN perception modules can recognize important features (e.g., lane marks) and objects (e.g., cars), and perform well in lane-following and car-following situations [1][2]. However, higher- level autonomous driving in complex environments (e.g., urban streets) requires the understanding of the visual- social cues in driving scenes. For example, recognizing the gaze of a pedestrian informs drivers about the

pedestrian’s intentions and the likelihood of a dart-out (and collision). A parked car can be deemed as benign, but not if a driver is inferred to be in the driver’s seat and about to exit the vehicle (Figure 1). We argue that these visual-social cues are challenging to learn because the currently commonly used labels, i.e., car control signals and object segmentation maps, are either too sparse or not pertinent to driving.

With support from BDD last year, we have developed an efficient method that uses eye tracking to generate human- labeled priority maps of the driving-relevant information – human driver attention maps (Figure 1). We have collected human driver attention maps for 1,232 videos. We were focusing on the cases where the car was driving in urban streets and where the average speed was slower than 10 miles/hour. Moreover, we have built a deep neural network that predicts human driver’s attention merely from monocular dash camera videos and surpassed the state- of-the-art performance. Our model has shown understandings of complex visual-social cues such as watching out for a driver exiting from a parked car (Figure 1). The data we collected and the code of our model have been uploaded to BDD repositories to share with our sponsors. The result was reported in an arXiv paper ( and was submitted to CVPR 2018.

Figure 1. Illustration of human driver attention maps and the prediction of our model. In this example, two pedestrians unexpectedly exit from a parked car. The human driver attention map highlights the exiting driver who might be hit. The state-of-the-art model fails to detect the relevant pedestrian. Our model successfully predicts attending to the relevant pedestrian—the one who is most likely to be hit.

This year, we plan to continue to collect more human driver eye movement and attention maps for additional situations of various kinds, e.g., on highways and variable road conditions, under different lighting conditions, under medium and high-speed conditions, etc. Additionally, we plan to test the potential contribution of our attention prediction model to autonomous car control. We will replace the perception module of a car control model [2] with the perception module of our attention prediction model. By default, this would result in worse accuracy in car control since the perception module of our attention model was not trained to predict car control. However, we expect an increase in performance, especially in complex situations, because our attention model has learned high- level visual-social cues from human attention maps. Newly collected data and this new car control model will be shared with our sponsors.

Estimating the underlying costs of prediction errors to guide supervised learning properly
Most of the applications of deep learning in autonomous driving can be divided into two main approaches: supervised learning and reinforcement learning. In supervised learning, the autonomous driving models are trained to minimize prediction errors, e.g., the difference between the predicted speed and the speed demonstrated by a human driver. However, the cost of making a given error is not known. In some driving situations, e.g., when a pedestrian unexpectedly enters into a driveway, a small error can lead to fatal costs.

Last year (, we showed that mean prediction error as a loss function can be misleading. We proposed a solution that samples the frames of the driving video dataset with different frequencies during training. The important frames, i.e., where the cost of making an error is very high, are sampled more frequently. We developed a method that uses human gaze data to estimate the importance level of each frame to determine the sampling frequency. We demonstrated the effectiveness of our method in the prediction of human driver attention.

This year, we plan to conduct a study to directly verify how accurately our human gaze-based method can estimate the importance levels of the frames of a driving video dataset. We will conduct behavioral experiments to collect importance ratings of frames from human subjects and compare them with the prediction of our gaze-based method. The importance rating data will be shared with our sponsors.

Improving the reward/punishment in driving for deep reinforcement learning
In deep reinforcement learning (DRL), a good reward/punishment function is essential for guiding the learning. Most studies of DRL in autonomous driving used racing games as driving simulators and treated the game score/racing time as the reward function [3]. But real-life driving is not about how fast one drives. Some studies [4] considered simulations of real-life driving and used the number or damage of crashes as the punishment function. However, autonomous driving is not simply about avoiding collisions. Therefore, a more informative punishment function is needed.

A byproduct of our eye-tracking experiment conducted last year is labels about what the human observers perceived to be the dangerous moments in the driving videos. The human observers were asked to imagine they were driving instructors sitting on the copilot seat while watching the driving videos, and they needed to push a button whenever they felt the student driver was in danger. These labels of dangerous moments can be used to train a network that identifies dangers according to the video input and the state of the car. This network can then be used as an informative punishment function for DRL in real-life driving simulations. This year, as we collect new human driver gaze data, we will also acquire more labels of dangers in driving. We plan to use these data to train the network described above. The data will be shared with our sponsors and the trained network can not only facilitate DRL, but also be applied to an advanced driver-assistance system.

Expected outcome: human driver attention maps, labels of dangers in driving videos, importance ratings of the video frames, a car control model based on the perception module of attention prediction model, evaluation of using gaze to predict importance level of video frames and a model that predicts dangers in driving videos.

David WhitneyYe Xia, Karl Zipser, Ken NakayamaEye-tracking, Human Factor, Autonomous Perception