Superhuman Vision for Superhuman Driving


At a glance

Superhuman driving requires superhuman vision. By “superhuman vision” we mean imaging and computationally perceiving the world at finer detail than the human eye, to successfully make the best autonomous driving decisions. Today, camera technology is not the bottleneck to such superhuman vision. 4K camera modules can be had for less than $10, and a cylindrical array of, say, a dozen such cameras pointed outward could image an autonomous car’s entire LIDAR field of view at 20/20 human acuity (approximately 100 megapixels). At video rate, this would represent a visual bandwidth of 3-6 billion pixels per second. The problem is that current computer vision algorithms cannot handle anywhere near such high pixel flows. For example, current segmentation algorithms operate at just 5-10 million pixels per second, and as a consequence it is typical for current autonomous cars to use only low-resolution video streams of, say, 0.5 - 1 megapixel at 10 frames per second. At this resolution, a pedestrian at even modest distance is covered by only a coarse set of pixels, and this lack of visual detail greatly constrains what can be inferred. In contrast, human drivers infer a pedestrian’s likely actions in a sophisticated manner, leveraging diverse visual clues that rely on high visual acuity. For example, the lifting of a heel may indicate an upcoming step, and a sideways glance may indicate awareness of the oncoming vehicle and decreased likelihood of accidentally stepping into its path. The world is filled with such small but critical visual clues about the way the world is about to change. For the human, as for the autonomous car, correctly reading these small visual clues requires high-resolution vision. The goal of our research is to enable exponentially higher resolution cameras in cars, by developing efficient and scalable processing to take full advantage of the pixels. The key technical principles we will bring to bear on this problem are: 1. Multiresolution image processing – e.g. image pyramids and wavelets, 2. Preferential attention – learning which sparse sets of pixels in the multiresolution are best to process at each frame, and 3. Exploiting temporal consistency – flowing inferences from one frame to the next. Our end goal is to achieve superior driving decisions from superhuman vision.

principal investigatorsresearchersthemes
Ren NgMatthew Tancik
Fisher Yu
Multi-resolution processing, Image pyramid, Attention, Optical flow, 4K video