Unsupervised Representation Learning for Autonomous Driving
ABOUT THE PROJECT
At a glance
Recent deep learning methods have leveraged large datasets of millions of labeled examples to learn rich, high-performance visual representations. Yet efforts to scale these methods to truly Internet-scale datasets (i.e. hundreds of billions of images) are hampered by the sheer expense of the human annotation required. A natural way to address this difficulty would be to employ unsupervised learning, which aims to use data without any annotation.
Unfortunately, despite several decades of sustained effort, unsupervised methods have not yet been shown to extract useful information from large collections of full-sized, real images. After all, without labels, it is not even clear what should be represented.
In this project, we propose to employ "self-supervision": using the data as its own supervisory signal. The team will also explore the use of temporal and spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. This will be achieved in two ways: predict the relative arrangement of pairs of patches and predicting the actual content of patches from their context. The team will build upon and extend preliminary work to not only consider the arrangement prediction within a single image, but more broadly predict the spatial and temporal arrangement of patches within entire scenes. Outdoor street data used in many driving applications would be a great source of imagery for such a training approach. This will make the arrangement prediction harder, leading to both a better and more task (driving) specific representation.
In a second part of the project, the team will work to predict the content of parts of the scene directly from its surrounding. This tasks is much more challenging than predicting the spatial arrangement, and potentially provides a much stronger supervisory signal. In order to succeed in this task, a model will need to both understand the content of the image, as well as reproduce a plausible hypothesis for the missing parts.
Unfortunately, despite several decades of sustained effort, unsupervised methods have not yet been shown to extract useful information from large collections of full-sized, real images. After all, without labels, it is not even clear what should be represented.
In this project, we propose to employ "self-supervision": using the data as its own supervisory signal. The team will also explore the use of temporal and spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. This will be achieved in two ways: predict the relative arrangement of pairs of patches and predicting the actual content of patches from their context. The team will build upon and extend preliminary work to not only consider the arrangement prediction within a single image, but more broadly predict the spatial and temporal arrangement of patches within entire scenes. Outdoor street data used in many driving applications would be a great source of imagery for such a training approach. This will make the arrangement prediction harder, leading to both a better and more task (driving) specific representation.
In a second part of the project, the team will work to predict the content of parts of the scene directly from its surrounding. This tasks is much more challenging than predicting the spatial arrangement, and potentially provides a much stronger supervisory signal. In order to succeed in this task, a model will need to both understand the content of the image, as well as reproduce a plausible hypothesis for the missing parts.
PRINCIPAL INVESTIGATORS | RESEARCHERS | THEMES |
---|---|---|
Alexei (Alyosha) Efros | Carl Doersch, Abhinav Gupta, Phillip Isola, Philipp Krahenbuhl, Richard Zhang, Jun-Yan Zhu, Tinghui Zhou and Dinesh Jayaraman | Autonomous Vehicles Deep Learning |