Self-supervision, Meta-supervision, Curiosity: The Harder to Learn, the Easier to Generalize


At a glance

Much of the effort, and excitement, of deep learning so far has focused on direct supervised learning – mapping input vectors X to annotated labels Y. While many practical problems can be successfully posed via this data association setup, there are a number of pitfalls of this approach, many of which are present in the automotive domain. In particular, direct supervised learning assumes that there is a large amount of labelled training data available for a given problem. Unfortunately, this is often not the case, such as, for rare events like road accidents. Perhaps even more problematic, direct supervised learning assumes that training data and test data are drawn from the same distribution. In real-world vision systems, this is almost never the case due to the prevalence of dataset bias. For example, ImageNet “cars” look very different from cars seen from an on- board camera, which explains why ImageNet-trained models perform so poorly in real-world settings. Other common scenarios when training and test data differ include change of seasons, changes in illumination, time of day, etc, etc.
When either of the two above assumptions is not met, catastrophic overfitting is likely to occur. This usually takes the form of the learning system finding and learning “shortcuts” in the training data, that produce good results on training and even validation sets, but fail to generalize to real-world scenarios. This is akin to a bad student cramming for a final exam, having missed most of the course lectures. The student might very well do OK on the exam, but still not really know the material.
How can we prevent classifiers from cheating and learning “shortcuts”? Having enough training data of the correct distribution is the obvious answer, but this is absolutely infeasible for most real-world scenarios. Another possible solution is using domain transfer to adapt a trained model to a different/novel domain. While this can sometimes work, often the originally trained classifier is already overfitting so much, than no amount of domain transfer can rescue it.
Instead, in this project, we propose several methods for better ways of training models, to force them to generalize from the start, instead of just memorizing various data short-cuts. The idea is to make the learning algorithm work harder at training time, to discover the regularities in the data, instead of just letting it memorize the training set. This is akin to a student trying to solve a math problem first, before looking to the back of the textbook to see if s/he got the correct answer, compared to simply memorizing a list of problem / answer pairs.
In 2018, we will continue investigating ways of making learning harder, to make generalization easier. The investigation will focus on the following three main thrusts:

Self-supervised learning is a broad range of approaches, where raw data, instead of curated labels, is used as supervisory signal in training. One benefit is that we can train high-capacity prediction models without the need for costly human annotation. More importantly, skipping the cumbersome and often subjective semantic labeling step turns out to produce models that are able to reason directly with the raw high dimensional data signal, greatly reducing the threat of overfitting / “shortcut” finding.                                  Our group was the first to use self- supervision in the context of deep learning, an area that has now grown to be extremely popular in computer vision and machine learning. Last year, we have produced a number of breakthroughs in this area, such as highly-cited pix2pix method (see BDD 2017 report for more details).
This year, we plan to make a big push towards self-supervised online learning. Most contemporary machine learning is practiced in batch mode, where a well-defined “training set” is used to train a model which is then being evaluated on a “test set”. While batch training makes a lot of sense in an academic setting, where different algorithms need to be evaluated on common benchmarks, for real-world problems, online training offers a number of important advantages. In online training, also known as life-long learning, every new piece of data (e.g. video frame) is first considered “test data” and then immediately used as “training data” to update the current model (using a self-supervised training paradigm). This offers several important advantages, particularly in the automotive domain: 1) there is no need to store the acquired training data – every data point, once used in training can be safely discarded afterwards, 2) subsequently, there is little worry of overfitting, as the model only exposed to novel data, without ever having to see the same bit of data more than once, 3) gradual domain shift is also handled gracefully by online learning, as the system slowing adjusts from one domain to the next as the incoming data distribution starts to change. Despite these advantages, online training has not seen my use in practice, mainly due to the difficulties of coming up with effective self- supervised tasks that could be used for training. Here, we believe that our work on self-supervised prediction, such as Split-Brain Autoencoders, could be a perfect candidate to try and revive the fortunes of online training.

Meta-supervision is a new concept being developed in our group as a direct counterpoint to direct supervision. The main idea is this: instead of directly supervising the output values of each input signal, we want to supervise some constraint on the outputs – instead of telling the algorithm what the result should be, we instead tell it how the result should behave. One particular example of meta-supervision we have been exploiting is cycle-consistency – the idea that, cycles of any length on a graph of transformations should result in a null transformation (CVPR’16, CVPR’17,ICCV’17). Our CycleGAN formulation, in particular, developed last year as part of BDD effort, has become very well-known, and spawned many follow-ups. However, we believe that there are many other constraints one should be able to use as meta-supervision. This year, we propose to look at differential constraints as well as set theoretic and graph theoretic constraints (e.g. bisection, closure, etc) that will allow us to offer supervision while preventing overfitting.

Last year, we have developed a practical method for training an agent without any extrinsic objectives/goals, just using the idea of curiosity. The idea is akin to a student learning how to draw just by doodling. We defined curiosity as “failure to predict”, which connects to the above idea of meta-supervision, as another way to enforce model consistency. Our approach (ICML2017) worked surprisingly well on learning to play Super Mario Bros video game from scratch and with no rewards.  However, it did not generalize well to a physical scenario of a curious indoor moving robot. This year, we plan to address this shortcoming by incorporating the notions of a curiosity horizon and hierarchical curiosity.

principal investigatorsresearchersthemes
Alexei Efros self-supervision, meta-supervision, curiosity, computer vision