Current paradigms for recognition in computer vision involve learning a generic feature representation on a large dataset of labeled images, and then specializing or fine-tuning the learned generic feature representation for the specific task at hand. Several different architectures for learning these generic feature representations have been proposed over the years, but all rely on the availability of a large dataset of labeled images to learn feature representations. While there are a large number of image modalities beyond RGB images such as depth images, infra-red images, aerial images, LIDAR point clouds and so forth, the number of labeled images from these modalities is still significantly smaller than RGB image datasets used for learning features.
This project seeks to transfer models for vision tasks like object detection, segmentation, fine-grained categorization and pose-estimation trained using large-scale annotated RGB datasets to new modalities with no or very few such task-specific labels. Additionally, the team will investigate a different technique to transfer learned representations from one modality to another. This technique will use ‘paired' images from the two modalities and utilizes the midlevel representations from the labeled modality to supervise learning representations on the paired unlabeled modality.
Preliminary experiments are showing promising results with the possibility of additional improvements through further investigation. The implications of the results on fields such like robotics could be substantial, and without having to obtain any more annotations robots, the team will be able to leverage the benefits of auxiliary sensing in existing vision models.