Efficient Neural Nets for Real-Time object-recognition on High-Resolution Images


At a glance

Object recognition tasks, including object detection and semantic segmentation, are fundamental problems in the computer vision community and are crucial for autonomous driving perception. In recent years, Deep Neural Networks (DNN) have become widely used and have achieved state-of-the-art accuracy in object recognition tasks. 
One of the remaining challenges for autonomous driving perception is to recognize objects that are far away from the ego vehicle. Such an ability is crucial for safety as well as for a smooth user experience as a passenger in the autonomous vehicle. However, this task is very difficult since we need to deal with the following dilemma: the farther away the object, the smaller it appears in the camera view, and to sufficiently represent the object, we therefore need higher resolution images. But high resolution images result in larger computational cost, making it more difficult to satisfy the inference speed requirement for autonomous driving, particularly within a tight power budget. Therefore, addressing this challenge requires us to design novel DNN architectures that address these challenges. In this project, we plan to explore efficient neural networks that are able to perform real-time object recognition tasks (object detection, semantic segmentation) on high-resolution images such as 4K resolution, or 3840 x 2160).

Research agenda and challenges
We foresee two major challenges in the course of this project: 1) collecting labeled data 2) designing resolution-scalable NN architectures.
Labeled data collection: To our best knowledge, there is no public large scale dataset with 4K resolution for object recognition tasks. To effectively build a dataset for our purpose, we plan to utilize Grand Theft Auto V (GTA-V) to synthesize labeled images. In the past year, we have built a simulation framework based on (GTA-V), in which we can get the ground-truth labels (bounding boxes, pixel-wise class labels, depth, etc.) for each in-game image. Furthermore, since we are able to control a large number of environment variables over the scene, we can easily synthesize different scenarios to ensure the diversity of the dataset. 

In addition to our own synthesized data, we are also aware that BDD industrial partners, such as Samsung are building high resolution datasets and are kindly willing to share the dataset with us. It would be mutually beneficial to include this annotated real-world data in our effort.
Resolution-scalable NN architecture: The second challenge we will focus on is to design novel scalable NN architectures. Today, most neural nets for object detection and semantic segmentation operate on images with ~1K resolution. Directly feeding 4K images to contemporary models will not only result in computational cost beyond the capability of most of the processors, but moreover the receptive fields of these models will not be sufficient to cover the whole image, and therefore leads to degraded accuracy.

In our recent work, we proposed the “Shift” operation [Wu2017-2]: a zero-FLOP, zero-parameter alternative to replace spatial convolutions. Such operation is proven to be more efficient than spatial convolutions in various tasks, and more importantly, the computation and parameter size of “Shift” does not change as receptive field increases. With the “Shift” operation, we expect to be able to build more efficient NN architectures for high-resolution images. We hope that the
proposed neural networks can achieve both real-time speed and high accuracy for the object recognition tasks on large-resolution images, especially demonstrate significant improvement in detecting small/far objects.

principal investigatorsresearchersthemes
Kurt Keutzer Efficient neural nets, computer vision on HD images