Low Latency Deep Inference for Self-Driving Vehicles


At a glance

Deep Neural Networks (DNNs) have revolutionized computer vision, speech recognition, and have recently proved themselves in reinforcement learning and policy optimization for robot and vehicle control. DNNs have shown great accuracy, but are also relatively expensive for both training and evaluation. Training a state-of-the-art CNN image recognizer or RNN semantic model takes several days on a single GPU machine and hours on a mid-sized compute cluster. For driving applications, low latency is at least as important. Human reflexes work in tenths of a second, and vehicles and driving infrastructure have been developed around this time scale. DNNs must complete perception-action cycles with similar latencies. The focus of this work is to improve DNN latencies and develop software that can execute perception-action cycles on the order of a few tenths of a second, and deliver this performance on state-of-the-art embedded computing hardware.

The team will approach this on two fronts:
- Embedded Kernel Optimization: Most deep learning toolkits have been developed to optimize training time. They use minibatch sizes of several hundred inputs, and performance often depends on the parallelism gained by processing data in batches. By contrast, the team will develop optimized kernels and algorithms for single inputs and minimum latency.
- End-to-End Kernel Development: The team’s past work on an NVIDIA GPU demonstrated an order-of-magnitude speedup over Google’s reference CPU implementation. While such speedups are not unusual for GPUs, the team’s implementation used several new techniques and substantially improved on the throughput (to 200 gflops) demonstrated to date for inference on a sparse, power-law dataset. This was achieved through end-to-end design, performing the entire forward-backward model update pipeline with one “touch” of the input data. The calculation to register memory was also moved, leveraging the large register storage on current GPUs, and finally the team developed a complementary algorithmic technique (negative sample sharing) to reduce memory bandwidth. While the goal in that project was to minimize training time, end-to-end design is also a key pattern for minimum latency.
John Canny Deep Neural Nets
Kernel Optimization and Development