Multi-modal Fusion of Deep Convolutional Neural Networks for 3D Object Detection


At a glance

Autonomous Vehicles (AVs) learn to represent their environment from the data collected from their sensors. Currently LIDAR is the sensor primarily relied upon to generate 3D bounding boxes around elements in the scene, a crucial task to ensure collision avoidance. Often overlooked is how other sensor modalities can complement LiDAR and its weaknesses. Given the affordability of RGB cameras and their ultra high spatial pixel resolution as compared to LiDAR, they are a natural candidate to fuse with LiDAR in 3D object detection tasks. In particular, while LiDAR is excellent at measuring distances, RGB data is more useful for determining the orientation or dimension of the bounding box. We propose an a neural network architecture that allows for fusion of RGB data with depth. This is done by adding image features to a PointNet architecture which has already been well tuned to 3D point cloud recognition. Our preliminary experimental results show improvements of up to 8.7% in the case of hard pedestrian detection in the KITTI dataset. We plan to fine tune the architecture of our existing system as well as extending it to other sensor modalities such as radar.

Avideh Zakhor 

3D object detection, sensor fusion, deep neural networks, multi-modal, LiDAR, Camera