BEVCar

Abstract

Semantic segmentation of the surroundings from a bird's-eye-view (BEV) perspective can be considered the minimum requirement for holistic scene understanding of mobile robots. Although recent vision-only methods have shown impressive progress in terms of performance, operation under poor illumination conditions such as rain or nighttime remains difficult. While active sensors can address this challenge, the usage of LiDARs remains controversial due to their high cost. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we continue to investigate this promising path by introducing BEVCar for joint map and object BEV segmentation. The core novelty of our approach is to first learn a point-based encoding of the raw radar data and then leverage this representation for efficient initialization of lifting the image features to the BEV space. In extensive experiments, we demonstrate that utilizing radar information significantly increases robustness under poor environmental conditions and improves segmentation for far away objects. To foster future research, we release the utilized weather split on the nuScenes dataset along with our code and trained models in our GitHub repository.

Technical Approach

In this work, we propose BEVCar posing a novel approach for camera-radar sensor fusion for joint object and map segmentation in the bird's-eye-view (BEV) space. For encoding the surround-view camera data, we employ a frozen DINOv2 backbone, whose image representation captures more semantic information than previously used ResNet-based methods. Inspired by LiDAR-based processing, we further utilize a learnable radar encoding module to obtain more abstract features than the raw metadata. We follow a learning-based approach to convert the encoded vision features from the 2D image plane to the BEV space. In particular, we propose a novel query initialization scheme that exploits depth from radar to enhance the image feature lifting. Afterwards, we fuse the lifted image features with the encoded radar data using deformable attention. Finally, we perform multi-class BEV segmentation to obtain pixel-wise predictions for both vehicles and the map categories.

Code

A software implementation of this project based on PyTorch can be found in our GitHub repository for academic usage and is released under the CC BY-NC-SA 4.0 license. For any commercial purpose, please contact the authors.

Publications

If you find our work useful, please consider citing our paper:

Jonas Schramm, Niclas Vödisch, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Wolfram Burgard, and Abhinav Valada
BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation
2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 2024, pp. 1435-1442.

(PDF) (BibTeX)