Abstract

Teaser image

Semantic segmentation of the surroundings from a bird's-eye-view (BEV) perspective can be considered the minimum requirement for holistic scene understanding of mobile robots. Although recent vision-only methods have shown impressive progress in terms of performance, operation under poor illumination conditions such as rain or nighttime remains difficult. While active sensors can address this challenge, the usage of LiDARs remains controversial due to their high cost. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we continue to investigate this promising path by introducing BEVCar for joint map and object BEV segmentation. The core novelty of our approach is to first learn a point-based encoding of the raw radar data and then leverage this representation for efficient initialization of lifting the image features to the BEV space. In extensive experiments, we demonstrate that utilizing radar information significantly increases robustness under poor environmental conditions and improves segmentation for far away objects. To foster future research, we release the utilized weather split on the nuScenes dataset along with our code and trained models in our GitHub repository.

Technical Approach

Overview of our approach

In this work, we propose BEVCar posing a novel approach for camera-radar sensor fusion for joint object and map segmentation in the bird's-eye-view (BEV) space. For encoding the surround-view camera data, we employ a frozen DINOv2 backbone, whose image representation captures more semantic information than previously used ResNet-based methods. Inspired by LiDAR-based processing, we further utilize a learnable radar encoding module to obtain more abstract features than the raw metadata. We follow a learning-based approach to convert the encoded vision features from the 2D image plane to the BEV space. In particular, we propose a novel query initialization scheme that exploits depth from radar to enhance the image feature lifting. Afterwards, we fuse the lifted image features with the encoded radar data using deformable attention. Finally, we perform multi-class BEV segmentation to obtain pixel-wise predictions for both vehicles and the map categories.

Video

Code

A software implementation of this project based on PyTorch can be found in our GitHub repository for academic usage and is released under the CC BY-NC-SA 4.0 license. For any commercial purpose, please contact the authors.

Publications

If you find our work useful, please consider citing our paper:

Jonas Schramm, Niclas Vödisch, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Wolfram Burgard, and Abhinav Valada
BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation
arXiv preprint arXiv:2403.11761, 2024.

(PDF) (BibTeX)

Authors

Jonas Schramm

Jonas Schramm

University of Freiburg

Niclas Vödisch

Niclas Vödisch

University of Freiburg

Kürsat Petek

Kürsat Petek

University of Freiburg

B Ravi Kiran

B Ravi Kiran

Qualcomm SARL France

Senthil Yogamani

Senthil Yogamani

QT Technologies Ireland Limited

Wolfram Burgard

Wolfram Burgard

University of Technology Nuremberg

Abhinav Valada

Abhinav Valada

University of Freiburg

Acknowledgment

This work was funded by Qualcomm Technologies Inc., the German Research Foundation (DFG) Emmy Noether Program grant No 468878300, and an academic grant from NVIDIA.