Voxel-MAE: Masked Autoencoder for Self-Supervised Pre-Training on Lidar Point Clouds


Method overview. We pre-train a Transformer-based voxel encoder to reconstruct masked voxels and to distinguish between empty and non-empty voxels. The pre-trained model is then fine-tuned on 3D object detection.

Core idea

We extend the idea of Masked Autoencoders to voxelized point clouds. To capture the unique characteristics of point clouds, we propose three reconstruction losses that are tailored to the voxel representation. We show that our pre-training outperforms a randomly initialized equivalent by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset, and reduce the amount of annotated data required for fine-tuning by 60%.

Abstract

Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds. Raw automotive datasets are suitable candidates for self-supervised pre-training as they generally are cheap to collect compared to annotations for tasks like 3D object detection (OD). However, the development of masked autoencoders for point clouds has focused solely on synthetic and indoor data. Consequently, existing methods have tailored their representations and models toward small and dense point clouds with homogeneous point densities. In this work, we study masked autoencoding for point clouds in an automotive setting, which are sparse and for which the point density can vary drastically among objects in the same scene. To this end, we propose Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations. We pre-train the backbone of a Transformer-based 3D object detector to reconstruct masked voxels and to distinguish between empty and non-empty voxels. Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset. Further, we show that by pre-training with Voxel-MAE, we require only 40% of the annotated data to outperform a randomly initialized equivalent.

Background

Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds. However, its effectiveness for large-scale, sparse, automotive point clouds has not been studied. We devise a simple masked autoencoding pre-training scheme designed for voxel representations. The voxel representation is widely used for 3d perception tasks in the automotive domain, hence we tailor our pre-training to this representation.

MAE
Point-MAE
Voxel-MAE
MAE divides images into non-overlapping patches of fixed size. Existing methods for masked point modeling create point cloud patches with a fixed number of points by using furthest point sampling and k-nearest neighbors. Our method uses non-overlapping voxels with a *dynamic* number of points. Airplane point cloud from .

Results

We show that our pre-training outperforms a randomly initialized equivalent by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset, and reduce the amount of annotated data required for fine-tuning by 60%. For more details, please refer to our paper.