The recent success of neural networks enables a better interpretation of 3D point clouds, but processing a large-scale 3D scene remains a challenging problem. Most current approaches divide a large-scale scene into small regions and combine the local predictions together. However, this scheme inevitably involves additional stages for pre- and post-processing and may also degrade the final output due to predictions in a local perspective. This paper introduces Fast Point Transformer that consists of a new lightweight self-attention layer. Our approach encodes continuous 3D coordinates, and the voxel hashing-based architecture boosts computational efficiency. The proposed method is demonstrated with 3D semantic segmentation and 3D detection. The accuracy of our approach is competitive to the best voxel-based method, and our network achieves 129 times faster inference time than the state-of-the-art, Point Transformer, with a reasonable accuracy trade-off in 3D semantic segmentation on S3DIS dataset.

Overall architecture

Figure 1. We illustrate the overall architecture of the proposed Fast Point Transformer. The red points are input points and their features, and the purple points are output points and their features. The colored squares are non-empty voxels produced by voxelization. The blue and green points are centroids of non-empty voxels with their features.

Quantitative results

1. Results on S3DIS Area 5 test

Table 1. 3D semantic segmentation results on S3DIS Area 5. We analyze the theoretical time complexity of neighbor search algorithms and evaluate the per-scene wall-time latency of each network. We denote N as the number of dataset points, M as the number of query points (or voxel centroids), and K as the number of neighbors to search. Both M and N are much larger than K in a large-scale point cloud.

2. Results on ScanNet V2 validation

(a) 3D semantic segmentation

(b) 3D object detection

Table 2. (a) 3D semantic segmentation results on ScanNet validation. (b) 3D object detection results on ScanNet validation. We report two mAP scores of VoteNet [1] with different backbones on ScanNet validation. Numbers except that of MinkowskiNet† and Fast Point Transformer are taken from Chaton et al. [2].

Qualitative results

1. 3D semantic segmentation on ScanNet V2 validation

Figure 2. (First column) Input point cloud, (Second column) Predicted semantic labels by MinkowskiNet42† , (Third column) Predicted semantic labels by the proposed Fast Point Transformer, and (Fourth column) Ground truth. Both models are trained with voxel size as 10cm.

2. 3D object detection on ScanNet V2 validation

Figure 3. (Left) Predicted bounding boxes by VoteNet [1] with MinkowskiNet backbone, (Middle) Predicted bounding boxes by VoteNet [1] with the Fast Point Transformer backbone, and (Right) Ground truth.

Consistency score results on ScanNet V2 validation

Table 3. Quantitative results

Figure 4. Qualitative results

Table 3 and Figure 4. We compare the consistency scores of Fast Point Transformer and MinkowskiNet42† , which is the reproduced model, on different transformation sets. The transformation sets are 1) rotation only (R), 2) translation only (t), and 3) both (R and t). The size of voxel is set to 10cm, 5cm, and 2cm for 3D semantic segmentation on the ScanNet V2 validation. Fast Point Transformer reduces the prediction inconsistency that occurred by voxelization artifact. Please refer our papr for the details of the consistency score (CScore).


This work was supported by Qualcomm and the IITP grant (2021-0-02068: AI Innovation Hub and 2019-0- 01906: AI Grad. School Prog.) funded by the Korea government (MSIT) and the NRF grant (NRF-2020R1C1C1015260).


Fast Point Transformer
Chunghyun Park, Yoonwoo Jeong, Minsu Cho, and Jaesik Park
CVPR, 2022
[Paper] [Bibtex]


Check our GitHub repository: [Github]


  • Charles R. Qi, Or Litany, Kaiming He, and Leonidas J. Guibas. Deep Hough Voting for 3D Object Detection in Point Clouds. ICCV, 2019.
  • Thomas Chaton, Nicolas Chaulet, Sofiane Horache, and Loic Landrieu. Torch-Points3D: A Modular Multi-Task Framework for Reproducible Deep Learning on 3D Point Clouds. 3DV, 2020.