KITTI-360

Submit

Semanic Scene Understanding

2D Semantic Segmentation

Our evaluation table ranks all methods according to the confidence weighted mean intersection-over-union (mIoU). The weighted IoU of one class can be defined as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of the class label, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

mIoU class: mean Intersection over Union over classes
mIoU category: mean Intersection over Union over categories

Table as LaTeX | Only published Methods

2D Instance Segmentation

Our evaluation table ranks all methods according to the Average Precision (AP) over 10 IoU thresholds, ranging from 0.5 to 0.95 with a step size of 0.05. The IoU is weighted by the confidence as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of one instance, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

AP: Average Precision over 10 IoU thresholds ranging from 0.5 to 0.95
AP 50: Average Precision at a threshold of 0.5

Table as LaTeX | Only published Methods

3D Semantic Segmentation

mIoU class: mean Intersection over Union over classes
mIoU category: mean Intersection over Union over categories

	Method	Setting	Code	mIoU Class	mIoU Category	Runtime	Environment
1	DeepViewAggregation		code	58.25	73.66	-	NVIDIA V100
D. Robert, B. Vallet and L. Landrieu: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022.
2	MinkowskiNet		code	53.92	74.08	-	NVIDIA V100
C. Choy, J. Gwak and S. Savarese: 4d spatio-temporal convnets: Minkowski convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019. D. Robert, B. Vallet and L. Landrieu: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022.
3	PointNet++		code	35.66	58.28		NVIDIA V100
C. Qi, L. Yi, H. Su and L. Guibas: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. NeurIPS 2017.
4	PointNet		code	13.07	30.42		NVIDIA V100
C. Qi, H. Su, K. Mo and L. Guibas: Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR 2017.

Table as LaTeX | Only published Methods

3D Instance Segmentation

AP: Average Precision over 10 IoU thresholds ranging from 0.5 to 0.95
AP 50: Average Precision at a threshold of 0.5

Table as LaTeX | Only published Methods

3D Bounding Box Detection

We evaluate all methods using mean Average Precision (AP) calculated at a threshold of 0.25 and 0.5, respectively. Our evaluation table ranks all methods according to the AP evaluated at the IoU threshold of 0.5.

AP 50: Average Precision at a threshold of 0.5
AP 25: Average Precision at a threshold of 0.25

Table as LaTeX | Only published Methods

Semantic Scene Completion

We evaluate geometric completion and semantic estimation and rank the methods according to the confidence weighted mean intersection-over-union (mIoU). Geometric completion is evaluated via completeness and accuracy at a threshold of 20cm. Completeness is calculated as the fraction of ground truth points of which the distances to their closest reconstructed points are below the threshold. Accuracy instead measures the percentage of reconstructed points that are within a distance threshold to the ground truth points. As our ground truth reconstruction may not be complete, we prevent punishing reconstructed points by dividing the space into observed and unobserved regions, which are determined by the unobserved volume from a 3D occupancy map obtained using OctoMap. We further measure the F1 score as the harmonic mean of the completeness and the accuracy.

Accuracy: Percentage of reconstructed points that are within a distance threshold to the ground truth points
Completeness: Percentage of ground truth points that are within a distance threshold to the reconstructed points
F1: Harmonic mean of the accuracy and completeness
mIoU Class: Confidence weighted mean intersection-over-union over object classes

Table as LaTeX | Only published Methods

	Method	Setting	Code	mIoU Class	mIoU Category	Runtime	Environment
1	PSPNet		code	64.92	82.17	0.2 s	1 core @ 2.5 Ghz (C/C++)
H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia: Pyramid Scene Parsing Network. CVPR 2017.
2	FCN			54.00	77.64	0.2 s	1 core @ 2.5 Ghz (C/C++)
J. Long, E. Shelhamer and T. Darrell: Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

	Method	Setting	Code	AP	AP 50	Runtime	Environment
1	Mask R-CNN (Res.101)		code	20.92	40.10	0.02 s	1 core @ 2.5 Ghz (C/C++)
K. He, G. Gkioxari, P. Doll\''ar and R. Girshick: Mask R-CNN. PAMI 2020.
2	Mask R-CNN (Res. 50)		code	19.51	36.25	0.02 s	1 core @ 2.5 Ghz (C/C++)
K. He, G. Gkioxari, P. Doll\\\'ar and R. Girshick: Mask R-CNN. PAMI 2020.

	Method	Setting	Code	AP 25	AP 50	Runtime	Environment
1	PBEV+SeaBird		code	37.12	4.64	0.15 s	NVIDIA A100
A. Kumar, Y. Guo, X. Huang, L. Ren and X. Liu: SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects. CVPR 2024.
2	BoxNet		code	23.59	4.08		NVIDIA V100
C. Qi, O. Litany, K. He and L. Guibas: Deep Hough Voting for 3D Object Detection in Point Clouds. ICCV 2019.
3	VoteNet		code	30.61	3.40		NVIDIA V100
C. Qi, O. Litany, K. He and L. Guibas: Deep Hough Voting for 3D Object Detection in Point Clouds. ICCV 2019.
4	I2M+SeaBird		code	35.04	3.14	0.02 s	NVIDIA A100
A. Kumar, Y. Guo, X. Huang, L. Ren and X. Liu: SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects. CVPR 2024.
5	MonoDTR		code	39.76	3.02	0.04 s	NVIDIA A6000
K. Huang, T. Wu, H. Su and W. Hsu: MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer. CVPR 2022.
6	DEVIANT		code	26.96	0.88	0.04 s	NVIDIA A100
A. Kumar, G. Brazil, E. Corona, A. Parchami and X. Liu: DEVIANT: Depth Equivariant Network for Monocular 3D Object Detection. ECCV 2022.
7	GUP Net		code	27.25	0.87	0.02 s	NVIDIA A100
Y. Lu, X. Ma, L. Yang, T. Zhang, Y. Liu, Q. Chu, J. Yan and W. Ouyang: Geometry Uncertainty Projection Network for Monocular 3D Object Detection. ICCV 2021.
8	MonoDLE		code	28.99	0.85	0.04 s	NVIDIA A100
X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li and W. Ouyang: Delving into Localization Errors for Monocular 3D Object Detection. CVPR 2021.
9	Cube R-CNN		code	15.57	0.80	0.04 s	NVIDIA A100
G. Brazil, A. Kumar, J. Straub, N. Ravi, J. Johnson and G. Gkioxari: Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild. CVPR 2023.
10	MonoDETR		code	27.13	0.79	0.4 s	1 core @ 2.5 Ghz (C/C++)
R. Zhang, H. Qiu, T. Wang, X. Xu, Z. Guo, Y. Qiao, P. Gao and H. Li: MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection. ICCV 2023.
11	GrooMeD-NMS		code	16.12	0.17	0.12 s	1 core @ 2.5 Ghz (Python)
A. Kumar, G. Brazil and X. Liu: GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection. CVPR 2021.

	Method	Setting	Code	Accuracy	Completeness	F1	mIoU Class	Runtime	Environment
1	EncDec			41.36	41.23	41.29	9.07		NVIDIA V100
Y. Liao, J. Xie and A. Geiger: KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. ARXIV 2021.
2	Raw Input			98.24	19.07	32.35	0.00		NVIDIA V100
Y. Liao, J. Xie and A. Geiger: KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. ARXIV 2021.