Our evaluation table ranks all methods according to the confidence weighted mean intersection-over-union (mIoU). The weighted IoU of one class can be defined as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of the class label, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

**mIoU class:**mean Intersection over Union over classes**mIoU category:**mean Intersection over Union over categories

Our evaluation table ranks all methods according to the Average Precision (AP) over 10 IoU thresholds, ranging from 0.5 to 0.95 with a step size of 0.05. The IoU is weighted by the confidence as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of one instance, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

**AP:**Average Precision over 10 IoU thresholds ranging from 0.5 to 0.95**AP 50:**Average Precision at a threshold of 0.5

Our evaluation table ranks all methods according to the confidence weighted mean intersection-over-union (mIoU). The weighted IoU of one class can be defined as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of the class label, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

**mIoU class:**mean Intersection over Union over classes**mIoU category:**mean Intersection over Union over categories

Our evaluation table ranks all methods according to the Average Precision (AP) over 10 IoU thresholds, ranging from 0.5 to 0.95 with a step size of 0.05. The IoU is weighted by the confidence as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of one instance, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

**AP:**Average Precision over 10 IoU thresholds ranging from 0.5 to 0.95**AP 50:**Average Precision at a threshold of 0.5

We evaluate all methods using mean Average Precision (AP) calculated at a threshold of 0.25 and 0.5, respectively. Our evaluation table ranks all methods according to the AP evaluated at the IoU threshold of 0.5.

**AP 50:**Average Precision at a threshold of 0.5**AP 25:**Average Precision at a threshold of 0.25

We evaluate geometric completion and semantic estimation and rank the methods according to the confidence weighted mean intersection-over-union (mIoU). Geometric completion is evaluated via completeness and accuracy at a threshold of 20cm. Completeness is calculated as the fraction of ground truth points of which the distances to their closest reconstructed points are below the threshold. Accuracy instead measures the percentage of reconstructed points that are within a distance threshold to the ground truth points. As our ground truth reconstruction may not be complete, we prevent punishing reconstructed points by dividing the space into observed and unobserved regions, which are determined by the unobserved volume from a 3D occupancy map obtained using OctoMap. We further measure the F1 score as the harmonic mean of the completeness and the accuracy.

**Accuracy:**Percentage of reconstructed points that are within a distance threshold to the ground truth points**Completeness:**Percentage of ground truth points that are within a distance threshold to the reconstructed points**F1:**Harmonic mean of the accuracy and completeness**mIoU Class:**Confidence weighted mean intersection-over-union over object classes