Segmenting and Tracking Every Pixel (STEP) Evaluation

This benchmark is part of the ICCV21-Workshop: Segmenting and Tracking Every Point and Pixel.

The Segmenting and Tracking Every Pixel (STEP) benchmark consists of 21 training sequences and 29 test sequences. It is based on the KITTI Tracking Evaluation and the Multi-Object Tracking and Segmentation (MOTS) benchmark. This benchmark extends the annotations to the Segmenting and Tracking Every Pixel (STEP) task. To this end, we added dense pixelwise segmentation labels for every pixel. In this benchmark, every pixel has a semantic label and all pixels belonging to the most salient object classes, car and pedestrian, have a unique tracking ID. We evaluate submitted results using the Segmentation and Tracking Quality (STQ) metric:

  • STQ: The combined segmentation and tracking quality given by the geometric mean of AQ and SQ.
  • AQ: The class-agnostic association quality. Please refer to the above link for details.
  • SQ (IoU): The track-agnostic segmentation quality given by the mean IoU of all classes.

More details and downloads can be found here:
The submission instructions can be found on the submit results page. Please address any questions or feedback about KITTI-STEP and its evaluation to Mark Weber.
Important Policy Update: As more and more non-published work and re-implementations of existing work is submitted to KITTI, we have established a new policy: from now on, only submissions with significant novelty that are leading to a peer-reviewed paper in a conference or journal are allowed. Minor modifications of existing algorithms or student research projects are not allowed. Such work must be evaluated on a split of the training set. To ensure that our policy is adopted, new users must detail their status, describe their work and specify the targeted venue during registration. Furthermore, we will regularly delete all entries that are 6 months old but are still anonymous or do not have a paper associated with them. For conferences, 6 month is enough to determine if a paper has been accepted and to add the bibliography information. For longer review cycles, you need to resubmit your results.
Additional information used by the methods
  • Online: Online method (frame-by-frame processing, no latency)
  • Additional training data: Use of additional data sources for training

Method Setting Code STQ AQ SQ (IoU)
1 Video-kMaX 68.47 % 67.20 % 69.77 %
I. Shin, D. Kim, Q. Yu, J. Xie, H. Kim, B. Green, I. Kweon, K. Yoon and L. Chen: Video-kMaX: A simple unified approach for online and near-online video panoptic segmentation. arXiv preprint arXiv:2304.04694 2023.
2 TubeFormer-DeepLab
This is an online method (no batch processing).
65.25 % 60.59 % 70.27 %
D. Kim, J. Xie, H. Wang, S. Qiao, Q. Yu, H. Kim, H. Adam, I. Kweon and L. Chen: TubeFormer-DeepLab: Video Mask Transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022.
3 siain
This is an online method (no batch processing).
57.87 % 55.16 % 60.71 %
J. Ryu and K. Yoon: An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers. 2021.
4 Motion-DeepLab
This is an online method (no batch processing).
code 52.19 % 45.55 % 59.81 %
M. Weber, J. Xie, M. Collins, Y. Zhu, P. Voigtlaender, H. Adam, B. Green, A. Geiger, B. Leibe, D. Cremers, A. Osep, L. Leal-Taixe and L. Chen: STEP: Segmenting and Tracking Every Pixel. arXiv:2102.11859 2021.
Table as LaTeX | Only published Methods


When using this dataset in your research, we will be happy if you cite us:
  author = {Mark Weber and Jun Xie and Maxwell Collins and Yukun Zhu and Paul Voigtlaender and Hartwig Adam and Bradley Green and Andreas Geiger and Bastian Leibe and Daniel Cremers and Aljosa Osep and Laura Leal-Taixe and Liang-Chieh Chen},
  title = {STEP: Segmenting and Tracking Every Pixel},
  booktitle = {Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks},
  year = {2021}

eXTReMe Tracker