The KITTI Vision Benchmark Suite

Method

TubeFormer-DeepLab [on] [TubeFormer-DeepLab]

Submitted on 15 Nov. 2021 09:43 by
Dahun Kim (Korea Advanced Institute of Science and Technology)

Running time:		1 s
Environment:		1 core @ 2.5 Ghz (C/C++)

Method Description:

TiMer (Time-Travel Mask Transformer) builds on top
of mask transformers for video segmentation by
directly predicting class-labeled tubes. A tube
contains segmentation masks linked along the time
axis. It is a clip-level pipeline that takes clip-
level frames and outputs per-clip results. We use
the clip length T = 2 to make it near online
method. To infer an entire video sequence, we use
simple tube IoU-based ID propagation to stitch the
clip-level results.

Parameters:

The clip length is set to 2. Runtime is not
measured.

Latex Bibtex:

@inproceedings{kim2022tubeformer,
title={TubeFormer-DeepLab: Video Mask
Transformer},
author={Kim, Dahun and Xie, Jun and Wang, Huiyu
and Qiao, Siyuan and Yu, Qihang and Kim, Hong-Seok
and Adam, Hartwig and Kweon, In So and Chen,
Liang-Chieh},
booktitle={Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition},
pages={13914--13924},
year={2022}
}

Detailed Results

From all 29 test sequences, our benchmark computes the STQ segmentation and tracking metric (STQ, AQ, SQ (IoU)). The tables below show all of these metrics.

Benchmark	STQ	AQ	SQ (IoU)
KITTI-STEP	65.25 %	60.59 %	70.27 %

This table as LaTeX

The KITTI Vision Benchmark Suite

A project of Karlsruhe Institute of Technologyand Toyota Technological Institute at Chicago

Method

Detailed Results

A project of Karlsruhe Institute of Technology
and Toyota Technological Institute at Chicago