The KITTI Vision Benchmark Suite

Method

MT-SfMLearner [MT-SfMLearner]

Submitted on 12 Oct. 2021 11:09 by
Hemang Chawla (Navinfo Europe)

Running time:		0.04s
Environment:		GPU @ 1.5 Ghz (Python)

Method Description:

The advent of autonomous driving and advanced driver assistance systems necessitates continuous developments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this task are limited to convolutional neural network (CNN) architectures. In contrast with CNNs that use localized linear operations and lose feature resolution across the layers, vision transformers process at constant resolution with a global receptive field at every stage. While recent works have compared transformers against their CNN counterparts for tasks such as image classification, no study exists that investigates the impact of using transformers
for self-supervised monocular depth estimation. Here, we first demonstrate how to adapt vision transformers for self-supervised monocula

Parameters:

See paper for details

Latex Bibtex:

@conference{mtsfmlearner,
author={Arnav Varma. and Hemang Chawla. and Bahram Zonooz. and Elahe Arani.},
title={Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics},
booktitle={Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP,},
year={2022},
pages={758-769},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010884000003124},
isbn={978-989-758-555-5},
}

Detailed Results

This page provides detailed results for the method(s) selected. For the first 20 test images, the percentage of erroneous pixels is depicted in the table. We use the error metric described in Sparsity Invariant CNNs (THREEDV 2017), which considers a pixel to be correctly estimated if the disparity or flow end-point error is <3px or <5% (for scene flow this criterion needs to be fulfilled for both disparity maps and the flow map). Underneath, the left input image, the estimated results and the error maps are shown (for disp_0/disp_1/flow/scene_flow, respectively). The error map uses the log-color scale described in Sparsity Invariant CNNs (THREEDV 2017), depicting correct estimates (<3px or <5% error) in blue and wrong estimates in red color tones. Dark regions in the error images denote the occluded pixels which fall outside the image boundaries. The false color maps of the results are scaled to the largest ground truth disparity values / flow magnitudes.

Test Set Average

	SILog	sqErrorRel	absErrorRel	iRMSE
Error	14.25	3.72	12.52	15.83

The KITTI Vision Benchmark Suite

A project of Karlsruhe Institute of Technologyand Toyota Technological Institute at Chicago

Method

Detailed Results

Test Set Average

Test Image 0

Test Image 1

Test Image 2

Test Image 3

Test Image 4

Test Image 5

Test Image 6

Test Image 7

Test Image 8

Test Image 9

Test Image 10

Test Image 11

Test Image 12

Test Image 13

Test Image 14

Test Image 15

Test Image 16

Test Image 17

Test Image 18

Test Image 19

A project of Karlsruhe Institute of Technology
and Toyota Technological Institute at Chicago