The KITTI Vision Benchmark Suite

Method

TopNet-HighRes [la] [TopNet-HighRes]

Submitted on 7 Sep. 2018 13:17 by
Sascha Wirges (Karlsruhe Institute of Technology)

Running time:		101ms
Environment:		NVIDIA GeForce 1080 Ti (tensorflow-gpu)

Method Description:

Object detection by deep convolutional
networks
consistently adapted to the multi-layer
occupancy
grid domain. Only Velodyne used.

Parameters:

Meta architecture: Faster R-CNN;
Feature extractor: Resnet101;
Grid cell features: Intensity, min./max. z
coordinate, detections, observations;
Grid cell size: 10cm;
Box encoding: position, length, width,
sin(2*angle)/cos(2*angle);

Latex Bibtex:

@INPROCEEDINGS{8569433,
author={S. Wirges and T. Fischer and C. Stiller
and J. B. Frias},
booktitle={2018 21st International Conference on
Intelligent Transportation Systems (ITSC)},
title={Object Detection and Classification in
Occupancy Grid Maps Using Deep Convolutional
Networks},
year={2018},
volume={},
number={},
pages={3530-3535},
abstract={Detailed environment perception is a
crucial component of automated vehicles. However,
to deal with the amount of perceived information,
we also require segmentation strategies. Based on
a grid map environment representation, well-
suited for sensor fusion, free-space estimation
and machine learning, we detect and classify
objects using deep convolutional neural networks.
As input for our networks we use a multi-layer
grid map efficiently encoding 3D range sensor
information. The inference output consists of a
list of rotated bounding boxes with associated
semantic classes. We conduct extensive ablation
studies, highlight important design
considerations when using grid maps and evaluate
our models on the KITTI Bird's Eye View
benchmark. Qualitative and quantitative benchmark
results show that we achieve robust detection and
state of the art accuracy solely using top-view
grid maps from range sensor data.},
keywords={feedforward neural nets;image
classification;image segmentation;learning
(artificial intelligence);mobile robots;object
detection;robot vision;sensor fusion;occupancy
grid maps;deep convolutional networks;detailed
environment perception;crucial
component;automated vehicles;perceived
information;segmentation strategies;grid map
environment representation;sensor fusion;free-
space estimation;machine learning;classify
objects;deep convolutional neural
networks;multilayer grid map;3D range sensor
information;inference output;rotated bounding
boxes;associated semantic classes;extensive
ablation studies;highlight important design
considerations;qualitative benchmark
results;quantitative benchmark results;robust
detection;top-view grid maps;range sensor
data;KITTI birds eye view benchmark;Feature
extraction;Object
detection;Cameras;Training;Encoding;Three-
dimensional displays;Semantics},
doi={10.1109/ITSC.2018.8569433},
ISSN={2153-0017},
month={Nov},
url={http://arxiv.org/abs/1805.08689},}

Detailed Results

Object detection and orientation estimation results. Results for object detection are given in terms of average precision (AP) and results for joint object detection and orientation estimation are provided in terms of average orientation similarity (AOS).

Benchmark	Easy	Moderate	Hard
Car (Detection)	58.04 %	45.85 %	41.11 %
Car (3D Detection)	12.67 %	9.28 %	7.95 %
Car (Bird's Eye View)	67.84 %	53.05 %	46.99 %
Pedestrian (Detection)	21.22 %	15.28 %	13.89 %
Pedestrian (3D Detection)	10.40 %	6.92 %	6.63 %
Pedestrian (Bird's Eye View)	19.43 %	13.50 %	11.93 %
Cyclist (Detection)	22.86 %	13.98 %	14.52 %
Cyclist (3D Detection)	2.49 %	1.67 %	1.88 %
Cyclist (Bird's Eye View)	9.99 %	6.48 %	6.76 %