The KITTI Vision Benchmark Suite

Method

Depth-Aware Dynamic Matching for DETR-based Monocular 3D Detection [MonoD2OM]
https://github.com/vincentweikey/MonoD2OM

Submitted on 19 May. 2026 09:27 by
Menghao Yang (Changsha University of Science and Technology)

Running time:		0.04 s
Environment:		GPU @ 2.5 Ghz (Python)

Method Description:

We propose D²OM, a depth-aware dynamic one-to-many
matching framework for
DETR-based monocular 3D detectors, which defines
anisotropic ellipsoidal
positive sample regions aligned with the camera ray
to explicitly model larger
error tolerance along the depth direction. A dynamic
cost-thresholding
mechanism adaptively controls positive sample count
throughout training.
We further propose Matching-guided Adaptive
Denoising (MAD) to dynamically
adjust denoising queries based on real-time matching
statistics, reinforcing
depth estimation accuracy with no additional
inference overhead.

Parameters:

None

Latex Bibtex:

None

Detailed Results

Object detection and orientation estimation results. Results for object detection are given in terms of average precision (AP) and results for joint object detection and orientation estimation are provided in terms of average orientation similarity (AOS).

Benchmark	Easy	Moderate	Hard
Car (Detection)	94.09 %	93.35 %	83.67 %
Car (Orientation)	93.87 %	92.83 %	82.95 %
Car (3D Detection)	29.27 %	21.13 %	18.44 %
Car (Bird's Eye View)	37.63 %	26.92 %	23.82 %

This table as LaTeX

2D object detection results.
This figure as: png eps txt gnuplot

Orientation estimation results.
This figure as: png eps txt gnuplot

3D object detection results.
This figure as: png eps txt gnuplot

Bird's eye view results.
This figure as: png eps txt gnuplot

The KITTI Vision Benchmark Suite

A project of Karlsruhe Institute of Technologyand Toyota Technological Institute at Chicago

Method

Detailed Results

A project of Karlsruhe Institute of Technology
and Toyota Technological Institute at Chicago