Introducing Asymmetric Convolution to Transformer with Decoupled Features for Monocular 3D Object De [MonoCDiT]
[Anonymous Submission]

Submitted on 8 Aug. 2023 07:36 by
[Anonymous Submission]

Running time:0.05 s
Environment:GPU @ >3.5 Ghz (Python)

Method Description:
In this work, to accurately capture detailed
features in the scene and avoid interference of
inaccurate depth information, we propose a
monocular 3D object detection method with a novel
decoupled encoder-decoder framework MonoCDiT,
which introduces asymmetric convolution to
transformer with decoupled depth features and
visual features. The convolution encoder
aggregates local features of the input image and
encodes the features using depth-wise convolution
with multiple convolutional kernels of different
shapes to obtain more details in the image. The
decoupled structure allows visual and depth
features to learn independently without
interfering with each other. In addition, the
combination of convolution and transformer
enables MonoCDiT to focus on both local and global
DLA-102 + 2 conv encoders + 2 transformer encoders +
2 transformer decoders + 1 fusion module
Latex Bibtex:

Detailed Results

Object detection and orientation estimation results. Results for object detection are given in terms of average precision (AP) and results for joint object detection and orientation estimation are provided in terms of average orientation similarity (AOS).

Benchmark Easy Moderate Hard
Car (Detection) 96.01 % 90.53 % 80.69 %
Car (Orientation) 95.44 % 89.58 % 79.68 %
Car (3D Detection) 23.52 % 17.13 % 14.37 %
Car (Bird's Eye View) 30.32 % 21.97 % 18.80 %
This table as LaTeX

2D object detection results.
This figure as: png eps txt gnuplot

Orientation estimation results.
This figure as: png eps txt gnuplot

3D object detection results.
This figure as: png eps txt gnuplot

Bird's eye view results.
This figure as: png eps txt gnuplot

eXTReMe Tracker