The KITTI Vision Benchmark Suite

Method

Learned Object Descriptors [LOD]
[Anonymous Submission]

Submitted on 23 Apr. 2025 16:49 by
[Anonymous Submission]

Running time:		0.1 s
Environment:		GPU @ 2.0 Ghz (Python)

Method Description:

This method relies only on the visual appearance of objects. We use a YOLO model to get detections and ConvNeXt model to get reID features from each detection. We then assign track IDs based on the highest cosine similarity between incoming reID features and features from all previous frames. If any duplicate IDs are detected within a frame, then only the detecion with highest similarity score is kept and all other duplicates are rejected.

Parameters:

The similarity threshold \tau=0.82
Yolo confidence threshold=0.5

Latex Bibtex:

Detailed Results

From all 29 test sequences, our benchmark computes the commonly used tracking metrics CLEARMOT, MT/PT/ML, identity switches, and fragmentations [1,2]. The tables below show all of these metrics.

Benchmark	MOTA	MOTP	MODA	MODP
CAR	68.10 %	84.88 %	69.19 %	88.15 %
PEDESTRIAN	39.14 %	76.77 %	39.65 %	93.93 %

Benchmark	recall	precision	F1	TP	FP	FN	FAR	#objects	#trajectories
CAR	72.62 %	98.78 %	83.70 %	27212	335	10262	3.01 %	29388	1764
PEDESTRIAN	42.39 %	95.01 %	58.62 %	9897	520	13452	4.67 %	10781	451

Benchmark	MT	PT	ML	IDS	FRAG
CAR	43.85 %	40.15 %	16.00 %	375	732
PEDESTRIAN	14.78 %	40.21 %	45.02 %	118	650

This table as LaTeX

[1] K. Bernardin, R. Stiefelhagen: Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. JIVP 2008.
[2] Y. Li, C. Huang, R. Nevatia: Learning to associate: HybridBoosted multi-target tracker for crowded scene. CVPR 2009.

The KITTI Vision Benchmark Suite

A project of Karlsruhe Institute of Technologyand Toyota Technological Institute at Chicago

Method

Detailed Results

A project of Karlsruhe Institute of Technology
and Toyota Technological Institute at Chicago