Folder Structure

Our dataset is divided into 11 individual sequences, each corresponding to a continuous driving trajectory. As there are rarely overlaps across different sequences, we split training and test data according to the sequence ID. The full dataset including raw data, semantic and instance labels in both 2D & 3D is structured as follows, where {seq:0>4} denotes the sequence ID using 4 digits and {frame:0>10} denotes the frame ID using 10 digits.

  |-- calibration/
  │   |-- calib_cam_to_pose.txt
  │   |-- calib_cam_to_velo.txt
  │   |-- calib_sick_to_velo.txt
  |   |-- perspective.txt
  │   |-- image_02.yaml
  │   `-- image_03.yaml
  |-- data_2d_raw/
  |   `-- 2013_05_28_drive_{seq:0>4}_sync/
  |       `-- image_{00|01}/
  |           `-- data_rect/
  |               `-- {frame:0>10}.png
  |       `-- image_{02|03}/
  |           `-- data_rgb/
  |               `-- {frame:0>10}.png
  |-- data_2d_semantics/
  │   |-- train 
  |   |   `-- 2013_05_28_drive_{seq:0>4}_sync/
  |   |      |-- image_{00|01}/
  |   |      |    |-- semantic/
  |   |      |    |   `-- {frame:0>10}.png
  |   |      |    |-- semantic_rgb/
  |   |      |    |   `-- {frame:0>10}.png
  |   |      |    |-- instance/
  |   |      |    |   `-- {frame:0>10}.png
  |   |      |    `-- confidence/
  |   |      |        `-- {frame:0>10}.png
  |   |      `--  instanceDict.json
  |   ...

  |   ...
  |-- data_3d_raw/
  |   `-- 2013_05_28_drive_{seq:0>4}_sync/
  |       `-- velodyne_points/
  |           |-- data/
  |           |   `-- {frame:0>10}.bin 
  |           `-- timestamps.txt
  |       `-- sick_points/
  |           |-- data/
  |           |   `-- {frame:0>10}.bin 
  |           `-- timestamps.txt
  |-- data_3d_semantics/
  │   |-- train 
  |   |   `-- 2013_05_28_drive_{seq:0>4}_sync/
  |   |      |-- static/
  |   |      |    `-- {start_frame:0>10}_{end_frame:0>10}.ply
  |   |      `-- dynamic/
  |   |           `-- {start_frame:0>10}_{end_frame:0>10}.ply  
  │   `-- test 
  |       `-- 2013_05_28_drive_{seq:0>4}_sync/
  |          `-- static/
  |               `-- {start_frame:0>10}_{end_frame:0>10}.ply
  |-- data_3d_bboxes/
  |   `-- train
  |       `-- 2013_05_28_drive_{seq:0>4}_sync.xml
  `-- data_poses/
      `-- 2013_05_28_drive_{seq:0>4}_sync/
          |-- poses.txt
          `-- cam0_to_world.txt


Development Toolkit

We provide a development toolkit for loading and inspecting the 2D and 3D labels. Please find more details here:

2D Data Format

Our 2D raw data include images collected by a pair of perspective cameras and a pair of fisheye cameras:
  • data_2d_raw/2013_05_28_drive_{seq:0>4}_sync/image_{00|01}/data_rect/{frame:0>10}.png:
    Stereo pairs in 8-bit PNG format.
  • data_2d_raw/2013_05_28_drive_{seq:0>4}_sync/image_{02|03}}/data_rgb/{frame:0>10}.png:
    Fisheye images in 8-bit PNG format.
For each frame, we provide semantic & instance labels as well as confidence maps:
  • data_2d_semantics/train/2013_05_28_drive_{seq:0>4}_sync/image_{00|01}/semantic/{frame:0>10}.png:
    Semantic label in single-channel 8-bit PNG format. Each pixel value denotes the corresponding semanticID.
  • data_2d_semantics/train/2013_05_28_drive_{seq:0>4}_sync/image_{00|01}/semantic_rgb/{frame:0>10}.png:
    Semantic RGB image in 3-channel 8-bit PNG format. Each pixel value denotes the color-coded semantic label.
  • data_2d_semantics/train/2013_05_28_drive_{seq:0>4}_sync/image_{00|01}/instance/{frame:0>10}.png:
    Instance label in single-channel 16-bit PNG format. Each pixel value denotes the corresponding instanceID. Here, instanceID = semanticID*1000 + classInstanceID with classInstanceID denoting the instance ID within one class and classInstanceID = 0 for classes without instance label. Note that instanceID is unique across the full sequence, for example, a building appearing in different frames has the same instanceID in all these frames.
  • data_2d_semantics/train/2013_05_28_drive_{seq:0>4}_sync/image_{00|01}/confidence/{frame:0>10}.png:
    Confidence map in single-channel 8-bit PNG format. Each pixel value corresponds to a confidence score ranging from 0 to 255. Lower values suggest lower confidence.
Please find helper functions for loading 2D data & labels in our development toolkit. The allows to visualize our 2D labels in semantic or instance mode. It also visualizes the projected 3D bounding boxes in 2D.

3D Data Format

We release the 3D raw scans as well as fused point clouds. The format of the 3D raw data is:
  • data_3d_raw/2013_05_28_drive_{seq:0>4}_sync/velodyne_points/data/{frame:0>10}.bin:
    Velodyne scans in BINARY format.
  • data_3d_raw/2013_05_28_drive_{seq:0>4}_sync/velodyne_points/timestamps.txt:
    Timestiamps of Velodyne scans, each line contains the timestamp of one scan.
  • data_3d_raw/2013_05_28_drive_{seq:0>4}_sync/sick_points/data/{frame:0>10}.bin:
    Sick scans in BINARY format. Note that the SICK laser scanner has a higher FPS, thus the frame indices of SICK scans do not align with those of images nor Velodyne scans.
  • data_3d_raw/2013_05_28_drive_{seq:0>4}_sync/sick_points/timestamps.txt:
    Timestamps of SICK scans, each line contains the timestamp of one scan.
We divide the fused point clouds into windows to reduce the size of individual files, where each window is defined by the start_frame and the end_frame (both in 10 digits):
  • data_3d_semantics/train/2013_05_28_drive_{seq:0>4}_sync/static/{start_frame:0>10}_{end_frame:0>10}.ply:
    Fused static point clouds in PLY format for training. The PLY file contains only vertices. Each vertex of the PLY contains the following information: x y z red green blue semanticID instanceID isVisible. Here, x y z (32-bit float) is the location of a 3D point in the world coordinate, red green blue (8-bit uchar) is the color of a 3D point obtained by projecting it to adjacent 2D images, semanticId instanceID (32-bit int) describes the label of a 3D point where instanceID is consistent with the 2D label. The last value, isVisible (8-bit uchar), is a binary variable which is 0 when a 3D point is not visible in any of the perspective images. For these occluded points we keep a 3D point only if it is uniquely labeled by a 3D bounding box and assign the label according to the annotation. Unlabeled points or ambiguously labeled points are ignored.
  • data_3d_semantics/train/2013_05_28_drive_{seq:0>4}_sync/dynamic/{start_frame:0>10}_{end_frame:0>10}.ply:
    Fused dynamic point clouds in PLY format. The PLY file contains only vertices. Each vertex has an additional timestamp (32-bit int) value compared to the static points: x y z red green blue semantic instance isVisible timestamp.
  • data_3d_semantics/test/2013_05_28_drive_{seq:0>4}_sync/static/{start_frame:0>10}_{end_frame:0>10}.ply:
    Fused static point clouds in PLY format for testing. The test point clouds share the same format as the training point clouds except that labels are omitted: x y z red green blue isVisible.
We also release the 3D bounding boxes:
  • data_3d_semantics/train/2013_05_28_drive_{seq:0>4}_sync.xml:
    Each element object{d} denotes a bounding box having consistent semanticId and instanceId with the 2D labels. The vertices and faces matrices form the mesh of the bounding box in a local coordinate. The transform matrix transforms this mesh to the world coordinate. The timestamp denotes the frame ID for dynamic objects and is -1 for static object.
Please find helper functions for loading 3D data & labels in our development toolkit. The supports visualizing the fused point cloud in semantic or instance mode as well as visualizing the 3D bounding boxes.


The calibration folder contains intrinsics and extrinsics of our sensors.
  • calibration/calib_cam_to_pose.txt:
    Each line contains a 3x4 matrix denoting the transformation from a camera to the system pose. There are 4 rows, including two perspective cameras image_00, image_01 and two fisheye cameras image_02, image_03.
  • calibration/calib_cam_to_velo.txt:
    A 3x4 matrix denoting the rigid transformation from the first camera (image_00) to the Velodyne.
  • calibration/calib_sick_to_velo.txt:
    A 3x4 matrix denoting the rigid transformation from the SICK laser scanner to the Velodyne.
  • calibration/perspective.txt:
    Intrinsics of the perspective cameras. The lines starting with P_rect_00 and P_rect_01 provide 3x4 perspective intrinsics. R_rect_00 and R_rect_01 correspond to 3x3 rectification matrices.
  • calibration/image_{02|03}.yaml:
    Intrinsics of the fisheye cameras.
Please find helper functions for loading calibrations in


We release the system poses and poses of the first perspective camera.
  • data_poses/2013_05_28_drive_{seq:0>4}_sync/poses.txt:
    Each line has 13 numbers, the first number is an integer denoting the frame index and the rest is a 3x4 matrix denoting the system pose in a global Euclidean space.
  • data_poses/2013_05_28_drive_{seq:0>4}_sync/cam0_to_world.txt:
    Each line has 17 numbers, the first number is an integer denoting the frame index and the rest is a 4x4 matrix denoting the pose of camera 0 in a global Euclidean space.

Sensor Locations

eXTReMe Tracker