Publications

Publications of Kashyap Chitta

Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
S. Gao, J. Yang, L. Chen, K. Chitta, Y. Qiu, A. Geiger, J. Zhang and H. Li
Advances in Neural Information Processing Systems (NeurIPS), 2024

Abstract: World models can foresee the outcomes of different actions, which is of paramount importance for autonomous driving. Nevertheless, existing driving world models still have limitations in generalization to unseen environments, prediction fidelity of critical details, and action controllability for flexible application. In this paper, we present Vista, a generalizable driving world model with high fidelity and versatile controllability. Based on a systematic diagnosis of existing methods, we introduce several key ingredients to address these limitations. To accurately predict real-world dynamics at high resolution, we propose two novel losses to promote the learning of moving instances and structural information. We also devise an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, we incorporate a versatile set of controls from high-level intentions (command, goal point) to low-level maneuvers (trajectory, angle, and speed) through an efficient learning strategy. After large-scale training, the capabilities of Vista can seamlessly generalize to different scenarios. Extensive experiments on multiple datasets show that Vista outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize the capacity of Vista itself to establish a generalizable reward for real-world action evaluation without accessing the ground truth actions.

Latex Bibtex Citation:
@inproceedings{Gao2024NEURIPS,
author = {Shenyuan Gao and Jiazhi Yang and Li Chen and Kashyap Chitta and Yihang Qiu and Andreas Geiger and Jun Zhang and Hongyang Li},
title = {Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2024}
}

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al.
Advances in Neural Information Processing Systems (NeurIPS), 2024

Abstract: Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird's eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges.

Latex Bibtex Citation:
@inproceedings{Dauner2024NEURIPS,
author = {Daniel Dauner and Marcel Hallgarten and Tianyu Li and Xinshuo Weng and Zhiyu Huang and Zetong Yang and Hongyang Li and Igor Gilitschenski and Boris Ivanovic and Marco Pavone and Andreas Geiger and Kashyap Chitta},
title = {NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2024}
}

Supplementary Material

SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic
K. Chitta, D. Dauner and A. Geiger
European Conference on Computer Vision (ECCV), 2024

Abstract: SLEDGE is the first generative simulator for vehicle motion planning trained on real-world driving logs. Its core component is a learned model that is able to generate agent bounding boxes and lane graphs. The model's outputs serve as an initial state for rule-based traffic simulation. The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder. It encodes agents and the lane graph into distinct channels in a rasterized latent map. This facilitates both lane-conditioned agent generation and combined generation of lanes and agents with a Diffusion Transformer. Using generated entities in SLEDGE enables greater control over the simulation, e.g. upsampling turns or increasing traffic density. Further, SLEDGE can support 500m long routes, a capability not found in existing data-driven simulators like nuPlan. It presents new challenges for planning algorithms, evidenced by failure rates of over 40% for PDM, the winner of the 2023 nuPlan challenge, when tested on hard routes and dense traffic generated by our model. Compared to nuPlan, SLEDGE requires 500x less storage to set up (<4 GB), making it a more accessible option and helping with democratizing future research in this field.

Latex Bibtex Citation:
@inproceedings{Chitta2024ECCV,
author = {Kashyap Chitta and Daniel Dauner and Andreas Geiger},
title = {SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2024}
}

Supplementary Material

DriveLM: Driving with Graph Visual Question Answering (oral)
C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger and H. Li
European Conference on Computer Vision (ECCV), 2024

Abstract: We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate object interactions before taking actions. The key insight is that with our proposed task, Graph VQA, where we model graph-structured reasoning through perception, prediction and planning question-answer pairs, we obtain a suitable proxy task to mimic the human reasoning process. We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent baseline performs end-to-end autonomous driving competitively in comparison to state-of-the-art driving-specific architectures. Notably, its benefits are pronounced when it is evaluated zero-shot on unseen objects or sensor configurations. We hope this work can be the starting point to shed new light on how to apply VLMs for autonomous driving. To facilitate future research, all code, data, and models are available to the public.

Latex Bibtex Citation:
@inproceedings{Sima2024ECCV,
author = {Chonghao Sima and Katrin Renz and Kashyap Chitta and Li Chen and Hanxue Zhang and Chengen Xie and Ping Luo and Andreas Geiger and Hongyang Li},
title = {DriveLM: Driving with Graph Visual Question Answering},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2024}
}

End-to-end Autonomous Driving: Challenges and Frontiers
L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger and H. Li
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

Abstract: The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle motion plans, instead of concentrating on individual tasks such as detection and motion prediction. End-to-end systems, in comparison to modular pipelines, benefit from joint feature optimization for perception and planning. This field has flourished due to the availability of large-scale datasets, closed-loop evaluation, and the increasing need for autonomous driving algorithms to perform effectively in challenging scenarios. In this survey, we provide a comprehensive analysis of more than 250 papers, covering the motivation, roadmap, methodology, challenges, and future trends in end-to-end autonomous driving. We delve into several critical challenges, including multi-modality, interpretability, causal confusion, robustness, and world models, amongst others. Additionally, we discuss current advancements in foundation models and visual pre-training, as well as how to incorporate these techniques within the end-to-end driving framework. To facilitate future research, we maintain an active repository that contains up-to-date links to relevant literature and open-source projects at https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving.

Latex Bibtex Citation:
@article{Chen2024PAMI,
author = {Li Chen and Penghao Wu and Kashyap Chitta and Bernhard Jaeger and Andreas Geiger and Hongyang Li},
title = {End-to-end Autonomous Driving: Challenges and Frontiers},
journal = {Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
year = {2024}
}

Generalized Predictive Model for Autonomous Driving (highlight)
J. Yang, S. Gao, Y. Qiu, L. Chen, T. Li, B. Dai, K. Chitta, P. Wu, J. Zeng, P. Luo, et al.
Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Abstract: In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.

Latex Bibtex Citation:
@inproceedings{Yang2024CVPR,
author = {Jiazhi Yang and Shenyuan Gao and Yihang Qiu and Li Chen and Tianyu Li and Bo Dai and Kashyap Chitta and Penghao Wu and Jia Zeng and Ping Luo and Jun Zhang and Andreas Geiger and Yu Qiao and Hongyang Li},
title = {Generalized Predictive Model for Autonomous Driving},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2024}
}

Parting with Misconceptions about Learning-based Vehicle Motion Planning
D. Dauner, M. Hallgarten, A. Geiger and K. Chitta
Conference on Robot Learning (CoRL), 2023

Abstract: The release of nuPlan marks a new era in vehicle motion planning research, offering the first large-scale real-world dataset and evaluation schemes requiring both precise short-term planning and long-horizon ego-forecasting. Existing systems struggle to simultaneously meet both requirements. Indeed, we find that these tasks are fundamentally misaligned and should be addressed independently. We further assess the current state of closed-loop planning in the field, revealing the limitations of learning-based methods in complex real-world scenarios and the value of simple rule-based priors such as centerline selection through lane graph search algorithms. More surprisingly, for the open-loop sub-task, we observe that the best results are achieved when using only this centerline as scene context (ie, ignoring all information regarding the map and other agents). Combining these insights, we propose an extremely simple and efficient planner which outperforms an extensive set of competitors, winning the nuPlan planning challenge 2023.

Latex Bibtex Citation:
@inproceedings{Dauner2023CORL,
author = {Daniel Dauner and Marcel Hallgarten and Andreas Geiger and Kashyap Chitta},
title = {Parting with Misconceptions about Learning-based Vehicle Motion Planning},
booktitle = {Conference on Robot Learning (CoRL)},
year = {2023}
}

Supplementary Material

Hidden Biases of End-to-End Driving Models
B. Jaeger, K. Chitta and A. Geiger
International Conference on Computer Vision (ICCV), 2023

Abstract: End-to-end driving systems have recently made rapid progress, in particular on CARLA. Independent of their major contribution, they introduce changes to minor system components. Consequently, the source of improvements is unclear. We identify two biases that recur in nearly all state-of-the-art methods and are critical for the observed progress on CARLA: (1) lateral recovery via a strong inductive bias towards target point following, and (2) longitudinal averaging of multimodal waypoint predictions for slowing down. We investigate the drawbacks of these biases and identify principled alternatives. By incorporating our insights, we develop TF++, a simple end-to-end method that ranks first on the Longest6 and LAV benchmarks, gaining 11 driving score over the best prior work on Longest6.

Latex Bibtex Citation:
@inproceedings{Jaeger2023ICCV,
author = {Bernhard Jaeger and Kashyap Chitta and Andreas Geiger},
title = {Hidden Biases of End-to-End Driving Models},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2023}
}

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz and A. Geiger
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Abstract: How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.

Latex Bibtex Citation:
@article{Chitta2022PAMI,
author = {Kashyap Chitta and Aditya Prakash and Bernhard Jaeger and Zehao Yu and Katrin Renz and Andreas Geiger},
title = {TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving},
journal = {Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
year = {2023}
}

Supplementary Material

End-to-end Autonomous Driving: Challenges and Frontiers
L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger and H. Li
Arxiv, 2023

Abstract: The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle motion plans, instead of concentrating on individual tasks such as detection and motion prediction. End-to-end systems, in comparison to modular pipelines, benefit from joint feature optimization for perception and planning. This field has flourished due to the availability of large-scale datasets, closed-loop evaluation, and the increasing need for autonomous driving algorithms to perform effectively in challenging scenarios. In this survey, we provide a comprehensive analysis of more than 250 papers, covering the motivation, roadmap, methodology, challenges, and future trends in end-to-end autonomous driving. We delve into several critical challenges, including multi-modality, interpretability, causal confusion, robustness, and world models, amongst others. Additionally, we discuss current advancements in foundation models and visual pre-training, as well as how to incorporate these techniques within the end-to-end driving framework.

Latex Bibtex Citation:
@article{Chen2023ARXIVa,
author = {Li Chen and Penghao Wu and Kashyap Chitta and Bernhard Jaeger and Andreas Geiger and Hongyang Li},
title = {End-to-end Autonomous Driving: Challenges and Frontiers},
journal = {Arxiv},
year = {2023}
}

PlanT: Explainable Planning Transformers via Object-Level Representations
K. Renz, K. Chitta, O. Mercea, A. Koepke, Z. Akata and A. Geiger
Conference on Robot Learning (CoRL), 2022

Abstract: Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations of the scene containing all vehicle and road context information. In this paper, we propose PlanT, a novel approach for planning in the context of self-driving that uses a standard transformer architecture. PlanT is based on imitation learning with a compact object-level input representation. With this representation, we demonstrate that information regarding the ego vehicle's route provides sufficient context regarding the road layout for planning. On the challenging Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the driving score of the expert) while being 5.3x faster than equivalent pixel-based planning baselines during inference. Furthermore, we propose an evaluation protocol to quantify the ability of planners to identify relevant objects, providing insights regarding their decision making. Our results indicate that PlanT can reliably focus on the most relevant object in the scene, even when this object is geometrically distant.

Latex Bibtex Citation:
@inproceedings{Renz2022CORL,
author = {Katrin Renz and Kashyap Chitta and Otniel-Bogdan Mercea and Almut Sophia Koepke and Zeynep Akata and Andreas Geiger},
title = {PlanT: Explainable Planning Transformers via Object-Level Representations},
booktitle = {Conference on Robot Learning (CoRL)},
year = {2022}
}

Supplementary Material

KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients (oral)
N. Hanselmann, K. Renz, K. Chitta, A. Bhattacharyya and A. Geiger
European Conference on Computer Vision (ECCV), 2022

Abstract: Simulators offer the possibility of safe, low-cost development of self-driving systems. However, current driving simulators exhibit naïve behavior models for background traffic. Hand-tuned scenarios are typically added during simulation to induce safety-critical situations. An alternative approach is to adversarially perturb the background traffic trajectories. In this paper, we study this approach to safety-critical driving scenario generation using the CARLA simulator. We use a kinematic bicycle model as a proxy to the simulator's true dynamics and observe that gradients through this proxy model are sufficient for optimizing the background traffic trajectories. Based on this finding, we propose KING, which generates safety-critical driving scenarios with a 20% higher success rate than black-box optimization. By solving the scenarios generated by KING using a privileged rule-based expert algorithm, we obtain training data for an imitation learning policy. After fine-tuning on this new data, we show that the policy becomes better at avoiding collisions. Importantly, our generated data leads to reduced collisions on both held-out scenarios generated via KING as well as traditional hand-crafted scenarios, demonstrating improved robustness.

Latex Bibtex Citation:
@inproceedings{Hanselmann2022ECCV,
author = {Niklas Hanselmann and Katrin Renz and Kashyap Chitta and Apratim Bhattacharyya and Andreas Geiger},
title = {KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2022}
}

Supplementary Material

Projected GANs Converge Faster
A. Sauer, K. Chitta, J. Müller and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2021

Abstract: Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train. They need careful regularization, vast amounts of compute, and expensive hyper-parameter sweeps. We make significant headway on these issues by projecting generated and real samples into a fixed, pretrained feature space. Motivated by the finding that the discriminator cannot fully exploit features from deeper layers of the pretrained model, we propose a more effective strategy that mixes features across channels and resolutions. Our Projected GAN improves image quality, sample efficiency, and convergence speed. It is further compatible with resolutions of up to one Megapixel and advances the state-of-the-art Fréchet Inception Distance (FID) on twenty-two benchmark datasets. Importantly, Projected GANs match the previously lowest FIDs up to 40 times faster, cutting the wall-clock time from 5 days to less than 3 hours given the same computational resources.

Latex Bibtex Citation:
@inproceedings{Sauer2021NEURIPS,
author = {Axel Sauer and Kashyap Chitta and Jens Müller and Andreas Geiger},
title = {Projected GANs Converge Faster},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2021}
}

Supplementary Material

NEAT: Neural Attention Fields for End-to-End Autonomous Driving
K. Chitta, A. Prakash and A. Geiger
International Conference on Computer Vision (ICCV), 2021

Abstract: Efficient reasoning about the semantic, spatial, and temporal structure of a scene is a crucial pre-requisite for autonomous driving. We present NEural ATtention fields (NEAT), a novel representation that enables such reasoning for end-to-end Imitation Learning (IL) models. Our representation is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact representation. This allows our model to selectively attend to relevant regions in the input while ignoring information irrelevant to the driving task, effectively associating the images with the BEV representation. NEAT nearly matches the state-of-the-art on the CARLA Leaderboard while being far less resource-intensive. Furthermore, visualizing the attention maps for models with NEAT intermediate representations provides improved interpretability. On a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert used to generate its training data.

Latex Bibtex Citation:
@inproceedings{Chitta2021ICCV,
author = {Kashyap Chitta and Aditya Prakash and Andreas Geiger},
title = {NEAT: Neural Attention Fields for End-to-End Autonomous Driving},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2021}
}

Supplementary Material

Benchmarking Unsupervised Object Representations for Video Sequences
M. Weis, K. Chitta, Y. Sharma, W. Brendel, M. Bethge, A. Geiger and A. Ecker
Journal of Machine Learning Research (JMLR), 2021

Abstract: Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models were evaluated on different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of objects. To close this gap, we design a benchmark with four data sets of varying complexity and seven additional test sets featuring challenging tracking scenarios relevant for natural videos. Using this benchmark, we compare the perceptual abilities of four object-centric approaches: ViMON, a video-extension of MONet, based on recurrent spatial attention, OP3, which exploits clustering via spatial mixture models, as well as TBA and SCALOR, which use explicit factorization via spatial transformers. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking than the spatial transformer based architectures. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios despite their synthetic nature, suggesting that our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.

Latex Bibtex Citation:
@article{Weis2021JMLR,
author = {Marissa Weis and Kashyap Chitta and Yash Sharma and Wieland Brendel and Matthias Bethge and Andreas Geiger and Alexander Ecker},
title = {Benchmarking Unsupervised Object Representations for Video Sequences},
journal = {Journal of Machine Learning Research (JMLR)},
year = {2021}
}

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
A. Prakash, K. Chitta and A. Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2021

Abstract: How should representations from complementary sensors be integrated for autonomous driving? Geometry-based sensor fusion has shown great promise for perception tasks such as object detection and motion forecasting. However, for the actual driving task, the global context of the 3D scene is key, e.g. a change in traffic light state can affect the behavior of a vehicle geometrically distant from that traffic light. Geometry alone may therefore be insufficient for effectively fusing representations in end-to-end driving models. In this work, we demonstrate that existing sensor fusion methods under-perform in the presence of a high density of dynamic agents and complex scenarios, which require global contextual reasoning, such as handling traffic oncoming from multiple directions at uncontrolled intersections. Therefore, we propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention. We experimentally validate the efficacy of our approach in urban settings involving complex scenarios using the CARLA urban driving simulator. Our approach achieves state-of-the-art driving performance while reducing collisions by 80% compared to geometry-based fusion.

Latex Bibtex Citation:
@inproceedings{Prakash2021CVPR,
author = {Aditya Prakash and Kashyap Chitta and Andreas Geiger},
title = {Multi-Modal Fusion Transformer for End-to-End Autonomous Driving},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Supplementary Material

Label Efficient Visual Abstractions for Autonomous Driving
A. Behl, K. Chitta, A. Prakash, E. Ohn-Bar and A. Geiger
International Conference on Intelligent Robots and Systems (IROS), 2020

Abstract: It is well known that semantic segmentation can be used as an effective intermediate representation for learning driving policies. However, the task of street scene semantic segmentation requires expensive annotations. Furthermore, segmentation algorithms are often trained irrespective of the actual driving task, using auxiliary image-space loss functions which are not guaranteed to maximize driving metrics such as safety or distance traveled per intervention. In this work, we seek to quantify the impact of reducing segmentation annotation costs on learned behavior cloning agents. We analyze several segmentation-based intermediate representations. We use these visual abstractions to systematically study the trade-off between annotation efficiency and driving performance, ie, the types of classes labeled, the number of image samples used to learn the visual abstraction model, and their granularity (eg, object masks vs. 2D bounding boxes). Our analysis uncovers several practical insights into how segmentation-based visual abstractions can be exploited in a more label efficient manner. Surprisingly, we find that state-of-the-art driving performance can be achieved with orders of magnitude reduction in annotation cost. Beyond label efficiency, we find several additional training benefits when leveraging visual abstractions, such as a significant reduction in the variance of the learned policy when compared to state-of-the-art end-to-end driving models.

Latex Bibtex Citation:
@inproceedings{Behl2020IROS,
author = {Aseem Behl and Kashyap Chitta and Aditya Prakash and Eshed Ohn-Bar and Andreas Geiger},
title = {Label Efficient Visual Abstractions for Autonomous Driving},
booktitle = {International Conference on Intelligent Robots and Systems (IROS)},
year = {2020}
}

Learning Situational Driving
E. Ohn-Bar, A. Prakash, A. Behl, K. Chitta and A. Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Abstract: Human drivers have a remarkable ability to drive in diverse visual conditions and situations, e.g., from maneuvering in rainy, limited visibility conditions with no lane markings to turning in a busy intersection while yielding to pedestrians. In contrast, we find that state-of-the-art sensorimotor driving models struggle when encountering diverse settings with varying relationships between observation and action. To generalize when making decisions across diverse conditions, humans leverage multiple types of situation-specific reasoning and learning strategies. Motivated by this observation, we develop a framework for learning a situational driving policy that effectively captures reasoning under varying types of scenarios. Our key idea is to learn a mixture model with a set of policies that can capture multiple driving modes. We first optimize the mixture model through behavior cloning, and show it to result in significant gains in terms of driving performance in diverse conditions. We then refine the model by directly optimizing for the driving task itself, i.e., supervised with the navigation task reward. Our method is more scalable than methods assuming access to privileged information, e.g., perception labels, as it only assumes demonstration and reward-based supervision. We achieve over 98% success rate on the CARLA driving benchmark as well as state-of-the-art performance on a newly introduced generalization benchmark.

Latex Bibtex Citation:
@inproceedings{Ohn-Bar2020CVPR,
author = {Eshed Ohn-Bar and Aditya Prakash and Aseem Behl and Kashyap Chitta and Andreas Geiger},
title = {Learning Situational Driving},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}

Supplementary Material

Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving
A. Prakash, A. Behl, E. Ohn-Bar, K. Chitta and A. Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Abstract: Data aggregation techniques can significantly improve vision-based policy learning within a training environment, e.g., learning to drive in a specific simulation condition. However, as on-policy data is sequentially sampled and added in an iterative manner, the policy can specialize and overfit to the training conditions. For real-world applications, it is useful for the learned policy to generalize to novel scenarios that differ from the training conditions. To improve policy learning while maintaining robustness when training end-to-end driving policies, we perform an extensive analysis of data aggregation techniques in the CARLA environment. We demonstrate how the majority of them have poor generalization performance, and develop a novel approach with empirically better generalization performance compared to existing techniques. Our two key ideas are (1) to sample critical states from the collected on-policy data based on the utility they provide to the learned policy in terms of driving behavior, and (2) to incorporate a replay buffer which progressively focuses on the high uncertainty regions of the policy's state distribution. We evaluate the proposed approach on the CARLA NoCrash benchmark, focusing on the most challenging driving scenarios with dense pedestrian and vehicle traffic. Our approach improves driving success rate by 16% over state-of-the-art, achieving 87% of the expert performance while also reducing the collision rate by an order of magnitude without the use of any additional modality, auxiliary tasks, architectural modifications or reward from the environment.

Latex Bibtex Citation:
@inproceedings{Prakash2020CVPR,
author = {Aditya Prakash and Aseem Behl and Eshed Ohn-Bar and Kashyap Chitta and Andreas Geiger},
title = {Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}

Supplementary Material