Andreas Geiger

Publications of Yiyi Liao

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
H. Zhou, J. Shao, L. Xu, D. Bai, W. Qiu, B. Liu, Y. Wang, A. Geiger and Y. Liao
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Abstract: Holistic understanding of urban scenes based on RGB images is a challenging yet important problem. It encompasses understanding both the geometry and appearance to enable novel view synthesis, parsing semantic labels, and tracking moving objects. Despite considerable progress, existing approaches often focus on specific aspects of this task and require additional inputs such as LiDAR scans or manually annotated 3D bounding boxes. In this paper, we introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic urban scene understanding. Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints. Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy, and reconstruct dynamic scenes, even in scenarios where 3D bounding box detection are highly noisy. Experimental results on KITTI, KITTI-360, and Virtual KITTI 2 demonstrate the effectiveness of our approach.
Latex Bibtex Citation:
@inproceedings{Zhou2024CVPR,
  author = {Hongyu Zhou and Jiahao Shao and Lu Xu and Dongfeng Bai and Weichao Qiu and Bingbing Liu and Yue Wang and Andreas Geiger and Yiyi Liao},
  title = {HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024}
}
VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids
K. Schwarz, A. Sauer, M. Niemeyer, Y. Liao and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2022
Abstract: State-of-the-art 3D-aware generative models rely on coordinate-based MLPs to parameterize 3D radiance fields. While demonstrating impressive results, querying an MLP for every sample along each ray leads to slow rendering. Therefore, existing approaches often render low-resolution feature maps and process them with an upsampling network to obtain the final image. Albeit efficient, neural rendering often entangles viewpoint and content such that changing the camera pose results in unwanted changes of geometry or appearance. Motivated by recent results in voxel-based novel view synthesis, we investigate the utility of sparse voxel grid representations for fast and 3D-consistent generative modeling in this paper. Our results demonstrate that monolithic MLPs can indeed be replaced by 3D convolutions when combining sparse voxel grids with progressive growing, free space pruning and appropriate regularization. To obtain a compact representation of the scene and allow for scaling to higher voxel resolutions, our model disentangles the foreground object (modeled in 3D) from the background (modeled in 2D). In contrast to existing approaches, our method requires only a single forward pass to generate a full 3D scene. It hence allows for efficient rendering from arbitrary viewpoints while yielding 3D consistent results with high visual fidelity.
Latex Bibtex Citation:
@inproceedings{Schwarz2022NEURIPS,
  author = {Katja Schwarz and Axel Sauer and Michael Niemeyer and Yiyi Liao and Andreas Geiger},
  title = {VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2022}
}
Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation
X. Fu, S. Zhang, T. Chen, Y. Lu, L. Zhu, X. Zhou, A. Geiger and Y. Liao
International Conference on 3D Vision (3DV), 2022
Abstract: Large-scale training data with high-quality annotations is critical for training semantic and instance segmentation models. Unfortunately, pixel-wise annotation is labor-intensive and costly, raising the demand for more efficient labeling strategies. In this work, we present a novel 3D-to-2D label transfer method, Panoptic NeRF, which aims for obtaining per-pixel 2D semantic and instance labels from easy-to-obtain coarse 3D bounding primitives. Our method utilizes NeRF as a differentiable tool to unify coarse 3D annotations and 2D semantic cues transferred from existing datasets. We demonstrate that this combination allows for improved geometry guided by semantic information, enabling rendering of accurate semantic maps across multiple views. Furthermore, this fusion process resolves label ambiguity of the coarse 3D annotations and filters noise in the 2D predictions. By inferring in 3D space and rendering to 2D labels, our 2D semantic and instance labels are multi-view consistent by design. Experimental results show that Panoptic NeRF outperforms existing semantic and instance label transfer methods in terms of accuracy and multi-view consistency on challenging urban scenes of the KITTI-360 dataset.
Latex Bibtex Citation:
@inproceedings{Fu2022THREEDV,
  author = {Xiao Fu and Shangzhan Zhang and Tianrun Chen and Yichong Lu and Lanyun Zhu and Xiaowei Zhou and Andreas Geiger and Yiyi Liao},
  title = {Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation},
  booktitle = {International Conference on 3D Vision (3DV)},
  year = {2022}
}
KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D
Y. Liao, J. Xie and A. Geiger
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Abstract: For the last few decades, several major subfields of artificial intelligence including computer vision, graphics, and robotics have progressed largely independently from each other. Recently, however, the community has realized that progress towards robust intelligent systems such as self-driving cars requires a concerted effort across the different fields. This motivated us to develop KITTI-360, successor of the popular KITTI dataset. KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization to facilitate research at the intersection of vision, graphics and robotics. For efficient annotation, we created a tool to label 3D scenes with bounding primitives and developed a model that transfers this information into the 2D image domain, resulting in over 150k semantic and instance annotated images and 1B annotated 3D points. Moreover, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset. KITTI-360 will enable progress at the intersection of these research areas and thus contributing towards solving one of our grand challenges: the development of fully autonomous self-driving systems.
Latex Bibtex Citation:
@article{Liao2022PAMI,
  author = {Yiyi Liao and Jun Xie and Andreas Geiger},
  title = {KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D},
  journal = {Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
  year = {2022}
}
On the Frequency Bias of Generative Models
K. Schwarz, Y. Liao and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2021
Abstract: The key objective of Generative Adversarial Networks (GANs) is to generate new data with the same statistics as the provided training data. However, multiple recent works show that state-of-the-art architectures yet struggle to achieve this goal. In particular, they report an elevated amount of high frequencies in the spectral statistics which makes it straightforward to distinguish real and generated images. Explanations for this phenomenon are controversial: While most works attribute the artifacts to the generator, other works point to the discriminator. We take a sober look at those explanations and provide insights on what makes proposed measures against high-frequency artifacts effective. To achieve this, we first independently assess the architectures of both the generator and discriminator and investigate if they exhibit a frequency bias that makes learning the distribution of high-frequency content particularly problematic. Based on these experiments, we make the following four observations: 1) Different upsampling operations bias the generator towards different spectral properties. 2) Checkerboard artifacts introduced by upsampling cannot explain the spectral discrepancies alone as the generator is able to compensate for these artifacts. 3) The discriminator does not struggle with detecting high frequencies per se but rather struggles with frequencies of low magnitude. 4) The downsampling operations in the discriminator can impair the quality of the training signal it provides. In light of these findings, we analyze proposed measures against high-frequency artifacts in state-of-the-art GAN training but find that none of the existing approaches can fully resolve spectral artifacts yet. Our results suggest that there is great potential in improving the discriminator and that this could be key to match the distribution of the training data more closely.
Latex Bibtex Citation:
@inproceedings{Schwarz2021NEURIPS,
  author = {Katja Schwarz and Yiyi Liao and Andreas Geiger},
  title = {On the Frequency Bias of Generative Models},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2021}
}
Shape As Points: A Differentiable Poisson Solver (oral)
S. Peng, C. Jiang, Y. Liao, M. Niemeyer, M. Pollefeys and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2021
Abstract: In recent years, neural implicit representations gained popularity in 3D reconstruction due to their expressiveness and flexibility. However, the implicit nature of neural implicit representations results in slow inference times and requires careful initialization. In this paper, we revisit the classic yet ubiquitous point cloud representation and introduce a differentiable point-to-mesh layer using a differentiable formulation of Poisson Surface Reconstruction (PSR) which allows for a GPU-accelerated fast solution of the indicator function given an oriented point cloud. The differentiable PSR layer allows us to efficiently and differentiably bridge the explicit 3D point representation with the 3D mesh via the implicit indicator field, enabling end-to-end optimization of surface reconstruction metrics such as Chamfer distance. This duality between points and meshes hence allows us to represent shapes as oriented point clouds, which are explicit, lightweight and expressive. Compared to neural implicit representations, our Shape-As-Points (SAP) model is more interpretable, lightweight, and accelerates inference time by one order of magnitude. Compared to other explicit representations such as points, patches, and meshes, SAP produces topology-agnostic, watertight manifold surfaces. We demonstrate the effectiveness of SAP on the task of surface reconstruction from unoriented point clouds and learning-based reconstruction.
Latex Bibtex Citation:
@inproceedings{Peng2021NEURIPS,
  author = {Songyou Peng and Chiyu Max Jiang and Yiyi Liao and Michael Niemeyer and Marc Pollefeys and Andreas Geiger},
  title = {Shape As Points: A Differentiable Poisson Solver},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2021}
}
KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs
C. Reiser, S. Peng, Y. Liao and A. Geiger
International Conference on Computer Vision (ICCV), 2021
Abstract: NeRF synthesizes novel views of a scene with unprecedented quality by fitting a neural radiance field to RGB images. However, NeRF requires querying a deep Multi-Layer Perceptron (MLP) millions of times, leading to slow rendering times, even on modern GPUs. In this paper, we demonstrate that significant speed-ups are possible by utilizing thousands of tiny MLPs instead of one single large MLP. In our setting, each individual MLP only needs to represent parts of the scene, thus smaller and faster-to-evaluate MLPs can be used. By combining this divide-and-conquer strategy with further optimizations, rendering is accelerated by two orders of magnitude compared to the original NeRF model without incurring high storage costs. Further, using teacher-student distillation for training, we show that this speed-up can be achieved without sacrificing visual quality..
Latex Bibtex Citation:
@inproceedings{Reiser2021ICCV,
  author = {Christian Reiser and Songyou Peng and Yiyi Liao and Andreas Geiger},
  title = {KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs},
  booktitle = {International Conference on Computer Vision (ICCV)},
  year = {2021}
}
SMD-Nets: Stereo Mixture Density Networks
F. Tosi, Y. Liao, C. Schmitt and A. Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Abstract: Despite stereo matching accuracy has greatly improved by deep learning in the last few years, recovering sharp boundaries and high-resolution outputs efficiently remains challenging. In this paper, we propose Stereo Mixture Density Networks (SMD-Nets), a simple yet effective learning framework compatible with a wide class of 2D and 3D architectures which ameliorates both issues. Specifically, we exploit bimodal mixture densities as output representation and show that this allows for sharp and precise disparity estimates near discontinuities while explicitly modeling the aleatoric uncertainty inherent in the observations. Moreover, we formulate disparity estimation as a continuous problem in the image domain, allowing our model to query disparities at arbitrary spatial precision. We carry out comprehensive experiments on a new high-resolution and highly realistic synthetic stereo dataset, consisting of stereo pairs at 8Mpx resolution, as well as on real-world stereo datasets. Our experiments demonstrate increased depth accuracy near object boundaries and prediction of ultra high-resolution disparity maps on standard GPUs. We demonstrate the flexibility of our technique by improving the performance of a variety of stereo backbones.
Latex Bibtex Citation:
@inproceedings{Tosi2021CVPR,
  author = {Fabio Tosi and Yiyi Liao and Carolin Schmitt and Andreas Geiger},
  title = {SMD-Nets: Stereo Mixture Density Networks},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2021}
}
Learning Steering Kernels for Guided Depth Completion
L. Liu, Y. Liao, Y. Wang, A. Geiger and Y. Liu
Transactions on Image Processing (TIP), 2021
Abstract: This paper addresses the guided depth completion task in which the goal is to predict a dense depth map given a guidance RGB image and sparse depth measurements. Recent advances on this problem nurture hopes that one day we can acquire accurate and dense depth at a very low cost. A major challenge of guided depth completion is to effectively make use of extremely sparse measurements, eg, measurements covering less than 1% of the image pixels. In this paper, we propose a fully differentiable model that avoids convolving on sparse tensors by jointly learning depth interpolation and refinement. More specifically, we propose a differentiable kernel regression layer that interpolates the sparse depth measurements via learned kernels. We further refine the interpolated depth map using a residual depth refinement layer which leads to improved performance compared to learning absolute depth prediction using a vanilla network. We provide experimental evidence that our differentiable kernel regression layer not only enables end-to-end training from very sparse measurements using standard convolutional network architectures, but also leads to better depth interpolation results compared to existing heuristically motivated methods. We demonstrate that our method outperforms many state-of-the-art guided depth completion techniques on both NYUv2 and KITTI. We further show the generalization ability of our method with respect to the density and spatial statistics of the sparse depth measurements.
Latex Bibtex Citation:
@article{Liu2021TIP,
  author = {Lina Liu and Yiyi Liao and Yue Wang and Andreas Geiger and Yong Liu},
  title = {Learning Steering Kernels for Guided Depth Completion},
  journal = {Transactions on Image Processing (TIP)},
  year = {2021}
}
GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis
K. Schwarz, Y. Liao, M. Niemeyer and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2020
Abstract: While 2D generative adversarial networks have enabled high-resolution image synthesis, they largely lack an understanding of the 3D world and the image formation process. Thus, they do not provide precise control over camera viewpoint or object pose. To address this problem, several recent approaches leverage intermediate voxel-based representations in combination with differentiable rendering. However, existing methods either produce low image resolution or fall short in disentangling camera and scene properties, eg, the object identity may vary with the viewpoint. In this paper, we propose a generative model for radiance fields which have recently proven successful for novel view synthesis of a single scene. In contrast to voxel-based representations, radiance fields are not confined to a coarse discretization of the 3D space, yet allow for disentangling camera and scene properties while degrading gracefully in the presence of reconstruction ambiguity. By introducing a multi-scale patch-based discriminator, we demonstrate synthesis of high-resolution images while training our model from unposed 2D images alone. We systematically analyze our approach on several challenging synthetic and real-world datasets. Our experiments reveal that radiance fields are a powerful representation for generative image synthesis, leading to 3D consistent models that render with high fidelity.
Latex Bibtex Citation:
@inproceedings{Schwarz2020NEURIPS,
  author = {Katja Schwarz and Yiyi Liao and Michael Niemeyer and Andreas Geiger},
  title = {GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2020}
}
Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis
Y. Liao, K. Schwarz, L. Mescheder and A. Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Abstract: In recent years, Generative Adversarial Networks have achieved impressive results in photorealistic image synthesis. This progress nurtures hopes that one day the classical rendering pipeline can be replaced by efficient models that are learned directly from images. However, current image synthesis models operate in the 2D domain where disentangling 3D properties such as camera viewpoint or object pose is challenging. Furthermore, they lack an interpretable and controllable representation. Our key hypothesis is that the image generation process should be modeled in 3D space as the physical world surrounding us is intrinsically three-dimensional. We define the new task of 3D controllable image synthesis and propose an approach for solving it by reasoning both in 3D space and in the 2D image domain. We demonstrate that our model is able to disentangle latent 3D factors of simple multi-object scenes in an unsupervised fashion from raw images. Compared to pure 2D baselines, it allows for synthesizing scenes that are consistent wrt. changes in viewpoint or object pose. We further evaluate various 3D representations in terms of their usefulness for this challenging task.
Latex Bibtex Citation:
@inproceedings{Liao2020CVPR,
  author = {Yiyi Liao and Katja Schwarz and Lars Mescheder and Andreas Geiger},
  title = {Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2020}
}
Connecting the Dots: Learning Representations for Active Monocular Depth Estimation
G. Riegler, Y. Liao, S. Donne, V. Koltun and A. Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2019
Abstract: We propose a technique for depth estimation with a monocular structured-light camera, \ie, a calibrated stereo set-up with one camera and one laser projector. Instead of formulating the depth estimation via a correspondence search problem, we show that a simple convolutional architecture is sufficient for high-quality disparity estimates in this setting. As accurate ground-truth is hard to obtain, we train our model in a self-supervised fashion with a combination of photometric and geometric losses. Further, we demonstrate that the projected pattern of the structured light sensor can be reliably separated from the ambient information. This can then be used to improve depth boundaries in a weakly supervised fashion by modeling the joint statistics of image and depth edges. The model trained in this fashion compares favorably to the state-of-the-art on challenging synthetic and real-world datasets. In addition, we contribute a novel simulator, which allows to benchmark active depth prediction algorithms in controlled conditions.
Latex Bibtex Citation:
@inproceedings{Riegler2019CVPR,
  author = {Gernot Riegler and Yiyi Liao and Simon Donne and Vladlen Koltun and Andreas Geiger},
  title = {Connecting the Dots: Learning Representations for Active Monocular Depth Estimation},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2019}
}
On the Integration of Optical Flow and Action Recognition (oral)
L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger and M. Black
German Conference on Pattern Recognition (GCPR), 2018
Abstract: Most of the top performing action recognition methods use optical flow as a black box input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we fine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: 1) optical flow is useful for action recognition because it is invariant to appearance, 2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, 3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, 4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and 5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.
Latex Bibtex Citation:
@inproceedings{Sevilla-Lara2018GCPR,
  author = {Laura Sevilla-Lara and Yiyi Liao and Fatma Güney and Varun Jampani and Andreas Geiger and Michael Black},
  title = {On the Integration of Optical Flow and Action Recognition},
  booktitle = {German Conference on Pattern Recognition (GCPR)},
  year = {2018}
}
Deep Marching Cubes: Learning Explicit Surface Representations
Y. Liao, S. Donne and A. Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Abstract: Existing learning based solutions to 3D surface prediction cannot be trained end-to-end as they operate on intermediate representations (eg, TSDF) from which 3D surface meshes must be extracted in a post-processing step (eg, via the marching cubes algorithm). In this paper, we investigate the problem of end-to-end 3D surface prediction. We first demonstrate that the marching cubes algorithm is not differentiable and propose an alternative differentiable formulation which we insert as a final layer into a 3D convolutional neural network. We further propose a set of loss functions which allow for training our model with sparse point supervision. Our experiments demonstrate that the model allows for predicting sub-voxel accurate 3D shapes of arbitrary topology. Additionally, it learns to complete shapes and to separate an object's inside from its outside even in the presence of sparse and incomplete ground truth. We investigate the benefits of our approach on the task of inferring shapes from 3D point clouds. Our model is flexible and can be combined with a variety of shape encoder and shape inference techniques.
Latex Bibtex Citation:
@inproceedings{Liao2018CVPR,
  author = {Yiyi Liao and Simon Donne and Andreas Geiger},
  title = {Deep Marching Cubes: Learning Explicit Surface Representations},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2018}
}


eXTReMe Tracker