Publications

Publications of Songyou Peng

DepthSplat: Connecting Gaussian Splatting and Depth
H. Xu, S. Peng, F. Wang, H. Blum, D. Barath, A. Geiger and M. Pollefeys
Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Abstract: Gaussian splatting and single-view depth estimation are typically studied in isolation. In this paper, we present DepthSplat to connect Gaussian splatting and depth estimation and study their interactions. More specifically, we first contribute a robust multi-view depth model by leveraging pre-trained monocular depth features, leading to high-quality feed-forward 3D Gaussian splatting reconstructions. We also show that Gaussian splatting can serve as an unsupervised pre-training objective for learning powerful depth models from large-scale multi-view posed datasets. We validate the synergy between Gaussian splatting and depth estimation through extensive ablation and cross-task transfer experiments. Our DepthSplat achieves state-of-the-art performance on ScanNet, RealEstate10K and DL3DV datasets in terms of both depth estimation and novel view synthesis, demonstrating the mutual benefits of connecting both tasks. In addition, DepthSplat enables feed-forward reconstruction from 12 input views (512x960 resolutions) in 0.6 seconds.

Latex Bibtex Citation:
@inproceedings{Xu2025CVPR,
author = {Haofei Xu and Songyou Peng and Fangjinhua Wang and Hermann Blum and Daniel Barath and Andreas Geiger and Marc Pollefeys},
title = {DepthSplat: Connecting Gaussian Splatting and Depth},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}

Paper

Project Page

Renovating Names in Open-Vocabulary Segmentation Benchmarks
H. Huang, S. Peng, D. Zhang and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2024

Abstract: Names are essential to both human cognition and vision-language models. Open-vocabulary models utilize class names as text prompts to generalize to categories unseen during training. However, the precision of these names is often overlooked in existing datasets. In this paper, we address this underexplored problem by presenting a framework for "renovating" names in open-vocabulary segmentation benchmarks (RENOVATE). Our framework features a renaming model that enhances the quality of names for each visual segment. Through experiments, we demonstrate that our renovated names help train stronger open-vocabulary models with up to 15% relative improvement and significantly enhance training efficiency with improved data quality. We also show that our renovated names improve evaluation by better measuring misclassification and enabling fine-grained model analysis. We will provide our code and relabelings for several popular segmentation datasets (MS COCO, ADE20K, Cityscapes) to the research community.

Latex Bibtex Citation:
@inproceedings{Huang2024NEURIPS,
author = {Haiwen Huang and Songyou Peng and Dan Zhang and Andreas Geiger},
title = {Renovating Names in Open-Vocabulary Segmentation Benchmarks},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2024}
}

Paper

Arxiv

Project Page

NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM (oral, best paper runner up award)
Z. Zhu, S. Peng, V. Larsson, Z. Cui, M. Oswald, A. Geiger and M. Pollefeys
International Conference on 3D Vision (3DV), 2024

Abstract: Neural implicit representations have recently become popular in simultaneous localization and mapping (SLAM), especially in dense visual SLAM. However, existing works either rely on RGB-D sensors or require a separate monocular SLAM approach for camera tracking, and fail to produce high-fidelity 3D dense reconstructions. To address these shortcomings, we present NICER-SLAM, a dense RGB SLAM system that simultaneously optimizes for camera poses and a hierarchical neural implicit map representation, which also allows for high-quality novel view synthesis. To facilitate the optimization process for mapping, we integrate additional supervision signals including easy-to-obtain monocular geometric cues and optical flow, and also introduce a simple warping loss to further enforce geometric consistency. Moreover, to further boost performance in complex large-scale scenes, we also propose a local adaptive transformation from signed distance functions (SDFs) to density in the volume rendering equation. On multiple challenging indoor and outdoor datasets, NICER-SLAM demonstrates strong performance in dense mapping, novel view synthesis, and tracking, even competitive with recent RGB-D SLAM systems. Project page: https://nicer-slam.github.io/.

Latex Bibtex Citation:
@inproceedings{Zhu2024THREEDV,
author = {Zihan Zhu and Songyou Peng and Viktor Larsson and Zhaopeng Cui and Martin R. Oswald and Andreas Geiger and Marc Pollefeys},
title = {NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM},
booktitle = {International Conference on 3D Vision (3DV)},
year = {2024}
}

Paper

Video

Project Page

MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction
Z. Yu, S. Peng, M. Niemeyer, T. Sattler and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2022

Abstract: In recent years, neural implicit surface reconstruction methods have become popular for multi-view 3D reconstruction. In contrast to traditional multi-view stereo methods, these approaches tend to produce smoother and more complete reconstructions due to the inductive smoothness bias of neural networks. State-of-the-art neural implicit methods allow for high-quality reconstructions of simple scenes from many input views. Yet, their performance drops significantly for larger and more complex scenes and scenes captured from sparse viewpoints. This is caused primarily by the inherent ambiguity in the RGB reconstruction loss that does not provide enough constraints, in particular in less-observed and textureless areas. Motivated by recent advances in the area of monocular geometry prediction, we systematically explore the utility these cues provide for improving neural implicit surface reconstruction. We demonstrate that depth and normal cues, predicted by general-purpose monocular estimators, significantly improve reconstruction quality and optimization time. Further, we analyse and investigate multiple design choices for representing neural implicit surfaces, ranging from monolithic MLP models over single-grid to multi-resolution grid representations. We observe that geometric monocular priors improve performance both for small-scale single-object as well as large-scale multi-object scenes, independent of the choice of representation.

Latex Bibtex Citation:
@inproceedings{Yu2022NEURIPS,
author = {Zehao Yu and Songyou Peng and Michael Niemeyer and Torsten Sattler and Andreas Geiger},
title = {MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2022}
}

Paper

Arxiv

Project Page

Shape As Points: A Differentiable Poisson Solver (oral)
S. Peng, C. Jiang, Y. Liao, M. Niemeyer, M. Pollefeys and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2021

Abstract: In recent years, neural implicit representations gained popularity in 3D reconstruction due to their expressiveness and flexibility. However, the implicit nature of neural implicit representations results in slow inference times and requires careful initialization. In this paper, we revisit the classic yet ubiquitous point cloud representation and introduce a differentiable point-to-mesh layer using a differentiable formulation of Poisson Surface Reconstruction (PSR) which allows for a GPU-accelerated fast solution of the indicator function given an oriented point cloud. The differentiable PSR layer allows us to efficiently and differentiably bridge the explicit 3D point representation with the 3D mesh via the implicit indicator field, enabling end-to-end optimization of surface reconstruction metrics such as Chamfer distance. This duality between points and meshes hence allows us to represent shapes as oriented point clouds, which are explicit, lightweight and expressive. Compared to neural implicit representations, our Shape-As-Points (SAP) model is more interpretable, lightweight, and accelerates inference time by one order of magnitude. Compared to other explicit representations such as points, patches, and meshes, SAP produces topology-agnostic, watertight manifold surfaces. We demonstrate the effectiveness of SAP on the task of surface reconstruction from unoriented point clouds and learning-based reconstruction.

Latex Bibtex Citation:
@inproceedings{Peng2021NEURIPS,
author = {Songyou Peng and Chiyu Max Jiang and Yiyi Liao and Michael Niemeyer and Marc Pollefeys and Andreas Geiger},
title = {Shape As Points: A Differentiable Poisson Solver},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2021}
}

Paper

Supplementary Material

Poster

Video

Project Page

KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs
C. Reiser, S. Peng, Y. Liao and A. Geiger
International Conference on Computer Vision (ICCV), 2021

Abstract: NeRF synthesizes novel views of a scene with unprecedented quality by fitting a neural radiance field to RGB images. However, NeRF requires querying a deep Multi-Layer Perceptron (MLP) millions of times, leading to slow rendering times, even on modern GPUs. In this paper, we demonstrate that significant speed-ups are possible by utilizing thousands of tiny MLPs instead of one single large MLP. In our setting, each individual MLP only needs to represent parts of the scene, thus smaller and faster-to-evaluate MLPs can be used. By combining this divide-and-conquer strategy with further optimizations, rendering is accelerated by two orders of magnitude compared to the original NeRF model without incurring high storage costs. Further, using teacher-student distillation for training, we show that this speed-up can be achieved without sacrificing visual quality..

Latex Bibtex Citation:
@inproceedings{Reiser2021ICCV,
author = {Christian Reiser and Songyou Peng and Yiyi Liao and Andreas Geiger},
title = {KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2021}
}

Paper

Supplementary Material

UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction (oral)
M. Oechsle, S. Peng and A. Geiger
International Conference on Computer Vision (ICCV), 2021

Abstract: Neural implicit 3D representations have emerged as a powerful paradigm for reconstructing surfaces from multi-view images and synthesizing novel views. Unfortunately, existing methods such as DVR or IDR require accurate per-pixel object masks as supervision. At the same time, neural radiance fields have revolutionized novel view synthesis. However, NeRF's estimated volume density does not admit accurate surface reconstruction. Our key insight is that implicit surface models and radiance fields can be formulated in a unified way, enabling both surface and volume rendering using the same model. This unified perspective enables novel, more efficient sampling procedures and the ability to reconstruct accurate surfaces without input masks. We compare our method on the DTU, BlendedMVS, and a synthetic indoor dataset. Our experiments demonstrate that we outperform NeRF in terms of reconstruction quality while performing on par with IDR without requiring masks.

Latex Bibtex Citation:
@inproceedings{Oechsle2021ICCV,
author = {Michael Oechsle and Songyou Peng and Andreas Geiger},
title = {UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2021}
}

Paper

Supplementary Material

Video 1

Video 2

Project Page

Convolutional Occupancy Networks (spotlight)
S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys and A. Geiger
European Conference on Computer Vision (ECCV), 2020

Abstract: Recently, implicit neural representations have gained popularity for learning-based 3D reconstruction. While demonstrating promising results, most implicit approaches are limited to comparably simple geometry of single objects and do not scale to more complicated or large-scale scenes. The key limiting factor of implicit methods is their simple fully-connected network architecture which does not allow for integrating local information in the observations or incorporating inductive biases such as translational equivariance. In this paper, we propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. By combining convolutional encoders with implicit occupancy decoders, our model incorporates inductive biases, enabling structured reasoning in 3D space. We investigate the effectiveness of the proposed representation by reconstructing complex geometry from noisy point clouds and low-resolution voxel representations. We empirically find that our method enables the fine-grained implicit 3D reconstruction of single objects, scales to large indoor scenes, and generalizes well from synthetic to real data.

Latex Bibtex Citation:
@inproceedings{Peng2020ECCV,
author = {Songyou Peng and Michael Niemeyer and Lars Mescheder and Marc Pollefeys and Andreas Geiger},
title = {Convolutional Occupancy Networks},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2020}
}

Paper

Supplementary Material