Andreas Geiger

Publications of Katja Schwarz

WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space
K. Schwarz, S. Kim, J. Gao, S. Fidler, A. Geiger and K. Kreis
International Conference on Learning Representations (ICLR), 2024
Abstract: Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images’ underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly, our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
Latex Bibtex Citation:
@inproceedings{Schwarz2024ICLR,
  author = {Katja Schwarz and Seung Wook Kim and Jun Gao and Sanja Fidler and Andreas Geiger and Karsten Kreis},
  title = {WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2024}
}
VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids
K. Schwarz, A. Sauer, M. Niemeyer, Y. Liao and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2022
Abstract: State-of-the-art 3D-aware generative models rely on coordinate-based MLPs to parameterize 3D radiance fields. While demonstrating impressive results, querying an MLP for every sample along each ray leads to slow rendering. Therefore, existing approaches often render low-resolution feature maps and process them with an upsampling network to obtain the final image. Albeit efficient, neural rendering often entangles viewpoint and content such that changing the camera pose results in unwanted changes of geometry or appearance. Motivated by recent results in voxel-based novel view synthesis, we investigate the utility of sparse voxel grid representations for fast and 3D-consistent generative modeling in this paper. Our results demonstrate that monolithic MLPs can indeed be replaced by 3D convolutions when combining sparse voxel grids with progressive growing, free space pruning and appropriate regularization. To obtain a compact representation of the scene and allow for scaling to higher voxel resolutions, our model disentangles the foreground object (modeled in 3D) from the background (modeled in 2D). In contrast to existing approaches, our method requires only a single forward pass to generate a full 3D scene. It hence allows for efficient rendering from arbitrary viewpoints while yielding 3D consistent results with high visual fidelity.
Latex Bibtex Citation:
@inproceedings{Schwarz2022NEURIPS,
  author = {Katja Schwarz and Axel Sauer and Michael Niemeyer and Yiyi Liao and Andreas Geiger},
  title = {VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2022}
}
ARAH: Animatable Volume Rendering of Articulated Human SDFs
S. Wang, K. Schwarz, A. Geiger and S. Tang
European Conference on Computer Vision (ECCV), 2022
Abstract: Combining human body models with differentiable rendering has recently enabled animatable avatars of clothed humans from sparse sets of multi-view RGB videos. While state-of-the-art approaches achieve a realistic appearance with neural radiance fields (NeRF), the inferred geometry often lacks detail due to missing geometric constraints. Further, animating avatars in out-of-distribution poses is not yet possible because the mapping from observation space to canonical space does not generalize faithfully to unseen poses. In this work, we address these shortcomings and propose a model to create animatable clothed human avatars with detailed geometry that generalize well to out-of-distribution poses. To achieve detailed geometry, we combine an articulated implicit surface representation with volume rendering. For generalization, we propose a novel joint root-finding algorithm for simultaneous ray-surface intersection search and correspondence search. Our algorithm enables efficient point sampling and accurate point canonicalization while generalizing well to unseen poses. We demonstrate that our proposed pipeline can generate clothed avatars with high-quality pose-dependent geometry and appearance from a sparse set of multi-view RGB videos. Our method achieves state-of-the-art performance on geometry and appearance reconstruction while creating animatable avatars that generalize well to out-of-distribution poses beyond the small number of training poses.
Latex Bibtex Citation:
@inproceedings{Wang2022ECCV,
  author = {Shaofei Wang and Katja Schwarz and Andreas Geiger and Siyu Tang},
  title = {ARAH: Animatable Volume Rendering of Articulated Human SDFs},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year = {2022}
}
StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets
A. Sauer, K. Schwarz and A. Geiger
International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 2022
Abstract: Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN's performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets. In contrast, we find the main limiting factor to be the current training strategy. Following the recently introduced Projected GAN paradigm, we leverage powerful neural network priors and a progressive growing strategy to successfully train the latest StyleGAN3 generator on ImageNet. Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of 1024x1024 at such a dataset scale. We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object~classes. Code, models, and supplementary videos can be found at https://sites.google.com/view/stylegan-xl/.
Latex Bibtex Citation:
@inproceedings{Sauer2022SIGGRAPH,
  author = {Axel Sauer and Katja Schwarz and Andreas Geiger},
  title = {StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets},
  booktitle = {International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH)},
  year = {2022}
}
On the Frequency Bias of Generative Models
K. Schwarz, Y. Liao and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2021
Abstract: The key objective of Generative Adversarial Networks (GANs) is to generate new data with the same statistics as the provided training data. However, multiple recent works show that state-of-the-art architectures yet struggle to achieve this goal. In particular, they report an elevated amount of high frequencies in the spectral statistics which makes it straightforward to distinguish real and generated images. Explanations for this phenomenon are controversial: While most works attribute the artifacts to the generator, other works point to the discriminator. We take a sober look at those explanations and provide insights on what makes proposed measures against high-frequency artifacts effective. To achieve this, we first independently assess the architectures of both the generator and discriminator and investigate if they exhibit a frequency bias that makes learning the distribution of high-frequency content particularly problematic. Based on these experiments, we make the following four observations: 1) Different upsampling operations bias the generator towards different spectral properties. 2) Checkerboard artifacts introduced by upsampling cannot explain the spectral discrepancies alone as the generator is able to compensate for these artifacts. 3) The discriminator does not struggle with detecting high frequencies per se but rather struggles with frequencies of low magnitude. 4) The downsampling operations in the discriminator can impair the quality of the training signal it provides. In light of these findings, we analyze proposed measures against high-frequency artifacts in state-of-the-art GAN training but find that none of the existing approaches can fully resolve spectral artifacts yet. Our results suggest that there is great potential in improving the discriminator and that this could be key to match the distribution of the training data more closely.
Latex Bibtex Citation:
@inproceedings{Schwarz2021NEURIPS,
  author = {Katja Schwarz and Yiyi Liao and Andreas Geiger},
  title = {On the Frequency Bias of Generative Models},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2021}
}
GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis
K. Schwarz, Y. Liao, M. Niemeyer and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2020
Abstract: While 2D generative adversarial networks have enabled high-resolution image synthesis, they largely lack an understanding of the 3D world and the image formation process. Thus, they do not provide precise control over camera viewpoint or object pose. To address this problem, several recent approaches leverage intermediate voxel-based representations in combination with differentiable rendering. However, existing methods either produce low image resolution or fall short in disentangling camera and scene properties, eg, the object identity may vary with the viewpoint. In this paper, we propose a generative model for radiance fields which have recently proven successful for novel view synthesis of a single scene. In contrast to voxel-based representations, radiance fields are not confined to a coarse discretization of the 3D space, yet allow for disentangling camera and scene properties while degrading gracefully in the presence of reconstruction ambiguity. By introducing a multi-scale patch-based discriminator, we demonstrate synthesis of high-resolution images while training our model from unposed 2D images alone. We systematically analyze our approach on several challenging synthetic and real-world datasets. Our experiments reveal that radiance fields are a powerful representation for generative image synthesis, leading to 3D consistent models that render with high fidelity.
Latex Bibtex Citation:
@inproceedings{Schwarz2020NEURIPS,
  author = {Katja Schwarz and Yiyi Liao and Michael Niemeyer and Andreas Geiger},
  title = {GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2020}
}
Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis
Y. Liao, K. Schwarz, L. Mescheder and A. Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Abstract: In recent years, Generative Adversarial Networks have achieved impressive results in photorealistic image synthesis. This progress nurtures hopes that one day the classical rendering pipeline can be replaced by efficient models that are learned directly from images. However, current image synthesis models operate in the 2D domain where disentangling 3D properties such as camera viewpoint or object pose is challenging. Furthermore, they lack an interpretable and controllable representation. Our key hypothesis is that the image generation process should be modeled in 3D space as the physical world surrounding us is intrinsically three-dimensional. We define the new task of 3D controllable image synthesis and propose an approach for solving it by reasoning both in 3D space and in the 2D image domain. We demonstrate that our model is able to disentangle latent 3D factors of simple multi-object scenes in an unsupervised fashion from raw images. Compared to pure 2D baselines, it allows for synthesizing scenes that are consistent wrt. changes in viewpoint or object pose. We further evaluate various 3D representations in terms of their usefulness for this challenging task.
Latex Bibtex Citation:
@inproceedings{Liao2020CVPR,
  author = {Yiyi Liao and Katja Schwarz and Lars Mescheder and Andreas Geiger},
  title = {Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2020}
}


eXTReMe Tracker