Publications

Publications of Axel Sauer

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis (oral)
A. Sauer, T. Karras, S. Laine, A. Geiger and T. Aila
International Conference on Machine learning (ICML), 2023

Abstract: Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.

Latex Bibtex Citation:
@inproceedings{Sauer2023ICML,
author = {Axel Sauer and Tero Karras and Samuli Laine and Andreas Geiger and Timo Aila},
title = {StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis},
booktitle = {International Conference on Machine learning (ICML)},
year = {2023}
}

Paper

Video

Project Page

VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids
K. Schwarz, A. Sauer, M. Niemeyer, Y. Liao and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2022

Abstract: State-of-the-art 3D-aware generative models rely on coordinate-based MLPs to parameterize 3D radiance fields. While demonstrating impressive results, querying an MLP for every sample along each ray leads to slow rendering. Therefore, existing approaches often render low-resolution feature maps and process them with an upsampling network to obtain the final image. Albeit efficient, neural rendering often entangles viewpoint and content such that changing the camera pose results in unwanted changes of geometry or appearance. Motivated by recent results in voxel-based novel view synthesis, we investigate the utility of sparse voxel grid representations for fast and 3D-consistent generative modeling in this paper. Our results demonstrate that monolithic MLPs can indeed be replaced by 3D convolutions when combining sparse voxel grids with progressive growing, free space pruning and appropriate regularization. To obtain a compact representation of the scene and allow for scaling to higher voxel resolutions, our model disentangles the foreground object (modeled in 3D) from the background (modeled in 2D). In contrast to existing approaches, our method requires only a single forward pass to generate a full 3D scene. It hence allows for efficient rendering from arbitrary viewpoints while yielding 3D consistent results with high visual fidelity.

Latex Bibtex Citation:
@inproceedings{Schwarz2022NEURIPS,
author = {Katja Schwarz and Axel Sauer and Michael Niemeyer and Yiyi Liao and Andreas Geiger},
title = {VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2022}
}

Paper

Arxiv

Project Page

StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets
A. Sauer, K. Schwarz and A. Geiger
International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 2022

Abstract: Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN's performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets. In contrast, we find the main limiting factor to be the current training strategy. Following the recently introduced Projected GAN paradigm, we leverage powerful neural network priors and a progressive growing strategy to successfully train the latest StyleGAN3 generator on ImageNet. Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of 1024x1024 at such a dataset scale. We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object~classes. Code, models, and supplementary videos can be found at https://sites.google.com/view/stylegan-xl/.

Latex Bibtex Citation:
@inproceedings{Sauer2022SIGGRAPH,
author = {Axel Sauer and Katja Schwarz and Andreas Geiger},
title = {StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets},
booktitle = {International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH)},
year = {2022}
}

Paper

Supplementary Material

Video

Project Page

Projected GANs Converge Faster
A. Sauer, K. Chitta, J. Müller and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2021

Abstract: Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train. They need careful regularization, vast amounts of compute, and expensive hyper-parameter sweeps. We make significant headway on these issues by projecting generated and real samples into a fixed, pretrained feature space. Motivated by the finding that the discriminator cannot fully exploit features from deeper layers of the pretrained model, we propose a more effective strategy that mixes features across channels and resolutions. Our Projected GAN improves image quality, sample efficiency, and convergence speed. It is further compatible with resolutions of up to one Megapixel and advances the state-of-the-art Fréchet Inception Distance (FID) on twenty-two benchmark datasets. Importantly, Projected GANs match the previously lowest FIDs up to 40 times faster, cutting the wall-clock time from 5 days to less than 3 hours given the same computational resources.

Latex Bibtex Citation:
@inproceedings{Sauer2021NEURIPS,
author = {Axel Sauer and Kashyap Chitta and Jens Müller and Andreas Geiger},
title = {Projected GANs Converge Faster},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2021}
}

Paper

Supplementary Material

Poster

Project Page

Counterfactual Generative Networks
A. Sauer and A. Geiger
International Conference on Learning Representations (ICLR), 2021

Abstract: Neural networks are prone to learning shortcuts -they often model simple correlations, ignoring more complex ones that potentially generalize better. Prior works on image classification show that instead of learning a connection to object shape, deep classifiers tend to exploit spurious correlations with low-level texture or the background for solving the classification task. In this work, we take a step towards more robust and interpretable classifiers that explicitly expose the task's causal structure. Building on current advances in deep generative modeling, we propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision. By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background; hence, they allow for generating counterfactual images. We demonstrate the ability of our model to generate such images on MNIST and ImageNet. Further, we show that the counterfactual images can improve out-of-distribution robustness with a marginal drop in performance on the original classification task, despite being synthetic. Lastly, our generative model can be trained efficiently on a single GPU, exploiting common pre-trained models as inductive biases.

Latex Bibtex Citation:
@inproceedings{Sauer2021ICLR,
author = {Axel Sauer and Andreas Geiger},
title = {Counterfactual Generative Networks},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2021}
}

Conditional Affordance Learning for Driving in Urban Environments (oral)
A. Sauer, N. Savinov and A. Geiger
Conference on Robot Learning (CoRL), 2018

Abstract: Most existing approaches to autonomous driving fall into one of two categories: modular pipelines, that build an extensive model of the environment, and imitation learning approaches, that map images directly to control outputs. A recently proposed third paradigm, direct perception, aims to combine the advantages of both by using a neural network to learn appropriate low-dimensional intermediate representations. However, existing direct perception approaches are restricted to simple highway situations, lacking the ability to navigate intersections, stop at traffic lights or respect speed limits. In this work, we propose a direct perception approach which maps video input to intermediate representations suitable for autonomous navigation in complex urban environments given high-level directional inputs. Compared to state-of-the-art reinforcement and conditional imitation learning approaches, we achieve an improvement of up to 68 \% in goal-directed navigation on the challenging CARLA simulation benchmark. In addition, our approach is the first to handle traffic lights, speed signs and smooth car-following, resulting in a significant reduction of traffic accidents.

Latex Bibtex Citation:
@inproceedings{Sauer2018CORL,
author = {Axel Sauer and Nikolay Savinov and Andreas Geiger},
title = {Conditional Affordance Learning for Driving in Urban Environments},
booktitle = {Conference on Robot Learning (CoRL)},
year = {2018}
}

Paper

Supplementary Material

Video

Project Page

Blog