Publications

Publications of Haiwen Huang

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models (oral)
H. Huang, A. Chen, V. Havrylov, A. Geiger and D. Zhang
International Conference on Computer Vision (ICCV), 2025

Abstract: Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks.

Latex Bibtex Citation:
@inproceedings{Huang2025ICCV,
author = {Haiwen Huang and Anpei Chen and Volodymyr Havrylov and Andreas Geiger and Dan Zhang},
title = {LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2025}
}

Paper

Project Page

Renovating Names in Open-Vocabulary Segmentation Benchmarks
H. Huang, S. Peng, D. Zhang and A. Geiger
Advances in Neural Information Processing Systems (NeurIPS), 2024

Abstract: Names are essential to both human cognition and vision-language models. Open-vocabulary models utilize class names as text prompts to generalize to categories unseen during training. However, the precision of these names is often overlooked in existing datasets. In this paper, we address this underexplored problem by presenting a framework for "renovating" names in open-vocabulary segmentation benchmarks (RENOVATE). Our framework features a renaming model that enhances the quality of names for each visual segment. Through experiments, we demonstrate that our renovated names help train stronger open-vocabulary models with up to 15% relative improvement and significantly enhance training efficiency with improved data quality. We also show that our renovated names improve evaluation by better measuring misclassification and enabling fine-grained model analysis. We will provide our code and relabelings for several popular segmentation datasets (MS COCO, ADE20K, Cityscapes) to the research community.

Latex Bibtex Citation:
@inproceedings{Huang2024NEURIPS,
author = {Haiwen Huang and Songyou Peng and Dan Zhang and Andreas Geiger},
title = {Renovating Names in Open-Vocabulary Segmentation Benchmarks},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2024}
}

Paper

Arxiv

Project Page

GOOD: Exploring geometric cues for detecting objects in an open world
H. Huang, A. Geiger and D. Zhang
International Conference on Learning Representations (ICLR), 2023

Abstract: We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single "person" class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%.

Latex Bibtex Citation:
@inproceedings{Huang2023ICLR,
author = {Haiwen Huang and Andreas Geiger and Dan Zhang},
title = {GOOD: Exploring geometric cues for detecting objects in an open world},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2023}
}

Paper

Project Page