Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Posted Feb 4, 2025

By Chris Choy

1 min read

Authors

Junha Lee^1,2,*, Chunghyun Park^1,2,*, Jaesung Choe¹, Yu-Chiang Frank Wang¹, Jan Kautz¹, Minsu Cho², Chris Choy¹

¹NVIDIA, ²POSTECH

^* indicates equal contribution

Abstract

We tackle open-vocabulary 3D scene segmentation tasks by introducing a novel data generation pipeline and training framework. Our work targets three essential aspects required for an effective dataset: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware vision-language models (VLM), we develop an automatic pipeline capable of producing high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of more than 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building on these data, we propose Mosaic3D, a 3D visual foundation model (3D-VFM) combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation benchmarks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

Key Contributions

Mosaic3D-5.6M Dataset: The largest 3D mask-text paired dataset to date, encompassing over 30K indoor scenes and approximately 1M RGB-D frames, yielding 5.6M region captions comprising 30M total text tokens
Precise Region Boundaries: Uses Grounded-SAM and SEEM to ensure precise region boundaries, significantly improving over bounding box-based approaches
Rich Contextual Descriptions: Region-aware VLM generates detailed contextual descriptions that capture both visual attributes and spatial relationships
3D Visual Foundation Model: State-of-the-art results on open-vocabulary 3D semantic and instance segmentation benchmarks

Bibtex

@article{lee2025mosaic3d,
    title={Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation},
    author={Junha Lee and Chunghyun Park and Jaesung Choe and Frank Wang and Jan Kautz and Minsu Cho and Chris Choy},
    journal={arXiv},
    year={2025}
}

Publications, Computer Vision

This post is licensed under CC BY 4.0 by the author.

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Authors

Abstract

Key Contributions

Links

Bibtex

Trending Tags