ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

* Equal Contribution
1S-Lab, Nanyang Technological University, 2Shanghai Artifcial Intelligence Laboratory, 3The Chinese University of Hong Kong, 4The Chinese University of Hong Kong, Shenzhen

Abstract

Generating high-quality 3D assets from a given image is highly desirable in various applications such as AR/VR. Recent advances in single-image 3D generation explore feed-forward models that learn to infer the 3D model of an object without optimization. Though promising results have been achieved in single object generation, these methods often struggle to model complex 3D assets that inherently contain multiple objects. In this work, we present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. 1) We first perform an in-depth analysis of this ``multi-object gap'' from both model and data perspectives. 2) Next, with reconstructed 3D models of different objects, we seek to adjust their sizes, rotation angles, and locations to create a 3D asset that matches the given image. 3) To automate this process, we apply spatially-aware score distillation sampling (SSDS) from pretrained diffusion models to guide the positioning of objects. Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling, and thus achieves more accurate results. Extensive experiments validate ComboVerse achieves clear improvements over existing methods in generating compositional 3D assets.

Video

Method Overview

Overview of ComboVerse. Given an input image that contains multiple objects, our method can generate high-quality 3D assets through a two-stage process. In the single-object reconstruction stage, we decompose every single object in the image with object inpainting, and perform single-image reconstruction to create individual 3D models. In the multi-object combination stage, we maintain the geometry and texture of each object while optimizing their scale, rotation, and translation parameters.

Compositional Generation

Scene Generation

Related Links

There are a lot of excellent works that inspire or relate to ours, ComboVerse is based on these outstanding works.

Wonder3D: Xiaoxiao Long*, Yuan-Chen Guo*, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, Wenping Wang. Wonder3D: Single Image to 3D using Cross-Domain Diffusion. CVPR 2024.

LRM: Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, Hao Tan. LRM: Large Reconstruction Model for Single Image to 3D. ICLR 2024.

OpenLRM: Zexin He, Tengfei Wang. OpenLRM: Open-Source Large Reconstruction Models.

SyncDreamer: Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, Wenping Wang. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. ICLR 2024.

TripoSR: Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, Yan-Pei Cao. TripoSR: Fast 3D Object Reconstruction from a Single Image

BibTeX

@article{chen2024comboverse,
  title={ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance},
  author={Chen, Yongwei and Wang, Tengfei and Wu, Tong and Pan, Xingang and Jia, Kui and Liu, Ziwei},
  journal={arXiv preprint arXiv:2403.12409},
  year={2024}
}