3 Harbin Institute of Technology.
4 Xidian University.
Surface prediction and completion have been widely studied in various applications. Recently, research in surface completion has evolved from small objects to complex large-scale scenes. As a result, researchers have begun increasing the volume of data and leveraging a greater variety of data modalities including rendered RGB images, descriptive text, depth images, etc, to enhance algorithm performance. However, existing datasets suffer from a deficiency in the amounts of scene-level models along with the corresponding multi-modal information. Therefore, a method to scale the datasets and generate multi-modal information in them efficiently is essential. To bridge this research gap, we propose MASSTAR: a multi-modal large-scale scene dataset with a versatile toolchain for surface prediction and completion. We develop a versatile and efficient toolchain for processing the raw 3D data from the environments. It screens out a set of fine-grained scene models and generates the corresponding multi-modal data. Utilizing the toolchain, we then generate an example dataset composed of over a thousand scene-level models with partial real-world data added. We compare MASSTAR with the existing datasets, which validates its superiority: the ability to efficiently extract high-quality models from complex scenarios to expand the dataset. Additionally, several representative surface completion algorithms are benchmarked on MASSTAR, which reveals that existing algorithms can hardly deal with scene-level completion.
An overview of 3D scene segmentation. Initially, we generate the depth image and RGB image by rendering a bird's-eye view of each scene. Users have the option to employ SAM for segmenting top-view images in manual mode or automatic mode. Subsequently, the 3D mesh model is sliced using Blender, and then CLIP is utilized to filter out non-architectural categories.
An example of the image rendering part of the toolchain. We offer the random mode (left) and trajectory mode (right) for users.
An example of the descriptive text rendering part of the toolchain. BLIP is employed to perform zero-shot image-to-text generation.
An example of the partial point cloud render part of the toolchain.
Some examples of independent data processing using the proposed toolchain on the existing datasets.
@misc{zheng2024masstar,
title={MASSTAR: A Multi-Modal and Large-Scale Scene Dataset with a Versatile Toolchain for Surface Prediction and Completion},
author={Guiyong Zheng and Jinqi Jiang and Chen Feng and Shaojie Shen and Boyu Zhou},
year={2024},
eprint={2403.11681},
archivePrefix={arXiv},
primaryClass={cs.RO}
}