RealDiff: Real-world 3D Shape Completion using Self-Supervised Diffusion Models

Abstract

Point cloud completion aims to recover the complete 3D shape of an object from partial observations. While approaches relying on synthetic shape priors achieved promising results in this domain, their applicability and generalizability to real-world data are still limited. To tackle this problem, we propose a self-supervised framework, namely RealDiff, that formulates point cloud completion as a conditional generation problem directly on real-world measurements.

To better deal with noisy observations without resorting to training on synthetic data, we leverage additional geometric cues. Specifically, RealDiff simulates a diffusion process at the missing object parts while conditioning the generation on the partial input to address the multimodal nature of the task. We further regularize the training by matching object silhouettes and depth maps, predicted by our method, with the externally estimated ones. Experimental results show that our method consistently outperforms state-of-the-art methods in real-world point cloud completion.

Overview

When given a pair of noisy point clouds representing an object, our pipeline takes one of these point clouds as input, and a pseudo ground-truth is created by combining the two point clouds. A diffusion process is simulated at the missing parts (unoccupied input voxels) of the voxelized input, while conditioning the generation on the known parts (occupied input voxels). To eliminate the noise from the reconstructions, the rendered object shapes’ silhouettes and depth maps are constrained to match the auxiliary silhouettes (e.g. from ScanNet) and depth maps (e.g., from a pre-trained Omnidata model). At generation time, only fθ is used to reconstruct a complete 3D shape from the input real-world point cloud.

Videos (Coming soon!)

Experiments

Point cloud completion results on the ScanNet dataset at the resolution of 16384 points. Experimental results show that the proposed method outperforms the state-of-the-art baselines.

Visual comparison of point cloud completion results on the ScanNet dataset. From left to right: partial shapes sampled from depth images, completion from baselines, our results, and ground-truth CAD model alignments from Scan2CAD annotations. For multi-modal methods, we picked single output shapes corresponding to a specific random seed. Our methodology produces reconstructions that are both more comprehensive and adept at retaining the initially observed structural characteristics.

Multi-modal shape completion on ScanNet dataset with 16384 points.

Multimodal completion results for the ScanNet dataset. Shapes are ordered from left to right, top to bottom. Our method is able to generate multiple valid outputs across different runs. As the input incompleteness degree rises, the uncertainty in how to complete the shape geometry also increases which allows for a higher diversity (first, third and fourth shapes). When a more complete input is provided (second shape), we observe slight changes in the recovered geometry between different runs.

BibTeX