IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals

Gross, Markus; Fahmy, Aya; Niwattananan, Danit; Muhle, Dominik; Song, Rui; Cremers, Daniel; Meeß, Henri

IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals

Markus Gross^1,2,3,📧, Aya Fahmy¹, Danit Niwattananan², Dominik Muhle^2,3, Rui Song⁴ Daniel Cremers^2,3, Henri Meeß¹

🌟 NeurIPS 2025 🌟

¹Fraunhofer Institute IVI

Autonomous Aerial Systems

²Technical University of Munich

Computer Vision Group

³Munich Center for Machine Learning

Research Group Daniel Cremers

⁴University of California, Los Angeles

Mobility Lab

Paper ArXiv Code Video

Visual Panoptic Scene Completion: Using camera images, infer the complete 3D structure of a scene as a voxel grid, including both visible and occluded regions. Every voxel carries (1) binary occupancy, (2) a semantic label, and (3) an instance ID to group countable objects.

Abstract

Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first method that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Extensive experimental results show that our approach achieves state-of-the-art in-domain performance, exhibits superior zero-shot generalization on out-of-domain data, and achieves a runtime reduction exceeding 14$\times$. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion.

Show more Show less

Challenge & Solution

Previous methods (1) only infer occupancy and semantics in an end-to-end fashion (Semantic Scene Completion), but require subsequent, time-consuming Euclidean clustering to retrieve individual instances. (2) they reconstruct objects using a fixed set of learned queries that are updated with image context during training, but remain static at test time and thus fail to dynamically adapt specifically to the observed scene. To address these challanges, our method (1) infers occupancy, semantics, and instances in an end-to-end fashion, and (2) initializes object queries from image context, referred to as instance proposals, which dynamically adapt specifically to the observed scene at train and test time.

Context-Adaptivity of Instance Proposals

Instance-specific saliency. Through gradient-based attribution, we derive saliency maps that highlight image regions in green, where an individual instance mainly retrieves context from. Our introduced instance proposals effectively adapt to scene characteristics by guiding feature aggregation, substantially improving identification, classification, and completion. In contrast, non-adaptive instance queries sample context in an undirected manner, causing misclassification and geometric ambiguity.

IPFormer Architecture

IPFormer refines image features and a depth map to produce 3D context features, which are sampled based on visibility to generate context-adaptive instance and voxel proposals. In a two-stage training strategy, voxel proposals first handle Semantic Scene Completion, effectively guiding the latent space toward detailed geometry and semantics. The second stage attends instance proposals over the pretrained voxel features to register individual instances. This design aligns occupancy, semantics, and instances, enabling robust Panoptic Scene Completion.

Quantitative Results

In-domain performance on SemanticKITTI val. set. Best and second-best results are bold and underlined, respectively. Due to the absence of established baselines for vision-based PSC, we infer state-of-the-art SSC methods and apply DBSCAN to retrieve instances. In summary, IPFormer exceeds all baselines in overall panoptic metrics PQ and PQ$^{\dag}$, and achieves best or second-best results on individual metrics. Additionally, our method directly predicts a full panoptic scene, resulting in a significantly superior runtime of 0.33 seconds, thus providing a runtime reduction of over 14$\times$.

Out-of-domain zero-shot generalization performance of IPFormer and the closest baseline CGFormer+DBSCAN, by training on SemanticKITTI and cross-validating on the distinct SSCBench-KITTI360 test set. IPFormer demonstrates superior absolute and relative generalization performance across PSC and SSC metrics.

Qualitative Results

Qualitative results on the SemanticKITTI val. set (zoom in for best view). Each top row illustrates purely semantic information, following the SSC color map. Each bottom row displays individual instances, with randomly assigned colors to facilitate differentiation. Note that we specifically show instances of the Thing-category for clarity. IPFormer surpasses existing approaches by excelling at identifying individual instances, inferring their semantics, and reconstructing geometry with exceptional fidelity. Even for extremely low-frequency categories such as the person category (0.07%) under adverse lighting conditions, and in the presence of trace-artifacts from dynamic objects in the ground-truth data, our method proves visually superior. These advancements stem from IPFormer’s instance proposals, which dynamically adapt to scene characteristics, thus preserving high precision in instance identification, semantic segmentation, and geometric completion.

IPFormer Full Explanation with Audio

Poster

BibTeX


        @inproceedings{gross2025ipformer,
        title={{IPF}ormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals},
        author={Markus Gross and Aya Fahmy and Danit Niwattananan and Dominik Muhle and Rui Song and Daniel Cremers and Henri Meeß},
        booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
        year={2025}
        }