OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

Gross, Markus; Matha, Sai B.; Fahmy, Aya; Song, Rui; Cremers, Daniel; Meeß, Henri

OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

Markus Gross^1,2,3,📧, Sai B. Matha¹, Aya Fahmy¹, Rui Song⁴, Daniel Cremers^2,3, Henri Meeß¹

🌟 CVPR 2026 Oral 🌟

¹Fraunhofer Institute IVI

Autonomous Aerial Systems

²Technical University of Munich

Computer Vision Group

³Munich Center for Machine Learning

Research Group Daniel Cremers

⁴University of California, Los Angeles

Mobility Lab

Paper ArXiv Code & Docs OccuFly Dataset Aerial DepthAnything2

Unmute for a 50 second summary.

TL;DR

Outdoor Semantic Scene Completion (SSC) datasets are restricted to ground-level views and rely on LiDAR sensors for data generation, which are costly, heavy, and inherently sparse from aerial viewpoints.
We address this with a LiDAR-free, camera-based framework: by labeling <10% of images in 2D and lifting them into a dense 3D representation, we enable scalable and cost-efficient dataset generation.
Built on this, we introduce OccuFly: a large-scale dataset for SSC and metric monocular depth estimation, with 20,000+ samples across multiple scenes, altitudes, environments, and seasons.
Benchmarking reveals a clear gap: state-of-the-art models struggle in aerial scenarios, and even powerful 3D vision foundation models underperform significantly in zero-shot settings (>500% scale deviation) and continue to degrade with altitude after finetuning, revealing domain-specific challenges and the need for dedicated datasets like OccuFly.

Abstract

Semantic Scene Completion (SSC) is essential for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial settings like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors are the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR point clouds from elevated viewpoints. To address these limitations, we propose a LiDAR-free, camera-based data generation framework. By leveraging classical 3D reconstruction, our framework automates semantic label transfer by lifting <10% of annotated images into the reconstructed point cloud, substantially minimizing manual 3D annotation effort. Based on this framework, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured across multiple altitudes and all seasons. OccuFly provides over 20,000 samples of images, semantic voxel grids, and metric depth maps across 21 semantic classes in urban, industrial, and rural environments, and follows established data organization for seamless integration. We benchmark both SSC and metric monocular depth estimation on OccuFly, revealing fundamental limitations of current vision foundation models in aerial settings and establishing new challenges for robust 3D scene understanding in the aerial domain.

Show more Show less

Challenge & Solution

🆘Challenge: Autonomous aerial perception requires holistic 3D scene understanding from sparse visual observations, yet existing datasets and methods are predominantly designed for terrestrial environments. Such SSC datasets are typically built by fusing multiple sparse LiDAR sweeps to form a dense point cloud, manually annotating each point, and voxelizing the result into ground-truth labels. While this pipeline works well for ground vehicles, it is poorly suited for aerial settings: UAV platforms face strict payload and power constraints that limit the use of LiDAR sensors, and the elevated viewpoint further exacerbates LiDAR sparsity, leaving large regions unobserved and resulting in incomplete or low-quality ground truth.

💡Solution: We propose a novel and scalable data generation framework based on camera modality, which is ubiquitous on modern UAVs. Our approach constructs SSC ground truth without LiDAR, mitigating point cloud sparsity, complying with UAV mass and energy constraints, and reducing manual annotation effort by shifting from costly 3D semantic labeling to efficient 2D labeling.

Data Generation Framework

Context-Adaptivity of Instance Proposals

3D Reconstruction: We utilize geo-referenced images to apply traditional multi-view reconstruction, generating a metric 3D point cloud. This approach additionally yields 2D–3D correspondences, allowing image pixels to be associated with reconstructed 3D points, effectively streamlining the creation of 3D semantic annotations.
Semantic Annotation: We enable highly efficient label transfer by manually annotating only a small subset of the camera images (<10% on average) and automatically lifting the semantic pixels into >99% of the reconstructed points. This reduces costly 3D annotation to efficient 2D image labeling, substantially lowering annotation effort.
Densification & Voxelization: Our pipeline first separates semantic classes into three distinct groups, and applies specialized densification strategies to each. Individual objects are then voxelized separately and aggregated to form a complete and densified scene-level voxel grid.
Ground-Truth Sampling: As all previous steps are performed on a global scene level, we finally retrieve per-frame ground-truth grids by frustum-culling the scene voxel grid using geo-referenced camera poses and intrinsics, resulting in one fixed-size semantic voxel grid per camera frame.

The OccuFly Dataset

OccuFly introduces the first real-world, aerial 3D SSC benchmark dataset, consisting of 9 scenes that provide over 20,000 samples of RGB images, semantic occupancy grids, and metric depth maps, including 21 semantic classes. OccuFly covers almost 200,000m² at 50m, 40m, and 30m altitude in urban, industrial, and rural environments during spring, summer, fall, and winter.

OccuFly in Comparison

Since no real-world aerial SSC datasets exist, we compare OccuFly with established vision-based terrestrial SSC datasets. Similar to SemanticKITTI, which introduced real-world SSC to autonomous driving, OccuFly introduces real-world SSC to the aerial domain, but at a substantially larger scale: the number of samples is more than 5x higher, and the total number of labeled voxels is over 6x larger than SemanticKITTI. OccuFly further provides the largest class taxonomy (21 classes) among the compared SSC datasets.

Furthermore, we compare OccuFly to other real-world, low-altitude aerial datasets that include metric depth maps. To the best of our knowledge, WildUAV and UseGeo are the only publicly available dataset of this kind. Notably, OccuFly is substantially larger, providing more than 13x and more than 24x as many metric depth maps, respectively, while spanning a broader range of environments and seasons. This positions OccuFly as the largest and most diverse publicly available low‑altitude metric depth estimation dataset to date.

Semantic Scene Completion

We benchmark state-of-the-art SSC models and observe a pronounced domain gap when transferring from terrestrial to aerial data. While models can recover coarse scene geometry, semantic predictions degrade significantly, leading to inconsistent and noisy class assignments. This highlights that existing SSC approaches, largely developed for ground-level driving scenarios, fail to generalize to aerial viewpoints characterized by different scales, perspectives, and sparsity patterns. Notably, altitude-wise evaluation shows that performance remains uniformly low across all flight heights, indicating that viewpoint altitude has only a limited effect compared to the overall difficulty of the task. OccuFly therefore exposes fundamental limitations of current models and establishes a challenging benchmark for aerial 3D scene understanding.

Metric Monocular Depth Estimation & 3D Foundation Models

We evaluate metric monocular depth estimation with 3D foundation models and observe that existing approaches struggle significantly in aerial settings. In particular, 3D foundation models such as DepthAnything3 exhibit severe metric inconsistencies, with an average scale deviation exceeding 500%, highlighting their inability to recover accurate geometry from aerial imagery. While fine-tuning on OccuFly substantially improves depth prediction accuracy across all metrics, a clear altitude-dependent trend remains: relative errors stay stable, whereas absolute errors increase with flight height due to larger scene depth and reduced resolution. These findings demonstrate that both geometric reconstruction and depth estimation remain challenging in aerial domains, positioning OccuFly as a crucial benchmark for improving metric 3D understanding from UAV imagery.

OccuFly Full Explanation with Audio

Poster

BibTeX


@inproceedings{gross2025occufly,
    title={{OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective}}, 
    author={Markus Gross and Sai B. Matha and Aya Fahmy and Rui Song and Daniel Cremers and Henri Meess},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2026},
}