OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
- We present OccuFly, the first real-world aerial SSC vision benchmark, consisting of 9 scenes that provide over 20,000 images with corresponding 3D semantic occupancy grids and metric depth maps, including 22 semantic classes. OccuFly covers almost 200,000 m2 at 50m, 40m, and 30m altitude in urban, industrial, and rural scenarios during spring, summer, fall, and winter.
- In addition to the SSC samples, OccuFly offers more than 20,000 per-frame metric depth maps. Furthermore, we train and evaluate Depth-Anything-V2 on these depth maps, and release it to enable state-of-the-art SSC.
- We propose a novel and scalable data generation framework to construct SSC ground-truth, thereby (i) relying on camera modality to avoid LiDAR-based point cloud sparsity from elevated viewpoints, (ii) avoiding LiDAR hardware to adhere to mass and energy constraints of most UAVs, and (iii) reducing manual semantic labeling from tedious 3D annotation to efficient 2D annotation.
- We show that state-of-the-art SSC models trained on OccuFly can recover coarse geometry but struggle with semantic consistency, revealing fundamental domain-specific challenges, which position OccuFly as a robust benchmark for advancing aerial vision-based 3D scene understanding.
Abstract
Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by lifting a subset of annotated 2D masks into the reconstructed point cloud, thereby substantially minimizing manual 3D annotation effort. Finally, we benchmark the state-of-the-art on OccuFly and highlight challenges specific to elevated viewpoints, yielding a comprehensive vision benchmark for holistic aerial 3D scene understanding.
Challenge & Solution
🆘Challenge: Terrestrial SSC datasets are typically built by fusing multiple sparse LiDAR sweeps to form a dense point cloud,
manually annotating each point, and voxelizing the result into ground-truth labels. While this pipeline works well for
ground vehicles, it is poorly suited to aerial scenarios: UAV platforms face strict payload and power constraints that
limit the use of LiDAR sensors, and the elevated viewpoint further exacerbates LiDAR sparsity, leaving large regions
unobserved and resulting in incomplete or low-quality ground truth.
💡Solution: We propose a novel and scalable data generation framework based on camera modality, which is
ubiquitous on modern UAVs. Our approach constructs SSC ground truth without LiDAR, mitigating point cloud sparsity, complying
with UAV mass and energy constraints, and reducing manual annotation effort by shifting from costly 3D semantic labeling to efficient 2D labeling.
Data Generation Framework
- 3D Reconstruction: We utilize geo-referenced images to apply traditional multi-view reconstruction, generating a metric 3D point cloud. This approach additionally yields 2D–3D correspondences, allowing image pixels to be associated with reconstructed 3D points, effectively streamlining the creation of 3D semantic annotations.
- 2D Semantic Annotation: We enable highly efficient label transfer by manually annotating only a small subset of the camera images (<10% on average) and lifting the semantic pixels into the reconstructed point cloud. This reduces costly 3D annotation to efficient 2D image labeling, substantially lowering annotation effort.
- Densification & Voxelization: Our pipeline first separates semantic classes into three distinct groups, and applies specialized densification strategies to each. Individual objects are then voxelized separately and aggregated to form a complete and densified scene-level voxel grid.
- Ground-Truth Sampling: As all previous steps are performed on a global scene level, we finally retrieve per-frame ground-truth grids by frustum-culling the scene voxel grid using geo-referenced camera poses and intrinsics, resulting in one fixed-size semantic voxel grid per camera frame.
The OccuFly Dataset
OccuFly introduces the first real-world, aerial 3D SSC benchmark dataset, consisting of 9 scenes that provide over 20,000 samples of RGB images, semantic occupancy grids, and metric depth maps, including 22 semantic classes. OccuFly covers almost 200,000m2 at 50m, 40m, and 30m altitude in urban, industrial, and rural scenarios during spring, summer, fall, and winter.
OccuFly in Comparison
Since no real-world aerial SSC datasets exist, we compare OccuFly with established vision-based terrestrial SSC datasets. Similar to SemanticKITTI, which introduced real-world SSC to autonomous driving, OccuFly introduces real-world SSC to the aerial domain, but at a substantially larger scale: the number of samples is more than 5x higher, and the total number of labeled voxels is over 6x larger than SemanticKITTI. OccuFly further provides the largest class taxonomy (22 classes) among the compared SSC datasets.
Furthermore, we compare OccuFly to other real-world, low-altitude aerial datasets that include metric depth maps. To the best of our knowledge, WildUAV and UseGeo are the only publicly available dataset of this kind. Notably, OccuFly is substantially larger, providing more than 13x and more than 24x as many metric depth maps, respectively, while spanning a broader range of scenarios and seasons. This positions OccuFly as the largest and most diverse publicly available low‑altitude metric depth estimation dataset to date.
Evaluation of Semantic Scene Completion
We benchmark CGFormer, a state-of-the-art and established SSC method. Our results show that, although coarse geometry is captured, semantic consistency suffers significantly and performs inferiorly compared to terrestrial settings. This gap exposes the domain-specific challenges of aerial imagery and reveals that existing SSC models fall short in this domain. OccuFly therefore serves as a rigorous testbed to propel progress on aerial image-based 3D scene understanding
Evaluation of Metric Monocular Depth Estimation
Notably, no established metric monocular depth estimation models exist for the aerial domain. Therefore, we evaluate the potential of OccuFly's metric depth maps by benchmarking Depth Anything V2 ViT-Small Metric (DAv2-metric), a state-of-the-art metric monocular model. We follow the DAv2 metric adaptation protocol and fine-tune the affine-invariant model on OccuFly’s training split with metric depth supervision, referred to as DAv2-OccuFly.
Our results show that our DAv2-OccuFly consistently and substantially outperforms DAv2-metric across all metrics and altitudes. Notably, normalized error measures (AbsRel, SILog) remain relatively stable with altitude, indicating that the model exhibits robust scale-invariant behavior. In contrast, absolute errors (RMSE, MAE) increase with altitude, suggesting a positive correlation between viewpoint height and metric error. Consequently, OccuFly provides a robust benchmark to advance aerial image-based 3D scene understanding.
BibTeX
@misc{gross2025occufly,
title={{OccuFly}: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective},
author={Markus Gross and Sai B. Matha and Aya Fahmy and Rui Song and Daniel Cremers and Henri Meess},
year={2025},
eprint={2512.20770},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.20770}
}