UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

The Hong Kong University of Science and Technology (HKUST)
teaser
Overview of UniFunc3D compared to existing fragmented pipelines. (Top) Prior methods like Fun3DU rely on a visually blind text-only LLM for initial task parsing. Coupled with single-scale passive heuristic frame selection, this fragmented approach suffers from three critical failure modes: semantic misinterpretations, spatial-temporal context inconsistencies, and imperceptible small targets. (Bottom) Our proposed UniFunc3D addresses these limitations by utilizing a unified Multimodal Large Language Model (MLLM) as an active observer, consolidating semantic, temporal, and spatial reasoning into a single forward pass.

Abstract

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9% mIoU improvement, without any task-specific training.

Method

UniFunc3D employs a single unified MLLM with active spatial-temporal grounding across two stages:

(1) Active spatial-temporal grounding with joint functional object identification. The coarse stage (Round 1) actively surveys low-resolution video frames across multiple sampling iterations and selects the most informative candidate via visual verification. The fine stage (Round 2) processes a dense temporal window at native high resolution, delivering zoom-in capability while preserving global scene context for precise localization.

(2) Visual mask generation and verification. Predicted affordance points prompt SAM3 for segmentation. Each mask is then verified by the same MLLM through visual overlay inspection before 3D lifting. Verified masks undergo multi-view agreement and 3D lifting to produce the final point cloud mask.
pipeline
Method overview. The coarse stage actively surveys low-resolution frames with multiple sampling iterations; the fine stage processes a dense temporal window at native high resolution. Visual mask generation and verification uses SAM3 with MLLM-based mask verification, then multi-view 3D lifting to obtain the final 3D masks.

Results

State-of-the-art on SceneFun3D

UniFunc3D-30B achieves the best performance across all methods on all reported metrics on both splits of SceneFun3D. Remarkably, our training-free approach outperforms both training-free and training-based methods, including those using significantly larger models (72B).
  • Compared to Fun3DU (training-free): +84.9% relative AP50 and +59.9% relative mIoU on split0.
  • Compared to AffordBot-72B (training-based, fine-tuned for 1000 epochs): +49.4% AP50 and +68.5% mIoU.
  • UniFunc3D also achieves a 3.2× speedup over Fun3DU (~26 min vs. ~82 min per scene).
Method Split0 (30 scenes, val) Split1 (200 scenes, train)
AP50AP25AR50AR25mIoU AP50AP25AR50AR25mIoU
Training-based methods:
TASA-72B 26.928.619.7 trained on split1
AffordBot-72B 20.9124.7618.9922.8414.42 trained on split1
Training-free methods:
Fun3DU-9B 16.933.338.246.715.2 12.623.132.940.511.5
UniFunc3D-8B (Ours) 23.8244.0446.0755.5120.92 16.2429.0238.9148.1514.23
UniFunc3D-30B (Ours) 31.2451.0146.9758.8824.30 21.3235.7640.0351.0017.09

Qualitative Results

Qualitative comparison across five representative queries. Our method clearly outperforms prior methods in spatial disambiguation and handling small interactive objects. For example, for "Open the top left drawer of the cabinet with the beauty products on top", our method finds the correct top-left knob, while AffordBot finds the wrong top-right knob and Fun3DU mistakenly segments the drawer face.

AffordBot

Fun3DU

Ours

GT

"Open the top left drawer of the cabinet with the beauty products on top"

AffordBot
Fun3DU
Ours
GT

"Turn on the ceiling light"

AffordBot
Fun3DU
Ours
GT

"Control the water flow in the bathtub using the drain control dial"

AffordBot
Fun3DU
Ours
GT

"Select a washing program"

AffordBot
Fun3DU
Ours
GT

"Flush the toilet"

AffordBot
Fun3DU
Ours
GT

BibTeX

@InProceedings{Lin_UniFunc3D,
    author    = {Lin, Jiaying and Xu, Dan},
    title     = {UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation},
    booktitle = {arXiv},
    year      = {2026},
}