UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Overview of UniFunc3D compared to existing fragmented pipelines. (Top) Prior methods like Fun3DU rely on a visually blind text-only LLM for initial task parsing. Coupled with single-scale passive heuristic frame selection, this fragmented approach suffers from three critical failure modes: semantic misinterpretations, spatial-temporal context inconsistencies, and imperceptible small targets. (Bottom) Our proposed UniFunc3D addresses these limitations by utilizing a unified Multimodal Large Language Model (MLLM) as an active observer, consolidating semantic, temporal, and spatial reasoning into a single forward pass.

Abstract

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9% mIoU improvement, without any task-specific training.

Method

UniFunc3D employs a single unified MLLM with active spatial-temporal grounding across two stages:

(1) Active spatial-temporal grounding with joint functional object identification. The coarse stage (Round 1) actively surveys low-resolution video frames across multiple sampling iterations and selects the most informative candidate via visual verification. The fine stage (Round 2) processes a dense temporal window at native high resolution, delivering zoom-in capability while preserving global scene context for precise localization.

(2) Visual mask generation and verification. Predicted affordance points prompt SAM3 for segmentation. Each mask is then verified by the same MLLM through visual overlay inspection before 3D lifting. Verified masks undergo multi-view agreement and 3D lifting to produce the final point cloud mask.

Method overview. The coarse stage actively surveys low-resolution frames with multiple sampling iterations; the fine stage processes a dense temporal window at native high resolution. Visual mask generation and verification uses SAM3 with MLLM-based mask verification, then multi-view 3D lifting to obtain the final 3D masks.

Results

State-of-the-art on SceneFun3D

UniFunc3D-30B achieves the best performance across all methods on all reported metrics on both splits of SceneFun3D. Remarkably, our training-free approach outperforms both training-free and training-based methods, including those using significantly larger models (72B).

Compared to Fun3DU (training-free): +84.9% relative AP₅₀ and +59.9% relative mIoU on split0.
Compared to AffordBot-72B (training-based, fine-tuned for 1000 epochs): +49.4% AP₅₀ and +68.5% mIoU.
UniFunc3D also achieves a 3.2× speedup over Fun3DU (~26 min vs. ~82 min per scene).

Method	Split0 (30 scenes, val)					Split1 (200 scenes, train)
Method	AP₅₀	AP₂₅	AR₅₀	AR₂₅	mIoU	AP₅₀	AP₂₅	AR₅₀	AR₂₅	mIoU
Training-based methods:
TASA-72B	26.9	28.6	—	—	19.7	trained on split1
AffordBot-72B	20.91	24.76	18.99	22.84	14.42	trained on split1
Training-free methods:
Fun3DU-9B	16.9	33.3	38.2	46.7	15.2	12.6	23.1	32.9	40.5	11.5
UniFunc3D-8B (Ours)	23.82	44.04	46.07	55.51	20.92	16.24	29.02	38.91	48.15	14.23
UniFunc3D-30B (Ours)	31.24	51.01	46.97	58.88	24.30	21.32	35.76	40.03	51.00	17.09

Qualitative Results

Qualitative comparison across five representative queries. Our method clearly outperforms prior methods in spatial disambiguation and handling small interactive objects. For example, for "Open the top left drawer of the cabinet with the beauty products on top", our method finds the correct top-left knob, while AffordBot finds the wrong top-right knob and Fun3DU mistakenly segments the drawer face.

AffordBot

Fun3DU

Ours

GT

"Open the top left drawer of the cabinet with the beauty products on top"

"Turn on the ceiling light"

"Control the water flow in the bathtub using the drain control dial"

"Select a washing program"

"Flush the toilet"

BibTeX

@InProceedings{Lin_UniFunc3D,
    author    = {Lin, Jiaying and Xu, Dan},
    title     = {UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation},
    booktitle = {arXiv},
    year      = {2026},
}