This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
ConcreteNet0.8120 30.6933 10.4479 40.3760 20.5296 30.4471 1
M3DRef-CLIP0.8170 10.6925 20.4582 20.3617 30.5386 10.4359 2
HAM0.7799 80.6373 100.4148 100.3324 40.4967 90.4007 3
Jiaming Chen, Weixin Luo, Xiaolin Wei, Lin Ma, Wei Zhang: HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding.
CSA-M3LM0.8137 20.6241 110.4544 30.3317 50.5349 20.3972 4
D3Netpermissive0.7923 40.6843 30.3905 140.3074 80.4806 120.3919 5
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 70.6801 60.3850 150.2947 90.4743 130.3811 6
Clip0.7733 120.6810 50.3619 180.2919 130.4542 170.3791 7
Clip-pre0.7766 100.6843 30.3617 200.2904 140.4547 160.3787 8
3DJCG(Grounding)permissive0.7675 150.6059 120.4389 50.3117 60.5126 40.3776 9
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 120.5787 170.4370 60.3102 70.5124 50.3704 10
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
FE-3DGQA0.7857 50.5862 160.4317 70.2935 100.5111 60.3592 11
D3Net - Pretrainedpermissive0.7659 160.6579 80.3619 180.2726 160.4525 190.3590 12
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 140.5886 150.4141 110.2924 120.4937 100.3588 13
InstanceReferpermissive0.7782 90.6669 70.3457 220.2688 170.4427 210.3580 14
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 170.5515 190.4224 90.2933 110.4976 80.3512 15
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 110.5664 180.4236 80.2826 150.5026 70.3462 16
PointGroup_MCAN0.7510 180.6397 90.3271 240.2535 190.4222 230.3401 17
TransformerVG0.7502 190.5977 130.3712 160.2628 180.4562 150.3379 18
TGNN0.6834 240.5894 140.3312 230.2526 200.4102 240.3281 19
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 60.5499 200.3934 130.2480 210.4811 110.3157 20
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.3822 310.2421 310.7298 10.5458 10.4538 180.3046 21
henet0.7110 210.5180 220.3936 120.2472 220.4590 140.3030 22
SRGA0.7494 200.5128 230.3631 170.2218 230.4497 200.2871 23
SR-GAB0.7016 220.5202 210.3233 250.1959 260.4081 250.2686 24
SPANet0.5614 300.4641 250.2800 290.2071 250.3431 300.2647 25
ScanReferpermissive0.6859 230.4353 260.3488 210.2097 240.4244 220.2603 26
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 270.4353 260.3193 260.1947 270.3898 260.2486 27
TransformerRefer0.6010 280.4658 240.2540 310.1730 310.3318 310.2386 28
ScanRefer Baseline0.6422 260.4196 280.3090 270.1832 280.3837 270.2362 29
ScanRefer_vanilla0.6488 250.4056 290.3052 280.1782 290.3823 280.2292 30
pairwisemethod0.5779 290.3603 300.2792 300.1746 300.3462 290.2163 31