This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
HAM0.7799 60.6373 80.4148 80.3324 20.4967 70.4007 1
Jiaming Chen, Weixin Luo, Xiaolin Wei, Lin Ma, Wei Zhang: HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding.
CSA-M3LM0.8137 10.6241 90.4544 20.3317 30.5349 10.3972 2
D3Netpermissive0.7923 20.6843 10.3905 120.3074 60.4806 100.3919 3
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 50.6801 40.3850 130.2947 70.4743 110.3811 4
Clip0.7733 100.6810 30.3619 160.2919 110.4542 150.3791 5
Clip-pre0.7766 80.6843 10.3617 180.2904 120.4547 140.3787 6
3DJCG(Grounding)permissive0.7675 130.6059 100.4389 30.3117 40.5126 20.3776 7
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 100.5787 150.4370 40.3102 50.5124 30.3704 8
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
FE-3DGQA0.7857 30.5862 140.4317 50.2935 80.5111 40.3592 9
D3Net - Pretrainedpermissive0.7659 140.6579 60.3619 160.2726 140.4525 170.3590 10
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 120.5886 130.4141 90.2924 100.4937 80.3588 11
InstanceReferpermissive0.7782 70.6669 50.3457 200.2688 150.4427 190.3580 12
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 150.5515 170.4224 70.2933 90.4976 60.3512 13
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 90.5664 160.4236 60.2826 130.5026 50.3462 14
PointGroup_MCAN0.7510 160.6397 70.3271 220.2535 170.4222 210.3401 15
TransformerVG0.7502 170.5977 110.3712 140.2628 160.4562 130.3379 16
TGNN0.6834 220.5894 120.3312 210.2526 180.4102 220.3281 17
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 40.5499 180.3934 110.2480 190.4811 90.3157 18
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.3822 290.2421 290.7298 10.5458 10.4538 160.3046 19
henet0.7110 190.5180 200.3936 100.2472 200.4590 120.3030 20
SRGA0.7494 180.5128 210.3631 150.2218 210.4497 180.2871 21
SR-GAB0.7016 200.5202 190.3233 230.1959 240.4081 230.2686 22
SPANet0.5614 280.4641 230.2800 270.2071 230.3431 280.2647 23
ScanReferpermissive0.6859 210.4353 240.3488 190.2097 220.4244 200.2603 24
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 250.4353 240.3193 240.1947 250.3898 240.2486 25
TransformerRefer0.6010 260.4658 220.2540 290.1730 290.3318 290.2386 26
ScanRefer Baseline0.6422 240.4196 260.3090 250.1832 260.3837 250.2362 27
ScanRefer_vanilla0.6488 230.4056 270.3052 260.1782 270.3823 260.2292 28
pairwisemethod0.5779 270.3603 280.2792 280.1746 280.3462 270.2163 29