This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysorted bysort bysort by
Chat-Scenepermissive0.8887 10.8005 10.5421 10.4861 10.6198 10.5566 1
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
ConcreteNet0.8607 20.7923 20.4746 70.4091 20.5612 60.4950 2
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024
cus3d0.8384 40.7073 60.4908 50.4000 30.5688 40.4689 3
M3DRef-test0.7865 140.6793 140.4963 40.3977 40.5614 50.4608 5
D-LISA0.8195 60.6900 80.4975 30.3967 50.5697 30.4625 4
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024
pointclip0.8211 50.7082 50.4803 60.3884 60.5567 70.4601 6
CORE-3DVG0.8557 30.6867 90.5275 20.3850 70.6011 20.4527 9
M3DRef-CLIPpermissive0.7980 100.7085 40.4692 90.3807 80.5433 90.4545 8
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
M3DRef-SCLIP0.7997 90.7123 30.4708 80.3805 90.5445 80.4549 7
3DInsVG0.8170 70.6925 70.4582 110.3617 100.5386 100.4359 10
RG-SAN0.7964 110.6785 150.4591 100.3600 110.5348 120.4314 11
HAM0.7799 190.6373 200.4148 210.3324 120.4967 210.4007 12
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 80.6241 210.4544 120.3317 130.5349 110.3972 13
GALA-Grounder + 2D0.7947 120.5713 300.4525 140.3202 140.5292 130.3765 19
GALA-Grounder0.7824 180.5796 280.4391 150.3131 150.5161 150.3728 20
bo3d-10.7469 330.5606 330.4539 130.3124 160.5196 140.3680 22
3DJCG(Grounding)permissive0.7675 270.6059 230.4389 160.3117 170.5126 160.3776 18
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 240.5787 290.4370 170.3102 180.5124 170.3704 21
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
D3Netpermissive0.7923 130.6843 100.3905 250.3074 190.4806 240.3919 14
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
SAF0.6348 420.5647 320.3726 300.3009 200.4314 360.3601 25
ContraRefer0.7832 170.6801 130.3850 260.2947 210.4743 250.3811 15
FE-3DGQA0.7857 150.5862 270.4317 180.2935 220.5111 180.3592 26
3DVG-Transformerpermissive0.7576 290.5515 340.4224 200.2933 230.4976 200.3512 30
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
secg0.7288 350.6175 220.3696 320.2933 230.4501 330.3660 24
HGT0.7692 260.5886 260.4141 220.2924 250.4937 220.3588 28
Clip0.7733 240.6810 120.3619 350.2919 260.4542 300.3791 16
Clip-pre0.7766 220.6843 100.3617 370.2904 270.4547 290.3787 17
SAVG0.7758 230.5664 310.4236 190.2826 280.5026 190.3462 31
Se2d0.7799 190.6628 170.3636 330.2823 290.4569 270.3677 23
D3Net - Pretrainedpermissive0.7659 280.6579 180.3619 350.2726 300.4525 320.3590 27
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
InstanceReferpermissive0.7782 210.6669 160.3457 400.2688 310.4427 350.3580 29
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
TransformerVG0.7502 310.5977 240.3712 310.2628 320.4562 280.3379 33
PointGroup_MCAN0.7510 300.6397 190.3271 420.2535 330.4222 380.3401 32
TGNN0.6834 390.5894 250.3312 410.2526 340.4102 410.3281 34
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 160.5499 350.3934 240.2480 350.4811 230.3157 35
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
henet0.7110 360.5180 380.3936 230.2472 360.4590 260.3030 37
grounding0.7298 340.5458 360.3822 280.2421 370.4538 310.3046 36
SRGA0.7494 320.5128 390.3631 340.2218 380.4497 340.2871 38
ScanReferpermissive0.6859 380.4353 420.3488 390.2097 390.4244 370.2603 41
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
SPANet0.5614 460.4641 410.2800 480.2071 400.3431 490.2647 40
SR-GAB0.7016 370.5202 370.3233 440.1959 410.4081 420.2686 39
scanrefer20.6340 430.4353 420.3193 450.1947 420.3898 440.2486 42
ScanRefer Baseline0.6422 410.4196 440.3090 460.1832 430.3837 450.2362 44
bo3d0.5400 470.1550 470.3817 290.1785 440.4172 400.1732 47
ScanRefer_vanilla0.6488 400.4056 450.3052 470.1782 450.3823 460.2292 45
pairwisemethod0.5779 450.3603 460.2792 490.1746 460.3462 480.2163 46
TransformerRefer0.6010 440.4658 400.2540 500.1730 470.3318 500.2386 43
Co3d30.5326 480.1369 480.3848 270.1651 480.4179 390.1588 48
Co3d20.5070 490.1195 500.3569 380.1511 490.3906 430.1440 49
bo3d00.4823 500.1278 490.3271 420.1394 500.3619 470.1368 50
Co3d0.0000 510.0000 510.0000 510.0000 510.0000 510.0000 51