This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysorted bysort bysort bysort by
CORE-3DVG0.8557 20.6867 70.5275 10.3850 50.6011 10.4527 7
M3DRef-test0.7865 110.6793 120.4963 20.3977 30.5614 30.4608 3
cus3d0.8384 30.7073 50.4908 30.4000 20.5688 20.4689 2
pointclip0.8211 40.7082 40.4803 40.3884 40.5567 50.4601 4
ConcreteNet0.8607 10.7923 10.4746 50.4091 10.5612 40.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
M3DRef-SCLIP0.7997 70.7123 20.4708 60.3805 70.5445 60.4549 5
M3DRef-CLIPpermissive0.7980 80.7085 30.4692 70.3807 60.5433 70.4545 6
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
RG-SAN0.7964 90.6785 130.4591 80.3600 90.5348 100.4314 9
3DInsVG0.8170 50.6925 60.4582 90.3617 80.5386 80.4359 8
CSA-M3LM0.8137 60.6241 190.4544 100.3317 110.5349 90.3972 11
bo3d-10.7469 290.5606 290.4539 110.3124 120.5196 110.3680 18
3DJCG(Grounding)permissive0.7675 230.6059 210.4389 120.3117 130.5126 120.3776 16
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 200.5787 260.4370 130.3102 140.5124 130.3704 17
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
FE-3DGQA0.7857 120.5862 250.4317 140.2935 180.5111 140.3592 22
SAVG0.7758 190.5664 270.4236 150.2826 240.5026 150.3462 27
3DVG-Transformerpermissive0.7576 250.5515 300.4224 160.2933 190.4976 160.3512 26
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
HAM0.7799 150.6373 180.4148 170.3324 100.4967 170.4007 10
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
HGT0.7692 220.5886 240.4141 180.2924 210.4937 180.3588 24
henet0.7110 320.5180 340.3936 190.2472 320.4590 220.3030 33
BEAUTY-DETRcopyleft0.7848 130.5499 310.3934 200.2480 310.4811 190.3157 31
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
D3Netpermissive0.7923 100.6843 80.3905 210.3074 150.4806 200.3919 12
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 140.6801 110.3850 220.2947 170.4743 210.3811 13
Co3d30.5326 440.1369 440.3848 230.1651 440.4179 350.1588 44
grounding0.7298 300.5458 320.3822 240.2421 330.4538 270.3046 32
bo3d0.5400 430.1550 430.3817 250.1785 400.4172 360.1732 43
SAF0.6348 380.5647 280.3726 260.3009 160.4314 320.3601 21
TransformerVG0.7502 270.5977 220.3712 270.2628 280.4562 240.3379 29
secg0.7288 310.6175 200.3696 280.2933 190.4501 290.3660 20
Se2d0.7799 150.6628 150.3636 290.2823 250.4569 230.3677 19
SRGA0.7494 280.5128 350.3631 300.2218 340.4497 300.2871 34
D3Net - Pretrainedpermissive0.7659 240.6579 160.3619 310.2726 260.4525 280.3590 23
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
Clip0.7733 200.6810 100.3619 310.2919 220.4542 260.3791 14
Clip-pre0.7766 180.6843 80.3617 330.2904 230.4547 250.3787 15
Co3d20.5070 450.1195 460.3569 340.1511 450.3906 390.1440 45
ScanReferpermissive0.6859 340.4353 380.3488 350.2097 350.4244 330.2603 37
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
InstanceReferpermissive0.7782 170.6669 140.3457 360.2688 270.4427 310.3580 25
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
TGNN0.6834 350.5894 230.3312 370.2526 300.4102 370.3281 30
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
PointGroup_MCAN0.7510 260.6397 170.3271 380.2535 290.4222 340.3401 28
bo3d00.4823 460.1278 450.3271 380.1394 460.3619 430.1368 46
SR-GAB0.7016 330.5202 330.3233 400.1959 370.4081 380.2686 35
scanrefer20.6340 390.4353 380.3193 410.1947 380.3898 400.2486 38
ScanRefer Baseline0.6422 370.4196 400.3090 420.1832 390.3837 410.2362 40
ScanRefer_vanilla0.6488 360.4056 410.3052 430.1782 410.3823 420.2292 41
SPANet0.5614 420.4641 370.2800 440.2071 360.3431 450.2647 36
pairwisemethod0.5779 410.3603 420.2792 450.1746 420.3462 440.2163 42
TransformerRefer0.6010 400.4658 360.2540 460.1730 430.3318 460.2386 39
Co3d0.0000 470.0000 470.0000 470.0000 470.0000 470.0000 47