This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysorted bysort bysort bysort bysort by
ConcreteNet0.8607 10.7923 10.4746 60.4091 10.5612 50.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
M3DRef-SCLIP0.7997 80.7123 20.4708 70.3805 80.5445 70.4549 6
M3DRef-CLIPpermissive0.7980 90.7085 30.4692 80.3807 70.5433 80.4545 7
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
pointclip0.8211 40.7082 40.4803 50.3884 50.5567 60.4601 5
cus3d0.8384 30.7073 50.4908 40.4000 20.5688 30.4689 2
3DInsVG0.8170 60.6925 60.4582 100.3617 90.5386 90.4359 9
DMO-3DG0.8195 50.6900 70.4975 20.3967 40.5697 20.4625 3
CORE-3DVG0.8557 20.6867 80.5275 10.3850 60.6011 10.4527 8
D3Netpermissive0.7923 110.6843 90.3905 220.3074 160.4806 210.3919 13
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
Clip-pre0.7766 190.6843 90.3617 340.2904 240.4547 260.3787 16
Clip0.7733 210.6810 110.3619 320.2919 230.4542 270.3791 15
ContraRefer0.7832 150.6801 120.3850 230.2947 180.4743 220.3811 14
M3DRef-test0.7865 120.6793 130.4963 30.3977 30.5614 40.4608 4
RG-SAN0.7964 100.6785 140.4591 90.3600 100.5348 110.4314 10
InstanceReferpermissive0.7782 180.6669 150.3457 370.2688 280.4427 320.3580 26
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
Se2d0.7799 160.6628 160.3636 300.2823 260.4569 240.3677 20
D3Net - Pretrainedpermissive0.7659 250.6579 170.3619 320.2726 270.4525 290.3590 24
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
PointGroup_MCAN0.7510 270.6397 180.3271 390.2535 300.4222 350.3401 29
HAM0.7799 160.6373 190.4148 180.3324 110.4967 180.4007 11
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 70.6241 200.4544 110.3317 120.5349 100.3972 12
secg0.7288 320.6175 210.3696 290.2933 200.4501 300.3660 21
3DJCG(Grounding)permissive0.7675 240.6059 220.4389 130.3117 140.5126 130.3776 17
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
TransformerVG0.7502 280.5977 230.3712 280.2628 290.4562 250.3379 30
TGNN0.6834 360.5894 240.3312 380.2526 310.4102 380.3281 31
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
HGT0.7692 230.5886 250.4141 190.2924 220.4937 190.3588 25
FE-3DGQA0.7857 130.5862 260.4317 150.2935 190.5111 150.3592 23
3DVG-Trans +permissive0.7733 210.5787 270.4370 140.3102 150.5124 140.3704 18
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 200.5664 280.4236 160.2826 250.5026 160.3462 28
SAF0.6348 390.5647 290.3726 270.3009 170.4314 330.3601 22
bo3d-10.7469 300.5606 300.4539 120.3124 130.5196 120.3680 19
3DVG-Transformerpermissive0.7576 260.5515 310.4224 170.2933 200.4976 170.3512 27
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
BEAUTY-DETRcopyleft0.7848 140.5499 320.3934 210.2480 320.4811 200.3157 32
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 310.5458 330.3822 250.2421 340.4538 280.3046 33
SR-GAB0.7016 340.5202 340.3233 410.1959 380.4081 390.2686 36
henet0.7110 330.5180 350.3936 200.2472 330.4590 230.3030 34
SRGA0.7494 290.5128 360.3631 310.2218 350.4497 310.2871 35
TransformerRefer0.6010 410.4658 370.2540 470.1730 440.3318 470.2386 40
SPANet0.5614 430.4641 380.2800 450.2071 370.3431 460.2647 37
scanrefer20.6340 400.4353 390.3193 420.1947 390.3898 410.2486 39
ScanReferpermissive0.6859 350.4353 390.3488 360.2097 360.4244 340.2603 38
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
ScanRefer Baseline0.6422 380.4196 410.3090 430.1832 400.3837 420.2362 41
ScanRefer_vanilla0.6488 370.4056 420.3052 440.1782 420.3823 430.2292 42
pairwisemethod0.5779 420.3603 430.2792 460.1746 430.3462 450.2163 43
bo3d0.5400 440.1550 440.3817 260.1785 410.4172 370.1732 44
Co3d30.5326 450.1369 450.3848 240.1651 450.4179 360.1588 45
bo3d00.4823 470.1278 460.3271 390.1394 470.3619 440.1368 47
Co3d20.5070 460.1195 470.3569 350.1511 460.3906 400.1440 46
Co3d0.0000 480.0000 480.0000 480.0000 480.0000 480.0000 48