This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
ConcreteNet0.8607 10.7923 10.4746 40.4091 10.5612 30.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
cus3d0.8384 30.7073 40.4908 20.4000 20.5688 20.4689 2
pointclip0.8211 40.7082 30.4803 30.3884 30.5567 40.4601 3
M3DRef-CLIPpermissive0.7980 70.7085 20.4692 50.3807 50.5433 50.4545 4
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
CORE-3DVG0.8557 20.6867 60.5275 10.3850 40.6011 10.4527 5
3DInsVG0.8170 50.6925 50.4582 60.3617 60.5386 60.4359 6
HAM0.7799 120.6373 150.4148 140.3324 70.4967 140.4007 7
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 60.6241 160.4544 70.3317 80.5349 70.3972 8
D3Netpermissive0.7923 80.6843 70.3905 180.3074 120.4806 170.3919 9
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 110.6801 100.3850 190.2947 130.4743 180.3811 10
Clip0.7733 170.6810 90.3619 270.2919 180.4542 230.3791 11
Clip-pre0.7766 150.6843 70.3617 290.2904 190.4547 220.3787 12
3DJCG(Grounding)permissive0.7675 200.6059 180.4389 90.3117 100.5126 90.3776 13
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 170.5787 230.4370 100.3102 110.5124 100.3704 14
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
bo3d-10.7469 260.5606 250.4539 80.3124 90.5196 80.3680 15
Se2d0.7799 120.6628 120.3636 250.2823 210.4569 200.3677 16
secg0.7288 280.6175 170.3696 240.2933 150.4501 260.3660 17
FE-3DGQA0.7857 90.5862 220.4317 110.2935 140.5111 110.3592 18
D3Net - Pretrainedpermissive0.7659 210.6579 130.3619 270.2726 220.4525 250.3590 19
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 190.5886 210.4141 150.2924 170.4937 150.3588 20
InstanceReferpermissive0.7782 140.6669 110.3457 320.2688 230.4427 280.3580 21
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 220.5515 260.4224 130.2933 150.4976 130.3512 22
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 160.5664 240.4236 120.2826 200.5026 120.3462 23
PointGroup_MCAN0.7510 230.6397 140.3271 340.2535 250.4222 300.3401 24
TransformerVG0.7502 240.5977 190.3712 230.2628 240.4562 210.3379 25
TGNN0.6834 320.5894 200.3312 330.2526 260.4102 330.3281 26
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 100.5499 270.3934 170.2480 270.4811 160.3157 27
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 270.5458 280.3822 210.2421 290.4538 240.3046 28
henet0.7110 290.5180 300.3936 160.2472 280.4590 190.3030 29
SRGA0.7494 250.5128 310.3631 260.2218 300.4497 270.2871 30
SR-GAB0.7016 300.5202 290.3233 360.1959 330.4081 340.2686 31
SPANet0.5614 380.4641 330.2800 400.2071 320.3431 410.2647 32
ScanReferpermissive0.6859 310.4353 340.3488 310.2097 310.4244 290.2603 33
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 350.4353 340.3193 370.1947 340.3898 360.2486 34
TransformerRefer0.6010 360.4658 320.2540 420.1730 390.3318 420.2386 35
ScanRefer Baseline0.6422 340.4196 360.3090 380.1832 350.3837 370.2362 36
ScanRefer_vanilla0.6488 330.4056 370.3052 390.1782 370.3823 380.2292 37
pairwisemethod0.5779 370.3603 380.2792 410.1746 380.3462 400.2163 38
bo3d0.5400 390.1550 390.3817 220.1785 360.4172 320.1732 39
Co3d30.5326 400.1369 400.3848 200.1651 400.4179 310.1588 40
Co3d20.5070 410.1195 420.3569 300.1511 410.3906 350.1440 41
bo3d00.4823 420.1278 410.3271 340.1394 420.3619 390.1368 42
Co3d0.0000 430.0000 430.0000 430.0000 430.0000 430.0000 43