This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sorted bysort bysort bysort bysort bysort by
ConcreteNet0.8607 10.7923 10.4746 40.4091 10.5612 30.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
CORE-3DVG0.8557 20.6867 70.5275 10.3850 40.6011 10.4527 6
cus3d0.8384 30.7073 50.4908 20.4000 20.5688 20.4689 2
pointclip0.8211 40.7082 40.4803 30.3884 30.5567 40.4601 3
3DInsVG0.8170 50.6925 60.4582 70.3617 70.5386 70.4359 7
CSA-M3LM0.8137 60.6241 170.4544 80.3317 90.5349 80.3972 9
3D-REC0.7997 70.7123 20.4708 50.3805 60.5445 50.4549 4
M3DRef-CLIPpermissive0.7980 80.7085 30.4692 60.3807 50.5433 60.4545 5
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
D3Netpermissive0.7923 90.6843 80.3905 190.3074 130.4806 180.3919 10
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
FE-3DGQA0.7857 100.5862 230.4317 120.2935 150.5111 120.3592 19
BEAUTY-DETRcopyleft0.7848 110.5499 280.3934 180.2480 280.4811 170.3157 28
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
ContraRefer0.7832 120.6801 110.3850 200.2947 140.4743 190.3811 11
Se2d0.7799 130.6628 130.3636 260.2823 220.4569 210.3677 17
HAM0.7799 130.6373 160.4148 150.3324 80.4967 150.4007 8
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
InstanceReferpermissive0.7782 150.6669 120.3457 330.2688 240.4427 290.3580 22
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
Clip-pre0.7766 160.6843 80.3617 300.2904 200.4547 230.3787 13
SAVG0.7758 170.5664 250.4236 130.2826 210.5026 130.3462 24
3DVG-Trans +permissive0.7733 180.5787 240.4370 110.3102 120.5124 110.3704 15
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
Clip0.7733 180.6810 100.3619 280.2919 190.4542 240.3791 12
HGT0.7692 200.5886 220.4141 160.2924 180.4937 160.3588 21
3DJCG(Grounding)permissive0.7675 210.6059 190.4389 100.3117 110.5126 100.3776 14
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
D3Net - Pretrainedpermissive0.7659 220.6579 140.3619 280.2726 230.4525 260.3590 20
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
3DVG-Transformerpermissive0.7576 230.5515 270.4224 140.2933 160.4976 140.3512 23
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
PointGroup_MCAN0.7510 240.6397 150.3271 350.2535 260.4222 310.3401 25
TransformerVG0.7502 250.5977 200.3712 240.2628 250.4562 220.3379 26
SRGA0.7494 260.5128 320.3631 270.2218 310.4497 280.2871 31
bo3d-10.7469 270.5606 260.4539 90.3124 100.5196 90.3680 16
grounding0.7298 280.5458 290.3822 220.2421 300.4538 250.3046 29
secg0.7288 290.6175 180.3696 250.2933 160.4501 270.3660 18
henet0.7110 300.5180 310.3936 170.2472 290.4590 200.3030 30
SR-GAB0.7016 310.5202 300.3233 370.1959 340.4081 350.2686 32
ScanReferpermissive0.6859 320.4353 350.3488 320.2097 320.4244 300.2603 34
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
TGNN0.6834 330.5894 210.3312 340.2526 270.4102 340.3281 27
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
ScanRefer_vanilla0.6488 340.4056 380.3052 400.1782 380.3823 390.2292 38
ScanRefer Baseline0.6422 350.4196 370.3090 390.1832 360.3837 380.2362 37
scanrefer20.6340 360.4353 350.3193 380.1947 350.3898 370.2486 35
TransformerRefer0.6010 370.4658 330.2540 430.1730 400.3318 430.2386 36
pairwisemethod0.5779 380.3603 390.2792 420.1746 390.3462 410.2163 39
SPANet0.5614 390.4641 340.2800 410.2071 330.3431 420.2647 33
bo3d0.5400 400.1550 400.3817 230.1785 370.4172 330.1732 40
Co3d30.5326 410.1369 410.3848 210.1651 410.4179 320.1588 41
Co3d20.5070 420.1195 430.3569 310.1511 420.3906 360.1440 42
bo3d00.4823 430.1278 420.3271 350.1394 430.3619 400.1368 43
Co3d0.0000 440.0000 440.0000 440.0000 440.0000 440.0000 44