This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysorted bysort bysort bysort by
CORE-3DVG0.8557 20.6867 40.5275 10.3850 20.6011 10.4527 3
ConcreteNet0.8607 10.7923 10.4746 20.4091 10.5612 20.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
M3DRef-CLIPpermissive0.7980 50.7085 20.4692 30.3807 30.5433 30.4545 2
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
3DInsVG0.8170 30.6925 30.4582 40.3617 40.5386 40.4359 4
CSA-M3LM0.8137 40.6241 140.4544 50.3317 60.5349 50.3972 6
bo3d-10.7469 240.5606 230.4539 60.3124 70.5196 60.3680 13
3DJCG(Grounding)permissive0.7675 180.6059 160.4389 70.3117 80.5126 70.3776 11
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 150.5787 210.4370 80.3102 90.5124 80.3704 12
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
FE-3DGQA0.7857 70.5862 200.4317 90.2935 120.5111 90.3592 16
SAVG0.7758 140.5664 220.4236 100.2826 180.5026 100.3462 21
3DVG-Transformerpermissive0.7576 200.5515 240.4224 110.2933 130.4976 110.3512 20
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
HAM0.7799 100.6373 130.4148 120.3324 50.4967 120.4007 5
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
HGT0.7692 170.5886 190.4141 130.2924 150.4937 130.3588 18
henet0.7110 270.5180 280.3936 140.2472 260.4590 170.3030 27
BEAUTY-DETRcopyleft0.7848 80.5499 250.3934 150.2480 250.4811 140.3157 25
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
D3Netpermissive0.7923 60.6843 50.3905 160.3074 100.4806 150.3919 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 90.6801 80.3850 170.2947 110.4743 160.3811 8
Co3d30.5326 380.1369 380.3848 180.1651 380.4179 290.1588 38
grounding0.7298 250.5458 260.3822 190.2421 270.4538 220.3046 26
bo3d0.5400 370.1550 370.3817 200.1785 340.4172 300.1732 37
TransformerVG0.7502 220.5977 170.3712 210.2628 220.4562 190.3379 23
secg0.7288 260.6175 150.3696 220.2933 130.4501 240.3660 15
Se2d0.7799 100.6628 100.3636 230.2823 190.4569 180.3677 14
SRGA0.7494 230.5128 290.3631 240.2218 280.4497 250.2871 28
Clip0.7733 150.6810 70.3619 250.2919 160.4542 210.3791 9
D3Net - Pretrainedpermissive0.7659 190.6579 110.3619 250.2726 200.4525 230.3590 17
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
Clip-pre0.7766 130.6843 50.3617 270.2904 170.4547 200.3787 10
Co3d20.5070 390.1195 400.3569 280.1511 390.3906 330.1440 39
ScanReferpermissive0.6859 290.4353 320.3488 290.2097 290.4244 270.2603 31
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
InstanceReferpermissive0.7782 120.6669 90.3457 300.2688 210.4427 260.3580 19
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
TGNN0.6834 300.5894 180.3312 310.2526 240.4102 310.3281 24
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
bo3d00.4823 400.1278 390.3271 320.1394 400.3619 370.1368 40
PointGroup_MCAN0.7510 210.6397 120.3271 320.2535 230.4222 280.3401 22
SR-GAB0.7016 280.5202 270.3233 340.1959 310.4081 320.2686 29
scanrefer20.6340 330.4353 320.3193 350.1947 320.3898 340.2486 32
ScanRefer Baseline0.6422 320.4196 340.3090 360.1832 330.3837 350.2362 34
ScanRefer_vanilla0.6488 310.4056 350.3052 370.1782 350.3823 360.2292 35
SPANet0.5614 360.4641 310.2800 380.2071 300.3431 390.2647 30
pairwisemethod0.5779 350.3603 360.2792 390.1746 360.3462 380.2163 36
TransformerRefer0.6010 340.4658 300.2540 400.1730 370.3318 400.2386 33
Co3d0.0000 410.0000 410.0000 410.0000 410.0000 410.0000 41