This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
3DVLP-baseline0.7766 310.6373 220.4572 150.3469 130.5288 210.4120 13
TransformerVG0.7502 420.5977 290.3712 430.2628 440.4562 400.3379 44
ConcreteNet0.8607 30.7923 30.4746 80.4091 30.5612 70.4950 3
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024
scanrefer20.6340 560.4353 550.3193 580.1947 540.3898 570.2486 54
CSA-M3LM0.8137 90.6241 250.4544 200.3317 180.5349 130.3972 17
ScanRefer_vanilla0.6488 530.4056 580.3052 600.1782 570.3823 590.2292 58
HAM0.7799 280.6373 220.4148 320.3324 170.4967 320.4007 16
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
SPANet0.5614 590.4641 530.2800 610.2071 520.3431 620.2647 52
henet0.7110 480.5180 500.3936 340.2472 480.4590 380.3030 49
grounding0.7298 460.5458 470.3822 390.2421 490.4538 430.3046 48
SAVG0.7758 330.5664 400.4236 300.2826 390.5026 300.3462 42
HGT0.7692 360.5886 320.4141 330.2924 360.4937 330.3588 39
BEAUTY-DETRcopyleft0.7848 220.5499 460.3934 350.2480 470.4811 340.3157 47
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
Clip-pre0.7766 310.6843 110.3617 490.2904 380.4547 410.3787 26
Clip0.7733 340.6810 130.3619 470.2919 370.4542 420.3791 25
FE-3DGQA0.7857 210.5862 330.4317 290.2935 330.5111 290.3592 37
3DInsVG0.8170 80.6925 80.4582 130.3617 110.5386 110.4359 11
ContraRefer0.7832 240.6801 140.3850 370.2947 320.4743 360.3811 22
D3Netpermissive0.7923 180.6843 110.3905 360.3074 300.4806 350.3919 19
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
D3Net - Pretrainedpermissive0.7659 380.6579 190.3619 470.2726 410.4525 440.3590 38
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
TransformerRefer0.6010 570.4658 520.2540 630.1730 590.3318 630.2386 55
SR-GAB0.7016 490.5202 490.3233 570.1959 530.4081 550.2686 51
pairwisemethod0.5779 580.3603 590.2792 620.1746 580.3462 610.2163 59
PointGroup_MCAN0.7510 410.6397 210.3271 550.2535 450.4222 500.3401 43
3DJCG(Grounding)permissive0.7675 370.6059 270.4389 270.3117 280.5126 270.3776 27
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Transformerpermissive0.7576 390.5515 450.4224 310.2933 340.4976 310.3512 41
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
3DVG-Trans +permissive0.7733 340.5787 380.4370 280.3102 290.5124 280.3704 32
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SRGA0.7494 430.5128 510.3631 460.2218 500.4497 460.2871 50
InstanceReferpermissive0.7782 300.6669 170.3457 520.2688 430.4427 470.3580 40
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
ScanRefer Baseline0.6422 540.4196 570.3090 590.1832 550.3837 580.2362 56
TGNN0.6834 520.5894 310.3312 530.2526 460.4102 540.3281 46
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
M3DRef-CLIPpermissive0.7980 140.7085 50.4692 100.3807 90.5433 100.4545 9
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
Co3d0.0000 650.0000 650.0000 650.0000 650.0000 650.0000 65
3dvlp-with-judge0.7807 270.6472 200.4498 240.3407 140.5240 240.4094 14
TFVG3D ++permissive0.7453 450.5458 480.3793 410.2690 420.4614 370.3311 45
Ali Solgi, Mehdi Ezoji: A Transformer-based Framework for Visual Grounding on 3D Point Clouds. AISP 2024
3dvlp-judge-h0.7552 400.6051 280.4458 260.3340 160.5152 260.3948 18
ScanRefer-3dvlp-test0.7824 250.6298 240.4532 220.3405 150.5270 230.4054 15
ScanRefer-test0.6999 500.4361 540.3274 540.1725 600.4109 530.2316 57
3DVLP0.0038 640.0019 640.0049 640.0023 640.0047 640.0022 64
UniVLG0.8895 10.8236 10.5921 10.5030 10.6588 10.5749 1
GALA-Grounder-D30.7939 170.5952 300.4625 110.3229 200.5368 120.3839 20
LAG-3D-30.7815 260.5837 340.4556 180.3219 210.5287 220.3806 23
LAG-3D-20.7964 150.5812 360.4572 150.3245 190.5333 150.3821 21
LAG-3D0.7881 190.5606 430.4579 140.3169 240.5320 170.3715 31
Graph-VG-40.7848 220.5631 420.4560 170.3164 260.5298 190.3717 30
Graph-VG-30.8038 110.5812 360.4515 230.3169 240.5305 180.3762 28
Graph-VG-20.8021 120.5829 350.4546 190.3217 220.5325 160.3802 24
GALA-Grounder-D10.8104 100.5754 390.4479 250.3176 230.5292 200.3754 29
Chat-Scenepermissive0.8887 20.8005 20.5421 20.4861 20.6198 20.5566 2
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
Co3d20.5070 620.1195 630.3569 500.1511 620.3906 560.1440 62
D-LISA0.8195 70.6900 90.4975 40.3967 60.5697 40.4625 5
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024
M3DRef-test0.7865 200.6793 150.4963 50.3977 50.5614 60.4608 6
RG-SAN0.7964 150.6785 160.4591 120.3600 120.5348 140.4314 12
SAF0.6348 550.5647 410.3726 420.3009 310.4314 480.3601 36
M3DRef-SCLIP0.7997 130.7123 40.4708 90.3805 100.5445 90.4549 8
cus3d0.8384 50.7073 70.4908 60.4000 40.5688 50.4689 4
pointclip0.8211 60.7082 60.4803 70.3884 70.5567 80.4601 7
Se2d0.7799 280.6628 180.3636 450.2823 400.4569 390.3677 34
secg0.7288 470.6175 260.3696 440.2933 340.4501 450.3660 35
CORE-3DVG0.8557 40.6867 100.5275 30.3850 80.6011 30.4527 10
bo3d-10.7469 440.5606 430.4539 210.3124 270.5196 250.3680 33
Co3d30.5326 610.1369 610.3848 380.1651 610.4179 510.1588 61
bo3d00.4823 630.1278 620.3271 550.1394 630.3619 600.1368 63
bo3d0.5400 600.1550 600.3817 400.1785 560.4172 520.1732 60
ScanReferpermissive0.6859 510.4353 550.3488 510.2097 510.4244 490.2603 53
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sort bysort bysort bysorted bysort bysort by
Chat-Scene-thres0.5permissive0.3456 10.1859 20.3162 10.1527 10.1415 80.4856 4
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
Vote2Cap-DETR++0.3360 20.1908 10.3012 20.1386 20.1864 10.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
TMP0.3029 40.1728 40.2898 30.1332 30.1801 30.4605 5
vote2cap-detrpermissive0.3128 30.1778 30.2842 40.1316 40.1825 20.4454 6
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
CFM0.2360 50.1417 50.2253 50.1034 50.1379 100.3008 10
CM3D-Trans+0.2348 60.1383 60.2250 70.1030 60.1398 90.2966 12
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Forest-xyz0.2266 70.1363 70.2250 60.1027 70.1161 150.2825 15
D3Net - Speakerpermissive0.2088 80.1335 90.2237 80.1022 80.1481 70.4198 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
3DJCG(Captioning)permissive0.1918 100.1350 80.2207 90.1013 90.1506 60.3867 8
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
Chat-Scene-thres0.010.2053 90.1103 100.1884 100.0907 100.1527 50.5076 2
REMAN0.1662 110.1070 110.1790 110.0815 110.1235 130.2927 14
NOAH0.1382 120.0901 120.1598 120.0747 120.1359 110.2977 11
SpaCap3Dpermissive0.1359 130.0883 130.1591 130.0738 130.1182 140.3275 9
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
SUN+0.1148 170.0846 140.1564 140.0711 140.1143 160.2958 13
X-Trans2Cappermissive0.1274 140.0808 150.1392 150.0653 150.1244 120.2795 16
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
MORE-xyzpermissive0.1239 160.0796 160.1362 160.0631 160.1116 170.2648 17
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
Chat-Scene-all0.1257 150.0671 170.1150 170.0554 170.1539 40.5076 2
Scan2Cappermissive0.0849 180.0576 180.1073 180.0492 180.0970 180.2481 18
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021