This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
Chat-Scenepermissive0.8887 10.8005 10.5421 10.4861 10.6198 10.5566 1
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
ConcreteNet0.8607 20.7923 20.4746 70.4091 20.5612 60.4950 2
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024
cus3d0.8384 40.7073 60.4908 50.4000 30.5688 40.4689 3
D-LISA0.8195 60.6900 80.4975 30.3967 50.5697 30.4625 4
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024
M3DRef-test0.7865 190.6793 140.4963 40.3977 40.5614 50.4608 5
pointclip0.8211 50.7082 50.4803 60.3884 60.5567 70.4601 6
M3DRef-SCLIP0.7997 120.7123 30.4708 80.3805 90.5445 80.4549 7
M3DRef-CLIPpermissive0.7980 130.7085 40.4692 90.3807 80.5433 90.4545 8
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
CORE-3DVG0.8557 30.6867 90.5275 20.3850 70.6011 20.4527 9
3DInsVG0.8170 70.6925 70.4582 120.3617 100.5386 100.4359 10
RG-SAN0.7964 140.6785 150.4591 110.3600 110.5348 130.4314 11
HAM0.7799 250.6373 200.4148 270.3324 120.4967 270.4007 12
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 80.6241 210.4544 180.3317 130.5349 120.3972 13
D3Netpermissive0.7923 170.6843 100.3905 310.3074 250.4806 300.3919 14
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
GALA-Grounder-D30.7939 160.5952 250.4625 100.3229 150.5368 110.3839 15
LAG-3D-20.7964 140.5812 310.4572 140.3245 140.5333 140.3821 16
ContraRefer0.7832 230.6801 130.3850 320.2947 270.4743 310.3811 17
LAG-3D-30.7815 240.5837 290.4556 160.3219 160.5287 200.3806 18
Graph-VG-20.8021 110.5829 300.4546 170.3217 170.5325 150.3802 19
Clip0.7733 300.6810 120.3619 420.2919 320.4542 370.3791 20
Clip-pre0.7766 280.6843 100.3617 440.2904 330.4547 360.3787 21
3DJCG(Grounding)permissive0.7675 330.6059 230.4389 220.3117 230.5126 220.3776 22
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
Graph-VG-30.8038 100.5812 310.4515 200.3169 190.5305 170.3762 23
GALA-Grounder-D10.8104 90.5754 340.4479 210.3176 180.5292 190.3754 24
Graph-VG-40.7848 210.5631 370.4560 150.3164 210.5298 180.3717 25
LAG-3D0.7881 180.5606 380.4579 130.3169 190.5320 160.3715 26
3DVG-Trans +permissive0.7733 300.5787 330.4370 230.3102 240.5124 230.3704 27
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
bo3d-10.7469 390.5606 380.4539 190.3124 220.5196 210.3680 28
Se2d0.7799 250.6628 170.3636 400.2823 350.4569 340.3677 29
secg0.7288 420.6175 220.3696 390.2933 290.4501 400.3660 30
SAF0.6348 490.5647 360.3726 370.3009 260.4314 430.3601 31
FE-3DGQA0.7857 200.5862 280.4317 240.2935 280.5111 240.3592 32
D3Net - Pretrainedpermissive0.7659 340.6579 180.3619 420.2726 360.4525 390.3590 33
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 320.5886 270.4141 280.2924 310.4937 280.3588 34
InstanceReferpermissive0.7782 270.6669 160.3457 470.2688 380.4427 420.3580 35
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 350.5515 400.4224 260.2933 290.4976 260.3512 36
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 290.5664 350.4236 250.2826 340.5026 250.3462 37
PointGroup_MCAN0.7510 360.6397 190.3271 490.2535 400.4222 450.3401 38
TransformerVG0.7502 370.5977 240.3712 380.2628 390.4562 350.3379 39
TFVG3D ++permissive0.7453 400.5458 430.3793 360.2690 370.4614 320.3311 40
Ali Solgi, Mehdi Ezoji: A Transformer-based Framework for Visual Grounding on 3D Point Clouds. AISP 2024
TGNN0.6834 460.5894 260.3312 480.2526 410.4102 480.3281 41
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 210.5499 410.3934 300.2480 420.4811 290.3157 42
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 410.5458 420.3822 340.2421 440.4538 380.3046 43
henet0.7110 430.5180 450.3936 290.2472 430.4590 330.3030 44
SRGA0.7494 380.5128 460.3631 410.2218 450.4497 410.2871 45
SR-GAB0.7016 440.5202 440.3233 510.1959 480.4081 490.2686 46
SPANet0.5614 530.4641 480.2800 550.2071 470.3431 560.2647 47
ScanReferpermissive0.6859 450.4353 490.3488 460.2097 460.4244 440.2603 48
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 500.4353 490.3193 520.1947 490.3898 510.2486 49
TransformerRefer0.6010 510.4658 470.2540 570.1730 540.3318 570.2386 50
ScanRefer Baseline0.6422 480.4196 510.3090 530.1832 500.3837 520.2362 51
ScanRefer_vanilla0.6488 470.4056 520.3052 540.1782 520.3823 530.2292 52
pairwisemethod0.5779 520.3603 530.2792 560.1746 530.3462 550.2163 53
bo3d0.5400 540.1550 540.3817 350.1785 510.4172 470.1732 54
Co3d30.5326 550.1369 550.3848 330.1651 550.4179 460.1588 55
Co3d20.5070 560.1195 570.3569 450.1511 560.3906 500.1440 56
bo3d00.4823 570.1278 560.3271 490.1394 570.3619 540.1368 57
Co3d0.0000 580.0000 580.0000 580.0000 580.0000 580.0000 58

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sorted bysort bysort bysort bysort bysort by
Vote2Cap-DETR++0.3360 10.1908 10.3012 10.1386 10.1864 10.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
Chat-Scene-thres0.5permissive0.3128 20.1679 40.2862 30.1376 20.1478 80.4981 4
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
vote2cap-detrpermissive0.3128 30.1778 20.2842 40.1316 40.1825 20.4454 6
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
TMP0.3029 40.1728 30.2898 20.1332 30.1801 30.4605 5
CFM0.2360 50.1417 50.2253 50.1034 50.1379 100.3008 10
CM3D-Trans+0.2348 60.1383 60.2250 70.1030 60.1398 90.2966 12
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Forest-xyz0.2266 70.1363 70.2250 60.1027 70.1161 150.2825 15
D3Net - Speakerpermissive0.2088 80.1335 90.2237 80.1022 80.1481 70.4198 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
Chat-Scene-thres0.010.2053 90.1103 100.1884 100.0907 100.1527 50.5076 2
3DJCG(Captioning)permissive0.1918 100.1350 80.2207 90.1013 90.1506 60.3867 8
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
REMAN0.1662 110.1070 110.1790 110.0815 110.1235 130.2927 14
NOAH0.1382 120.0901 120.1598 120.0747 120.1359 110.2977 11
SpaCap3Dpermissive0.1359 130.0883 130.1591 130.0738 130.1182 140.3275 9
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
X-Trans2Cappermissive0.1274 140.0808 150.1392 150.0653 150.1244 120.2795 16
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
Chat-Scene-all0.1257 150.0671 170.1150 170.0554 170.1539 40.5076 2
MORE-xyzpermissive0.1239 160.0796 160.1362 160.0631 160.1116 170.2648 17
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
SUN+0.1148 170.0846 140.1564 140.0711 140.1143 160.2958 13
Scan2Cappermissive0.0849 180.0576 180.1073 180.0492 180.0970 180.2481 18
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021