Benchmark Results - ScanRefer Benchmark

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.

		Captioning F1-Score				Dense Captioning	Object Detection
Method	Info	CIDEr@0.5IoU	BLEU-4@0.5IoU	Rouge-L@0.5IoU	METEOR@0.5IoU	DCmAP	mAP@0.5
Method	Info
Chat-Scene-thres0.5		0.3456 1	0.1859 2	0.3162 1	0.1527 1	0.1415 8	0.4856 4
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
Vote2Cap-DETR++		0.3360 2	0.1908 1	0.3012 2	0.1386 2	0.1864 1	0.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
TMP		0.3029 4	0.1728 4	0.2898 3	0.1332 3	0.1801 3	0.4605 5

vote2cap-detr		0.3128 3	0.1778 3	0.2842 4	0.1316 4	0.1825 2	0.4454 6
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
CFM		0.2360 5	0.1417 5	0.2253 5	0.1034 5	0.1379 10	0.3008 10

Forest-xyz		0.2266 7	0.1363 7	0.2250 6	0.1027 7	0.1161 15	0.2825 15

CM3D-Trans+		0.2348 6	0.1383 6	0.2250 7	0.1030 6	0.1398 9	0.2966 12
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
D3Net - Speaker		0.2088 8	0.1335 9	0.2237 8	0.1022 8	0.1481 7	0.4198 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
3DJCG(Captioning)		0.1918 10	0.1350 8	0.2207 9	0.1013 9	0.1506 6	0.3867 8
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
Chat-Scene-thres0.01		0.2053 9	0.1103 10	0.1884 10	0.0907 10	0.1527 5	0.5076 2

REMAN		0.1662 11	0.1070 11	0.1790 11	0.0815 11	0.1235 13	0.2927 14

NOAH		0.1382 12	0.0901 12	0.1598 12	0.0747 12	0.1359 11	0.2977 11

SpaCap3D		0.1359 13	0.0883 13	0.1591 13	0.0738 13	0.1182 14	0.3275 9
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
SUN+		0.1148 17	0.0846 14	0.1564 14	0.0711 14	0.1143 16	0.2958 13

X-Trans2Cap		0.1274 14	0.0808 15	0.1392 15	0.0653 15	0.1244 12	0.2795 16
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
MORE-xyz		0.1239 16	0.0796 16	0.1362 16	0.0631 16	0.1116 17	0.2648 17
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
Chat-Scene-all		0.1257 15	0.0671 17	0.1150 17	0.0554 17	0.1539 4	0.5076 2

Scan2Cap		0.0849 18	0.0576 18	0.1073 18	0.0492 18	0.0970 18	0.2481 18
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021

Scan2Cap Benchmark