This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
M3DRef-CLIP0.7980 30.7085 10.4692 10.3807 10.5433 10.4545 1
ConcreteNet0.8120 20.6933 20.4479 30.3760 20.5296 30.4471 2
HAM0.7799 80.6373 100.4148 90.3324 30.4967 90.4007 3
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 10.6241 110.4544 20.3317 40.5349 20.3972 4
D3Netpermissive0.7923 40.6843 30.3905 130.3074 70.4806 120.3919 5
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 70.6801 60.3850 140.2947 80.4743 130.3811 6
Clip0.7733 120.6810 50.3619 180.2919 120.4542 170.3791 7
Clip-pre0.7766 100.6843 30.3617 200.2904 130.4547 160.3787 8
3DJCG(Grounding)permissive0.7675 150.6059 120.4389 40.3117 50.5126 40.3776 9
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 120.5787 170.4370 50.3102 60.5124 50.3704 10
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
FE-3DGQA0.7857 50.5862 160.4317 60.2935 90.5111 60.3592 11
D3Net - Pretrainedpermissive0.7659 160.6579 80.3619 180.2726 150.4525 190.3590 12
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 140.5886 150.4141 100.2924 110.4937 100.3588 13
InstanceReferpermissive0.7782 90.6669 70.3457 220.2688 160.4427 210.3580 14
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 170.5515 190.4224 80.2933 100.4976 80.3512 15
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 110.5664 180.4236 70.2826 140.5026 70.3462 16
PointGroup_MCAN0.7510 180.6397 90.3271 240.2535 180.4222 230.3401 17
TransformerVG0.7502 190.5977 130.3712 160.2628 170.4562 150.3379 18
TGNN0.6834 250.5894 140.3312 230.2526 190.4102 240.3281 19
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 60.5499 200.3934 120.2480 200.4811 110.3157 20
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 210.5458 210.3822 150.2421 220.4538 180.3046 21
henet0.7110 220.5180 230.3936 110.2472 210.4590 140.3030 22
SRGA0.7494 200.5128 240.3631 170.2218 230.4497 200.2871 23
SR-GAB0.7016 230.5202 220.3233 250.1959 260.4081 250.2686 24
SPANet0.5614 310.4641 260.2800 290.2071 250.3431 300.2647 25
ScanReferpermissive0.6859 240.4353 270.3488 210.2097 240.4244 220.2603 26
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 280.4353 270.3193 260.1947 270.3898 260.2486 27
TransformerRefer0.6010 290.4658 250.2540 310.1730 310.3318 310.2386 28
ScanRefer Baseline0.6422 270.4196 290.3090 270.1832 280.3837 270.2362 29
ScanRefer_vanilla0.6488 260.4056 300.3052 280.1782 290.3823 280.2292 30
pairwisemethod0.5779 300.3603 310.2792 300.1746 300.3462 290.2163 31

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sorted bysort bysort bysort bysort bysort by
vote2cap-detr0.3128 10.1778 10.2842 10.1316 10.1825 10.4454 1
CFM0.2360 20.1417 20.2253 20.1034 20.1379 50.3008 5
CM3D-Trans+0.2348 30.1383 30.2250 40.1030 30.1398 40.2966 7
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Forest-xyz0.2266 40.1363 40.2250 30.1027 40.1161 100.2825 10
D3Net - Speakerpermissive0.2088 50.1335 60.2237 50.1022 50.1481 30.4198 2
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
3DJCG(Captioning)permissive0.1918 60.1350 50.2207 60.1013 60.1506 20.3867 3
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
REMAN0.1662 70.1070 70.1790 70.0815 70.1235 80.2927 9
NOAH0.1382 80.0901 80.1598 80.0747 80.1359 60.2977 6
SpaCap3Dpermissive0.1359 90.0883 90.1591 90.0738 90.1182 90.3275 4
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
X-Trans2Cappermissive0.1274 100.0808 110.1392 110.0653 110.1244 70.2795 11
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
MORE-xyzpermissive0.1239 110.0796 120.1362 120.0631 120.1116 120.2648 12
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
SUN+0.1148 120.0846 100.1564 100.0711 100.1143 110.2958 8
Scan2Cappermissive0.0849 130.0576 130.1073 130.0492 130.0970 130.2481 13
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021