This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
secg0.7230 250.6026 150.3548 270.2816 180.4373 250.3536 18
pairwisemethod0.5779 340.3603 350.2792 380.1746 350.3462 370.2163 35
TransformerVG0.7502 210.5977 160.3712 210.2628 210.4562 180.3379 22
FE-3DGQA0.7857 70.5862 190.4317 90.2935 120.5111 90.3592 14
ContraRefer0.7832 90.6801 80.3850 170.2947 110.4743 160.3811 8
D3Netpermissive0.7923 60.6843 50.3905 160.3074 100.4806 150.3919 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
D3Net - Pretrainedpermissive0.7659 180.6579 100.3619 230.2726 190.4525 220.3590 15
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
TransformerRefer0.6010 330.4658 290.2540 390.1730 360.3318 390.2386 32
SR-GAB0.7016 270.5202 260.3233 330.1959 300.4081 310.2686 28
PointGroup_MCAN0.7510 200.6397 110.3271 310.2535 220.4222 270.3401 21
Clip-pre0.7766 120.6843 50.3617 250.2904 160.4547 190.3787 10
3DJCG(Grounding)permissive0.7675 170.6059 140.4389 70.3117 80.5126 70.3776 11
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Transformerpermissive0.7576 190.5515 230.4224 110.2933 130.4976 110.3512 19
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
3DVG-Trans +permissive0.7733 140.5787 200.4370 80.3102 90.5124 80.3704 12
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SRGA0.7494 220.5128 280.3631 220.2218 270.4497 230.2871 27
InstanceReferpermissive0.7782 110.6669 90.3457 290.2688 200.4427 240.3580 17
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
ScanRefer Baseline0.6422 310.4196 330.3090 350.1832 320.3837 340.2362 33
TGNN0.6834 290.5894 170.3312 300.2526 230.4102 300.3281 23
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
Clip0.7733 140.6810 70.3619 230.2919 150.4542 200.3791 9
BEAUTY-DETRcopyleft0.7848 80.5499 240.3934 150.2480 240.4811 140.3157 24
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
CORE-3DVG0.8557 20.6867 40.5275 10.3850 20.6011 10.4527 3
M3DRef-CLIPpermissive0.7980 50.7085 20.4692 30.3807 30.5433 30.4545 2
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
bo3d-10.7469 230.5606 220.4539 60.3124 70.5196 60.3680 13
Co3d30.5326 370.1369 370.3848 180.1651 370.4179 280.1588 37
bo3d00.4823 390.1278 380.3271 310.1394 390.3619 360.1368 39
bo3d0.5400 360.1550 360.3817 200.1785 330.4172 290.1732 36
Co3d20.5070 380.1195 390.3569 260.1511 380.3906 320.1440 38
Co3d0.0000 400.0000 400.0000 400.0000 400.0000 400.0000 40
3DInsVG0.8170 30.6925 30.4582 40.3617 40.5386 40.4359 4
ConcreteNet0.8607 10.7923 10.4746 20.4091 10.5612 20.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
HGT0.7692 160.5886 180.4141 130.2924 140.4937 130.3588 16
scanrefer20.6340 320.4353 310.3193 340.1947 310.3898 330.2486 31
CSA-M3LM0.8137 40.6241 130.4544 50.3317 60.5349 50.3972 6
ScanRefer_vanilla0.6488 300.4056 340.3052 360.1782 340.3823 350.2292 34
HAM0.7799 100.6373 120.4148 120.3324 50.4967 120.4007 5
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
SPANet0.5614 350.4641 300.2800 370.2071 290.3431 380.2647 29
henet0.7110 260.5180 270.3936 140.2472 250.4590 170.3030 26
grounding0.7298 240.5458 250.3822 190.2421 260.4538 210.3046 25
SAVG0.7758 130.5664 210.4236 100.2826 170.5026 100.3462 20
ScanReferpermissive0.6859 280.4353 310.3488 280.2097 280.4244 260.2603 30
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sort bysorted bysort bysort bysort bysort by
vote2cap-detrpermissive0.3128 10.1778 10.2842 10.1316 10.1825 10.4454 1
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
CFM0.2360 20.1417 20.2253 20.1034 20.1379 50.3008 5
CM3D-Trans+0.2348 30.1383 30.2250 40.1030 30.1398 40.2966 7
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Forest-xyz0.2266 40.1363 40.2250 30.1027 40.1161 100.2825 10
3DJCG(Captioning)permissive0.1918 60.1350 50.2207 60.1013 60.1506 20.3867 3
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
D3Net - Speakerpermissive0.2088 50.1335 60.2237 50.1022 50.1481 30.4198 2
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
REMAN0.1662 70.1070 70.1790 70.0815 70.1235 80.2927 9
NOAH0.1382 80.0901 80.1598 80.0747 80.1359 60.2977 6
SpaCap3Dpermissive0.1359 90.0883 90.1591 90.0738 90.1182 90.3275 4
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
SUN+0.1148 120.0846 100.1564 100.0711 100.1143 110.2958 8
X-Trans2Cappermissive0.1274 100.0808 110.1392 110.0653 110.1244 70.2795 11
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
MORE-xyzpermissive0.1239 110.0796 120.1362 120.0631 120.1116 120.2648 12
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
Scan2Cappermissive0.0849 130.0576 130.1073 130.0492 130.0970 130.2481 13
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021