This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
ConcreteNet0.8607 10.7923 10.4746 20.4091 10.5612 20.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
M3DRef-CLIPpermissive0.7980 50.7085 20.4692 30.3807 30.5433 30.4545 2
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
CORE-3DVG0.8557 20.6867 40.5275 10.3850 20.6011 10.4527 3
3DInsVG0.8170 30.6925 30.4582 40.3617 40.5386 40.4359 4
HAM0.7799 100.6373 130.4148 120.3324 50.4967 120.4007 5
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 40.6241 140.4544 50.3317 60.5349 50.3972 6
D3Netpermissive0.7923 60.6843 50.3905 160.3074 100.4806 150.3919 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 90.6801 80.3850 170.2947 110.4743 160.3811 8
Clip0.7733 150.6810 70.3619 250.2919 160.4542 210.3791 9
Clip-pre0.7766 130.6843 50.3617 270.2904 170.4547 200.3787 10
3DJCG(Grounding)permissive0.7675 180.6059 160.4389 70.3117 80.5126 70.3776 11
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 150.5787 210.4370 80.3102 90.5124 80.3704 12
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
bo3d-10.7469 240.5606 230.4539 60.3124 70.5196 60.3680 13
Se2d0.7799 100.6628 100.3636 230.2823 190.4569 180.3677 14
secg0.7288 260.6175 150.3696 220.2933 130.4501 240.3660 15
FE-3DGQA0.7857 70.5862 200.4317 90.2935 120.5111 90.3592 16
D3Net - Pretrainedpermissive0.7659 190.6579 110.3619 250.2726 200.4525 230.3590 17
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 170.5886 190.4141 130.2924 150.4937 130.3588 18
InstanceReferpermissive0.7782 120.6669 90.3457 300.2688 210.4427 260.3580 19
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 200.5515 240.4224 110.2933 130.4976 110.3512 20
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 140.5664 220.4236 100.2826 180.5026 100.3462 21
PointGroup_MCAN0.7510 210.6397 120.3271 320.2535 230.4222 280.3401 22
TransformerVG0.7502 220.5977 170.3712 210.2628 220.4562 190.3379 23
TGNN0.6834 300.5894 180.3312 310.2526 240.4102 310.3281 24
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 80.5499 250.3934 150.2480 250.4811 140.3157 25
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 250.5458 260.3822 190.2421 270.4538 220.3046 26
henet0.7110 270.5180 280.3936 140.2472 260.4590 170.3030 27
SRGA0.7494 230.5128 290.3631 240.2218 280.4497 250.2871 28
SR-GAB0.7016 280.5202 270.3233 340.1959 310.4081 320.2686 29
SPANet0.5614 360.4641 310.2800 380.2071 300.3431 390.2647 30
ScanReferpermissive0.6859 290.4353 320.3488 290.2097 290.4244 270.2603 31
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 330.4353 320.3193 350.1947 320.3898 340.2486 32
TransformerRefer0.6010 340.4658 300.2540 400.1730 370.3318 400.2386 33
ScanRefer Baseline0.6422 320.4196 340.3090 360.1832 330.3837 350.2362 34
ScanRefer_vanilla0.6488 310.4056 350.3052 370.1782 350.3823 360.2292 35
pairwisemethod0.5779 350.3603 360.2792 390.1746 360.3462 380.2163 36
bo3d0.5400 370.1550 370.3817 200.1785 340.4172 300.1732 37
Co3d30.5326 380.1369 380.3848 180.1651 380.4179 290.1588 38
Co3d20.5070 390.1195 400.3569 280.1511 390.3906 330.1440 39
bo3d00.4823 400.1278 390.3271 320.1394 400.3619 370.1368 40
Co3d0.0000 410.0000 410.0000 410.0000 410.0000 410.0000 41

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sorted bysort bysort bysort bysort bysort by
vote2cap-detrpermissive0.3128 10.1778 10.2842 10.1316 10.1825 10.4454 1
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
CFM0.2360 20.1417 20.2253 20.1034 20.1379 50.3008 5
CM3D-Trans+0.2348 30.1383 30.2250 40.1030 30.1398 40.2966 7
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Forest-xyz0.2266 40.1363 40.2250 30.1027 40.1161 100.2825 10
D3Net - Speakerpermissive0.2088 50.1335 60.2237 50.1022 50.1481 30.4198 2
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
3DJCG(Captioning)permissive0.1918 60.1350 50.2207 60.1013 60.1506 20.3867 3
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
REMAN0.1662 70.1070 70.1790 70.0815 70.1235 80.2927 9
NOAH0.1382 80.0901 80.1598 80.0747 80.1359 60.2977 6
SpaCap3Dpermissive0.1359 90.0883 90.1591 90.0738 90.1182 90.3275 4
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
X-Trans2Cappermissive0.1274 100.0808 110.1392 110.0653 110.1244 70.2795 11
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
MORE-xyzpermissive0.1239 110.0796 120.1362 120.0631 120.1116 120.2648 12
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
SUN+0.1148 120.0846 100.1564 100.0711 100.1143 110.2958 8
Scan2Cappermissive0.0849 130.0576 130.1073 130.0492 130.0970 130.2481 13
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021