This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
D-LISA0.8195 50.6900 70.4975 20.3967 40.5697 20.4625 3
TransformerRefer0.6010 410.4658 370.2540 470.1730 440.3318 470.2386 40
HGT0.7692 230.5886 250.4141 190.2924 220.4937 190.3588 25
BEAUTY-DETRcopyleft0.7848 140.5499 320.3934 210.2480 320.4811 200.3157 32
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
Clip-pre0.7766 190.6843 90.3617 340.2904 240.4547 260.3787 16
Clip0.7733 210.6810 110.3619 320.2919 230.4542 270.3791 15
TransformerVG0.7502 280.5977 230.3712 280.2628 290.4562 250.3379 30
FE-3DGQA0.7857 130.5862 260.4317 150.2935 190.5111 150.3592 23
ContraRefer0.7832 150.6801 120.3850 230.2947 180.4743 220.3811 14
D3Netpermissive0.7923 110.6843 90.3905 220.3074 160.4806 210.3919 13
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
D3Net - Pretrainedpermissive0.7659 250.6579 170.3619 320.2726 270.4525 290.3590 24
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
SR-GAB0.7016 340.5202 340.3233 410.1959 380.4081 390.2686 36
grounding0.7298 310.5458 330.3822 250.2421 340.4538 280.3046 33
pairwisemethod0.5779 420.3603 430.2792 460.1746 430.3462 450.2163 43
PointGroup_MCAN0.7510 270.6397 180.3271 390.2535 300.4222 350.3401 29
3DJCG(Grounding)permissive0.7675 240.6059 220.4389 130.3117 140.5126 130.3776 17
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Transformerpermissive0.7576 260.5515 310.4224 170.2933 200.4976 170.3512 27
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
3DVG-Trans +permissive0.7733 210.5787 270.4370 140.3102 150.5124 140.3704 18
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SRGA0.7494 290.5128 360.3631 310.2218 350.4497 310.2871 35
InstanceReferpermissive0.7782 180.6669 150.3457 370.2688 280.4427 320.3580 26
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
ScanRefer Baseline0.6422 380.4196 410.3090 430.1832 400.3837 420.2362 41
TGNN0.6834 360.5894 240.3312 380.2526 310.4102 380.3281 31
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
SAVG0.7758 200.5664 280.4236 160.2826 250.5026 160.3462 28
henet0.7110 330.5180 350.3936 200.2472 330.4590 230.3030 34
M3DRef-test0.7865 120.6793 130.4963 30.3977 30.5614 40.4608 4
Co3d30.5326 450.1369 450.3848 240.1651 450.4179 360.1588 45
RG-SAN0.7964 100.6785 140.4591 90.3600 100.5348 110.4314 10
SAF0.6348 390.5647 290.3726 270.3009 170.4314 330.3601 22
M3DRef-SCLIP0.7997 80.7123 20.4708 70.3805 80.5445 70.4549 6
cus3d0.8384 30.7073 50.4908 40.4000 20.5688 30.4689 2
pointclip0.8211 40.7082 40.4803 50.3884 50.5567 60.4601 5
Se2d0.7799 160.6628 160.3636 300.2823 260.4569 240.3677 20
secg0.7288 320.6175 210.3696 290.2933 200.4501 300.3660 21
CORE-3DVG0.8557 20.6867 80.5275 10.3850 60.6011 10.4527 8
bo3d-10.7469 300.5606 300.4539 120.3124 130.5196 120.3680 19
bo3d00.4823 470.1278 460.3271 390.1394 470.3619 440.1368 47
SPANet0.5614 430.4641 380.2800 450.2071 370.3431 460.2647 37
bo3d0.5400 440.1550 440.3817 260.1785 410.4172 370.1732 44
Co3d20.5070 460.1195 470.3569 350.1511 460.3906 400.1440 46
Co3d0.0000 480.0000 480.0000 480.0000 480.0000 480.0000 48
3DInsVG0.8170 60.6925 60.4582 100.3617 90.5386 90.4359 9
M3DRef-CLIPpermissive0.7980 90.7085 30.4692 80.3807 70.5433 80.4545 7
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
ConcreteNet0.8607 10.7923 10.4746 60.4091 10.5612 50.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
scanrefer20.6340 400.4353 390.3193 420.1947 390.3898 410.2486 39
CSA-M3LM0.8137 70.6241 200.4544 110.3317 120.5349 100.3972 12
ScanRefer_vanilla0.6488 370.4056 420.3052 440.1782 420.3823 430.2292 42
HAM0.7799 160.6373 190.4148 180.3324 110.4967 180.4007 11
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
ScanReferpermissive0.6859 350.4353 390.3488 360.2097 360.4244 340.2603 38
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sort bysort bysort bysort bysorted bysort by
Vote2Cap-DETR++0.3360 10.1908 10.3012 10.1386 10.1864 10.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
vote2cap-detrpermissive0.3128 20.1778 20.2842 30.1316 30.1825 20.4454 3
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
TMP0.3029 30.1728 30.2898 20.1332 20.1801 30.4605 2
3DJCG(Captioning)permissive0.1918 80.1350 70.2207 80.1013 80.1506 40.3867 5
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
D3Net - Speakerpermissive0.2088 70.1335 80.2237 70.1022 70.1481 50.4198 4
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
CM3D-Trans+0.2348 50.1383 50.2250 60.1030 50.1398 60.2966 9
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
CFM0.2360 40.1417 40.2253 40.1034 40.1379 70.3008 7
NOAH0.1382 100.0901 100.1598 100.0747 100.1359 80.2977 8
X-Trans2Cappermissive0.1274 120.0808 130.1392 130.0653 130.1244 90.2795 13
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
REMAN0.1662 90.1070 90.1790 90.0815 90.1235 100.2927 11
SpaCap3Dpermissive0.1359 110.0883 110.1591 110.0738 110.1182 110.3275 6
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
Forest-xyz0.2266 60.1363 60.2250 50.1027 60.1161 120.2825 12
SUN+0.1148 140.0846 120.1564 120.0711 120.1143 130.2958 10
MORE-xyzpermissive0.1239 130.0796 140.1362 140.0631 140.1116 140.2648 14
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
Scan2Cappermissive0.0849 150.0576 150.1073 150.0492 150.0970 150.2481 15
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021