This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
ConcreteNet0.8607 10.7923 10.4746 40.4091 10.5612 30.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
cus3d0.8384 30.7073 50.4908 20.4000 20.5688 20.4689 2
pointclip0.8211 40.7082 40.4803 30.3884 30.5567 40.4601 3
M3DRef-SCLIP0.7997 70.7123 20.4708 50.3805 60.5445 50.4549 4
M3DRef-CLIPpermissive0.7980 80.7085 30.4692 60.3807 50.5433 60.4545 5
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
CORE-3DVG0.8557 20.6867 70.5275 10.3850 40.6011 10.4527 6
3DInsVG0.8170 50.6925 60.4582 80.3617 70.5386 70.4359 7
RG-SAN0.7964 90.6785 120.4591 70.3600 80.5348 90.4314 8
HAM0.7799 140.6373 170.4148 160.3324 90.4967 160.4007 9
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 60.6241 180.4544 90.3317 100.5349 80.3972 10
D3Netpermissive0.7923 100.6843 80.3905 200.3074 140.4806 190.3919 11
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 130.6801 110.3850 210.2947 160.4743 200.3811 12
Clip0.7733 190.6810 100.3619 300.2919 210.4542 250.3791 13
Clip-pre0.7766 170.6843 80.3617 320.2904 220.4547 240.3787 14
3DJCG(Grounding)permissive0.7675 220.6059 200.4389 110.3117 120.5126 110.3776 15
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 190.5787 250.4370 120.3102 130.5124 120.3704 16
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
bo3d-10.7469 280.5606 280.4539 100.3124 110.5196 100.3680 17
Se2d0.7799 140.6628 140.3636 280.2823 240.4569 220.3677 18
secg0.7288 300.6175 190.3696 270.2933 180.4501 280.3660 19
SAF0.6348 370.5647 270.3726 250.3009 150.4314 310.3601 20
FE-3DGQA0.7857 110.5862 240.4317 130.2935 170.5111 130.3592 21
D3Net - Pretrainedpermissive0.7659 230.6579 150.3619 300.2726 250.4525 270.3590 22
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 210.5886 230.4141 170.2924 200.4937 170.3588 23
InstanceReferpermissive0.7782 160.6669 130.3457 350.2688 260.4427 300.3580 24
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 240.5515 290.4224 150.2933 180.4976 150.3512 25
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 180.5664 260.4236 140.2826 230.5026 140.3462 26
PointGroup_MCAN0.7510 250.6397 160.3271 370.2535 280.4222 330.3401 27
TransformerVG0.7502 260.5977 210.3712 260.2628 270.4562 230.3379 28
TGNN0.6834 340.5894 220.3312 360.2526 290.4102 360.3281 29
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 120.5499 300.3934 190.2480 300.4811 180.3157 30
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 290.5458 310.3822 230.2421 320.4538 260.3046 31
henet0.7110 310.5180 330.3936 180.2472 310.4590 210.3030 32
SRGA0.7494 270.5128 340.3631 290.2218 330.4497 290.2871 33
SR-GAB0.7016 320.5202 320.3233 390.1959 360.4081 370.2686 34
SPANet0.5614 410.4641 360.2800 430.2071 350.3431 440.2647 35
ScanReferpermissive0.6859 330.4353 370.3488 340.2097 340.4244 320.2603 36
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 380.4353 370.3193 400.1947 370.3898 390.2486 37
TransformerRefer0.6010 390.4658 350.2540 450.1730 420.3318 450.2386 38
ScanRefer Baseline0.6422 360.4196 390.3090 410.1832 380.3837 400.2362 39
ScanRefer_vanilla0.6488 350.4056 400.3052 420.1782 400.3823 410.2292 40
pairwisemethod0.5779 400.3603 410.2792 440.1746 410.3462 430.2163 41
bo3d0.5400 420.1550 420.3817 240.1785 390.4172 350.1732 42
Co3d30.5326 430.1369 430.3848 220.1651 430.4179 340.1588 43
Co3d20.5070 440.1195 450.3569 330.1511 440.3906 380.1440 44
bo3d00.4823 450.1278 440.3271 370.1394 450.3619 420.1368 45
Co3d0.0000 460.0000 460.0000 460.0000 460.0000 460.0000 46

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sorted bysort bysort bysort bysort bysort by
Vote2Cap-DETR++0.3360 10.1908 10.3012 10.1386 10.1864 10.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
vote2cap-detrpermissive0.3128 20.1778 20.2842 30.1316 30.1825 20.4454 3
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
TMP0.3029 30.1728 30.2898 20.1332 20.1801 30.4605 2
CFM0.2360 40.1417 40.2253 40.1034 40.1379 70.3008 7
CM3D-Trans+0.2348 50.1383 50.2250 60.1030 50.1398 60.2966 9
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Forest-xyz0.2266 60.1363 60.2250 50.1027 60.1161 120.2825 12
D3Net - Speakerpermissive0.2088 70.1335 80.2237 70.1022 70.1481 50.4198 4
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
3DJCG(Captioning)permissive0.1918 80.1350 70.2207 80.1013 80.1506 40.3867 5
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
REMAN0.1662 90.1070 90.1790 90.0815 90.1235 100.2927 11
NOAH0.1382 100.0901 100.1598 100.0747 100.1359 80.2977 8
SpaCap3Dpermissive0.1359 110.0883 110.1591 110.0738 110.1182 110.3275 6
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
X-Trans2Cappermissive0.1274 120.0808 130.1392 130.0653 130.1244 90.2795 13
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
MORE-xyzpermissive0.1239 130.0796 140.1362 140.0631 140.1116 140.2648 14
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
SUN+0.1148 140.0846 120.1564 120.0711 120.1143 130.2958 10
Scan2Cappermissive0.0849 150.0576 150.1073 150.0492 150.0970 150.2481 15
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021