This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
ConcreteNet0.8607 10.7923 10.4746 50.4091 10.5612 40.4950 1
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding.
cus3d0.8384 30.7073 50.4908 30.4000 20.5688 20.4689 2
M3DRef-test0.7865 110.6793 120.4963 20.3977 30.5614 30.4608 3
pointclip0.8211 40.7082 40.4803 40.3884 40.5567 50.4601 4
M3DRef-SCLIP0.7997 70.7123 20.4708 60.3805 70.5445 60.4549 5
M3DRef-CLIPpermissive0.7980 80.7085 30.4692 70.3807 60.5433 70.4545 6
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
CORE-3DVG0.8557 20.6867 70.5275 10.3850 50.6011 10.4527 7
3DInsVG0.8170 50.6925 60.4582 90.3617 80.5386 80.4359 8
RG-SAN0.7964 90.6785 130.4591 80.3600 90.5348 100.4314 9
HAM0.7799 150.6373 180.4148 170.3324 100.4967 170.4007 10
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 60.6241 190.4544 100.3317 110.5349 90.3972 11
D3Netpermissive0.7923 100.6843 80.3905 210.3074 150.4806 200.3919 12
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 140.6801 110.3850 220.2947 170.4743 210.3811 13
Clip0.7733 200.6810 100.3619 310.2919 220.4542 260.3791 14
Clip-pre0.7766 180.6843 80.3617 330.2904 230.4547 250.3787 15
3DJCG(Grounding)permissive0.7675 230.6059 210.4389 120.3117 130.5126 120.3776 16
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 200.5787 260.4370 130.3102 140.5124 130.3704 17
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
bo3d-10.7469 290.5606 290.4539 110.3124 120.5196 110.3680 18
Se2d0.7799 150.6628 150.3636 290.2823 250.4569 230.3677 19
secg0.7288 310.6175 200.3696 280.2933 190.4501 290.3660 20
SAF0.6348 380.5647 280.3726 260.3009 160.4314 320.3601 21
FE-3DGQA0.7857 120.5862 250.4317 140.2935 180.5111 140.3592 22
D3Net - Pretrainedpermissive0.7659 240.6579 160.3619 310.2726 260.4525 280.3590 23
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 220.5886 240.4141 180.2924 210.4937 180.3588 24
InstanceReferpermissive0.7782 170.6669 140.3457 360.2688 270.4427 310.3580 25
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 250.5515 300.4224 160.2933 190.4976 160.3512 26
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 190.5664 270.4236 150.2826 240.5026 150.3462 27
PointGroup_MCAN0.7510 260.6397 170.3271 380.2535 290.4222 340.3401 28
TransformerVG0.7502 270.5977 220.3712 270.2628 280.4562 240.3379 29
TGNN0.6834 350.5894 230.3312 370.2526 300.4102 370.3281 30
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 130.5499 310.3934 200.2480 310.4811 190.3157 31
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 300.5458 320.3822 240.2421 330.4538 270.3046 32
henet0.7110 320.5180 340.3936 190.2472 320.4590 220.3030 33
SRGA0.7494 280.5128 350.3631 300.2218 340.4497 300.2871 34
SR-GAB0.7016 330.5202 330.3233 400.1959 370.4081 380.2686 35
SPANet0.5614 420.4641 370.2800 440.2071 360.3431 450.2647 36
ScanReferpermissive0.6859 340.4353 380.3488 350.2097 350.4244 330.2603 37
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 390.4353 380.3193 410.1947 380.3898 400.2486 38
TransformerRefer0.6010 400.4658 360.2540 460.1730 430.3318 460.2386 39
ScanRefer Baseline0.6422 370.4196 400.3090 420.1832 390.3837 410.2362 40
ScanRefer_vanilla0.6488 360.4056 410.3052 430.1782 410.3823 420.2292 41
pairwisemethod0.5779 410.3603 420.2792 450.1746 420.3462 440.2163 42
bo3d0.5400 430.1550 430.3817 250.1785 400.4172 360.1732 43
Co3d30.5326 440.1369 440.3848 230.1651 440.4179 350.1588 44
Co3d20.5070 450.1195 460.3569 340.1511 450.3906 390.1440 45
bo3d00.4823 460.1278 450.3271 380.1394 460.3619 430.1368 46
Co3d0.0000 470.0000 470.0000 470.0000 470.0000 470.0000 47

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sorted bysort bysort bysort bysort bysort by
Vote2Cap-DETR++0.3360 10.1908 10.3012 10.1386 10.1864 10.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
vote2cap-detrpermissive0.3128 20.1778 20.2842 30.1316 30.1825 20.4454 3
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
TMP0.3029 30.1728 30.2898 20.1332 20.1801 30.4605 2
CFM0.2360 40.1417 40.2253 40.1034 40.1379 70.3008 7
CM3D-Trans+0.2348 50.1383 50.2250 60.1030 50.1398 60.2966 9
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Forest-xyz0.2266 60.1363 60.2250 50.1027 60.1161 120.2825 12
D3Net - Speakerpermissive0.2088 70.1335 80.2237 70.1022 70.1481 50.4198 4
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
3DJCG(Captioning)permissive0.1918 80.1350 70.2207 80.1013 80.1506 40.3867 5
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
REMAN0.1662 90.1070 90.1790 90.0815 90.1235 100.2927 11
NOAH0.1382 100.0901 100.1598 100.0747 100.1359 80.2977 8
SpaCap3Dpermissive0.1359 110.0883 110.1591 110.0738 110.1182 110.3275 6
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
X-Trans2Cappermissive0.1274 120.0808 130.1392 130.0653 130.1244 90.2795 13
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
MORE-xyzpermissive0.1239 130.0796 140.1362 140.0631 140.1116 140.2648 14
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
SUN+0.1148 140.0846 120.1564 120.0711 120.1143 130.2958 10
Scan2Cappermissive0.0849 150.0576 150.1073 150.0492 150.0970 150.2481 15
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021