This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysorted bysort bysort bysort by
Chat-Scenepermissive0.8887 10.8005 10.5421 10.4861 10.6198 10.5566 1
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
CORE-3DVG0.8557 30.6867 90.5275 20.3850 70.6011 20.4527 9
D-LISA0.8195 60.6900 80.4975 30.3967 50.5697 30.4625 4
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024
M3DRef-test0.7865 140.6793 140.4963 40.3977 40.5614 50.4608 5
cus3d0.8384 40.7073 60.4908 50.4000 30.5688 40.4689 3
pointclip0.8211 50.7082 50.4803 60.3884 60.5567 70.4601 6
ConcreteNet0.8607 20.7923 20.4746 70.4091 20.5612 60.4950 2
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024
M3DRef-SCLIP0.7997 90.7123 30.4708 80.3805 90.5445 80.4549 7
M3DRef-CLIPpermissive0.7980 100.7085 40.4692 90.3807 80.5433 90.4545 8
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
RG-SAN0.7964 110.6785 150.4591 100.3600 110.5348 120.4314 11
3DInsVG0.8170 70.6925 70.4582 110.3617 100.5386 100.4359 10
CSA-M3LM0.8137 80.6241 210.4544 120.3317 130.5349 110.3972 13
bo3d-10.7469 330.5606 330.4539 130.3124 160.5196 140.3680 22
GALA-Grounder + 2D0.7947 120.5713 300.4525 140.3202 140.5292 130.3765 19
GALA-Grounder0.7824 180.5796 280.4391 150.3131 150.5161 150.3728 20
3DJCG(Grounding)permissive0.7675 270.6059 230.4389 160.3117 170.5126 160.3776 18
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +permissive0.7733 240.5787 290.4370 170.3102 180.5124 170.3704 21
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
FE-3DGQA0.7857 150.5862 270.4317 180.2935 220.5111 180.3592 26
SAVG0.7758 230.5664 310.4236 190.2826 280.5026 190.3462 31
3DVG-Transformerpermissive0.7576 290.5515 340.4224 200.2933 230.4976 200.3512 30
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
HAM0.7799 190.6373 200.4148 210.3324 120.4967 210.4007 12
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
HGT0.7692 260.5886 260.4141 220.2924 250.4937 220.3588 28
henet0.7110 360.5180 380.3936 230.2472 360.4590 260.3030 37
BEAUTY-DETRcopyleft0.7848 160.5499 350.3934 240.2480 350.4811 230.3157 35
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
D3Netpermissive0.7923 130.6843 100.3905 250.3074 190.4806 240.3919 14
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer0.7832 170.6801 130.3850 260.2947 210.4743 250.3811 15
Co3d30.5326 480.1369 480.3848 270.1651 480.4179 390.1588 48
grounding0.7298 340.5458 360.3822 280.2421 370.4538 310.3046 36
bo3d0.5400 470.1550 470.3817 290.1785 440.4172 400.1732 47
SAF0.6348 420.5647 320.3726 300.3009 200.4314 360.3601 25
TransformerVG0.7502 310.5977 240.3712 310.2628 320.4562 280.3379 33
secg0.7288 350.6175 220.3696 320.2933 230.4501 330.3660 24
Se2d0.7799 190.6628 170.3636 330.2823 290.4569 270.3677 23
SRGA0.7494 320.5128 390.3631 340.2218 380.4497 340.2871 38
D3Net - Pretrainedpermissive0.7659 280.6579 180.3619 350.2726 300.4525 320.3590 27
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
Clip0.7733 240.6810 120.3619 350.2919 260.4542 300.3791 16
Clip-pre0.7766 220.6843 100.3617 370.2904 270.4547 290.3787 17
Co3d20.5070 490.1195 500.3569 380.1511 490.3906 430.1440 49
ScanReferpermissive0.6859 380.4353 420.3488 390.2097 390.4244 370.2603 41
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
InstanceReferpermissive0.7782 210.6669 160.3457 400.2688 310.4427 350.3580 29
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
TGNN0.6834 390.5894 250.3312 410.2526 340.4102 410.3281 34
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
PointGroup_MCAN0.7510 300.6397 190.3271 420.2535 330.4222 380.3401 32
bo3d00.4823 500.1278 490.3271 420.1394 500.3619 470.1368 50
SR-GAB0.7016 370.5202 370.3233 440.1959 410.4081 420.2686 39
scanrefer20.6340 430.4353 420.3193 450.1947 420.3898 440.2486 42
ScanRefer Baseline0.6422 410.4196 440.3090 460.1832 430.3837 450.2362 44
ScanRefer_vanilla0.6488 400.4056 450.3052 470.1782 450.3823 460.2292 45
SPANet0.5614 460.4641 410.2800 480.2071 400.3431 490.2647 40
pairwisemethod0.5779 450.3603 460.2792 490.1746 460.3462 480.2163 46
TransformerRefer0.6010 440.4658 400.2540 500.1730 470.3318 500.2386 43
Co3d0.0000 510.0000 510.0000 510.0000 510.0000 510.0000 51

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sorted bysort bysort bysort bysort bysort by
Chat-Scene-thres0.5permissive0.3456 10.1859 20.3162 10.1527 10.1415 80.4856 4
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
CM3D-Trans+0.2348 60.1383 60.2250 70.1030 60.1398 90.2966 12
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Scan2Cappermissive0.0849 180.0576 180.1073 180.0492 180.0970 180.2481 18
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021
X-Trans2Cappermissive0.1274 140.0808 150.1392 150.0653 150.1244 120.2795 16
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
SpaCap3Dpermissive0.1359 130.0883 130.1591 130.0738 130.1182 140.3275 9
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
MORE-xyzpermissive0.1239 160.0796 160.1362 160.0631 160.1116 170.2648 17
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
REMAN0.1662 110.1070 110.1790 110.0815 110.1235 130.2927 14
3DJCG(Captioning)permissive0.1918 100.1350 80.2207 90.1013 90.1506 60.3867 8
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
SUN+0.1148 170.0846 140.1564 140.0711 140.1143 160.2958 13
Chat-Scene-thres0.010.2053 90.1103 100.1884 100.0907 100.1527 50.5076 2
NOAH0.1382 120.0901 120.1598 120.0747 120.1359 110.2977 11
Forest-xyz0.2266 70.1363 70.2250 60.1027 70.1161 150.2825 15
CFM0.2360 50.1417 50.2253 50.1034 50.1379 100.3008 10
vote2cap-detrpermissive0.3128 30.1778 30.2842 40.1316 40.1825 20.4454 6
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
Vote2Cap-DETR++0.3360 20.1908 10.3012 20.1386 20.1864 10.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
TMP0.3029 40.1728 40.2898 30.1332 30.1801 30.4605 5
Chat-Scene-all0.1257 150.0671 170.1150 170.0554 170.1539 40.5076 2
D3Net - Speakerpermissive0.2088 80.1335 90.2237 80.1022 80.1481 70.4198 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022