This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysorted bysort bysort bysort bysort by
UniVLGpermissive0.8895 10.8236 10.5921 10.5030 10.6588 10.5749 1
Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Alexander Sax, Franziska Meier, Katerina Fragkiadaki: Unifying 2D and 3D Vision-Language Understanding.
Chat-Scenepermissive0.8887 20.8005 20.5421 20.4861 20.6198 20.5566 2
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
ConcreteNet0.8607 30.7923 30.4746 90.4091 30.5612 80.4950 3
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024
M3DRef-SCLIP0.7997 140.7123 40.4708 100.3805 100.5445 100.4549 8
M3DRef-CLIPpermissive0.7980 150.7085 50.4692 110.3807 90.5433 110.4545 9
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
pointclip0.8211 60.7082 60.4803 80.3884 70.5567 90.4601 7
cus3d0.8384 50.7073 70.4908 70.4000 40.5688 60.4689 4
3DInsVG0.8170 80.6925 80.4582 140.3617 110.5386 120.4359 11
D-LISA0.8195 70.6900 90.4975 50.3967 60.5697 50.4625 5
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024
CORE-3DVG0.8557 40.6867 100.5275 30.3850 80.6011 30.4527 10
D3Netpermissive0.7923 190.6843 110.3905 370.3074 310.4806 360.3919 20
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
Clip-pre0.7766 320.6843 110.3617 500.2904 390.4547 420.3787 27
Clip0.7733 350.6810 130.3619 480.2919 380.4542 430.3791 26
ContraRefer0.7832 250.6801 140.3850 380.2947 330.4743 370.3811 23
M3DRef-test0.7865 210.6793 150.4963 60.3977 50.5614 70.4608 6
RG-SAN0.7964 160.6785 160.4591 130.3600 120.5348 150.4314 12
InstanceReferpermissive0.7782 310.6669 170.3457 530.2688 440.4427 480.3580 41
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
Se2d0.7799 290.6628 180.3636 460.2823 410.4569 400.3677 35
D3Net - Pretrainedpermissive0.7659 390.6579 190.3619 480.2726 420.4525 450.3590 39
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
3dvlp-with-judge0.7807 280.6472 200.4498 250.3407 140.5240 250.4094 14
PointGroup_MCAN0.7510 420.6397 210.3271 560.2535 460.4222 510.3401 44
HAM0.7799 290.6373 220.4148 330.3324 180.4967 330.4007 17
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
3DVLP-baseline0.7766 320.6373 220.4572 160.3469 130.5288 220.4120 13
Jung0.8096 110.6331 240.5113 40.3398 160.5782 40.4055 15
ScanRefer-3dvlp-test0.7824 260.6298 250.4532 230.3405 150.5270 240.4054 16
CSA-M3LM0.8137 90.6241 260.4544 210.3317 190.5349 140.3972 18
secg0.7288 480.6175 270.3696 450.2933 350.4501 460.3660 36
3DJCG(Grounding)permissive0.7675 380.6059 280.4389 280.3117 290.5126 280.3776 28
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3dvlp-judge-h0.7552 410.6051 290.4458 270.3340 170.5152 270.3948 19
TransformerVG0.7502 430.5977 300.3712 440.2628 450.4562 410.3379 45
GALA-Grounder-D30.7939 180.5952 310.4625 120.3229 210.5368 130.3839 21
TGNN0.6834 530.5894 320.3312 540.2526 470.4102 550.3281 47
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
HGT0.7692 370.5886 330.4141 340.2924 370.4937 340.3588 40
FE-3DGQA0.7857 220.5862 340.4317 300.2935 340.5111 300.3592 38
LAG-3D-30.7815 270.5837 350.4556 190.3219 220.5287 230.3806 24
Graph-VG-20.8021 130.5829 360.4546 200.3217 230.5325 170.3802 25
Graph-VG-30.8038 120.5812 370.4515 240.3169 250.5305 190.3762 29
LAG-3D-20.7964 160.5812 370.4572 160.3245 200.5333 160.3821 22
3DVG-Trans +permissive0.7733 350.5787 390.4370 290.3102 300.5124 290.3704 33
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
GALA-Grounder-D10.8104 100.5754 400.4479 260.3176 240.5292 210.3754 30
SAVG0.7758 340.5664 410.4236 310.2826 400.5026 310.3462 43
SAF0.6348 560.5647 420.3726 430.3009 320.4314 490.3601 37
Graph-VG-40.7848 230.5631 430.4560 180.3164 270.5298 200.3717 31
LAG-3D0.7881 200.5606 440.4579 150.3169 250.5320 180.3715 32
bo3d-10.7469 450.5606 440.4539 220.3124 280.5196 260.3680 34
3DVG-Transformerpermissive0.7576 400.5515 460.4224 320.2933 350.4976 320.3512 42
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
BEAUTY-DETRcopyleft0.7848 230.5499 470.3934 360.2480 480.4811 350.3157 48
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 470.5458 480.3822 400.2421 500.4538 440.3046 49
TFVG3D ++permissive0.7453 460.5458 490.3793 420.2690 430.4614 380.3311 46
Ali Solgi, Mehdi Ezoji: A Transformer-based Framework for Visual Grounding on 3D Point Clouds. AISP 2024
SR-GAB0.7016 500.5202 500.3233 580.1959 540.4081 560.2686 52
henet0.7110 490.5180 510.3936 350.2472 490.4590 390.3030 50
SRGA0.7494 440.5128 520.3631 470.2218 510.4497 470.2871 51
TransformerRefer0.6010 580.4658 530.2540 640.1730 600.3318 640.2386 56
SPANet0.5614 600.4641 540.2800 620.2071 530.3431 630.2647 53
ScanRefer-test0.6999 510.4361 550.3274 550.1725 610.4109 540.2316 58
ScanReferpermissive0.6859 520.4353 560.3488 520.2097 520.4244 500.2603 54
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 570.4353 560.3193 590.1947 550.3898 580.2486 55
ScanRefer Baseline0.6422 550.4196 580.3090 600.1832 560.3837 590.2362 57
ScanRefer_vanilla0.6488 540.4056 590.3052 610.1782 580.3823 600.2292 59
pairwisemethod0.5779 590.3603 600.2792 630.1746 590.3462 620.2163 60
bo3d0.5400 610.1550 610.3817 410.1785 570.4172 530.1732 61
Co3d30.5326 620.1369 620.3848 390.1651 620.4179 520.1588 62
bo3d00.4823 640.1278 630.3271 560.1394 640.3619 610.1368 64
Co3d20.5070 630.1195 640.3569 510.1511 630.3906 570.1440 63
3DVLP0.0038 650.0019 650.0049 650.0023 650.0047 650.0022 65
Co3d0.0000 660.0000 660.0000 660.0000 660.0000 660.0000 66

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sorted bysort bysort bysort bysort bysort by
Chat-Scene-thres0.5permissive0.3456 10.1859 20.3162 10.1527 10.1415 80.4856 4
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
CM3D-Trans+0.2348 60.1383 60.2250 70.1030 60.1398 90.2966 12
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Scan2Cappermissive0.0849 180.0576 180.1073 180.0492 180.0970 180.2481 18
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021
X-Trans2Cappermissive0.1274 140.0808 150.1392 150.0653 150.1244 120.2795 16
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
SpaCap3Dpermissive0.1359 130.0883 130.1591 130.0738 130.1182 140.3275 9
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
MORE-xyzpermissive0.1239 160.0796 160.1362 160.0631 160.1116 170.2648 17
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
REMAN0.1662 110.1070 110.1790 110.0815 110.1235 130.2927 14
3DJCG(Captioning)permissive0.1918 100.1350 80.2207 90.1013 90.1506 60.3867 8
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
SUN+0.1148 170.0846 140.1564 140.0711 140.1143 160.2958 13
Chat-Scene-thres0.010.2053 90.1103 100.1884 100.0907 100.1527 50.5076 2
NOAH0.1382 120.0901 120.1598 120.0747 120.1359 110.2977 11
Forest-xyz0.2266 70.1363 70.2250 60.1027 70.1161 150.2825 15
CFM0.2360 50.1417 50.2253 50.1034 50.1379 100.3008 10
vote2cap-detrpermissive0.3128 30.1778 30.2842 40.1316 40.1825 20.4454 6
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
Vote2Cap-DETR++0.3360 20.1908 10.3012 20.1386 20.1864 10.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
TMP0.3029 40.1728 40.2898 30.1332 30.1801 30.4605 5
Chat-Scene-all0.1257 150.0671 170.1150 170.0554 170.1539 40.5076 2
D3Net - Speakerpermissive0.2088 80.1335 90.2237 80.1022 80.1481 70.4198 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022