This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.


   Unique Unique Multiple Multiple Overall Overall
Method Infoacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
sort bysort bysort bysort bysort bysorted by
UniVLGpermissive0.8895 10.8236 10.5921 10.5030 10.6588 10.5749 1
Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Alexander Sax, Franziska Meier, Katerina Fragkiadaki: Unifying 2D and 3D Vision-Language Understanding.
Chat-Scenepermissive0.8887 20.8005 20.5421 20.4861 20.6198 20.5566 2
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
ConcreteNet0.8607 30.7923 30.4746 90.4091 30.5612 80.4950 3
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024
cus3d0.8384 50.7073 70.4908 70.4000 40.5688 60.4689 4
D-LISA0.8195 70.6900 90.4975 50.3967 60.5697 50.4625 5
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024
M3DRef-test0.7865 230.6793 150.4963 60.3977 50.5614 70.4608 6
pointclip0.8211 60.7082 60.4803 80.3884 70.5567 90.4601 7
M3DRef-SCLIP0.7997 140.7123 40.4708 100.3805 100.5445 100.4549 8
M3DRef-CLIPpermissive0.7980 150.7085 50.4692 110.3807 90.5433 110.4545 9
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
CORE-3DVG0.8557 40.6867 100.5275 30.3850 80.6011 30.4527 10
3DInsVG0.8170 80.6925 80.4582 160.3617 110.5386 140.4359 11
RG-SAN0.7964 160.6785 160.4591 150.3600 120.5348 170.4314 12
3DVLP-rf-enhance0.7939 190.6546 210.4651 120.3491 130.5388 130.4176 13
3DVLP-rf0.7964 160.6579 190.4646 130.3436 150.5390 120.4140 14
3DVLP-baseline0.7766 350.6373 240.4572 180.3469 140.5288 240.4120 15
3dvlp-with-judge0.7807 310.6472 220.4498 270.3407 160.5240 270.4094 16
Jung0.8096 110.6331 260.5113 40.3398 180.5782 40.4055 17
ScanRefer-3dvlp-test0.7824 290.6298 270.4532 250.3405 170.5270 260.4054 18
HAM0.7799 320.6373 240.4148 350.3324 200.4967 350.4007 19
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
CSA-M3LM0.8137 90.6241 280.4544 230.3317 210.5349 160.3972 20
3dvlp-judge-h0.7552 450.6051 310.4458 290.3340 190.5152 290.3948 21
D3Netpermissive0.7923 210.6843 110.3905 390.3074 330.4806 380.3919 22
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
GALA-Grounder-D30.7939 190.5952 330.4625 140.3229 230.5368 150.3839 23
LAG-3D-20.7964 160.5812 390.4572 180.3245 220.5333 180.3821 24
ContraRefer0.7832 270.6801 140.3850 400.2947 350.4743 390.3811 25
LAG-3D-30.7815 300.5837 370.4556 210.3219 240.5287 250.3806 26
Graph-VG-20.8021 130.5829 380.4546 220.3217 250.5325 190.3802 27
Clip0.7733 380.6810 130.3619 530.2919 400.4542 470.3791 28
Clip-pre0.7766 350.6843 110.3617 550.2904 410.4547 460.3787 29
3DJCG(Grounding)permissive0.7675 420.6059 300.4389 300.3117 310.5126 300.3776 30
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
Graph-VG-30.8038 120.5812 390.4515 260.3169 270.5305 210.3762 31
GALA-Grounder-D10.8104 100.5754 420.4479 280.3176 260.5292 230.3754 32
Graph-VG-40.7848 250.5631 450.4560 200.3164 290.5298 220.3717 33
LAG-3D0.7881 220.5606 460.4579 170.3169 270.5320 200.3715 34
3DVG-Trans +permissive0.7733 380.5787 410.4370 310.3102 320.5124 310.3704 35
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
bo3d-10.7469 490.5606 460.4539 240.3124 300.5196 280.3680 36
Se2d0.7799 320.6628 180.3636 500.2823 430.4569 430.3677 37
secg0.7288 530.6175 290.3696 480.2933 370.4501 500.3660 38
SAF0.6348 610.5647 440.3726 460.3009 340.4314 540.3601 39
FE-3DGQA0.7857 240.5862 360.4317 320.2935 360.5111 320.3592 40
D3Net - Pretrainedpermissive0.7659 430.6579 190.3619 530.2726 440.4525 490.3590 41
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
HGT0.7692 410.5886 350.4141 360.2924 390.4937 360.3588 42
InstanceReferpermissive0.7782 340.6669 170.3457 590.2688 460.4427 520.3580 43
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
3DVG-Transformerpermissive0.7576 440.5515 490.4224 340.2933 370.4976 340.3512 44
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
SAVG0.7758 370.5664 430.4236 330.2826 420.5026 330.3462 45
PointGroup_MCAN0.7510 460.6397 230.3271 620.2535 480.4222 560.3401 46
TransformerVG0.7502 470.5977 320.3712 470.2628 470.4562 450.3379 47
TFVG3D ++permissive0.7453 500.5458 530.3793 440.2690 450.4614 410.3311 48
Ali Solgi, Mehdi Ezoji: A Transformer-based Framework for Visual Grounding on 3D Point Clouds. AISP 2024
TGNN0.6834 580.5894 340.3312 600.2526 490.4102 600.3281 49
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
BEAUTY-DETRcopyleft0.7848 250.5499 500.3934 380.2480 500.4811 370.3157 50
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
grounding0.7298 520.5458 520.3822 420.2421 520.4538 480.3046 51
henet0.7110 540.5180 550.3936 370.2472 510.4590 420.3030 52
scanrefer-rj-14bz0.7832 270.5524 480.3746 450.2275 530.4662 400.3004 53
scanrefer-test-14bz0.7700 400.5474 510.3665 490.2247 540.4569 430.2970 54
SRGA0.7494 480.5128 560.3631 510.2218 550.4497 510.2871 55
scanrefer-rj-org0.7345 510.4716 570.3536 570.2125 560.4390 530.2706 56
SR-GAB0.7016 550.5202 540.3233 640.1959 590.4081 610.2686 57
SPANet0.5614 650.4641 590.2800 680.2071 580.3431 690.2647 58
ScanReferpermissive0.6859 570.4353 610.3488 580.2097 570.4244 550.2603 59
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
scanrefer20.6340 620.4353 610.3193 650.1947 600.3898 630.2486 60
TransformerRefer0.6010 630.4658 580.2540 700.1730 650.3318 700.2386 61
ScanRefer Baseline0.6422 600.4196 630.3090 660.1832 610.3837 650.2362 62
ScanRefer-test0.6999 560.4361 600.3274 610.1725 660.4109 590.2316 63
ScanRefer_vanilla0.6488 590.4056 640.3052 670.1782 630.3823 660.2292 64
pairwisemethod0.5779 640.3603 650.2792 690.1746 640.3462 680.2163 65
bo3d0.5400 660.1550 660.3817 430.1785 620.4172 580.1732 66
Co3d30.5326 670.1369 670.3848 410.1651 670.4179 570.1588 67
Co3d20.5070 680.1195 700.3569 560.1511 680.3906 620.1440 68
test_submitt0.4732 700.1286 680.3626 520.1399 690.3874 640.1373 69
bo3d00.4823 690.1278 690.3271 620.1394 700.3619 670.1368 70
3DVLP0.0038 710.0019 710.0049 710.0023 710.0047 710.0022 71
Co3d0.0000 720.0000 720.0000 720.0000 720.0000 720.0000 72

This table lists the benchmark results for the Scan2Cap Dense Captioning Benchmark scenario.


   Captioning F1-Score Dense Captioning Object Detection
Method InfoCIDEr@0.5IoUBLEU-4@0.5IoURouge-L@0.5IoUMETEOR@0.5IoUDCmAPmAP@0.5
sorted bysort bysort bysort bysort bysort by
Chat-Scene-thres0.5permissive0.3456 10.1859 20.3162 10.1527 10.1415 80.4856 4
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
Vote2Cap-DETR++0.3360 20.1908 10.3012 20.1386 20.1864 10.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
vote2cap-detrpermissive0.3128 30.1778 30.2842 40.1316 40.1825 20.4454 6
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
TMP0.3029 40.1728 40.2898 30.1332 30.1801 30.4605 5
CFM0.2360 50.1417 50.2253 50.1034 50.1379 100.3008 10
CM3D-Trans+0.2348 60.1383 60.2250 70.1030 60.1398 90.2966 12
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Forest-xyz0.2266 70.1363 70.2250 60.1027 70.1161 150.2825 15
D3Net - Speakerpermissive0.2088 80.1335 90.2237 80.1022 80.1481 70.4198 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
Chat-Scene-thres0.010.2053 90.1103 100.1884 100.0907 100.1527 50.5076 2
3DJCG(Captioning)permissive0.1918 100.1350 80.2207 90.1013 90.1506 60.3867 8
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
REMAN0.1662 110.1070 110.1790 110.0815 110.1235 130.2927 14
NOAH0.1382 120.0901 120.1598 120.0747 120.1359 110.2977 11
SpaCap3Dpermissive0.1359 130.0883 130.1591 130.0738 130.1182 140.3275 9
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
X-Trans2Cappermissive0.1274 140.0808 150.1392 150.0653 150.1244 120.2795 16
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
Chat-Scene-all0.1257 150.0671 170.1150 170.0554 170.1539 40.5076 2
MORE-xyzpermissive0.1239 160.0796 160.1362 160.0631 160.1116 170.2648 17
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
SUN+0.1148 170.0846 140.1564 140.0711 140.1143 160.2958 13
Scan2Cappermissive0.0849 180.0576 180.1073 180.0492 180.0970 180.2481 18
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021