Benchmark Results - ScanRefer Benchmark

This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.

		Unique	Unique	Multiple	Multiple	Overall	Overall
Method	Info	acc@0.25IoU	acc@0.5IoU	acc@0.25IoU	acc@0.5IoU	acc@0.25IoU	acc@0.5IoU
Method	Info
Chat-Scene		0.8887 1	0.8005 1	0.5421 1	0.4861 1	0.6198 1	0.5566 1
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
CORE-3DVG		0.8557 3	0.6867 9	0.5275 2	0.3850 7	0.6011 2	0.4527 9

D-LISA		0.8195 6	0.6900 8	0.4975 3	0.3967 5	0.5697 3	0.4625 4
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024
cus3d		0.8384 4	0.7073 6	0.4908 5	0.4000 3	0.5688 4	0.4689 3

M3DRef-test		0.7865 14	0.6793 14	0.4963 4	0.3977 4	0.5614 5	0.4608 5

ConcreteNet		0.8607 2	0.7923 2	0.4746 7	0.4091 2	0.5612 6	0.4950 2
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024
pointclip		0.8211 5	0.7082 5	0.4803 6	0.3884 6	0.5567 7	0.4601 6

M3DRef-SCLIP		0.7997 9	0.7123 3	0.4708 8	0.3805 9	0.5445 8	0.4549 7

M3DRef-CLIP		0.7980 10	0.7085 4	0.4692 9	0.3807 8	0.5433 9	0.4545 8
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023
3DInsVG		0.8170 7	0.6925 7	0.4582 11	0.3617 10	0.5386 10	0.4359 10

CSA-M3LM		0.8137 8	0.6241 21	0.4544 12	0.3317 13	0.5349 11	0.3972 13

RG-SAN		0.7964 11	0.6785 15	0.4591 10	0.3600 11	0.5348 12	0.4314 11

GALA-Grounder + 2D		0.7947 12	0.5713 30	0.4525 14	0.3202 14	0.5292 13	0.3765 19

bo3d-1		0.7469 33	0.5606 33	0.4539 13	0.3124 16	0.5196 14	0.3680 22

GALA-Grounder		0.7824 18	0.5796 28	0.4391 15	0.3131 15	0.5161 15	0.3728 20

3DJCG(Grounding)		0.7675 27	0.6059 23	0.4389 16	0.3117 17	0.5126 16	0.3776 18
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
3DVG-Trans +		0.7733 24	0.5787 29	0.4370 17	0.3102 18	0.5124 17	0.3704 21
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
FE-3DGQA		0.7857 15	0.5862 27	0.4317 18	0.2935 22	0.5111 18	0.3592 26

SAVG		0.7758 23	0.5664 31	0.4236 19	0.2826 28	0.5026 19	0.3462 31

3DVG-Transformer		0.7576 29	0.5515 34	0.4224 20	0.2933 23	0.4976 20	0.3512 30
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021
HAM		0.7799 19	0.6373 20	0.4148 21	0.3324 12	0.4967 21	0.4007 12
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.
HGT		0.7692 26	0.5886 26	0.4141 22	0.2924 25	0.4937 22	0.3588 28

BEAUTY-DETR		0.7848 16	0.5499 35	0.3934 24	0.2480 35	0.4811 23	0.3157 35
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes.
D3Net		0.7923 13	0.6843 10	0.3905 25	0.3074 19	0.4806 24	0.3919 14
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
ContraRefer		0.7832 17	0.6801 13	0.3850 26	0.2947 21	0.4743 25	0.3811 15

henet		0.7110 36	0.5180 38	0.3936 23	0.2472 36	0.4590 26	0.3030 37

Se2d		0.7799 19	0.6628 17	0.3636 33	0.2823 29	0.4569 27	0.3677 23

TransformerVG		0.7502 31	0.5977 24	0.3712 31	0.2628 32	0.4562 28	0.3379 33

Clip-pre		0.7766 22	0.6843 10	0.3617 37	0.2904 27	0.4547 29	0.3787 17

Clip		0.7733 24	0.6810 12	0.3619 35	0.2919 26	0.4542 30	0.3791 16

grounding		0.7298 34	0.5458 36	0.3822 28	0.2421 37	0.4538 31	0.3046 36

D3Net - Pretrained		0.7659 28	0.6579 18	0.3619 35	0.2726 30	0.4525 32	0.3590 27
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022
secg		0.7288 35	0.6175 22	0.3696 32	0.2933 23	0.4501 33	0.3660 24

SRGA		0.7494 32	0.5128 39	0.3631 34	0.2218 38	0.4497 34	0.2871 38

InstanceRefer		0.7782 21	0.6669 16	0.3457 40	0.2688 31	0.4427 35	0.3580 29
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021
SAF		0.6348 42	0.5647 32	0.3726 30	0.3009 20	0.4314 36	0.3601 25

ScanRefer		0.6859 38	0.4353 42	0.3488 39	0.2097 39	0.4244 37	0.2603 41
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020
PointGroup_MCAN		0.7510 30	0.6397 19	0.3271 42	0.2535 33	0.4222 38	0.3401 32

Co3d3		0.5326 48	0.1369 48	0.3848 27	0.1651 48	0.4179 39	0.1588 48

bo3d		0.5400 47	0.1550 47	0.3817 29	0.1785 44	0.4172 40	0.1732 47

TGNN		0.6834 39	0.5894 25	0.3312 41	0.2526 34	0.4102 41	0.3281 34
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021
SR-GAB		0.7016 37	0.5202 37	0.3233 44	0.1959 41	0.4081 42	0.2686 39

Co3d2		0.5070 49	0.1195 50	0.3569 38	0.1511 49	0.3906 43	0.1440 49

scanrefer2		0.6340 43	0.4353 42	0.3193 45	0.1947 42	0.3898 44	0.2486 42

ScanRefer Baseline		0.6422 41	0.4196 44	0.3090 46	0.1832 43	0.3837 45	0.2362 44

ScanRefer_vanilla		0.6488 40	0.4056 45	0.3052 47	0.1782 45	0.3823 46	0.2292 45

bo3d0		0.4823 50	0.1278 49	0.3271 42	0.1394 50	0.3619 47	0.1368 50

pairwisemethod		0.5779 45	0.3603 46	0.2792 49	0.1746 46	0.3462 48	0.2163 46

SPANet		0.5614 46	0.4641 41	0.2800 48	0.2071 40	0.3431 49	0.2647 40

TransformerRefer		0.6010 44	0.4658 40	0.2540 50	0.1730 47	0.3318 50	0.2386 43

Co3d		0.0000 51	0.0000 51	0.0000 51	0.0000 51	0.0000 51	0.0000 51

		Captioning F1-Score				Dense Captioning	Object Detection
Method	Info	CIDEr@0.5IoU	BLEU-4@0.5IoU	Rouge-L@0.5IoU	METEOR@0.5IoU	DCmAP	mAP@0.5
Method	Info
Chat-Scene-thres0.5		0.3456 1	0.1859 2	0.3162 1	0.1527 1	0.1415 8	0.4856 4
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024
CM3D-Trans+		0.2348 6	0.1383 6	0.2250 7	0.1030 6	0.1398 9	0.2966 12
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma: Contextual Modeling for 3D Dense Captioning on Point Clouds.
Scan2Cap		0.0849 18	0.0576 18	0.1073 18	0.0492 18	0.0970 18	0.2481 18
Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang: Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. CVPR 2021
X-Trans2Cap		0.1274 14	0.0808 15	0.1392 15	0.0653 15	0.1244 12	0.2795 16
Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Guo, Yao and Li, Guanbin and Cui, Shuguang and Li, Zhen: X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning. CVPR 2022
SpaCap3D		0.1359 13	0.0883 13	0.1591 13	0.0738 13	0.1182 14	0.3275 9
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022
MORE-xyz		0.1239 16	0.0796 16	0.1362 16	0.0631 16	0.1116 17	0.2648 17
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang: MORE: Multi_ORder RElation Mining for Dense Captioning in 3D Scenes. ECCV 2022
REMAN		0.1662 11	0.1070 11	0.1790 11	0.0815 11	0.1235 13	0.2927 14

3DJCG(Captioning)		0.1918 10	0.1350 8	0.2207 9	0.1013 9	0.1506 6	0.3867 8
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral
SUN+		0.1148 17	0.0846 14	0.1564 14	0.0711 14	0.1143 16	0.2958 13

Chat-Scene-thres0.01		0.2053 9	0.1103 10	0.1884 10	0.0907 10	0.1527 5	0.5076 2

NOAH		0.1382 12	0.0901 12	0.1598 12	0.0747 12	0.1359 11	0.2977 11

Forest-xyz		0.2266 7	0.1363 7	0.2250 6	0.1027 7	0.1161 15	0.2825 15

CFM		0.2360 5	0.1417 5	0.2253 5	0.1034 5	0.1379 10	0.3008 10

vote2cap-detr		0.3128 3	0.1778 3	0.2842 4	0.1316 4	0.1825 2	0.4454 6
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU, Taihao Li: End-to-End 3D Dense Captioning with Vote2Cap-DETR. CVPR 2023
Vote2Cap-DETR++		0.3360 2	0.1908 1	0.3012 2	0.1386 2	0.1864 1	0.5090 1
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning.
TMP		0.3029 4	0.1728 4	0.2898 3	0.1332 3	0.1801 3	0.4605 5

Chat-Scene-all		0.1257 15	0.0671 17	0.1150 17	0.0554 17	0.1539 4	0.5076 2

D3Net - Speaker		0.2088 8	0.1335 9	0.2237 8	0.1022 8	0.1481 7	0.4198 7
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022

ScanRefer Benchmark

Scan2Cap Benchmark