
ScanRefer Benchmark
This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.
Unique | Unique | Multiple | Multiple | Overall | Overall | ||
---|---|---|---|---|---|---|---|
Method | Info | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ||
M3DRef-CLIP | 0.7980 3 | 0.7085 1 | 0.4692 1 | 0.3807 1 | 0.5433 1 | 0.4545 1 | |
ConcreteNet | 0.8120 2 | 0.6933 2 | 0.4479 3 | 0.3760 2 | 0.5296 3 | 0.4471 2 | |
HAM | 0.7799 8 | 0.6373 10 | 0.4148 9 | 0.3324 3 | 0.4967 9 | 0.4007 3 | |
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding. | |||||||
CSA-M3LM | 0.8137 1 | 0.6241 11 | 0.4544 2 | 0.3317 4 | 0.5349 2 | 0.3972 4 | |
D3Net | ![]() | 0.7923 4 | 0.6843 3 | 0.3905 13 | 0.3074 7 | 0.4806 12 | 0.3919 5 |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
ContraRefer | 0.7832 7 | 0.6801 6 | 0.3850 14 | 0.2947 8 | 0.4743 13 | 0.3811 6 | |
Clip | 0.7733 12 | 0.6810 5 | 0.3619 18 | 0.2919 12 | 0.4542 17 | 0.3791 7 | |
Clip-pre | 0.7766 10 | 0.6843 3 | 0.3617 20 | 0.2904 13 | 0.4547 16 | 0.3787 8 | |
3DJCG(Grounding) | ![]() | 0.7675 15 | 0.6059 12 | 0.4389 4 | 0.3117 5 | 0.5126 4 | 0.3776 9 |
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral | |||||||
3DVG-Trans + | ![]() | 0.7733 12 | 0.5787 17 | 0.4370 5 | 0.3102 6 | 0.5124 5 | 0.3704 10 |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
FE-3DGQA | 0.7857 5 | 0.5862 16 | 0.4317 6 | 0.2935 9 | 0.5111 6 | 0.3592 11 | |
D3Net - Pretrained | ![]() | 0.7659 16 | 0.6579 8 | 0.3619 18 | 0.2726 15 | 0.4525 19 | 0.3590 12 |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
HGT | 0.7692 14 | 0.5886 15 | 0.4141 10 | 0.2924 11 | 0.4937 10 | 0.3588 13 | |
InstanceRefer | ![]() | 0.7782 9 | 0.6669 7 | 0.3457 22 | 0.2688 16 | 0.4427 21 | 0.3580 14 |
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021 | |||||||
3DVG-Transformer | ![]() | 0.7576 17 | 0.5515 19 | 0.4224 8 | 0.2933 10 | 0.4976 8 | 0.3512 15 |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
SAVG | 0.7758 11 | 0.5664 18 | 0.4236 7 | 0.2826 14 | 0.5026 7 | 0.3462 16 | |
PointGroup_MCAN | 0.7510 18 | 0.6397 9 | 0.3271 24 | 0.2535 18 | 0.4222 23 | 0.3401 17 | |
TransformerVG | 0.7502 19 | 0.5977 13 | 0.3712 16 | 0.2628 17 | 0.4562 15 | 0.3379 18 | |
TGNN | 0.6834 25 | 0.5894 14 | 0.3312 23 | 0.2526 19 | 0.4102 24 | 0.3281 19 | |
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021 | |||||||
BEAUTY-DETR | ![]() | 0.7848 6 | 0.5499 20 | 0.3934 12 | 0.2480 20 | 0.4811 11 | 0.3157 20 |
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes. | |||||||
grounding | 0.7298 21 | 0.5458 21 | 0.3822 15 | 0.2421 22 | 0.4538 18 | 0.3046 21 | |
henet | 0.7110 22 | 0.5180 23 | 0.3936 11 | 0.2472 21 | 0.4590 14 | 0.3030 22 | |
SRGA | 0.7494 20 | 0.5128 24 | 0.3631 17 | 0.2218 23 | 0.4497 20 | 0.2871 23 | |
SR-GAB | 0.7016 23 | 0.5202 22 | 0.3233 25 | 0.1959 26 | 0.4081 25 | 0.2686 24 | |
SPANet | 0.5614 31 | 0.4641 26 | 0.2800 29 | 0.2071 25 | 0.3431 30 | 0.2647 25 | |
ScanRefer | ![]() | 0.6859 24 | 0.4353 27 | 0.3488 21 | 0.2097 24 | 0.4244 22 | 0.2603 26 |
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020 | |||||||
scanrefer2 | 0.6340 28 | 0.4353 27 | 0.3193 26 | 0.1947 27 | 0.3898 26 | 0.2486 27 | |
TransformerRefer | 0.6010 29 | 0.4658 25 | 0.2540 31 | 0.1730 31 | 0.3318 31 | 0.2386 28 | |
ScanRefer Baseline | 0.6422 27 | 0.4196 29 | 0.3090 27 | 0.1832 28 | 0.3837 27 | 0.2362 29 | |
ScanRefer_vanilla | 0.6488 26 | 0.4056 30 | 0.3052 28 | 0.1782 29 | 0.3823 28 | 0.2292 30 | |
pairwisemethod | 0.5779 30 | 0.3603 31 | 0.2792 30 | 0.1746 30 | 0.3462 29 | 0.2163 31 | |