
ScanRefer Benchmark
This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.
Unique | Unique | Multiple | Multiple | Overall | Overall | ||
---|---|---|---|---|---|---|---|
Method | Info | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ||
ConcreteNet | 0.8607 1 | 0.7923 1 | 0.4746 2 | 0.4091 1 | 0.5612 2 | 0.4950 1 | |
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. | |||||||
CORE-3DVG | 0.8557 2 | 0.6867 4 | 0.5275 1 | 0.3850 2 | 0.6011 1 | 0.4527 3 | |
M3DRef-CLIP | ![]() | 0.7980 5 | 0.7085 2 | 0.4692 3 | 0.3807 3 | 0.5433 3 | 0.4545 2 |
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023 | |||||||
3DInsVG | 0.8170 3 | 0.6925 3 | 0.4582 4 | 0.3617 4 | 0.5386 4 | 0.4359 4 | |
HAM | 0.7799 10 | 0.6373 12 | 0.4148 12 | 0.3324 5 | 0.4967 12 | 0.4007 5 | |
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding. | |||||||
CSA-M3LM | 0.8137 4 | 0.6241 13 | 0.4544 5 | 0.3317 6 | 0.5349 5 | 0.3972 6 | |
bo3d-1 | 0.7469 23 | 0.5606 22 | 0.4539 6 | 0.3124 7 | 0.5196 6 | 0.3680 13 | |
3DJCG(Grounding) | ![]() | 0.7675 17 | 0.6059 14 | 0.4389 7 | 0.3117 8 | 0.5126 7 | 0.3776 11 |
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral | |||||||
3DVG-Trans + | ![]() | 0.7733 14 | 0.5787 20 | 0.4370 8 | 0.3102 9 | 0.5124 8 | 0.3704 12 |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
D3Net | ![]() | 0.7923 6 | 0.6843 5 | 0.3905 16 | 0.3074 10 | 0.4806 15 | 0.3919 7 |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
ContraRefer | 0.7832 9 | 0.6801 8 | 0.3850 17 | 0.2947 11 | 0.4743 16 | 0.3811 8 | |
FE-3DGQA | 0.7857 7 | 0.5862 19 | 0.4317 9 | 0.2935 12 | 0.5111 9 | 0.3592 14 | |
3DVG-Transformer | ![]() | 0.7576 19 | 0.5515 23 | 0.4224 11 | 0.2933 13 | 0.4976 11 | 0.3512 19 |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
HGT | 0.7692 16 | 0.5886 18 | 0.4141 13 | 0.2924 14 | 0.4937 13 | 0.3588 16 | |
Clip | 0.7733 14 | 0.6810 7 | 0.3619 23 | 0.2919 15 | 0.4542 20 | 0.3791 9 | |
Clip-pre | 0.7766 12 | 0.6843 5 | 0.3617 25 | 0.2904 16 | 0.4547 19 | 0.3787 10 | |
SAVG | 0.7758 13 | 0.5664 21 | 0.4236 10 | 0.2826 17 | 0.5026 10 | 0.3462 20 | |
secg | 0.7230 25 | 0.6026 15 | 0.3548 27 | 0.2816 18 | 0.4373 25 | 0.3536 18 | |
D3Net - Pretrained | ![]() | 0.7659 18 | 0.6579 10 | 0.3619 23 | 0.2726 19 | 0.4525 22 | 0.3590 15 |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
InstanceRefer | ![]() | 0.7782 11 | 0.6669 9 | 0.3457 29 | 0.2688 20 | 0.4427 24 | 0.3580 17 |
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021 | |||||||
TransformerVG | 0.7502 21 | 0.5977 16 | 0.3712 21 | 0.2628 21 | 0.4562 18 | 0.3379 22 | |
PointGroup_MCAN | 0.7510 20 | 0.6397 11 | 0.3271 31 | 0.2535 22 | 0.4222 27 | 0.3401 21 | |
TGNN | 0.6834 29 | 0.5894 17 | 0.3312 30 | 0.2526 23 | 0.4102 30 | 0.3281 23 | |
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021 | |||||||
BEAUTY-DETR | ![]() | 0.7848 8 | 0.5499 24 | 0.3934 15 | 0.2480 24 | 0.4811 14 | 0.3157 24 |
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes. | |||||||
henet | 0.7110 26 | 0.5180 27 | 0.3936 14 | 0.2472 25 | 0.4590 17 | 0.3030 26 | |
grounding | 0.7298 24 | 0.5458 25 | 0.3822 19 | 0.2421 26 | 0.4538 21 | 0.3046 25 | |
SRGA | 0.7494 22 | 0.5128 28 | 0.3631 22 | 0.2218 27 | 0.4497 23 | 0.2871 27 | |
ScanRefer | ![]() | 0.6859 28 | 0.4353 31 | 0.3488 28 | 0.2097 28 | 0.4244 26 | 0.2603 30 |
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020 | |||||||
SPANet | 0.5614 35 | 0.4641 30 | 0.2800 37 | 0.2071 29 | 0.3431 38 | 0.2647 29 | |
SR-GAB | 0.7016 27 | 0.5202 26 | 0.3233 33 | 0.1959 30 | 0.4081 31 | 0.2686 28 | |
scanrefer2 | 0.6340 32 | 0.4353 31 | 0.3193 34 | 0.1947 31 | 0.3898 33 | 0.2486 31 | |
ScanRefer Baseline | 0.6422 31 | 0.4196 33 | 0.3090 35 | 0.1832 32 | 0.3837 34 | 0.2362 33 | |
bo3d | 0.5400 36 | 0.1550 36 | 0.3817 20 | 0.1785 33 | 0.4172 29 | 0.1732 36 | |
ScanRefer_vanilla | 0.6488 30 | 0.4056 34 | 0.3052 36 | 0.1782 34 | 0.3823 35 | 0.2292 34 | |
pairwisemethod | 0.5779 34 | 0.3603 35 | 0.2792 38 | 0.1746 35 | 0.3462 37 | 0.2163 35 | |
TransformerRefer | 0.6010 33 | 0.4658 29 | 0.2540 39 | 0.1730 36 | 0.3318 39 | 0.2386 32 | |
Co3d3 | 0.5326 37 | 0.1369 37 | 0.3848 18 | 0.1651 37 | 0.4179 28 | 0.1588 37 | |
Co3d2 | 0.5070 38 | 0.1195 39 | 0.3569 26 | 0.1511 38 | 0.3906 32 | 0.1440 38 | |
bo3d0 | 0.4823 39 | 0.1278 38 | 0.3271 31 | 0.1394 39 | 0.3619 36 | 0.1368 39 | |
Co3d | 0.0000 40 | 0.0000 40 | 0.0000 40 | 0.0000 40 | 0.0000 40 | 0.0000 40 | |