ScanRefer Benchmark
This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.
Unique | Unique | Multiple | Multiple | Overall | Overall | ||
---|---|---|---|---|---|---|---|
Method | Info | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU |
ConcreteNet | 0.8607 1 | 0.7923 1 | 0.4746 5 | 0.4091 1 | 0.5612 4 | 0.4950 1 | |
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool: Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. | |||||||
CORE-3DVG | 0.8557 2 | 0.6867 7 | 0.5275 1 | 0.3850 5 | 0.6011 1 | 0.4527 7 | |
cus3d | 0.8384 3 | 0.7073 5 | 0.4908 3 | 0.4000 2 | 0.5688 2 | 0.4689 2 | |
pointclip | 0.8211 4 | 0.7082 4 | 0.4803 4 | 0.3884 4 | 0.5567 5 | 0.4601 4 | |
3DInsVG | 0.8170 5 | 0.6925 6 | 0.4582 9 | 0.3617 8 | 0.5386 8 | 0.4359 8 | |
CSA-M3LM | 0.8137 6 | 0.6241 19 | 0.4544 10 | 0.3317 11 | 0.5349 9 | 0.3972 11 | |
M3DRef-SCLIP | 0.7997 7 | 0.7123 2 | 0.4708 6 | 0.3805 7 | 0.5445 6 | 0.4549 5 | |
M3DRef-CLIP | 0.7980 8 | 0.7085 3 | 0.4692 7 | 0.3807 6 | 0.5433 7 | 0.4545 6 | |
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023 | |||||||
RG-SAN | 0.7964 9 | 0.6785 13 | 0.4591 8 | 0.3600 9 | 0.5348 10 | 0.4314 9 | |
D3Net | 0.7923 10 | 0.6843 8 | 0.3905 21 | 0.3074 15 | 0.4806 20 | 0.3919 12 | |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
M3DRef-test | 0.7865 11 | 0.6793 12 | 0.4963 2 | 0.3977 3 | 0.5614 3 | 0.4608 3 | |
FE-3DGQA | 0.7857 12 | 0.5862 25 | 0.4317 14 | 0.2935 18 | 0.5111 14 | 0.3592 22 | |
BEAUTY-DETR | 0.7848 13 | 0.5499 31 | 0.3934 20 | 0.2480 31 | 0.4811 19 | 0.3157 31 | |
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes. | |||||||
ContraRefer | 0.7832 14 | 0.6801 11 | 0.3850 22 | 0.2947 17 | 0.4743 21 | 0.3811 13 | |
HAM | 0.7799 15 | 0.6373 18 | 0.4148 17 | 0.3324 10 | 0.4967 17 | 0.4007 10 | |
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding. | |||||||
Se2d | 0.7799 15 | 0.6628 15 | 0.3636 29 | 0.2823 25 | 0.4569 23 | 0.3677 19 | |
InstanceRefer | 0.7782 17 | 0.6669 14 | 0.3457 36 | 0.2688 27 | 0.4427 31 | 0.3580 25 | |
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021 | |||||||
Clip-pre | 0.7766 18 | 0.6843 8 | 0.3617 33 | 0.2904 23 | 0.4547 25 | 0.3787 15 | |
SAVG | 0.7758 19 | 0.5664 27 | 0.4236 15 | 0.2826 24 | 0.5026 15 | 0.3462 27 | |
Clip | 0.7733 20 | 0.6810 10 | 0.3619 31 | 0.2919 22 | 0.4542 26 | 0.3791 14 | |
3DVG-Trans + | 0.7733 20 | 0.5787 26 | 0.4370 13 | 0.3102 14 | 0.5124 13 | 0.3704 17 | |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
HGT | 0.7692 22 | 0.5886 24 | 0.4141 18 | 0.2924 21 | 0.4937 18 | 0.3588 24 | |
3DJCG(Grounding) | 0.7675 23 | 0.6059 21 | 0.4389 12 | 0.3117 13 | 0.5126 12 | 0.3776 16 | |
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral | |||||||
D3Net - Pretrained | 0.7659 24 | 0.6579 16 | 0.3619 31 | 0.2726 26 | 0.4525 28 | 0.3590 23 | |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
3DVG-Transformer | 0.7576 25 | 0.5515 30 | 0.4224 16 | 0.2933 19 | 0.4976 16 | 0.3512 26 | |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
PointGroup_MCAN | 0.7510 26 | 0.6397 17 | 0.3271 38 | 0.2535 29 | 0.4222 34 | 0.3401 28 | |
TransformerVG | 0.7502 27 | 0.5977 22 | 0.3712 27 | 0.2628 28 | 0.4562 24 | 0.3379 29 | |
SRGA | 0.7494 28 | 0.5128 35 | 0.3631 30 | 0.2218 34 | 0.4497 30 | 0.2871 34 | |
bo3d-1 | 0.7469 29 | 0.5606 29 | 0.4539 11 | 0.3124 12 | 0.5196 11 | 0.3680 18 | |
grounding | 0.7298 30 | 0.5458 32 | 0.3822 24 | 0.2421 33 | 0.4538 27 | 0.3046 32 | |
secg | 0.7288 31 | 0.6175 20 | 0.3696 28 | 0.2933 19 | 0.4501 29 | 0.3660 20 | |
henet | 0.7110 32 | 0.5180 34 | 0.3936 19 | 0.2472 32 | 0.4590 22 | 0.3030 33 | |
SR-GAB | 0.7016 33 | 0.5202 33 | 0.3233 40 | 0.1959 37 | 0.4081 38 | 0.2686 35 | |
ScanRefer | 0.6859 34 | 0.4353 38 | 0.3488 35 | 0.2097 35 | 0.4244 33 | 0.2603 37 | |
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020 | |||||||
TGNN | 0.6834 35 | 0.5894 23 | 0.3312 37 | 0.2526 30 | 0.4102 37 | 0.3281 30 | |
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021 | |||||||
ScanRefer_vanilla | 0.6488 36 | 0.4056 41 | 0.3052 43 | 0.1782 41 | 0.3823 42 | 0.2292 41 | |
ScanRefer Baseline | 0.6422 37 | 0.4196 40 | 0.3090 42 | 0.1832 39 | 0.3837 41 | 0.2362 40 | |
SAF | 0.6348 38 | 0.5647 28 | 0.3726 26 | 0.3009 16 | 0.4314 32 | 0.3601 21 | |
scanrefer2 | 0.6340 39 | 0.4353 38 | 0.3193 41 | 0.1947 38 | 0.3898 40 | 0.2486 38 | |
TransformerRefer | 0.6010 40 | 0.4658 36 | 0.2540 46 | 0.1730 43 | 0.3318 46 | 0.2386 39 | |
pairwisemethod | 0.5779 41 | 0.3603 42 | 0.2792 45 | 0.1746 42 | 0.3462 44 | 0.2163 42 | |
SPANet | 0.5614 42 | 0.4641 37 | 0.2800 44 | 0.2071 36 | 0.3431 45 | 0.2647 36 | |
bo3d | 0.5400 43 | 0.1550 43 | 0.3817 25 | 0.1785 40 | 0.4172 36 | 0.1732 43 | |
Co3d3 | 0.5326 44 | 0.1369 44 | 0.3848 23 | 0.1651 44 | 0.4179 35 | 0.1588 44 | |
Co3d2 | 0.5070 45 | 0.1195 46 | 0.3569 34 | 0.1511 45 | 0.3906 39 | 0.1440 45 | |
bo3d0 | 0.4823 46 | 0.1278 45 | 0.3271 38 | 0.1394 46 | 0.3619 43 | 0.1368 46 | |
Co3d | 0.0000 47 | 0.0000 47 | 0.0000 47 | 0.0000 47 | 0.0000 47 | 0.0000 47 | |