ScanRefer Benchmark
This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.
Unique | Unique | Multiple | Multiple | Overall | Overall | ||
---|---|---|---|---|---|---|---|
Method | Info | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU |
Chat-Scene | 0.8887 1 | 0.8005 1 | 0.5421 1 | 0.4861 1 | 0.6198 1 | 0.5566 1 | |
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024 | |||||||
ConcreteNet | 0.8607 2 | 0.7923 2 | 0.4746 7 | 0.4091 2 | 0.5612 6 | 0.4950 2 | |
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024 | |||||||
cus3d | 0.8384 4 | 0.7073 6 | 0.4908 5 | 0.4000 3 | 0.5688 4 | 0.4689 3 | |
D-LISA | 0.8195 6 | 0.6900 8 | 0.4975 3 | 0.3967 5 | 0.5697 3 | 0.4625 4 | |
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024 | |||||||
M3DRef-test | 0.7865 19 | 0.6793 14 | 0.4963 4 | 0.3977 4 | 0.5614 5 | 0.4608 5 | |
pointclip | 0.8211 5 | 0.7082 5 | 0.4803 6 | 0.3884 6 | 0.5567 7 | 0.4601 6 | |
M3DRef-SCLIP | 0.7997 12 | 0.7123 3 | 0.4708 8 | 0.3805 9 | 0.5445 8 | 0.4549 7 | |
M3DRef-CLIP | 0.7980 13 | 0.7085 4 | 0.4692 9 | 0.3807 8 | 0.5433 9 | 0.4545 8 | |
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023 | |||||||
CORE-3DVG | 0.8557 3 | 0.6867 9 | 0.5275 2 | 0.3850 7 | 0.6011 2 | 0.4527 9 | |
3DInsVG | 0.8170 7 | 0.6925 7 | 0.4582 12 | 0.3617 10 | 0.5386 10 | 0.4359 10 | |
RG-SAN | 0.7964 14 | 0.6785 15 | 0.4591 11 | 0.3600 11 | 0.5348 13 | 0.4314 11 | |
HAM | 0.7799 25 | 0.6373 20 | 0.4148 27 | 0.3324 12 | 0.4967 27 | 0.4007 12 | |
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding. | |||||||
CSA-M3LM | 0.8137 8 | 0.6241 21 | 0.4544 18 | 0.3317 13 | 0.5349 12 | 0.3972 13 | |
D3Net | 0.7923 17 | 0.6843 10 | 0.3905 31 | 0.3074 25 | 0.4806 30 | 0.3919 14 | |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
GALA-Grounder-D3 | 0.7939 16 | 0.5952 25 | 0.4625 10 | 0.3229 15 | 0.5368 11 | 0.3839 15 | |
LAG-3D-2 | 0.7964 14 | 0.5812 31 | 0.4572 14 | 0.3245 14 | 0.5333 14 | 0.3821 16 | |
ContraRefer | 0.7832 23 | 0.6801 13 | 0.3850 32 | 0.2947 27 | 0.4743 31 | 0.3811 17 | |
LAG-3D-3 | 0.7815 24 | 0.5837 29 | 0.4556 16 | 0.3219 16 | 0.5287 20 | 0.3806 18 | |
Graph-VG-2 | 0.8021 11 | 0.5829 30 | 0.4546 17 | 0.3217 17 | 0.5325 15 | 0.3802 19 | |
Clip | 0.7733 30 | 0.6810 12 | 0.3619 42 | 0.2919 32 | 0.4542 37 | 0.3791 20 | |
Clip-pre | 0.7766 28 | 0.6843 10 | 0.3617 44 | 0.2904 33 | 0.4547 36 | 0.3787 21 | |
3DJCG(Grounding) | 0.7675 33 | 0.6059 23 | 0.4389 22 | 0.3117 23 | 0.5126 22 | 0.3776 22 | |
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral | |||||||
Graph-VG-3 | 0.8038 10 | 0.5812 31 | 0.4515 20 | 0.3169 19 | 0.5305 17 | 0.3762 23 | |
GALA-Grounder-D1 | 0.8104 9 | 0.5754 34 | 0.4479 21 | 0.3176 18 | 0.5292 19 | 0.3754 24 | |
Graph-VG-4 | 0.7848 21 | 0.5631 37 | 0.4560 15 | 0.3164 21 | 0.5298 18 | 0.3717 25 | |
LAG-3D | 0.7881 18 | 0.5606 38 | 0.4579 13 | 0.3169 19 | 0.5320 16 | 0.3715 26 | |
3DVG-Trans + | 0.7733 30 | 0.5787 33 | 0.4370 23 | 0.3102 24 | 0.5124 23 | 0.3704 27 | |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
bo3d-1 | 0.7469 39 | 0.5606 38 | 0.4539 19 | 0.3124 22 | 0.5196 21 | 0.3680 28 | |
Se2d | 0.7799 25 | 0.6628 17 | 0.3636 40 | 0.2823 35 | 0.4569 34 | 0.3677 29 | |
secg | 0.7288 42 | 0.6175 22 | 0.3696 39 | 0.2933 29 | 0.4501 40 | 0.3660 30 | |
SAF | 0.6348 49 | 0.5647 36 | 0.3726 37 | 0.3009 26 | 0.4314 43 | 0.3601 31 | |
FE-3DGQA | 0.7857 20 | 0.5862 28 | 0.4317 24 | 0.2935 28 | 0.5111 24 | 0.3592 32 | |
D3Net - Pretrained | 0.7659 34 | 0.6579 18 | 0.3619 42 | 0.2726 36 | 0.4525 39 | 0.3590 33 | |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
HGT | 0.7692 32 | 0.5886 27 | 0.4141 28 | 0.2924 31 | 0.4937 28 | 0.3588 34 | |
InstanceRefer | 0.7782 27 | 0.6669 16 | 0.3457 47 | 0.2688 38 | 0.4427 42 | 0.3580 35 | |
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021 | |||||||
3DVG-Transformer | 0.7576 35 | 0.5515 40 | 0.4224 26 | 0.2933 29 | 0.4976 26 | 0.3512 36 | |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
SAVG | 0.7758 29 | 0.5664 35 | 0.4236 25 | 0.2826 34 | 0.5026 25 | 0.3462 37 | |
PointGroup_MCAN | 0.7510 36 | 0.6397 19 | 0.3271 49 | 0.2535 40 | 0.4222 45 | 0.3401 38 | |
TransformerVG | 0.7502 37 | 0.5977 24 | 0.3712 38 | 0.2628 39 | 0.4562 35 | 0.3379 39 | |
TFVG3D ++ | 0.7453 40 | 0.5458 43 | 0.3793 36 | 0.2690 37 | 0.4614 32 | 0.3311 40 | |
Ali Solgi, Mehdi Ezoji: A Transformer-based Framework for Visual Grounding on 3D Point Clouds. AISP 2024 | |||||||
TGNN | 0.6834 46 | 0.5894 26 | 0.3312 48 | 0.2526 41 | 0.4102 48 | 0.3281 41 | |
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021 | |||||||
BEAUTY-DETR | 0.7848 21 | 0.5499 41 | 0.3934 30 | 0.2480 42 | 0.4811 29 | 0.3157 42 | |
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes. | |||||||
grounding | 0.7298 41 | 0.5458 42 | 0.3822 34 | 0.2421 44 | 0.4538 38 | 0.3046 43 | |
henet | 0.7110 43 | 0.5180 45 | 0.3936 29 | 0.2472 43 | 0.4590 33 | 0.3030 44 | |
SRGA | 0.7494 38 | 0.5128 46 | 0.3631 41 | 0.2218 45 | 0.4497 41 | 0.2871 45 | |
SR-GAB | 0.7016 44 | 0.5202 44 | 0.3233 51 | 0.1959 48 | 0.4081 49 | 0.2686 46 | |
SPANet | 0.5614 53 | 0.4641 48 | 0.2800 55 | 0.2071 47 | 0.3431 56 | 0.2647 47 | |
ScanRefer | 0.6859 45 | 0.4353 49 | 0.3488 46 | 0.2097 46 | 0.4244 44 | 0.2603 48 | |
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020 | |||||||
scanrefer2 | 0.6340 50 | 0.4353 49 | 0.3193 52 | 0.1947 49 | 0.3898 51 | 0.2486 49 | |
TransformerRefer | 0.6010 51 | 0.4658 47 | 0.2540 57 | 0.1730 54 | 0.3318 57 | 0.2386 50 | |
ScanRefer Baseline | 0.6422 48 | 0.4196 51 | 0.3090 53 | 0.1832 50 | 0.3837 52 | 0.2362 51 | |
ScanRefer_vanilla | 0.6488 47 | 0.4056 52 | 0.3052 54 | 0.1782 52 | 0.3823 53 | 0.2292 52 | |
pairwisemethod | 0.5779 52 | 0.3603 53 | 0.2792 56 | 0.1746 53 | 0.3462 55 | 0.2163 53 | |
bo3d | 0.5400 54 | 0.1550 54 | 0.3817 35 | 0.1785 51 | 0.4172 47 | 0.1732 54 | |
Co3d3 | 0.5326 55 | 0.1369 55 | 0.3848 33 | 0.1651 55 | 0.4179 46 | 0.1588 55 | |
Co3d2 | 0.5070 56 | 0.1195 57 | 0.3569 45 | 0.1511 56 | 0.3906 50 | 0.1440 56 | |
bo3d0 | 0.4823 57 | 0.1278 56 | 0.3271 49 | 0.1394 57 | 0.3619 54 | 0.1368 57 | |
Co3d | 0.0000 58 | 0.0000 58 | 0.0000 58 | 0.0000 58 | 0.0000 58 | 0.0000 58 | |