
ScanRefer Benchmark
This table lists the benchmark results for the ScanRefer Localization Benchmark scenario.
Unique | Unique | Multiple | Multiple | Overall | Overall | ||
---|---|---|---|---|---|---|---|
Method | Info | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU | acc@0.25IoU | acc@0.5IoU |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ||
UniVLG | ![]() | 0.8895 1 | 0.8236 1 | 0.5921 1 | 0.5030 1 | 0.6588 1 | 0.5749 1 |
Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Alexander Sax, Franziska Meier, Katerina Fragkiadaki: Unifying 2D and 3D Vision-Language Understanding. | |||||||
Chat-Scene | ![]() | 0.8887 2 | 0.8005 2 | 0.5421 2 | 0.4861 2 | 0.6198 2 | 0.5566 2 |
Haifeng Huang, Yilun Chen, Zehan Wang, et al.: Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers. NeurIPS 2024 | |||||||
ConcreteNet | 0.8607 3 | 0.7923 3 | 0.4746 9 | 0.4091 3 | 0.5612 8 | 0.4950 3 | |
Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool: Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. ECCV 2024 | |||||||
cus3d | 0.8384 5 | 0.7073 7 | 0.4908 7 | 0.4000 4 | 0.5688 6 | 0.4689 4 | |
D-LISA | 0.8195 7 | 0.6900 9 | 0.4975 5 | 0.3967 6 | 0.5697 5 | 0.4625 5 | |
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh: Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention. NeurIPS 2024 | |||||||
M3DRef-test | 0.7865 23 | 0.6793 15 | 0.4963 6 | 0.3977 5 | 0.5614 7 | 0.4608 6 | |
pointclip | 0.8211 6 | 0.7082 6 | 0.4803 8 | 0.3884 7 | 0.5567 9 | 0.4601 7 | |
M3DRef-SCLIP | 0.7997 14 | 0.7123 4 | 0.4708 10 | 0.3805 10 | 0.5445 10 | 0.4549 8 | |
M3DRef-CLIP | ![]() | 0.7980 15 | 0.7085 5 | 0.4692 11 | 0.3807 9 | 0.5433 11 | 0.4545 9 |
Yiming Zhang, ZeMing Gong, Angel X. Chang: Multi3DRefer: Grounding Text Description to Multiple 3D Objects. ICCV 2023 | |||||||
CORE-3DVG | 0.8557 4 | 0.6867 10 | 0.5275 3 | 0.3850 8 | 0.6011 3 | 0.4527 10 | |
3DInsVG | 0.8170 8 | 0.6925 8 | 0.4582 16 | 0.3617 11 | 0.5386 14 | 0.4359 11 | |
RG-SAN | 0.7964 16 | 0.6785 16 | 0.4591 15 | 0.3600 12 | 0.5348 17 | 0.4314 12 | |
3DVLP-rf-enhance | 0.7939 19 | 0.6546 21 | 0.4651 12 | 0.3491 13 | 0.5388 13 | 0.4176 13 | |
3DVLP-rf | 0.7964 16 | 0.6579 19 | 0.4646 13 | 0.3436 15 | 0.5390 12 | 0.4140 14 | |
3DVLP-baseline | 0.7766 35 | 0.6373 24 | 0.4572 18 | 0.3469 14 | 0.5288 24 | 0.4120 15 | |
3dvlp-with-judge | 0.7807 31 | 0.6472 22 | 0.4498 27 | 0.3407 16 | 0.5240 27 | 0.4094 16 | |
Jung | 0.8096 11 | 0.6331 26 | 0.5113 4 | 0.3398 18 | 0.5782 4 | 0.4055 17 | |
ScanRefer-3dvlp-test | 0.7824 29 | 0.6298 27 | 0.4532 25 | 0.3405 17 | 0.5270 26 | 0.4054 18 | |
HAM | 0.7799 32 | 0.6373 24 | 0.4148 35 | 0.3324 20 | 0.4967 35 | 0.4007 19 | |
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding. | |||||||
CSA-M3LM | 0.8137 9 | 0.6241 28 | 0.4544 23 | 0.3317 21 | 0.5349 16 | 0.3972 20 | |
3dvlp-judge-h | 0.7552 45 | 0.6051 31 | 0.4458 29 | 0.3340 19 | 0.5152 29 | 0.3948 21 | |
D3Net | ![]() | 0.7923 21 | 0.6843 11 | 0.3905 39 | 0.3074 33 | 0.4806 38 | 0.3919 22 |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
GALA-Grounder-D3 | 0.7939 19 | 0.5952 33 | 0.4625 14 | 0.3229 23 | 0.5368 15 | 0.3839 23 | |
LAG-3D-2 | 0.7964 16 | 0.5812 39 | 0.4572 18 | 0.3245 22 | 0.5333 18 | 0.3821 24 | |
ContraRefer | 0.7832 27 | 0.6801 14 | 0.3850 40 | 0.2947 35 | 0.4743 39 | 0.3811 25 | |
LAG-3D-3 | 0.7815 30 | 0.5837 37 | 0.4556 21 | 0.3219 24 | 0.5287 25 | 0.3806 26 | |
Graph-VG-2 | 0.8021 13 | 0.5829 38 | 0.4546 22 | 0.3217 25 | 0.5325 19 | 0.3802 27 | |
Clip | 0.7733 38 | 0.6810 13 | 0.3619 53 | 0.2919 40 | 0.4542 47 | 0.3791 28 | |
Clip-pre | 0.7766 35 | 0.6843 11 | 0.3617 55 | 0.2904 41 | 0.4547 46 | 0.3787 29 | |
3DJCG(Grounding) | ![]() | 0.7675 42 | 0.6059 30 | 0.4389 30 | 0.3117 31 | 0.5126 30 | 0.3776 30 |
Daigang Cai, Lichen Zhao, Jing Zhang†, Lu Sheng, Dong Xu: 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR2022 Oral | |||||||
Graph-VG-3 | 0.8038 12 | 0.5812 39 | 0.4515 26 | 0.3169 27 | 0.5305 21 | 0.3762 31 | |
GALA-Grounder-D1 | 0.8104 10 | 0.5754 42 | 0.4479 28 | 0.3176 26 | 0.5292 23 | 0.3754 32 | |
Graph-VG-4 | 0.7848 25 | 0.5631 45 | 0.4560 20 | 0.3164 29 | 0.5298 22 | 0.3717 33 | |
LAG-3D | 0.7881 22 | 0.5606 46 | 0.4579 17 | 0.3169 27 | 0.5320 20 | 0.3715 34 | |
3DVG-Trans + | ![]() | 0.7733 38 | 0.5787 41 | 0.4370 31 | 0.3102 32 | 0.5124 31 | 0.3704 35 |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
bo3d-1 | 0.7469 49 | 0.5606 46 | 0.4539 24 | 0.3124 30 | 0.5196 28 | 0.3680 36 | |
Se2d | 0.7799 32 | 0.6628 18 | 0.3636 50 | 0.2823 43 | 0.4569 43 | 0.3677 37 | |
secg | 0.7288 53 | 0.6175 29 | 0.3696 48 | 0.2933 37 | 0.4501 50 | 0.3660 38 | |
SAF | 0.6348 61 | 0.5647 44 | 0.3726 46 | 0.3009 34 | 0.4314 54 | 0.3601 39 | |
FE-3DGQA | 0.7857 24 | 0.5862 36 | 0.4317 32 | 0.2935 36 | 0.5111 32 | 0.3592 40 | |
D3Net - Pretrained | ![]() | 0.7659 43 | 0.6579 19 | 0.3619 53 | 0.2726 44 | 0.4525 49 | 0.3590 41 |
Dave Zhenyu Chen, Qirui Wu, Matthias Niessner, Angel X. Chang: D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. 17th European Conference on Computer Vision (ECCV), 2022 | |||||||
HGT | 0.7692 41 | 0.5886 35 | 0.4141 36 | 0.2924 39 | 0.4937 36 | 0.3588 42 | |
InstanceRefer | ![]() | 0.7782 34 | 0.6669 17 | 0.3457 59 | 0.2688 46 | 0.4427 52 | 0.3580 43 |
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li*, Shuguang Cui: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. ICCV 2021 | |||||||
3DVG-Transformer | ![]() | 0.7576 44 | 0.5515 49 | 0.4224 34 | 0.2933 37 | 0.4976 34 | 0.3512 44 |
Lichen Zhao∗, Daigang Cai∗, Lu Sheng†, Dong Xu: 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. ICCV2021 | |||||||
SAVG | 0.7758 37 | 0.5664 43 | 0.4236 33 | 0.2826 42 | 0.5026 33 | 0.3462 45 | |
PointGroup_MCAN | 0.7510 46 | 0.6397 23 | 0.3271 62 | 0.2535 48 | 0.4222 56 | 0.3401 46 | |
TransformerVG | 0.7502 47 | 0.5977 32 | 0.3712 47 | 0.2628 47 | 0.4562 45 | 0.3379 47 | |
TFVG3D ++ | ![]() | 0.7453 50 | 0.5458 53 | 0.3793 44 | 0.2690 45 | 0.4614 41 | 0.3311 48 |
Ali Solgi, Mehdi Ezoji: A Transformer-based Framework for Visual Grounding on 3D Point Clouds. AISP 2024 | |||||||
TGNN | 0.6834 58 | 0.5894 34 | 0.3312 60 | 0.2526 49 | 0.4102 60 | 0.3281 49 | |
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu: Text-Guided Graph Neural Network for Referring 3D Instance Segmentation. AAAI 2021 | |||||||
BEAUTY-DETR | ![]() | 0.7848 25 | 0.5499 50 | 0.3934 38 | 0.2480 50 | 0.4811 37 | 0.3157 50 |
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki: Looking Outside the Box to Ground Language in 3D Scenes. | |||||||
grounding | 0.7298 52 | 0.5458 52 | 0.3822 42 | 0.2421 52 | 0.4538 48 | 0.3046 51 | |
henet | 0.7110 54 | 0.5180 55 | 0.3936 37 | 0.2472 51 | 0.4590 42 | 0.3030 52 | |
scanrefer-rj-14bz | 0.7832 27 | 0.5524 48 | 0.3746 45 | 0.2275 53 | 0.4662 40 | 0.3004 53 | |
scanrefer-test-14bz | 0.7700 40 | 0.5474 51 | 0.3665 49 | 0.2247 54 | 0.4569 43 | 0.2970 54 | |
SRGA | 0.7494 48 | 0.5128 56 | 0.3631 51 | 0.2218 55 | 0.4497 51 | 0.2871 55 | |
scanrefer-rj-org | 0.7345 51 | 0.4716 57 | 0.3536 57 | 0.2125 56 | 0.4390 53 | 0.2706 56 | |
SR-GAB | 0.7016 55 | 0.5202 54 | 0.3233 64 | 0.1959 59 | 0.4081 61 | 0.2686 57 | |
SPANet | 0.5614 65 | 0.4641 59 | 0.2800 68 | 0.2071 58 | 0.3431 69 | 0.2647 58 | |
ScanRefer | ![]() | 0.6859 57 | 0.4353 61 | 0.3488 58 | 0.2097 57 | 0.4244 55 | 0.2603 59 |
Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. 16th European Conference on Computer Vision (ECCV), 2020 | |||||||
scanrefer2 | 0.6340 62 | 0.4353 61 | 0.3193 65 | 0.1947 60 | 0.3898 63 | 0.2486 60 | |
TransformerRefer | 0.6010 63 | 0.4658 58 | 0.2540 70 | 0.1730 65 | 0.3318 70 | 0.2386 61 | |
ScanRefer Baseline | 0.6422 60 | 0.4196 63 | 0.3090 66 | 0.1832 61 | 0.3837 65 | 0.2362 62 | |
ScanRefer-test | 0.6999 56 | 0.4361 60 | 0.3274 61 | 0.1725 66 | 0.4109 59 | 0.2316 63 | |
ScanRefer_vanilla | 0.6488 59 | 0.4056 64 | 0.3052 67 | 0.1782 63 | 0.3823 66 | 0.2292 64 | |
pairwisemethod | 0.5779 64 | 0.3603 65 | 0.2792 69 | 0.1746 64 | 0.3462 68 | 0.2163 65 | |
bo3d | 0.5400 66 | 0.1550 66 | 0.3817 43 | 0.1785 62 | 0.4172 58 | 0.1732 66 | |
Co3d3 | 0.5326 67 | 0.1369 67 | 0.3848 41 | 0.1651 67 | 0.4179 57 | 0.1588 67 | |
Co3d2 | 0.5070 68 | 0.1195 70 | 0.3569 56 | 0.1511 68 | 0.3906 62 | 0.1440 68 | |
test_submitt | 0.4732 70 | 0.1286 68 | 0.3626 52 | 0.1399 69 | 0.3874 64 | 0.1373 69 | |
bo3d0 | 0.4823 69 | 0.1278 69 | 0.3271 62 | 0.1394 70 | 0.3619 67 | 0.1368 70 | |
3DVLP | 0.0038 71 | 0.0019 71 | 0.0049 71 | 0.0023 71 | 0.0047 71 | 0.0022 71 | |
Co3d | 0.0000 72 | 0.0000 72 | 0.0000 72 | 0.0000 72 | 0.0000 72 | 0.0000 72 | |