2D Semantic Label Benchmark
The 2D semantic labeling task involves predicting a per-pixel semantic labeling of an image.
Evaluation and metricsOur evaluation ranks all methods according to the PASCAL VOC intersection-over-union metric (IoU). IoU = TP/(TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively.
This table lists the benchmark results for the 2D semantic label scenario.
Method | Info | avg iou | bathtub | bed | bookshelf | cabinet | chair | counter | curtain | desk | door | floor | otherfurniture | picture | refrigerator | shower curtain | sink | sofa | table | toilet | wall | window |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ||
Virtual MVFusion (R) | 0.745 1 | 0.861 1 | 0.839 1 | 0.881 1 | 0.672 1 | 0.512 1 | 0.422 15 | 0.898 1 | 0.723 1 | 0.714 1 | 0.954 2 | 0.454 1 | 0.509 1 | 0.773 1 | 0.895 1 | 0.756 1 | 0.820 1 | 0.653 1 | 0.935 1 | 0.891 1 | 0.728 1 | |
Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, Caroline Pantofaru: Virtual Multi-view Fusion for 3D Semantic Segmentation. ECCV 2020 | ||||||||||||||||||||||
BPNet_2D | ![]() | 0.670 2 | 0.822 3 | 0.795 3 | 0.836 2 | 0.659 2 | 0.481 2 | 0.451 11 | 0.769 3 | 0.656 3 | 0.567 3 | 0.931 3 | 0.395 4 | 0.390 4 | 0.700 3 | 0.534 3 | 0.689 9 | 0.770 2 | 0.574 3 | 0.865 6 | 0.831 3 | 0.675 4 |
Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia and Tien-Tsin Wong: Bidirectional Projection Network for Cross Dimension Scene Understanding. CVPR 2021 (Oral) | ||||||||||||||||||||||
CU-Hybrid-2D Net | 0.636 3 | 0.825 2 | 0.820 2 | 0.179 20 | 0.648 3 | 0.463 3 | 0.549 2 | 0.742 6 | 0.676 2 | 0.628 2 | 0.961 1 | 0.420 2 | 0.379 5 | 0.684 6 | 0.381 15 | 0.732 2 | 0.723 3 | 0.599 2 | 0.827 13 | 0.851 2 | 0.634 6 | |
CMX | 0.613 4 | 0.681 7 | 0.725 9 | 0.502 12 | 0.634 5 | 0.297 15 | 0.478 9 | 0.830 2 | 0.651 4 | 0.537 6 | 0.924 4 | 0.375 5 | 0.315 12 | 0.686 5 | 0.451 12 | 0.714 4 | 0.543 18 | 0.504 5 | 0.894 4 | 0.823 4 | 0.688 3 | |
DMMF_3d | 0.605 5 | 0.651 8 | 0.744 7 | 0.782 3 | 0.637 4 | 0.387 4 | 0.536 3 | 0.732 7 | 0.590 6 | 0.540 5 | 0.856 18 | 0.359 9 | 0.306 13 | 0.596 11 | 0.539 2 | 0.627 18 | 0.706 4 | 0.497 7 | 0.785 18 | 0.757 16 | 0.476 19 | |
MCA-Net | 0.595 6 | 0.533 17 | 0.756 6 | 0.746 4 | 0.590 8 | 0.334 7 | 0.506 6 | 0.670 12 | 0.587 7 | 0.500 10 | 0.905 8 | 0.366 8 | 0.352 8 | 0.601 10 | 0.506 6 | 0.669 15 | 0.648 7 | 0.501 6 | 0.839 12 | 0.769 12 | 0.516 18 | |
RFBNet | 0.592 7 | 0.616 9 | 0.758 5 | 0.659 5 | 0.581 9 | 0.330 8 | 0.469 10 | 0.655 15 | 0.543 12 | 0.524 7 | 0.924 4 | 0.355 10 | 0.336 10 | 0.572 14 | 0.479 8 | 0.671 13 | 0.648 7 | 0.480 9 | 0.814 16 | 0.814 5 | 0.614 9 | |
FAN_NV_RVC | 0.586 8 | 0.510 18 | 0.764 4 | 0.079 23 | 0.620 7 | 0.330 8 | 0.494 7 | 0.753 4 | 0.573 8 | 0.556 4 | 0.884 13 | 0.405 3 | 0.303 14 | 0.718 2 | 0.452 11 | 0.672 12 | 0.658 5 | 0.509 4 | 0.898 3 | 0.813 6 | 0.727 2 | |
DCRedNet | 0.583 9 | 0.682 6 | 0.723 10 | 0.542 11 | 0.510 17 | 0.310 12 | 0.451 11 | 0.668 13 | 0.549 11 | 0.520 8 | 0.920 6 | 0.375 5 | 0.446 2 | 0.528 17 | 0.417 13 | 0.670 14 | 0.577 15 | 0.478 10 | 0.862 7 | 0.806 7 | 0.628 8 | |
MIX6D_RVC | 0.582 10 | 0.695 4 | 0.687 14 | 0.225 18 | 0.632 6 | 0.328 10 | 0.550 1 | 0.748 5 | 0.623 5 | 0.494 13 | 0.890 11 | 0.350 12 | 0.254 20 | 0.688 4 | 0.454 10 | 0.716 3 | 0.597 14 | 0.489 8 | 0.881 5 | 0.768 13 | 0.575 12 | |
SSMA | ![]() | 0.577 11 | 0.695 4 | 0.716 12 | 0.439 14 | 0.563 11 | 0.314 11 | 0.444 13 | 0.719 8 | 0.551 10 | 0.503 9 | 0.887 12 | 0.346 13 | 0.348 9 | 0.603 9 | 0.353 17 | 0.709 5 | 0.600 12 | 0.457 12 | 0.901 2 | 0.786 8 | 0.599 11 |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
UNIV_CNP_RVC_UE | 0.566 12 | 0.569 16 | 0.686 16 | 0.435 15 | 0.524 14 | 0.294 16 | 0.421 16 | 0.712 9 | 0.543 12 | 0.463 15 | 0.872 14 | 0.320 14 | 0.363 7 | 0.611 8 | 0.477 9 | 0.686 10 | 0.627 9 | 0.443 15 | 0.862 7 | 0.775 11 | 0.639 5 | |
EMSAFormer | 0.564 13 | 0.581 13 | 0.736 8 | 0.564 10 | 0.546 13 | 0.219 20 | 0.517 4 | 0.675 11 | 0.486 17 | 0.427 19 | 0.904 9 | 0.352 11 | 0.320 11 | 0.589 12 | 0.528 4 | 0.708 6 | 0.464 21 | 0.413 19 | 0.847 11 | 0.786 8 | 0.611 10 | |
SN_RN152pyrx8_RVC | ![]() | 0.546 14 | 0.572 14 | 0.663 18 | 0.638 7 | 0.518 15 | 0.298 14 | 0.366 21 | 0.633 18 | 0.510 15 | 0.446 17 | 0.864 16 | 0.296 17 | 0.267 17 | 0.542 16 | 0.346 18 | 0.704 7 | 0.575 16 | 0.431 16 | 0.853 10 | 0.766 14 | 0.630 7 |
UDSSEG_RVC | 0.545 15 | 0.610 11 | 0.661 19 | 0.588 8 | 0.556 12 | 0.268 18 | 0.482 8 | 0.642 17 | 0.572 9 | 0.475 14 | 0.836 20 | 0.312 15 | 0.367 6 | 0.630 7 | 0.189 20 | 0.639 17 | 0.495 20 | 0.452 13 | 0.826 14 | 0.756 17 | 0.541 14 | |
segfomer with 6d | 0.542 16 | 0.594 12 | 0.687 14 | 0.146 21 | 0.579 10 | 0.308 13 | 0.515 5 | 0.703 10 | 0.472 18 | 0.498 11 | 0.868 15 | 0.369 7 | 0.282 15 | 0.589 12 | 0.390 14 | 0.701 8 | 0.556 17 | 0.416 18 | 0.860 9 | 0.759 15 | 0.539 16 | |
FuseNet | ![]() | 0.535 17 | 0.570 15 | 0.681 17 | 0.182 19 | 0.512 16 | 0.290 17 | 0.431 14 | 0.659 14 | 0.504 16 | 0.495 12 | 0.903 10 | 0.308 16 | 0.428 3 | 0.523 18 | 0.365 16 | 0.676 11 | 0.621 11 | 0.470 11 | 0.762 19 | 0.779 10 | 0.541 14 |
Caner Hazirbas, Lingni Ma, Csaba Domokos, Daniel Cremers: FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture. ACCV 2016 | ||||||||||||||||||||||
AdapNet++ | ![]() | 0.503 18 | 0.613 10 | 0.722 11 | 0.418 16 | 0.358 23 | 0.337 6 | 0.370 20 | 0.479 21 | 0.443 19 | 0.368 21 | 0.907 7 | 0.207 20 | 0.213 22 | 0.464 21 | 0.525 5 | 0.618 19 | 0.657 6 | 0.450 14 | 0.788 17 | 0.721 20 | 0.408 22 |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
3DMV (2d proj) | 0.498 19 | 0.481 21 | 0.612 20 | 0.579 9 | 0.456 19 | 0.343 5 | 0.384 18 | 0.623 19 | 0.525 14 | 0.381 20 | 0.845 19 | 0.254 19 | 0.264 19 | 0.557 15 | 0.182 21 | 0.581 21 | 0.598 13 | 0.429 17 | 0.760 20 | 0.661 22 | 0.446 21 | |
Angela Dai, Matthias Niessner: 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation. ECCV'18 | ||||||||||||||||||||||
MSeg1080_RVC | ![]() | 0.485 20 | 0.505 19 | 0.709 13 | 0.092 22 | 0.427 20 | 0.241 19 | 0.411 17 | 0.654 16 | 0.385 23 | 0.457 16 | 0.861 17 | 0.053 23 | 0.279 16 | 0.503 19 | 0.481 7 | 0.645 16 | 0.626 10 | 0.365 21 | 0.748 21 | 0.725 19 | 0.529 17 |
John Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen Koltun: MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. CVPR 2020 | ||||||||||||||||||||||
ILC-PSPNet | 0.475 21 | 0.490 20 | 0.581 21 | 0.289 17 | 0.507 18 | 0.067 23 | 0.379 19 | 0.610 20 | 0.417 21 | 0.435 18 | 0.822 22 | 0.278 18 | 0.267 17 | 0.503 19 | 0.228 19 | 0.616 20 | 0.533 19 | 0.375 20 | 0.820 15 | 0.729 18 | 0.560 13 | |
Enet (reimpl) | 0.376 22 | 0.264 23 | 0.452 23 | 0.452 13 | 0.365 21 | 0.181 21 | 0.143 23 | 0.456 22 | 0.409 22 | 0.346 22 | 0.769 23 | 0.164 21 | 0.218 21 | 0.359 22 | 0.123 23 | 0.403 23 | 0.381 23 | 0.313 23 | 0.571 22 | 0.685 21 | 0.472 20 | |
Re-implementation of Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello: ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. | ||||||||||||||||||||||
ScanNet (2d proj) | ![]() | 0.330 23 | 0.293 22 | 0.521 22 | 0.657 6 | 0.361 22 | 0.161 22 | 0.250 22 | 0.004 23 | 0.440 20 | 0.183 23 | 0.836 20 | 0.125 22 | 0.060 23 | 0.319 23 | 0.132 22 | 0.417 22 | 0.412 22 | 0.344 22 | 0.541 23 | 0.427 23 | 0.109 23 |
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Nießner: ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. CVPR'17 | ||||||||||||||||||||||
DMMF | 0.003 24 | 0.000 24 | 0.005 24 | 0.000 24 | 0.000 24 | 0.037 24 | 0.001 24 | 0.000 24 | 0.001 24 | 0.005 24 | 0.003 24 | 0.000 24 | 0.000 24 | 0.000 24 | 0.000 24 | 0.000 24 | 0.002 24 | 0.001 24 | 0.000 24 | 0.006 24 | 0.000 24 | |