2D Semantic Label Benchmark
The 2D semantic labeling task involves predicting a per-pixel semantic labeling of an image.
Evaluation and metricsOur evaluation ranks all methods according to the PASCAL VOC intersection-over-union metric (IoU). IoU = TP/(TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively.
This table lists the benchmark results for the 2D semantic label scenario.
Method | Info | avg iou | bathtub | bed | bookshelf | cabinet | chair | counter | curtain | desk | door | floor | otherfurniture | picture | refrigerator | shower curtain | sink | sofa | table | toilet | wall | window |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ||
Virtual MVFusion (R) | 0.745 1 | 0.861 1 | 0.839 1 | 0.881 1 | 0.672 2 | 0.512 1 | 0.422 18 | 0.898 1 | 0.723 1 | 0.714 1 | 0.954 2 | 0.454 1 | 0.509 1 | 0.773 1 | 0.895 1 | 0.756 1 | 0.820 1 | 0.653 1 | 0.935 1 | 0.891 1 | 0.728 1 | |
Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, Caroline Pantofaru: Virtual Multi-view Fusion for 3D Semantic Segmentation. ECCV 2020 | ||||||||||||||||||||||
BPNet_2D | ![]() | 0.670 2 | 0.822 3 | 0.795 3 | 0.836 2 | 0.659 3 | 0.481 2 | 0.451 14 | 0.769 4 | 0.656 3 | 0.567 4 | 0.931 3 | 0.395 6 | 0.390 5 | 0.700 4 | 0.534 4 | 0.689 11 | 0.770 2 | 0.574 3 | 0.865 10 | 0.831 3 | 0.675 5 |
Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia and Tien-Tsin Wong: Bidirectional Projection Network for Cross Dimension Scene Understanding. CVPR 2021 (Oral) | ||||||||||||||||||||||
CU-Hybrid-2D Net | 0.636 3 | 0.825 2 | 0.820 2 | 0.179 24 | 0.648 4 | 0.463 3 | 0.549 2 | 0.742 8 | 0.676 2 | 0.628 2 | 0.961 1 | 0.420 2 | 0.379 6 | 0.684 8 | 0.381 19 | 0.732 3 | 0.723 3 | 0.599 2 | 0.827 17 | 0.851 2 | 0.634 8 | |
DMMF_3d | 0.605 6 | 0.651 10 | 0.744 10 | 0.782 3 | 0.637 5 | 0.387 4 | 0.536 4 | 0.732 9 | 0.590 7 | 0.540 6 | 0.856 22 | 0.359 11 | 0.306 16 | 0.596 15 | 0.539 3 | 0.627 21 | 0.706 4 | 0.497 8 | 0.785 22 | 0.757 20 | 0.476 23 | |
MVF-GNN(2D) | 0.636 3 | 0.606 15 | 0.794 4 | 0.434 16 | 0.688 1 | 0.337 8 | 0.464 13 | 0.798 3 | 0.632 5 | 0.589 3 | 0.908 9 | 0.420 2 | 0.329 13 | 0.743 2 | 0.594 2 | 0.738 2 | 0.676 5 | 0.527 4 | 0.906 2 | 0.818 6 | 0.715 3 | |
EMSANet | 0.600 7 | 0.716 4 | 0.746 9 | 0.395 19 | 0.614 9 | 0.382 5 | 0.523 5 | 0.713 12 | 0.571 11 | 0.503 10 | 0.922 7 | 0.404 5 | 0.397 4 | 0.655 9 | 0.400 16 | 0.626 22 | 0.663 6 | 0.469 13 | 0.900 4 | 0.827 4 | 0.577 15 | |
Seichter, Daniel and Fischedick, Söhnke and Köhler, Mona and Gross, Horst-Michael: EMSANet: Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments. IJCNN 2022 | ||||||||||||||||||||||
FAN_NV_RVC | 0.586 10 | 0.510 22 | 0.764 6 | 0.079 27 | 0.620 8 | 0.330 11 | 0.494 9 | 0.753 6 | 0.573 9 | 0.556 5 | 0.884 17 | 0.405 4 | 0.303 17 | 0.718 3 | 0.452 13 | 0.672 14 | 0.658 7 | 0.509 5 | 0.898 5 | 0.813 8 | 0.727 2 | |
AdapNet++ | ![]() | 0.503 22 | 0.613 13 | 0.722 14 | 0.418 18 | 0.358 27 | 0.337 8 | 0.370 24 | 0.479 25 | 0.443 23 | 0.368 25 | 0.907 10 | 0.207 24 | 0.213 26 | 0.464 25 | 0.525 6 | 0.618 23 | 0.657 8 | 0.450 17 | 0.788 21 | 0.721 24 | 0.408 26 |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
RFBNet | 0.592 9 | 0.616 12 | 0.758 7 | 0.659 5 | 0.581 11 | 0.330 11 | 0.469 12 | 0.655 19 | 0.543 14 | 0.524 8 | 0.924 4 | 0.355 13 | 0.336 11 | 0.572 18 | 0.479 10 | 0.671 15 | 0.648 9 | 0.480 10 | 0.814 20 | 0.814 7 | 0.614 11 | |
MCA-Net | 0.595 8 | 0.533 21 | 0.756 8 | 0.746 4 | 0.590 10 | 0.334 10 | 0.506 8 | 0.670 16 | 0.587 8 | 0.500 12 | 0.905 11 | 0.366 10 | 0.352 9 | 0.601 14 | 0.506 8 | 0.669 17 | 0.648 9 | 0.501 7 | 0.839 16 | 0.769 16 | 0.516 22 | |
DMMF | 0.567 15 | 0.623 11 | 0.767 5 | 0.238 21 | 0.571 14 | 0.347 6 | 0.413 20 | 0.719 10 | 0.472 21 | 0.418 23 | 0.895 14 | 0.357 12 | 0.260 23 | 0.696 5 | 0.523 7 | 0.666 18 | 0.642 11 | 0.437 19 | 0.895 6 | 0.793 11 | 0.603 13 | |
UNIV_CNP_RVC_UE | 0.566 16 | 0.569 20 | 0.686 20 | 0.435 15 | 0.524 18 | 0.294 19 | 0.421 19 | 0.712 13 | 0.543 14 | 0.463 18 | 0.872 18 | 0.320 18 | 0.363 8 | 0.611 12 | 0.477 11 | 0.686 12 | 0.627 12 | 0.443 18 | 0.862 11 | 0.775 15 | 0.639 7 | |
MSeg1080_RVC | ![]() | 0.485 24 | 0.505 23 | 0.709 16 | 0.092 26 | 0.427 24 | 0.241 23 | 0.411 21 | 0.654 20 | 0.385 27 | 0.457 19 | 0.861 21 | 0.053 27 | 0.279 19 | 0.503 23 | 0.481 9 | 0.645 19 | 0.626 13 | 0.365 25 | 0.748 25 | 0.725 23 | 0.529 21 |
John Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen Koltun: MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. CVPR 2020 | ||||||||||||||||||||||
FuseNet | ![]() | 0.535 21 | 0.570 19 | 0.681 21 | 0.182 23 | 0.512 20 | 0.290 20 | 0.431 17 | 0.659 18 | 0.504 19 | 0.495 14 | 0.903 13 | 0.308 20 | 0.428 3 | 0.523 22 | 0.365 20 | 0.676 13 | 0.621 14 | 0.470 12 | 0.762 23 | 0.779 14 | 0.541 18 |
Caner Hazirbas, Lingni Ma, Csaba Domokos, Daniel Cremers: FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture. ACCV 2016 | ||||||||||||||||||||||
SSMA | ![]() | 0.577 14 | 0.695 6 | 0.716 15 | 0.439 14 | 0.563 15 | 0.314 14 | 0.444 16 | 0.719 10 | 0.551 12 | 0.503 10 | 0.887 16 | 0.346 17 | 0.348 10 | 0.603 13 | 0.353 21 | 0.709 6 | 0.600 15 | 0.457 15 | 0.901 3 | 0.786 12 | 0.599 14 |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
3DMV (2d proj) | 0.498 23 | 0.481 25 | 0.612 24 | 0.579 9 | 0.456 23 | 0.343 7 | 0.384 22 | 0.623 23 | 0.525 17 | 0.381 24 | 0.845 23 | 0.254 23 | 0.264 22 | 0.557 19 | 0.182 25 | 0.581 25 | 0.598 16 | 0.429 21 | 0.760 24 | 0.661 26 | 0.446 25 | |
Angela Dai, Matthias Niessner: 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation. ECCV'18 | ||||||||||||||||||||||
MIX6D_RVC | 0.582 13 | 0.695 6 | 0.687 18 | 0.225 22 | 0.632 7 | 0.328 13 | 0.550 1 | 0.748 7 | 0.623 6 | 0.494 15 | 0.890 15 | 0.350 16 | 0.254 24 | 0.688 6 | 0.454 12 | 0.716 4 | 0.597 17 | 0.489 9 | 0.881 8 | 0.768 17 | 0.575 16 | |
DCRedNet | 0.583 12 | 0.682 8 | 0.723 13 | 0.542 11 | 0.510 21 | 0.310 15 | 0.451 14 | 0.668 17 | 0.549 13 | 0.520 9 | 0.920 8 | 0.375 7 | 0.446 2 | 0.528 21 | 0.417 15 | 0.670 16 | 0.577 18 | 0.478 11 | 0.862 11 | 0.806 10 | 0.628 10 | |
SN_RN152pyrx8_RVC | ![]() | 0.546 18 | 0.572 18 | 0.663 22 | 0.638 7 | 0.518 19 | 0.298 17 | 0.366 25 | 0.633 22 | 0.510 18 | 0.446 20 | 0.864 20 | 0.296 21 | 0.267 20 | 0.542 20 | 0.346 22 | 0.704 8 | 0.575 19 | 0.431 20 | 0.853 14 | 0.766 18 | 0.630 9 |
segfomer with 6d | 0.542 20 | 0.594 16 | 0.687 18 | 0.146 25 | 0.579 12 | 0.308 16 | 0.515 7 | 0.703 14 | 0.472 21 | 0.498 13 | 0.868 19 | 0.369 9 | 0.282 18 | 0.589 16 | 0.390 18 | 0.701 9 | 0.556 20 | 0.416 22 | 0.860 13 | 0.759 19 | 0.539 20 | |
WSGFormer | 0.585 11 | 0.706 5 | 0.708 17 | 0.434 16 | 0.574 13 | 0.283 21 | 0.538 3 | 0.759 5 | 0.542 16 | 0.482 16 | 0.924 4 | 0.351 15 | 0.333 12 | 0.614 11 | 0.393 17 | 0.692 10 | 0.551 21 | 0.461 14 | 0.874 9 | 0.809 9 | 0.673 6 | |
CMX | 0.613 5 | 0.681 9 | 0.725 12 | 0.502 12 | 0.634 6 | 0.297 18 | 0.478 11 | 0.830 2 | 0.651 4 | 0.537 7 | 0.924 4 | 0.375 7 | 0.315 15 | 0.686 7 | 0.451 14 | 0.714 5 | 0.543 22 | 0.504 6 | 0.894 7 | 0.823 5 | 0.688 4 | |
ILC-PSPNet | 0.475 25 | 0.490 24 | 0.581 25 | 0.289 20 | 0.507 22 | 0.067 27 | 0.379 23 | 0.610 24 | 0.417 25 | 0.435 21 | 0.822 26 | 0.278 22 | 0.267 20 | 0.503 23 | 0.228 23 | 0.616 24 | 0.533 23 | 0.375 24 | 0.820 19 | 0.729 22 | 0.560 17 | |
UDSSEG_RVC | 0.545 19 | 0.610 14 | 0.661 23 | 0.588 8 | 0.556 16 | 0.268 22 | 0.482 10 | 0.642 21 | 0.572 10 | 0.475 17 | 0.836 24 | 0.312 19 | 0.367 7 | 0.630 10 | 0.189 24 | 0.639 20 | 0.495 24 | 0.452 16 | 0.826 18 | 0.756 21 | 0.541 18 | |
EMSAFormer | 0.564 17 | 0.581 17 | 0.736 11 | 0.564 10 | 0.546 17 | 0.219 24 | 0.517 6 | 0.675 15 | 0.486 20 | 0.427 22 | 0.904 12 | 0.352 14 | 0.320 14 | 0.589 16 | 0.528 5 | 0.708 7 | 0.464 25 | 0.413 23 | 0.847 15 | 0.786 12 | 0.611 12 | |
ScanNet (2d proj) | ![]() | 0.330 27 | 0.293 26 | 0.521 26 | 0.657 6 | 0.361 26 | 0.161 26 | 0.250 26 | 0.004 27 | 0.440 24 | 0.183 27 | 0.836 24 | 0.125 26 | 0.060 27 | 0.319 27 | 0.132 26 | 0.417 26 | 0.412 26 | 0.344 26 | 0.541 27 | 0.427 27 | 0.109 27 |
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Nießner: ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. CVPR'17 | ||||||||||||||||||||||
Enet (reimpl) | 0.376 26 | 0.264 27 | 0.452 27 | 0.452 13 | 0.365 25 | 0.181 25 | 0.143 27 | 0.456 26 | 0.409 26 | 0.346 26 | 0.769 27 | 0.164 25 | 0.218 25 | 0.359 26 | 0.123 27 | 0.403 27 | 0.381 27 | 0.313 27 | 0.571 26 | 0.685 25 | 0.472 24 | |
Re-implementation of Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello: ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. |