2D Semantic Label Benchmark
The 2D semantic labeling task involves predicting a per-pixel semantic labeling of an image.
Evaluation and metricsOur evaluation ranks all methods according to the PASCAL VOC intersection-over-union metric (IoU). IoU = TP/(TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively.
This table lists the benchmark results for the 2D semantic label scenario.
Method | Info | avg iou | bathtub | bed | bookshelf | cabinet | chair | counter | curtain | desk | door | floor | otherfurniture | picture | refrigerator | shower curtain | sink | sofa | table | toilet | wall | window |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Virtual MVFusion (R) | 0.745 1 | 0.861 1 | 0.839 1 | 0.881 1 | 0.672 2 | 0.512 1 | 0.422 17 | 0.898 1 | 0.723 1 | 0.714 1 | 0.954 2 | 0.454 1 | 0.509 1 | 0.773 1 | 0.895 1 | 0.756 1 | 0.820 1 | 0.653 1 | 0.935 1 | 0.891 1 | 0.728 1 | |
Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, Caroline Pantofaru: Virtual Multi-view Fusion for 3D Semantic Segmentation. ECCV 2020 | ||||||||||||||||||||||
CU-Hybrid-2D Net | 0.636 3 | 0.825 2 | 0.820 2 | 0.179 23 | 0.648 4 | 0.463 3 | 0.549 2 | 0.742 7 | 0.676 2 | 0.628 2 | 0.961 1 | 0.420 2 | 0.379 6 | 0.684 8 | 0.381 18 | 0.732 3 | 0.723 3 | 0.599 2 | 0.827 16 | 0.851 2 | 0.634 7 | |
BPNet_2D | 0.670 2 | 0.822 3 | 0.795 3 | 0.836 2 | 0.659 3 | 0.481 2 | 0.451 13 | 0.769 4 | 0.656 3 | 0.567 4 | 0.931 3 | 0.395 6 | 0.390 5 | 0.700 4 | 0.534 4 | 0.689 10 | 0.770 2 | 0.574 3 | 0.865 9 | 0.831 3 | 0.675 5 | |
Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia and Tien-Tsin Wong: Bidirectional Projection Network for Cross Dimension Scene Understanding. CVPR 2021 (Oral) | ||||||||||||||||||||||
EMSANet | 0.600 7 | 0.716 4 | 0.746 9 | 0.395 18 | 0.614 9 | 0.382 5 | 0.523 4 | 0.713 11 | 0.571 11 | 0.503 10 | 0.922 6 | 0.404 5 | 0.397 4 | 0.655 9 | 0.400 16 | 0.626 21 | 0.663 6 | 0.469 13 | 0.900 4 | 0.827 4 | 0.577 14 | |
Seichter, Daniel and Fischedick, Söhnke and Köhler, Mona and Gross, Horst-Michael: EMSANet: Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments. IJCNN 2022 | ||||||||||||||||||||||
SSMA | 0.577 13 | 0.695 5 | 0.716 15 | 0.439 14 | 0.563 14 | 0.314 14 | 0.444 15 | 0.719 9 | 0.551 12 | 0.503 10 | 0.887 15 | 0.346 16 | 0.348 10 | 0.603 12 | 0.353 20 | 0.709 6 | 0.600 15 | 0.457 14 | 0.901 3 | 0.786 11 | 0.599 13 | |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
MIX6D_RVC | 0.582 12 | 0.695 5 | 0.687 17 | 0.225 21 | 0.632 7 | 0.328 13 | 0.550 1 | 0.748 6 | 0.623 6 | 0.494 15 | 0.890 14 | 0.350 15 | 0.254 23 | 0.688 6 | 0.454 12 | 0.716 4 | 0.597 17 | 0.489 9 | 0.881 8 | 0.768 16 | 0.575 15 | |
DCRedNet | 0.583 11 | 0.682 7 | 0.723 13 | 0.542 11 | 0.510 20 | 0.310 15 | 0.451 13 | 0.668 16 | 0.549 13 | 0.520 9 | 0.920 7 | 0.375 7 | 0.446 2 | 0.528 20 | 0.417 15 | 0.670 15 | 0.577 18 | 0.478 11 | 0.862 10 | 0.806 9 | 0.628 9 | |
CMX | 0.613 5 | 0.681 8 | 0.725 12 | 0.502 12 | 0.634 6 | 0.297 18 | 0.478 10 | 0.830 2 | 0.651 4 | 0.537 7 | 0.924 4 | 0.375 7 | 0.315 14 | 0.686 7 | 0.451 14 | 0.714 5 | 0.543 21 | 0.504 6 | 0.894 7 | 0.823 5 | 0.688 4 | |
DMMF_3d | 0.605 6 | 0.651 9 | 0.744 10 | 0.782 3 | 0.637 5 | 0.387 4 | 0.536 3 | 0.732 8 | 0.590 7 | 0.540 6 | 0.856 21 | 0.359 11 | 0.306 15 | 0.596 14 | 0.539 3 | 0.627 20 | 0.706 4 | 0.497 8 | 0.785 21 | 0.757 19 | 0.476 22 | |
DMMF | 0.567 14 | 0.623 10 | 0.767 5 | 0.238 20 | 0.571 13 | 0.347 6 | 0.413 19 | 0.719 9 | 0.472 20 | 0.418 22 | 0.895 13 | 0.357 12 | 0.260 22 | 0.696 5 | 0.523 7 | 0.666 17 | 0.642 11 | 0.437 18 | 0.895 6 | 0.793 10 | 0.603 12 | |
RFBNet | 0.592 9 | 0.616 11 | 0.758 7 | 0.659 5 | 0.581 11 | 0.330 11 | 0.469 11 | 0.655 18 | 0.543 14 | 0.524 8 | 0.924 4 | 0.355 13 | 0.336 11 | 0.572 17 | 0.479 10 | 0.671 14 | 0.648 9 | 0.480 10 | 0.814 19 | 0.814 7 | 0.614 10 | |
AdapNet++ | 0.503 21 | 0.613 12 | 0.722 14 | 0.418 17 | 0.358 26 | 0.337 8 | 0.370 23 | 0.479 24 | 0.443 22 | 0.368 24 | 0.907 9 | 0.207 23 | 0.213 25 | 0.464 24 | 0.525 6 | 0.618 22 | 0.657 8 | 0.450 16 | 0.788 20 | 0.721 23 | 0.408 25 | |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
UDSSEG_RVC | 0.545 18 | 0.610 13 | 0.661 22 | 0.588 8 | 0.556 15 | 0.268 21 | 0.482 9 | 0.642 20 | 0.572 10 | 0.475 16 | 0.836 23 | 0.312 18 | 0.367 7 | 0.630 10 | 0.189 23 | 0.639 19 | 0.495 23 | 0.452 15 | 0.826 17 | 0.756 20 | 0.541 17 | |
MVF-GNN(2D) | 0.636 3 | 0.606 14 | 0.794 4 | 0.434 16 | 0.688 1 | 0.337 8 | 0.464 12 | 0.798 3 | 0.632 5 | 0.589 3 | 0.908 8 | 0.420 2 | 0.329 12 | 0.743 2 | 0.594 2 | 0.738 2 | 0.676 5 | 0.527 4 | 0.906 2 | 0.818 6 | 0.715 3 | |
segfomer with 6d | 0.542 19 | 0.594 15 | 0.687 17 | 0.146 24 | 0.579 12 | 0.308 16 | 0.515 6 | 0.703 13 | 0.472 20 | 0.498 13 | 0.868 18 | 0.369 9 | 0.282 17 | 0.589 15 | 0.390 17 | 0.701 9 | 0.556 20 | 0.416 21 | 0.860 12 | 0.759 18 | 0.539 19 | |
EMSAFormer | 0.564 16 | 0.581 16 | 0.736 11 | 0.564 10 | 0.546 16 | 0.219 23 | 0.517 5 | 0.675 14 | 0.486 19 | 0.427 21 | 0.904 11 | 0.352 14 | 0.320 13 | 0.589 15 | 0.528 5 | 0.708 7 | 0.464 24 | 0.413 22 | 0.847 14 | 0.786 11 | 0.611 11 | |
SN_RN152pyrx8_RVC | 0.546 17 | 0.572 17 | 0.663 21 | 0.638 7 | 0.518 18 | 0.298 17 | 0.366 24 | 0.633 21 | 0.510 17 | 0.446 19 | 0.864 19 | 0.296 20 | 0.267 19 | 0.542 19 | 0.346 21 | 0.704 8 | 0.575 19 | 0.431 19 | 0.853 13 | 0.766 17 | 0.630 8 | |
FuseNet | 0.535 20 | 0.570 18 | 0.681 20 | 0.182 22 | 0.512 19 | 0.290 20 | 0.431 16 | 0.659 17 | 0.504 18 | 0.495 14 | 0.903 12 | 0.308 19 | 0.428 3 | 0.523 21 | 0.365 19 | 0.676 12 | 0.621 14 | 0.470 12 | 0.762 22 | 0.779 13 | 0.541 17 | |
Caner Hazirbas, Lingni Ma, Csaba Domokos, Daniel Cremers: FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture. ACCV 2016 | ||||||||||||||||||||||
UNIV_CNP_RVC_UE | 0.566 15 | 0.569 19 | 0.686 19 | 0.435 15 | 0.524 17 | 0.294 19 | 0.421 18 | 0.712 12 | 0.543 14 | 0.463 17 | 0.872 17 | 0.320 17 | 0.363 8 | 0.611 11 | 0.477 11 | 0.686 11 | 0.627 12 | 0.443 17 | 0.862 10 | 0.775 14 | 0.639 6 | |
MCA-Net | 0.595 8 | 0.533 20 | 0.756 8 | 0.746 4 | 0.590 10 | 0.334 10 | 0.506 7 | 0.670 15 | 0.587 8 | 0.500 12 | 0.905 10 | 0.366 10 | 0.352 9 | 0.601 13 | 0.506 8 | 0.669 16 | 0.648 9 | 0.501 7 | 0.839 15 | 0.769 15 | 0.516 21 | |
FAN_NV_RVC | 0.586 10 | 0.510 21 | 0.764 6 | 0.079 26 | 0.620 8 | 0.330 11 | 0.494 8 | 0.753 5 | 0.573 9 | 0.556 5 | 0.884 16 | 0.405 4 | 0.303 16 | 0.718 3 | 0.452 13 | 0.672 13 | 0.658 7 | 0.509 5 | 0.898 5 | 0.813 8 | 0.727 2 | |
MSeg1080_RVC | 0.485 23 | 0.505 22 | 0.709 16 | 0.092 25 | 0.427 23 | 0.241 22 | 0.411 20 | 0.654 19 | 0.385 26 | 0.457 18 | 0.861 20 | 0.053 26 | 0.279 18 | 0.503 22 | 0.481 9 | 0.645 18 | 0.626 13 | 0.365 24 | 0.748 24 | 0.725 22 | 0.529 20 | |
John Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen Koltun: MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. CVPR 2020 | ||||||||||||||||||||||
ILC-PSPNet | 0.475 24 | 0.490 23 | 0.581 24 | 0.289 19 | 0.507 21 | 0.067 26 | 0.379 22 | 0.610 23 | 0.417 24 | 0.435 20 | 0.822 25 | 0.278 21 | 0.267 19 | 0.503 22 | 0.228 22 | 0.616 23 | 0.533 22 | 0.375 23 | 0.820 18 | 0.729 21 | 0.560 16 | |
3DMV (2d proj) | 0.498 22 | 0.481 24 | 0.612 23 | 0.579 9 | 0.456 22 | 0.343 7 | 0.384 21 | 0.623 22 | 0.525 16 | 0.381 23 | 0.845 22 | 0.254 22 | 0.264 21 | 0.557 18 | 0.182 24 | 0.581 24 | 0.598 16 | 0.429 20 | 0.760 23 | 0.661 25 | 0.446 24 | |
Angela Dai, Matthias Niessner: 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation. ECCV'18 | ||||||||||||||||||||||
ScanNet (2d proj) | 0.330 26 | 0.293 25 | 0.521 25 | 0.657 6 | 0.361 25 | 0.161 25 | 0.250 25 | 0.004 26 | 0.440 23 | 0.183 26 | 0.836 23 | 0.125 25 | 0.060 26 | 0.319 26 | 0.132 25 | 0.417 25 | 0.412 25 | 0.344 25 | 0.541 26 | 0.427 26 | 0.109 26 | |
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Nießner: ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. CVPR'17 | ||||||||||||||||||||||
Enet (reimpl) | 0.376 25 | 0.264 26 | 0.452 26 | 0.452 13 | 0.365 24 | 0.181 24 | 0.143 26 | 0.456 25 | 0.409 25 | 0.346 25 | 0.769 26 | 0.164 24 | 0.218 24 | 0.359 25 | 0.123 26 | 0.403 26 | 0.381 26 | 0.313 26 | 0.571 25 | 0.685 24 | 0.472 23 | |
Re-implementation of Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello: ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. |