2D Semantic Label Benchmark
The 2D semantic labeling task involves predicting a per-pixel semantic labeling of an image.
Evaluation and metricsOur evaluation ranks all methods according to the PASCAL VOC intersection-over-union metric (IoU). IoU = TP/(TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively.
This table lists the benchmark results for the 2D semantic label scenario.
Method | Info | avg iou | bathtub | bed | bookshelf | cabinet | chair | counter | curtain | desk | door | floor | otherfurniture | picture | refrigerator | shower curtain | sink | sofa | table | toilet | wall | window |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Virtual MVFusion (R) | 0.745 1 | 0.861 1 | 0.839 1 | 0.881 1 | 0.672 2 | 0.512 1 | 0.422 17 | 0.898 1 | 0.723 1 | 0.714 1 | 0.954 2 | 0.454 1 | 0.509 1 | 0.773 1 | 0.895 1 | 0.756 1 | 0.820 1 | 0.653 1 | 0.935 1 | 0.891 1 | 0.728 1 | |
Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, Caroline Pantofaru: Virtual Multi-view Fusion for 3D Semantic Segmentation. ECCV 2020 | ||||||||||||||||||||||
MVF-GNN(2D) | 0.636 3 | 0.606 13 | 0.794 4 | 0.434 16 | 0.688 1 | 0.337 7 | 0.464 12 | 0.798 3 | 0.632 5 | 0.589 3 | 0.908 8 | 0.420 2 | 0.329 12 | 0.743 2 | 0.594 2 | 0.738 2 | 0.676 5 | 0.527 4 | 0.906 2 | 0.818 6 | 0.715 3 | |
DMMF_3d | 0.605 6 | 0.651 9 | 0.744 9 | 0.782 3 | 0.637 5 | 0.387 4 | 0.536 3 | 0.732 8 | 0.590 7 | 0.540 6 | 0.856 20 | 0.359 11 | 0.306 15 | 0.596 13 | 0.539 3 | 0.627 19 | 0.706 4 | 0.497 8 | 0.785 20 | 0.757 18 | 0.476 21 | |
BPNet_2D | 0.670 2 | 0.822 3 | 0.795 3 | 0.836 2 | 0.659 3 | 0.481 2 | 0.451 13 | 0.769 4 | 0.656 3 | 0.567 4 | 0.931 3 | 0.395 6 | 0.390 5 | 0.700 4 | 0.534 4 | 0.689 10 | 0.770 2 | 0.574 3 | 0.865 8 | 0.831 3 | 0.675 5 | |
Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia and Tien-Tsin Wong: Bidirectional Projection Network for Cross Dimension Scene Understanding. CVPR 2021 (Oral) | ||||||||||||||||||||||
EMSAFormer | 0.564 15 | 0.581 15 | 0.736 10 | 0.564 10 | 0.546 15 | 0.219 22 | 0.517 5 | 0.675 13 | 0.486 19 | 0.427 21 | 0.904 11 | 0.352 13 | 0.320 13 | 0.589 14 | 0.528 5 | 0.708 7 | 0.464 23 | 0.413 21 | 0.847 13 | 0.786 10 | 0.611 11 | |
AdapNet++ | 0.503 20 | 0.613 11 | 0.722 13 | 0.418 17 | 0.358 25 | 0.337 7 | 0.370 22 | 0.479 23 | 0.443 21 | 0.368 23 | 0.907 9 | 0.207 22 | 0.213 24 | 0.464 23 | 0.525 6 | 0.618 21 | 0.657 8 | 0.450 16 | 0.788 19 | 0.721 22 | 0.408 24 | |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
MCA-Net | 0.595 8 | 0.533 19 | 0.756 7 | 0.746 4 | 0.590 10 | 0.334 9 | 0.506 7 | 0.670 14 | 0.587 8 | 0.500 12 | 0.905 10 | 0.366 10 | 0.352 9 | 0.601 12 | 0.506 7 | 0.669 16 | 0.648 9 | 0.501 7 | 0.839 14 | 0.769 14 | 0.516 20 | |
MSeg1080_RVC | 0.485 22 | 0.505 21 | 0.709 15 | 0.092 24 | 0.427 22 | 0.241 21 | 0.411 19 | 0.654 18 | 0.385 25 | 0.457 18 | 0.861 19 | 0.053 25 | 0.279 18 | 0.503 21 | 0.481 8 | 0.645 17 | 0.626 12 | 0.365 23 | 0.748 23 | 0.725 21 | 0.529 19 | |
John Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen Koltun: MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. CVPR 2020 | ||||||||||||||||||||||
RFBNet | 0.592 9 | 0.616 10 | 0.758 6 | 0.659 5 | 0.581 11 | 0.330 10 | 0.469 11 | 0.655 17 | 0.543 14 | 0.524 8 | 0.924 4 | 0.355 12 | 0.336 11 | 0.572 16 | 0.479 9 | 0.671 14 | 0.648 9 | 0.480 10 | 0.814 18 | 0.814 7 | 0.614 10 | |
UNIV_CNP_RVC_UE | 0.566 14 | 0.569 18 | 0.686 18 | 0.435 15 | 0.524 16 | 0.294 18 | 0.421 18 | 0.712 11 | 0.543 14 | 0.463 17 | 0.872 16 | 0.320 16 | 0.363 8 | 0.611 10 | 0.477 10 | 0.686 11 | 0.627 11 | 0.443 17 | 0.862 9 | 0.775 13 | 0.639 6 | |
MIX6D_RVC | 0.582 12 | 0.695 5 | 0.687 16 | 0.225 20 | 0.632 7 | 0.328 12 | 0.550 1 | 0.748 6 | 0.623 6 | 0.494 15 | 0.890 13 | 0.350 14 | 0.254 22 | 0.688 5 | 0.454 11 | 0.716 4 | 0.597 16 | 0.489 9 | 0.881 7 | 0.768 15 | 0.575 14 | |
FAN_NV_RVC | 0.586 10 | 0.510 20 | 0.764 5 | 0.079 25 | 0.620 8 | 0.330 10 | 0.494 8 | 0.753 5 | 0.573 9 | 0.556 5 | 0.884 15 | 0.405 4 | 0.303 16 | 0.718 3 | 0.452 12 | 0.672 13 | 0.658 7 | 0.509 5 | 0.898 5 | 0.813 8 | 0.727 2 | |
CMX | 0.613 5 | 0.681 8 | 0.725 11 | 0.502 12 | 0.634 6 | 0.297 17 | 0.478 10 | 0.830 2 | 0.651 4 | 0.537 7 | 0.924 4 | 0.375 7 | 0.315 14 | 0.686 6 | 0.451 13 | 0.714 5 | 0.543 20 | 0.504 6 | 0.894 6 | 0.823 5 | 0.688 4 | |
DCRedNet | 0.583 11 | 0.682 7 | 0.723 12 | 0.542 11 | 0.510 19 | 0.310 14 | 0.451 13 | 0.668 15 | 0.549 13 | 0.520 9 | 0.920 7 | 0.375 7 | 0.446 2 | 0.528 19 | 0.417 14 | 0.670 15 | 0.577 17 | 0.478 11 | 0.862 9 | 0.806 9 | 0.628 9 | |
EMSANet | 0.600 7 | 0.716 4 | 0.746 8 | 0.395 18 | 0.614 9 | 0.382 5 | 0.523 4 | 0.713 10 | 0.571 11 | 0.503 10 | 0.922 6 | 0.404 5 | 0.397 4 | 0.655 8 | 0.400 15 | 0.626 20 | 0.663 6 | 0.469 13 | 0.900 4 | 0.827 4 | 0.577 13 | |
Seichter, Daniel and Fischedick, Söhnke and Köhler, Mona and Gross, Horst-Michael: EMSANet: Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments. IJCNN 2022 | ||||||||||||||||||||||
segfomer with 6d | 0.542 18 | 0.594 14 | 0.687 16 | 0.146 23 | 0.579 12 | 0.308 15 | 0.515 6 | 0.703 12 | 0.472 20 | 0.498 13 | 0.868 17 | 0.369 9 | 0.282 17 | 0.589 14 | 0.390 16 | 0.701 9 | 0.556 19 | 0.416 20 | 0.860 11 | 0.759 17 | 0.539 18 | |
CU-Hybrid-2D Net | 0.636 3 | 0.825 2 | 0.820 2 | 0.179 22 | 0.648 4 | 0.463 3 | 0.549 2 | 0.742 7 | 0.676 2 | 0.628 2 | 0.961 1 | 0.420 2 | 0.379 6 | 0.684 7 | 0.381 17 | 0.732 3 | 0.723 3 | 0.599 2 | 0.827 15 | 0.851 2 | 0.634 7 | |
FuseNet | 0.535 19 | 0.570 17 | 0.681 19 | 0.182 21 | 0.512 18 | 0.290 19 | 0.431 16 | 0.659 16 | 0.504 18 | 0.495 14 | 0.903 12 | 0.308 18 | 0.428 3 | 0.523 20 | 0.365 18 | 0.676 12 | 0.621 13 | 0.470 12 | 0.762 21 | 0.779 12 | 0.541 16 | |
Caner Hazirbas, Lingni Ma, Csaba Domokos, Daniel Cremers: FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture. ACCV 2016 | ||||||||||||||||||||||
SSMA | 0.577 13 | 0.695 5 | 0.716 14 | 0.439 14 | 0.563 13 | 0.314 13 | 0.444 15 | 0.719 9 | 0.551 12 | 0.503 10 | 0.887 14 | 0.346 15 | 0.348 10 | 0.603 11 | 0.353 19 | 0.709 6 | 0.600 14 | 0.457 14 | 0.901 3 | 0.786 10 | 0.599 12 | |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
SN_RN152pyrx8_RVC | 0.546 16 | 0.572 16 | 0.663 20 | 0.638 7 | 0.518 17 | 0.298 16 | 0.366 23 | 0.633 20 | 0.510 17 | 0.446 19 | 0.864 18 | 0.296 19 | 0.267 19 | 0.542 18 | 0.346 20 | 0.704 8 | 0.575 18 | 0.431 18 | 0.853 12 | 0.766 16 | 0.630 8 | |
ILC-PSPNet | 0.475 23 | 0.490 22 | 0.581 23 | 0.289 19 | 0.507 20 | 0.067 25 | 0.379 21 | 0.610 22 | 0.417 23 | 0.435 20 | 0.822 24 | 0.278 20 | 0.267 19 | 0.503 21 | 0.228 21 | 0.616 22 | 0.533 21 | 0.375 22 | 0.820 17 | 0.729 20 | 0.560 15 | |
UDSSEG_RVC | 0.545 17 | 0.610 12 | 0.661 21 | 0.588 8 | 0.556 14 | 0.268 20 | 0.482 9 | 0.642 19 | 0.572 10 | 0.475 16 | 0.836 22 | 0.312 17 | 0.367 7 | 0.630 9 | 0.189 22 | 0.639 18 | 0.495 22 | 0.452 15 | 0.826 16 | 0.756 19 | 0.541 16 | |
3DMV (2d proj) | 0.498 21 | 0.481 23 | 0.612 22 | 0.579 9 | 0.456 21 | 0.343 6 | 0.384 20 | 0.623 21 | 0.525 16 | 0.381 22 | 0.845 21 | 0.254 21 | 0.264 21 | 0.557 17 | 0.182 23 | 0.581 23 | 0.598 15 | 0.429 19 | 0.760 22 | 0.661 24 | 0.446 23 | |
Angela Dai, Matthias Niessner: 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation. ECCV'18 | ||||||||||||||||||||||
ScanNet (2d proj) | 0.330 25 | 0.293 24 | 0.521 24 | 0.657 6 | 0.361 24 | 0.161 24 | 0.250 24 | 0.004 25 | 0.440 22 | 0.183 25 | 0.836 22 | 0.125 24 | 0.060 25 | 0.319 25 | 0.132 24 | 0.417 24 | 0.412 24 | 0.344 24 | 0.541 25 | 0.427 25 | 0.109 25 | |
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Nießner: ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. CVPR'17 | ||||||||||||||||||||||
Enet (reimpl) | 0.376 24 | 0.264 25 | 0.452 25 | 0.452 13 | 0.365 23 | 0.181 23 | 0.143 25 | 0.456 24 | 0.409 24 | 0.346 24 | 0.769 25 | 0.164 23 | 0.218 23 | 0.359 24 | 0.123 25 | 0.403 25 | 0.381 25 | 0.313 25 | 0.571 24 | 0.685 23 | 0.472 22 | |
Re-implementation of Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello: ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. | ||||||||||||||||||||||
DMMF | 0.003 26 | 0.000 26 | 0.005 26 | 0.000 26 | 0.000 26 | 0.037 26 | 0.001 26 | 0.000 26 | 0.001 26 | 0.005 26 | 0.003 26 | 0.000 26 | 0.000 26 | 0.000 26 | 0.000 26 | 0.000 26 | 0.002 26 | 0.001 26 | 0.000 26 | 0.006 26 | 0.000 26 | |