2D Semantic Label Benchmark
The 2D semantic labeling task involves predicting a per-pixel semantic labeling of an image.
Evaluation and metricsOur evaluation ranks all methods according to the PASCAL VOC intersection-over-union metric (IoU). IoU = TP/(TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively.
This table lists the benchmark results for the 2D semantic label scenario.
| Method | Info | avg iou | bathtub | bed | bookshelf | cabinet | chair | counter | curtain | desk | door | floor | otherfurniture | picture | refrigerator | shower curtain | sink | sofa | table | toilet | wall | window |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Virtual MVFusion (R) | 0.745 1 | 0.861 1 | 0.839 1 | 0.881 1 | 0.672 2 | 0.512 1 | 0.422 19 | 0.898 1 | 0.723 1 | 0.714 1 | 0.954 2 | 0.454 1 | 0.509 1 | 0.773 1 | 0.895 1 | 0.756 1 | 0.820 1 | 0.653 1 | 0.935 1 | 0.891 1 | 0.728 1 | |
| Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, Caroline Pantofaru: Virtual Multi-view Fusion for 3D Semantic Segmentation. ECCV 2020 | ||||||||||||||||||||||
| CU-Hybrid-2D Net | 0.636 3 | 0.825 2 | 0.820 2 | 0.179 25 | 0.648 4 | 0.463 3 | 0.549 2 | 0.742 9 | 0.676 2 | 0.628 2 | 0.961 1 | 0.420 2 | 0.379 7 | 0.684 8 | 0.381 20 | 0.732 3 | 0.723 3 | 0.599 2 | 0.827 18 | 0.851 2 | 0.634 9 | |
| BPNet_2D | 0.670 2 | 0.822 3 | 0.795 3 | 0.836 2 | 0.659 3 | 0.481 2 | 0.451 15 | 0.769 5 | 0.656 3 | 0.567 4 | 0.931 3 | 0.395 6 | 0.390 6 | 0.700 4 | 0.534 4 | 0.689 11 | 0.770 2 | 0.574 3 | 0.865 11 | 0.831 3 | 0.675 6 | |
| Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia and Tien-Tsin Wong: Bidirectional Projection Network for Cross Dimension Scene Understanding. CVPR 2021 (Oral) | ||||||||||||||||||||||
| CMX | 0.613 6 | 0.681 9 | 0.725 13 | 0.502 13 | 0.634 6 | 0.297 19 | 0.478 12 | 0.830 2 | 0.651 4 | 0.537 7 | 0.924 4 | 0.375 7 | 0.315 16 | 0.686 7 | 0.451 15 | 0.714 5 | 0.543 23 | 0.504 6 | 0.894 7 | 0.823 5 | 0.688 5 | |
| MVF-GNN(2D) | 0.636 3 | 0.606 16 | 0.794 4 | 0.434 17 | 0.688 1 | 0.337 8 | 0.464 14 | 0.798 4 | 0.632 5 | 0.589 3 | 0.908 9 | 0.420 2 | 0.329 14 | 0.743 2 | 0.594 2 | 0.738 2 | 0.676 5 | 0.527 4 | 0.906 2 | 0.818 6 | 0.715 3 | |
| MIX6D_RVC | 0.582 14 | 0.695 6 | 0.687 19 | 0.225 23 | 0.632 7 | 0.328 13 | 0.550 1 | 0.748 8 | 0.623 6 | 0.494 16 | 0.890 16 | 0.350 17 | 0.254 25 | 0.688 6 | 0.454 13 | 0.716 4 | 0.597 18 | 0.489 10 | 0.881 8 | 0.768 18 | 0.575 17 | |
| DVEFormer | 0.626 5 | 0.616 12 | 0.764 6 | 0.690 5 | 0.583 11 | 0.322 14 | 0.540 3 | 0.809 3 | 0.593 7 | 0.502 12 | 0.900 14 | 0.374 9 | 0.433 3 | 0.660 9 | 0.528 5 | 0.665 19 | 0.663 6 | 0.491 9 | 0.871 10 | 0.810 9 | 0.705 4 | |
| DMMF_3d | 0.605 7 | 0.651 10 | 0.744 11 | 0.782 3 | 0.637 5 | 0.387 4 | 0.536 5 | 0.732 10 | 0.590 8 | 0.540 6 | 0.856 23 | 0.359 12 | 0.306 17 | 0.596 16 | 0.539 3 | 0.627 22 | 0.706 4 | 0.497 8 | 0.785 23 | 0.757 21 | 0.476 24 | |
| MCA-Net | 0.595 9 | 0.533 22 | 0.756 9 | 0.746 4 | 0.590 10 | 0.334 10 | 0.506 9 | 0.670 17 | 0.587 9 | 0.500 13 | 0.905 11 | 0.366 11 | 0.352 10 | 0.601 15 | 0.506 9 | 0.669 17 | 0.648 10 | 0.501 7 | 0.839 17 | 0.769 17 | 0.516 23 | |
| FAN_NV_RVC | 0.586 11 | 0.510 23 | 0.764 6 | 0.079 28 | 0.620 8 | 0.330 11 | 0.494 10 | 0.753 7 | 0.573 10 | 0.556 5 | 0.884 18 | 0.405 4 | 0.303 18 | 0.718 3 | 0.452 14 | 0.672 14 | 0.658 8 | 0.509 5 | 0.898 5 | 0.813 8 | 0.727 2 | |
| UDSSEG_RVC | 0.545 20 | 0.610 15 | 0.661 24 | 0.588 9 | 0.556 17 | 0.268 23 | 0.482 11 | 0.642 22 | 0.572 11 | 0.475 18 | 0.836 25 | 0.312 20 | 0.367 8 | 0.630 11 | 0.189 25 | 0.639 21 | 0.495 25 | 0.452 17 | 0.826 19 | 0.756 22 | 0.541 19 | |
| EMSANet | 0.600 8 | 0.716 4 | 0.746 10 | 0.395 20 | 0.614 9 | 0.382 5 | 0.523 6 | 0.713 13 | 0.571 12 | 0.503 10 | 0.922 7 | 0.404 5 | 0.397 5 | 0.655 10 | 0.400 17 | 0.626 23 | 0.663 6 | 0.469 14 | 0.900 4 | 0.827 4 | 0.577 16 | |
| Seichter, Daniel and Fischedick, Söhnke and Köhler, Mona and Gross, Horst-Michael: EMSANet: Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments. IJCNN 2022 | ||||||||||||||||||||||
| SSMA | 0.577 15 | 0.695 6 | 0.716 16 | 0.439 15 | 0.563 16 | 0.314 15 | 0.444 17 | 0.719 11 | 0.551 13 | 0.503 10 | 0.887 17 | 0.346 18 | 0.348 11 | 0.603 14 | 0.353 22 | 0.709 6 | 0.600 16 | 0.457 16 | 0.901 3 | 0.786 13 | 0.599 15 | |
| Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
| DCRedNet | 0.583 13 | 0.682 8 | 0.723 14 | 0.542 12 | 0.510 22 | 0.310 16 | 0.451 15 | 0.668 18 | 0.549 14 | 0.520 9 | 0.920 8 | 0.375 7 | 0.446 2 | 0.528 22 | 0.417 16 | 0.670 16 | 0.577 19 | 0.478 12 | 0.862 12 | 0.806 11 | 0.628 11 | |
| UNIV_CNP_RVC_UE | 0.566 17 | 0.569 21 | 0.686 21 | 0.435 16 | 0.524 19 | 0.294 20 | 0.421 20 | 0.712 14 | 0.543 15 | 0.463 19 | 0.872 19 | 0.320 19 | 0.363 9 | 0.611 13 | 0.477 12 | 0.686 12 | 0.627 13 | 0.443 19 | 0.862 12 | 0.775 16 | 0.639 8 | |
| RFBNet | 0.592 10 | 0.616 12 | 0.758 8 | 0.659 6 | 0.581 12 | 0.330 11 | 0.469 13 | 0.655 20 | 0.543 15 | 0.524 8 | 0.924 4 | 0.355 14 | 0.336 12 | 0.572 19 | 0.479 11 | 0.671 15 | 0.648 10 | 0.480 11 | 0.814 21 | 0.814 7 | 0.614 12 | |
| WSGFormer | 0.585 12 | 0.706 5 | 0.708 18 | 0.434 17 | 0.574 14 | 0.283 22 | 0.538 4 | 0.759 6 | 0.542 17 | 0.482 17 | 0.924 4 | 0.351 16 | 0.333 13 | 0.614 12 | 0.393 18 | 0.692 10 | 0.551 22 | 0.461 15 | 0.874 9 | 0.809 10 | 0.673 7 | |
| 3DMV (2d proj) | 0.498 24 | 0.481 26 | 0.612 25 | 0.579 10 | 0.456 24 | 0.343 7 | 0.384 23 | 0.623 24 | 0.525 18 | 0.381 25 | 0.845 24 | 0.254 24 | 0.264 23 | 0.557 20 | 0.182 26 | 0.581 26 | 0.598 17 | 0.429 22 | 0.760 25 | 0.661 27 | 0.446 26 | |
| Angela Dai, Matthias Niessner: 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation. ECCV'18 | ||||||||||||||||||||||
| SN_RN152pyrx8_RVC | 0.546 19 | 0.572 19 | 0.663 23 | 0.638 8 | 0.518 20 | 0.298 18 | 0.366 26 | 0.633 23 | 0.510 19 | 0.446 21 | 0.864 21 | 0.296 22 | 0.267 21 | 0.542 21 | 0.346 23 | 0.704 8 | 0.575 20 | 0.431 21 | 0.853 15 | 0.766 19 | 0.630 10 | |
| FuseNet | 0.535 22 | 0.570 20 | 0.681 22 | 0.182 24 | 0.512 21 | 0.290 21 | 0.431 18 | 0.659 19 | 0.504 20 | 0.495 15 | 0.903 13 | 0.308 21 | 0.428 4 | 0.523 23 | 0.365 21 | 0.676 13 | 0.621 15 | 0.470 13 | 0.762 24 | 0.779 15 | 0.541 19 | |
| Caner Hazirbas, Lingni Ma, Csaba Domokos, Daniel Cremers: FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture. ACCV 2016 | ||||||||||||||||||||||
| EMSAFormer | 0.564 18 | 0.581 18 | 0.736 12 | 0.564 11 | 0.546 18 | 0.219 25 | 0.517 7 | 0.675 16 | 0.486 21 | 0.427 23 | 0.904 12 | 0.352 15 | 0.320 15 | 0.589 17 | 0.528 5 | 0.708 7 | 0.464 26 | 0.413 24 | 0.847 16 | 0.786 13 | 0.611 13 | |
| DMMF | 0.567 16 | 0.623 11 | 0.767 5 | 0.238 22 | 0.571 15 | 0.347 6 | 0.413 21 | 0.719 11 | 0.472 22 | 0.418 24 | 0.895 15 | 0.357 13 | 0.260 24 | 0.696 5 | 0.523 8 | 0.666 18 | 0.642 12 | 0.437 20 | 0.895 6 | 0.793 12 | 0.603 14 | |
| segfomer with 6d | 0.542 21 | 0.594 17 | 0.687 19 | 0.146 26 | 0.579 13 | 0.308 17 | 0.515 8 | 0.703 15 | 0.472 22 | 0.498 14 | 0.868 20 | 0.369 10 | 0.282 19 | 0.589 17 | 0.390 19 | 0.701 9 | 0.556 21 | 0.416 23 | 0.860 14 | 0.759 20 | 0.539 21 | |
| AdapNet++ | 0.503 23 | 0.613 14 | 0.722 15 | 0.418 19 | 0.358 28 | 0.337 8 | 0.370 25 | 0.479 26 | 0.443 24 | 0.368 26 | 0.907 10 | 0.207 25 | 0.213 27 | 0.464 26 | 0.525 7 | 0.618 24 | 0.657 9 | 0.450 18 | 0.788 22 | 0.721 25 | 0.408 27 | |
| Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
| ScanNet (2d proj) | 0.330 28 | 0.293 27 | 0.521 27 | 0.657 7 | 0.361 27 | 0.161 27 | 0.250 27 | 0.004 28 | 0.440 25 | 0.183 28 | 0.836 25 | 0.125 27 | 0.060 28 | 0.319 28 | 0.132 27 | 0.417 27 | 0.412 27 | 0.344 27 | 0.541 28 | 0.427 28 | 0.109 28 | |
| Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Nießner: ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. CVPR'17 | ||||||||||||||||||||||
| ILC-PSPNet | 0.475 26 | 0.490 25 | 0.581 26 | 0.289 21 | 0.507 23 | 0.067 28 | 0.379 24 | 0.610 25 | 0.417 26 | 0.435 22 | 0.822 27 | 0.278 23 | 0.267 21 | 0.503 24 | 0.228 24 | 0.616 25 | 0.533 24 | 0.375 25 | 0.820 20 | 0.729 23 | 0.560 18 | |
| Enet (reimpl) | 0.376 27 | 0.264 28 | 0.452 28 | 0.452 14 | 0.365 26 | 0.181 26 | 0.143 28 | 0.456 27 | 0.409 27 | 0.346 27 | 0.769 28 | 0.164 26 | 0.218 26 | 0.359 27 | 0.123 28 | 0.403 28 | 0.381 28 | 0.313 28 | 0.571 27 | 0.685 26 | 0.472 25 | |
| Re-implementation of Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello: ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. | ||||||||||||||||||||||
| MSeg1080_RVC | 0.485 25 | 0.505 24 | 0.709 17 | 0.092 27 | 0.427 25 | 0.241 24 | 0.411 22 | 0.654 21 | 0.385 28 | 0.457 20 | 0.861 22 | 0.053 28 | 0.279 20 | 0.503 24 | 0.481 10 | 0.645 20 | 0.626 14 | 0.365 26 | 0.748 26 | 0.725 24 | 0.529 22 | |
| John Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen Koltun: MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. CVPR 2020 | ||||||||||||||||||||||
