2D Semantic label benchmark
The 2D semantic labeling task involves predicting a per-pixel semantic labeling of an image.
Evaluation and metricsOur evaluation ranks all methods according to the PASCAL VOC intersection-over-union metric (IoU). IoU = TP/(TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively.
This table lists the benchmark results for the 2D semantic label scenario.
Method | Info | avg iou | bathtub | bed | bookshelf | cabinet | chair | counter | curtain | desk | door | floor | otherfurniture | picture | refrigerator | shower curtain | sink | sofa | table | toilet | wall | window |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ||
Virtual MVFusion (R) | 0.745 1 | 0.861 1 | 0.839 1 | 0.881 1 | 0.672 1 | 0.512 1 | 0.422 11 | 0.898 1 | 0.723 1 | 0.714 1 | 0.954 2 | 0.454 1 | 0.509 1 | 0.773 1 | 0.895 1 | 0.756 1 | 0.820 1 | 0.653 1 | 0.935 1 | 0.891 1 | 0.728 1 | |
Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, Caroline Pantofaru: Virtual Multi-view Fusion for 3D Semantic Segmentation. ECCV 2020 | ||||||||||||||||||||||
CU-Hybrid-2D Net | 0.636 3 | 0.825 2 | 0.820 2 | 0.179 17 | 0.648 3 | 0.463 3 | 0.549 1 | 0.742 4 | 0.676 2 | 0.628 2 | 0.961 1 | 0.420 2 | 0.379 5 | 0.684 4 | 0.381 11 | 0.732 2 | 0.723 3 | 0.599 2 | 0.827 9 | 0.851 2 | 0.634 4 | |
BPNet_2D | ![]() | 0.670 2 | 0.822 3 | 0.795 3 | 0.836 2 | 0.659 2 | 0.481 2 | 0.451 7 | 0.769 3 | 0.656 3 | 0.567 3 | 0.931 3 | 0.395 3 | 0.390 4 | 0.700 2 | 0.534 3 | 0.689 6 | 0.770 2 | 0.574 3 | 0.865 4 | 0.831 3 | 0.675 3 |
Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia and Tien-Tsin Wong: Bidirectional Projection Network for Cross Dimension Scene Understanding. CVPR 2021 (Oral) | ||||||||||||||||||||||
CMX | 0.613 4 | 0.681 6 | 0.725 8 | 0.502 11 | 0.634 5 | 0.297 13 | 0.478 5 | 0.830 2 | 0.651 4 | 0.537 5 | 0.924 4 | 0.375 4 | 0.315 10 | 0.686 3 | 0.451 9 | 0.714 3 | 0.543 15 | 0.504 5 | 0.894 3 | 0.823 4 | 0.688 2 | |
DMMF | 0.597 6 | 0.543 12 | 0.755 6 | 0.749 4 | 0.585 7 | 0.338 6 | 0.494 4 | 0.704 7 | 0.598 5 | 0.494 11 | 0.911 7 | 0.347 9 | 0.327 9 | 0.593 8 | 0.527 4 | 0.675 8 | 0.646 8 | 0.513 4 | 0.842 7 | 0.774 9 | 0.527 12 | |
DMMF_3d | 0.605 5 | 0.651 7 | 0.744 7 | 0.782 3 | 0.637 4 | 0.387 4 | 0.536 2 | 0.732 5 | 0.590 6 | 0.540 4 | 0.856 14 | 0.359 7 | 0.306 11 | 0.596 7 | 0.539 2 | 0.627 13 | 0.706 4 | 0.497 7 | 0.785 13 | 0.757 12 | 0.476 14 | |
MCA-Net | 0.595 7 | 0.533 13 | 0.756 5 | 0.746 5 | 0.590 6 | 0.334 8 | 0.506 3 | 0.670 8 | 0.587 7 | 0.500 9 | 0.905 9 | 0.366 6 | 0.352 6 | 0.601 6 | 0.506 6 | 0.669 11 | 0.648 6 | 0.501 6 | 0.839 8 | 0.769 10 | 0.516 13 | |
SSMA | ![]() | 0.577 10 | 0.695 4 | 0.716 11 | 0.439 13 | 0.563 9 | 0.314 10 | 0.444 9 | 0.719 6 | 0.551 8 | 0.503 8 | 0.887 11 | 0.346 10 | 0.348 7 | 0.603 5 | 0.353 13 | 0.709 4 | 0.600 11 | 0.457 11 | 0.901 2 | 0.786 7 | 0.599 8 |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
DCRedNet | 0.583 9 | 0.682 5 | 0.723 9 | 0.542 10 | 0.510 12 | 0.310 11 | 0.451 7 | 0.668 9 | 0.549 9 | 0.520 7 | 0.920 6 | 0.375 4 | 0.446 2 | 0.528 12 | 0.417 10 | 0.670 10 | 0.577 13 | 0.478 9 | 0.862 5 | 0.806 6 | 0.628 6 | |
RFBNet | 0.592 8 | 0.616 8 | 0.758 4 | 0.659 6 | 0.581 8 | 0.330 9 | 0.469 6 | 0.655 11 | 0.543 10 | 0.524 6 | 0.924 4 | 0.355 8 | 0.336 8 | 0.572 9 | 0.479 8 | 0.671 9 | 0.648 6 | 0.480 8 | 0.814 11 | 0.814 5 | 0.614 7 | |
3DMV (2d proj) | 0.498 14 | 0.481 16 | 0.612 15 | 0.579 9 | 0.456 14 | 0.343 5 | 0.384 13 | 0.623 14 | 0.525 11 | 0.381 15 | 0.845 15 | 0.254 14 | 0.264 15 | 0.557 10 | 0.182 16 | 0.581 16 | 0.598 12 | 0.429 14 | 0.760 15 | 0.661 17 | 0.446 16 | |
Angela Dai, Matthias Niessner: 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation. ECCV'18 | ||||||||||||||||||||||
SN_RN152pyrx8_RVC | ![]() | 0.546 11 | 0.572 10 | 0.663 14 | 0.638 8 | 0.518 10 | 0.298 12 | 0.366 16 | 0.633 13 | 0.510 12 | 0.446 13 | 0.864 12 | 0.296 12 | 0.267 13 | 0.542 11 | 0.346 14 | 0.704 5 | 0.575 14 | 0.431 13 | 0.853 6 | 0.766 11 | 0.630 5 |
FuseNet | ![]() | 0.535 12 | 0.570 11 | 0.681 13 | 0.182 16 | 0.512 11 | 0.290 14 | 0.431 10 | 0.659 10 | 0.504 13 | 0.495 10 | 0.903 10 | 0.308 11 | 0.428 3 | 0.523 13 | 0.365 12 | 0.676 7 | 0.621 10 | 0.470 10 | 0.762 14 | 0.779 8 | 0.541 10 |
Caner Hazirbas, Lingni Ma, Csaba Domokos, Daniel Cremers: FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture. ACCV 2016 | ||||||||||||||||||||||
AdapNet++ | ![]() | 0.503 13 | 0.613 9 | 0.722 10 | 0.418 14 | 0.358 18 | 0.337 7 | 0.370 15 | 0.479 16 | 0.443 14 | 0.368 16 | 0.907 8 | 0.207 15 | 0.213 17 | 0.464 16 | 0.525 5 | 0.618 14 | 0.657 5 | 0.450 12 | 0.788 12 | 0.721 15 | 0.408 17 |
Abhinav Valada, Rohit Mohan, Wolfram Burgard: Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. International Journal of Computer Vision, 2019 | ||||||||||||||||||||||
ScanNet (2d proj) | ![]() | 0.330 18 | 0.293 17 | 0.521 17 | 0.657 7 | 0.361 17 | 0.161 17 | 0.250 17 | 0.004 18 | 0.440 15 | 0.183 18 | 0.836 16 | 0.125 17 | 0.060 18 | 0.319 18 | 0.132 17 | 0.417 17 | 0.412 17 | 0.344 17 | 0.541 18 | 0.427 18 | 0.109 18 |
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Nießner: ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. CVPR'17 | ||||||||||||||||||||||
ILC-PSPNet | 0.475 16 | 0.490 15 | 0.581 16 | 0.289 15 | 0.507 13 | 0.067 18 | 0.379 14 | 0.610 15 | 0.417 16 | 0.435 14 | 0.822 17 | 0.278 13 | 0.267 13 | 0.503 14 | 0.228 15 | 0.616 15 | 0.533 16 | 0.375 15 | 0.820 10 | 0.729 13 | 0.560 9 | |
Enet (reimpl) | 0.376 17 | 0.264 18 | 0.452 18 | 0.452 12 | 0.365 16 | 0.181 16 | 0.143 18 | 0.456 17 | 0.409 17 | 0.346 17 | 0.769 18 | 0.164 16 | 0.218 16 | 0.359 17 | 0.123 18 | 0.403 18 | 0.381 18 | 0.313 18 | 0.571 17 | 0.685 16 | 0.472 15 | |
Re-implementation of Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello: ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. | ||||||||||||||||||||||
MSeg1080_RVC | ![]() | 0.485 15 | 0.505 14 | 0.709 12 | 0.092 18 | 0.427 15 | 0.241 15 | 0.411 12 | 0.654 12 | 0.385 18 | 0.457 12 | 0.861 13 | 0.053 18 | 0.279 12 | 0.503 14 | 0.481 7 | 0.645 12 | 0.626 9 | 0.365 16 | 0.748 16 | 0.725 14 | 0.529 11 |
John Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen Koltun: MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. CVPR 2020 |