Documentation - ScanNet Benchmark

Class labels and ids
    Label IDs for the ScanNet benchmark
    Label IDs for the ScanNet200 benchmark
    Scene type classification

Submission format
    Format for 3D Semantic Label Prediction
    Format for 3D Semantic Instance Prediction
    Format for 2D Semantic Label Prediction
    Format for 2D Semantic Instance Prediction
    Format for Scene Type Classification

Submission policy

The 1513 scans of the ScanNet dataset release may be used for learning the parameters of the algorithms. The test data should be used strictly for reporting the final results -- this benchmark is not meant for iterative testing sessions or parameter tweaking.

Parameter tuning is only allowed on the training data. Evaluating on the test data via this evaluation server must only be done once for the final system. It is not permitted to use it to train systems, for example by trying out different parameter values and choosing the best. Only one version must be evaluated (which performed best on the training data). This is to avoid overfitting on the test data. Results of different parameter settings of an algorithm can therefore only be reported on the training set. To help enforcing this policy, we block updates to the test set results of a method for two weeks after a test set submission. You can split up the training data into training and validation sets yourself as you wish.

It is not permitted to register on this webpage with multiple e-mail addresses nor information misrepresenting the identity of the user. We will ban users or domains if required.

Data

Download: If you would like to download the ScanNet data, please fill out an agreement to the ScanNet Terms of Use and send it to us at the scannet group email. For more information regarding the ScanNet dataset, please see our git repo. The ScanNet and ScanNet200 benchmarks are sharing the source data, but different in the annotation parsing. Please refer to the provided preprocessing scripts for ScanNet200.

Tasks Data Requirements: For all tasks, both 2D and 3D data modalities can be used as input.

2D Data: As the number of 2D RGB-D frames is particularly large (≈2.5 million), we provide an option to download a smaller subset scannet_frames_25k (≈25k frames, subsampled approximately every 100 frames from the full dataset) through the ScanNet data download. Note that the benchmark evaluation is run for the test frames specified by scannet_frames_test.

Evaluation scripts: Additionally, we provide some helper functions and examples for exporting data to the respective evaluation format. The evaluation scripts used for each task are also provided. We also provide a conversion function for mapping the raw 2D data to the label set defined below. Note that the scannet_frames_25k set has already been preprocessed for the ScanNet format.

Data Formats

3D Data: 3D data is provided with the RGB-D video sequences (depth-color aligned) as well as reconstructed meshes as .ply files. For more detail, please see the ScanNet Data Organization.

2D Data: 2D label and instance data are provided as .png image files. Color images are provided as 8-bit RGB .jpg, and depth images as 16-bit .png (divide by 1000 to obtain depth in meters).
The 2D data provided with the ScanNet release (*_2d-label.zip, *_2d-instance.zip, *_2d-label-filt.zip, *_2d-label-filt.zip) are formatted as follows:

Label images: 16-bit .png where each pixel stores the id value corresponding from scannetv2-labels.combined.tsv (0 corresponds to unannotated or no depth).
Instance images: 8-bit .png where each pixel stores an integer value per distinct instance (0 corresponds to unannotated or no depth).

We also provide a preprocessed subset of the ScanNet frames, scannet_frames_25k. These label and instance images have been preprocessed and are thus formatted differently from the previous, as follows:

Label images: 8-bit .png where each pixel stores the nyu40id value corresponding from scannetv2-labels.combined.tsv (0 corresponds to unannotated or no depth).
Instance images: 16-bit .png where each pixel stores an integer value per distinct instance corresponding to label*1000+inst, where label is the nyu40id label and inst is a 1-indexed value counting the instances for the respective label in the image. Note that walls, floors, and ceilings do not contain instances, and that 0 corresponds to unannotated or no depth.

Class labels and ids

ScanNet Benchmark

2D/3D semantic label and instance prediction: We use the NYUv2 40-label set, see all label ids here, and evaluate on the 20-class subset defined here. Note that for instance tasks, the wall and floor classes are ignored. Below are the evaluated classes (except wall and floor for instance tasks), colored as they are in the *_vh_clean_2.labels.ply meshes.

ScanNet200 Benchmark

This benchmark follows the original train/val/test scene splits published with this dataset and available here
We further split of the 200 categories into three sets based on their point and instance frequencies, namely head, common, and tail. The category splits can be found in scannet200_split.py here
The raw annotations in the training set containing 550 distinct categories, many of which appear only once, and were filtered to produce the large-vocabulary, challenging ScanNet200 setting. The mapping of annotation category IDs to ScanNet200 valid categories can be found in scannet200_constants.py here
For preprocessing the raw data and color labels please refer to the repository.

Scene type classification

There are 20 different scene types for the ScanNet dataset, described with ids here. We evaluate on a 13-class subset defined here.

Submission format

Results for a method must be uploaded as a single .zip or .7z file (7z is preferred due to smaller file sizes), which when unzipped must contain in the root the prediction files. There must not be any additional files or folders in the archive except those specified below.

Format for 3D Semantic Label Prediction

Both for the ScanNet and ScanNet200 3D semantic label prediction task, results must be provided as class labels per vertex of the corresponding 3D scan mesh, i.e., for each vertex in the order provided by the *_vh_clean_2.ply mesh. Each prediction file should contain one line per vertex, with each line containing the integer label id of the predicted class. This label should be either the nyu40id for ScanNet or the raw id for ScanNet200. E.g., a file could look like: 10 10 2 2 2 ⋮ 39 A submission must contain a .txt prediction file image for each test scan, named scene%04d_%02d.txt with the corresponding ScanNet scan name, e.g.: unzip_root/ |-- scene0707_00.txt |-- scene0708_00.txt |-- scene0709_00.txt ⋮ |-- scene0806_00.txt

Format for 3D Semantic Instance Prediction

Results must be provided as a text file for each test scan. Each text file should contain a line for each instance, containing the relative path to a binary mask of the instance, the predicted label id, and the confidence of the prediction. The result text files must be named according to the corresponding test scan, as scene%04d_%02d.txt with the corresponding ScanNet scan name. Predicted .txt files listing the instances of each scan must live in the root of the unzipped submission. Predicted instance mask files must live in a subdirectory of the unzipped submission. For instance, a submission should look like: unzip_root/ |-- scene0707_00.txt |-- scene0708_00.txt |-- scene0709_00.txt ⋮ |-- scene0806_00.txt |-- predicted_masks/ |-- scene0707_00_000.txt |-- scene0707_00_001.txt ⋮ Each prediction file for a scan should contain a list of instances, where an instance is: (1) the relative path to the predicted mask file, (2) the integer class label id, (3) the float confidence score. Each line in the prediction file should correspond to one instance, and the three values above separated by spaces. Thus, the filenames in the prediction files must not contain spaces.
The predicted instance mask file should provide a mask over the vertices of the scan mesh, i.e., for each vertex in the order provided by the *_vh_clean_2.ply mesh. Each instance mask file should contain one line per vertex, with each line containing an integer value, with non-zero values indicating part of the instance. E.g., scene0707_00.txt should be of the format: predicted_masks/scene0707_00_000.txt 10 0.7234 predicted_masks/scene0707_00_001.txt 36 0.9038 ⋮ and predicted_masks/scene0707_00_000.txt could look like: 0 0 0 1 1 ⋮ 0

Format for 2D Semantic Label Prediction

Results must be provided as 8-bit PNGs, encoding the predicted class labels ids. A submission must contain a result image for each test image, named scene%04d_%02d_%06d.png with the corresponding ScanNet scan name and frame number. Image results must all live in the root of the unzipped folder, e.g.: unzip_root/ |-- scene0707_00_000000.png |-- scene0707_00_000200.png |-- scene0707_00_000400.png ⋮

Format for 2D Semantic Instance Prediction

Results must be provided as a text file for each test image. Each text file should contain a line for each instance, containing the relative path to a binary mask image of the instance, the predicted label id, and the confidence of the prediction. The result text files must be named according to the corresponding test scan, as scene%04d_%02d_%06d.txt with the corresponding ScanNet scan name and frame number. Result text files can be in any directory level of the submission. For instance, a submission could look like: unzip_root/ |-- scene0707_00_000000.txt |-- scene0707_00_000200.txt |-- scene0707_00_000400.txt ⋮ |-- predicted_masks/ |-- scene0707_00_000000_000.png |-- scene0707_00_000000_001.png ⋮ with scene0707_00_000000.txt containing: predicted_masks/scene0707_00_000000_000.png 33 0.7234 predicted_masks/scene0707_00_000000_001.png 5 0.9038 ⋮

Format for Scene Type Classification

Results must be provided as a single text file, containing a line for each scan and its predicted scene type, separated by spaces. For instance, an example submission example_scene_type_classificiation.txt should look like: scene0707_00 1 scene0708_00 2 scene0709_00 3 ⋮ scene0806_00 14