ScanNet++ Dataset

The ScanNet++ dataset currently consists of 1006 scenes. The default download (low-res DSLR images, iPhone data, 3D meshes and semantics) occupies about 1.5 TB on disk.

Asset group	Download size
DSLR (2 MP), iPhone, Meshes and semantics (default download)	1.5 TB
DSLR (2 MP)	371 GB
DSLR 2MP + 33MP (hi-res)	9 TB
Meshes and semantics	132 GB
Point clouds	720 GB

The data download contains one folder per scene containing laser scan, DSLR and iPhone data, and several metadata files. The data is organized as follows:

split/
- nvs_sem_train.txt: Training set for NVS and semantic tasks with 230 scenes
- nvs_sem_val.txt: Validation set for NVS and semantic tasks with 50 scenes
- nvs_test.txt: Test set for NVS with 50 scenes
- sem_test.txt: Test set for semantic tasks with 50 scenes
- Each file contains lists of Scene IDs in the respective split
metadata/
- scene_types.json: scene ID to scene type mapping for all scenes
- semantic_classes.txt: list of semantic classes
- instance_classes.txt: subset of semantic classes that have instances (i.e., excludes wall, ceiling, floor, ..)
- semantic_benchmark/
  - top100.txt: top 100 semantic classes for semantic segmentation benchmark
  - top100_instance.txt: subset of 100 semantic classes for instance segmentation benchmark
  - map_benchmark.csv: mapping from raw semantic labels to benchmark labels
data/<scene_id>/
- scans/
- dslr/
  - resized_images: Fisheye DSLR images, resized, JPG
  - resized_anon_masks: PNG. Specifies the pixels that have been anonymized (0: invalid, 255: valid pixels).
  - original_images: Full resolution images, JPG
  - original_anon_masks: PNG. Similar to resized masks
  - resized_undistorted_images: Undistorted DSLR images with the same resolution as the resized images, JPG
  - resized_undistorted_masks: PNG. Similar to resized masks
  - colmap: contains the colmap camera model that has been aligned with the 3D scans, which implies the poses are in metric scale. Make sure to use this if you want to do 2D-3D matching between the provided mesh.
    - cameras.txt: Contain the camera type (OPENCV_FISHEYE) and the intrinsic parameters (fx, fy, cx, cy, distortion parameters)
    - images.txt: Contain extrinsics of each image: qvec (quaternion) and tvec
    - points3D.txt: Contain 3D feature points used by COLMAP
    - Useful references:
    - Python (Open3D) visualizer provided by COLMAP
  - nerfstudio/
    - transforms.json
      - Contains the same camera poses in the format used by Nerfstudio, OpenGL/Blender convention. The coordinate system is different from OpenCV/COLMAP convention.
      - poses:
        
        frames, test_frames: contain poses for train and test images respectively
      - mask: filename of binary mask file
      - is_bad: indicates if the image is blurry or contains heavy shadows.
      - Camera model (as above):
        
        contained in fl_x, fl_y, .., k1, k2, k3, k4, camera_model
        
        The intrinsics are corresponds to the resized images
      - has_mask: global flag for the scene, indicating if it has anonymized masks or not
    - transforms_undistorted.json: similar to transforms.json but the undistorted DSLR version
  - train_test_lists.json
    - json[“train”]: training images
    - json[“test”]: novel views, test images
    - The split here is the same as the one in nerfstudio/transform.json
    - json[“has_masks”]: global flag for the scene, indicating if it has anonymized masks or not
- iphone/
  - rgb.mkv: full RGB video, 60 FPS
  - rgb_mask.mkv: Video of anonymization masks, lossless compression. After decode, it's similar to the masks in DSLR.
  - depth.bin: Depth images as 16 bit png in millimeters in a single binary file from iPhone Lidar sensor. The depth images are aligned with the RGB images.
  - rgb: RGB frames from the video, subsampled. Obtained by running processing script on rgb.mkv. The resolution is 1920 x 1440.
  - depth: Depth images as 16 bit png in millimeters. Obtained by running processing script on depth.bin. The depth images are aligned with the RGB images but with much lower resolution: 256 x 192.
  - pose_intrinsic_imu.json: contains ARKit poses and IMU information from the iPhone
    - json["poses"] contain a 4x4 camera-to-world extrinsic matrix from raw ARKit output. The coordinate system is right-handed. +Z is the camera direction.
    - json["intrinsic"] contains a 3x3 intrinsic matrix of the RGB image
    - json["aligned_poses"] contains ARKit poses that are scaled and transformed to our mesh space
    - There are no intrinsics for Lidar depth provided by the iPhone. The user can scale the RGB intrinsic for the Lidar depth map since RGB and depth are aligned.
  - nerfstudio: similar to DSLR
  - colmap: similar to DSLR. Images here have been filtered based on agreement of depth between iPhone Lidar and the laser scanner. The camera model is OPENCV, which contains 4 distortion parameters: k1, k2 p1, p2.
  - exif.json: EXIF information for each frame in the video
All data is anonymized using the magenta color with RGB value (255,0,255). The user may fill those regions with any color using the given binary mask.
To ensure fair comparison, the scenes in nvs_test split do not contain 3D information like meshes and iphone depth maps.

Documentation

ScanNet++ Toolbox

3D Gaussian Splatting on ScanNet++

Data Structure