Documentation

ScanNet++ Toolbox

Check out our ScanNet++ Toolbox in github. It provides tools and code to

  • Read the dataset structure
  • Decode iPhone RGB, depth, and mask video
  • Undistort DSLR images with COLMAP
  • Render high-res depth maps from the mesh for DSLR and iPhone frames
  • Nerfstudio dataparser for DSLR images is ready (see this PR)
  • Prepare training data for semantic tasks
  • Official evaluation code for the benchmark

Data Structure

The ScanNet++ dataset currently consists of 380 scenes.

The data download contains one folder per scene containing laser scan, DSLR and iPhone data, and several metadata files. The data is organized as follows:

  • split/
    • nvs_sem_train.txt: Training set for NVS and semantic tasks with 230 scenes
    • nvs_sem_val.txt: Validation set for NVS and semantic tasks with 50 scenes
    • nvs_test.txt: Test set for NVS with 50 scenes
    • sem_test.txt: Test set for semantic tasks with 50 scenes
    • Each file contains lists of Scene IDs in the respective split
  • metadata/
    • semantic_classes.txt: list of semantic classes
    • instance_classes.txt: subset of semantic classes that have instances (i.e., excludes wall, ceiling, floor, ..)
    • semantic_benchmark/
      • top100.txt: top 100 semantic classes for semantic segmentation benchmark
      • top100_instance.txt: subset of 100 semantic classes for instance segmentation benchmark
      • map_benchmark.csv: mapping from raw semantic labels to benchmark labels
  • data/<scene_id>/
    • scans/
        • pc_aligned.ply: point cloud from laser scanner, axis-aligned
        • pc_aligned_mask.txt: indices of anonymized points
        • scanner_poses.json: contains scanner positions, 4x4 transformation matrix for each position
        • mesh_aligned_0.05.ply: mesh decimated to 5% size, obtained from point cloud
        • mesh_aligned_0.05_mask.txt: indices of mesh vertices with anonymization applied
        • mesh_aligned_0.05_semantic.ply
          • The vertex “label” property contains the integer semantic label into the classes in semantic_classes.txt
          • Unlabeled vertices have the label -100
        • segments.json: json_data[“segIndices”] contains the segment ID for each vertex
        • segments_anno.json:
          • json_data[i] corresponds to a single annotated object
            • “label”: the semantic label of this object
            • “segments”: all the segments belonging to this object
    • dslr/
      • resized_images: Fisheye DSLR images, resized, JPG
      • resized_anon_masks: PNG. Specifies the pixels that have been anonymized (0: invalid, 255: valid pixels).
      • original_images: Full resolution images, JPG
      • original_anon_masks: PNG. Similar to resized masks
      • colmap: contains the colmap camera model that has been aligned with the 3D scans, which implies the poses are in metric scale. Make sure to use this if you want to do 2D-3D matching between the provided mesh.
      • nerfstudio/transforms.json
        • Contains the same camera poses in the format used by Nerfstudio, OpenGL/Blender convention. The coordinate system is different from OpenCV/COLMAP convention.
        • poses:
          • frames, test_frames: contain poses for train and test images respectively
        • mask: filename of binary mask file
        • is_bad: indicates if the image is blurry or contains heavy shadows.
        • Camera model (as above):
          • contained in fl_x, fl_y, .., k1, k2, k3, k4, camera_model
      • train_test_lists.json
        • json[“train”]: training images
        • json[“test”]: novel views, test images
        • The split here is the same as the one in nerfstudio/transform.json
        • json[“has_masks”]: global flag for the scene, indicating if it has anonymized masks or not
    • iphone/
      • rgb.mp4: full RGB video, 60 FPS
      • rgb_mask.mp4: Video of anonymization masks, lossless compression. After decode, it's similar to the masks in DSLR.
      • depth.bin: Depth images as 16 bit png in millimeters in a single binary file from iPhone Lidar sensor. The depth images are aligned with the RGB images.
      • rgb: RGB frames from the video, subsampled. Obtained by running processing script on rgb.mp4. The resolution is 1920 x 1440.
      • depth: Depth images as 16 bit png in millimeters. Obtained by running processing script on depth.bin. The depth images are aligned with the RGB images but with much lower resolution: 256 x 192.
      • pose_intrinsic_imu.json: contains ARKit poses and IMU information from the iPhone
        • json["poses"] contain a 4x4 camera-to-world extrinsic matrix from raw ARKit output. The coordinate system is right-handed. +Z is the camera direction.
        • json["intrinsic"] contains a 3x3 intrinsic matrix of the RGB image
        • json["aligned_poses"] contains ARKit poses that are scaled and transformed to our mesh space
        • There are no intrinsics for Lidar depth provided by the iPhone. The user can scale the RGB intrinsic for the Lidar depth map since RGB and depth are aligned.
      • nerfstudio: similar to DSLR
      • colmap: similar to DSLR. Images here have been filtered based on agreement of depth between iPhone Lidar and the laser scanner. The camera model is OPENCV, which contains 4 distortion parameters: k1, k2 p1, p2.
      • exif.json: EXIF information for each frame in the video
  • All data is anonymized using the magenta color with RGB value (255,0,255). The user may fill those regions with any color using the given binary mask.