The ScanNet++ dataset currently consists of 380 scenes.
The data download contains one folder per scene containing laser scan,
DSLR and iPhone data, and several metadata files. The data is organized as follows:
- split/
- nvs_sem_train.txt: Training set for NVS and semantic tasks with 230 scenes
- nvs_sem_val.txt: Validation set for NVS and semantic tasks with 50 scenes
- nvs_test.txt: Test set for NVS with 50 scenes
- sem_test.txt: Test set for semantic tasks with 50 scenes
- Each file contains lists of Scene IDs in the respective split
- metadata/
- semantic_classes.txt: list of semantic classes
- instance_classes.txt: subset of semantic classes that have instances
(i.e., excludes wall, ceiling, floor, ..)
- semantic_benchmark/
- top100.txt: top 100 semantic classes for semantic segmentation benchmark
- top100_instance.txt: subset of 100 semantic classes for instance segmentation benchmark
- map_benchmark.csv: mapping from raw semantic labels to benchmark labels
- data/<scene_id>/
- scans/
- pc_aligned.ply: point cloud from laser scanner, axis-aligned
- pc_aligned_mask.txt: indices of anonymized points
- scanner_poses.json: contains scanner positions, 4x4
transformation
matrix for each position
- mesh_aligned_0.05.ply: mesh decimated to 5% size, obtained from
point
cloud
- mesh_aligned_0.05_mask.txt: indices of mesh vertices with
anonymization applied
- mesh_aligned_0.05_semantic.ply
- The vertex “label” property contains the integer semantic label
into
the classes in semantic_classes.txt
- Unlabeled vertices have the label -100
- segments.json: json_data[“segIndices”] contains the segment ID
for
each vertex
- segments_anno.json:
- json_data[i] corresponds to a single annotated object
- “label”: the semantic label of this object
- “segments”: all the segments belonging to this
object
- dslr/
- resized_images: Fisheye DSLR images, resized, JPG
- resized_anon_masks: PNG. Specifies the pixels that have been
anonymized
(0: invalid, 255: valid pixels).
- original_images: Full resolution images, JPG
- original_anon_masks: PNG. Similar to resized masks
- colmap: contains the colmap camera model that has been aligned with
the
3D scans, which implies the poses are in metric scale. Make sure to use this
if
you want to do 2D-3D matching between the provided mesh.
- cameras.txt: Contain the camera type (OPENCV_FISHEYE) and the
intrinsic parameters (fx, fy, cx, cy, distortion parameters)
- images.txt: Contain extrinsics of each image: qvec (quaternion)
and
tvec
- points3D.txt: Contain 3D feature points used by COLMAP
- Useful references:
- Python (Open3D) visualizer
provided by COLMAP
- nerfstudio/transforms.json
- Contains the same camera poses in the format used by Nerfstudio, OpenGL/Blender
convention. The coordinate system is different from OpenCV/COLMAP convention.
- poses:
- frames, test_frames: contain poses for train and test images
respectively
- mask: filename of binary mask file
- is_bad: indicates if the image is blurry or contains heavy shadows.
- Camera model (as above):
- contained in fl_x, fl_y, .., k1, k2, k3, k4, camera_model
- train_test_lists.json
- json[“train”]: training images
- json[“test”]: novel views, test images
- The split here is the same as the one in nerfstudio/transform.json
- json[“has_masks”]: global flag for the scene, indicating if it has anonymized
masks or not
iphone/
- rgb.mp4: full RGB video, 60 FPS
- rgb_mask.mp4: Video of anonymization masks, lossless compression. After decode,
it's
similar to the masks in DSLR.
- depth.bin: Depth images as 16 bit png in millimeters in a single binary file from
iPhone Lidar sensor. The depth images are aligned with the RGB images.
- rgb: RGB frames from the video, subsampled. Obtained by running processing script
on
rgb.mp4. The resolution is 1920 x 1440.
- depth: Depth images as 16 bit png in millimeters. Obtained by running processing
script on depth.bin. The depth images are aligned with the RGB images but with much
lower
resolution: 256 x 192.
- pose_intrinsic_imu.json: contains ARKit poses and IMU information from the iPhone
- json["poses"] contain a 4x4 camera-to-world extrinsic matrix from raw ARKit
output.
The coordinate system is right-handed. +Z is the camera direction.
- json["intrinsic"] contains a 3x3 intrinsic matrix of the RGB image
- json["aligned_poses"] contains ARKit poses that are scaled and transformed to
our
mesh space
- There are no intrinsics for Lidar depth provided by the iPhone. The user can
scale
the RGB intrinsic for the Lidar depth map since RGB and depth are aligned.
- nerfstudio: similar to DSLR
- colmap: similar to DSLR. Images here have been filtered based on agreement of
depth
between iPhone Lidar and the laser scanner. The camera model is OPENCV, which contains 4
distortion parameters: k1, k2 p1, p2.
- exif.json: EXIF information for each frame in the video
All data is anonymized using the magenta color with RGB
value
(255,0,255). The user may fill those regions with any color using the given binary mask.