Result details - ScanRefer Benchmark

Submitted anonymously.

Full name	Vision-Language Pre-training with Object Contrastive Learning
Description	We propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D visionlanguage downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoUguided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross- Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively.
Input Data Types	Uses XYZ coordinates,Uses RGB values,Uses Multiview Image Features,Uses Normal Vectors
Programming language(s)	python
Hardware	RTX4090
Submission creation date	26 Feb, 2025
Last edited	26 Feb, 2025

Unique	Unique	Multiple	Multiple	Overall	Overall
acc@0.25IoU	acc@0.5IoU	acc@0.25IoU	acc@0.5IoU	acc@0.25IoU	acc@0.5IoU
0.0038	0.0019	0.0049	0.0023	0.0047	0.0022

Results for Anonymous submission