Full name | Vision-Language Pre-training with Object Contrastive Learning |
Description | We propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D visionlanguage downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoUguided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross-
Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively. |
Input Data Types | Uses XYZ coordinates,Uses RGB values,Uses Multiview Image Features,Uses Normal Vectors |
Programming language(s) | python |
Hardware | RTX4090 |
Submission creation date | 26 Feb, 2025 |
Last edited | 26 Feb, 2025 |