Full name | Dense 3D Visual Grounding Network |
Description | We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next, we construct a contrastive training scheme to induce separation in the latent space, we then resolve view-dependent utterances via a learned global camera token, and finally we employ multi-view ensembling to improve referred mask quality. ConcreteNet has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge. |
Publication title | Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding |
Publication authors | Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool |
Publication venue | ECCV 2024 |
Publication URL | http://arxiv.org/abs/2309.04561 |
Input Data Types | Uses XYZ coordinates,Uses RGB values |
Programming language(s) | Python |
Hardware | AMD EPYC 7742 64-Core CPU, GeForce RTX 3090, 526 GB RAM |
Website | https://ouenal.github.io/concretenet/ |
Source code or download URL | https://github.com/ouenal/concretenet |
Submission creation date | 20 Feb, 2023 |
Last edited | 26 Jul, 2024 |