Submitted by Ozan Ünal.

Submission data

Full nameDense 3D Visual Grounding Network
DescriptionWe propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next, we construct a contrastive training scheme to induce separation in the latent space, we then resolve view-dependent utterances via a learned global camera token, and finally we employ multi-view ensembling to improve referred mask quality. ConcreteNet has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge.
Publication titleFour Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding
Publication authorsOzan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool
Publication venueECCV 2024
Publication URLhttp://arxiv.org/abs/2309.04561
Input Data TypesUses XYZ coordinates,Uses RGB values
Programming language(s)Python
HardwareAMD EPYC 7742 64-Core CPU, GeForce RTX 3090, 526 GB RAM
Websitehttps://ouenal.github.io/concretenet/
Source code or download URLhttps://github.com/ouenal/concretenet
Submission creation date20 Feb, 2023
Last edited26 Jul, 2024

Localization

Unique Unique Multiple Multiple Overall Overall
acc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoUacc@0.25IoUacc@0.5IoU
0.86070.79230.47460.40910.56120.4950