Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras


Grounded scene understanding from event streams. This work presents Talk2Event, a novel task for localizing objects from event cameras using natural language, where each unique object in the scene is defined by four key attributes: 1Appearance, 2Status, 3Relation-to-Viewer, and 3Relation-to-Others. We find that modeling these attributes enables precise, interpretable, and temporally-aware grounding across diverse dynamic environments in the real world.


Abstract

Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.


Data Curation Pipeline

Figure. We leverage two surrounding frames to generate context-aware referring expressions at t0, covering appearance, motion, spatial relations, and interactions. Word clouds on the right highlight distinct linguistic patterns across the four grounding attributes.




Benchmark Comparison

Dataset Venue Sensor Scene Type Scenes Objects Expr. Avg Len. δa δs δv δo
RefCOCO+ECCV'16Static 19,99249,856141,5643.53
RefCOCOgECCV'16Static 26,71154,82285,4748.43
Nr3DECCV'20 Static 7075,87841,503-
Sr3DECCV'20 Static 1,2738,86383,572-
ScanReferECCV'20 Static 80011,04651,58320.3
Text2PosCVPR'22Static -6,80043,381-
CityReferNeurIPS'23Static -5,86635,196-
Ref-KITTICVPR'23Static 6,650-818-
M3DReferAAAI'24Static 2,0258,22841,14053.2
STReferECCV'24 Static 6623,5815,458-
LifeReferECCV'24 Static 3,17211,86425,380-
Talk2EventOurs Dynamic5,56713,45830,69034.1

The  Talk2Event Dataset

Data examples of the Car class under the Daytime condition in our dataset.

Event Stream RGB Frame Referring Expression

Data examples of the Car class under the Nighttime condition in our dataset.

Event Stream RGB Frame Referring Expression

Data examples of the Truck class under the Daytime condition in our dataset.

Event Stream RGB Frame Referring Expression

Data examples of the Bus class under the Daytime condition in our dataset.

Event Stream RGB Frame Referring Expression

Data examples of the Pedestrian class under the Daytime condition in our dataset.

Event Stream RGB Frame Referring Expression

Data examples of the Pedestrian class under the Nighttime condition in our dataset.

Event Stream RGB Frame Referring Expression

Data examples of the Bicycle class under the Daytime condition in our dataset.

Event Stream RGB Frame Referring Expression

Data examples of the Motorcycle class under the Daytime condition in our dataset.

Event Stream RGB Frame Referring Expression