Visual Understanding

This part covers the researches related to visual understanding, which focuses on Crowd Counting, Visual Grounding, Video Grounding, and Temporal Action Localization (TAL). Crowd Counting is a task to count people in image. Different from object detection, Crowd Counting aims at recognizing arbitrarily sized targets in various situations including sparse and cluttering scenes at the same time. Visual Grounding aims to localize a visual region in the given image referred by a natural language query. Video Grounding aims locate the queried action or event in an untrimmed video based on rich linguistic descriptions.

Crowd Counting

Temporal Action Detection

Video Grounding

Visual Grounding

Human Beahavior Analysis