The 4th International Workshop on Video Event Categorization, Tagging and Retrieval (VECTaR2012)

In Conjunction with ECCV 2012

Florence, Italy, 12 October 2012


Keynote Speakers

Call for Papers

Important Dates


Program Committee






Keynote Speakers

Prof. Dong Xu, Nanyang Technological University, Singapore

Dong Xu is currently an associate professor at Nanyang Technological University in Singapore. He is leading the Visual Computing Group working on new theories, algorithms and systems for intelligent processing and understanding of visual data such as images and videos. He was the coauthor of a paper that won the Best Student Paper Award at the IEEE International Conference on Computer Vision and Pattern Recognition in 2010.

Title: Classifying Images and Videos by Learning from Web Data

Abstract: Increasingly rich and massive social media data are being posted to the photo and video sharing websites like Flickr and YouTube. Keywords (also called tags) based search can be readily used to collect the relevant and irrelevant Flickr images or YouTube videos, which can be used as the positive and negative training data to learn classifiers for classifying consumer images and videos. In this talk, I will first introduce a domain adaptation method called Adaptive Multiple Kernel Learning (A-MKL) for video event recognition, which can effectively cope with the considerable variation in feature distributions between the web videos and consumer videos. Moreover, I will also describe our approaches for text based image retrieval by using multiple instance learning (MIL) to handle noise in the loose labels of training images.

Dr. Tao Xiang, Queen Mary, University of London, UK

Dr Tao Xiang received the Ph.D. degree in electrical and computer engineering from the National University of Singapore in 2002. He is currently a senior lecturer (associate professor) in the School of Electronic Engineering and Computer Science, Queen Mary University of London. His research interests include computer vision, statistical learning, video processing, and machine learning, with focus on interpreting and understanding human behaviour. He has published over 90 papers and a book  "Visual Analysis of Behaviour: From Pixels to Semantics".

Title: Weakly supervised learning for video tagging

Abstract: Providing methods to support semantic interaction with growing volumes of video data is an increasingly important challenge for computer vision and data mining. To this end, there has been some success in recognition of simple objects and actions in video; however most of this work requires strongly supervised training data. The supervision cost of these approaches therefore renders them economically non-scalable for real world applications. This talk will focus on the problem of learning to annotate and retrieve semantic tags of actions and events in realistic video data with sparsely provided tags of semantically salient activities. This is challenging because of (1) the multi-label nature of the learning problem and (2) realistic videos are often dominated by (semantically uninteresting) background activity un-supported by any tags of interest, leading to a strong irrelevant data problem. To address these challenges, a new topic model based approach is introduced to video tag annotation. The model simultaneously learns a low dimensional representation of the video data, which dimensions are semantically relevant (supported by tags), and how to annotate videos with tags.

Technical Program

9:00 - 9:05

Opening Remarks: Ling Shao, Jianguo Zhang or Liang Wang

Keynote Speech 1 Chair: Shiguang Shan


Title: Classifying Images and Videos by Learning from Web Data
Speaker: Dr. Dong Xu, National Technological University, Singapore



Keynote Speech 2 Chair: Shiguang Shan


Title: Weakly Supervised Learning for Video Tagging
Speaker: Dr. Tao Xiang, Queen Mary, University of London, UK



Session A Chair: Shiguang Shan


Atomic Action Features: A New Feature for Action Recognition
Qiang Zhou (ADSC, Singapore), Gang Wang (NTU & ADSC, Singapore)

Spatio-Temporal SIFT and Its Application to Human Action Classification

Manal Alghamdi (University of Sheffield), Lei Zhang (Harbin Engineering University), Yoshihiko Gotoh (University of Sheffield)

Statistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition

Piotr Bilinski (INRIA), Francois Bremond (INRIA)

Visual Code-Sentences: A New Video Representation based on Image Descriptor Sequences

Yusuke Mitarai (Canon Inc.), Masakazu Matsugu (Canon Inc.)


Lunch Break

Session B Chair: Jingyu Yang


Action Recognition Robust to Background Clutter by using Stereo Vision
Jordi Sanchez-Riera (INRIA), Jan Cech (INRIA), Radu Horaud (INRIA)

Recognizing Unseen Actions Across Cameras by Exploring the Correlated Subspace
Chun-Hao Huang (Academia Sinica), Yi-Ren Yeh (Academia Sinica), Yu-Chiang Frank Wang (Academia Sinica)

Chinese Shadow Puppetry with an Interactive Interface Using the Kinect Sensor
Hui Zhang (United International College), Yuhao Song (United International College), Zhuo Chen (United International College), Ji Cai (United International College), Ke Lu (United International College)

Group Dynamics and Multimodal Interaction Modeling using a Smart Digital Signage

Tony Tung (Kyoto University), Randy Gomez (Kyoto University), Tatsuya Kawahara (Kyoto University), Takashi Matsuyama (Kyoto University)

Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions

Gertjan Burghouts (TNO), Patrick Hanckmann (TNO), Klamer Schutte (TNO)


Call for Papers

With the vast development of Internet capacity and speed, as well as wide adoptation of media technologies in people's daily life, it is highly demanding to efficiently process or organize video events rapidly emerged from the Internet (e.g., YouTube), wider surveillance networks, mobile devices, smart cameras, etc. The human visual perception system could, without difficulty, interpret and recognize thousands of events in videos, despite high level of video object clutters, different types of scene context, variability of motion scales, appearance changes, occlusions and object interactions. For a computer vision system, it has been very challenging to achieve automatic video event understanding for decades. Broadly speaking, those challenges include robust detection of events under motion clutters, event interpretation under complex scenes, multi-level semantic event inference, putting events in context and multiple cameras, event inference from object interactions, etc.

In recent years, steady progress has been made towards better models for video event categorization and recognition, e.g., from modeling events with bag of spatial temporal features to discovering event context, from detecting events using a single camera to inferring events through a distributed camera network, and from low-level event feature extraction and description to high-level semantic event classification and recognition. However, the current progress in video event analysis is still far more from its promise. It is still very difficult to retrieve or categorise a specific video segment based on their content in a real multimedia system or in surveillance applications. The existing techniques are usually tested on simplified scenarios, such as the KTH dataset, and real-life applications are much more challenging and require special attention. To advance the progress further, we must adapt recent or existing approaches to find new solutions for intelligent video event understanding.

The goal of this workshop is to provide a forum for recent research advances in the area of video event categorisation, tagging and retrieval. The workshop seeks original high-quality submissions from leading researchers and practitioners in academia as well as industry, dealing with theories, applications and databases of visual event recognition. Depth sensors, such as Kinect, and real-world applications, such as event analysis and recognition on videos from the Internet, surveillance cameras, and mobile devices, will be the theme of this year's workshop. Topics include the following, but not limited to:

  • Motion interpretation and grouping
  • Human Action representation and recognition
  • Abnormal event detection
  • Contextual event inference
  • Event recognition among a distributed camera network
  • Multi-modal event recognition
  • Spatial temporal features for event categorization
  • Hierarchical event recognition
  • Probabilistic graph models for event reasoning
  • Machine learning for event recognition
  • Global/local event descriptors
  • Metadata construction for event recognition
  • Bottom up and top down approaches for event recognition
  • Event-based video segmentation and summarization
  • Video event database gathering and annotation
  • Efficient indexing and concepts modeling for video event retrieval
  • Semantic-based video event retrieval
  • On-line video event tagging
  • Event recognition for depth cameras (Kinect)
  • Evaluation methodologies for event-based systems
  • Event-based applications (security, sports, news, etc.)

Important Dates

  • Submission Deadline: 18 July 2012 (extended)
  • Notification of Acceptance: 23 July 2012
  • Camera-Ready Submission: 1 August 2012
  • Workshop: 12 October 2012

General Chairs

  • Tieniu Tan, Chinese Academy of Sciences, China
  • Thomas S. Huang, University of Illinois at Urbana-Champaign, USA

Program Chairs

  • Ling Shao, The University of Sheffield, UK
  • Jianguo Zhang, University of Dundee, UK
  • Liang Wang, Chinese Academy of Sciences, China

Technical Program Committee  

  • Rama Chellappa, University of Maryland, USA
  • James Ferryman , University of Reading, UK
  • GianLuca Foresti , University of Udine, Italy
  • Shaogang Gong, Queen Mary, University of London, UK   
  • Ran He, Chinese Academy of Sciences
  • Yu-Gang Jiang , Columbia University, USA
  • Graeme A. Jones , Kingston University, UK
  • Xuelong Li , Chinese Academy of Sciences, China  
  • Ram Nevatia , University of Southern California, USA 
  • Carlo Regazzoni , University of Genoa, Italy
  • Shin'ichi Satoh , National Institute of Informatics, Japan 
  • Ling Shao , The University of Sheffield, UK
  • Yan Song , University of Science and Technology of China
  • Peter Sturm , INRIA, France
  • Dacheng Tao , Sydney University of Technology, Australia
  • Liang Wang, Chinese Academy of Sciences, China
  • Qi Wang, Chinese Academy of Sciences, China
  • Xin-Jing Wang , Microsoft Research Asia, China
  • Pingkun Yan , Chinese Academy of Sciences, China 
  • Tao Xiang , Queen Mary University London, UK
  • Dong Xu , Nanyang Technological University, Singapore
  • Zhang Zhang, Chinese Academy of Sciences
  • Jianguo Zhang , University of Dundee, UK
  • Lei Zhang , Microsoft Research Asia 



Each submission will be reviewed by at least two reviewers from program committee members and external reviewers for originality, significance, clarity, soundness, relevance and technical contents. Accepted papers will be published in a volume of Springer Lecture Notes in Computer Science. High-quality papers will be invited to submit a special issue of a good computer vision journal after the conference.