Semantic-Conditional Network for Micro-Video Summarization

Main Article Content

Xiaowei Gu

Abstract

The goal of video summarization is to extract key information from a raw video so that long videos can be interpreted in a short time without losing much semantic information. Previous methods primarily consider the diversity and representation of the obtained summary without paying sufficient attention to the semantic information of the resulting frame set, especially when generating summaries motivated by user queries. In this paper, we break the conventions in conditional video summarization and propose a new model to accept user queries semantically, namely Semantic-Conditional Network (SC-Net). Technically, for each video, we first search the semantically relevant video frames via a cross-modal retrieval model to convey the comprehensive semantic information in the user query. The rich semantics are further regarded as semantic prior to trigger the optimization of the summarization network, which produces summaries in a diverse and representative way. Furthermore, a novel one-stage training strategy optimizes the time complexity from polynomial to linear. Extensive experiments on publicly available datasets demonstrate promising results compared with state-of-the-art methods.

Article Details

Section
Articles