“Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames” accepted by the ICMR’22 conference.

“Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames” accepted by the ICMR’22 conference.

We are pleased to share that the MIRROR project paper “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames” authored by CERTH-ITI was accepted to be presented at the 2022 ACM International Conference on Multimedia Retrieval.

Effectively and efficiently retrieving information based on user needs is one of the most exciting areas in multimedia research. The Annual ACM International Conference on Multimedia Retrieval (ICMR) offers a great opportunity for exchanging leading-edge multimedia retrieval ideas among researchers, practitioners and other potential users of multimedia retrieval systems.

Abstract:
In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames’ dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames’ dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames’ uniqueness and diversity, shows their relative contributions to the overall summarization performance.

The paper will be available for download soon in the publications section on our website.