Abstract:
Objective At present, video platforms generate massive and multi-type video data every day, which makes it difficult for users to find and obtain effective video information. If we rely solely on manual screening, understanding and description of video content, it will consume a lot of manpower and be inefficient. Therefore, dense video description can obtain the content and location of multiple events in the video, and quickly and efficiently obtain the content information of a large number of videos. Due to the multi-scale and random characteristics of event lengths in videos, there are also significant scale differences in the features from short-term events to long-term events, which makes it difficult for existing methods to effectively extract feature information for events of different scales. At the same time, there are often a large number of similar frames and repeated frames in the video, which will lead to the extraction of a large number of redundant features. This causes the final generated description to lose detailed information, and the coherence and accuracy are also affected. In response to the above problems, a dense video description method based on an improved Mamba multi-scale feature extraction is proposed.
Methods Firstly, this paper proposes a Mamba multi-scale feature extraction module, which aims to realize the extraction and fusion of multi-level features, so as to improve the ability to capture event features at different scales and reduce the loss of detailed information. In addition, in order to screen redundant features, the model introduces trend-aware attention, so that it can focus on keyframes with significant semantic changes, so as to improve the accuracy of feature expression. Then, the event difference loss function is added to promote the model to pay attention to the event difference and improve the resolution and prediction ability of different events. Multiple temporal convolutional layers are added to the encoder for multi-scale convolution. The decoder uses the learnable event query header as input, and adds a hop connection to the description header to supplement the context information, improve the model's ability to understand the global information, and finally obtain the candidate set of predicted events. Specifically, the event quantity prediction module in the encoder dynamically estimates the total number of candidate events, and the event boundary positioning module and the event description generation module predict the event start and stop time points and text descriptions of events in parallel. The final model sorts events according to confidence level as the output of the model.
Results and Discussions Experimental results on the ActivityNet Captions dataset show that the recall, accuracy and F1 values of MMFE are 59.85%, 60.45% and 60.15%, respectively, which are 4.43%, 2.38% and 3.44% higher than those of the suboptimal method PDVC. In the text description, the BLEU4, CIDEr and METEOR of MMFE were 2.67%, 37.78% and 8.79%, respectively, which were 0.71%, 9.19% and 0.71% higher than that of the suboptimal method PDVC. This indicates that the description generated by MMFE is more accurate, which provides an effective tool for improving the efficiency of network information dissemination, enhancing the ability of information security supervision, and promoting the construction of an intelligent society.
Conclusions This article uses the Mamba multi-scale feature extraction module to effectively extract video features, and at the same time, multi-level extraction and fusion of different-scale features through pyramid structure, which effectively solves the problem of insufficient multi-scale event feature extraction. Then, in order to reduce the influence of redundant feature information, the model uses the trend-aware attention mechanism, focusing on the key frames with significant semantic changes, so as to significantly improve the accuracy and discrimination of feature expression. Then, the event difference loss function is added to promote the model to pay attention to the event difference and improve the resolution and prediction ability of different events. Finally, a jump connection is added to the description generation header to make the model better understand the contextual content.