基于Mamba多尺度特征提取的密集视频描述方法

Dense video caption method based on improved Mamba multi-scale feature extraction

  • 摘要: 密集视频描述旨在从视频中提取多个关键事件并生成连贯的文本描述,可广泛应用于自动讲解、人机交互、视频检索以及辅助视障人士日常生活等场景。现有方法存在对视频中短时、长时等多尺度事件特征提取不足,以及视频重复帧或相似帧特征信息冗余的问题,这导致现有方法生成的视频描述缺失细节信息,连贯性和准确性较低。针对这一问题,研究提出了一种基于Mamba多尺度特征提取的密集视频描述模型(Mamba Multi-scale Feature Extraction for dense video caption,MMFE)。首先,提出Mamba多尺度特征提取模块,利用Mamba增加长程依赖捕捉能力,并通过多层次特征提取和融合,解决了对短时、长时等多尺度事件特征提取不足问题;其次,引入趋势感知注意力,通过重点关注有显著语义变化的关键帧,解决重复帧或相似帧特征信息冗余,提升特征表达的准确性;然后,加入事件差异损失函数,促使模型关注长视频中不同内容事件的特征差异,提高对多样化事件的分辨以及预测能力;最后,在描述头中引入跳跃连接,将先前生成的历史描述文本选择性融入到当前解码过程,通过参考整体视频叙事脉络补充上下文信息,提高模型对全局信息的理解能力。在ActivityNet Captions数据集的实验结果表明,针对短时、长时等不同时长事件定位任务,MMFE的召回率、准确率和F1值分别为59.85%、60.45%和60.15%,较次优方法PDVC提升了4.43%、2.38%和3.44%。针对多样化事件文本描述任务,MMFE的BLEU4、CIDEr和 METEOR分别为2.67%、37.78%和8.79%,较次优方法PDVC提升0.71%、9.19%和0.71%。这表明MMFE所生成的视频描述更加准确,可为提高网络信息传播效率、增强信息安全监管能力以及推动智能社会建设提供有效工具。

     

    Abstract:
    Objective At present, video platforms generate massive and multi-type video data every day, which makes it difficult for users to find and obtain effective video information. If we rely solely on manual screening, understanding and description of video content, it will consume a lot of manpower and be inefficient. Therefore, dense video description can obtain the content and location of multiple events in the video, and quickly and efficiently obtain the content information of a large number of videos. Due to the multi-scale and random characteristics of event lengths in videos, there are also significant scale differences in the features from short-term events to long-term events, which makes it difficult for existing methods to effectively extract feature information for events of different scales. At the same time, there are often a large number of similar frames and repeated frames in the video, which will lead to the extraction of a large number of redundant features. This causes the final generated description to lose detailed information, and the coherence and accuracy are also affected. In response to the above problems, a dense video description method based on an improved Mamba multi-scale feature extraction is proposed.
    Methods Firstly, this paper proposes a Mamba multi-scale feature extraction module, which aims to realize the extraction and fusion of multi-level features, so as to improve the ability to capture event features at different scales and reduce the loss of detailed information. In addition, in order to screen redundant features, the model introduces trend-aware attention, so that it can focus on keyframes with significant semantic changes, so as to improve the accuracy of feature expression. Then, the event difference loss function is added to promote the model to pay attention to the event difference and improve the resolution and prediction ability of different events. Multiple temporal convolutional layers are added to the encoder for multi-scale convolution. The decoder uses the learnable event query header as input, and adds a hop connection to the description header to supplement the context information, improve the model's ability to understand the global information, and finally obtain the candidate set of predicted events. Specifically, the event quantity prediction module in the encoder dynamically estimates the total number of candidate events, and the event boundary positioning module and the event description generation module predict the event start and stop time points and text descriptions of events in parallel. The final model sorts events according to confidence level as the output of the model.
    Results and Discussions Experimental results on the ActivityNet Captions dataset show that the recall, accuracy and F1 values of MMFE are 59.85%, 60.45% and 60.15%, respectively, which are 4.43%, 2.38% and 3.44% higher than those of the suboptimal method PDVC. In the text description, the BLEU4, CIDEr and METEOR of MMFE were 2.67%, 37.78% and 8.79%, respectively, which were 0.71%, 9.19% and 0.71% higher than that of the suboptimal method PDVC. This indicates that the description generated by MMFE is more accurate, which provides an effective tool for improving the efficiency of network information dissemination, enhancing the ability of information security supervision, and promoting the construction of an intelligent society.
    Conclusions This article uses the Mamba multi-scale feature extraction module to effectively extract video features, and at the same time, multi-level extraction and fusion of different-scale features through pyramid structure, which effectively solves the problem of insufficient multi-scale event feature extraction. Then, in order to reduce the influence of redundant feature information, the model uses the trend-aware attention mechanism, focusing on the key frames with significant semantic changes, so as to significantly improve the accuracy and discrimination of feature expression. Then, the event difference loss function is added to promote the model to pay attention to the event difference and improve the resolution and prediction ability of different events. Finally, a jump connection is added to the description generation header to make the model better understand the contextual content.

     

/

返回文章
返回