多模态特征融合的细粒度视频异常检测方法
DOI:
作者:
作者单位:

1.中国人民公安大学信息网络安全学院;2.中国人民公安大学研究生院

作者简介:

通讯作者:

中图分类号:

TP 391

基金项目:

中国人民公安大学安全防范工程双一流专项(NO.2023SYL08)


Multimodal Feature Fusion for Fine-Grained Weakly Supervised Video Anomaly Detection
Author:
Affiliation:

1.School of Information Network Security,People'2.'3.s Public Security University of China;4.Graduate School,People'

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    为解决视频异常检测现有方法在细粒度视频异常事件预测中表现不足的问题,本文提出了一种基于对比语言—图像预训练模型(Contrastive Language-Image Pre-training,CLIP)的多模态特征融合的细粒度视频异常检测方法(CLIP-based Multimodal Feature Fusion for Fine-grained Video Anomaly Detection,ClipFusionVAD)。首先,利用对比语言—图像预训练模型分别提取视频帧与文本特征,设计特征聚合器模块和特征投影器模块实现不同维度上特征融合,提升多模态特征表达能力,实现准确的细粒度异常事件分类。然后,引入结合通道注意力机制和空间注意力机制的全局注意力模块(Global Attention Module,GAM),增强视觉特征的判别能力。实验结果表明,ClipFusionVAD在细粒度异常检测任务中表现优异,在UCF-Crime和XD-Violence数据集上达到了9.4%mAP和28.07%mAP,进一步验证了该方法的有效性。

    Abstract:

    In order to address the deficiency of existing video anomaly detection methods in fine-grained video anomaly event prediction, a CLIP-based Multimodal Feature Fusion for Fine-grained Video Anomaly Detection (ClipFusionVAD) method, built upon the Contrastive Language-Image Pre-training (CLIP) model, was used to investigate fine-grained video anomaly event prediction tasks.First, the pre-trained CLIP model was utilized to extract features from video frames and text separately. A feature aggregator module and a feature projector module were designed to implement feature fusion across different dimensions, which improved the representation capability of multi-modal features and enables accurate fine-grained anomaly event classification. Subsequently, a Global Attention Module (GAM) combining channel attention mechanism and spatial attention mechanism was introduced to enhance the discriminative ability of visual features.The results show that ClipFusionVAD achieves outstanding performance in fine-grained anomaly detection tasks, reaching 9.4% mAP on the UCF-Crime dataset and 28.07% mAP on the XD-Violence dataset.It is concluded that the proposed method is effective in fine-grained video anomaly detection tasks, which further verifies its effectiveness.

    参考文献
    相似文献
    引证文献
引用本文

张鑫奕,刘特立,李文斌,等. 多模态特征融合的细粒度视频异常检测方法[J]. 科学技术与工程, , ():

复制
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-11-10
  • 最后修改日期:2026-04-15
  • 录用日期:2026-04-21
  • 在线发布日期:
  • 出版日期:
×
2026年会通知 | “技术经济学驱动智能经济生态构建与治理变革”——中国技术经济学会第三十三届学术年会(2026)会议通知暨征文启事(第一轮)
亟待确认版面费归属稿件,敬请作者关注