Abstract:In order to address the deficiency of existing video anomaly detection methods in fine-grained video anomaly event prediction, a CLIP-based Multimodal Feature Fusion for Fine-grained Video Anomaly Detection (ClipFusionVAD) method, built upon the Contrastive Language-Image Pre-training (CLIP) model, was used to investigate fine-grained video anomaly event prediction tasks.First, the pre-trained CLIP model was utilized to extract features from video frames and text separately. A feature aggregator module and a feature projector module were designed to implement feature fusion across different dimensions, which improved the representation capability of multi-modal features and enables accurate fine-grained anomaly event classification. Subsequently, a Global Attention Module (GAM) combining channel attention mechanism and spatial attention mechanism was introduced to enhance the discriminative ability of visual features.The results show that ClipFusionVAD achieves outstanding performance in fine-grained anomaly detection tasks, reaching 9.4% mAP on the UCF-Crime dataset and 28.07% mAP on the XD-Violence dataset.It is concluded that the proposed method is effective in fine-grained video anomaly detection tasks, which further verifies its effectiveness.