Abstract:To enhance the computational efficiency and cross-path feature interaction of the SlowFast network in action recognition tasks, this study incorporates a Video Dynamic Sparse Token (VDST) Selection module and a Bidirectional Gated Cross-Attention Module (Bi-CAM) into the original dual-path framework. By adaptively selecting key spatiotemporal regions in the fast pathway and establishing fine-grained bidirectional semantic fusion between the slow and fast pathways, this work investigates the trade-off between computational complexity and recognition performance in the improved SlowFast architecture. The results show that the VDST module effectively reduces redundant feature computation and maintains model accuracy while significantly lowering FLOPs; the Bi-CAM module enhances semantic interaction across pathways and leads to more comprehensive action representations; and the combined model achieves a Top-1 accuracy of 95.5% on the UCF101 dataset with only 33.6 GFLOPs. It can be seen that the proposed multi-module enhanced SlowFast framework substantially improves computational efficiency while preserving high recognition accuracy, providing a feasible solution for efficient video understanding.