Abstract:A sport enhanced two-stream network is proposed to address the issue of low recognition accuracy caused by insufficient extraction of motion features in current action recognition. The network is divided into spatial stream and temporal stream, with the same structure and different inputs. The spatial stream inputs a video frame sequence, while the temporal stream inputs a frame difference sequence. The network structure is based on Resnet50 as the backbone network, integrating 3 × 3 convolutions are replaced by the global sport feature module and local sport feature module proposed in this article, fully extracting video sport information, and ultimately combining spatial and temporal streams to output the results. The accuracy of this model on the UCF101 and HMDB51 datasets reached 96.8% and 75.3%, which has certain advantages compared to traditional algorithms.