Abstract:Environmental sound classification has become an important task in the field of computer hearing, which can be used as a supplement to computer vision to help devices better understand the environment and user needs, and has a wide range of application prospects, which will have a positive impact on human life. In recent years, Transformer model with self-attention mechanism has been adopted in the field of environmental sound classification. However, the existing model requires large memory and relies on pre-trained visual model, and cannot extract audio features well. In order to solve these problems and improve the accuracy of environmental sound classification, a new Swin Conformer environmental sound classification model with double branch structure is proposed. By fusing convolutional neural network and Swin Transformer model with window self-attention mechanism, the two-branch features are interactively fused and the token semantic module is introduced. The Swin Conformer model achieved 98.1% and 96.8% classification accuracy on ESC-50 and UrbanSound8K public data sets, respectively. Compared with the existing model, it has higher classification accuracy, which proves the feasibility and superiority of this model in the task of environmental sound classification.