Abstract:Speech synthesis and speech conversion and other technologies are gradually becoming the mainstream methods for synthesizing speech, which has potential risks to social stability and national security. To further improve the accuracy of synthesized and converted forged speech detection, three hybrid network models are proposed from the hybrid network model, feature selection, which are based on CNN-RNN-DNN networks, namely CNN-LSTM-DNN, CNN-GRU-DNN and CNN-BiLSTM-DNN. Subsampling can be carried out by the CNN part of the model; the timing problem of speech can be solved by the RNN part; and the classification function can be realized by the DNN part. 20 network layers are contained in each fusion network model. The extracted 6 acoustic features were tested, among which the combination of CNN-LSTM-DNN+MFCC performed the best, with an equal error rate of 5.79%, which was 28.43% lower than the B02 baseline system provided by ASVSPoof2019. At the same time, the performance of three fusion networks combined with six characteristics is compared. The results show that the hybrid network model proposed in this paper has the advantages of stable performance and high accuracy, besides the MFCC feature and MFCC+LFCC fusion feature is better fit with this fusion network.