Abstract:In order to enhance the reliability and accuracy of speaker recognition in courtrooms, and facilitate the transformation of scientific evaluation paradigm for courtroom voice analysis methods and processes, a novel method for automatic speaker recognition in courtrooms based on an improved ECAPA-TDNN network architecture is proposed. This method integrates spatial attention mechanism, channel attention mechanism, and multi-head attention mechanism to enhance the accuracy and generalization capability of the model. The network model utilizes a fusion of spectrogram and GFCC features, selecting the one with the best training performance as the input. The trained neural network is employed as a deep feature extractor, followed by evaluating the strength of speech evidence using a likelihood ratio quantification evaluation system specifically designed for courtroom evidence. Experimental results demonstrate that on the voceleb1 dataset, the achieved value is 0.156, outperforming the previously published literature on automatic speaker recognition systems in courtrooms. On the Chinese Zhaishell dataset, the false acceptance rate and false rejection rate both reach zero, with a minimum likelihood ratio supporting the homogeneity hypothesis of 3.97e+6 and a maximum likelihood ratio supporting the heterogeneity hypothesis of 1.52e-31. Consequently, this method further enhances the reliability and accuracy of the recognition system, providing robust support for the conclusion of evaluating speech evidence in courtrooms.