Abstract:To obtain a speaker’s pronunciation characteristics, we propose a method, based on an idea from bionics that uses spectrogram statistics to achieve a characteristic spectrogram, giving a stable representation of the speaker’s pronunciation, from a linear superposition of short-time spectrograms. To deal with the issue of slow network training and recognition speeds for speaker recognition systems on resource-constrained devices, we propose an adaptive clustering self-organizing feature map SOM (AC-SOM) algorithm, based on a traditional SOM neural network. This automatically adjusts the number of neurons in the competition layer based on the number of speakers to be recognized until the number of clusters matches the number of speakers. We have also built a 100-speaker database of characteristic spectrogram samples and applied our AC-SOM model to it, yielding a maximum training time of only 304s, with a maximum sample recognition time of less than 28ms. Compared with applying other approaches to the same number of people, our method offers greatly improved training and recognition speeds. This means it can potentially satisfy the real-time data processing and execution requirements of edge intelligence systems more easily than previous speaker recognition methods.