Abstract:Speech synthesis of minority languages contributes to the preservation, protection and development of national culture, while the research results in this field are currently limited. To address the problem of speech synthesis errors where words with different tones sound similar, a sub-syllable representation-based text-to-speech method for the Hmong language is proposed in this paper. The method utilizes sub-syllables as training primitives to accurately represent the pronunciation information of the Hmong language, enabling distinctive learning of similar sounds across different syllables. According to the monotonicity of alignment between text sequence and Mel-spectrogram, a monotonic alignment loss is introduced to guide the attention module to learn alignment more accurately, thereby reducing synthesis phenomena such as word skipping and repetition inherent in the autoregressive attention mechanism. To verify the effectiveness of the proposed method, a self-built Hmong language speech synthesis corpus, HmongSpeech(download link: http://sxjxsf.gzmu.edu.cn/info/1728/1214.htm), is utilized as the benchmark dataset. Comparative experiments are conducted with typical speech synthesis methods. The experimental results show that the proposed method successfully reduces the synthetic error rate caused by the similar pronunciation of words with different tones. Notably, the word error rate is only 0.96%, outperforming the baseline method by 6.25%.