Abstract:Current speech-driven face generation techniques face challenges such as insufficient robustness in acoustic feature representation, facial distortion in generation, and insufficient refinement in cross-modal feature fusion. To address these issues, a progressive generative adversarial network (GAN) method was proposed, integrating an efficient multi-scale attention (EMA) mechanism with adaptive condition injection. Firstly, a speech encoder based on ResNet18 and incorporating the EMA mechanism was constructed, utilizing multi-scale grouped convolution to precisely capture the long-term temporal dependencies and identity features of speech signals. Secondly, a three-stage progressive generation architecture was designed, explicitly modeling the spatial geometric relationships of facial features through the introduction of a self-attention mechanism, guiding the model to synthesize faces progressively from contours to details. Finally, a cross-modal condition injection module based on adaptive instance normalization (AdaIN) was proposed, converting acoustic features into dynamic parameters to modulate visual features layer by layer, achieving deep fusion of cross-modal information and fine-grained control over facial attributes. Experimental results demonstrate that this method optimizes the Fréchet inception distance (FID) of generated images to 30.32, achieves a Top-1 accuracy of 4.01% in face retrieval, and enhances the feature cosine similarity to 0.656. It outperforms current mainstream methods in terms of image clarity, texture detail restoration, and identity consistency.