基于多尺度注意力与自适应注入的语音驱动人脸画像方法
DOI:
作者:
作者单位:

中国人民公安大学信息网络安全学院

作者简介:

通讯作者:

中图分类号:

TP391

基金项目:

国家重点研发计划课题(2024YFC3306901)


Speech-driven Face Portrait Method Based on Multi-scale Attention and Adaptive Injection
Author:
Affiliation:

School of Information Network Security,People’s Public Security University of China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    现有的语音驱动人脸生成方法中存在声学特征表征鲁棒性不足、生成结果易失真以及跨模态特征融合不够精细等问题。为此,提出了一种融合高效多尺度注意力(efficient multi-scale attention, EMA)机制与自适应条件注入的渐进式生成对抗网络方法(generative adversarial networks, GAN)。首先,构建基于ResNet18并融合EMA机制的语音编码器,利用多尺度分组卷积精准捕捉语音信号的长时序依赖与身份特征;其次,设计三阶段渐进式生成架构,通过引入自注意力机制显式建模五官的空间几何关系,引导模型从轮廓到细节逐步合成人脸图像。最后,提出基于自适应实例归一化(adaptive instance normalization, AdaIN)的跨模态条件注入模块,将声学特征转化为动态参数,逐层调制视觉特征,实现跨模态信息的深度融合和对人脸属性的精细控制。实验结果表明,该方法将生成图像的FID(Fréchet inception distance)优化至30.32,人脸检索Top-1准确率达4.01%,特征余弦相似度提升至0.656,在图像清晰度、纹理细节还原以及身份一致性上均优于目前主流方法。

    Abstract:

    Current speech-driven face generation techniques face challenges such as insufficient robustness in acoustic feature representation, facial distortion in generation, and insufficient refinement in cross-modal feature fusion. To address these issues, a progressive generative adversarial network (GAN) method was proposed, integrating an efficient multi-scale attention (EMA) mechanism with adaptive condition injection. Firstly, a speech encoder based on ResNet18 and incorporating the EMA mechanism was constructed, utilizing multi-scale grouped convolution to precisely capture the long-term temporal dependencies and identity features of speech signals. Secondly, a three-stage progressive generation architecture was designed, explicitly modeling the spatial geometric relationships of facial features through the introduction of a self-attention mechanism, guiding the model to synthesize faces progressively from contours to details. Finally, a cross-modal condition injection module based on adaptive instance normalization (AdaIN) was proposed, converting acoustic features into dynamic parameters to modulate visual features layer by layer, achieving deep fusion of cross-modal information and fine-grained control over facial attributes. Experimental results demonstrate that this method optimizes the Fréchet inception distance (FID) of generated images to 30.32, achieves a Top-1 accuracy of 4.01% in face retrieval, and enhances the feature cosine similarity to 0.656. It outperforms current mainstream methods in terms of image clarity, texture detail restoration, and identity consistency.

    参考文献
    相似文献
    引证文献
引用本文

刘佳迪,曾昭龙,马启明. 基于多尺度注意力与自适应注入的语音驱动人脸画像方法[J]. 科学技术与工程, , ():

复制
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-12-19
  • 最后修改日期:2026-04-11
  • 录用日期:2026-05-20
  • 在线发布日期:
  • 出版日期:
×
2026年会通知 | “技术经济学驱动智能经济生态构建与治理变革”——中国技术经济学会第三十三届学术年会(2026)会议通知暨征文启事(第一轮)
亟待确认版面费归属稿件,敬请作者关注