目前,传统声纹识别方法多依赖卷积神经网络(CNN)提取语音的局部时频特征,难以有效建模长时序依赖和全频域信息,限制了复杂语音条件下的判别性能。为此,提出一种基于上下文感知掩码网络(Context-Aware Masking Network, CAM++)与轻量化Transformer编码块(Lightweight Transformer Encoder Block, LTEB)相结合的声纹识别方法。该方法在CAM++网络的FCM与D-TDNN模块之间引入LTEB模块,LTEB利用Nyström近似注意力机制建模长达10秒的全局语音依赖,提升模型的时序感知能力;CAM++网络中的D‑TDNN模块则专注于局部语义特征提取,二者协同融合,实现局部感知与上下文建模的统一,从而提升声纹判别能力与模型计算效率。模型以融合的MFCC与FBANK特征为输入进行训练。在CN-Celeb数据集上,所提LTEB-CAM++模型的等错误率(EER)与最小检测代价函数(minDCF)相较基线CAM++分别降低了7.39%和11.17%。
Abstract
Traditional speaker recognition methods based on convolutional neural networks (CNNs) often fail to capture long-term temporal dependencies and full-frequency information, limiting their robustness in complex acoustic environments. To address this, we propose LTEB-CAM++, a hybrid architecture combining a Context-Aware Masking Network (CAM++) with a Lightweight Transformer Encoder Block (LTEB). The LTEB module, inserted between CAM++'s FCM and D-TDNN layers, leverages Nyström-based self-attention to efficiently model global speech contexts (up to 10s), while the D-TDNN module extracts fine-grained local features. Trained on fused MFCC-FBANK features, LTEB-CAM++ reduces EERand minDCF by 7.98% and 12.58%, respectively, on CN-Celeb versus the CAM++ baseline, demonstrating superior efficiency and discriminability.
关键词
CAM++ /
声纹识别 /
Nyström近似注意力 /
全局–局部特征融合 /
轻量化编码块
Key words
CAM++ /
speaker recognition /
Nyström attention /
global-local feature fusion /
lightweight encoder block
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Desplanques B,Thienpondt J,Demuynck K.ECAPA-TDNN:Emphasized Channel Attention,Propagation and Aggregation in TDNN based Speaker Verification[A].In:Proceedings of INTERSPEECH[C].Brno,Czech Republic:ISCA,2020:3830-3834.
[2] Wang H,Zheng S,Chen Y,et al.Cam++:A fast and efficient network for speaker verification using context-aware masking[J].arXiv preprint arXiv:2303.00332,2023.
[3] Cao D,Wang X,Zhou J,et al.LightCAM:A Fast and Light Implementation of Context-Aware Masking based D-TDNN for Speaker Verification[J].arXiv preprint arXiv:2402. 06073,2024.
[4] Sang M, Zhao Y,Liu G, et al. Improving Transformer-based Networks With Locality For Automatic Speaker Verification[J].arXiv preprint arXiv:2302.08639,2023.
[5] Choi J H,Yang J Y,Chang J H.Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2024.
[6] Yu Y Q,Zheng S,Suo H B,et al.Cam:Context Aware Masking for Robust Speaker Verification[A].IEEE ICASSP, 2021:6703-6707.
[7] 姚瑶,杨吉斌,张雄伟,等.基于多维注意力机制的单通道语音增强方法.南京大学学报(自然科学)[J] ,2023,59(4):669-679.
[8] Xiong Y,Zeng Z,Chakraborty R,et al.Nyströmformer:A nyström-based algorithm for approximating self-attention[C]//Proceedings of the AAAI conference on artificial intelligence.2021,35(16):14138-14148.
[9] Wu Z,Zhao D,Liang Q,et al.Dynamic sparsity neural networks for automatic speech recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2021:6014-6018.
基金
国家自然科学基金项目“基于听觉感知机理的3D音频智能高效采集研究”,项目编号:62062025; 贵州省科技计划项目重点项目“三维音频高效感知采集算法研究”,项目编号:黔科合基础〔2019〕1432; 教育部行业职业教育教学指导委员会项目“MR(混合现实)技术在教育教学中的应用研究——以汽车三维声纹故障诊断教学为例”,项目编号:HBKC217112