Speaker Verification Method Combining CAM++ and Lightweight Transformer Module

ZHAO Hong-xiang, YANG Cheng

Computer & Telecommunication ›› 2025, Vol. 1 ›› Issue (7) : 5-9.

Computer & Telecommunication ›› 2025, Vol. 1 ›› Issue (7) : 5-9.

Speaker Verification Method Combining CAM++ and Lightweight Transformer Module

  • ZHAO Hong-xiang, YANG Cheng*
Author information +
History +

Abstract

Traditional speaker recognition methods based on convolutional neural networks (CNNs) often fail to capture long-term temporal dependencies and full-frequency information, limiting their robustness in complex acoustic environments. To address this, we propose LTEB-CAM++, a hybrid architecture combining a Context-Aware Masking Network (CAM++) with a Lightweight Transformer Encoder Block (LTEB). The LTEB module, inserted between CAM++'s FCM and D-TDNN layers, leverages Nyström-based self-attention to efficiently model global speech contexts (up to 10s), while the D-TDNN module extracts fine-grained local features. Trained on fused MFCC-FBANK features, LTEB-CAM++ reduces EERand minDCF by 7.98% and 12.58%, respectively, on CN-Celeb versus the CAM++ baseline, demonstrating superior efficiency and discriminability.

Key words

CAM++ / speaker recognition / Nyström attention / global-local feature fusion / lightweight encoder block

Cite this article

Download Citations
ZHAO Hong-xiang, YANG Cheng. Speaker Verification Method Combining CAM++ and Lightweight Transformer Module[J]. Computer & Telecommunication. 2025, 1(7): 5-9

References

[1] Desplanques B,Thienpondt J,Demuynck K.ECAPA-TDNN:Emphasized Channel Attention,Propagation and Aggregation in TDNN based Speaker Verification[A].In:Proceedings of INTERSPEECH[C].Brno,Czech Republic:ISCA,2020:3830-3834.
[2] Wang H,Zheng S,Chen Y,et al.Cam++:A fast and efficient network for speaker verification using context-aware masking[J].arXiv preprint arXiv:2303.00332,2023.
[3] Cao D,Wang X,Zhou J,et al.LightCAM:A Fast and Light Implementation of Context-Aware Masking based D-TDNN for Speaker Verification[J].arXiv preprint arXiv:2402. 06073,2024.
[4] Sang M, Zhao Y,Liu G, et al. Improving Transformer-based Networks With Locality For Automatic Speaker Verification[J].arXiv preprint arXiv:2302.08639,2023.
[5] Choi J H,Yang J Y,Chang J H.Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2024.
[6] Yu Y Q,Zheng S,Suo H B,et al.Cam:Context Aware Masking for Robust Speaker Verification[A].IEEE ICASSP, 2021:6703-6707.
[7] 姚瑶,杨吉斌,张雄伟,等.基于多维注意力机制的单通道语音增强方法.南京大学学报(自然科学)[J] ,2023,59(4):669-679.
[8] Xiong Y,Zeng Z,Chakraborty R,et al.Nyströmformer:A nyström-based algorithm for approximating self-attention[C]//Proceedings of the AAAI conference on artificial intelligence.2021,35(16):14138-14148.
[9] Wu Z,Zhao D,Liang Q,et al.Dynamic sparsity neural networks for automatic speech recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2021:6014-6018.

Accesses

Citation

Detail

Sections
Recommended

/