基于fullRes2-SLSTM多特征融合的声纹识别

杨正才; 杨乘

电脑与电信 ›› 2025, Vol. 1 ›› Issue (8) : 14-19.

智能识别

杨正才¹, 杨乘^1,2

作者信息 +

Multi-feature Fusion Speaker Recognition Based on fullRes2-SLSTM

YANG Zheng-cai^1,2, YANG Cheng^1,2

Author information +

文章历史 +

摘要

为解决当前声纹识别领域中特征融合困难以及现有模型的表达能力不足的问题,提出了一种基于改进的Res2Net和改进的长短时记忆神经网络(Stacked Long Short-Term Memory, SLSTM),并结合MFCC、FBank和LFBank三种特征进行融合。首先,通过对三种特征融合,全面捕捉声音的特性,并结合改进的Res2Net以更细粒化的工作方式对每个输入的特征获取多种不同尺度组合的特征表达,最后将提取的特征信息输入到堆叠长短时记忆神经网络处理序列问题,提升模型的表达能力。实验结果表明,所提出的方法在CN-Celeb数据集上的效果良好,等错误率与最小检测代价函数达到了2.89%和0.372 5,证明了本文所提方法的鲁棒性和准确性。

Abstract

At present, to address the challenges of feature fusion methods in the field of voiceprint recognition and the limited expressive capability of existing models, this paper proposes an enhanced Res2Net architecture and an improved Long Short-Term Memory (SLSTM) neural network. Furthermore, it integrates three features: MFCC, FBank, and LFBank. Initially, by fusing these three features, the characteristics of sound are comprehensively captured. Subsequently, in conjunction with the enhanced Res2Net, multiple feature representations at different scale combinations are obtained for each input feature through a more fine-grained operational mode. Finally, the extracted feature information is fed into the stacked long short-term memory neural network to handle sequence-related problems and improve the model's expressive power. Experimental results show that the proposed method performs well on the CN-Celeb dataset, with the equal error rate and the minimum detection cost function reaching 2.89% and 0.3725, which proves the robustness and accuracy of the proposed method.

导出引用

杨正才, 杨乘. 基于fullRes2-SLSTM多特征融合的声纹识别[J]. 电脑与电信. 2025, 1(8): 14-19

YANG Zheng-cai, YANG Cheng. Multi-feature Fusion Speaker Recognition Based on fullRes2-SLSTM[J]. Computer & Telecommunication. 2025, 1(8): 14-19

中图分类号： TN912.34

参考文献

[1] 张葛祥,曾鑫,姚光乐,等.说话人识别综述[J].控制工程,2025,32(2):251-264.
[2] 邓力洪,邓飞,张葛祥,等.改进Res2Net的多尺度端到端说话人识别系统[J].计算机工程与应用,2023,59(24):110-120.
[3] Wu J,Li P,Wang Y,et al.VFR:The underwater acoustic target recognition using cross-domain pre-training with fbank fusion features[J].Journal of Marine Science and Engineering,2023,11(2):263.
[4] Ma M,Liu C,Wei R,et al.Predicting machine's performance record using the stacked long short‐term memory (LSTM) neural networks[J].Journal of Applied Clinical Medical Physics,2022,23(3):e13558.
[5] Verma V,Benjwal A,Chhabra A,et al.A novel hybrid model integrating MFCC and acoustic parameters for voice disorder detection[J].Scientific Reports,2023,13(1):22719.
[6] Gao G,Guo Y,Zhou L,et al.Res2Net-based multi-scale and multi-attention model for traffic scene image classification[J].PLoS one,2024,19(5):e0300017.
[7] Deng L H,Deng F,Chiou G X,et al.End-to-end Speaker Recognition Based on MTFC-FullRes2Net[J].Journal of Computers,2023,34(3):75-91.