为解决当前声纹识别领域中特征融合困难以及现有模型的表达能力不足的问题,提出了一种基于改进的Res2Net和改进的长短时记忆神经网络(Stacked Long Short-Term Memory, SLSTM),并结合MFCC、FBank和LFBank三种特征进行融合。首先,通过对三种特征融合,全面捕捉声音的特性,并结合改进的Res2Net以更细粒化的工作方式对每个输入的特征获取多种不同尺度组合的特征表达,最后将提取的特征信息输入到堆叠长短时记忆神经网络处理序列问题,提升模型的表达能力。实验结果表明,所提出的方法在CN-Celeb数据集上的效果良好,等错误率与最小检测代价函数达到了2.89%和0.372 5,证明了本文所提方法的鲁棒性和准确性。
Abstract
At present, to address the challenges of feature fusion methods in the field of voiceprint recognition and the limited expressive capability of existing models, this paper proposes an enhanced Res2Net architecture and an improved Long Short-Term Memory (SLSTM) neural network. Furthermore, it integrates three features: MFCC, FBank, and LFBank. Initially, by fusing these three features, the characteristics of sound are comprehensively captured. Subsequently, in conjunction with the enhanced Res2Net, multiple feature representations at different scale combinations are obtained for each input feature through a more fine-grained operational mode. Finally, the extracted feature information is fed into the stacked long short-term memory neural network to handle sequence-related problems and improve the model's expressive power. Experimental results show that the proposed method performs well on the CN-Celeb dataset, with the equal error rate and the minimum detection cost function reaching 2.89% and 0.3725, which proves the robustness and accuracy of the proposed method.
关键词
声纹识别 /
混合特征 /
fullRes2Net /
SLSTM
Key words
speaker recognition /
multi-feature fusion /
fullRes2Net /
SLSTM
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 张葛祥,曾鑫,姚光乐,等.说话人识别综述[J].控制工程,2025,32(2):251-264.
[2] 邓力洪,邓飞,张葛祥,等.改进Res2Net的多尺度端到端说话人识别系统[J].计算机工程与应用,2023,59(24):110-120.
[3] Wu J,Li P,Wang Y,et al.VFR:The underwater acoustic target recognition using cross-domain pre-training with fbank fusion features[J].Journal of Marine Science and Engineering,2023,11(2):263.
[4] Ma M,Liu C,Wei R,et al.Predicting machine's performance record using the stacked long short‐term memory (LSTM) neural networks[J].Journal of Applied Clinical Medical Physics,2022,23(3):e13558.
[5] Verma V,Benjwal A,Chhabra A,et al.A novel hybrid model integrating MFCC and acoustic parameters for voice disorder detection[J].Scientific Reports,2023,13(1):22719.
[6] Gao G,Guo Y,Zhou L,et al.Res2Net-based multi-scale and multi-attention model for traffic scene image classification[J].PLoS one,2024,19(5):e0300017.
[7] Deng L H,Deng F,Chiou G X,et al.End-to-end Speaker Recognition Based on MTFC-FullRes2Net[J].Journal of Computers,2023,34(3):75-91.
基金
国家自然科学基金项目“面向元宇宙的听觉沉浸感知与声场智能仿真”,项目编号:62562018; 国家自然科学基金项目“基于听觉感知机理的 3D 音频智能高效采集研究”,项目编号:62062025; 贵州省科学技术基金重点项目“三维音频高效感知采集算法研究”,项目编号:黔科合基础〔2019〕1432