At present, to address the challenges of feature fusion methods in the field of voiceprint recognition and the limited expressive capability of existing models, this paper proposes an enhanced Res2Net architecture and an improved Long Short-Term Memory (SLSTM) neural network. Furthermore, it integrates three features: MFCC, FBank, and LFBank. Initially, by fusing these three features, the characteristics of sound are comprehensively captured. Subsequently, in conjunction with the enhanced Res2Net, multiple feature representations at different scale combinations are obtained for each input feature through a more fine-grained operational mode. Finally, the extracted feature information is fed into the stacked long short-term memory neural network to handle sequence-related problems and improve the model's expressive power. Experimental results show that the proposed method performs well on the CN-Celeb dataset, with the equal error rate and the minimum detection cost function reaching 2.89% and 0.3725, which proves the robustness and accuracy of the proposed method.
Key words
speaker recognition /
multi-feature fusion /
fullRes2Net /
SLSTM
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
References
[1] 张葛祥,曾鑫,姚光乐,等.说话人识别综述[J].控制工程,2025,32(2):251-264.
[2] 邓力洪,邓飞,张葛祥,等.改进Res2Net的多尺度端到端说话人识别系统[J].计算机工程与应用,2023,59(24):110-120.
[3] Wu J,Li P,Wang Y,et al.VFR:The underwater acoustic target recognition using cross-domain pre-training with fbank fusion features[J].Journal of Marine Science and Engineering,2023,11(2):263.
[4] Ma M,Liu C,Wei R,et al.Predicting machine's performance record using the stacked long short‐term memory (LSTM) neural networks[J].Journal of Applied Clinical Medical Physics,2022,23(3):e13558.
[5] Verma V,Benjwal A,Chhabra A,et al.A novel hybrid model integrating MFCC and acoustic parameters for voice disorder detection[J].Scientific Reports,2023,13(1):22719.
[6] Gao G,Guo Y,Zhou L,et al.Res2Net-based multi-scale and multi-attention model for traffic scene image classification[J].PLoS one,2024,19(5):e0300017.
[7] Deng L H,Deng F,Chiou G X,et al.End-to-end Speaker Recognition Based on MTFC-FullRes2Net[J].Journal of Computers,2023,34(3):75-91.