少量训练样本下基于SE-DBGRU的说话人识别

许鑫; 杨乘

电脑与电信 ›› 2025, Vol. 1 ›› Issue (5) : 22-27.

智能识别

少量训练样本下基于SE-DBGRU的说话人识别

许鑫, 杨乘

作者信息 +

Speaker Recognition Based on Deep Bidirectional GRU and SE Block with Small Training Set

XU Xin, YANG Cheng

Author information +

文章历史 +

摘要

随着深度学习和大模型的发展,在说话人识别中,模型的训练需要足够的样本,而当训练样本量受限时,往往不能达到良好收敛状态。针对少量训练样本的问题,提出一种基于深度双向门控循环单元神经网络和SE block（Squeeze and Excitation block）结合的说话人识别方法。该方法中,深度双向门控循环单元神经网络主要功能为提取输入语音不同方向、不同深度的多个信息,随后SE block为获得的信息赋予不同的权重,最后利用这些信息进行分类识别任务。实验结果表明,在每位说话人只有6条训练语音时,识别准确率达到90.34%,可见该模型能在少量训练样本下取得良好的效果。

Abstract

With the development of deep learning and large models, the training of the model needs enough samples in speaker recognition, and when the training set is limited, it often fails to achieve good convergence. To solve the problem of a small training set, a speaker recognition method based on the combination of deep bidirectional gated recurrent unit neural network and SE block (Squeeze and Excitation block) is proposed. In this method, the deep bidirectional gated recurrent unit neural network mainly realizes the extraction of multiple information in different directions and depths of input speech, and then assigns different weights to the obtained information through SE-block, and finally uses the information to perform classification and recognition tasks. The experimental results show that the recognition accuracy reaches 90.34% when each speaker has only 6 trained speech, which shows that the model can achieve good results under a small number of training samples.

导出引用

许鑫, 杨乘. 少量训练样本下基于SE-DBGRU的说话人识别[J]. 电脑与电信. 2025, 1(5): 22-27

XU Xin, YANG Cheng. Speaker Recognition Based on Deep Bidirectional GRU and SE Block with Small Training Set[J]. Computer & Telecommunication. 2025, 1(5): 22-27

中图分类号： TN912.34

参考文献

[1] 郑方,李蓝天,张慧,等.声纹识别技术及其应用现状[J].信息安全研究,2016,2(1):44-57.
[2] Atal,B S.Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification[J].Journal of the Acoustical Society of America,1974,55(6):1304-1322.
[3] Vergin R,O'Shaughnessy D,Farhat A.Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition[J].IEEE Transactions on Speech and Audio Processing,1999,7(5):525-532.
[4] 王玥,钱志鸿,王雪,等.基于伽马通滤波器组的听觉特征提取算法研究[J].电子学报,2010,38(3):525-528.
[5] Reynolds D A,Rose R C.Robust text-independent speaker identification using Gaussian mixture speaker models[J].IEEE Trans Speach & Audio Processing,1995,3(1):72-83.
[6] Reynolds D A,Quatieri T F,Dunn R B.Speaker verification using adapted Gaussian mixture models[J].Digital Signal Processing,2000,10(1):19-41.
[7] Campbell W M,Sturim D E,Reynolds D A.Support vector machines using GMM supervectors for speaker verification[J].IEEE signal processing letters,2006,13(5):308-311.
[8] Novotn O,Plchot O,Pavel Matějka,et al.On the use of X-vectors for Robust Speaker Recognition[C]//Odyssey 2018 The Speaker and Language Recognition Workshop.2018:168-175.
[9] Desplanques B,Thienpondt J,Demuynck K.ECAPA-TDNN:Emphasized channel attention,propagation and aggregation in TDNN based speaker verification[J].(2020-08-10).https://arxiv.org/abs/2005.07143.
[10] 余玲飞,刘强.基于深度循环网络的声纹识别方法研究及应用[J].计算机应用研究,2019,36(1):153-158.
[11] 刘晓璇,季怡,刘纯平.基于LSTM神经网络的声纹识别[J].计算机科学,2021,48(S2):270-274.
[12] 刘勇,梁宏涛,刘国柱,等.基于ResNet-LSTM的声纹识别方法[J].计算机系统应用,2021,30(6):215-219.
[13] 王华朋. 基于深度双向LSTM网络的说话人识别[J]. 计算机工程与设计,2020,41(6):1768-1772.
[14] Chung J,Gulcehre C,Cho K H,et al.Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[J].(2014-12-11).https://arxiv.org/abs/1412.3555.
[15] Hu J,Shen L,Sun G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition,2018:7132-7141.