摘要
传统的说话人识别(Automatic Speaker Verfication, ASV)系统难以分辨合成语音,构建一个说话人保护系统刻不容缓。针对合成语音侵扰说话人识别系统问题,从特征层面提出了一种基于经验模式分解(Empirical Mode Decomposition, EMD)的梅尔倒谱系数(Mel Frequency Cepstral Coefficients, MFCC)+逆梅尔倒谱系数(Inverse Mel Frequency Cepstral Coefficients, IMFCC)的双通道语音特征作为合成语音检测的前端特征,在后端分类器上串联Res2Net网络和SENet网络组合成SE-Res2Net网络来提升模型的泛化能力。将不同特征与模型的打分结果融合,进一步提高实验性能。在ASVspoof2019数据集上的实验结果表明,该设计的合成语音检测系统能有效检测合成语音,与ASVspoof2019比赛的基线系统相比,融合模型的等错误概率(Equal Error Rate, EER)与串联成本检测函数(tandem Detection Cost Function, t-DCF)分别降低了49%和64%。
It is difficult for traditional Automatic Speaker Verfication(ASV) systems to distinguish synthetic speech, so it is urgent to build a speaker protection system.A two-channel speech feature based on Empirical Mode Decomposition(EMD) of Mel Frequency Cepstral Coefficients and Inverse Mel Frequency Cepstral Coefficients(MFCC+IMFCC) is proposed as the front-end feature for synthetic speech detection at the feature level, and then the Res2 Net network and the Squeeze-and-Excitation Network(SENet) are cascaded on the back-end classifier to form an SE-Res2 Net network to enhance the generalization ability of the model.The scoring results of different features and models are fused to further improve the experimental performance.The experimental results on the ASVspoof2019 dataset show that the synthetic speech detection system designed can effectively detect synthetic speech.Compared with the baseline system of the ASVspoof2019 competition, the Equal Error Rate(EER) and tandem Detection Cost Function(t-DCF) of the fused model are reduced by 49% and 64%,respectively.
作者
梁超
高勇
LIANG Chao;GAO Yong(School of Electronics and Information Engineering,Sichuan University,Chengdu 610065,China)
出处
《无线电工程》
北大核心
2022年第9期1560-1565,共6页
Radio Engineering