期刊文献+

结合时间注意力机制和单模态标签自动生成策略的自监督多模态情感识别

Self-supervised Multimodal Emotion Recognition Combining Temporal Attention Mechanism and Unimodal Label Automatic Generation Strategy
下载PDF
导出
摘要 大多数多模态情感识别方法旨在寻求一种有效的融合机制,构建异构模态的特征,从而学习到具有语义一致性的特征表示。然而,这些方法通常忽略了模态间情感语义的差异性信息。为解决这一问题,提出了一种多任务学习框架,联合训练1个多模态任务和3个单模态任务,分别学习多模态特征间的情感语义一致性信息和各个模态所含情感语义的差异性信息。首先,为了学习情感语义一致性信息,提出了一种基于多层循环神经网络的时间注意力机制(TAM),通过赋予时间序列特征向量不同的权重来描述情感特征的贡献度。然后,针对多模态融合,在语义空间进行了逐语义维度的细粒度特征融合。其次,为了有效学习各个模态所含情感语义的差异性信息,提出了一种基于模态间特征向量相似度的自监督单模态标签自动生成策略(ULAG)。通过在CMU-MOSI,CMU-MOSEI, CH-SIMS 3个数据集上的大量实验结果证实,提出的TAM-ULAG模型具有很强的竞争力:在分类指标(Acc_(2),F_(1))和回归指标(MAE, Corr)上与基准模型的指标相比均有所提升;对于二分类识别准确率,在CMUMOSI和CMU-MOSEI数据集上分别为87.2%和85.8%,而在CH-SIMS数据集上达到81.47%。这些研究结果表明,同时学习多模态间的情感语义一致性信息和各模态情感语义的差异性信息,有助于提高自监督多模态情感识别方法的性能。 Most multimodal emotion recognition methods aim to find an effective fusion mechanism to construct the features from heterogeneous modalities,so as to learn the feature representation with semantic consistency.However,these methods usually ignore the emotionally semantic differences between modalities.To solve this problem,one multi-task learning framework is proposed.By training one multimodal task and three unimodal tasks jointly,the emotionally semantic consistency information among multimodal features and the emotionally semantic difference information contained in each modality are respectively learned.Firstly,in order to learn the emotionally semantic consistency information,one Temporal Attention Mechanism(TAM)based on a multilayer recurrent neural network is proposed.The contribution degree of emotional features is described by assigning different weights to time series feature vectors.Then,for multimodal fusion,the fine-grained feature fusion per semantic dimension is carried out in the semantic space.Secondly,one self-supervised Unimodal Label Automatic Generation(ULAG)strategy based on the inter-modal feature vector similarity is proposed in order to effectively learn the difference information of emotional semantics in each modality.A large number of experimental results on three datasets CMU-MOSI,CMU-MOSEI,CH-SIMS,confirm that the proposed TAMULAG model has strong competitiveness,and has improved the classification indices(Acc_(2),F_(1)) and regression index(MAE,Corr)compared with the current benchmark models.For binary classification,the recognition rate is 87.2%and 85.8%on the CMU-MOSEI and CMU-MOSEI datasets,and 81.47%on the CH-SIMS dataset.The results show that simultaneously learning the emotionally semantic consistency information and the emotionally semantic difference information for each modality is helpful in improving the performance of selfsupervised multimodal emotion recognition method.
作者 孙强 王姝玉 SUN Qiang;WANG Shuyu(Department of Communication Engineering,School of Automation and Information Engineering,Xi’an University of Technology,Xi’an 710048,China;Xi’an Key Laboratory of Wireless Optical Communication and Network Research,Xi’an 710048,China)
出处 《电子与信息学报》 EI CAS CSCD 北大核心 2024年第2期588-601,共14页 Journal of Electronics & Information Technology
基金 西安市科技计划项目(22GXFW0086) 西安市碑林区科技计划项目(GX2243)。
关键词 多模态情感识别 自监督标签生成 多任务学习 时间注意力机制 多模态融合 Multimodal emotion recognition Self-supervised label generation Multi-task learning Temporal Attention mechanism Multimodal fusion
  • 相关文献

参考文献7

二级参考文献46

共引文献98

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部