摘要
为解决现有中文字向量表征方法中字形特征利用不充分的问题,利用矢量图形的尺度不变性,提出了一种面向汉字矢量图形特征的字向量(scalable vector graphics to vector,SVG2vec)表征方法。预处理阶段将汉字像素图像转化矢量图形,生成字形矢量坐标对序列;特征学习阶段采用双向循环神经网络(recurrent neural network,RNN)和自回归混合密度循环神经网络构建矢量图形变分自编码器模型,利用模型学习汉字字形结构特征;向量生成阶段输入字形矢量坐标对序列到编码器,编码器将字形特征映射到概率连续分布空间,得到SVG2vec字向量。与已有字向量在不同层级任务上进行对比实验。结果表明:SVG2vec向量在命名实体识别、中文分词和短文本相似度计算实验中,F1均值比Word2vec、GloVe等未利用字形特征的向量分别提高了1.27、0.4,1.67、0.12,3.28、2.03,比GnM2Vec、CWE等利用字形特征的向量分别提高了1.02、1.07,1.69、1.34,0.04、0.31,SVG2vec能更有效利用汉字字形特征。
In order to solve the problem of insufficient use of character features in existing Chinese character vector representation methods,a character vector representation method was proposed based on the scale invariance of vector graphics.In the pre-processing stage,the pixel-level image of Chinese characters was transformed into vector graphics,and the vector coordinate sequences of glyphs were generated.In the feature learning stage,bidirectional recurrent neural network(RNN)and autoregressive mixed density RNN were used to construct vector graphic variational autoencoder model,and the model was used to learn Chinese character character structure features.The sequence of glyph vector coordinate pairs were input to the encoder,and the encoder maps the glyphs to the probability continuous distribution space to obtain SVG2vec word vector.Compared with the existing word vector on different levels of tasks,the results show that in named entity recognition,Chinese word segmentation and short text similarity calculation tasks,compared with Word2vec and GloVe without using glyph features,F1 value of SVG2vec vectors is increased by 1.27,0.4,1.67,0.12,3.28,2.03,compared with glyph and meaning to vector(GnM2Vec)and character-enhanced word embedding(CWE)using glyph features,F1 value of SVG2vec vectors is increased by 1.02,1.07,1.69,1.34,0.04,0.31,SVG2vec can effectively represent Chinese glyphs.
作者
唐善成
鲁彪
张雪
张莹
梁少君
TANG Shan-cheng;LU Biao;ZHANG Xue;ZHANG Ying;LIANG Shao-jun(School of Communication and Information Engineering,Xian University of Science and Technology,Xian 710054,China)
出处
《科学技术与工程》
北大核心
2023年第16期6967-6973,共7页
Science Technology and Engineering
基金
国家重点研发计划项目(2018YFC0808300)
陕西省科技计划重点产业创新链(群)项目(2020ZDLGY15-07)
西安市科技计划科技创新引导项目(201805036YD14CG20(4))。
关键词
汉字字形
矢量图形
字向量
变分自编码器
Chinese character glyph
scalable vector graphics
character vector
variational auto encoder