摘要
目的研究文本相似度计算方法在提高诊断名称数据标准化过程中人工判断效率的价值。方法严格遵照国家标准的疾病分类与代码上存在编码的诊断名称,按疾病名称进行排序,并对其ID进行标记,选择2020年3月—2021年8月使用的23681条诊断名称文本数据,利用余弦相似度计算文本相似度,并通过单字与单字组合的穷举分词及频数、逆文本频率指数加权形成文本向量,经函数对计算结果进行校正。结果诊断名称长度为8.58个字符,获得9字组长度,经观察发现3字组后,字组暂停增长,但5字组逐渐减低,从而得知9字组长度未达50万维,可利用计算机处理。通过余弦相似度的计算,选择频数向量成为非标准数据,而标准数据选择频数逆文本频率指数加权的向量,最终将数据标准化。利用疾病诊断名称相似的文本,发现文本向量存在较大差别,其自身即为相似度最大值。通过字组组合的方式,测定不同文本向量,选择高血压进行举例。各类字组组合模式下存在356条诊断名称,其中390次相似度最大值并不是其本身。对于不同的情况实施分析,字组组合模式Ⅰ不一致,Ⅱ~Ⅸ模式一致,记成0、1、1、1、1、1、1、1、1。结论文本相似度计算方法能够提升诊断名称数据标准化,并促进人工判断效率改善。
Objective To study the value of text similarity computing methods on the improvement of artificial judgement efficacy during the diagnostic record data standardization process.Methods In strict accordance with the disease classification and codes of the national standard,there are coded diagnostic names,which are sorted according to the disease name,and their IDs were marked.23681 diagnostic name text data used from March 2020 to August 2021 were selected.The text similarity is calculated by cosine similarity,and the text vector is formed by the exhaustive word segmentation,frequency and inverse text frequency index of single word and single word combination.The calculation results are corrected by function.Results The length of the diagnosis name is 8.58 characters.The length of the 9-character group is obtained.It is observed that after the 3-character group,the word group stops growing,but the 5-character group gradually decreases.It is known that the length of the 9-character group does not reach 500000 dimensions,which can be processed by computer.Through the calculation of cosine similarity,the selected frequency vector becomes non-standard data,while the standard data selects the vector weighted by the frequency inverse text frequency index,and finally standardizes the data.Using texts with similar disease diagnosis names,it is found that there are great differences in text vectors,which is the maximum similarity.Through the way of word group combination,different text vectors were measured,and hypertension was selected as an example.There are 356 diagnostic names in various word group combination modes,of which the maximum similarity of 390 times is not itself.For the analysis of different cases,the word group combination modeⅠis inconsistent,and the modeⅡ-Ⅸis consistent.It is recorded as 0,1,1,1,1,1,1,1,1.Conclusion The text similarity calculation method can improve the standardization of diagnostic name data and improve the efficiency of manual judgment.
作者
郑景文
ZHENG Jingwen(Medical Record Room,Zhanjiang Nongken Central Hospital,Zhanjiang,Guangdong Province,524002 China)
出处
《中国卫生产业》
2022年第9期166-169,共4页
China Health Industry
关键词
人工判断效率
文本相似度计算法
诊断名称数据标准化
Artificial judgement efficacy
Text similarity computing methods
Diagnostic record data standardization