摘要
在文本处理中,针对一般敏感词的过滤系统已经十分成熟,但是对于现今普及的变形敏感词的过滤方法有待完善,尤其是对于复杂的中文变形敏感词。针对变形敏感词过滤这一问题,通过对变形敏感词进行分析总结,提出一种基于改进Trie树的变形敏感词过滤算法。该算法经过对变形敏感词分析归类、文本进行分立预处理、构建符合中文特点的Trie树、变形敏感词过滤等阶段,形成一套完整的中文文本过滤体系。经过反复实验表明,该算法不仅可以有效查找中文本中的普通敏感词,并且能高效地过滤出变形敏感词,其中对总敏感词和变形敏感词的查全率分别达到95.46%和92.49%,扩大敏感词查找范围,提高敏感词过滤的精确度。
In text processing,the filtering system for general sensitive words has matured,but the processing methods for deformed sensitive words that are now common are still to be improved,especially for complex Chinese texts that are sensitive to deformation.Through analyzing and summarizing the deformation sensitive words,proposes a sensitive word filtering algorithm based on improved Trie tree.The algorithm pass?es through the process of preprocessing the deformation-sensitive words,preprocessing the text,constructing the Chinese-specific Trie tree,detecting sensitive words,etc.Finally,it can not only effectively find common sensitive words in Chinese text,but also can effectively filter out the deformation-sensitive words.The recall rate of total sensitive words and deformation-sensitive words reach95.46%and92.49%,respectively,which expands the search range of sensitive words and improves the accuracy of filtering of sensitive words.
作者
叶情
YE Qing(College of Computer Science, Sichuan University, Chengdu 610065)
出处
《现代计算机》
2018年第22期3-7,共5页
Modern Computer
基金
国家自然科学基金资助项目(No.61332001)
关键词
敏感词过滤
TRIE树
变形敏感词
文本分立
模糊匹配
Sensitive Word Filtering
Trie Tree
Fuzzy Matching
Text Separation
Deformation-Sensitive Word