摘要
关键词过滤是基于文本内容过滤中最为常用的一种方法,有着广泛的应用。汉字由部件组成,将汉字拆成部件给关键词过滤造成了困难。提出了基于汉字部件组合的关键词过滤技术,依托于汉字结构标注库,运用改进的多模式匹配算法处理海量文本内容。实验结果证明,该方法能够找出被故意拆分的关键词。
Keywords filtering is one of the most common methods in text content filtering and is widely used. Chinese characters are combinations of constituents, and splitting characters into constituents makes keywords filtering difficult. To deal with this problem, a keywords filtering technology based on combination of Chinese character constituents is proposed. It is based on Chinese characters structure library, and uses improved multiple patterns matching algorithm to deal with massive text contents . Tests show that this method can f'md out split keywords efficiently.
出处
《信息技术》
2008年第10期1-3,10,共4页
Information Technology
基金
国家自然科学基金项目(60402019
60502032)
教育部新世纪优秀人才支持计划项目(NCET-06-0393)
关键词
汉字部件
多模式匹配
过滤
Chinese characters constituents
multiple patterns matching
filtering