摘要
基于词典的中文自动分词是中文信息处理的基础.按照使用计算机缓存的优化原则,分析了几种典型的分词词典机制,指出了其中的一些问题.改进了整词二分法,极大地提高了速度.结合哈希索引和PATRICIA tree搜索算法,提出了一个综合最优化的中文分词系统.
Several typical Chinese word segmentation algorithms based on dictionary were discussed in this paper,and existing problems of these algorithms were identified.The method of binary-seek-by-word was improved through optimizing the usage of computers cache.Combining with the hash index and the PATRICIA tree search mechanisms,an optimized comprehensive Chinese word segmentation method was proposed.
出处
《贵州科学》
2008年第3期1-8,共8页
Guizhou Science
基金
贵州省科技厅年度计划项目
黔科合(2004)JN057资助
关键词
中文信息处理
自动分词
分词词典
缓存优化
Chinese information processing
Chinese word segmentation
segmentation dictionary
cache optimization