摘要
本文构造了一个用于自动生成Internet个人信息索引的实验系统PersonIndexer。在CERNET两个网址上进行的初步实验表明,PersonIndexer对中文姓名、拼音人名、中文机构名的召回率和精确率平均分别为97.8%和61.9%、100%和64.5%、94.5%和92.1%,对电子邮件地址和电话传真号码的召回率和精确率均为100%。鉴于Internet上的信息检索以及自然语言处理这两个领域都互向对方提出了要求,我们相信,面向大规模真实文本的汉语分析技术与Internet的结合。
PersonIndexer, a prototype system for automatically generating Chinese personal information index in Internet, is proposed in this paper. Preliminary experimental results on all HTML texts under two CERNET web sites indicate that, the average recall and precision for extraction of Chinese names, Chinese names in Pinyin form as well as Chinese organization names are 97.8% & 61.9%, 100% & 64.5%,94.5% & 92.1% respectively, and the recall and precision for extracting email addresses, telephone and fax numbers are about 100%. We believe that, the integration of large-scale-running-text-oriented Chinese NLP techniques with information retrieval techniques in Internet, will become a hot research topic of Chinese information processing in the near future.
出处
《中文信息学报》
CSCD
北大核心
1999年第2期24-32,共9页
Journal of Chinese Information Processing
基金
清华大学国家重点实验室开放基金
关键词
中文姓名辨识
个人信息搜索
INTERNET
信息处理
automatic index generator Chinese name identification personal information searching Internet