摘要
对学习者语料进行自动词性赋码,可以使语料库获得“增值”,便于对中介语进行更深层次的研究。本研究考察两种自动词性赋码器对中国英语学习者书面语进行自动赋码的可行性。研究中使用Brill词性赋码器和CLAWS7词性赋码器分别为一组高分作文和一组低分作文进行自动词性赋码,并统计赋码的准确率。研究的目的在于:1)比较基于规则的词性赋码器和基于概率的词性赋码器对中国英语学习者书面语的适用度;2)考察学生作文质量对赋码准确率是否有显著影响;3)分析两类词性赋码器在处理学习者语言时所暴露出来的弱点。研究发现,作为一种基于概率的自动词性赋码器,CLAWS7具有较为可靠的性能,其赋码准确率基本达到该工具为英语母语进行词性赋码时的水平,而作为一种基于规则的词性赋码器,Brill的赋码准确率不够稳定,受学习者语言质量特别是语言错误的影响较大。本研究的发现表明,基于CLAWS7所提供的词性赋码,可以对中国英语学习者书面语的句法特点进行有效的研究。
POS tagging can bring “added value” to learner corpora and thus enable in-depth studies of interlanguage. This study investigates the performance of two POS taggers on Chinese EFL learners' written data. The Brill POS tagger and the CLAWS POS tagger were used to tag a group of high-proficiency learner texts and a group of low-proficiency learner texts, and tagging accuracy was then calculated. The study aims 1) to compare the performance of the rule-based tagger with that of the probability-based tagger; 2) to find out whether the performance of POS taggers is significantly affected by the quality of learnerlanguage; and 3) to discover typical errors of both types of POS taggers. Results of the study indicate that the probability-based tagger outperforms the rule-based tagger, and that the probability-based tagger yields an accuracy comparable to that achieved when the tagger is used to tag English native speakers' texts. It is also found that the rule-based tagger does not perform stably, and that its accuracy is often affected by the quality of learner language. It is concluded that learner written corpora tagged with CLAWS can serve as reliable data for syntactic studies of Chinese EFL learners' written language.
出处
《外语教学与研究》
CSSCI
北大核心
2006年第4期279-286,共8页
Foreign Language Teaching and Research