摘要
Sentence Boundary Disambiguation(SBD)is a preprocessing step for natural language processing.Segmenting text into sentences is essential for Deep Learning(DL)and pretraining language models.Tibetan punctuation marks may involve ambiguity about the sentences’beginnings and endings.Hence,the ambiguous punctuation marks must be distinguished,and the sentence structure must be correctly encoded in language models.This study proposed a component-level Tibetan SBD approach based on the DL model.The models can reduce the error amplification caused by word segmentation and part-of-speech tagging.Although most SBD methods have only considered text on the left side of punctuation marks,this study considers the text on both sides.In this study,465669 Tibetan sentences are adopted,and a Bidirectional Long Short-Term Memory(Bi-LSTM)model is used to perform SBD.The experimental results show that the F1-score of the Bi-LSTM model reached 96%,the most efficient among the six models.Experiments are performed on low-resource languages such as Turkish and Romanian,and high-resource languages such as English and German,to verify the models’generalization.
基金
This work was supported by the National Key R&D Program of China(No.2020YFC0832500)
the Ministry of Education-China Mobile Research Foundation(No.MCM20170206)
the Fundamental Research Funds for the Central Universities(Nos.lzujbky-2022-kb12,lzujbky-2021-sp43,lzujbky-2020-sp02,lzujbky-2019-kb51,and lzujbky-2018-k12)
the National Natural Science Foundation of China(No.61402210)
the Science and Technology Plan of Qinghai Province(No.2020-GX-164)
the Google Research Awards and Google Faculty Award,the Provincial Science and Technology Plan(Major Science and Technology Projects-Open Solicitation)(No.22ZD6GA048)
the Gansu Provincial Science and Technology Major Special Innovation Consortium Project(No.21ZD3GA002)
the Gansu Province Green and Smart Highway Key Technology Research and Demonstration。