Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will...Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will be affected greatly. Aiming at this problem,hierarchical clustering algorithm based on single-pass is proposed,which is inspired by hierarchical and concurrent ideas to divide clustering process into three stages. News reports are classified into different categories firstly.Then there are twice single-pass clustering processes in the same category,and one agglomerative clustering among different categories. In addition,for semantic similarity in news reports,topic model is improved based on named entities. Experimental results show that the proposed method can effectively accelerate the process as well as improve the performance.展开更多
基金Supported by the National Natural Science Foundation of China(No.61502312)the Fundamental Research Funds for the Central Universities(No.2017BQ024)+1 种基金the Natural Science Foundation of Guangdong Province(No.2017A030310428)the Science and Technology Programm of Guangzhou(No.201806020075,20180210025)
文摘Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will be affected greatly. Aiming at this problem,hierarchical clustering algorithm based on single-pass is proposed,which is inspired by hierarchical and concurrent ideas to divide clustering process into three stages. News reports are classified into different categories firstly.Then there are twice single-pass clustering processes in the same category,and one agglomerative clustering among different categories. In addition,for semantic similarity in news reports,topic model is improved based on named entities. Experimental results show that the proposed method can effectively accelerate the process as well as improve the performance.