摘要
从手写文档图像中提取出文本行是文档分析的一个重要预处理步骤,但是由于手写文本行之间通常行方向不平行,甚至存在着交叠和弯曲,所以它仍然是一个具有挑战性的问题.针对该问题,提出了一种基于高阶相关聚类的脱机中文手写文本行的分割算法.首先,使用连通部件构成一个文档超图,然后,在学习所得的相似性度量准则的约束下,通过高阶相关聚类算法将连通部件对标记为属于或者不属于同一文本行;最后,使用union-find算法将连通部件连接成为不同的文本行.该算法在HIT-MW脱机手写数据库上的803幅文档上取得了较好的效果,召回率99.05%,错误率为1.96%.
Text line segmentation from handwritten document images is one of important pre-processing steps in document image analysis, however, it remains a challenge because the handwritten text lines are often multi-skewed, curved and overlapped. This paper proposed a novel handwritten text line segmentation method based on high-order correlation clustering. First, a hypergraph was constructed with the nodes corresponding to connected components and the edge connecting at least two connected components. Then under the learned similarity measure, the pairs of connected components were labeled as belonging or not belonging to the same text line. Finally, the connected components were merged into different text lines using union-find algorithm. In experiments on a database with 803 unconstrained handwritten Chinese document images (HIT-MW), the proposed method achieved a correct rate 99.05%, and an error rate of 1.96%.
作者
殷亚林
刘爱民
周祥东
YIN Yalin LIU Aimin ZHOU Xiangdonga(Department of Digital Media Technology, Jianghan University, Wuhan 430056 Laboratory and Equipment Department, Central China Normal University, Wuhan 430079 Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714)
出处
《华中师范大学学报(自然科学版)》
CAS
北大核心
2017年第1期18-22,34,共6页
Journal of Central China Normal University:Natural Sciences
基金
国家自然科学基金项目(61273269)
关键词
手写文本行分割
高阶相关聚类
超图
handwritten text line segmentation
high-order correlation clustering
hypergraph