Daily newspapers publish a tremendous amount of information disseminated through the Internet.Freely available and easily accessible large online repositories are not indexed and are in an un-processable format.The ma...Daily newspapers publish a tremendous amount of information disseminated through the Internet.Freely available and easily accessible large online repositories are not indexed and are in an un-processable format.The major hindrance in developing and evaluating existing/new monolingual text in an image is that it is not linked and indexed.There is no method to reuse the online news images because of the unavailability of standardized benchmark corpora,especially for South Asian languages.The corpus is a vital resource for developing and evaluating text in an image to reuse local news systems in general and specifically for the Urdu language.Lack of indexing,primarily semantic indexing of the daily news items,makes news items impracticable for any querying.Moreover,the most straightforward search facility does not support these unindexed news resources.Our study addresses this gap by associating and marking the newspaper images with one of the widely spoken but under-resourced languages,i.e.,Urdu.The present work proposed a method to build a benchmark corpus of news in image form by introducing a web crawler.The corpus is then semantically linked and annotated with daily news items.Two techniques are proposed for image annotation,free annotation and fixed cross examination annotation.The second technique got higher accuracy.Build news ontology in protégéusing OntologyWeb Language(OWL)language and indexed the annotations under it.The application is also built and linked with protégéso that the readers and journalists have an interface to query the news items directly.Similarly,news items linked together will provide complete coverage and bring together different opinions at a single location for readers to do the analysis themselves.展开更多
基金King Saud University through Researchers Supporting Project number(RSP-2021/387),King Saud University,Riyadh,Saudi Arabia.
文摘Daily newspapers publish a tremendous amount of information disseminated through the Internet.Freely available and easily accessible large online repositories are not indexed and are in an un-processable format.The major hindrance in developing and evaluating existing/new monolingual text in an image is that it is not linked and indexed.There is no method to reuse the online news images because of the unavailability of standardized benchmark corpora,especially for South Asian languages.The corpus is a vital resource for developing and evaluating text in an image to reuse local news systems in general and specifically for the Urdu language.Lack of indexing,primarily semantic indexing of the daily news items,makes news items impracticable for any querying.Moreover,the most straightforward search facility does not support these unindexed news resources.Our study addresses this gap by associating and marking the newspaper images with one of the widely spoken but under-resourced languages,i.e.,Urdu.The present work proposed a method to build a benchmark corpus of news in image form by introducing a web crawler.The corpus is then semantically linked and annotated with daily news items.Two techniques are proposed for image annotation,free annotation and fixed cross examination annotation.The second technique got higher accuracy.Build news ontology in protégéusing OntologyWeb Language(OWL)language and indexed the annotations under it.The application is also built and linked with protégéso that the readers and journalists have an interface to query the news items directly.Similarly,news items linked together will provide complete coverage and bring together different opinions at a single location for readers to do the analysis themselves.