Purpose-Normalization is an important step in all the natural language processing applications that are handling social media text.The text from social media poses a different kind of problems that are not present in ...Purpose-Normalization is an important step in all the natural language processing applications that are handling social media text.The text from social media poses a different kind of problems that are not present in regular text.Recently,a considerable amount of work has been done in this direction,but mostly in the English language.People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script.This kind of text further aggravates the problem of normalizing.This paper aims to discuss the concept of normalization with respect to code-mixed social media text,and a model has been proposed to normalize such text.Design/methodology/approach-The system is divided into two phases-candidate generation and most probable sentence selection.Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language.Characterbased translation system has been proposed to generate candidate tokens.Once candidates are generated,the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.Findings-Character error rate(CER)and bilingual evaluation understudy(BLEU)score are reported.The proposed system has been compared with Akhar software and RB\_R2G system,which are also capable of transliterating Roman text to Gurmukhi.The performance of the system outperforms Akhar software.The CER and BLEU scores are 0.268121 and 0.6807939,respectively,for ill-formed text.Research limitations/implications-It was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing.Spell checker can improve the output of the system by correcting these minor errors.Extensive experimentation is needed for optimizing language identifier,which will further help in improving the output.The language model also seeks further exploration.Inclusion of wider context,particularly from social media text,is an important area that deserves further investigation.Practical implications-The practical implications of this study are:(1)development of parallel dataset containing Roman and Gurmukhi text;(2)development of dataset annotated with language tag;(3)development of the normalizing system,which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi.It can be extended for any pair of scripts.(4)The proposed system can be used for better analysis of social media text.Theoretically,our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/value-Existing research work focus on normalizing monolingual text.This study contributes towards the development of a normalization system for multilingual text.展开更多
Recognizing irregular text in natural images is a challenging task in computer vision.The existing approaches still face difficulties in recognizing irre-gular text because of its diverse shapes.In this paper,we propos...Recognizing irregular text in natural images is a challenging task in computer vision.The existing approaches still face difficulties in recognizing irre-gular text because of its diverse shapes.In this paper,we propose a simple yet powerful irregular text recognition framework based on an encoder-decoder archi-tecture.The proposed framework is divided into four main modules.Firstly,in the image transformation module,a Thin Plate Spline(TPS)transformation is employed to transform the irregular text image into a readable text image.Sec-ondly,we propose a novel Spatial Attention Module(SAM)to compel the model to concentrate on text regions and obtain enriched feature maps.Thirdly,a deep bi-directional long short-term memory(Bi-LSTM)network is used to make a con-textual feature map out of a visual feature map generated from a Convolutional Neural Network(CNN).Finally,we propose a Dual Step Attention Mechanism(DSAM)integrated with the Connectionist Temporal Classification(CTC)-Attention decoder to re-weights visual features and focus on the intra-sequence relationships to generate a more accurate character sequence.The effectiveness of our proposed framework is verified through extensive experiments on various benchmarks datasets,such as SVT,ICDAR,CUTE80,and IIIT5k.The perfor-mance of the proposed text recognition framework is analyzed with the accuracy metric.Demonstrate that our proposed method outperforms the existing approaches on both regular and irregular text.Additionally,the robustness of our approach is evaluated using the grocery datasets,such as GroZi-120,Web-Market,SKU-110K,and Freiburg Groceries datasets that contain complex text images.Still,our framework produces superior performance on grocery datasets.展开更多
文摘Purpose-Normalization is an important step in all the natural language processing applications that are handling social media text.The text from social media poses a different kind of problems that are not present in regular text.Recently,a considerable amount of work has been done in this direction,but mostly in the English language.People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script.This kind of text further aggravates the problem of normalizing.This paper aims to discuss the concept of normalization with respect to code-mixed social media text,and a model has been proposed to normalize such text.Design/methodology/approach-The system is divided into two phases-candidate generation and most probable sentence selection.Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language.Characterbased translation system has been proposed to generate candidate tokens.Once candidates are generated,the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.Findings-Character error rate(CER)and bilingual evaluation understudy(BLEU)score are reported.The proposed system has been compared with Akhar software and RB\_R2G system,which are also capable of transliterating Roman text to Gurmukhi.The performance of the system outperforms Akhar software.The CER and BLEU scores are 0.268121 and 0.6807939,respectively,for ill-formed text.Research limitations/implications-It was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing.Spell checker can improve the output of the system by correcting these minor errors.Extensive experimentation is needed for optimizing language identifier,which will further help in improving the output.The language model also seeks further exploration.Inclusion of wider context,particularly from social media text,is an important area that deserves further investigation.Practical implications-The practical implications of this study are:(1)development of parallel dataset containing Roman and Gurmukhi text;(2)development of dataset annotated with language tag;(3)development of the normalizing system,which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi.It can be extended for any pair of scripts.(4)The proposed system can be used for better analysis of social media text.Theoretically,our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/value-Existing research work focus on normalizing monolingual text.This study contributes towards the development of a normalization system for multilingual text.
文摘Recognizing irregular text in natural images is a challenging task in computer vision.The existing approaches still face difficulties in recognizing irre-gular text because of its diverse shapes.In this paper,we propose a simple yet powerful irregular text recognition framework based on an encoder-decoder archi-tecture.The proposed framework is divided into four main modules.Firstly,in the image transformation module,a Thin Plate Spline(TPS)transformation is employed to transform the irregular text image into a readable text image.Sec-ondly,we propose a novel Spatial Attention Module(SAM)to compel the model to concentrate on text regions and obtain enriched feature maps.Thirdly,a deep bi-directional long short-term memory(Bi-LSTM)network is used to make a con-textual feature map out of a visual feature map generated from a Convolutional Neural Network(CNN).Finally,we propose a Dual Step Attention Mechanism(DSAM)integrated with the Connectionist Temporal Classification(CTC)-Attention decoder to re-weights visual features and focus on the intra-sequence relationships to generate a more accurate character sequence.The effectiveness of our proposed framework is verified through extensive experiments on various benchmarks datasets,such as SVT,ICDAR,CUTE80,and IIIT5k.The perfor-mance of the proposed text recognition framework is analyzed with the accuracy metric.Demonstrate that our proposed method outperforms the existing approaches on both regular and irregular text.Additionally,the robustness of our approach is evaluated using the grocery datasets,such as GroZi-120,Web-Market,SKU-110K,and Freiburg Groceries datasets that contain complex text images.Still,our framework produces superior performance on grocery datasets.