An Efficient Mechanism for Product Data Extraction from E-Commerce Websites

下载PDF

导出

摘要 A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.

作者 Malik Javed Akhtar Zahur Ahmad Rashid Amin Sultan H.Almotiri Mohammed A.Al Ghamdi Hamza Aldabbas

机构地区 University of Engineering and Technology Computer Science Department Al-Balqa Applied University

出处《Computers, Materials & Continua》 SCIE EI 2020年第12期2639-2663,共25页 计算机、材料和连续体（英文）

关键词 Document object model rich data region common tag sequence web data extraction deep web mining

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

1Ayman G. Awadallah,Mohamed ElGamal,Ashraf ElMostafa,Hesham ElBadry.Developing Intensity-Duration-Frequency Curves in Scarce Data Region: An Approach using Regional Analysis and Satellite Data[J].Engineering（科研）,2011,3(3):215-226. 被引量：3
2Ayman G. Awadallah,Nabil A. Awadallah.A Novel Approach for the Joint Use of Rainfall Monthly and Daily Ground Station Data with TRMM Data to Generate IDF Estimates in a Poorly Gauged Arid Region[J].Open Journal of Modern Hydrology,2013,3(1):1-7. 被引量：2
3Sudhir Kumar Patnaik,C.Narendra Babu,Mukul Bhave.Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks[J].Big Data Mining and Analytics,2021,4(4):279-297. 被引量：4
4Sel Ly,Nicolas Privault.Stochastic ordering by g-expectations[J].Probability, Uncertainty and Quantitative Risk,2021,6(1):61-98.
5MOHSENI Alireza,DUCHAINE Vincent,WONG Tony.Experimental study of path planning problem using EMCOA for a holonomic mobile robot[J].Journal of Systems Engineering and Electronics,2021,32(6):1450-1462. 被引量：4
6Youpeng Yang,Chenyi Ding,Sanghyuk Lee,Limin Yu,Fei Ma.A modified Teunter-Syntetos-Babai method for intermittent demand forecasting[J].Journal of Management Science and Engineering,2021,6(1):53-63. 被引量：1
7Christina Corbane,Pesaresi Martino,Politis Panagiotis,Florczyk J.Aneta,Melchiorri Michele,Freire Sergio,Schiavina Marcello,Ehrlich Daniele,Naumann Gustavo,Kemper Thomas.The grey-green divide:multi-temporal analysis of greenness across 10,000 urban centres derived from the Global Human Settlement Layer(GHSL)[J].International Journal of Digital Earth,2020,13(1):101-118.
8Shakhloi Mukhammadzoda,Faizulloev Shohnavaz,Oimuhammadzoda Ilhomjon,Guangcheng Zhang.Application of Frequency Ratio Method for Landslide Susceptibility Mapping in the Surkhob Valley, Tajikistan[J].Journal of Geoscience and Environment Protection,2021,9(12):168-189. 被引量：1
9Hangjun Zhou,Tingting Shen,Xinglian Liu,Yurong Zhang,Peng Guo,Jianjun Zhang.Survey of Knowledge Graph Approaches and Applications[J].Journal on Artificial Intelligence,2020,2(2):89-101. 被引量：5
10Jialin Ma,Jieyi Cheng,Lin Zhang,Lei Zhou,Bolun Chen.A Phrase Topic Model Based on Distributed Representation[J].Computers, Materials & Continua,2020(7):455-469.

Computers, Materials & Continua

2020年第12期

浏览历史

内容加载中请稍等...

An Efficient Mechanism for Product Data Extraction from E-Commerce Websites

相关作者

相关机构

相关主题

浏览历史