摘要
当今时代,愈发庞大的数据难以有效处理运用和管理,需要一种更加合适的资源获取处理方式。该文基于大数据架构结合网络爬虫、数据清洗、信息检索等前沿技术,设计开发了地震科普知识资源库系统。其中运用了J2EE、Python、Hadoop、Elasticsearch、MySQL等技术。通过网络爬虫和人工上传的方式采集地震科普相关信息资源,经过数据清洗转换后对信息资源进行自动分类,最后将资源上传至资源库hdfs分布式文件系统并将文件信息保存至Elasticsearch分布式文件索引系统,由此实现大数据架构下的全文检索。同时,建立资源库的后台管理系统,用于网站的日常管理和维护。相比以前的集群文件系统更加高速便捷、更加的安全稳定。
In today’s era,the increasingly large data is difficult to effectively handle the application and management,and a more appropriate resource acquisition and processing method is needed.Based on the big data architecture combined with web reptile,data cleaning,information retrieval and other cutting-edge technologies,this paper designs and develops the seismic science knowledge resource database system.It uses technologies such as J2EE,Python,Hadoop,Elasticsearch,and MySQL.The seismic science related information resources are collected by means of web crawling and manual uploading,and the information resources are automatically classified after data cleaning and conversion,and finally the resources are uploaded to the resource library hdfs,distributed file system,and the file information is saved to the Elasticsearch,distributed file index.So the system enables full-text retrieval under the big data architecture.At the same time,the background management system of the resource library is established for daily management and maintenance of the website.It is more convenient and safer than the previous cluster file system.
出处
《科技资讯》
2020年第5期17-18,20,共3页
Science & Technology Information
基金
中国地震局地震科技星火计划项目(项目编号:XH18038)。