Spark is a distributed data processing framework based on memory.Memory allocation is a focus question of Spark research.A good memory allocation scheme can effectively improve the efficiency of task execution and mem...Spark is a distributed data processing framework based on memory.Memory allocation is a focus question of Spark research.A good memory allocation scheme can effectively improve the efficiency of task execution and memory resource utilization of the Spark.Aiming at the memory allocation problem in the Spark2.x version,this paper optimizes the memory allocation strategy by analyzing the Spark memory model,the existing cache replacement algorithms and the memory allocation methods,which is on the basis of minimizing the storage area and allocating the execution area according to the demand.It mainly including two parts:cache replacement optimization and memory allocation optimization.Firstly,in the storage area,the cache replacement algorithm is optimized according to the characteristics of RDD Partition,which is combined with PCA dimension.In this section,the four features of RDD Partition are selected.When the RDD cache is replaced,only two most important features are selected by PCA dimension reduction method each time,thereby ensuring the generalization of the cache replacement strategy.Secondly,the memory allocation strategy of the execution area is optimized according to the memory requirement of Task and the memory space of storage area.In this paper,a series of experiments in Spark on Yarn mode are carried out to verify the effectiveness of the optimization algorithm and improve the cluster performance.展开更多
With the rapid development of social network,public opinion monitoring based on social networks is becoming more and more important.Many platforms have achieved some success in public opinion monitoring.However,these ...With the rapid development of social network,public opinion monitoring based on social networks is becoming more and more important.Many platforms have achieved some success in public opinion monitoring.However,these platforms cannot perform well in scalability,fault tolerance,and real-time performance.In this paper,we propose a novel social-network-oriented public opinion monitoring platform based on ElasticSearch(SNES).Firstly,SNES integrates the module of distributed crawler cluster,which provides real-time social media data access.Secondly,SNES integrates ElasticSearch which can store and retrieve massive unstructured data in near real time.Finally,we design subscription module based on Apache Kafka to connect the modules of the platform together in the form of message push and consumption,improving message throughput and the ability of dynamic horizontal scaling.A great number of empirical experiments prove that the platform can adapt well to the social network with highly real-time data and has good performance in public opinion monitoring.展开更多
文摘Spark is a distributed data processing framework based on memory.Memory allocation is a focus question of Spark research.A good memory allocation scheme can effectively improve the efficiency of task execution and memory resource utilization of the Spark.Aiming at the memory allocation problem in the Spark2.x version,this paper optimizes the memory allocation strategy by analyzing the Spark memory model,the existing cache replacement algorithms and the memory allocation methods,which is on the basis of minimizing the storage area and allocating the execution area according to the demand.It mainly including two parts:cache replacement optimization and memory allocation optimization.Firstly,in the storage area,the cache replacement algorithm is optimized according to the characteristics of RDD Partition,which is combined with PCA dimension.In this section,the four features of RDD Partition are selected.When the RDD cache is replaced,only two most important features are selected by PCA dimension reduction method each time,thereby ensuring the generalization of the cache replacement strategy.Secondly,the memory allocation strategy of the execution area is optimized according to the memory requirement of Task and the memory space of storage area.In this paper,a series of experiments in Spark on Yarn mode are carried out to verify the effectiveness of the optimization algorithm and improve the cluster performance.
基金This work is supported by State Grid Science and Technology Project under Grant Nos.520613180002,62061318C002the Fundamental Research Funds for the Central Universities(Grant Nos.HIT.NSRIF.201714)+4 种基金Weihai Science and Technology Development Program(2016DXGJMS15)Key Research and Development Program in Shandong Provincial(2017GGX90103)Fujian Young and Middle-aged Teacher Education Research Project,Grant No.JAT160466Jiangsu Polytechnic College of Agriculture and Forestry Key R&D Projects(2018kj11)Study and Development of Smart Agriculture Control System Based on Spark Big Data Decision(2017N0029).
文摘With the rapid development of social network,public opinion monitoring based on social networks is becoming more and more important.Many platforms have achieved some success in public opinion monitoring.However,these platforms cannot perform well in scalability,fault tolerance,and real-time performance.In this paper,we propose a novel social-network-oriented public opinion monitoring platform based on ElasticSearch(SNES).Firstly,SNES integrates the module of distributed crawler cluster,which provides real-time social media data access.Secondly,SNES integrates ElasticSearch which can store and retrieve massive unstructured data in near real time.Finally,we design subscription module based on Apache Kafka to connect the modules of the platform together in the form of message push and consumption,improving message throughput and the ability of dynamic horizontal scaling.A great number of empirical experiments prove that the platform can adapt well to the social network with highly real-time data and has good performance in public opinion monitoring.