摘要
当前,以Hadoop、Spark为代表的大数据处理框架,已经在学术界和工业界被广泛应用于大规模数据的处理和分析.这些大数据处理框架采用分布式架构,使用Java、Scala等面向对象语言编写,在集群节点上以Java虚拟机(JVM)为运行时环境执行计算任务,因此依赖JVM的自动内存管理机制来分配和回收数据对象.然而,当前的JVM并不是针对大数据处理框架的计算特征设计的,在实际运行大数据应用时经常出现垃圾回收(GC)时间长、数据对象序列化和反序列化开销大等问题.在一些大数据场景下,JVM的垃圾回收耗时甚至超过应用整体运行时间的50%,已经成为大数据处理框架的性能瓶颈和优化热点.对近年来相关领域的研究成果进行了系统性综述:(1)总结了大数据应用在JVM中运行时性能下降的原因;(2)总结了现有面向大数据处理框架的JVM优化技术,对相关优化技术进行了层次划分,并分析比较了各种方法的优化效果、适用范围、使用负担等优缺点;(3)探讨了JVM未来的优化方向,有助于进一步提升大数据处理框架的性能.
Nowadays,the big data processing frameworks such as Hadoop and Spark have been widely used for data processing and analysis in industry and academia.These big data processing frameworks adopt the distributed architecture,generally developed in objectoriented languages like Java and Scala.These frameworks take Java virtual machine(JVM)as the runtime environment on cluster nodes to execute computing tasks,i.e.,relying on JVM’s automatic memory management mechanism to allocate and reclaim data objects.However,current JVMs are not designed for the big data processing frameworks,leading to many problems such as long garbage collection(GC)time and high cost of data serialization and deserialization.As reported by users and researchers,GC time can take even more than 50%of the overall application execution time in some cases.Therefore,JVM memory management problem has become the performance bottleneck of the big data processing frameworks.This study systematically reviews the recent JVM optimization research work for big data processing frameworks.The contributions include the following three outcomes.First,the root causes of the performance degradation of big data applications when executed in JVM are summarized.Second,the existing JVM optimization techniques are summarized for big data processing frameworks.These methods are also classified into categories,the advantages and disadvantages of each are compared and analyzed,including the method’s optimization effects,application scopes,and burdens on users.Finally,some future JVM optimization directions are proposed,which will help the performance improvement of big data processing frameworks.
作者
汪钇丞
曾鸿斌
许利杰
王伟
魏峻
黄涛
WANG Yi-Cheng;ZENG Hong-Bin;XU Li-Jie;WANG Wei;WEI Jun;HUANG Tao(State Key Laboratory of Computer Science(Institute of Software,Chinese Academy of Sciences),Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;Nanjing Institute of Software Technology,Nanjing 211135,China)
出处
《软件学报》
EI
CSCD
北大核心
2023年第1期463-488,共26页
Journal of Software
基金
国家重点研发计划(2017YFB1001804)
国家自然科学基金(61802377)
中国科学院青年创新促进会。
关键词
大数据系统
JAVA虚拟机
分布式系统
自动内存管理
big data system
Java virtual machine(JVM)
distributed system
automatic memory management