摘要
随着数据仓库的规模不断扩大,数据集成下的ETL(Extraction-Transformation-Loading)任务也随之增多,单机调度显然已经不能满足当下繁多复杂的ETL任务调度。针对ETL任务调度如何提高效率、缩短关键任务等待时间、提升资源利用率等问题,构建了一套分布式ETL任务调度框架,该框架由调度器和若干执行器组成,通过任务预处理、任务调度分配、任务执行3个阶段来完成ETL任务调度。在任务预处理阶段,对ETL任务建立权重模型,并根据权重确定调度优先级。在任务调度分配阶段,调度器根据各个执行器节点的性能及负载情况来约束执行器节点的选择,并设计贪心平衡(Greedy Balance,GB)算法来进行ETL任务执行请求的分发,使执行器节点的负载相对均衡。在任务执行阶段,通过高响应比优先(Highest Response Ratio Next,HRRN)算法确定执行器节点队列下任务的执行优先级。实验结果表明,分布式ETL任务调度框架及相应的一体化调度执行(Integrated Scheduling Execution,ISE)算法能够有效提高集群资源的利用率,缩短任务调度的执行时间。
With the expansion of the data warehouse,ETL tasks have also increased under data integration.Stand-alone scheduling obviously cannot meet the needs of many complex ETL tasks.Aiming at how to improve the efficiency of ETL task scheduling,reduce the critical task waiting time,and improve the resource utilization and so on,this paper constructed a distributed ETL task scheduling framework consisting of a scheduler and several actuators and completing the ETL task scheduling through the task preprocessing,task scheduling and task execution.In the task preprocessing stage,a weight model is established for the ETL task,and the scheduling priority is determined according to the weight.In the task scheduling stage,the scheduler constrains the choice of actuator nodes according to the performance and load conditions of each actuator node,and a greedy balance(GB)algorithm is designed to distribute the ETL task execution requests,so that the load of the actuator nodes is relatively balanced.In the task execution phase,the execution priority of tasks under the actuator node queue is determined by the high response ratio first(Highest Response Ratio Next,HRRN)algorithm.Experiment results show that the distributed ETL task scheduling framework and the corresponding integrated scheduling execution(ISE)algorithm can effectively improve the utilization of cluster resources and shorten the task scheduling execution time.
作者
王卓昊
杨冬菊
徐晨阳
WANG Zhuo-hao;YANG Dong-ju;XU Chen-yang(Institute of Scientific and Technical Information of China,Beijing 100038,China;Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China;Data Engineering Institute,North China University of Technology,Beijing 100144,China)
出处
《计算机科学》
CSCD
北大核心
2019年第12期1-7,共7页
Computer Science
基金
国家自然科学基金重点项目(61832004)资助
关键词
任务调度
负载均衡
动态分配
分布式集群
ETL
数据集成
Task scheduling
Load balancing
Dynamic allocation
Distributed clustering
Extraction-Transformation-Loading
Data integration