期刊文献+

基于ISE算法的分布式ETL任务调度策略研究 被引量:12

Research on Distributed ETL Tasks Scheduling Strategy Based on ISE Algorithm
下载PDF
导出
摘要 随着数据仓库的规模不断扩大,数据集成下的ETL(Extraction-Transformation-Loading)任务也随之增多,单机调度显然已经不能满足当下繁多复杂的ETL任务调度。针对ETL任务调度如何提高效率、缩短关键任务等待时间、提升资源利用率等问题,构建了一套分布式ETL任务调度框架,该框架由调度器和若干执行器组成,通过任务预处理、任务调度分配、任务执行3个阶段来完成ETL任务调度。在任务预处理阶段,对ETL任务建立权重模型,并根据权重确定调度优先级。在任务调度分配阶段,调度器根据各个执行器节点的性能及负载情况来约束执行器节点的选择,并设计贪心平衡(Greedy Balance,GB)算法来进行ETL任务执行请求的分发,使执行器节点的负载相对均衡。在任务执行阶段,通过高响应比优先(Highest Response Ratio Next,HRRN)算法确定执行器节点队列下任务的执行优先级。实验结果表明,分布式ETL任务调度框架及相应的一体化调度执行(Integrated Scheduling Execution,ISE)算法能够有效提高集群资源的利用率,缩短任务调度的执行时间。 With the expansion of the data warehouse,ETL tasks have also increased under data integration.Stand-alone scheduling obviously cannot meet the needs of many complex ETL tasks.Aiming at how to improve the efficiency of ETL task scheduling,reduce the critical task waiting time,and improve the resource utilization and so on,this paper constructed a distributed ETL task scheduling framework consisting of a scheduler and several actuators and completing the ETL task scheduling through the task preprocessing,task scheduling and task execution.In the task preprocessing stage,a weight model is established for the ETL task,and the scheduling priority is determined according to the weight.In the task scheduling stage,the scheduler constrains the choice of actuator nodes according to the performance and load conditions of each actuator node,and a greedy balance(GB)algorithm is designed to distribute the ETL task execution requests,so that the load of the actuator nodes is relatively balanced.In the task execution phase,the execution priority of tasks under the actuator node queue is determined by the high response ratio first(Highest Response Ratio Next,HRRN)algorithm.Experiment results show that the distributed ETL task scheduling framework and the corresponding integrated scheduling execution(ISE)algorithm can effectively improve the utilization of cluster resources and shorten the task scheduling execution time.
作者 王卓昊 杨冬菊 徐晨阳 WANG Zhuo-hao;YANG Dong-ju;XU Chen-yang(Institute of Scientific and Technical Information of China,Beijing 100038,China;Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China;Data Engineering Institute,North China University of Technology,Beijing 100144,China)
出处 《计算机科学》 CSCD 北大核心 2019年第12期1-7,共7页 Computer Science
基金 国家自然科学基金重点项目(61832004)资助
关键词 任务调度 负载均衡 动态分配 分布式集群 ETL 数据集成 Task scheduling Load balancing Dynamic allocation Distributed clustering Extraction-Transformation-Loading Data integration
  • 相关文献

参考文献2

二级参考文献16

  • 1宫华,张彪,许可.并行机生产与成批配送协调调度问题的近似策略[J].沈阳工业大学学报,2015,37(3):324-328. 被引量:3
  • 2王小非,方明.一种基于调度簇树的周期性分布实时任务调度算法[J].计算机科学,2007,34(3):256-261. 被引量:3
  • 3Topcuoglu H,Hariri S,Wu Min-you.Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing[J].IEEE Transactions on parallel and distributed systems,2002,13(3):260-274.
  • 4Prashanth C,Sai Ran-ga.Algorithms for task scheduling in heterogeneous computing environments[D].Alabama:Auburn University,2006.
  • 5Ullman J D.Np-complete scheduling problem[J].Journal of Computer and System Sciences,1975,10(3):384-393.
  • 6Hagrss T,Janecek J.A high performance low complexity algorithm for compile-time task scheduling in heterogeneous systems[J].Parallel computing,2005,31(7):653-670.
  • 7Daolud M I,Kharma N.A high performance algorithm for static task scheduling in heterogeneous distributed computing systems[J].Journal of parallel and distributed computing,2008,68(4):399-409.
  • 8Darbha S,Agrawal D P.Optimal scheduling algorithm for distributed memory machines[J].IEEE transactions on parallel and distributed systems,1998,9(1):87-95.
  • 9何琨,赵勇,黄文奇.基于任务复制的分簇与调度算法[J].计算机学报,2008,31(5):733-740. 被引量:14
  • 10孟宪福,刘伟伟.基于选择性复制前驱任务的DAG调度算法[J].计算机辅助设计与图形学学报,2010,22(6):1056-1062. 被引量:10

共引文献33

同被引文献98

引证文献12

二级引证文献47

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部