摘要
目前集群系统因节点数量和计算任务规模不断增长,导致任务故障概率不断增加,设计和实现集群系统分布式任务故障冗余机制就是为了解决集群系统的上述问题。文章首先介绍了集群系统分布式任务故障冗余管理的体系架构;然后详细阐述了任务故障检测及恢复、集群单点故障问题的解决、任务状态同步等关键技术;最后,通过实验室环境测试进一步表明,该机制能够增强集群系统运行的可靠性,保障集群系统分布式任务的稳定运行。
At present,the probability of task failure is increasing in cluster system due to the number of nodes andcomputing tasks scale growing,design and implementation of distributed task fault redundancy for cluster system is tosolve the above problems. This paper firstly introduces the architecture of distributed task fault redundancymanagement for cluster system;then the author describes the key technologies such as task failure detection andrecovery,the solution of the problem of single point of the cluster,the synchronization of tasks status and the userinteraction;finally,the laboratory environment test shows that this mechanism can enhance the reliability of the clustersystem and the stable operation of distributed tasks in cluster system.
出处
《江苏科技信息》
2015年第21期37-39,共3页
Jiangsu Science and Technology Information