摘要
高可靠与高可用已经成为高性能计算中不可或缺的要素。文章设计并实现的H3C集群系统应用对称式热备份(Symmetry Active/Active Replication)机制提高了集群关键服务的可靠性和可用性,使集群头节点(Head Node)在失效时仍能够提供不间断的服务。在结合LAM/MPI和BLCR检查点机制的基础上开发出高可用MPI运行环境HA/MPI,有效解决了并行计算过程中计算节点(Computing Node)失效的容错难题。
High Reliability and High Availability play key roles in HPC(High Performance Computing).This paper introduced a H3C cluster architecture.Head node failover is achieved via Symmetry Active/Active Replication.This paper also produced a high available MPI runtime environment,HA/MPI,which overcomes the failure of computing nodes while during parallel computing.
出处
《舰船电子工程》
2008年第5期143-146,共4页
Ship Electronic Engineering
基金
"十一五"国防预研项目(编号:513160201)资助
关键词
高可靠
高可用
对称式热备份
虚拟同步
LAM
检查点/恢复
进程迁移
high reliability
high availability
symmetry active/active replication
virtual synchrony
LAM
checkpoint/restart
process migrate