摘要
清华信息科学与技术国家实验室(筹)高性能计算公共平台是清华大学校级公共服务平台之一。建立完备的设备管理方法及高效的系统运维模式是高性能计算平台管理的最基础工作之一。采用自动化与人工管理相结合的方式,建立了完善的设备管理制度保证整个集群设备安全运行;在集群系统运维中,自主开发了集群机自动检测及修复系统,实现了无人值守系统运维模式,保证集群系统的稳定性。目前,设备管理与系统运维系统已经应用于清华"探索100"百万次超级计算系统,为校内外用户提供了稳定、高效的高性能计算环境。
The HPC platform of Tsinghua National Laboratory for Information Science and Technology is a common service platform in Tsinghua University. Equipment management and system maintenance are one of the basic tasks for HPC platform. In equipment management, an auto-artificial method to manage and monitor the HPC equipment is adopted. In system maintenance, an auto monitoring and reconstruction system is developed to preserve cluster system. The system for equipment management and system maintenance has been applied on HPC platform and provides stable computing service for users.
出处
《实验技术与管理》
CAS
北大核心
2013年第5期87-90,共4页
Experimental Technology and Management