摘要
检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。
Checkpointing and rollback recovery is an effect way to improve system reliability and implement fault-tolerant computation. It is usually evaluated by overhead ratio, which is primarily effected by checkpoint overhead. As there is much communication blocking time while parallel program is running, a method based on copy-on-write checkpoint buffering is proposed to further reduce checkpoint overhead. By controlling the running of checkpointing thread and selecting a suitable granularity, the method can hide the overhead caused by checkpointing thread very well and thus reduce overhead ratio.
出处
《计算机工程》
CAS
CSCD
北大核心
2007年第12期84-86,共3页
Computer Engineering
基金
中国科学院新一代机群关键技术研究基金资助项目(KGCX2-SW-116)
关键词
检查点设置和卷回恢复
检查点开销
通信阻塞时间
Checkpointing and rollback recovery
Checkpoint overhead
Communication blocking time