期刊文献+

一种降低并行程序检查点开销的方法 被引量:3

Method for Reducing Checkpoint Overhead of Parallel Program
下载PDF
导出
摘要 检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。 Checkpointing and rollback recovery is an effect way to improve system reliability and implement fault-tolerant computation. It is usually evaluated by overhead ratio, which is primarily effected by checkpoint overhead. As there is much communication blocking time while parallel program is running, a method based on copy-on-write checkpoint buffering is proposed to further reduce checkpoint overhead. By controlling the running of checkpointing thread and selecting a suitable granularity, the method can hide the overhead caused by checkpointing thread very well and thus reduce overhead ratio.
出处 《计算机工程》 CAS CSCD 北大核心 2007年第12期84-86,共3页 Computer Engineering
基金 中国科学院新一代机群关键技术研究基金资助项目(KGCX2-SW-116)
关键词 检查点设置和卷回恢复 检查点开销 通信阻塞时间 Checkpointing and rollback recovery Checkpoint overhead Communication blocking time
  • 相关文献

参考文献3

  • 1Nitin H,Vaidya.Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme[J].Information Processing Letters,1997,46(8):942-947.
  • 2Plank J S.Efficient Checkpointing on MIMD Architectures[D].Princeton University,1993.
  • 3Chandy K M,Lamport L.Distributed Snapshots:Determining Global States of Distributed Systems[J].ACM Transactions on Computer Systems,1985,3(1):63-75.

同被引文献22

  • 1周恩强,卢宇彤,沈志宇.一个适合大规模集群并行计算的检查点系统[J].计算机研究与发展,2005,42(6):987-992. 被引量:12
  • 2杨超,张伟哲,张宏莉,田舟贤,方滨兴.基于检查点算法的网格计算容错机制研究[J].微电子学与计算机,2006,23(9):82-84. 被引量:6
  • 3霍志刚,马捷,孙凝晖.一个基于通信系统支持的并行检查点系统[J].计算机工程,2007,33(5):217-219. 被引量:1
  • 4Hong G,Ahn S J,Sang C,et al.Kckpt:Checkpoint and Recovery Facility on UnixWare Kernel[C] //Proc.of the 15th International Conference on Computers and Their Applications.New Orleans,Louisiana,USA:[s.n.] ,2000:303-308.
  • 5Russinovich M E,Solomon D A.Microsoft Windows Internals[M].4th ed.San Jose,California,USA:Microsoft Press,2004.
  • 6CAO G H,MUKESH S.Checkpointing with mutable checkpoints[J].Theoretical Computer Science,2003,290:1127-1148.
  • 7FOSIER I,KESSELMAN C.网格计算[M].北京:机械工业出版社,2005.
  • 8RONALD J.Leach,Setting checkpoints in legacy code to improve fault-tolerance[J].The Journal of Systems and Software,2008,81:920-928.
  • 9HIMADRI S P,AROBINDA G.Finding a suitable checkpoint and recovery protocol for a distributed application[J].Journal of Parallel and Distributed Computing,2006,66:732-749.
  • 10Elnozahy E N, Johnson D B. A survey of rollback recovery protocols in message passing systems[R]. Carnegie Mellon University, 1996.

引证文献3

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部