摘要
在分布式存储系统中,如何优化失效数据的修复时间以保证系统的高可靠性,已引起了人们的广泛关注.近几年的研究发现修复过程中不同的节点选择机制对数据的再生时间产生很大的影响,已有工作提出了单节点失效场景下的节点选择SPSN(select provider select newcomer)算法,系统中往往存在多个节点同时修复的情况,此时,SPSN算法巨大的时空开销使得数据的再生时间不再最优.对已有真实系统的失效数据及原因进行统计;基于已有算法特点和修复模型,提出了具有更优的多节点选择B-WSJ(bandwidth based weak and strong judgement)算法.为了更好地描述算法,对带宽中节点的关系进行分类,算法利用节点关系分别实现了修复模型中目标节点的浅度和深度判断,并加入一定的预处理和剪枝策略,最终快速选择出具有较优带宽的节点集合.为了评估B-WSJ算法性能,使用Waxman算法产生网络拓扑,依据FTA(failure trace archive)网站所给的真实系统的节点失效模型进行多次实验,仿真结果表明:B-WSJ算法使得节点修复性能得到了很大的提升.
In distributed storage systems,how to optimize the regeneration time of lost data so as to keep high reliability has attracted attention increasingly.Recent researches reveal that node selection mechanism during repair progress has great impact on regeneration time.SPSN(select provider select newcomer)algorithm has put forward by some studies,which suits the scenario of single node failure well.However,it is very common to repair many modes at the same time in actual system.In this scenario,SPSN algorithm will no longer be optimal taking large time and space consumption into consideration.By analyzing the data failure trace of real distributed file system,we propose a node selection algorithm B-WSJ(bandwidth based weak and strong judgement)based on the existing algorithms and repairing model with the characteristic of parallelism which is suitable for multi-failure scenario.In order to describe the algorithm better,we firstly define several concepts of noderelationship on a link.Secondly we use these concepts to realize the weak and strong judgment of target node with pre-process and pruning strategy added.Finally,the nodes with better bandwidth will be chosen.To evaluate the performance of NS algorithm,we use Waxman algorithm to generate network topology and do many experiments with node failure models in real system provided by FTA(failure trace archive).The experimental results show the performance of B-WSJ algorithm can be improved greatly compared with the existing algorithms.
作者
刘佩
蒋梓逸
曹袖
Liu Pei;Jiang Ziyi;Cao Xiu(School of Computer Science and Technology,Fudan University,Shanghai 201203;Engineering Research Center of Cyber Security Auditing and Monitoring(Fudan University),Ministry of Education,Shanghai 200433)
出处
《计算机研究与发展》
EI
CSCD
北大核心
2018年第7期1557-1568,共12页
Journal of Computer Research and Development
关键词
分布式存储系统
数据修复
再生时间
多节点失效
节点选择
distributed storage system
data repair
regeneration time
multi node failure
node selection