摘要
硬盘故障是数据中心最主要的故障,严重影响了可靠性.传统的数据容错技术一般都是通过增加数据冗余来实现的,存在缺陷.主动容错技术通过预测硬盘故障提前将数据迁移,成为研究热点.现有技术大多研究硬盘故障预测,缺乏采集、迁移、反馈的研究,难以商用.提出“采集—预测—迁移—反馈”全流程主动容错机制,包括:分时硬盘信息采集方法、滑动窗口记录合并及样本构建方法、多类型硬盘故障预测方法、多盘联合数据迁移方法、预测结果二级验证及快速反馈方法.测试表明:采集硬盘信息对业务影响仅0.96%,硬盘故障预测召回率达94.66%,数据修复时间较传统方法减少55.10%.该工作已在中兴通讯的数据中心稳定商用,满足了主动容错技术在高可靠、高智能、低干扰、低成本、广适用等核心目标.
Hard disk fault has become the main source of failure in data centers,which seriously affects the reliability of data.The traditional data fault tolerant technology is usually realized by increasing data redundancy,which has some shortcomings.Proactive fault tolerant technology has become a research hotspot,because it can predict hard disk failures and migrate data ahead of time.However,the existing technology mostly studies hard disk fault prediction,but lacks the research of collection,migration and feedback,which causes difficulty in commercialize.This paper proposes a whole process proactive fault tolerant on“Collection—Prediction—Migration—Feedback”mechanism,which includes time-sharing hard disk information collection method,sliding window record merging and sample building method,multi-type hard disk fault prediction method,multi-disk joint data migration method,and two-level validation of prediction results with fast feedback method.The test results show that the impact of collecting hard disk information on front-end thread is only 0.96%,the recall rate of hard disk fault prediction is 94.66%,and data repair time is 55.10%less than traditional methods.This work has been used stably in ZTE’s data center,which meets the objectives of proactive fault tolerance technology,such as high-reliability,high-intelligence,low-interference,low-cost and wide-application.
作者
杨洪章
杨雅辉
屠要峰
孙广宇
吴中海
Yang Hongzhang;Yang Yahui;Tu Yaofeng;Sun Guangyu;Wu Zhonghai(School of Software&Microelectronics,Peking University,Beijing 102600;ZTE Corporation,Shenzhen,Guangdong 518057;School of Electronics Engineering and Computer Science,Peking University,Beijing 100871)
出处
《计算机研究与发展》
EI
CSCD
北大核心
2020年第2期306-317,共12页
Journal of Computer Research and Development
基金
国家重点研发计划项目(2018YFB1003302)
国家自然科学基金项目(61672062)
江苏省工业和信息产业转型升级专项资金项目(2018GX02517)~~
关键词
硬盘故障
存储可靠性
容错
人工智能
运维
disk failure
storage reliability
fault tolerance
artificial intelligence
operation&maintenance