摘要
Hadoop分布式文件系统(HDFS)设计之初是针对大文件的处理,但无法高效地针对小文件进行存储,因此提出了一种基于关联规则挖掘的高效的小文件存储方法——ARMFS。ARMFS通过对Hadoop系统的审计日志进行关联规则挖掘,获得小文件间的关联性,通过文件合并算法将小文件合并存储至HDFS;在请求HDFS文件时,根据关联规则挖掘得到的高频访问表和预取机制表提出预取算法来进一步提高文件访问效率。实验结果表明,ARMFS方法明显提高了NameNode的内存使用效率,对于小文件的下载速度和访问效率的改善十分有效。
Hadoop distributed file system (HDFS) is previously designed for large file processing,but it is not effective for small file storage. This paper proposes an efficient method of distributed small file storage by means of association rule mining and named ARMFS. By analyzing the audit logs to obtain the association of small files,these small files are merged and compressed to HDFS via file merge algorithm. When requesting HDFS file,the prefetching algorithm is further proposed to improve the access efficiency according to the high frequency access table and prefetching table that is based on association rules. The experiment results show that the ARMFS method can significantly improve the memory efficiency on NameNode and the access efficiency of the small file on HDFS.
出处
《华东理工大学学报(自然科学版)》
CAS
CSCD
北大核心
2016年第5期708-714,共7页
Journal of East China University of Science and Technology
基金
国家自然科学基金(61300041
61272198)
关键词
HDFS
关联规则挖掘
小文件关联性
预取
HDFS
association rule mining
the association of small files
prefetching