摘要
传统的大数据冗余消除算法无法实现冗余去重率与吞吐量的冲突平衡,为此,设计一种基于哈希计算的大数据冗余消除算法。依据样本数据在数据集中的边缘程度对数据进行分类处理。采用哈希算法计算分类后数据的相似度与熵值,由此判断数据是否为重复数据,实现消除冗余数据的算法设计。实验结果可知,所提算法最高去重率可达到99%,最高吞吐量可达到26 MB/s,验证了所提算法可有效解决冗余去重率与吞吐量之间的冲突问题。
Traditional big data redundancy elimination algorithms cannot achieve the conflict balance between redundancy deduplication rate and throughput.For this reason,a large data redundancy elimination algorithm based on Hash calculation is designed.The data are classified according to the marginal degree of the sample data.The Hash algorithm is used to calculate the similarity and entropy of the classified data,and to calculate the similarity and entropy of the classified data and determine whether the data are duplicate data.The design of algorithm to eliminate redundant data is implemented.The experimental results show that the highest deduplication rate of the proposed algorithm is 99%,and the highest throughput is 26 Mb/s.This proves that the algorithm in this paper has solved the conflict between redundant deduplication rate and throughput.
作者
张淑清
ZHANG Shuqing(Traffic Management Engineering College, Guangxi Police College, Nanning 530022, China)
出处
《微型电脑应用》
2021年第12期68-70,共3页
Microcomputer Applications
基金
广西科学研究与技术开发计划项目(2015BC17063)。
关键词
哈希计算
大数据资源
冗余消除
Hash calculation
big data resources
redundancy elimination