摘要
MinHash作为位置敏感哈希(LSH)算法中的一种,可以用来快速估算两个集合的相似度,查找网络上的重复网页或者相似新闻网页,MinHash算法使用Jaccard相似度来度量对象的相似程度。本文针对MinHash算法在分布式平台上的实现和性能表现进行分析和研究,给出了MinHash的分布式算法。最后通过具体的实验,验证了提出的MinHash算法在处理实际问题上的正确性和准确性。
MinHash is a kind of Locality Sensitive Hashing algorithm ( LSH), which can be used to quickly estimate the similarity of two sets to find the duplicate web pages or the similar news pages on the web. This paper focuses on the MinHash implementations and Performance in distributed platform, and devises the distributed MinHash algorithm. To veri- fy the soundness of the new version, the paper conducts extensive experiments with several real datasets. Experimental re- suits confirm the validity and accuracy of the proposed implementation.
出处
《智能计算机与应用》
2014年第6期44-46,共3页
Intelligent Computer and Applications