期刊文献+

面向Top-k快速查询的层次化LSH索引方法

A Fast Top-k Query Method Based on Hierarchical LSH
下载PDF
导出
摘要 局部敏感哈希(locality sensitive hashing,LSH)用于在海量高维数据中检索相似的数据项,它能高效地返回相似度大于用户给定阈值的数据对.但是,由于需要设置固定阈值,LSH无法直接处理Top-k相似查询.传统LSH索引算法需要设置一系列阈值,分别建立索引,时间和空间代价较大.提出了一种层次化的LSH索引算法,通过动态构建层次化相似度图,充分利用三角不等式,减少不必要的索引构建代价.具体来讲,首先通过高阈值构建相似度图,将高度相似的数据点抽象成"超点",再在"超点"上构建低阈值的相似度图.查询时,首先查询高阈值相似度图;数量不足时再查询低阈值相似度图.实验表明,相比传统LSH算法,本文方法在构建索引的时间和空间代价上减小一个数量级,查询更加高效. Locality sensitive hashing(LSH)is widely applied in similarity query over massive highdimensional data. When given a fixed threshold,LSH can return data pairs efficiently,whose similarity scores are no less than the threshold.However,LSH cant handle Top-k query naturally,for it depends heavily on a given threshold.A straightforward method to leverage LSH is to use a serial of thresholds and to construct indexes for each of them,leading to huge cost in time and space.In this paper,we propose a hierarchical locality sensitive hashing(HLSH)index method,which constructs hierarchical similarity graph dynamically,and utilizes triangle inequality to avoid unnecessary index construction.Specifically,we firstly construct a similarity graph based on a higher threshold,and abstract highly similar points into a single'super point'.Then we construct another similarity graph over these 'super points',based on a lower threshold.We dont query the lowerthreshold similarity graph,until the higher one cant return enough data. The experiments demonstrate that our method can reduce the cost of index construction in time and space to an order of magnitude,with higher query efficiency than Naive LSH method.
作者 罗雄才 高军
出处 《计算机研究与发展》 EI CSCD 北大核心 2015年第S1期56-63,共8页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61272156)
关键词 层次化局部敏感哈希 Minhash TOP-K查询 相似度图 三角不等式 hierarchical locality sensitive hashing(HLSH) Minhash Top-kquery similarity graph triangle inequality
  • 相关文献

参考文献13

  • 1Indyk P,Motwani R.Approximate nearest neighbors: Towards removing the curse of dimensionality. Proceedings of the 30th Annual ACM Symposium on Theory of Computing . 1998
  • 2Xie H,Chen Z,Liu Y,et al.Data-dependent locality sensitive hashing. Advances in Multimedia Information Processing-PCM 2014 . 2014
  • 3Houle, M. E,Kriegel, H. P,Kroger, P,Schubert, E,Zimek, A.Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? (PDF). Scientific and Statistical Database Management . 2010
  • 4Gionis A,Indyk P,Motwani R.Similarity search in high dimensions via hashing. The VLDB Journal . 1999
  • 5Ji?í Matou?ek,Milo? Stojakovi?.On restricted min-wise independence of permutations. Random Structures & Algorithms . 2003
  • 6J. Pan,D. Manocha.Bi-level locality sensitive hashing for k-nearest neighbor computation. DEEE 28th International Conference on Data Engineering . 2012
  • 7Broder A Z,Charikar M,Frieze A M, et al.Min-wise independent permutations. Proceedings of the thirtieth annual ACM symposium on Theory of computing . 1998
  • 8Stefan Berchtold,Christian B?hm,Hans-Peter Kriegal.??The pyramid-technique(J)ACM SIGMOD Record . 1998 (2)
  • 9Rajaraman A,Ullman J D.Mining of Massive Datasets. . 2011
  • 10蔡衡,李舟军,孙健,李洋.基于LSH的中文文本快速检索[J].计算机科学,2009,36(8):201-204. 被引量:13

二级参考文献17

  • 1Stein B. Principles of hash - based text retrieval [C]//Annual ACM Conference on Research and Development in Information Retrieval Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2007.
  • 2Athitsos V,Potamias M,Papapetrou P,et al. Nearest Neighbor Retrieval Using Distanee-Based Hashing[C] // Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. 2008.
  • 3IndykP, DatarM, ImmorlicaN. Locality-SensitiveHashingScheme Based on p-Stable[C]//Annual Symposium on Computational Geometry. 2004.
  • 4Arya S, Mount D. Ann: Library for approximate nearest neighbor search[OL], http: //www. cs. umd. edu/-mount/ANN/.
  • 5Indyk P, Motwani R. Approximate nearest neighbors : Towards removing the curse of dimensionality[C]//Jeffrey V, ed. Proc. of the 30th Annual ACM Symp. on Theory of Computing. New York: ACM Press, 1998 : 604-613.
  • 6Panigrahy R. Entropy based nearest neighbor searchin high dimensions[C]//Proc, of ACM-SIAMSymposium on Discrete Algorithms(SODA). 2006.
  • 7Ravichandran D,Pantel P, Hovy E. Randomized Algorithms and NLP..Using Locality Sensitive Hash Function for High Speed Noun Clustering[M]. Information Sciences Institute University of Southern California, 2004.
  • 8Cai Rui, Zhang Chao, Zhang Lei, et al. Scalable Music Recommendation by Search [C]// International Multimedia Conference. 2007.
  • 9Charikar M S. Similarity Estimation Techniques from Rounding Algorithms[C]//Annual ACM Symposium on Theory of Computing. 2002.
  • 10Qin L, Josephson W, Wang Zhe, et al. A TimeSpace Efficient Locality Sensitive Hashing Method for Similarity Search in High Dimensions.

共引文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部