面向Top-k快速查询的层次化LSH索引方法

A Fast Top-k Query Method Based on Hierarchical LSH

下载PDF

导出

摘要局部敏感哈希(locality sensitive hashing,LSH)用于在海量高维数据中检索相似的数据项,它能高效地返回相似度大于用户给定阈值的数据对.但是,由于需要设置固定阈值,LSH无法直接处理Top-k相似查询.传统LSH索引算法需要设置一系列阈值,分别建立索引,时间和空间代价较大.提出了一种层次化的LSH索引算法,通过动态构建层次化相似度图,充分利用三角不等式,减少不必要的索引构建代价.具体来讲,首先通过高阈值构建相似度图,将高度相似的数据点抽象成"超点",再在"超点"上构建低阈值的相似度图.查询时,首先查询高阈值相似度图;数量不足时再查询低阈值相似度图.实验表明,相比传统LSH算法,本文方法在构建索引的时间和空间代价上减小一个数量级,查询更加高效. Locality sensitive hashing(LSH)is widely applied in similarity query over massive highdimensional data. When given a fixed threshold,LSH can return data pairs efficiently,whose similarity scores are no less than the threshold.However,LSH cant handle Top-k query naturally,for it depends heavily on a given threshold.A straightforward method to leverage LSH is to use a serial of thresholds and to construct indexes for each of them,leading to huge cost in time and space.In this paper,we propose a hierarchical locality sensitive hashing(HLSH)index method,which constructs hierarchical similarity graph dynamically,and utilizes triangle inequality to avoid unnecessary index construction.Specifically,we firstly construct a similarity graph based on a higher threshold,and abstract highly similar points into a single'super point'.Then we construct another similarity graph over these 'super points',based on a lower threshold.We dont query the lowerthreshold similarity graph,until the higher one cant return enough data. The experiments demonstrate that our method can reduce the cost of index construction in time and space to an order of magnitude,with higher query efficiency than Naive LSH method.

作者罗雄才高军

机构地区北京大学信息科学技术学院高可信软件技术教育部重点实验室(北京大学)

出处《计算机研究与发展》 EI CSCD 北大核心 2015年第S1期56-63,共8页 Journal of Computer Research and Development

基金国家自然科学基金项目(61272156)

关键词层次化局部敏感哈希 Minhash TOP-K查询相似度图三角不等式 hierarchical locality sensitive hashing(HLSH) Minhash Top-kquery similarity graph triangle inequality

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1Indyk P,Motwani R.Approximate nearest neighbors: Towards removing the curse of dimensionality. Proceedings of the 30th Annual ACM Symposium on Theory of Computing . 1998
2Xie H,Chen Z,Liu Y,et al.Data-dependent locality sensitive hashing. Advances in Multimedia Information Processing-PCM 2014 . 2014
3Houle, M. E,Kriegel, H. P,Kroger, P,Schubert, E,Zimek, A.Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? (PDF). Scientific and Statistical Database Management . 2010
4Gionis A,Indyk P,Motwani R.Similarity search in high dimensions via hashing. The VLDB Journal . 1999
5Ji?í Matou?ek,Milo? Stojakovi?.On restricted min-wise independence of permutations. Random Structures & Algorithms . 2003
6J. Pan,D. Manocha.Bi-level locality sensitive hashing for k-nearest neighbor computation. DEEE 28th International Conference on Data Engineering . 2012
7Broder A Z,Charikar M,Frieze A M, et al.Min-wise independent permutations. Proceedings of the thirtieth annual ACM symposium on Theory of computing . 1998
8Stefan Berchtold,Christian B?hm,Hans-Peter Kriegal.??The pyramid-technique(J)ACM SIGMOD Record . 1998 (2)
9Rajaraman A,Ullman J D.Mining of Massive Datasets. . 2011
10蔡衡,李舟军,孙健,李洋.基于LSH的中文文本快速检索[J].计算机科学,2009,36(8):201-204. 被引量：13

二级参考文献17

1Stein B. Principles of hash - based text retrieval [C]//Annual ACM Conference on Research and Development in Information Retrieval Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2007.
2Athitsos V,Potamias M,Papapetrou P,et al. Nearest Neighbor Retrieval Using Distanee-Based Hashing[C] // Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. 2008.
3IndykP, DatarM, ImmorlicaN. Locality-SensitiveHashingScheme Based on p-Stable[C]//Annual Symposium on Computational Geometry. 2004.
4Arya S, Mount D. Ann: Library for approximate nearest neighbor search[OL], http: //www. cs. umd. edu/-mount/ANN/.
5Indyk P, Motwani R. Approximate nearest neighbors : Towards removing the curse of dimensionality[C]//Jeffrey V, ed. Proc. of the 30th Annual ACM Symp. on Theory of Computing. New York: ACM Press, 1998 : 604-613.
6Panigrahy R. Entropy based nearest neighbor searchin high dimensions[C]//Proc, of ACM-SIAMSymposium on Discrete Algorithms(SODA). 2006.
7Ravichandran D,Pantel P, Hovy E. Randomized Algorithms and NLP..Using Locality Sensitive Hash Function for High Speed Noun Clustering[M]. Information Sciences Institute University of Southern California, 2004.
8Cai Rui, Zhang Chao, Zhang Lei, et al. Scalable Music Recommendation by Search [C]// International Multimedia Conference. 2007.
9Charikar M S. Similarity Estimation Techniques from Rounding Algorithms[C]//Annual ACM Symposium on Theory of Computing. 2002.
10Qin L, Josephson W, Wang Zhe, et al. A TimeSpace Efficient Locality Sensitive Hashing Method for Similarity Search in High Dimensions.

共引文献12

1易磊,仲红,袁先平,赵玉.支持容错检索的数据共享方案[J].计算机应用,2011,31(6):1525-1527.
2赵启潍,张乐,祝贝利,刘静.面向高维数据的LSH算法及应用[J].福建电脑,2012,28(4):13-14. 被引量：1
3高毫林,徐旭,李弼程.近似最近邻搜索算法——位置敏感哈希[J].信息工程大学学报,2013,14(3):332-340. 被引量：8
4於慧,谢萍,李士进,冯钧.基于多特征LSH索引的快速遥感图像检索[J].山西大学学报（自然科学版）,2013,36(3):350-356. 被引量：1
5卞艺杰,陈超,马玲玲,陈远磊.一种改进的LSH/MinHash协同过滤算法[J].计算机与现代化,2013(12):19-22. 被引量：5
6赵跃华,林聚伟.面向海量病毒样本家族聚类方法的研究[J].计算机工程与应用,2014,50(18):118-121.
7曹玉东,刘艳洋,孙福明,贾旭.低空间复杂度的LSH算法及其在图像检索中的应用[J].计算机工程与科学,2015,37(2):379-383. 被引量：2
8史建东,耿利川,秦永志.基于BSIFT的无人机影像快速拼接算法[J].测绘与空间地理信息,2015,38(5):124-127. 被引量：4
9曹玉东,刘艳洋,贾旭,王冬霞.基于改进的局部敏感哈希算法实现图像型垃圾邮件过滤[J].计算机应用研究,2016,33(6):1693-1696. 被引量：13
10顾志祥,谢龙恩,杜雨.文本相似度计算的Simhash算法的实现与改进[J].信息通信,2020,0(1):27-29. 被引量：5

1冯永亮.基于Canny算子的图像边缘检测方法改进研究[J].计算机与数字工程,2016,44(8):1567-1570. 被引量：13
2唐俊华,阎保平.基于LSH索引的快速图像检索[J].计算机工程与应用,2002,38(24):20-21. 被引量：6
3低阈值面部识别 iOS刷脸时代即将到来[J].新电脑,2015,0(8):15-15.
4卢艳君,徐望明.一种运用随机算法改进的图像检索方法[J].武汉科技大学学报,2015,38(1):72-76. 被引量：2
5印勇,刘平,刘丹平.采用改进Canny算子分割尿沉渣图像[J].计算机工程与应用,2010,46(14):196-198. 被引量：3
6蔡衡,李舟军,孙健,李洋.基于LSH的中文文本快速检索[J].计算机科学,2009,36(8):201-204. 被引量：13
7赵启潍,张乐,祝贝利,刘静.面向高维数据的LSH算法及应用[J].福建电脑,2012,28(4):13-14. 被引量：1
8曹玉东,刘艳洋,孙福明,贾旭.低空间复杂度的LSH算法及其在图像检索中的应用[J].计算机工程与科学,2015,37(2):379-383. 被引量：2
9陈强,朱立新,夏德深.结合Canny算子的图像二值化[J].计算机辅助设计与图形学学报,2005,17(6):1302-1306. 被引量：51
10拓小明,李云红,刘旭,曹浏,霍可,田冀达,陈航.基于Canny算子与阈值分割的边缘检测算法[J].西安工程大学学报,2014,28(6):745-749. 被引量：24

计算机研究与发展

2015年第S1期

浏览历史

内容加载中请稍等...

面向Top-k快速查询的层次化LSH索引方法

参考文献13

二级参考文献17

共引文献12

相关作者

相关机构

相关主题

浏览历史