摘要
局部敏感哈希(locality sensitive hashing,LSH)用于在海量高维数据中检索相似的数据项,它能高效地返回相似度大于用户给定阈值的数据对.但是,由于需要设置固定阈值,LSH无法直接处理Top-k相似查询.传统LSH索引算法需要设置一系列阈值,分别建立索引,时间和空间代价较大.提出了一种层次化的LSH索引算法,通过动态构建层次化相似度图,充分利用三角不等式,减少不必要的索引构建代价.具体来讲,首先通过高阈值构建相似度图,将高度相似的数据点抽象成"超点",再在"超点"上构建低阈值的相似度图.查询时,首先查询高阈值相似度图;数量不足时再查询低阈值相似度图.实验表明,相比传统LSH算法,本文方法在构建索引的时间和空间代价上减小一个数量级,查询更加高效.
Locality sensitive hashing(LSH)is widely applied in similarity query over massive highdimensional data. When given a fixed threshold,LSH can return data pairs efficiently,whose similarity scores are no less than the threshold.However,LSH cant handle Top-k query naturally,for it depends heavily on a given threshold.A straightforward method to leverage LSH is to use a serial of thresholds and to construct indexes for each of them,leading to huge cost in time and space.In this paper,we propose a hierarchical locality sensitive hashing(HLSH)index method,which constructs hierarchical similarity graph dynamically,and utilizes triangle inequality to avoid unnecessary index construction.Specifically,we firstly construct a similarity graph based on a higher threshold,and abstract highly similar points into a single'super point'.Then we construct another similarity graph over these 'super points',based on a lower threshold.We dont query the lowerthreshold similarity graph,until the higher one cant return enough data. The experiments demonstrate that our method can reduce the cost of index construction in time and space to an order of magnitude,with higher query efficiency than Naive LSH method.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2015年第S1期56-63,共8页
Journal of Computer Research and Development
基金
国家自然科学基金项目(61272156)