摘要
将低维空间中的距离度量方法(如Lk-范数)应用于高维空间时,随着维数的增加,对象之间距离的对比性将不复存在。研究高维数据有效的距离或相似(相异)度度量方法是一个重要且具有挑战性的课题。通过对传统的距离度量或相似性(相异性)度量方法在高维空间中表现出的不适应性的分析,并对现有的应用于高维数据的相似性度量方法进行总结,提出了高维数据相似性度量函数Hsim(X,Y)的改进方法HDsim(X,Y)。函数HDsim(X,Y)整合了各类型数据的相似性度量方法,在处理数值型、二值型以及分类属性数据上充分体现了原Hsim(X,Y)处理数值型数据、Jaccard系数处理二值数据以及匹配率处理分类属性数据的优越性。通过有效性及实例分析,充分论证了HDsim(X,Y)在高维空间中的有效性。
There exists no comparison between the distances of the objects with the increase of dimension when the method of distance measurement for low dimensional space is adopted in high dimensional space. The study of efficient methods for distance measurement or similarity (dissimilarity) measurement in high dimensional space is very important and challenging. The improved function HDsirn (X,Y) is proposed to measure the similarity between the objects in high dimen- sional space through analyzing the inapplicability of the traditional measurement being used in high dimensional space and summarizing the existing methods to similarity measurement for high dimensional data. The methods for similarity measure- ment to all kinds of data have been integrated by function HDsim (X,Y) , which takes full advantage of the original function Hsim (X,Y) in dealing with numerical data, the Jaccard coefficient in dealing with the binary data, and the matching ratio in dealing with the categorical data. Validity and ease analysis demonstrate that the function HDsim (X,Y) is effective in com- puting the similarity between the objects in high dimensional space.
出处
《计算机工程与科学》
CSCD
北大核心
2010年第5期92-96,共5页
Computer Engineering & Science
基金
国家科技支撑计划资助项目(2007BAH16B03)
国家863计划资助项目(2009AA12Z228)
关键词
高维数据
相似性度量
属性相似性
空间相似性
high dimensional data
similarity measurement
attribute similarity
spatial similarity