摘要
This paper presents a new efficient algorithm for clustering categorical data,Squeezer, which can produce high quality clustering results and at the same time deservegood scalability. The Squeezer algorithm reads each tuple t in sequence, either assigning tto an existing cluster (initially none), or creating t as a new cluster, which is determined bythe similarities between t and clusters. Due to its characteristics, the proposed algorithm isextremely suitable for clustering data streams, where given a sequence of points, the objective isto maintain consistently good clustering of the sequence so far, using a small amount of memoryand time. Outliers can also be handled efficiently and directly in Squeezer. Experimental resultson real-life and synthetic datasets verify the superiority of Squeezer.
基金
国家自然科学基金,IBMAS/400 Research Fund