Catálogo de publicaciones - libros

Compartir en
redes sociales

Grouping Multidimensional Data: Recent Advances in Clustering

Jacob Kogan ; Charles Nicholas ; Marc Teboulle (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad

Institución detectada	Año de publicación	Navegá	Descargá	Solicitá
No detectada	2006	SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-28348-5

ISBN electrónico

978-3-540-28349-2

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

2006

Información sobre derechos de publicación

Cobertura temática

Ciencias de la computación e información

Tabla de contenidos

Verificá que desde tu institución tengas acceso para descargar o solicitar el libro completo o alguno de sus capítulos.

doi: 10.1007/3-540-28349-8_1

The Star Clustering Algorithm for Information Organization

J.A. Aslam; E. Pelekhov; D. Rus

We present the star clustering algorithm for static and dynamic information organization. The offline star algorithm can be used for clustering static information systems, and the online star algorithm can be used for clustering dynamic information systems. These algorithms organize a data collection into a number of clusters that are naturally induced by the collection via a computationally efficient cover by dense subgraphs. We further show a lower bound on the accuracy of the clusters produced by these algorithms as well as demonstrate that these algorithms are computationally efficient. Finally, we discuss a number of applications of the star clustering algorithm and provide results from a number of experiments with the Text Retrieval Conference data.

Pp. 1-23

doi: 10.1007/3-540-28349-8_2

A Survey of Clustering Data Mining Techniques

P. Berkhin

is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data. Clustering is therefore related to many disciplines and plays an important role in a broad range of applications. The applications of clustering usually deal with large datasets and data with many attributes. Exploration of such data is a subject of data mining. This survey concentrates on clustering algorithms from a data mining perspective.

Pp. 25-71

doi: 10.1007/3-540-28349-8_3

Similarity-Based Text Clustering: A Comparative Study

J. Ghosh; A. Strehl

Clustering of text documents enables unsupervised categorization and facilitates browsing and search. Any clustering method has to embed the objects to be clustered in a suitable representational space that provides a measure of (dis)similarity between any pair of objects. While several clustering methods and the associated similarity measures have been proposed in the past for text clustering, there is no systematic comparative study of the impact of similarity measures on the quality of document clusters, possibly because most popular cost criteria for evaluating cluster quality do not readily translate across qualitatively different measures. This chapter compares popular similarity measures (Euclidean, cosine, Pearson correlation, extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hypergraph partitioning, generalized -means, weighted graph partitioning), on a variety of high dimension sparse vector data sets representing text documents as bags of words. Performance is measured based on mutual information with a human-imposed classification. Our key findings are that in the quasiorthogonal space of word frequencies: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) Graph partitioning tends to be superior especially when balanced clusters are desired; (iv) Performance curves generally do cross.

Pp. 73-97

doi: 10.1007/3-540-28349-8_4

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

D. Littau; D. Boley

We present a method to cluster data sets too large to fit in memory, based on a Low-Memory Factored Representation (LMFR). The LMFR represents the original data in a factored form with much less memory, while preserving the individuality of each of the original samples. The scalable clustering algorithm Principal Direction Divisive Partitioning (PDDP) can use the factored form in a natural way to obtain a clustering of the original dataset.

The resulting algorithm is the PieceMeal PDDP (PMPDDP) method. The scalability of PMPDDP is demonstrated with a complexity analysis and experimental results. A discussion on the practical use of this method by a casual user is provided.

Pp. 99-126

doi: 10.1007/3-540-28349-8_5

Clustering with Entropy-Like -Means Algorithms

M. Teboulle; P. Berkhin; I. Dhillon; Y. Guan; J. Kogan

The aim of this chapter is to demonstrate that many results attributed to the classical -means clustering algorithm with the squared Euclidean distance can be extended to many other distance-like functions. We focus on entropy-like distances based on Bregman [88] and Csiszar [119] divergences, which have previously been shown to be useful in various optimization and clustering contexts. Further, the chapter reviews various versions of the classical -means and BIRCH clustering algorithms with squared Euclidean distance and considers modifications of these algorithms with the proposed families of distance-like functions. Numerical experiments with some of these modifications are reported.

Pp. 127-160

doi: 10.1007/3-540-28349-8_6

Sampling Methods for Building Initial Partitions

Z. Volkovich; J. Kogan; C. Nicholas

The initialization of iterative clustering algorithms is a difficult yet important problem in the practice of data mining. In this chapter, we discuss two new approaches for building such initial partitions. The first approach applies a procedure for selecting appropriate samples in the spirit of the Cross-Entropy (CE) method, and the second is based on a sequential summarizing schema. In the first approach, we use a sequential sample clustering procedure instead of the simulation step of the CE method. In this context, we state several facts related to the Projection Pursuit methodology for exploring the structure of a high-dimensional data set. In addition we review several external and internal approaches for cluster validity testing. Experimental results for cluster initializations obtained via the CE method and the first of the presented methods are reported for a real data set.

Pp. 161-185

doi: 10.1007/3-540-28349-8_7

: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

D. Zeimpekis; E. Gallopoulos

A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that are presented in the form of large sparse term-document matrices (tdm). We present , a research and teaching toolbox for the generation of sparse tdms from text collections and for the incremental modification of these tdms by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem-solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different term-weighting policies on the performance of querying and clustering tasks.

Pp. 187-210

doi: 10.1007/3-540-28349-8_8

Criterion Functions for Clustering on High-Dimensional Data

Y. Zhao; G. Karypis

In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and high-quality document clustering algorithms play an important role toward this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatly improve the retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. This ever-increasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexity-quality trade-offs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process, which seeks to maximize or minimize a particular defined over the entire clustering solution.

This chapter provides empirical and theoretical comparisons of the performance of a number of widely used criterion functions in the context of partitional clustering algorithms for high-dimensional datasets. The comparisons consist of a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion func-break tions and their effect on the clusters they produce. Our experimental results show that there is a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis of the criterion function shows that their relative performance of the criterion functions depends on: (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

Pp. 211-237