Catálogo de publicaciones - libros

Compartir en
redes sociales


Data Science and Classification

Vladimir Batagelj ; Hans-Hermann Bock ; Anuška Ferligoj ; Aleš Žiberna (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-34415-5

ISBN electrónico

978-3-540-34416-2

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin · Heidelberg 2006

Tabla de contenidos

Empirical Comparison of a Monothetic Divisive Clustering Method with the Ward and the k-means Clustering Methods

Marie Chavent; Yves Lechevallier

DIVCLUS-T is a descendant hierarchical clustering method based on the same monothetic approach than classification and regression trees but from an unsupervised point of view. The aim is not to predict a continuous variable (regression) or a categorical variable (classification) but to construct a hierarchy. The dendrogram of the hierarchy is easy to interpret and can be read as decision tree. An example of this new type of dendrogram is given on a small categorical dataset. DIVCLUS-T is then compared empirically with two polythetic clustering methods: the Ward ascendant hierarchical clustering method and the k-means partitional method. The three algorithms are applied and compared on six databases of the UCI Machine Learning repository.

Part II - Classification and Clustering | Pp. 83-90

Model Selection for the Binary Latent Class Model: A Monte Carlo Simulation

José G. Dias

This paper addresses model selection using information criteria for binary latent class (LC) models. A Monte Carlo study sets an experimental design to compare the performance of different information criteria for this model, some compared for the first time. Furthermore, the level of separation of latent classes is controlled using a new procedure. The results show that AIC3 (Akaike information criterion with 3 as penalizing factor) has a balanced performance for binary LC models.

Part II - Classification and Clustering | Pp. 91-99

Finding Meaningful and Stable Clusters Using Local Cluster Analysis

Hans-Joachim Mucha

Let us consider the problem of finding clusters in a heterogeneous, high-dimensional setting. Usually a (global) cluster analysis model is applied to reach this aim. As a result, often ten or more clusters are detected in a heterogeneous data set. The idea of this paper is to perform subsequent local cluster analyses. Here the following two main questions arise. Is it possible to improve the stability of some of the clusters? Are there new clusters that are not yet detected by global clustering? The paper presents a methodology for such an iterative clustering that can be a useful tool in discovering stable and meaningful clusters. The proposed methodology is used successfully in the field of archaeometry. Here, without loss of generality, it is applied to hierarchical cluster analysis. The improvements of local cluster analysis will be illustrated by means of multivariate graphics.

Part II - Classification and Clustering | Pp. 101-108

Comparing Optimal Individual and Collective Assessment Procedures

Hans J. Vos; Ruth Ben-Yashar; Shmuel Nitzan

This paper focuses on the comparison between the optimal cutoff points set on single and multiple tests in predictor-based assessment, that is, assessing applicants as either suitable or unsuitable for a job. Our main result specifies the condition that determines the number of predictor tests, the collective assessment rule (aggregation procedure of predictor tests’ recommendations) and the function relating the tests’ assessment skills to the predictor cutoff points.

Part II - Classification and Clustering | Pp. 109-116

Some Open Problem Sets for Generalized Blockmodeling

Patrick Doreian

This paper provides an introduction to the blockmodeling problem of how to cluster networks, based solely on the structural information contained in the relational ties, and a brief overview of generalized blockmodeling as an approach for solving this problem. Following a formal statement of the core of generalized blockmodeling, a listing of the advantages of adopting this approach to partitioning networks is provided. These advantages, together with some of the disadvantages of this approach, in its current state, form the basis for proposing some open problem sets for generalized blockmodeling. Providing solutions to these problem sets will transform generalized blockmodeling into an even more powerful approach for clustering networks of relations.

Part III - Network and Graph Analysis | Pp. 119-130

Spectral Clustering and Multidimensional Scaling: A Unified View

François Bavaud

Spectral clustering is a procedure aimed at partitionning a weighted graph into minimally interacting components. The resulting eigen-structure is determined by a reversible Markov chain, or equivalently by a symmetric transition matrix . On the other hand, multidimensional scaling procedures (and factorial correspondence analysis in particular) consist in the spectral decomposition of a kernel matrix . This paper shows how and can be related to each other through a linear or even non-linear transformation leaving the eigen-vectors invariant. As illustrated by examples, this circumstance permits to define a transition matrix from a similarity matrix between objects, to define Euclidean distances between the vertices of a weighted graph, and to elucidate the “flow-induced” nature of spatial auto-covariances.

Part III - Network and Graph Analysis | Pp. 131-139

Analyzing the Structure of U.S. Patents Network

Vladimir Batagelj; Nataša Kejžar; Simona Korenjak-Černe; Matjaž Zaveršnik

The U.S. patents network is a network of almost 3.8 millions patents (network vertices) from the year 1963 to 1999 (Hall et al. (2001)) and more than 16.5 millions citations (network arcs). It is an example of a very large citation network.

We analyzed the U.S. patents network with the tools of network analysis in order to get insight into the structure of the network as an initial step to the study of innovations and technical changes based on patents citation network data.

In our approach the SPC (Search Path Count) weights, proposed by Hummon and Doreian (1989), for vertices and arcs are calculated first. Based on these weights vertex and line islands (Batagelj and Zaveršnik (2004)) are determined to identify the main themes of U.S. patents network. All analyses were done with Pajek — a program for analysis and visualization of large networks. As a result of the analysis the obtained main U.S. patents topics are presented.

Part III - Network and Graph Analysis | Pp. 141-148

Identifying and Classifying Social Groups: A Machine Learning Approach

Matteo Roffilli; Alessandro Lomi

The identification of social groups remains one of the main analytical themes in the analysis of social networks and, in more general terms, in the study of social organization. Traditional network approaches to group identification encounter a variety of problems when the data to be analyzed involve two-mode networks, i.e., relations between two distinct sets of objects with no reflexive relation allowed within each set. In this paper we propose a relatively novel approach to the recognition and identification of social groups in data generated by network-based processes in the context of two-mode networks. Our approach is based on a family of learning algorithms called Support Vector Machines (SVM). The analytical framework provided by SVM provides a flexible statistical environment to solve classification tasks, and to reframe regression and density estimation problems. We explore the relative merits of our approach to the analysis of social networks in the context of the well known “Southern women” (SW) data set collected by Davis Gardner and Gardner. We compare our results with those that have been produced by different analytical approaches. We show that our method, which acts as a data-independent preprocessing step, is able to reduce the complexity of the clustering problem enabling the application of simpler configurations of common algorithms.

Part III - Network and Graph Analysis | Pp. 149-157

Multidimensional Scaling of Histogram Dissimilarities

Patrick J. F. Groenen; Suzanne Winsberg

Multidimensional scaling aims at reconstructing dissimilarities between pairs of objects by distances in a low dimensional space. However, in some cases the dissimilarity itself is unknown, but the range, or a histogram of the dissimilarities is given. This type of data fall in the wider class of symbolic data (see Bock and Diday (2000)). We model a histogram of dissimilarities by a histogram of the distances defined as the minimum and maximum distance between two sets of embedded rectangles representing the objects. In this paper, we provide a new algorithm called Hist-Scal using iterative majorization, that is based on an algorithm, I-Scal developed for the case where the dissimilarities are given by a range of values ie an interval (see Groenen et al. (in press)). The advantage of iterative majorization is that each iteration is guaranteed to improve the solution until no improvement is possible. We present the results on an empirical data set on synthetic musical tones.

Part IV - Analysis of Symbolic Data | Pp. 161-170

Dependence and Interdependence Analysis for Interval-Valued Variables

Carlo Lauro; Federica Gioia

Data analysis is often affected by different types of errors as: measurement errors, computation errors, imprecision related to the method adopted for estimating the data. The methods which have been proposed for treating errors in the data, may also be applied to different kinds of data that in real life are of interval type. The uncertainty in the data, which is strictly connected to the above errors, may be treated by considering, rather than a single value for each data, the interval of values in which it may fall: . The purpose of the present paper is to introduce methods for analyzing the and among variables. Statistical units described by interval-valued variables can be assumed as a special case of Symbolic Object (SO). In Symbolic Data Analysis (SDA), these data are represented as boxes. Accordingly, the purpose of the present work is the extension of the to obtain a visualization of such boxes, on a lower dimensional space. Furthermore, a new method for fitting an equation is developed. With difference to other approaches proposed in the literature that work on scalar recoding of the intervals using classical tools of analysis, we make extensively use of the interval algebra tools combined with some optimization techniques.

Part IV - Analysis of Symbolic Data | Pp. 171-183