Catálogo de publicaciones - libros

Compartir en
redes sociales


Data Science and Classification

Vladimir Batagelj ; Hans-Hermann Bock ; Anuška Ferligoj ; Aleš Žiberna (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-34415-5

ISBN electrónico

978-3-540-34416-2

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin · Heidelberg 2006

Tabla de contenidos

A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data

Antonio Irpino; Rosanna Verde

Symbolic Data Analysis (SDA) aims to to describe and analyze complex and structured data extracted, for example, from large databases. Such data, which can be expressed as concepts, are modeled by symbolic objects described by multivalued variables. In the present paper we present a new distance, based on the Wasserstein metric, in order to cluster a set of data described by distributions with finite continue support, or, as called in SDA, by “histograms”. The proposed distance permits us to define a measure of inertia of data with respect to a barycenter that satisfies the Huygens theorem of decomposition of inertia. We propose to use this measure for an agglomerative hierarchical clustering of histogram data based on the Ward criterion. An application to real data validates the procedure.

Part IV - Analysis of Symbolic Data | Pp. 185-192

Symbolic Clustering of Large Datasets

Yves Lechevallier; Rosanna Verde; Francisco de A. T. de Carvalho

We present an approach to cluster large datasets that integrates the Kohonen Self Organizing Maps (SOM) with a dynamic clustering algorithm of symbolic data (SCLUST). A preliminary data reduction using SOM algorithm is performed. As a result, the individual measurements are replaced by micro-clusters. These micro-clusters are then grouped in a few clusters which are modeled by symbolic objects. By computing the extension of these symbolic objects, symbolic clustering algorithm allows discovering the natural classes. An application on a real data set shows the usefulness of this methodology.

Part IV - Analysis of Symbolic Data | Pp. 193-201

A Dynamic Clustering Method for Mixed Feature-Type Symbolic Data

Renata M. C. R. de Souza; Francisco de A. T. de Carvalho; Daniel Ferrari Pizzato

A dynamic clustering method for mixed feature-type symbolic data is presented. The proposed method needs a previous pre-processing step to transform Boolean symbolic data into modal symbolic data. The presented dynamic clustering method has then as input a set of vectors of modal symbolic data and furnishes a partition and a prototype to each class by optimizing an adequacy criterion based on a suitable squared Euclidean distance. To show the usefulness of this method, examples with symbolic data sets are considered.

Part IV - Analysis of Symbolic Data | Pp. 203-210

Iterated Boosting for Outlier Detection

Nathalie Cheze; Jean-Michel Poggi

A procedure for detecting outliers in regression problems based on information provided by boosting trees is proposed. Boosting is meant for dealing with observations that are hard to predict, by giving them extra weights. In the present paper, such observations are considered to be possible outliers, and a procedure is proposed that uses the boosting results to diagnose which observations could be outliers. The key idea is to select the most frequently resampled observation along the boosting iterations and reiterate boosting after removing it. A lot of well-known bench data sets are considered and a comparative study against two classical competitors allows to show the value of the method.

Part V - General Data Analysis Methods | Pp. 213-220

Sub-species of Biplots and Small Class Inference with Analysis of Distance

Sugnet Gardner; Niël J. le Roux

A canonical variance analysis (CVA) biplot can visually portray a oneway MANOVA. Both techniques are subject to the assumption of equal class covariance matrices. In the application considered, very small sample sizes resulted in some singular class covariance matrix estimates and furthermore it seemed unlikely that the assumption of homogeneity of covariance matrices would hold. Analysis of distance (AOD) is employed as nonparametric inference tool. In particular, AOD biplots are introduced for a visual display of samples and variables, analogous to the CVA biplot.

Part V - General Data Analysis Methods | Pp. 221-228

Revised Boxplot Based Discretization as the Kernel of Automatic Interpretation of Classes Using Numerical Variables

Karina Gibert; Alejandra Pérez-Bonilla

In this paper the impact of improving on the methodology of , oriented to the automatic generation of conceptual descriptions of classifications that can support later decision-making is presented.

Part V - General Data Analysis Methods | Pp. 229-237

Comparison of Two Methods for Detecting and Correcting Systematic Error in High-throughput Screening Data

Andrei Gagarin; Dmytro Kevorkov; Vladimir Makarenkov; Pablo Zentilli

High-throughput screening (HTS) is an efficient technological tool for drug discovery in the modern pharmaceutical industry. It consists of testing thousands of chemical compounds per day to select active ones. This process has many drawbacks that may result in missing a potential drug candidate or in selecting inactive compounds. We describe and compare two statistical methods for correcting systematic errors that may occur during HTS experiments. Namely, the collected HTS measurements and the hit selection procedure are corrected.

Part VI - Data and Web Mining | Pp. 241-249

kNN Versus SVM in the Collaborative Filtering Framework

Miha Grčar; Blaž Fortuna; Dunja Mladenič; Marko Grobelnik

We present experimental results of confronting the k-Nearest Neighbor (kNN) algorithm with Support Vector Machine (SVM) in the collaborative filtering framework using datasets with different properties. While k-Nearest Neighbor is usually used for the collaborative filtering tasks, Support Vector Machine is considered a state-of-the-art classification algorithm. Since collaborative filtering can also be interpreted as a classification/regression task, virtually any supervised learning algorithm (such as SVM) can also be applied. Experiments were performed on two standard, publicly available datasets and, on the other hand, on a real-life corporate dataset that does not fit the profile of ideal data for collaborative filtering. We conclude that the quality of collaborative filtering recommendations is highly dependent on the quality of the data. Furthermore, we can see that kNN is dominant over SVM on the two standard datasets. On the real-life corporate dataset with high level of sparsity, kNN fails as it is unable to form reliable neighborhoods. In this case SVM outperforms kNN.

Part VI - Data and Web Mining | Pp. 251-260

Mining Association Rules in Folksonomies

Christoph Schmitz; Andreas Hotho; Robert Jäschke; Gerd Stumme

Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. These systems provide currently relatively few structure. We discuss in this paper, how association rule mining can be adopted to analyze and structure folksonomies, and how the results can be used for ontology learning and supporting emergent semantics. We demonstrate our approach on a large scale dataset stemming from an online system.

Part VI - Data and Web Mining | Pp. 261-270

Empirical Analysis of Attribute-Aware Recommendation Algorithms with Variable Synthetic Data

Karen H. L. Tso; Lars Schmidt-Thieme

Recommender Systems (RS) have helped achieving success in E-commerce. Delving better RS algorithms has been an ongoing research. However, it has always been difficult to find adequate datasets to help evaluating RS algorithms. Public data suitable for such kind of evaluation is limited, especially for data containing content information (attributes). Previous researches have shown that the performance of RS rely on the characteristics and quality of datasets. Although, a few others have conducted studies on synthetically generated data to mimic the user-product datasets, datasets containing attributes information are rarely investigated. In this paper, we review synthetic datasets used in RS and present our synthetic data generator that considers attributes. Moreover, we conduct empirical evaluations on existing hybrid recommendation algorithms and other state-of-the-art algorithms using these synthetic data and observe the sensitivity of the algorithms when varying qualities of attribute data are applied to the them.

Part VI - Data and Web Mining | Pp. 271-278