Catálogo de publicaciones - libros
Data Science and Classification
Vladimir Batagelj ; Hans-Hermann Bock ; Anuška Ferligoj ; Aleš Žiberna (eds.)
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
No disponibles.
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2006 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-34415-5
ISBN electrónico
978-3-540-34416-2
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2006
Información sobre derechos de publicación
© Springer-Verlag Berlin · Heidelberg 2006
Cobertura temática
Tabla de contenidos
A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data
Antonio Irpino; Rosanna Verde
Symbolic Data Analysis (SDA) aims to to describe and analyze complex and structured data extracted, for example, from large databases. Such data, which can be expressed as concepts, are modeled by symbolic objects described by multivalued variables. In the present paper we present a new distance, based on the Wasserstein metric, in order to cluster a set of data described by distributions with finite continue support, or, as called in SDA, by “histograms”. The proposed distance permits us to define a measure of inertia of data with respect to a barycenter that satisfies the Huygens theorem of decomposition of inertia. We propose to use this measure for an agglomerative hierarchical clustering of histogram data based on the Ward criterion. An application to real data validates the procedure.
Part IV - Analysis of Symbolic Data | Pp. 185-192
Symbolic Clustering of Large Datasets
Yves Lechevallier; Rosanna Verde; Francisco de A. T. de Carvalho
We present an approach to cluster large datasets that integrates the Kohonen Self Organizing Maps (SOM) with a dynamic clustering algorithm of symbolic data (SCLUST). A preliminary data reduction using SOM algorithm is performed. As a result, the individual measurements are replaced by micro-clusters. These micro-clusters are then grouped in a few clusters which are modeled by symbolic objects. By computing the extension of these symbolic objects, symbolic clustering algorithm allows discovering the natural classes. An application on a real data set shows the usefulness of this methodology.
Part IV - Analysis of Symbolic Data | Pp. 193-201
A Dynamic Clustering Method for Mixed Feature-Type Symbolic Data
Renata M. C. R. de Souza; Francisco de A. T. de Carvalho; Daniel Ferrari Pizzato
A dynamic clustering method for mixed feature-type symbolic data is presented. The proposed method needs a previous pre-processing step to transform Boolean symbolic data into modal symbolic data. The presented dynamic clustering method has then as input a set of vectors of modal symbolic data and furnishes a partition and a prototype to each class by optimizing an adequacy criterion based on a suitable squared Euclidean distance. To show the usefulness of this method, examples with symbolic data sets are considered.
Part IV - Analysis of Symbolic Data | Pp. 203-210
Iterated Boosting for Outlier Detection
Nathalie Cheze; Jean-Michel Poggi
A procedure for detecting outliers in regression problems based on information provided by boosting trees is proposed. Boosting is meant for dealing with observations that are hard to predict, by giving them extra weights. In the present paper, such observations are considered to be possible outliers, and a procedure is proposed that uses the boosting results to diagnose which observations could be outliers. The key idea is to select the most frequently resampled observation along the boosting iterations and reiterate boosting after removing it. A lot of well-known bench data sets are considered and a comparative study against two classical competitors allows to show the value of the method.
Part V - General Data Analysis Methods | Pp. 213-220
Sub-species of Biplots and Small Class Inference with Analysis of Distance
Sugnet Gardner; Niël J. le Roux
A canonical variance analysis (CVA) biplot can visually portray a oneway MANOVA. Both techniques are subject to the assumption of equal class covariance matrices. In the application considered, very small sample sizes resulted in some singular class covariance matrix estimates and furthermore it seemed unlikely that the assumption of homogeneity of covariance matrices would hold. Analysis of distance (AOD) is employed as nonparametric inference tool. In particular, AOD biplots are introduced for a visual display of samples and variables, analogous to the CVA biplot.
Part V - General Data Analysis Methods | Pp. 221-228
Revised Boxplot Based Discretization as the Kernel of Automatic Interpretation of Classes Using Numerical Variables
Karina Gibert; Alejandra Pérez-Bonilla
In this paper the impact of improving on the methodology of , oriented to the automatic generation of conceptual descriptions of classifications that can support later decision-making is presented.
Part V - General Data Analysis Methods | Pp. 229-237
Comparison of Two Methods for Detecting and Correcting Systematic Error in High-throughput Screening Data
Andrei Gagarin; Dmytro Kevorkov; Vladimir Makarenkov; Pablo Zentilli
High-throughput screening (HTS) is an efficient technological tool for drug discovery in the modern pharmaceutical industry. It consists of testing thousands of chemical compounds per day to select active ones. This process has many drawbacks that may result in missing a potential drug candidate or in selecting inactive compounds. We describe and compare two statistical methods for correcting systematic errors that may occur during HTS experiments. Namely, the collected HTS measurements and the hit selection procedure are corrected.
Part VI - Data and Web Mining | Pp. 241-249
kNN Versus SVM in the Collaborative Filtering Framework
Miha Grčar; Blaž Fortuna; Dunja Mladenič; Marko Grobelnik
We present experimental results of confronting the k-Nearest Neighbor (kNN) algorithm with Support Vector Machine (SVM) in the collaborative filtering framework using datasets with different properties. While k-Nearest Neighbor is usually used for the collaborative filtering tasks, Support Vector Machine is considered a state-of-the-art classification algorithm. Since collaborative filtering can also be interpreted as a classification/regression task, virtually any supervised learning algorithm (such as SVM) can also be applied. Experiments were performed on two standard, publicly available datasets and, on the other hand, on a real-life corporate dataset that does not fit the profile of ideal data for collaborative filtering. We conclude that the quality of collaborative filtering recommendations is highly dependent on the quality of the data. Furthermore, we can see that kNN is dominant over SVM on the two standard datasets. On the real-life corporate dataset with high level of sparsity, kNN fails as it is unable to form reliable neighborhoods. In this case SVM outperforms kNN.
Part VI - Data and Web Mining | Pp. 251-260
Mining Association Rules in Folksonomies
Christoph Schmitz; Andreas Hotho; Robert Jäschke; Gerd Stumme
Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. These systems provide currently relatively few structure. We discuss in this paper, how association rule mining can be adopted to analyze and structure folksonomies, and how the results can be used for ontology learning and supporting emergent semantics. We demonstrate our approach on a large scale dataset stemming from an online system.
Part VI - Data and Web Mining | Pp. 261-270
Empirical Analysis of Attribute-Aware Recommendation Algorithms with Variable Synthetic Data
Karen H. L. Tso; Lars Schmidt-Thieme
Recommender Systems (RS) have helped achieving success in E-commerce. Delving better RS algorithms has been an ongoing research. However, it has always been difficult to find adequate datasets to help evaluating RS algorithms. Public data suitable for such kind of evaluation is limited, especially for data containing content information (attributes). Previous researches have shown that the performance of RS rely on the characteristics and quality of datasets. Although, a few others have conducted studies on synthetically generated data to mimic the user-product datasets, datasets containing attributes information are rarely investigated. In this paper, we review synthetic datasets used in RS and present our synthetic data generator that considers attributes. Moreover, we conduct empirical evaluations on existing hybrid recommendation algorithms and other state-of-the-art algorithms using these synthetic data and observe the sensitivity of the algorithms when varying qualities of attribute data are applied to the them.
Part VI - Data and Web Mining | Pp. 271-278