Catálogo de publicaciones - libros

Compartir en
redes sociales


Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006, Proceedings

Wee-Keong Ng ; Masaru Kitsuregawa ; Jianzhong Li ; Kuiyu Chang (eds.)

En conferencia: 10º Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) . Singapore, Singapore . April 9, 2006 - April 12, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-33206-0

ISBN electrónico

978-3-540-33207-7

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Constructive Meta-level Feature Selection Method Based on Method Repositories

Hidenao Abe; Takahira Yamaguchi

Feature selection is one of key issues related with data pre-processing of classification task in a data mining process. Although many efforts have been done to improve typical feature selection algorithms (FSAs), such as filter methods and wrapper methods, it is hard for just one FSA to manage its performances to various datasets. To above problems, we propose another way to support feature selection procedure, constructing proper FSAs to each given dataset. Here is discussed constructive meta-level feature selection that re-constructs proper FSAs with a method repository every given datasets, de-composing representative FSAs into methods. After implementing the constructive meta-level feature selection system, we show how constructive meta-level feature selection goes well with 32 UCI common data sets, comparing with typical FSAs on their accuracies. As the result, our system shows the highest performance on accuracies and the availability to construct a proper FSA to each given data set automatically.

- Classification | Pp. 70-80

Variable Randomness in Decision Tree Ensembles

Fei Tony Liu; Kai Ming Ting

In this paper, we propose Max-diverse., which has a mechanism to control the degrees of randomness in decision tree ensembles. This control gives an ensemble the means to balance the two conflicting functions of a random random ensemble, i.e., the abilities to model non-axis-parallel boundary and eliminate irrelevant features. We find that this control is more sensitive to the one provided by Random Forests. Using progressive training errors, we are able to estimate an appropriate randomness for any given data prior to any predictive tasks. Experiment results show that Max-diverse. is significantly better than Random Forests and Max-diverse Ensemble, and it is comparable to the state-of-the-art C5 boosting.

- Ensemble Learning | Pp. 81-90

Further Improving Emerging Pattern Based Classifiers Via Bagging

Hongjian Fan; Ming Fan; Kotagiri Ramamohanarao; Mengxu Liu

Emerging Patterns (EPs) are those itemsets whose supports in one class are significantly higher than their supports in the other class. In this paper we investigate how to “bag” EP-based classifiers to build effective ensembles. We design a new scoring function based on growth rates to increase the diversity of individual classifiers and an effective scheme to combine the power of ensemble members. The experimental results confirm that our method of “bagging” EP-based classifiers can produce a more accurate and noise tolerant classifier ensemble.

- Ensemble Learning | Pp. 91-96

Improving on Bagging with Input Smearing

Eibe Frank; Bernhard Pfahringer

Bagging is an ensemble learning method that has proved to be a useful tool in the arsenal of machine learning practitioners. Commonly applied in conjunction with decision tree learners to build an ensemble of decision trees, it often leads to reduced errors in the predictions when compared to using a single tree. A single tree is built from a training set of size . Bagging is based on the idea that, ideally, we would like to eliminate the variance due to a particular training set by combining trees built from all training sets of size . However, in practice, only one training set is available, and bagging simulates this platonic method by sampling with replacement from the original training data to form new training sets. In this paper we pursue the idea of sampling from a kernel density estimator of the underlying distribution to form new training sets, in addition to sampling from the data itself. This can be viewed as “smearing out” the resampled training data to generate new datasets, and the amount of “smear” is controlled by a parameter. We show that the resulting method, called “input smearing”, can lead to improved results when compared to bagging. We present results for both classification and regression problems.

- Ensemble Learning | Pp. 97-106

Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles

Yang Liu; Aijun An; Xiangji Huang

Learning from imbalanced datasets is inherently difficult due to lack of information about the minority class. In this paper, we study the performance of SVMs, which have gained great success in many real applications, in the imbalanced data context. Through empirical analysis, we show that SVMs suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed. We propose to combine an integrated sampling technique with an ensemble of SVMs to improve the prediction performance. The integrated sampling technique combines both over-sampling and under-sampling techniques. Through empirical study, we show that our method outperforms individual SVMs as well as several other state-of-the-art classifiers.

- Ensemble Learning | Pp. 107-118

DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking

Elke Achtert; Christian Böhm; Peer Kröger

Hierarchical clustering algorithms, e.g. Single-Link or OPTICS compute the hierarchical clustering structure of data sets and visualize those structures by means of dendrograms and reachability plots. Both types of algorithms have their own drawbacks. Single-Link suffers from the well-known single-link effect and is not robust against noise objects. Furthermore, the interpretability of the resulting dendrogram deteriorates heavily with increasing database size. OPTICS overcomes these limitations by using a density estimator for data grouping and computing a reachability diagram which provides a clear presentation of the hierarchical clustering structure even for large data sets. However, it requires a non-intuitive parameter that has significant impact on the performance of the algorithm and the accuracy of the results. In this paper, we propose a novel and efficient -nearest neighbor join closest-pair ranking algorithm to overcome the problems of both worlds. Our density-link clustering algorithm uses a similar density estimator for data grouping, but does not require the parameter of OPTICS and thus produces the optimal result w.r.t. accuracy. In addition, it provides a significant performance boosting over Single-Link and OPTICS. Our experiments show both, the improvement of accuracy as well as the efficiency acceleration of our method compared to Single-Link and OPTICS.

- Ensemble Learning | Pp. 119-128

Iterative Clustering Analysis for Grouping Missing Data in Gene Expression Profiles

Dae-Won Kim; Bo-Yeong Kang

Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Because a clustering method requires a complete data matrix as an input, we must estimate the missing values using an imputation method in the preprocessing step of clustering. However, a common limitation of these conventional approach is that once the estimates of missing values are fixed in the preprocessing step, they are not changed during subsequent process of clustering. Badly estimated missing values obtained in data preprocessing are likely to deteriorate the quality and reliability of clustering results. Thus, a new clustering method is required for improving missing values during iterative clustering process.

- Ensemble Learning | Pp. 129-138

An EM-Approach for Clustering Multi-Instance Objects

Hans-Peter Kriegel; Alexey Pryakhin; Matthias Schubert

In many data mining applications the data objects are modeled as sets of feature vectors or multi-instance objects. In this paper, we present an expectation maximization approach for clustering multi-instance objects. We therefore present a statistical process that models multi-instance objects. Furthermore, we present M-steps and E-steps for EM clustering and a method for finding a good initial model. In our experimental evaluation, we demonstrate that the new EM algorithm is capable to increase the cluster quality for three real world data sets compared to a -medoid clustering.

- Ensemble Learning | Pp. 139-148

Mining Maximal Correlated Member Clusters in High Dimensional Database

Lizheng Jiang; Dongqing Yang; Shiwei Tang; Xiuli Ma; Dehui Zhang

Mining high dimensional data is an urgent problem of great practical importance. Although some data mining models such as frequent patterns and clusters have been proven to be very successful for analyzing very large data sets, they have some limitations. Frequent patterns are inadequate to describe the quantitative correlations among nominal members. Traditional cluster models ignore distances of some pairs of members, so a pair of members in one big cluster may be far away. As a combination and complementary of both techniques, we propose the Maximal-Correlated-Member-Cluster (MCMC) model in this paper. The MCMC model is based on a statistical measure reflecting the relationship of nominal variables, and every pair of members in one cluster satisfy unified constraints. Moreover, in order to improve algorithm’s efficiency, we introduce pruning techniques to reduce the search space. In the first phase, a Tri-correlation inequation is used to eliminate unrelated member pairs, and in the second phase, an Inverse-Order-Enumeration-Tree (IOET) method is designed to share common computations. Experiments over both synthetic datasets and real life datasets are performed to examine our algorithm’s performance. The results show that our algorithm has much higher efficiency than the naïve algorithm, and this model can discover meaningful correlated patterns in high dimensional database.

- Ensemble Learning | Pp. 149-159

Hierarchical Clustering Based on Mathematical Optimization

Le Hoai Minh; Le Thi Hoai An; Pham Dinh Tao

In this paper a novel optimization model for bilevel hierarchical clustering has been proposed. This is a hard nonconvex, nonsmooth optimization problem for which we investigate an efficient technique based on DC (Difference of Convex functions) programming and DCA (DC optimization Algorithm). Preliminary numerical results on some artificial and real-world databases show the efficiency and the superiority of this approach with respect to related existing methods.

- Ensemble Learning | Pp. 160-173