Catálogo de publicaciones - libros

Compartir en
redes sociales


Advanced Data Mining and Applications: 1st International Conference, ADMA 2005, Wuhan, China, July 22-24, 2005, Proceedings

Xue Li ; Shuliang Wang ; Zhao Yang Dong (eds.)

En conferencia: 1º International Conference on Advanced Data Mining and Applications (ADMA) . Wuhan, China . July 22, 2005 - July 24, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Artificial Intelligence (incl. Robotics); Database Management; Software Engineering; Computer Appl. in Administrative Data Processing; Information Systems Applications (incl. Internet); Health Informatics

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-27894-8

ISBN electrónico

978-3-540-31877-4

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Learning -Nearest Neighbor Naive Bayes for Ranking

Liangxiao Jiang; Harry Zhang; Jiang Su

Accurate probability-based ranking of instances is crucial in many real-world data mining applications. KNN (-nearest neighbor) [1] has been intensively studied as an effective classification model in decades. However, its performance in ranking is unknown. In this paper, we conduct a systematic study on the ranking performance of KNN. At first, we compare KNN and KNNDW (KNN with distance weighted) to decision trees and naive Bayes in ranking, measured by AUC (the area under the Receiver Operating Characteristics curve). Then, we propose to improve the ranking performance of KNN by combining KNN with naive Bayes. The idea is that a naive Bayes is learned using the nearest neighbors of the test instance as the training data and used to classify the test instance. A critical problem in combining KNN with naive Bayes is the lack of training data when is small. We propose to deal with it using sampling to expand the training data. That is, each of the nearest neighbors is “cloned” and the clones are added to the training data. We call our new model instance cloning local naive Bayes (simply ICLNB). We conduct extensive empirical comparison for the related algorithms in two groups in terms of AUC, using the 36 UCI datasets recommended by Weka[2]. In the first group, we compare ICLNB with other types of algorithms C4.4[3], naive Bayes and NBTree[4]. In the second group, we compare ICLNB with KNN, KNNDW and LWNB[5]. Our experimental results show that ICLNB outperforms all those algorithms significantly. From our study, we have two conclusions. First, KNN-relates algorithms performs well in ranking. Second, our new algorithm ICLNB performs best among the algorithms compared in this paper, and could be used in the applications in which an accurate ranking is desired.

- Classification | Pp. 175-185

One Dependence Augmented Naive Bayes

Liangxiao Jiang; Harry Zhang; Zhihua Cai; Jiang Su

In real-world data mining applications, an accurate ranking is as important as an accurate classification. Naive Bayes has been widely used in data mining as a simple and effective classification and ranking algorithm. Since its conditional independence assumption is rarely true, numerous algorithms have been proposed to improve naive Bayes, for example, SBC[1] and TAN[2]. Indeed, the experimental results show that SBC and TAN achieve a significant improvement in term of classification accuracy. However, unfortunately, our experiments also show that SBC and TAN perform even worse than naive Bayes in ranking measured by AUC[3,4](the area under the Receiver Operating Characteristics curve). This fact raises the question of whether we can improve Naive Bayes with both accurate classification and ranking? In this paper, responding to this question, we present a new learning algorithm called One Dependence Augmented Naive Bayes(ODANB). Our motivation is to develop a new algorithm to improve Naive Bayes’ performance not only on classification measured by accuracy but also on ranking measured by AUC. We experimentally tested our algorithm, using the whole 36 UCI datasets recommended by Weka[5], and compared it to Naive Bayes, SBC and TAN. The experimental results show that our algorithm outperforms all the other algorithms significantly in yielding accurate ranking, yet at the same time outperforms all the other algorithms slightly in terms of classification accuracy.

- Classification | Pp. 186-194

A Genetic -Modes Algorithm for Clustering Categorical Data

Guojun Gan; Zijiang Yang; Jianhong Wu

Many optimization based clustering algorithms suffer from the possibility of stopping at locally optimal partitions of data sets. In this paper, we present a genetic -Modes algorithm(GKMODE) that finds a globally optimal partition of a given categorical data set into a specified number of clusters. We introduce a -Modes operator in place of the normal crossover operator. Our analysis shows that the clustering results produced by GKMODE are very high in accuracy and it performs much better than existing algorithms for clustering categorical data.

- Clustering | Pp. 195-202

A Fast Fuzzy Clustering Algorithm for Large-Scale Datasets

Lukui Shi; Pilian He

The transitive closure method is one of the most frequently used fuzzy clustering techniques. It has (log) time complexity and () space complexity for matrix compositions while building transitive closures. These drawbacks limit its further applications to large-scale databases. In this paper, we proposed a fast fuzzy clustering algorithm to avoid matrix multiplications and gave a principle, where the clustering results were directly obtained from the -cut of the fuzzy similar relation of objects. Moreover, it was dispensable to compute and store the similar matrix of objects beforehand. The time complexity of the presented algorithm is () at most and the space complexity is (1). Theoretical analysis and experiments demonstrate that although the new algorithm is equivalent to the transitive closure method, the former is more suitable to treat large-scale datasets because of its high computing efficiency.

- Clustering | Pp. 203-208

Clustering with Noising Method

Yongguo Liu; Yan Liu; Kefei Chen

The minimum sum of squares clustering problem is a nonconvex program which possesses many locally optimal values, resulting that its solution often falls into these traps. In this article, a recent metaheuristic technique, the noising method, is introduced to explore the proper clustering of data sets under the criterion of minimum sum of squares clustering. Meanwhile, K-means algorithm as a local improvement operation is integrated into the noising method to improve the performance of the clustering algorithm. Extensive computer simulations show that the proposed approach is feasible and effective.

- Clustering | Pp. 209-216

Extracting the Representative Failure Executions via Clustering Analysis Based on Markov Profile Model

Chengying Mao; Yansheng Lu

During the debugging of a program to be released, it is unnecessary and impractical for developers to check every failure execution. How to extract the typical ones from the vast set of failure executions is very important for reducing the debugging efforts. In this paper, a revised Markov model used to depict program behaviors is presented firstly. Based on this model, the dissimilarity of two profile matrixes is also defined. After separating the failure executions and non-failure executions into two different subsets, iterative partition clustering and a sampling strategy called priority-ranked n-per-cluster are employed to extract representative failure executions. Finally, with the assistance of our prototype CppTest, we have performed experiment on five subject programs. The results show that the clustering and sampling techniques based on revised Markov model is more effective to find faults than Podgurski’s method.

- Clustering | Pp. 217-224

Improvement on the Approximation Bound for Fuzzy-Neural Networks Clustering Method with Gaussian Membership Function

Weimin Ma; Guoqing Chen

A great deal of research has been devoted in recent years to the designing Fuzzy-Neural Networks (FNN) from input-output data. And some works were also done to analyze the performance of some methods from a rigorous mathematical point of view. In this paper, a new approximation bound for the clustering method, which is employed to design the FNN with the Gaussian Membership Function, is established. It is an improvement of the previous result in which the related approximation bound was somewhat complex. The detailed formulas of the error bound between the nonlinear function to be approximated and the FNN system designed based on the input-output data are derived.

- Clustering | Pp. 225-231

Optimal Fuzzy Modeling Based on Minimum Cluster Volume

Can Yang; Jun Meng

This paper proposes a new fuzzy modeling method, which involves the Minimum Cluster Volume clustering algorithm. The cluster centers founded are naturally considered to be the centers of Gaussian membership functions. Covariance matrix obtained from the result of cluster method is made use to estimate the parameters for Gaussian membership functions. A direct result of this method are compared in our simulations with published methods, which indicate that our method is powerful so that it solves the multi-dimension problems more accurately even with less complexity of our fuzzy model structure.

- Clustering | Pp. 232-239

An Efficient Clustering Approach for Large Document Collections

Bo Han; Lishan Kang; Huazhu Song

A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality.

- Clustering | Pp. 240-247

Clustering Categorical Data Using Coverage Density

Hua Yan; Lei Zhang; Yi Zhang

In this paper, a new algorithm based on the idea of coverage density is proposed for clustering categorical data. It uses average coverage density as the global criterion function. Large sparse categorical databases can be clustered effectively by using this algorithm. It shows that the algorithm uses less memory and time by analyzing its time and space complexity. Experiments on two real datasets are carried out to illustrate the performance of the proposed algorithm.

- Clustering | Pp. 248-255