Catálogo de publicaciones - libros

Compartir en
redes sociales


Machine Learning and Data Mining in Pattern Recognition: 5th International Conference, MLDM 2007, Leipzig, Germany, July 18-20, 2007. Proceedings

Petra Perner (eds.)

En conferencia: 5º International Workshop on Machine Learning and Data Mining in Pattern Recognition (MLDM) . Leipzig, Germany . July 18, 2007 - July 20, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Database Management; Data Mining and Knowledge Discovery; Pattern Recognition; Image Processing and Computer Vision

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-73498-7

ISBN electrónico

978-3-540-73499-4

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Data Clustering: User’s Dilemma

Anil K. Jain

Data clustering is a long standing research problem in pattern recognition, computer vision, machine learning, and data mining with applications in a number of diverse disciplines. The goal is to partition a set of n d-dimensional points into k clusters, where k may or may not be known. Most clustering techniques require the definition of a similarity measure between patterns, which is not easy to specify in the absence of any prior knowledge about cluster shapes. While a large number of clustering algorithms exist, there is no optimal algorithm. Each clustering algorithm imposes a specific structure on the data and has its own approach for estimating the number of clusters. No single algorithm can adequately handle various cluster shapes and structures that are encountered in practice. Instead of spending our effort in devising yet another clustering algorithm, there is a need to build upon the existing published techniques. In this talk we will address the following problems: (i) clustering via evidence accumulation, (ii) simultaneous clustering and dimensionality reduction, (iii) clustering under pair-wise constraints, and (iv) clustering with relevance feedback. Experimental results show that these approaches are promising in identifying arbitrary shaped clusters in multidimensional data.

- Invited Talk | Pp. 1-1

On Concentration of Discrete Distributions with Applications to Supervised Learning of Classifiers

Magnus Ekdahl; Timo Koski

Computational procedures using independence assumptions in various forms are popular in machine learning, although checks on empirical data have given inconclusive results about their impact. Some theoretical understanding of when they work is available, but a definite answer seems to be lacking. This paper derives distributions that maximizes the statewise difference to the respective product of marginals. These distributions are, in a sense the worst distribution for predicting an outcome of the data generating mechanism by independence. We also restrict the scope of new theoretical results by showing explicitly that, depending on context, independent (’Naïve’) classifiers can be as bad as tossing coins. Regardless of this, independence may beat the generating model in learning supervised classification and we explicitly provide one such scenario.

- Classification | Pp. 2-16

Comparison of a Novel Combined ECOC Strategy with Different Multiclass Algorithms Together with Parameter Optimization Methods

Marco Hülsmann; Christoph M. Friedrich

In this paper we consider multiclass learning tasks based on Support Vector Machines (SVMs). In this regard, currently used methods are or , but there is much need for improvements in the field of multiclass learning. We developed a novel combination algorithm called , which is based on posterior class probabilities. It assigns, according to the Bayesian rule, the respective instance to the class with the highest posterior probability. A problem with the usage of a multiclass method is the proper choice of parameters. Many users only take the default parameters of the respective learning algorithms (e.g. the regularization parameter and the kernel parameter ). We tested different parameter optimization methods on different learning algorithms and confirmed the better performance of versus , which can be explained by the maximum margin approach of SVMs.

- Classification | Pp. 17-31

Multi-source Data Modelling: Integrating Related Data to Improve Model Performance

Paul R. Trundle; Daniel C. Neagu; Qasim Chaudhry

Traditional methods in Data Mining cannot be applied to all types of data with equal success. Innovative methods for model creation are needed to address the lack of model performance for data from which it is difficult to extract relationships. This paper proposes a set of algorithms that allow the integration of data from multiple datasets that are related, as well as results from the implementation of these techniques using data from the field of Predictive Toxicology. The results show significant improvements when related data is used to aid in the model creation process, both overall and in specific data ranges. The proposed algorithms have potential for use within any field where multiple datasets exist, particularly in fields combining computing, chemistry and biology.

- Classification | Pp. 32-46

An Empirical Comparison of Ideal and Empirical ROC-Based Reject Rules

Claudio Marrocco; Mario Molinara; Francesco Tortorella

Two class classifiers are used in many complex problems in which the classification results could have serious consequences. In such situations the cost for a wrong classification can be so high that can be convenient to avoid a decision and reject the sample. This paper presents a comparison between two different reject rules (the Chow’s and the ROC rule). In particular, the experiments show that the Chow’s rule is inappropriate when the estimates of the a posteriori probabilities are not reliable.

- Classification | Pp. 47-60

Outlier Detection with Kernel Density Functions

Longin Jan Latecki; Aleksandar Lazarevic; Dragoljub Pokrajac

Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel unsupervised algorithm for outlier detection with a solid statistical foundation is proposed. First we modify a nonparametric density estimate with a variable kernel to yield a robust local density estimation. Outliers are then detected by comparing the local density of each point to the local density of its neighbors. Our experiments performed on several simulated data sets have demonstrated that the proposed approach can outperform two widely used outlier detection algorithms (LOF and LOCI).

- Classification | Pp. 61-75

Generic Probability Density Function Reconstruction for Randomization in Privacy-Preserving Data Mining

Vincent Yan Fu Tan; See-Kiong Ng

Data perturbation with random noise signals has been shown to be useful for data hiding in privacy-preserving data mining. Perturbation methods based on additive randomization allows accurate estimation of the Probability Density Function (PDF) via the Expectation-Maximization (EM) algorithm but it has been shown that noise-filtering techniques can be used to reconstruct the original data in many cases, leading to security breaches. In this paper, we propose a PDF reconstruction algorithm that can be used on non-additive (and additive) randomization techiques for the purpose of privacy-preserving data mining. This two-step reconstruction algorithm is based on Parzen-Window reconstruction and Quadratic Programming over a convex set – the probability simplex. Our algorithm eliminates the usual need for the iterative EM algorithm and it is generic for most randomization models. The simplicity of our two-step reconstruction algorithm, without iteration, also makes it attractive for use when dealing with large datasets.

- Classification | Pp. 76-90

An Incremental Fuzzy Decision Tree Classification Method for Mining Data Streams

Tao Wang; Zhoujun Li; Yuejin Yan; Huowang Chen

One of most important algorithms for mining data streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. In this paper, we revisit this problem and implemented a system fVFDT on top of VFDT and VFDTc. We make the following four contributions: 1) we present a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is , while VFDT‘s processing time is . When a new example arrives, VFDTc need update attribute tree nodes, but fVFDT just need update one necessary node.2) we improve the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves from to in processing time. 3) Comparing to VFDTc, fVFDT‘s candidate split-test number decrease from to .4)Improve the soft discretization method to be used in data streams mining, it overcomes the problem of noise data and improve the classification accuracy.

- Classification | Pp. 91-103

On the Combination of Locally Optimal Pairwise Classifiers

Gero Szepannek; Bernd Bischl; Claus Weihs

If their assumptions are not met, classifiers may fail. In this paper, the possibility of combining classifiers in multi-class problems is investigated. Multi-class classification problems are split into two class problems. For each of the latter problems an optimal classifier is determined. The results of applying the optimal classifiers on the two class problems can be combined using the algorithm by Hastie and Tibshirani (1998).

In this paper exemplary situations are investigated where the respective assumptions of Naive Bayes or the classical Linear Discriminant Analysis (LDA, Fisher, 1936) fail. It is investigated at which degree of violations of the assumptions it may be advantageous to use single methods or a classifier combination by Pairwise Coupling.

- Classification | Pp. 104-116

An Agent-Based Approach to the Multiple-Objective Selection of Reference Vectors

Ireneusz Czarnowski; Piotr Jȩdrzejowicz

The paper proposes an agent-based approach to the multiple-objective selection of reference vectors from original datasets. Effective and dependable selection procedures are of vital importance to machine learning and data mining. The suggested approach is based on the multiple agent paradigm. The authors propose using JABAT middleware as a tool and the original instance reduction procedure as a method for selecting reference vectors under multiple objectives. The paper contains a brief introduction to the multiple objective optimization, followed by the formulation of the multiple-objective, agent-based, reference vectors selection optimization problem. Further sections of the paper provide details on the proposed algorithm generating a non-dominated (or Pareto-optimal) set of reference vector sets. To validate the approach the computational experiment has been planned and carried out. Presentation and discussion of experiment results conclude the paper.

- Feature Selection, Extraction and Dimensionality Reduction | Pp. 117-130