Catálogo de publicaciones - libros
Machine Learning and Data Mining in Pattern Recognition: 5th International Conference, MLDM 2007, Leipzig, Germany, July 18-20, 2007. Proceedings
Petra Perner (eds.)
En conferencia: 5º International Workshop on Machine Learning and Data Mining in Pattern Recognition (MLDM) . Leipzig, Germany . July 18, 2007 - July 20, 2007
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Database Management; Data Mining and Knowledge Discovery; Pattern Recognition; Image Processing and Computer Vision
Disponibilidad
| Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
|---|---|---|---|---|
| No detectada | 2007 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-73498-7
ISBN electrónico
978-3-540-73499-4
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2007
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2007
Tabla de contenidos
Data Clustering: User’s Dilemma
Anil K. Jain
Data clustering is a long standing research problem in pattern recognition, computer vision, machine learning, and data mining with applications in a number of diverse disciplines. The goal is to partition a set of n d-dimensional points into k clusters, where k may or may not be known. Most clustering techniques require the definition of a similarity measure between patterns, which is not easy to specify in the absence of any prior knowledge about cluster shapes. While a large number of clustering algorithms exist, there is no optimal algorithm. Each clustering algorithm imposes a specific structure on the data and has its own approach for estimating the number of clusters. No single algorithm can adequately handle various cluster shapes and structures that are encountered in practice. Instead of spending our effort in devising yet another clustering algorithm, there is a need to build upon the existing published techniques. In this talk we will address the following problems: (i) clustering via evidence accumulation, (ii) simultaneous clustering and dimensionality reduction, (iii) clustering under pair-wise constraints, and (iv) clustering with relevance feedback. Experimental results show that these approaches are promising in identifying arbitrary shaped clusters in multidimensional data.
- Invited Talk | Pp. 1-1
On Concentration of Discrete Distributions with Applications to Supervised Learning of Classifiers
Magnus Ekdahl; Timo Koski
Computational procedures using independence assumptions in various forms are popular in machine learning, although checks on empirical data have given inconclusive results about their impact. Some theoretical understanding of when they work is available, but a definite answer seems to be lacking. This paper derives distributions that maximizes the statewise difference to the respective product of marginals. These distributions are, in a sense the worst distribution for predicting an outcome of the data generating mechanism by independence. We also restrict the scope of new theoretical results by showing explicitly that, depending on context, independent (’Naïve’) classifiers can be as bad as tossing coins. Regardless of this, independence may beat the generating model in learning supervised classification and we explicitly provide one such scenario.
- Classification | Pp. 2-16
Comparison of a Novel Combined ECOC Strategy with Different Multiclass Algorithms Together with Parameter Optimization Methods
Marco Hülsmann; Christoph M. Friedrich
In this paper we consider multiclass learning tasks based on Support Vector Machines (SVMs). In this regard, currently used methods are or , but there is much need for improvements in the field of multiclass learning. We developed a novel combination algorithm called , which is based on posterior class probabilities. It assigns, according to the Bayesian rule, the respective instance to the class with the highest posterior probability. A problem with the usage of a multiclass method is the proper choice of parameters. Many users only take the default parameters of the respective learning algorithms (e.g. the regularization parameter and the kernel parameter ). We tested different parameter optimization methods on different learning algorithms and confirmed the better performance of versus , which can be explained by the maximum margin approach of SVMs.
- Classification | Pp. 17-31
Multi-source Data Modelling: Integrating Related Data to Improve Model Performance
Paul R. Trundle; Daniel C. Neagu; Qasim Chaudhry
Traditional methods in Data Mining cannot be applied to all types of data with equal success. Innovative methods for model creation are needed to address the lack of model performance for data from which it is difficult to extract relationships. This paper proposes a set of algorithms that allow the integration of data from multiple datasets that are related, as well as results from the implementation of these techniques using data from the field of Predictive Toxicology. The results show significant improvements when related data is used to aid in the model creation process, both overall and in specific data ranges. The proposed algorithms have potential for use within any field where multiple datasets exist, particularly in fields combining computing, chemistry and biology.
- Classification | Pp. 32-46
An Empirical Comparison of Ideal and Empirical ROC-Based Reject Rules
Claudio Marrocco; Mario Molinara; Francesco Tortorella
Two class classifiers are used in many complex problems in which the classification results could have serious consequences. In such situations the cost for a wrong classification can be so high that can be convenient to avoid a decision and reject the sample. This paper presents a comparison between two different reject rules (the Chow’s and the ROC rule). In particular, the experiments show that the Chow’s rule is inappropriate when the estimates of the a posteriori probabilities are not reliable.
- Classification | Pp. 47-60
Outlier Detection with Kernel Density Functions
Longin Jan Latecki; Aleksandar Lazarevic; Dragoljub Pokrajac
Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel unsupervised algorithm for outlier detection with a solid statistical foundation is proposed. First we modify a nonparametric density estimate with a variable kernel to yield a robust local density estimation. Outliers are then detected by comparing the local density of each point to the local density of its neighbors. Our experiments performed on several simulated data sets have demonstrated that the proposed approach can outperform two widely used outlier detection algorithms (LOF and LOCI).
- Classification | Pp. 61-75
Generic Probability Density Function Reconstruction for Randomization in Privacy-Preserving Data Mining
Vincent Yan Fu Tan; See-Kiong Ng
Data perturbation with random noise signals has been shown to be useful for data hiding in privacy-preserving data mining. Perturbation methods based on additive randomization allows accurate estimation of the Probability Density Function (PDF) via the Expectation-Maximization (EM) algorithm but it has been shown that noise-filtering techniques can be used to reconstruct the original data in many cases, leading to security breaches. In this paper, we propose a PDF reconstruction algorithm that can be used on non-additive (and additive) randomization techiques for the purpose of privacy-preserving data mining. This two-step reconstruction algorithm is based on Parzen-Window reconstruction and Quadratic Programming over a convex set – the probability simplex. Our algorithm eliminates the usual need for the iterative EM algorithm and it is generic for most randomization models. The simplicity of our two-step reconstruction algorithm, without iteration, also makes it attractive for use when dealing with large datasets.
- Classification | Pp. 76-90
An Incremental Fuzzy Decision Tree Classification Method for Mining Data Streams
Tao Wang; Zhoujun Li; Yuejin Yan; Huowang Chen
One of most important algorithms for mining data streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. In this paper, we revisit this problem and implemented a system fVFDT on top of VFDT and VFDTc. We make the following four contributions: 1) we present a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is , while VFDT‘s processing time is . When a new example arrives, VFDTc need update attribute tree nodes, but fVFDT just need update one necessary node.2) we improve the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves from to in processing time. 3) Comparing to VFDTc, fVFDT‘s candidate split-test number decrease from to .4)Improve the soft discretization method to be used in data streams mining, it overcomes the problem of noise data and improve the classification accuracy.
- Classification | Pp. 91-103
On the Combination of Locally Optimal Pairwise Classifiers
Gero Szepannek; Bernd Bischl; Claus Weihs
If their assumptions are not met, classifiers may fail. In this paper, the possibility of combining classifiers in multi-class problems is investigated. Multi-class classification problems are split into two class problems. For each of the latter problems an optimal classifier is determined. The results of applying the optimal classifiers on the two class problems can be combined using the algorithm by Hastie and Tibshirani (1998).
In this paper exemplary situations are investigated where the respective assumptions of Naive Bayes or the classical Linear Discriminant Analysis (LDA, Fisher, 1936) fail. It is investigated at which degree of violations of the assumptions it may be advantageous to use single methods or a classifier combination by Pairwise Coupling.
- Classification | Pp. 104-116
An Agent-Based Approach to the Multiple-Objective Selection of Reference Vectors
Ireneusz Czarnowski; Piotr Jȩdrzejowicz
The paper proposes an agent-based approach to the multiple-objective selection of reference vectors from original datasets. Effective and dependable selection procedures are of vital importance to machine learning and data mining. The suggested approach is based on the multiple agent paradigm. The authors propose using JABAT middleware as a tool and the original instance reduction procedure as a method for selecting reference vectors under multiple objectives. The paper contains a brief introduction to the multiple objective optimization, followed by the formulation of the multiple-objective, agent-based, reference vectors selection optimization problem. Further sections of the paper provide details on the proposed algorithm generating a non-dominated (or Pareto-optimal) set of reference vector sets. To validate the approach the computational experiment has been planned and carried out. Presentation and discussion of experiment results conclude the paper.
- Feature Selection, Extraction and Dimensionality Reduction | Pp. 117-130