Catálogo de publicaciones - libros

Compartir en
redes sociales


Knowledge Discovery in Databases: PKDD 2005: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings

Alípio Mário Jorge ; Luís Torgo ; Pavel Brazdil ; Rui Camacho ; João Gama (eds.)

En conferencia: 9º European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) . Porto, Portugal . October 3, 2005 - October 7, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-29244-9

ISBN electrónico

978-3-540-31665-7

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Rank Measures for Ordering

Jin Huang; Charles X. Ling

Many data mining applications require a ranking, rather than a mere classification, of cases. Examples of these applications are widespread, including Internet search engines (ranking of pages returned) and customer relationship management (ranking of profitable customers). However, little theoretical foundation and practical guideline have been established to assess the merits of different rank measures for ordering. In this paper, we first review several general criteria to judge the merits of different single-number measures. Then we propose a novel rank measure, and compare the commonly used rank measures and our new one according to the criteria. This leads to a preference order for these rank measures. We conduct experiments on real-world datasets to confirm the preference order. The results of the paper will be very useful in evaluating and comparing rank algorithms.

- Short Papers | Pp. 503-510

Dynamic Ensemble Re-Construction for Better Ranking

Jin Huang; Charles X. Ling

Ensemble learning has been shown to be very successful in data mining. However most work on ensemble learning concerns the task of classification. Little work has been done to construct ensembles that aim to improve ranking. In this paper, we propose an approach to re-construct new ensembles based on a given ensemble with the purpose to improve the ranking performance, which is crucial in many data mining tasks. The experiments with real-world data sets show that our new approach achieves significant improvements in ranking over the original Bagging and Adaboost ensembles.

Palabras clave: Ranking Performance; Ensemble Learning; Test Subset; Data Mining Task; Original Ensemble.

- Short Papers | Pp. 511-518

Frequency-Based Separation of Climate Signals

Alexander Ilin; Harri Valpola

The paper presents an example of exploratory data analysis of climate measurements using a recently developed denoising source separation (DSS) framework. We analysed a combined dataset containing daily measurements of three variables: surface temperature, sea level pressure and precipitation around the globe. Components exhibiting slow temporal behaviour were extracted using DSS with linear denoising. These slow components were further rotated using DSS with nonlinear denoising which implemented a frequency-based separation criterion. The rotated sources give a meaningful representation of the slow climate variability as a combination of trends, interannual oscillations, the annual cycle and slowly changing seasonal variations.

Palabras clave: Power Spectrum; Independent Component Analysis; Empirical Orthogonal Function; Slow Component; Independent Component Analysis.

- Short Papers | Pp. 519-526

Efficient Processing of Ranked Queries with Sweeping Selection

Wen Jin; Martin Ester; Jiawei Han

Existing methods for top- k ranked query employ techniques including sorting, updating thresholds and materializing views. In this paper, we propose two novel index-based techniques for top- k ranked query: (1) indexing the layered skyline, and (2) indexing microclusters of objects into a grid structure. We also develop efficient algorithms for ranked query by locating the answer points during the sweeping of the line/hyperplane of the score function over the indexed objects. Both methods can be easily plugged into typical multi-dimensional database indexes. The comprehensive experiments not only demonstrate that our methods outperform the existing ones, but also illustrate that the application of data mining technique (microclustering) is a useful and effective solution for database query processing.

Palabras clave: Score Function; Query Processing; Query Time; Skyline Query; Sweeping Process.

- Short Papers | Pp. 527-535

Feature Extraction from Mass Spectra for Classification of Pathological States

Alexandros Kalousis; Julien Prados; Elton Rexhepaj; Melanie Hilario

Mass spectrometry is becoming an important tool in proteomics. The representation of mass spectra is characterized by very high dimensionality and a high level of redundancy. Here we present a feature extraction method for mass spectra that directly models for domain knowledge, reduces the dimensionality and redundancy of the initial representation and controls for the level of granularity of feature extraction by seeking to optimize classification accuracy. A number of experiments are performed which show that the feature extraction preserves the initial discriminatory content of the learning examples.

Palabras clave: Feature Extraction; Peak Detection; Discriminatory Information; Initial Representation; Spatial Redundancy.

- Short Papers | Pp. 536-543

Numbers in Multi-relational Data Mining

Arno J. Knobbe; Eric K. Y. Ho

Numeric data has traditionally received little attention in the field of Multi-Relational Data Mining (MRDM). It is often assumed that numeric data can simply be turned into symbolic data by means of discretisation. However, very few guidelines for successfully applying discretisation in MRDM exist. Furthermore, it is unclear whether the loss of information involved is negligible. In this paper, we consider different alternatives for dealing with numeric data in MRDM. Specifically, we analyse the adequacy of discretisation by performing a number of experiments with different existing discretisation approaches, and comparing the results with a procedure that handles numeric data dynamically. The discretisation procedures considered include an algorithm that is insensitive to the multi-relational structure of the data, and two algorithms that do involve this structure. With the empirical results thus obtained, we shed some light on the applicability of both dynamic and static procedures (discretisation), and give recommendations for when and how they can best be applied.

Palabras clave: Numeric Data; Dynamic Approach; Numeric Attribute; Inductive Logic Programming; Nominal Attribute.

- Short Papers | Pp. 544-551

Testing Theories in Particle Physics Using Maximum Likelihood and Adaptive Bin Allocation

Bruce Knuteson; Ricardo Vilalta

We describe a methodology to assist scientists in quantifying the degree of evidence in favor of a new proposed theory compared to a standard baseline theory. The figure of merit is the log-likelihood ratio of the data given each theory. The novelty of the proposed mechanism lies in the likelihood estimations; the central idea is to adaptively allocate histogram bins that emphasize regions in the variable space where there is a clear difference in the predictions made by the two theories. We describe a software system that computes this figure of merit in the context of particle physics, and describe two examples conducted at the Tevatron Ring at the Fermi National Accelerator Laboratory. Results show how two proposed theories compare to the Standard Model and how the likelihood ratio varies as a function of a physical parameter (e.g., by varying the particle mass).

Palabras clave: Particle Physics; Variable Space; Relative Entropy; Particle Accelerator; Actual Observation.

- Short Papers | Pp. 552-560

Improved Naive Bayes for Extremely Skewed Misclassification Costs

Aleksander Kołcz; Abdur Chowdhury

Naive Bayes has been an effective and important classifier in the text categorization domain despite violations of its underlying assumptions. Although quite accurate, it tends to provide poor estimates of the posterior class probabilities, which hampers its application in the cost-sensitive context. The apparent high confidence with which certain errors are made is particularly problematic when misclassification costs are highly skewed, since conservative setting of the decision threshold may greatly decrease the classifier utility. We propose an extension of the Naive Bayes algorithm aiming to discount the confidence with which errors are made. The approach is based on measuring the amount of change to feature distribution necessary to reverse the initial classifier decision and can be implemented efficiently without over-complicating the process of Naive Bayes induction. In experiments with three benchmark document collections, the decision-reversal Naive Bayes is demonstrated to substantially improve over the popular multinomial version of the Naive Bayes algorithm, in some cases performing more than 40% better.

- Short Papers | Pp. 561-568

Clustering and Prediction of Mobile User Routes from Cellular Data

Kari Laasonen

Location-awareness and prediction of future locations is an important problem in pervasive and mobile computing. In cellular systems (e.g., GSM) the serving cell is easily available as an indication of the user location, without any additional hardware or network services. With this location data and other context variables we can determine places that are important to the user, such as work and home. We devise online algorithms that learn routes between important locations and predict the next location when the user is moving. We incrementally build clusters of cell sequences to represent physical routes. Predictions are based on destination probabilities derived from these clusters. Other context variables such as the current time can be integrated into the model. We evaluate the model with real location data, and show that it achieves good prediction accuracy with relatively little memory, making the algorithms suitable for online use in mobile environments.

- Short Papers | Pp. 569-576

Elastic Partial Matching of Time Series

L. J. Latecki; V. Megalooikonomou; Q. Wang; R. Lakaemper; C. A. Ratanamahatana; E. Keogh

We consider a problem of elastic matching of time series. We propose an algorithm that automatically determines a subsequence b ′ of a target time series b that best matches a query series a . In the proposed algorithm we map the problem of the best matching subsequence to the problem of a cheapest path in a DAG (directed acyclic graph). Our experimental results demonstrate that the proposed algorithm outperforms the commonly used Dynamic Time Warping in retrieval accuracy.

Palabras clave: Time Series; Dynamic Time Warping; Longe Common Subsequence; Poor Quality Data; Target Series.

- Short Papers | Pp. 577-584