Catálogo de publicaciones - libros

Compartir en
redes sociales


Knowledge Discovery in Databases: PKDD 2005: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings

Alípio Mário Jorge ; Luís Torgo ; Pavel Brazdil ; Rui Camacho ; João Gama (eds.)

En conferencia: 9º European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) . Porto, Portugal . October 3, 2005 - October 7, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-29244-9

ISBN electrónico

978-3-540-31665-7

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Data Analysis in the Life Sciences — Sparking Ideas —

Michael R. Berthold

Data from various areas of Life Sciences have increasingly caught the attention of data mining and machine learning researchers. Not only is the amount of data available mind-boggling but the diverse and heterogenous nature of the information is far beyond any other data analysis problem so far. In sharp contrast to classical data analysis scenarios, the life science area poses challenges of a rather different nature for mainly two reasons. Firstly, the available data stems from heterogenous information sources of varying degrees of reliability and quality and is, without the interactive, constant interpretation of a domain expert, not useful. Furthermore, predictive models are of only marginal interest to those users – instead they hope for new insights into a complex, biological system that is only partially represented within that data anyway. In this scenario, the data serves mainly to create new insights and generate new ideas that can be tested. Secondly, the notion of feature space and the accompanying measures of similarity cannot be taken for granted. Similarity measures become context dependent and it is often the case that within one analysis task several different ways of describing the objects of interest or measuring similarity between them matter. Some more recently published work in the data analysis area has started to address some of these issues. For example, data analysis in parallel universes[1], that is, the detection of patterns of interest in various different descriptor spaces at the same time, and mining of frequent, discriminative fragments in large, molecular data bases[2]. In both cases, sheer numerical performance is not the focus; it is rather the discovery of interpretable pieces of evidence that lights up new ideas in the users mind. Future work in data analysis in the life sciences needs to keep this in mind: the goal is to trigger new ideas and stimulate interesting associations.

- Invited Talks | Pp. 1-1

Machine Learning for Natural Language Processing (and Vice Versa?)

Claire Cardie

Over the past 10-15 years, the influence of methods from machine learning has transformed the way that research is done in the field of natural language processing. This talk will begin by covering the history of this transformation. In particular, learning methods have proved successful in producing stand-alone text-processing components to handle a number of linguistic tasks. Moreover, these components can be combined to produce systems that exhibit shallow text-understanding capabilities: they can, for example, extract key facts from unrestricted documents in limited domains or find answers to general-purpose questions from open-domain document collections. I will briefly describe the state of the art for these practical text-processing applications, focusing on the important role that machine learning methods have played in their development. The second part of the talk will explore the role that natural language processing might play in machine learning research. Here, I will explain the kinds of text-based features that are relatively easy to incorporate into machine learning data sets. In addition, I’ll outline some problems from natural language processing that require, or could at least benefit from, new machine learning algorithms.

- Invited Talks | Pp. 2-2

Statistical Relational Learning: An Inductive Logic Programming Perspective

Luc De Raedt

In the past few years there has been a lot of work lying at the intersection of probability theory, logic programming and machine learning [14,18,13,9,6,1,11]. This work is known under the names of statistical relational learning [7,5], probabilistic logic learning [4], or probabilistic inductive logic programming. Whereas most of the existing works have started from a probabilistic learning perspective and extended probabilistic formalisms with relational aspects, I shall take a different perspective, in which I shall start from inductive logic programming and study how inductive logic programming formalisms, settings and techniques can be extended to deal with probabilistic issues. This tradition has already contributed a rich variety of valuable formalisms and techniques, including probabilistic Horn abduction by David Poole, PRISMs by Sato, stochastic logic programs by Muggleton[13] and Cussens[2], Bayesian logic programs[10,8] by Kersting and De Raedt, and Logical Hidden Markov Models[11]. The main contribution of this talk is the introduction of three probabilistic inductive logic programming settings which are derived from the learning from entailment, from interpretations and from proofs settings of the field of inductive logic programming [3]. Each of these settings contributes different notions of probabilistic logic representations, examples and probability distributions. The first setting, probabilistic learning from entailment, is incorporated in the wellknown PRISM system [19] and Cussens’s Failure Adjusted Maximisation approach to parameter estimation in stochastic logic programs [2]. A novel system that was recently developed and that .ts this paradigm is the nFOIL system [12]. It combines key principles of the well-known inductive logic programming system FOIL [15] with the naïve Bayes’ appraoch. In probabilistic learning from entailment, examples are ground facts that should be probabilistically entailed by the target logic program. The second setting, probabilistic learning from interpretations, is incorporated in Bayesian logic programs [10,8], which integrate Bayesian networks with logic programs. This setting is also adopted by [6]. Examples in this setting are Herbrand interpretations that should be a probabilistic model for the target theory. The third setting, learning from proofs [17], is novel. It is motivated by the learning of stochastic context free grammars from tree banks. In this setting, examples are proof trees that should be probabilistically provable from the unknown stochastic logic programs. The sketched settings (and their instances presented) are by no means the only possible settings for probabilisticinductive logic programming, but still – I hope – provide useful insights into the state-of-the-art of this exciting field. For a full survey of statistical relational learning or probabilistic inductive logic programming, the author would like to refer to [4], and for more details on the probabilistic inductive logic programming settings to [16], where a longer and earlier version of this contribution can be found.

Palabras clave: Bayesian Network; Logic Program; Logic Programming; Inductive Logic; Inductive Logic Programming.

- Invited Talks | Pp. 3-5

Recent Advances in Mining Time Series Data

Eamonn Keogh

Much of the world’s supply of data is in the form of time series. Furthermore, as we shall see, many types of data can be meaningfully converted into ”time series”, including text, DNA, video, images etc. The last decade has seen an explosion of interest in mining time series data from the academic community. There has been significant work on algorithms to classify, cluster, segment, index, discover rules, visualize, and detect anomalies/novelties in time series. In this talk I will summarize the latest advances in mining time series data, including: – New representations of time series data. – New algorithms/definitions. – The migration from static problems to online problems. – New areas and applications of time series data mining. I will end the talk with a discussion of “what’s left to do” in time series data mining.

Palabras clave: Time Series; Data Mining; Knowledge Discovery; Time Series Data; Academic Community.

- Invited Talks | Pp. 6-6

Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce

Ron Kohavi

Electronic Commerce is now entering its second decade, with Amazon.com and eBay now in existence for ten years. With massive amounts of data, an actionable domain, and measurable ROI, multiple companies use data mining and knowledge discovery to understand their customers and improve interactions. We present important lessons and challenges using e-commerce examples across two dimensions: (i) business-level to technical, and (ii) the mining lifecycle from data collection, data warehouse construction, to discovery and deployment. Many of the lessons and challenges are applicable to domains outside e-commerce.

- Invited Talks | Pp. 7-7

Data Streams and Data Synopses for Massive Data Sets

Yossi Matias

With the proliferation of data intensive applications, it has become necessary to develop new techniques to handle massive data sets. Traditional algorithmic techniques and data structures are not always suitable to handle the amount of data that is required and the fact that the data often streams by and cannot be accessed again. A field of research established over the past decade is that of handling massive data sets using data synopses, and developing algorithmic techniques for data stream models. We will discuss some of the research work that has been done in the field, and provide a decades’ perspective to data synopses and data streams.

- Invited Talks | Pp. 8-9

k-Anonymous Patterns

Maurizio Atzori; Francesco Bonchi; Fosca Giannotti; Dino Pedreschi

It is generally believed that data mining results do not violate the anonymity of the individuals recorded in the source database. In fact, data mining models and patterns, in order to ensure a required statistical significance, represent a large number of individuals and thus conceal individual identities: this is the case of the minimum support threshold in association rule mining. In this paper we show that this belief is ill-founded. By shifting the concept of k-anonymity from data to patterns, we formally characterize the notion of a threat to anonymity in the context of pattern discovery, and provide a methodology to efficiently and effectively identify all possible such threats that might arise from the disclosure of a set of extracted patterns.

Palabras clave: Association Rule; Frequent Itemsets; Association Rule Mining; Support Threshold; Minimum Support Threshold.

- Long Papers | Pp. 10-21

Interestingness is Not a Dichotomy: Introducing Softness in Constrained Pattern Mining

Stefano Bistarelli; Francesco Bonchi

The paradigm of pattern discovery based on constraints was introduced with the aim of providing to the user a tool to drive the discovery process towards potentially interesting patterns, with the positive side effect of achieving a more efficient computation. So far the research on this paradigm has mainly focussed on the latter aspect: the development of efficient algorithms for the evaluation of constraint-based mining queries. Due to the lack of research on methodological issues, the constraint-based pattern mining framework still suffers from many problems which limit its practical relevance. As a solution, in this paper we introduce the new paradigm of pattern discovery based on Soft Constraints . Albeit simple, the proposed paradigm overcomes all the major methodological drawbacks of the classical constraint-based paradigm, representing an important step further towards practical pattern discovery.

Palabras clave: Pattern Mining; Frequent Itemsets; Mining Association Rule; Soft Constraint; Pattern Discovery.

- Long Papers | Pp. 22-33

Generating Dynamic Higher-Order Markov Models in Web Usage Mining

José Borges; Mark Levene

Markov models have been widely used for modelling users’ web navigation behaviour. In previous work we have presented a dynamic clustering-based Markov model that accurately represents second-order transition probabilities given by a collection of navigation sessions. Herein, we propose a generalisation of the method that takes into account higher-order conditional probabilities. The method makes use of the state cloning concept together with a clustering technique to separate the navigation paths that reveal differences in the conditional probabilities. We report on experiments conducted with three real world data sets. The results show that some pages require a long history to understand the users choice of link, while others require only a short history. We also show that the number of additional states induced by the method can be controlled through a probability threshold parameter.

Palabras clave: Markov Model; Link Prediction; Navigation Pattern; User Navigation; State Cloning.

- Long Papers | Pp. 34-45

Tree ^2 – Decision Trees for Tree Structured Data

Björn Bringmann; Albrecht Zimmermann

We present Tree ^2, a new approach to structural classification . This integrated approach induces decision trees that test for pattern occurrence in the inner nodes. It combines state-of-the-art tree mining with sophisticated pruning techniques to find the most discriminative pattern in each node. In contrast to existing methods, Tree ^2 uses no heuristics and only a single, statistically well founded parameter has to be chosen by the user. The experiments show that Tree ^2 classifiers achieve good accuracies while the induced models are smaller than those of existing approaches, facilitating better comprehensibility.

Palabras clave: Decision Tree; Association Rule; Information Gain; Correlation Measure; Association Rule Mining.

- Long Papers | Pp. 46-58