Catálogo de publicaciones - libros

Compartir en
redes sociales


Knowledge Discovery in Inductive Databases: 4th International Workshop, KDID 2005, Porto, Portugal, October 3, 2005, Revised Selected and Invited Papers

Francesco Bonchi ; Jean-François Boulicaut (eds.)

En conferencia: 4º International Workshop on Knowledge Discovery in Inductive Databases (KDID) . Porto, Portugal . October 3, 2005 - October 3, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-33292-3

ISBN electrónico

978-3-540-33293-0

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Data Mining in Inductive Databases

Arno Siebes

Ever since the seminal paper by Imielinski and Mannila [11], inductive databases have been a constant theme in the data mining literature. Operationally, such an inductive database is a database in which models and patterns are first class citizens.

In the extensive literature on inductive databases there is at least one consequence of this operational definition that is conspicuously missing. That is the question: if we have models and patterns in our inductive database, how does this help to discover other models and patterns? This question is the topic of this paper.

- Invited Papers | Pp. 1-23

Mining Databases and Data Streams with Query Languages and Rules

Carlo Zaniolo

Among data-intensive applications that are beyond the reach of traditional Data Base Management Systems (DBMS), data mining stands out because of practical importance and the complexity of the research problems that must be solved before the vision of Inductive DBMS can become a reality. In this paper, we first discuss technical developments that have occurred since the very notion of Inductive DBMS emerged as a result of the seminal papers authored by Imielinski and Mannila a decade ago. The research progress achieved since then can be subdivided into three main problem subareas as follows: (i) language (ii) optimization, and (iii) representation. We discuss the problems in these three areas and the different approaches to Inductive DBMS that are made possible by recent technical advances. Then, we pursue a language-centric solution, and introduce simple SQL extensions that have proven very effective at supporting data mining. Finally, we turn our attention to the related problem of supporting data stream mining using Data Stream Management Systems (DSMS) and introduce the notion of Inductive DSMS. In addition to continuous query languages, DSMS provide support for synopses, sampling, load shedding, and other built-in functions that are needed for data stream mining. Moreover, we show that Inductive DSMS can be achieved by generalizing DSMS to assure that their continuous query languages support efficiently data stream mining applications. Thus, DSMS extended with inductive capabilities will provide a uniquely supportive environment for data stream mining applications.

- Invited Papers | Pp. 24-37

Memory-Aware Frequent -Itemset Mining

Maurizio Atzori; Paolo Mancarella; Franco Turini

In this paper we show that the well known problem of computing frequent -itemsets (i.e. itemsets of cardinality ) in a given dataset can be reduced to the problem of finding iceberg queries from a stream of queries suitably constructed from the original dataset. Hence, algorithms for computing frequent -itemsets can be obtained by adapting algorithms for computing iceberg queries. In the paper we show that, for sparse datasets, this can be done directly, i.e. without generating frequent -itemsets, for each < , as done in the most common algorithms based on a level-wise approach. We exploit a recent algorithm for finding iceberg queries and define an algorithm which requires only three sequential passes over the dataset to compute the frequent -itemsets (even for > 3). An important feature of the algorithm is that the amount of main memory required can be determined in advance, and it is shown to be very low for sparse datasets. Experiments show that for very large datasets with millions of small transactions our proposal outperforms the state-of-the-art algorithms. Furthermore, we sketch a first extension of our algorithm that works over data streams.

- Contributed Papers | Pp. 38-54

Constraint-Based Mining of Fault-Tolerant Patterns from Boolean Data

Jérémy Besson; Ruggero G. Pensa; Céline Robardet; Jean-François Boulicaut

Thanks to an important research effort during the last few years, inductive queries on local patterns (e.g., set patterns) and their associated complete solvers have been proved extremely useful to support knowledge discovery. The more we use such queries on real-life data, e.g., biological data, the more we are convinced that inductive queries should return fault-tolerant patterns. This is obviously the case when considering formal concept discovery from noisy datasets. Therefore, we study various extensions of this kind of bi-set towards fault-tolerance. We compare three declarative specifications of fault-tolerant bi-sets by means of a constraint-based mining approach. Our framework enables a better understanding of the needed trade-off between extraction feasibility, completeness, relevance, and ease of interpretation of these fault-tolerant patterns. An original empirical evaluation on both synthetic and real-life medical data is given. It enables a comparison of the various proposals and it motivates further directions of research.

- Contributed Papers | Pp. 55-71

Experiment Databases: A Novel Methodology for Experimental Research

Hendrik Blockeel

Data mining and machine learning are experimental sciences: a lot of insight in the behaviour of algorithms is obtained by implementing them and studying how they behave when run on datasets. However, such experiments are often not as extensive and systematic as they ideally would be, and therefore the experimental results must be interpreted with caution. In this paper we present a new experimental methodology that is based on the concept of “experiment databases”. An experiment database can be seen as a special kind of inductive database, and the experimental methodology consists of filling and then querying this database. We show that the novel methodology has numerous advantages over the existing one. As such, this paper presents a novel and interesting application of inductive databases that may have a significant impact on experimental research in machine learning and data mining.

- Contributed Papers | Pp. 72-85

Quick Inclusion-Exclusion

Toon Calders; Bart Goethals

Many data mining algorithms make use of the well-known Inclusion-Exclusion principle. As a consequence, using this principle efficiently is crucial for the success of all these algorithms. Especially in the context of condensed representations, such as NDI, and in computing interesting measures, a quick inclusion-exclusion algorithm can be crucial for the performance. In this paper, we give an overview of several algorithms that depend on the inclusion-exclusion principle and propose an efficient algorithm to use it and evaluate its complexity. The theoretically obtained results are supported by experimental evaluation of the quick IE technique in isolation, and of an example application.

- Contributed Papers | Pp. 86-103

Towards Mining Frequent Queries in Star Schemes

Tao-Yuan Jen; Dominique Laurent; Nicolas Spyratos; Oumar Sy

The problem of mining frequent queries in a database is intractable, even if we consider conjunctive queries only. In this paper, we study this problem under reasonable restrictions on the database, namely: () the database scheme is a star scheme; () the data in the database satisfies a set of functional dependencies and a set of referential constraints.

We note that star schemes are considered to be the most appropriate schemes for data warehouses, while functional dependencies and referential constraints are the most common constraints that one encounters in real databases. Our approach is based on the weak instance semantics of databases and considers the class of selection-projection queries over weak instances. In such a context, we show that frequent queries can be mined using level-wise algorithms such as Apriori.

- Contributed Papers | Pp. 104-123

Inductive Databases in the Relational Model: The Data as the Bridge

Stefan Kramer; Volker Aufschild; Andreas Hapfelmeier; Alexander Jarasch; Kristina Kessler; Stefan Reckow; Jörg Wicker; Lothar Richter

We present a new and comprehensive approach to inductive databases in the relational model. The main contribution is a new inductive query language extending SQL, with the goal of supporting the whole knowledge discovery process, from pre-processing via data mining to post-processing. A prototype system supporting the query language was developed in the SINDBAD (structured inductive database development) project. Setting aside models and focusing on distance-based and instance-based methods, closure can easily be achieved. An example scenario from the area of gene expression data analysis demonstrates the power and simplicity of the concept. We hope that this preliminary work will help to bring the fundamental issues, such as the integration of various pattern domains and data mining techniques, to the attention of the inductive database community.

- Contributed Papers | Pp. 124-138

Transaction Databases, Frequent Itemsets, and Their Condensed Representations

Taneli Mielikäinen

Mining frequent itemsets is a fundamental task in data mining. Unfortunately the number of frequent itemsets describing the data is often too large to comprehend. This problem has been attacked by condensed representations of frequent itemsets that are subcollections of frequent itemsets containing only the frequent itemsets that cannot be deduced from other frequent itemsets in the subcollection, using some deduction rules. In this paper we review the most popular condensed representations of frequent itemsets, study their relationship to transaction databases and each other, examine their combinatorial and computational complexity, and describe their relationship to other important concepts in combinatorial data analysis, such as Vapnik-Chervonenkis dimension and hypergraph transversals.

- Contributed Papers | Pp. 139-164

Multi-class Correlated Pattern Mining

Siegfried Nijssen; Joost N. Kok

To mine databases in which examples are tagged with class labels, the minimum correlation constraint has been studied as an alternative to the minimum frequency constraint. We reformulate previous approaches and show that a minimum correlation constraint can be transformed into a disjunction of minimum frequency constraints. We prove that this observation extends to the multi-class correlation measure, and thus obtain an efficient new () prune test. We illustrate how the relation between correlation measures and minimum support thresholds allows for the reuse of previously discovered pattern sets, thus avoiding unneccessary database evaluations. We conclude with experimental results to assess the effectivity of algorithms based on our observations.

- Contributed Papers | Pp. 165-187