Catálogo de publicaciones - libros

Compartir en
redes sociales


Knowledge Discovery in Databases: PKDD 2007: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, September 17-21, 2007. Proceedings

Joost N. Kok ; Jacek Koronacki ; Ramon Lopez de Mantaras ; Stan Matwin ; Dunja Mladenič ; Andrzej Skowron (eds.)

En conferencia: 11º European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) . Warsaw, Poland . September 17, 2007 - September 21, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-74975-2

ISBN electrónico

978-3-540-74976-9

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Constructing High Dimensional Feature Space for Time Series Classification

Victor Eruhimov; Vladimir Martyanov; Eugene Tuv

The paper investigates a generic method of time series classification that is invariant to transformations of time axis. The state-of-art methods widely use Dynamic Time Warping (DTW) with One-Nearest-Neighbor (1NN). We use DTW to transform time axis of each signal in order to decrease the Euclidean distance between signals from the same class. The predictive accuracy of an algorithm that learns from a heterogeneous set of features extracted from signals is analyzed. Feature selection is used to filter out irrelevant predictors and a serial ensemble of decision trees is used for classification. We simulate a dataset for providing a better insight into the algorithm. We also compare our method to DTW+1NN on several publicly available datasets.

- Short Papers | Pp. 414-421

A Dynamic Clustering Algorithm for Mobile Objects

Dominique Fournier; Gaële Simon; Bruno Mermet

In this paper, a multiagent algorithm for dynamic clustering is presented. This kind of clustering is intended to manage mobile data and so, to be able to continuously adapt the built clusters. First of all, potential applications of this algorithm are presented. Then the specific constraints for this kind of clustering are studied. A multiagent architecture satisfying these constraints is described. It combines an ants algorithm with a cluster agents layer which are executed simultaneously. Finally, the first experimental results of our work are presented.

- Short Papers | Pp. 422-429

A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions

Richard Frank; Flavia Moser; Martin Ester

This paper presents a novel method for multi-relational classification via an aggregation-based Inductive Logic Programming (ILP) approach. We extend the classical ILP representation by aggregation of multiple-features which aid the classification process by allowing for the analysis of relationships and dependencies between different features. In order to efficiently learn rules of this rich format, we present a novel algorithm capable of performing aggregation with the use of virtual joins of the data. By using more expressive aggregation predicates than the existential quantifier used in standard ILP methods, we improve the accuracy of multi-relational classification. This claim is supported by experimental evaluation on three different real world datasets.

- Short Papers | Pp. 430-437

MINI: Mining Informative Non-redundant Itemsets

Arianna Gallo; Tijl De Bie; Nello Cristianini

Frequent itemset mining assists the data mining practitioner in searching for strongly associated items (and transactions) in large transaction databases. Since the number of frequent itemsets is usually extremely large and unmanageable for a human user, recent works have sought to define condensed representations of them, e.g. or frequent itemsets. We argue that not only these methods often still fall short in sufficiently reducing of the output size, but they also output many redundant itemsets. In this paper we propose a philosophically new approach that resolves both these issues in a computationally tractable way. We present and empirically validate a statistically founded approach called MINI, to compress the set of frequent itemsets down to a list of informative and non-redundant itemsets.

- Short Papers | Pp. 438-445

Stream-Based Electricity Load Forecast

João Gama; Pedro Pereira Rodrigues

Sensors distributed all around electrical-power distribution networks produce streams of data at high-speed. From a data mining perspective, this sensor network problem is characterized by a large number of variables (sensors), producing a continuous flow of data, in a dynamic non-stationary environment. Companies make decisions to buy or sell energy based on load profiles and forecast. We propose an architecture based on an online clustering algorithm where each cluster (group of sensors with high correlation) contains a neural-network based predictive model. The goal is to maintain in real-time a clustering model and a predictive model able to incorporate new information at the speed data arrives, detecting changes and adapting the decision models to the most recent information. We present results illustrating the advantages of the proposed architecture, on several temporal horizons, and its competitiveness with another predictive strategy.

- Short Papers | Pp. 446-453

Automatic Hidden Web Database Classification

Zhiguo Gong; Jingbai Zhang; Qian Liu

In this paper, a method for automatic classification of Hidden-Web databases is addressed. In our approach, the classification tree for Hidden Web databases is constructed by tailoring the well accepted classification tree of DMOZ Directory. Then the feature for each class is extracted from randomly selected Web documents in the corresponding category. For each Web database, query terms are selected from the class features based on their weights. A hidden-web database is then probed by analyzing the results of the class-specific query. To raise the performance further, we also use Web pages which have links pointing to the hidden-web database (HW-DB) as another important source to represent the database. We combine link-based evaluation and query-based probing as our final classification solution. The experiment shows that the combined method can produce much better performance for classification of hidden Web Databases.

- Short Papers | Pp. 454-461

Pruning Relations for Substructure Discovery of Multi-relational Databases

Hongyu Guo; Herna L. Viktor; Eric Paquet

Multirelational data mining methods discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a multi-relational database spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering multirelational classification, it follows that these subdivisions will express different interests in the data, leading to the need to explore various subsets of relevant relations with high utility with respect to the target class. The paper presents a novel approach for pruning the uninteresting relations of a relational database where relations come from such different parties and spans many classification tasks. We aim to create a pruned structure and thus minimize predictive performance loss on the final classification model. Our method identifies a set of strongly uncorrelated subgraphs to use for training and discards all others. The experiments performed demonstrate that our strategy is able to significantly reduce the size of the relational schema without sacrificing predictive accuracy.

- Short Papers | Pp. 462-470

The Most Reliable Subgraph Problem

Petteri Hintsanen

We introduce the problem of finding the most reliable subgraph: given a probabilistic graph subject to random edge failures, a set of terminal vertices, and an integer find a subgraph ⊂ having fewer edges than , such that the probability of connecting the terminals in is maximized. The solution has applications in link analysis and visualization. We begin by formally defining the problem in a general form, after which we focus on a two-terminal, undirected case. Although the problem is most likely computationally intractable, we give a polynomial-time algorithm for a special case where is seriesparallel. For the general case, we propose a computationally efficient greedy heuristic. Our experiments on simulated graphs illustrate the usefulness of the concept of most reliable subgraph, and suggest that the heuristic for the general case is quite competitive.

- Short Papers | Pp. 471-478

Matching Partitions over Time to Reliably Capture Local Clusters in Noisy Domains

Frank Höppner; Mirko Böttcher

When seeking for small clusters it is very intricate to distinguish between incidental agglomeration of noisy points and true local patterns. We present the PAMALOC algorithm that addresses this problem by exploiting temporal information which is contained in most business data sets. The algorithm enables the detection of local patterns in noisy data sets more reliable compared to the case when the temporal information is ignored. This is achieved by making use of the fact that noise does not reproduce its incidental structure but even small patterns do. In particular, we developed a method to track clusters over time based on an optimal match of data partitions between time periods.

- Short Papers | Pp. 479-486

Searching for Better Randomized Response Schemes for Privacy-Preserving Data Mining

Zhengli Huang; Wenliang Du; Zhouxuan Teng

To preserve user privacy in Privacy-Preserving Data Mining (PPDM), the randomized response (RR) technique is widely used for categorical data. Although various RR schemes have been proposed, there is no study to systematically compare them in order to find optimal RR schemes. In the paper, we choose the R-U (Risk-Utility) confidentiality map to compare different randomization schemes. Using the R-U map as our metric, we present an optimal RR scheme for binary data, which helps us find an optimal class of RR matrices. From this optimal scheme, we have discovered several heuristic rules among the elements in the optimal class. We generalize these rules to find optimal class of RR matrices for categorical data. Based on these rules, we propose an RR scheme to find a class of RR matrices for categorical data. Our experimental results have shown that our scheme has much better performance than the existing RR schemes.

- Short Papers | Pp. 487-497