Catálogo de publicaciones - libros

Compartir en
redes sociales


Data Warehousing and Knowledge Discovery: 7th International Conference, DaWak 2005, Copenhagen, Denmark, August 22-26, 2005, Proceedings

A Min Tjoa ; Juan Trujillo (eds.)

En conferencia: 7º International Conference on Data Warehousing and Knowledge Discovery (DaWaK) . Copenhagen, Denmark . August 22, 2005 - August 26, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-28558-8

ISBN electrónico

978-3-540-31732-6

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Information Driven Evaluation of Data Hiding Algorithms

Elisa Bertino; Igor Nai Fovino

Privacy is one of the most important properties an information system must satisfy. A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used. Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of modifying the database in such a way to prevent the discovery of sensible information. Due to the large amount of possible techniques that can be used to achieve this goal, it is necessary to provide some standard evaluation metrics to determine the best algorithms for a specific application or context. Currently, however, there is no common set of parameters that can be used for this purpose. This paper explores the problem of PPDM algorithm evaluation, starting from the key goal of preserving of data quality. To achieve such goal, we propose a formal definition of data quality specifically tailored for use in the context of PPDM algorithms, a set of evaluation parameters and an evaluation algorithm. The resulting evaluation core process is then presented as a part of a more general three step evaluation framework, taking also into account other aspects of the algorithm evaluation such as efficiency, scalability and level of privacy.

- Security and Privacy Issues | Pp. 418-427

Essential Patterns: A Perfect Cover of Frequent Patterns

Alain Casali; Rosine Cicchetti; Lotfi Lakhal

The extraction of frequent patterns often yields extremely voluminous results which are difficult to handle. Computing a concise representation or cover of the frequent pattern set is thus an interesting alternative investigated by various approaches. The work presented in this article fits in such a trend. We introduce the concept of essential pattern and propose a new cover based on this concept. Such a cover makes it possible to decide whether a pattern is frequent or not, to compute its frequency and, in contrast with related work, to infer its disjunction and negation frequencies. A levelwise algorithm with a pruning step which uses the maximal frequent patterns for computing the essential patterns is proposed. Experiments show that when the number of frequent patterns is very high (strongly correlated data), the defined cover is significantly more reduced than the cover considered until now as minimal: the frequent closed patterns.

- Patterns | Pp. 428-437

Processing Sequential Patterns in Relational Databases

Xuequn Shang; Kai-Uwe Sattler

Database integration of data mining has gained popularity and its significance is well recognized. However, the performance of SQL based data mining is known to fall behind specialized implementation since the prohibitive nature of the cost associated with extracting knowledge, as well as the lack of suitable declarative query language support. Recent studies have found that for association rule mining and sequential pattern mining with carefully tuned SQL formulations it is possible to achieve performance comparable to systems that cache the data in files outside the DBMS. However most of the previous pattern mining methods follow the method of which still encounters problems when a sequential database is large and/or when sequential patterns to be mined are numerous and long.

In this paper, we present a novel SQL based approach that we recently proposed, called (PROjection Sequential PAttern Discovery). fundamentally differs from an -like candidate set generation-and-test approach. This approach is a pattern growth-based approach without candidate generation. It grows longer patterns from shorter ones by successively projecting the sequential table into subsequential tables. Since a projected table for a sequential pattern contains all and only necessary information for mining the sequential patterns that can grow from , the size of the projected table usually reduces quickly as mining proceeds to longer patterns. Moreover, avoiding creating and dropping cost of some temporary tables, depth first approach is used to facilitate the projecting process.

- Patterns | Pp. 438-447

Optimizing a Sequence of Frequent Pattern Queries

Mikołaj Morzy; Marek Wojciechowski; Maciej Zakrzewicz

Discovery of frequent patterns is a very important data mining problem with numerous applications. Frequent pattern mining is often regarded as advanced querying where a user specifies the source dataset and pattern constraints using a given constraint model. A significant amount of research on efficient processing of frequent pattern queries has been done in recent years, focusing mainly on constraint handling and reusing results of previous queries. In this paper we tackle the problem of optimizing a sequence of frequent pattern queries, submitted to the system as a batch. Our solutions are based on previously proposed techniques of reusing results of previous queries, and exploit the fact that knowing a sequence of queries a priori gives the system a chance to schedule and/or adjust the queries so that they can use results of queries executed earlier. We begin with simple query scheduling and then consider other transformations of the original batch of queries.

- Patterns | Pp. 448-457

A General Effective Framework for Monotony and Tough Constraint Based Sequential Pattern Mining

Enhong Chen; Tongshu Li; Phillip C-y Sheu

Sequential pattern mining has now become an important data mining problem. For many practical applications, the users may be only interested in those sequential patterns satisfying some constraints expressing their interest. The proposed constraints in general can be categorized into four classes, among which monotony and tough constraints are the most difficult ones to be processed. However, many of the available algorithms are proposed for some special constraints based sequential pattern mining. It is thus difficult to be adapted to other classes of constraints. In this paper we propose a new general framework called CBPSAlgm based on the projection-based pattern growth principal. Under this framework, ineffective item pruning strategies are designed and integrated to construct effective algorithms for monotony and tough constraint based sequential pattern mining. Experimental results show that our proposed methods outperform other algorithms.

- Patterns | Pp. 458-467

Hiding Classification Rules for Data Sharing with Privacy Preservation

Juggapong Natwichai; Xue Li; Maria Orlowska

In this paper, we propose a method of hiding sensitive classification rules from data mining algorithms for categorical datasets. Our approach is to reconstruct a dataset according to the classification rules that have been checked and agreed by the data owner for releasing to data sharing. Unlike the other heuristic modification approaches, firstly, our method classifies a given dataset. Subsequently, a set of classification rules is shown to the data owner to identify the sensitive rules that should be hidden. After that we build a new decision tree that is constituted only non-sensitive rules. Finally, a new dataset is reconstructed. Our experiments show that the sensitive rules can be hidden completely on the reconstructed datasets. While non-sensitive rules are still able to discovered without any side effect. Moreover, our method can also preserve high usability of reconstructed datasets.

- Cluster and Classification I | Pp. 468-477

Clustering-Based Histograms for Multi-dimensional Data

Filippo Furfaro; Giuseppe M. Mazzeo; Cristina Sirangelo

A new technique for constructing multi-dimensional histograms is proposed. This technique first invokes a density-based clustering algorithm to locate dense and sparse regions of the input data. Then the data distribution inside each of these regions is summarized by partitioning it into non-overlapping blocks laid onto a grid. The granularity of this grid is chosen depending on the underlying data distribution: the more homogeneous the data, the coarser the grid. Our approach is compared with state-of-the-art histograms on both synthetic and real-life data and is shown to be more effective.

- Cluster and Classification I | Pp. 478-487

Weighted K-Means for Density-Biased Clustering

Kittisak Kerdprasop; Nittaya Kerdprasop; Pairote Sattayatham

Clustering is a task of grouping data based on similarity. A popular k-means algorithm groups data by firstly assigning all data points to the closest clusters, then determining the cluster means. The algorithm repeats these two steps until it has converged. We propose a variation called weighted k-means to improve the clustering scalability. To speed up the clustering process, we develop the reservoir-biased sampling as an efficient data reduction technique since it performs a single scan over a data set. Our algorithm has been designed to group data of mixture models. We present an experimental evaluation of the proposed method.

- Cluster and Classification I | Pp. 488-497

A New Approach for Cluster Detection for Large Datasets with High Dimensionality

Matthew Gebski; Raymond K. Wong

The study of the use of computers through human computer interfaces (HCI) is essential to improve the productivity in any computer application environment. HCI analysts use a number of techniques to build models that are faithful to actual computer use. A key technique is through eye tracking, in which the region of the screen being examined is recorded in order to determine key areas of use. Clustering techniques allow these regions to be grouped to help facilitate usability analysis. Historically, approaches such as the Expectation Maximization (EM) and K-Means algorithm have performed well. Unfortunately, these approaches require the number of clusters to be known beforehand – in many real world situations, this hampers the effectiveness of the analysis of the data. We propose a novel algorithm that is well suited for cluster discovery for HCI data; we do not require the number of clusters to be specified a priori and our approach scales very well for both large datasets and high dimensionality. Experiments have demonstrated that our approach works well for real data from HCI applications.

- Cluster and Classification II | Pp. 498-508

Gene Expression Biclustering Using Random Walk Strategies

Fabrizio Angiulli; Clara Pizzuti

A biclustering algorithm, based on a greedy technique and enriched with a local search strategy to escape poor local minima, is proposed. The algorithm starts with an initial random solution and searches for a locally optimal solution by successive transformations that improve a gain function, combining the mean squared residue, the row variance, and the size of the bicluster. Different strategies to escape local minima are introduced and compared. Experimental results on yeast and lymphoma microarray data sets show that the method is able to find significant biclusters.

- Cluster and Classification II | Pp. 509-519