Catálogo de publicaciones - libros

Compartir en
redes sociales


Data Warehousing and Knowledge Discovery: 7th International Conference, DaWak 2005, Copenhagen, Denmark, August 22-26, 2005, Proceedings

A Min Tjoa ; Juan Trujillo (eds.)

En conferencia: 7º International Conference on Data Warehousing and Knowledge Discovery (DaWaK) . Copenhagen, Denmark . August 22, 2005 - August 26, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-28558-8

ISBN electrónico

978-3-540-31732-6

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

A Precise Blocking Method for Record Linkage

Patrick Lehti; Peter Fankhauser

Identifying approximately duplicate records between databases requires the costly computation of distances between their attributes. Thus duplicate detection is usually performed in two phases, an efficient blocking phase that determines few potential candidate duplicates based on simple criteria, followed by a second phase performing an in-depth comparison of the candidate duplicates. This paper introduces and evaluates a precise and efficient approach for the blocking phase, which requires only standard indices, but performs as well as other approaches based on special purpose indices, and outperforms other approaches based on standard indices. The key idea of the approach is to use a comparison window with a size that depends dynamically on a maximum distance, rather than using a window with fixed size.

- Data Warehouse Queries and Database Processing Issues | Pp. 210-220

Flexible Query Answering in Data Cubes

Sami Naouali; Rokia Missaoui

This paper presents a new approach toward approximate query answering in data warehouses. The approach is based on an adaptation of rough set theory to multidimensional data, and offers cube exploration and mining facilities.

Since data in a data warehouse come from multiple heterogeneous sources with various degrees of reliability and data formats, users tend to be more tolerant in a data warehouse environment and prone to accept some information loss and discrepancy between actual data and manipulated ones.

The objective of this work is to integrate approximation mechanisms and associated operators into data cubes in order to produce views that can then be explored using OLAP or data mining techniques. The integration of data approximation capabilities with OLAP techniques offers additional facilities for cube exploration and analysis.

The proposed approach allows the user to work either in a mode using a cube or in a mode using cube . The former mode is useful when the query output is large, and hence allows the user to focus on a reduced set of fully matching tuples. The latter is useful when a query returns an empty or small answer set, and hence helps relax the query conditions so that a superset of the answer is returned. In addition, the proposed approach generates classification and characteristic rules for prediction, classification and association purposes.

- Data Warehouse Queries and Database Processing Issues | Pp. 221-232

An Extendible Array Based Implementation of Relational Tables for Multi Dimensional Databases

K. M. Azharul Hasan; Masayuki Kuroda; Naoki Azuma; Tatsuo Tsuji; Ken Higuchi

A new implementation scheme for relational tables in multidimensional databases is proposed and evaluated. The scheme implements a relational table by employing a multidimensional array. Using multidimensional arrays provides many advantages, however suffers from some problems. In our scheme, these problems are solved by an efficient scheme of record encoding based on the notion of extendible array. Our scheme exhibits good performance in space and time costs compared with conventional implementation.

- Data Warehouse Queries and Database Processing Issues | Pp. 233-242

Nearest Neighbor Search on Vertically Partitioned High-Dimensional Data

Evangelos Dellis; Bernhard Seeger; Akrivi Vlachou

In this paper, we present a new approach to indexing multidimensional data that is particularly suitable for the efficient incremental processing of nearest neighbor queries. The basic idea is to use index-striping that vertically splits the data space into multiple low- and medium-dimensional data spaces. The data from each of these lower-dimensional subspaces is organized by using a standard multi-dimensional index structure. In order to perform incremental NN-queries on top of index-striping efficiently, we first develop an algorithm for merging the results received from the underlying indexes. Then, an accurate cost model relying on a power law is presented that determines an appropriate number of indexes. Moreover, we consider the problem of dimension assignment, where each dimension is assigned to a lower-dimensional subspace, such that the cost of nearest neighbor queries is minimized. Our experiments confirm the validity of our cost model and evaluate the performance of our approach.

- Data Mining Algorithms and Techniques | Pp. 243-253

A Machine Learning Approach to Identifying Database Sessions Using Unlabeled Data

Qingsong Yao; Xiangji Huang; Aijun An

In this paper, we describe a novel co-training based algorithm for identifying database user sessions from database traces. The algorithm learns to identify positive data (session boundaries) and negative data (non-session boundaries) incrementally by using two methods interactively in several iterations. In each iteration, previous identified positive and negative data are used to build better models, which in turn can label some new data and improve performance of further iterations. We also present experimental results.

- Data Mining Algorithms and Techniques | Pp. 254-264

Hybrid System of Case-Based Reasoning and Neural Network for Symbolic Features

Kwang Hyuk Im; Tae Hyun Kim; Sang Chan Park

Case-based reasoning is one of the most frequently used tools in data mining. Though it has been proved to be useful in many problems, it is noted to have shortcomings such as feature weighting problems. In previous research, we proposed a hybrid system of case-based reasoning and neural network. In the system, the feature weights are extracted from the trained neural network, and used to improve retrieval accuracy of case-based reasoning. However, this system has worked best in domains in which all features had numeric values. When the feature values are symbolic, nearest neighbor methods typically resort to much simpler metrics, such as counting the features that match. A more sophisticated treatment of the feature space is required in symbolic domains. We propose another hybrid system of case-based reasoning and neural network, which uses value difference metric (VDM) for symbolic features. The proposed system is validated by datasets in symbolic domains.

- Data Mining Algorithms and Techniques | Pp. 265-274

Spatio–temporal Rule Mining: Issues and Techniques

Győző Gidófalvi; Torben Bach Pedersen

Recent advances in communication and information technology, such as the increasing accuracy of GPS technology and the miniaturization of wireless communication devices pave the road for Location–Based Services (LBS). To achieve high quality for such services, spatio–temporal data mining techniques are needed. In this paper, we describe experiences with spatio–temporal rule mining in a Danish data mining company. First, a number of real world spatio–temporal data sets are described, leading to a taxonomy of spatio–temporal data. Second, the paper describes a general methodology that transforms the spatio–temporal rule mining task to the traditional market basket analysis task and applies it to the described data sets, enabling traditional association rule mining methods to discover spatio–temporal rules for LBS. Finally, unique issues in spatio–temporal rule mining are identified and discussed.

- Data Mining | Pp. 275-284

Hybrid Approach to Web Content Outlier Mining Without Query Vector

Malik Agyemang; Ken Barker; Reda Alhajj

Mining outliers from large datasets is like finding needles in a haystack. Even more challenging is sifting through the dynamic, unstructured, and ever-growing web data for . This paper presents , which is a hybrid algorithm that draws from the power of n-gram-based and word-based systems. Experimental results obtained using embedded motifs without a dictionary show significant improvement over using a domain dictionary irrespective of the type of data used (words, n-grams, or hybrid). Also, there is remarkable improvement in recall with hybrid documents compared to using raw words and n-grams without a domain dictionary.

- Data Mining | Pp. 285-294

Incremental Data Mining Using Concurrent Online Refresh of Materialized Data Mining Views

Mikołaj Morzy; Tadeusz Morzy; Marek Wojciechowski; Maciej Zakrzewicz

Data mining is an iterative process. Users issue series of similar data mining queries, in each consecutive run slightly modifying either the definition of the mined dataset, or the parameters of the mining algorithm. This model of processing is most suitable for incremental mining algorithms that reuse the results of previous queries when answering a given query. Incremental mining algorithms require the results of previous queries to be available. One way to preserve those results is to use materialized data mining views. Materialized data mining views store the mined patterns and refresh them as the underlying data change.

Data mining and knowledge discovery often take place in a data warehouse environment. There can be many relatively small materialized data mining views defined over the data warehouse. Separate refresh of each materialized view can be expensive, if the refresh process has to re-discover patterns in the original database. In this paper we present a novel approach to materialized data mining view refresh process. We show that the concurrent on-line refresh of a set of materialized data mining views is more efficient than the sequential refresh of individual views. We present the framework for the integration of data warehouse refresh process with the maintenance of materialized data mining views. Finally, we prove the feasibility of our approach by conducting several experiments on synthetic data sets.

- Data Mining | Pp. 295-304

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases

Shichao Zhang; Xindong Wu; Jilian Zhang; Chengqi Zhang

Data mining and machine learning must confront the problem of pattern maintenance because data updating is a fundamental operation in data management. Most existing data-mining algorithms assume that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new data. While there are many efficient mining techniques for data additions to databases, in this paper, we propose a decremental algorithm for pattern discovery when data is being deleted from databases. We conduct extensive experiments for evaluating this approach, and illustrate that the proposed algorithm can well model and capture useful interactions within data when the data is decreasing.

- Data Mining | Pp. 305-314