Catálogo de publicaciones - libros
Data Warehousing and Knowledge Discovery: 7th International Conference, DaWak 2005, Copenhagen, Denmark, August 22-26, 2005, Proceedings
A Min Tjoa ; Juan Trujillo (eds.)
En conferencia: 7º International Conference on Data Warehousing and Knowledge Discovery (DaWaK) . Copenhagen, Denmark . August 22, 2005 - August 26, 2005
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
No disponibles.
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-28558-8
ISBN electrónico
978-3-540-31732-6
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2005
Tabla de contenidos
doi: 10.1007/11546849_21
A Precise Blocking Method for Record Linkage
Patrick Lehti; Peter Fankhauser
Identifying approximately duplicate records between databases requires the costly computation of distances between their attributes. Thus duplicate detection is usually performed in two phases, an efficient blocking phase that determines few potential candidate duplicates based on simple criteria, followed by a second phase performing an in-depth comparison of the candidate duplicates. This paper introduces and evaluates a precise and efficient approach for the blocking phase, which requires only standard indices, but performs as well as other approaches based on special purpose indices, and outperforms other approaches based on standard indices. The key idea of the approach is to use a comparison window with a size that depends dynamically on a maximum distance, rather than using a window with fixed size.
- Data Warehouse Queries and Database Processing Issues | Pp. 210-220
doi: 10.1007/11546849_22
Flexible Query Answering in Data Cubes
Sami Naouali; Rokia Missaoui
This paper presents a new approach toward approximate query answering in data warehouses. The approach is based on an adaptation of rough set theory to multidimensional data, and offers cube exploration and mining facilities.
Since data in a data warehouse come from multiple heterogeneous sources with various degrees of reliability and data formats, users tend to be more tolerant in a data warehouse environment and prone to accept some information loss and discrepancy between actual data and manipulated ones.
The objective of this work is to integrate approximation mechanisms and associated operators into data cubes in order to produce views that can then be explored using OLAP or data mining techniques. The integration of data approximation capabilities with OLAP techniques offers additional facilities for cube exploration and analysis.
The proposed approach allows the user to work either in a mode using a cube or in a mode using cube . The former mode is useful when the query output is large, and hence allows the user to focus on a reduced set of fully matching tuples. The latter is useful when a query returns an empty or small answer set, and hence helps relax the query conditions so that a superset of the answer is returned. In addition, the proposed approach generates classification and characteristic rules for prediction, classification and association purposes.
- Data Warehouse Queries and Database Processing Issues | Pp. 221-232
doi: 10.1007/11546849_23
An Extendible Array Based Implementation of Relational Tables for Multi Dimensional Databases
K. M. Azharul Hasan; Masayuki Kuroda; Naoki Azuma; Tatsuo Tsuji; Ken Higuchi
A new implementation scheme for relational tables in multidimensional databases is proposed and evaluated. The scheme implements a relational table by employing a multidimensional array. Using multidimensional arrays provides many advantages, however suffers from some problems. In our scheme, these problems are solved by an efficient scheme of record encoding based on the notion of extendible array. Our scheme exhibits good performance in space and time costs compared with conventional implementation.
- Data Warehouse Queries and Database Processing Issues | Pp. 233-242
doi: 10.1007/11546849_24
Nearest Neighbor Search on Vertically Partitioned High-Dimensional Data
Evangelos Dellis; Bernhard Seeger; Akrivi Vlachou
In this paper, we present a new approach to indexing multidimensional data that is particularly suitable for the efficient incremental processing of nearest neighbor queries. The basic idea is to use index-striping that vertically splits the data space into multiple low- and medium-dimensional data spaces. The data from each of these lower-dimensional subspaces is organized by using a standard multi-dimensional index structure. In order to perform incremental NN-queries on top of index-striping efficiently, we first develop an algorithm for merging the results received from the underlying indexes. Then, an accurate cost model relying on a power law is presented that determines an appropriate number of indexes. Moreover, we consider the problem of dimension assignment, where each dimension is assigned to a lower-dimensional subspace, such that the cost of nearest neighbor queries is minimized. Our experiments confirm the validity of our cost model and evaluate the performance of our approach.
- Data Mining Algorithms and Techniques | Pp. 243-253
doi: 10.1007/11546849_25
A Machine Learning Approach to Identifying Database Sessions Using Unlabeled Data
Qingsong Yao; Xiangji Huang; Aijun An
In this paper, we describe a novel co-training based algorithm for identifying database user sessions from database traces. The algorithm learns to identify positive data (session boundaries) and negative data (non-session boundaries) incrementally by using two methods interactively in several iterations. In each iteration, previous identified positive and negative data are used to build better models, which in turn can label some new data and improve performance of further iterations. We also present experimental results.
- Data Mining Algorithms and Techniques | Pp. 254-264
doi: 10.1007/11546849_26
Hybrid System of Case-Based Reasoning and Neural Network for Symbolic Features
Kwang Hyuk Im; Tae Hyun Kim; Sang Chan Park
Case-based reasoning is one of the most frequently used tools in data mining. Though it has been proved to be useful in many problems, it is noted to have shortcomings such as feature weighting problems. In previous research, we proposed a hybrid system of case-based reasoning and neural network. In the system, the feature weights are extracted from the trained neural network, and used to improve retrieval accuracy of case-based reasoning. However, this system has worked best in domains in which all features had numeric values. When the feature values are symbolic, nearest neighbor methods typically resort to much simpler metrics, such as counting the features that match. A more sophisticated treatment of the feature space is required in symbolic domains. We propose another hybrid system of case-based reasoning and neural network, which uses value difference metric (VDM) for symbolic features. The proposed system is validated by datasets in symbolic domains.
- Data Mining Algorithms and Techniques | Pp. 265-274
doi: 10.1007/11546849_27
Spatio–temporal Rule Mining: Issues and Techniques
Győző Gidófalvi; Torben Bach Pedersen
Recent advances in communication and information technology, such as the increasing accuracy of GPS technology and the miniaturization of wireless communication devices pave the road for Location–Based Services (LBS). To achieve high quality for such services, spatio–temporal data mining techniques are needed. In this paper, we describe experiences with spatio–temporal rule mining in a Danish data mining company. First, a number of real world spatio–temporal data sets are described, leading to a taxonomy of spatio–temporal data. Second, the paper describes a general methodology that transforms the spatio–temporal rule mining task to the traditional market basket analysis task and applies it to the described data sets, enabling traditional association rule mining methods to discover spatio–temporal rules for LBS. Finally, unique issues in spatio–temporal rule mining are identified and discussed.
- Data Mining | Pp. 275-284
doi: 10.1007/11546849_28
Hybrid Approach to Web Content Outlier Mining Without Query Vector
Malik Agyemang; Ken Barker; Reda Alhajj
Mining outliers from large datasets is like finding needles in a haystack. Even more challenging is sifting through the dynamic, unstructured, and ever-growing web data for . This paper presents , which is a hybrid algorithm that draws from the power of n-gram-based and word-based systems. Experimental results obtained using embedded motifs without a dictionary show significant improvement over using a domain dictionary irrespective of the type of data used (words, n-grams, or hybrid). Also, there is remarkable improvement in recall with hybrid documents compared to using raw words and n-grams without a domain dictionary.
- Data Mining | Pp. 285-294
doi: 10.1007/11546849_29
Incremental Data Mining Using Concurrent Online Refresh of Materialized Data Mining Views
Mikołaj Morzy; Tadeusz Morzy; Marek Wojciechowski; Maciej Zakrzewicz
Data mining is an iterative process. Users issue series of similar data mining queries, in each consecutive run slightly modifying either the definition of the mined dataset, or the parameters of the mining algorithm. This model of processing is most suitable for incremental mining algorithms that reuse the results of previous queries when answering a given query. Incremental mining algorithms require the results of previous queries to be available. One way to preserve those results is to use materialized data mining views. Materialized data mining views store the mined patterns and refresh them as the underlying data change.
Data mining and knowledge discovery often take place in a data warehouse environment. There can be many relatively small materialized data mining views defined over the data warehouse. Separate refresh of each materialized view can be expensive, if the refresh process has to re-discover patterns in the original database. In this paper we present a novel approach to materialized data mining view refresh process. We show that the concurrent on-line refresh of a set of materialized data mining views is more efficient than the sequential refresh of individual views. We present the framework for the integration of data warehouse refresh process with the maintenance of materialized data mining views. Finally, we prove the feasibility of our approach by conducting several experiments on synthetic data sets.
- Data Mining | Pp. 295-304
doi: 10.1007/11546849_30
A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases
Shichao Zhang; Xindong Wu; Jilian Zhang; Chengqi Zhang
Data mining and machine learning must confront the problem of pattern maintenance because data updating is a fundamental operation in data management. Most existing data-mining algorithms assume that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new data. While there are many efficient mining techniques for data additions to databases, in this paper, we propose a decremental algorithm for pattern discovery when data is being deleted from databases. We conduct extensive experiments for evaluating this approach, and illustrate that the proposed algorithm can well model and capture useful interactions within data when the data is decreasing.
- Data Mining | Pp. 305-314