Catálogo de publicaciones - libros

Compartir en
redes sociales


Data Warehousing and Knowledge Discovery: 4th International Conference, DaWaK 2002 Aix-en-Provence, France, September 4-6, 2002. Proceedings

Yahiko Kambayashi ; Werner Winiwarter ; Masatoshi Arikawa (eds.)

En conferencia: 4º International Conference on Data Warehousing and Knowledge Discovery (DaWaK) . Aix-en-Provence, France . September 4, 2002 - September 6, 2002

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2002 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-44123-6

ISBN electrónico

978-3-540-46145-6

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2002

Tabla de contenidos

An Algorithm for Building User-Role Profiles in a Trust Environment

Evimaria Terzi; Yuhui Zhong; Bharat Bhargava; Pankaj; Sanjay Madria

A good direction towards building secure systems that operate efficiently in large-scale environments (like the World Wide Web) is the deployment of Role Based Access Control Methods (RBAC). RBAC architectures do not deal with each user separately, but with discrete roles that users can acquire in the system. The goal of this paper is to present a classification algorithm that during its training phase, classifies roles of the users in clusters. The behavior of each user that enters the system holding a specific role is traced via audit trails and any misbehavior is detected and reported (classification phase). This algorithm will be incorporated in the Role Server architecture, currently under development, enhancing its ability to dynamically adjust the amount of trust of each user and update the corresponding role assignments.

- Web Mining and Security | Pp. 104-113

Neural-Based Approaches for Improving the Accuracy of Decision Trees

Yue-Shi Lee; Show-Jane Yen

The decision-tree learning algorithms, e.g., C5, are good at dataset classification. But those algorithms usually work with only one attribute at a time. The dependencies among attributes are not considered in those algorithms. Unfortunately, in the real world, most datasets contain attributes, which are dependent. Generally, these dependencies are classified into two types: categorical- type and numerical-type dependencies. Thus, it is very important to construct a model to discover the dependencies among attributes, and to improve the accuracy of the decision-tree learning algorithms. Neural network model is a good choice to concern with these two types of dependencies. In this paper, we propose a Neural Decision Tree (NDT) model to deal with the problems described above. NDT model combines the neural network technologies and the traditional decision-tree learning capabilities to handle the complicated and real cases. The experimental results show that the NDT model can significantly improve the accuracy of C5.

- Data Mining Techniques | Pp. 114-123

Approximate -Closest-Pairs with Space Filling Curves

Fabrizio Angiulli; Clara Pizzuti

An approximate algorithm to efficiently solve the problem in high-dimensional spaces is presented. The method is based on dimensionality reduction of the space ℝ through the Hilbert space filling curve and performs at most +1 scans of the data set. After each scan, those points whose contribution to the solution has already been analyzed, are eliminated from the data set. The pruning is lossless, in fact the remaining points along with the approximate solution found can be used for the computation of the exact solution. Although we are able to guarantee an ( ) approximation to the solution, where = 1,…,∞ denotes the used metric, experimental results give the exact -Closest-Pairs for all the data sets considered and show that the pruning of the search space is effective.

- Data Mining Techniques | Pp. 124-134

Optimal Dimension Order: A Generic Technique for the Similarity Join

Christian Böhm; Florian Krebs; Hans-Peter Kriegel

The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a given Parameter ∈. Although the similarity join is clearly CPU bound, most previous publications propose strategies that primarily improve the I/O performance. Only little effort has been taken to address CPU aspects. In this Paper, we show that most of the computational overhead is dedicated to the final distance computations between the feature vectors. Consequently, we propose a generic technique to reduce the response time of a large number of basic algorithms for the similarity join. It is applicable for index based join algorithms as well as for most join algorithms based on hashing or sorting. Our technique, called Optimal Dimension Order, is able to avoid and accelerate distance calculations between feature vectors by a careful order of the dimensions. The order is determined according to a probability model. In the experimental evaluation, we show that our technique yields high performance improvements for various underlying similarity join algorithms such as the R-tree similarity join, the breadth- first-R-tree join, the Multipage Index Join, and the ∈-Grid-Order.

- Data Mining Techniques | Pp. 135-149

Fast Discovery of Sequential Patterns by Memory Indexing

Ming-Yen Lin; Suh-Yin Lee

Mining sequential patterns is an important issue for the complexity of temporal pattern discovering from sequences. Current mining approaches either require many times of database scanning or generate several intermediate databases. As databases may fit into the ever-increasing main memory, efficient memory-based discovery of sequential patterns will become possible. In this paper, we propose a memory indexing approach for fast sequential pattern mining, named . During the whole process, scans the sequence database only once for reading data sequences into memory. The findthen- index technique recursively finds the items which constitute a frequent sequence and constructs a compact index set which indicates the set of data sequences for further exploration. Through effective index advancing, fewer and shorter data sequences need to be processed in as the discovered patterns getting longer. Moreover, the maximum size of total memory required, which is independent of minimum support threshold in , can be estimated. The experiments indicates that outperforms both and algorithms. also has good linear scalability even with very low minimum support. When the database is too large to fit in memory in a batch, we partition the database, mine patterns in each partition, and validate the true patterns in the second pass of database scanning. Therefore, may efficiently mine databases of any size, for any minimum support values.

- Data Mining Techniques | Pp. 150-160

Dynamic Similarity for Fields with NULL Values

Li Zhao; Sung Sam Yuan; Qi Xiao Yang; Sun Peng

One of the most important tasks in data cleansing is to deduplicate records, which needs to compare records to determine their equivalence. However, existing comparison methods, such as Record Similarity, Equational Theory, implicitly assume that the values in all fields are known, and NULL values are treated as empty strings, which will result in a loss of correct duplicate records. In this paper, we solve this problem by proposing a simple yet efficient method, Dynamic Similarity, which dynamically adjusts the similarity for field with NULL value. Performance results on real and synthetic datasets show that Dynamic Similarity method can achieve more correct duplicate records and without introducing more false positives as compared with Record Similarity. Furthermore, the percentage of correct duplicate records obtained by Dynamic Similarity but not obtained by Record Similarity will increase if the number of fields with NULL values increases.

- Data Cleansing | Pp. 161-169

Outlier Detection Using Replicator Neural Networks

Simon Hawkins; Hongxing He; Graham Williams; Rohan Baxter

We consider the problem of finding outliers in large multivariate databases. Outlier detection can be applied during the data cleansing process of data mining to identify problems with the data itself, and to fraud detection where groups of outliers are often of particular interest. We use replicator neural networks (RNNs) to provide a measure of the outlyingness of data records. The performance of the RNNs is assessed using a ranked score measure. The effectiveness of the RNNs for outlier detection is demonstrated on two publicly available databases.

- Data Cleansing | Pp. 170-180

The Closed Keys Base of Frequent Itemsets

Viet Phan Luong

In data mining, concise representations are useful and necessary to apprehending voluminous results of data processing. Recently many different concise representations of frequent itemsets have been investigated. In this paper, we present yet another concise representation of frequent itemsets, called the closed keys representation, with the following characteristics: (i) it allows to determine if an itemset is frequent, and if so, the support of the itemset is immediate, and (ii) basing on the closed keys representation, it is straightforward to determine all frequent key itemsets and all frequent closed itemsets. An efficient algorithm for computing the closed key representation is offered. We show that our approach has many advantages over the existing approaches, in terms of efficiency, conciseness and information inferences.

- Data Cleansing | Pp. 181-190

New Representation and Algorithm for Drawing RNA Structure with Pseudoknots

Yujin Lee; Wootaek Kim; Kyungsook Han

Visualization of a complex molecular structure is a valuable tool in understanding the structure. A drawing of RNA pseudoknot structures is a graph (and a possibly nonplanar graph) with inner cycles within a pseudoknot as well as possible outer cycles formed between a pseudoknot and other structural elements. Thus, drawing RNA pseudoknot structures is computationally more difficult than depicting RNA secondary structures. Although several algorithms have been developed for drawing RNA secondary structures, none of these can be used to draw RNA pseudoknots and thus visualizing RNA pseudoknots relies on significant amount of manual work. Visualizing RNA pseudoknots by manual work becomes more difficult and yields worse results as the size and complexity of the RNA structures increase. We have developed a new representation method and an algorithm for visualizing RNA pseudoknots as a twodimensional drawing and implemented the algorithm in a program. The new representation produces uniform and clear drawings with no edge crossing for all kinds of pseudoknots, including H-type and other complex types. Given RNA structure data, we represent the whole structure as a tree rather than as a graph by hiding the inner cycles as well as the outer cycles in the nodes of the abstract tree. Once the top-level RNA structure is represented as a tree, nodes of the tree are placed and drawn in increasing order of their depth values. Experimental results demonstrate that the algorithm generates a clear and aesthetically pleasing drawing of large-scale RNA structures, containing any number of pseudoknots. This is the first algorithm for automatically drawing RNA structure with pseudoknots.

- Applications | Pp. 191-201

Boosting Naive Bayes for Claim Fraud Diagnosis

Stijn Viaene; Richard Derrig; Guido Dedene

In this paper we apply the weight of evidence reformulation of AdaBoosted naive Bayes scoring due to Ridgeway et al. (1998) for the diagnosis of insurance claim fraud. The method effiectively combines the advantages of boosting and the modelling power and representational attractiveness of the probabilistic weight of evidence scoring framework. We present the results of an experimental comparison with an emphasis on both discriminatory power and calibration of probability estimates. The data on which we evaluate the method consists of a representative set of closed personal injury protection automobile insurance claims from accidents that occurred in Massachusetts during 1993. The findings of the study reveal the method to be a valuable contribution to the design of effective, intelligible, accountable and efficient fraud detection support.

- Applications | Pp. 202-211