Catálogo de publicaciones - libros

Compartir en
redes sociales


Data Warehousing and Knowledge Discovery: 4th International Conference, DaWaK 2002 Aix-en-Provence, France, September 4-6, 2002. Proceedings

Yahiko Kambayashi ; Werner Winiwarter ; Masatoshi Arikawa (eds.)

En conferencia: 4º International Conference on Data Warehousing and Knowledge Discovery (DaWaK) . Aix-en-Provence, France . September 4, 2002 - September 6, 2002

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2002 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-44123-6

ISBN electrónico

978-3-540-46145-6

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2002

Tabla de contenidos

A Comparison between Query Languages for the Extraction of Association Rules

Marco Botta; Jean-Francois Boulicaut; Cyrille Masson; Rosa Meo

Recently inductive databases (IDBs) have been proposed to afford the problem of knowledge discovery from huge databases. With an IDB the user/analyst performs a set of very different operations on data using a special-purpose language, powerful enough to perform all the required manipulations, such as data preprocessing, pattern discovery and pattern post-processing. In this paper we present a comparison between query languages (MSQL, DMQL and MINE RULE) that have been proposed for association rules extraction in the last years and discuss their common features and differences. We present them using a set of examples, taken from the real practice of data mining. This allows us to define the language design guidelines, with particular attention to the open issues on IDBs.

- Association Rules | Pp. 1-10

Learning from Dissociations

Choh Man Teng

Standard association rules encapsulate the relationship between two sets of items: the presence of is a good predictor for the simultaneous presence of . We argue that the absence of an association rule conveys valuable information as well. Dissociation rules are rules that capture the relationship between two sets of items: the presence of is a good predictor for the presence of . We developed a representation for augmenting standard association rules with dissociation information, and presented some experimental results suggesting that such augmented rules can improve the quality of the associations obtained, both in terms of rule accuracy and in terms of using these rules as a guide to making decisions.

- Association Rules | Pp. 11-20

Mining Association Rules from XML Data

Daniele Braga; Alessandro Campi; Mika Klemettinen; PierLuca Lanzi

The eXtensible Markup Language (XML) rapidly emerged as a standard for representing and exchanging information. The fastgrowing amount of available XML data sets a pressing need for languages and tools to manage collections of XML documents, as well as to out of them. Although the data mining community has not yet rushed into the use of XML, there have been some proposals to exploit XML. However, in practice these proposals mainly rely on more or less traditional relational databases with an XML interface. In this paper, we introduce association rules from native XML documents and discuss the new challenges and opportunities that this topic sets to the data mining community. More specifically, we introduce an for mining association rules. This extension is used throughout the paper to better define association rule mining within XML and to emphasize its implications in the XML context.

- Association Rules | Pp. 21-30

Estimating Joint Probabilities from Marginal Ones

Tao Li; Shenghuo Zhu; Mitsunori Ogihara; Yinhe Cheng

Estimating joint probabilities plays an important role in many data mining and machine learning tasks. In this paper we introduce two methods, and , to estimate joint probabilities. Both methods are based on a light-weight structure, . The core idea is to maintain the partition support of itemsets over logically disjoint partitions and then use it to estimate joint probabilities of itemsets of higher cardinalitiess. We present extensive mathematical analyses on both methods and compare their performances on synthetic datasets. We also demonstrate a case study of using the estimation methods in algorithm for fast association mining. Moreover, we explore the usefulness of the estimation methods in other mining/learning tasks []. Experimental results show the effectiveness of the estimation methods.

- Association Rules | Pp. 31-41

Self-Tuning Clustering: An Adaptive Clustering Method for Transaction Data

Ching-Huang Yun; Kun-Ta Chuang; Ming-Syan Chen

In this paper, we devise an efficient algorithm for clustering market-basket data items. Market-basket data analysis has been well addressed in mining association rules for discovering the set of large items which are the frequently purchased items among all transactions. In essence, clustering is meant to divide a set of data items into some proper groups in such a way that items in the same group are as similar to one another as possible. In view of the nature of clustering market basket data, we present a measurement, called the small-large (SL) ratio, which is in essence the ratio of the number of small items to that of large items. Clearly, the smaller the SL ratio of a cluster, the more similar to one another the items in the cluster are. Then, by utilizing a self-tuning technique for adaptively tuning the input and output SL ratio thresholds, we develop an efficient clustering algorithm, (standing for ), for clustering market-basket data. The objective of algorithm STC is “.” We conduct several experiments on the real data and the synthetic workload for performance studies. It is shown by our experimental results that by utilizing the self-tuning technique to adaptively minimize the input and output SL ratio thresholds, algorithm STC performs very well. Specifically, algorithm STC not only incurs an execution time that is significantly smaller than that by prior works but also leads to the clustering results of very good quality.

- Clustering | Pp. 42-51

: An Algorithm for Non-distance Based Clustering in High Dimensional Spaces

Shenghuo Zhu; Tao Li; Mitsuonri Ogihara

The clustering problem, which aims at identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity clusters, has been widely studied. Traditional clustering algorithms use distance functions to measure similarity and are not suitable for high dimensional spaces. In this paper, we propose algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the maximum likelihood principle, is to optimize parameters to maximize the likelihood between data points and the modelgenerated by the parameters. Experimental results on both synthetic data sets and a realdata set show the efficiency and effectiveness of .

- Clustering | Pp. 52-62

An Efficient -Medoids-Based Algorithm Using Previous Medoid Index, Triangular Inequality Elimination Criteria, and Partial Distance Search

Shu-Chuan Chu; John F. Roddick; J. S. Pan

Clustering in data mining is a discovery process that groups similar objects into the same cluster. Various clustering algorithms have been designed to fit various requirements and constraints of application. In this paper, we study several -medoids-based algorithms including the and algorithms. A novel and efficient approach is proposed to reduce the computational complexity of such -medoids-based algorithms by using previous medoid index, triangular inequality elimination criteria and partial distance search. Experimental results based on elliptic, curve and Gauss-Markov databases demonstrate that the proposed algorithm applied to may reduce the number of distance calculations by 67% to 92% while retaining the same average distance per object. In terms of the running time, the proposed algorithm may reduce computation time by 38% to 65% compared with the algorithm.

- Clustering | Pp. 63-72

A Hybrid Approach to Web Usage Mining

Søren E. Jespersen; Jesper Thorhauge; Torben Bach Pedersen

With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web has become an important research area.Web usage mining, which is the main topic of this paper, focuses on knowledge discovery from the clicks in the web log for a given site (the so-called click-stream), especially on analysis of of clicks. Existing techniques for analyzing click sequences have different drawbacks, i.e., either huge storage requirements, excessive I/O cost, or scalability problems when additional information is introduced into the analysis.

In this paper we present a new approach for analyzing click sequences that aims to overcome these drawbacks. The approach is based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) and Click Fact Table approaches. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. The development is driven by experiences gained from industry collaboration. A prototype has been implemented and experiments are presented that show that the hybrid approach performs well compared to the existing approaches. This is especially true when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. The approach is not limited to web log analysis, but can also be used for general sequence mining tasks.

- Web Mining and Security | Pp. 73-82

Building and Exploiting Ad Hoc Concept Hierarchies for Web Log Analysis

Carsten Pohle; Myra Spiliopoulou

Web usage mining aims at the discovery of interesting usage patterns from Web server log files. “Interestingness” relates to the business goals of the site owner. However, business goals refer to business objects rather than the page hits and script invocations recorded by the site server. Hence, Web usage analysis requires a preparatory mechanism that incorporates the business goals, the concepts reflecting them and the expert’s background knowledge on them into the mining process. To this purpose, we present a methodology and a mechanism for the establishment and exploitation of application-oriented concept hierarchies in Web usage analysis. We demonstrate our approach on a real data set and show how it can substantially improve both the search for interesting patterns by the mining algorithm and the interpretation of the mining results by the analyst.

- Web Mining and Security | Pp. 83-93

Authorization Based on Evidence and Trust

Bharat Bhargava; Yuhui Zhong

Developing authorization mechanisms for secure information access by a large community of users in an open environment is challenging. Current research efforts grant privilege to a user based on her objective properties that are demonstrated by digital credentials (evidences). However, holding credentials is not sufficient to certify that a user is trustworthy. Therefore, we propose using the notion of trust to characterize the probability that a user will not harm an information system. We present a trust-enhanced role-mapping server, which cooperates with RBAC (Role-Based Access Control) mechanisms to together implement authorization based on evidence and trust. A prerequisite for this is our proposed formalization of trust and evidence.

- Web Mining and Security | Pp. 94-103