Catálogo de publicaciones - libros
Foundations and Advances in Data Mining
Wesley Chu ; Tsau Young Lin (eds.)
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
No disponibles.
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-25057-9
ISBN electrónico
978-3-540-32393-8
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin/Heidelberg 2005
Tabla de contenidos
doi: 10.1007/11362197_1
The Mathematics of Learning: Dealing with Data
T. Poggio; S. Smale
Learning is key to developing systems tailored to a broad range of data analysis and information extraction tasks. We outline the mathematical foundations of learning theory and describe a key algorithm of it.
Pp. 1-19
doi: 10.1007/11362197_2
Logical Regression Analysis: From Mathematical Formulas to Linguistic Rules
H. Tsukimoto
Data mining means the discovery of knowledge from (a large amount of)data, and so data mining should provide not only predictions but also knowledge such as rules that are comprehensible to humans. Data mining techniques should satisfy the two requirements, that is, and .
Pp. 21-61
doi: 10.1007/11362197_3
A Feature/Attribute Theory for Association Mining and Constructing the Complete Feature Set
T.Y. Lin
A correct selection of features (attributes) is vital in data mining. For this aim, the complete set of features is constructed. Here are some important results: (1) Isomorphic relational tables have isomorphic patterns. Such an isomorphism classifies relational tables into isomorphic classes. (2) A unique canonical model for each isomorphic class is constructed; the canonical model is the bitmap indexes or its variants. (3) All possible features (attributes) is generated in the canonical model. (4) Through isomorphism theorem, all un-interpreted features of any table can be obtained.
Pp. 63-78
doi: 10.1007/11362197_4
A New Theoretical Framework for K-Means-Type Clustering
J. Peng; Y. Xia
One of the fundamental clustering problems is to assign points into clusters based on the minimal sum-of-squares(MSSC), which is known to be NP-hard. In this paper, by using matrix arguments, we first model MSSC as a so-called 0-1 semidefinite programming (SDP). The classical K-means algorithm can be interpreted as a special heuristics for the underlying 0-1 SDP. Moreover, the 0-1 SDP model can be further approximated by the relaxed and polynomially solvable linear and semidefinite programming. This opens new avenues for solving MSSC. The 0-1 SDP model can be applied not only to MSSC, but also to other scenarios of clustering as well. In particular, we show that the recently proposed normalized k-cut and spectral clustering can also be embedded into the 0-1 SDP model in various kernel spaces.
Pp. 79-96
doi: 10.1007/11362197_5
Clustering Via Decision Tree Construction
B. Liu; Y. Xia; P.S. Yu
Clustering is an exploratory data analysis task. It aims to find the intrinsic structure of data by organizing data objects into similarity groups or clusters. It is often called unsupervised learning because no class labels denoting an a priori partition of the objects are given. This is in contrast with supervised learning (e.g., classification) for which the data objects are already labeled with known classes. Past research in clustering has produced many algorithms. However, these algorithms have some shortcomings. In this paper, we propose a novel clustering technique, which is based on a supervised learning technique called decision tree construction. The new technique is able to overcome many of these shortcomings. The key idea is to use a decision tree to partition the data space into cluster (or dense) regions and empty (or sparse) regions (which produce outliers and anomalies). We achieve this by introducing virtual data points into the space and then applying a modified decision tree algorithm for the purpose. The technique is able to find “natural” clusters in large high dimensional spaces efficiently. It is suitable for clustering in the full dimensional space as well as in subspaces. It also provides easily comprehensible descriptions of the resulting clusters. Experiments on both synthetic data and real-life data show that the technique is effective and also scales well for large high dimensional datasets.
Pp. 97-124
doi: 10.1007/11362197_6
Incremental Mining on Association Rules
W.-G. Teng; M.-S. Chen
The discovery of association rules has been known to be useful in selective marketing, decision analysis, and business management. An important application area of mining association rules is the market basket analysis, which studies the buying behaviors of customers by searching for sets of items that are frequently purchased together. With the increasing use of the record-based databases whose data is being continuously added, recent important applications have called for the need of incremental mining. In dynamic transaction databases, new transactions are appended and obsolete transactions are discarded as time advances. Several research works have developed feasible algorithms for deriving precise association rules efficiently and effectively in such dynamic databases. On the other hand, approaches to generate approximations from data streams have received a significant amount of research attention recently. In each scheme, previously proposed algorithms are explored with examples to illustrate their concepts and techniques in this chapter.
Pp. 125-162
doi: 10.1007/11362197_7
Mining Association Rules from Tabular Data Guided by Maximal Frequent Itemsets
Q. Zou; Y. Chen; W.W. Chu; X. Lu
We propose the use of maximal frequent itemsets (MFIs) to derive association rules from tabular datasets. We first present an efficient method to derive MFIs directly from tabular data using the information from previous search, known as tail information. Then we utilize tabular format to derive MFI, which can reduce the search space and the time needed for support-counting. Tabular data allows us to use spreadsheet as a user interface. The spreadsheet functions enable users to conveniently search and sort rules. To effectively present large numbers of rules, we organize rules into hierarchical trees from general to specific on the spreadsheet Experimental results reveal that our proposed method of using tail information to generate MFI yields significant improvements over conventional methods. Using inverted indices to compute supports for itemsets is faster than the hash tree counting method. We have applied the proposed technique to a set of tabular data that was collected from surgery outcomes and that contains a large number of dependent attributes. The application of our technique was able to derive rules for physicians in assisting their clinical decisions.
Pp. 163-181
doi: 10.1007/11362197_8
Sequential Pattern Mining by Pattern-Growth: Principles and Extensions
J. Han; J. Pei; X. Yan
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a challenging problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Recent studies have developed two major classes of sequential pattern mining methods: (1) a approach, represented by (i)GSP [30], a horizontal format-based sequential pattern mining method, and (ii) SPADE [36], a vertical format-based method; and (2) a method, represented by PrefixSpan [26] and its further extensions, such as CloSpan for mining closed sequential patterns [35].
Pp. 183-220
doi: 10.1007/11362197_9
Web Page Classification
B. Choi; Z. Yao
This chapter describes systems that automatically classify web pages into meaningful categories. It first defines two types of web page classification: subject based and genre based classifications. It then describes the state of the art techniques and subsystems used to build automatic web page classification systems, including web page representations, dimensionality reductions, web page classifiers, and evaluation of web page classifiers. Such systems are essential tools for Web Mining and for the future of Semantic Web.
Pp. 221-274
doi: 10.1007/11362197_10
Web Mining – Concepts, Applications and Research Directions
T. Srivastava; P. Desikan; V. Kumar
From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident. Web mining, i.e. the application of data mining techniques to extract knowledge from Web content, structure, and usage, is the collection of technologies to fulfill this potential. Interest in Web mining has grown rapidly in its short history, both in the research and practitioner communities. This paper provides a brief overview of the accomplishments of the field, both in terms of technologies and applications, and outlines key future research directions.
Pp. 275-307