Catálogo de publicaciones - libros
Foundations of Data Mining and Knowledge Discovery
Tsau Young Lin ; Setsuo Ohsuga ; Churn-Jung Liau ; Xiaohua Hu ; Shusaku Tsumoto (eds.)
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Theory of Computation; Appl.Mathematics/Computational Methods of Engineering; Artificial Intelligence (incl. Robotics)
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-26257-2
ISBN electrónico
978-3-540-32408-9
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin/Heidelberg 2005
Tabla de contenidos
doi: 10.1007/11498186_1
Knowledge Discovery as Translation
Setsuo Ohsuga
This paper discusses a view to capture discovery as a translation from non-symbolic to symbolic representation. First, a relation between symbolic processing and non-symbolic processing is discussed. An intermediate form was introduced to represent both of them in the same framework and clarify the difference of these two. Characteristic of symbolic representation is to eliminate quantitative measure and also to inhibit mutual dependency between elements. Non-symbolic processing has opposite characteristics. Therefore there is a large gap between them. In this paper a quantitative measure is introduced in the syntax of predicate. It enables to measure the distance between symbolic and non-symbolic representations quantitatively. It means that even though there is no general way of translation from non-symbolic to symbolic representation, it is possible when there is some symbolic representation that has no or small distance from the given non-symbolic representation. It is to discover general rule from data. This paper discussed a way to discover implicative predicate in databases based on the above discussion. Finally the paper discusses some related issues. The one is on the way of generating hypothesis and the other is the relation between data mining and discovery.
Pp. 1-19
doi: 10.1007/11498186_2
Mathematical Foundation of Association Rules – Mining Associations by Solving Integral Linear Inequalities
T.Y. Lin
Informally, data mining is derivation of patterns from data. The mathematical mechanics of association mining (AM) is carefully examined from this point. The data is table of symbols, and a pattern is any algebraic/logic expressions derived from this table that have high supports. Based on this view, we have the following theorem: A pattern (generalized associations) of a relational table can be found by solving a finite set of linear inequalities within a polynomial time of the table size. The main results are derived from few key notions that observed previously: (1) Isomorphism: Isomorphic relations have isomorphic patterns. (2) Canonical Representations: In each isomorphic class, there is a unique bitmap based model, called granular data model
Pp. 21-42
doi: 10.1007/11498186_3
Comparative Study of Sequential Pattern Mining Models
Hye-Chung (Monica) Kum; Susan Paulsen; Wei Wang
The process of finding interesting, novel, and useful patterns from data is now commonly known as Knowledge Discovery and Data mining (KDD). In this paper, we examine closely the problem of mining sequential patterns and propose a general evaluation method to assess the quality of the mined results. We propose four evaluation criteria, namely (1) recoverability, (2) the number of spurious patterns (3) the number of redundant patterns, and (4) the degree of extraneous items in the patterns, to quantitatively assess the quality of the mined results from a wide variety of synthetic datasets with varying randomness and noise levels. Recoverability, a new metric, measures how much of the underlying trend has been detected. Such an evaluation method provides a basis for comparing different models for sequential pattern mining. Furthermore, such evaluation is essential in understanding the performance of approximate solutions. In this paper, the method is employed to conduct a detailed comparison of the traditional frequent sequential pattern model with an alternative approximate pattern model based on sequence alignment. We demonstrate that the alternative approach is able to better recover the underlying patterns with little confounding information under all circumstances we examined, including those where the frequent sequential pattern model fails.
Pp. 43-70
doi: 10.1007/11498186_4
Designing Robust Regression Models
Murlikrishna Viswanathan; Kotagiri Ramamohanarao
In this study we focus on the preference among competing models from a family of polynomial regressors. Classical statistics offers a number of wellknown techniques for the selection of models in polynomial regression, namely, Finite Prediction Error (FPE) [1], Akaike’s Information Criterion (AIC) [2], Schwartz’s criterion (SCH) [10] and Generalized Cross Validation (GCV) [4]. Wallace’s Minimum Message Length (MML) principle [16, 17, 18] and also Vapnik’s Structural Risk Minimization (SRM) [11, 12]–based on the classical theory of VC-dimensionality–are plausible additions to this family of modelselection principles. SRM and MML are generic in the sense that they can be applied to any family of models, and similar in their attempt to define a trade-off between the complexity of a given model and its goodness of fit to the data under observation–although they do use different trade-offs, with MML’s being Bayesian and SRM’s being non-Bayesian in principle. Recent empirical evaluations [14, 15] comparing the performance of several methods for polynomial degree selection provide strong evidence in support of the MML and SRM methods over the other techniques.
Pp. 71-86
doi: 10.1007/11498186_5
A Probabilistic Logic-based Framework for Characterizing Knowledge Discovery in Databases
Ying Xie; Vijay V. Raghavan
In order to further improve the KDD process in terms of both the degree of automation achieved and types of knowledge discovered, we argue that a formal logical foundation is needed and suggest that Bacchus’ probability logic is a good choice. By completely staying within the expressiveness of Bacchus’ probability logic language, we give formal definitions of “pattern” as well as its determiners, which are “previously unknown” and “potentially useful”. These definitions provide a sound foundation to overcome several deficiencies of current KDD systems with respect to novelty and usefulness judgment. Furthermore, based on this logic, we propose a logic induction operator that defines a standard process through which all the potentially useful patterns embedded in the given data can be discovered. Hence, general knowledge discovery (independent of any application) is defined to be any process functionally equivalent to the process specified by this logic induction operator with respect to the given data. By customizing the parameters and providing more constraints, users can guide the knowledge discovery process to obtain a specific subset of all previously unknown and potentially useful patterns, in order to satisfy their current needs.
Pp. 87-100
doi: 10.1007/11498186_6
A Careful Look at the Use of Statistical Methodology in Data Mining
Norman Matloff
Knowledge discovery in databases (KDD) is an inherently statistical activity, with a considerable literature drawing upon statistical science. However, the usage has typically been vague and informal at best, and at worst of a seriously misleading nature. In addition, much of the classical statistical methodology was designed for goals which can be very different from those of KDD. The present paper seeks to take a first step in remedying this problem by pairing precise mathematical descriptions of some of the concepts in KDD with practical interpretations and implications for specific KDD issues.
Pp. 101-117
doi: 10.1007/11498186_7
Justification and Hypothesis Selection in Data Mining
Tuan-Fang Fan; Duen-Ren Liu; Churn-Jung Liau
Data mining is an instance of the inductive methodology. Many philosophical considerations for induction can also be carried out for data mining. In particular, the justification of induction has been a long-standing problem in epistemology. This article is a recast of the problem in the context of data mining. We formulate the problem precisely in the rough set-based decision logic and discuss its implications for the research of data mining.
Pp. 119-130
doi: 10.1007/11498186_8
On Statistical Independence in a Contingency Table
Shusaku Tsumoto
This paper gives a proof showing that statistical independence in a contingency table is a special type of linear independence, where the rank of a given table as a matrix is equal to 1.0. Especially, the equation obtained is corresponding to that of projective geometry, which suggests that a contingency matrix can be interpreted in a geometrical way.
Pp. 131-141
doi: 10.1007/11498186_9
A Comparative Investigation on Model Selection in Binary Factor Analysis
Yujia An; Xuelei Hu; Lei Xu
Binary factor analysis has been widely used in data analysis with various applications. Most studies assume a known hidden factors number or determine it by one of the existing model selection criteria in the literature of statistical learning. These criteria have to be implemented in two phases that first obtains a set of candidate models and then selects the “optimal” model among a family of candidates according to a model selection criterion, which incurs huge computational costs. Under the framework of Bayesian Ying-Yang (BYY) harmony learning, not only a criterion has been obtained, but also model selection can be made automatically during parameter learning without requiring a two stage implementation, with a significant saving on computational costs. This paper further investigates the BYY criterion and BYY harmony learning with automatic model selection (BYY-AUTO) in comparison with existing typical criteria, including Akaike’s information criterion (AIC), the consistent Akaike’s information criterion (CAIC), the Bayesian inference criterion (BIC), and the cross-validation (CV) criterion. This study is made via experiments on data sets with different sample sizes, data space dimensions, noise variances, and hidden factors numbers. Experiments have shown that in most cases BIC outperforms AIC, CAIC, and CV while the BYY criterion and BYY-AUTO are either comparable with or better than BIC. Furthermore, BYY-AUTO takes much less time than the conventional two-stage learning methods with an appropriate number automatically determined during parameter learning. Therefore, BYY harmony learning is a more preferred tool for hidden factors number determination.
Pp. 143-160
doi: 10.1007/11498186_10
Extraction of Generalized Rules with Automated Attribute Abstraction
Yohji Shidara; Mineichi Kudo; Atsuyoshi Nakamura
We propose a novel method for mining generalized rules with high support and confidence. Using our method, we can obtain generalized rules in which the abstraction of attribute values is implicitly carried out without the requirement of additional information such as information on conceptual hierarchies. Our experimental results showed that the obtained rules not only have high support and confidence but also have expressions that are conceptually meaningful.
Pp. 161-170