Catálogo de publicaciones - libros

Compartir en
redes sociales


Advanced Techniques in Knowledge Discovery and Data Mining

Nikhil R. Pal ; Lakhmi Jain (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-1-85233-867-1

ISBN electrónico

978-1-84628-183-9

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag London Limited 2005

Tabla de contenidos

Advanced Techniques in Knowledge Discovery and Data Mining

Nikhil R. Pal; Lakhmi Jain (eds.)

Pp. No disponible

Trends in Data Mining and Knowledge Discovery

Krzysztof J. Cios; Lukasz A. Kurgan

Data mining and knowledge discovery (DMKD) is a fast-growing field of research. Its popularity is caused by an ever increasing demand for tools that help in revealing and comprehending information hidden in huge amounts of data. Such data are generated on a daily basis by federal agencies, banks, insurance companies, retail stores, and on the WWW. This explosion came about through the increasing use of computers, scanners, digital cameras, bar codes, etc. We are in a situation where rich sources of data, stored in databases, warehouses, and other data repositories, are readily available but not easily analyzable. This causes pressure from the federal, business, and industry communities for improvements in the DMKD technology. What is needed is a clear and simple methodology for extracting the knowledge hidden in the data. In this chapter, an integrated DMKD process model based on technologies like XML, PMML, SOAP, UDDI, and OLE BD-DM is introduced. These technologies help to design flexible, semiautomated, and easy-to-use DMKD models to enable building knowledge repositories and allowing for communication between several data mining tools, databases, and knowledge repositories. They also enable integration and automation of the DMKD tasks. This chapter describes a six-step DMKD process model and its component technologies.

Pp. 1-26

Advanced Methods for the Analysis of Semiconductor Manufacturing Process Data

Andreas König; Achim Gratz

The analysis, control, and optimization of manufacturing processes in the semiconductor industry are applications with significant economic impact. Modern semiconductor manufacturing processes feature an increasing number of processing steps with an increasing complexity of the steps themselves to generate a flood of multivariate monitoring data. This exponentially increasing complexity and the associated information processing and productivity demand impose stringent requirements, which are hard to meet using state-of-the-art monitoring and analysis methods and tools. This chapter deals with the application of selected methods from soft computing to the analysis of deviations from allowed parameters or operation ranges, i.e., anomaly or novelty detection, and the discovery of nonobvious multivariate dependencies of the involved parameters and the structure in the data for improved process control. Methods for online observation and offline interactive analysis employing novelty classification, dimensionality reduction, and interactive data visualization techniques are investigated in this feasibility study, based on an actual application problem and data extracted from a CMOS submicron process. The viability and feasibility of the investigated methods are demonstrated. In particular, the results of the interactive data visualization and automatic feature selection methods are most promising. The chapter introduces to semiconductor manufacturing data acquisition, application problems, and the regarded soft-computing methods in a tutorial fashion. The results of the conducted data analysis and classification experiments are presented, and an outline of a system architecture based on this feasibility study and suited for industrial service is introduced.

Pp. 27-74

Clustering and Visualization of Retail Market Baskets

Joydeep Ghosh; Alexander Strehl

Transaction analysis, including clustering of market baskets, is a key application of data mining to the retail industry. This domain has some specific requirements, such as the need for obtaining easily interpretable and actionable results. It also exhibits some very challenging characteristics, mostly stemming from the fact that the data have thousands of features and are highly non-Gaussian and sparse. This chapter proposes a relationship-based approach to clustering such data that tries to sidestep the “curse-of-dimensionality” issue by working in a suitable similarity space instead of the original high-dimensional feature space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graph-partitioning-based clustering techniques in this space. The output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging clusters can be easily derived, and it also guides the user toward a suitable number of clusters. Results are presented on a real retail industry data set of several thousand customers and products.

Pp. 75-102

Segmentation of Continuous Data Streams Based on a Change Detection Methodology

Gil Zeira; Mark Last; Oded Maimon

Most data mining algorithms assume that the historic data are the best estimator of what will happen in the future. As more data are accumulated in a database, one should examine whether the new data agrees with the model induced from previous instances. The problem of recognizing the change of the underlying model is known as a problem. Once all change points have been detected, a data stream can be represented as a series of nonoverlapping .

This work presents a new methodology for change detection and segmentation based on a set of statistical estimators. While traditional segmentation methods are aimed at analyzing univariate time series, our methodology detects statistically significant changes in incrementally built classification models of data mining. In our previous work, we have shown the methodology to be valid for change detection in a set of artificial and benchmark data sets. In this work, we apply the change detection procedure to real-world data sets from two distinct domains (education and finance), where we detect significant changes between succeeding segments and compare the quality of alternative segmentations.

Pp. 103-126

Instance Selection Using Evolutionary Algorithms: An Experimental Study

José Ramón Cano; Francisco Herrera; Manuel Lozano

In this chapter, we carry out an empirical study of the performance of four representative evolutionary algorithm models considering two instance-selection perspectives, the prototype selection and the training set selection for data reduction in knowledge discovery. This study includes a comparison between these algorithms and other nonevolutionary instance-selection algorithms. The results show that the evolutionary instance-selection algorithms consistently outperform the nonevolutionary ones, offering two main advantages simultaneously, better instance-reduction rates and higher classification accuracy.

Pp. 127-152

Using Cooperative Coevolution for Data Mining of Bayesian Networks

Man Leung Wong; Shing Yan Lee; Kwong Sak Leung

Bayesian networks are formal knowledge representation tools that provide reasoning under uncertainty. The applications of Bayesian networks are widespread, including data mining, information retrieval, and various diagnostic systems. Although Bayesian networks are useful, the learning problem, namely to construct a network automatically from data, remains a difficult problem. Recently, some researchers have adopted evolutionary computation for learning. However, the drawback is that the approach is slow. In this chapter, we propose a hybrid framework for Bayesian network learning. By combining the merits of two different learning approaches, we expect an improvement in learning speed. In brief, the new learning algorithm consists of two phases: the conditional independence (CI) test phase and the search phase. In the CI test phase, we conduct dependency analysis, which helps to reduce the search space. In the search phase, we perform model searching using an evolutionary approach, called cooperative coevolution. When comparing our new algorithm with an existing algorithm, we find that our algorithm performs faster and is more accurate in many cases.

Pp. 153-175

Knowledge Discovery and Data Mining in Medicine

Takumi Ichimura; Shinichi Oeda; Machi Suka; Akira Hara; Kenneth J. Mackin; Yoshida Katsumi

Medical databases store diagnostic information based on patients’ medical records. Because of deficits in patients’ medical records, medical databases do not provide all the required information for learning algorithms. Moreover, we may meet some contradictory cases, in which the pattern of input signals is the same, but the pattern of output signals is different. Learning algorithms cannot correctly classify such cases. Even medical doctors require more information to make the final diagnosis. In this chapter, we describe three methods of classifying medical databases based on neural networks and genetic programming (GP). To verify the effectiveness of our proposed methods, we apply them to real medical databases and prove their high classification capability. We also introduce techniques for extracting If-Then rules from the trained networks.

Pp. 177-210

Satellite Image Classification Using Cascaded Architecture of Neural Fuzzy Network

Chin-Teng Lin; Her-Chang Pu; Yin-Cheung Lee

Because satellite images usually contain many complex factors and mix-up samples, a high recognition rate is not easy to attain. Especially for a nonhomogeneous region, the gray values of its satellite image vary greatly, and thus the direct use of gray values cannot do the categorization task correctly. Classification of terrain cover using polarimetric radar is an area of considerable current interest and research. Without the benefit of satellite, we cannot analyze the information of the distribution of soils and cities for a land development, as well as the variation of clouds and volcano for weather forecasting and for precaution, respectively. This chapter discusses a hybrid neural fuzzy network, combining unsupervised and supervised learning, for designing classifier systems. Based on systematic feature analysis, which is crucial for data mining and knowledge extraction, the proposed scheme signifies a novel algebraic system identification method, which can be used for knowledge extraction in general, and for satellite image analysis in particular. The goal of this chapter is to develop a cascaded architecture of a neural fuzzy network with feature mapping (CNFM) to help the classification of satellite images.

Pp. 211-231

Discovery of Positive and Negative Rules from Medical Databases Based on Rough Sets

Shusaku Tsumoto

One of the important problems in rule-induction methods is that extracted rules do not plausibly represent information on experts’ decision processes. To solve this problem, the characteristics of medical reasoning are discussed. The concept of positive and negative rules is introduced. Then, for induction of positive and negative rules, two search algorithms are provided. The proposed rule-induction method is evaluated on medical databases. The experimental results show that the induced rules correctly represent experts’ knowledge, and several interesting patterns are discovered.

Pp. 233-252