Catálogo de publicaciones - libros

Compartir en
redes sociales


Discovery Science: 9th International Conference, DS 2006, Barcelona, Spain, October 7-10, 2006, Proceedings

Ljupčo Todorovski ; Nada Lavrač ; Klaus P. Jantke (eds.)

En conferencia: 9º International Conference on Discovery Science (DS) . Barcelona, Spain . October 7, 2006 - October 10, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Philosophy of Science; Artificial Intelligence (incl. Robotics); Database Management; Information Storage and Retrieval; Computer Appl. in Administrative Data Processing; Computer Appl. in Social and Behavioral Sciences

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-46491-4

ISBN electrónico

978-3-540-46493-8

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Clustering Pairwise Distances with Missing Data: Maximum Cuts Versus Normalized Cuts

Jan Poland; Thomas Zeugmann

Clustering algorithms based on a matrix of pairwise similarities (kernel matrix) for the data are widely known and used, a particularly popular class being spectral clustering algorithms. In contrast, algorithms working with the pairwise distance matrix have been studied rarely for clustering. This is surprising, as in many applications, distances are directly given, and computing similarities involves another step that is error-prone, since the kernel has to be chosen appropriately, albeit computationally cheap. This paper proposes a clustering algorithm based on the SDP relaxation of the max-k-cut of the graph of pairwise distances, based on the work of Frieze and Jerrum. We compare the algorithm with Yu and Shi’s algorithm based on spectral relaxation of a norm-k-cut. Moreover, we propose a simple heuristic for dealing with missing data, i.e., the case where some of the pairwise distances or similarities are not known. We evaluate the algorithms on the task of clustering natural language terms with the Google distance, a semantic distance recently introduced by Cilibrasi and Vitányi, using relative frequency counts from WWW queries and based on the theory of Kolmogorov complexity.

II - Long Papers | Pp. 197-208

Analysis of Linux Evolution Using Aligned Source Code Segments

Antti Rasinen; Jaakko Hollmén; Heikki Mannila

The Linux operating system embodies a development history of 15 years and community effort of hundreds of voluntary developers. We examine the structure and evolution of the Linux kernel by considering the source code of the kernel as ordinary text without any regard to its semantics. After selecting three functionally central modules to study, we identified code segments using local alignments of source code from a reduced set of file comparisons. The further stages of the analyses take advantage of these identified alignments. We build module-specific visualizations, or descendant graphs, to visualize the overall code migration between versions and files. More detailed view can be achieved with chain graphs which show the time evolution of alignments between selected files. The methods used here may also prove useful in studying large collections of legacy code, whose original maintainers are not available.

II - Long Papers | Pp. 209-218

Rule-Based Prediction of Rare Extreme Values

Rita Ribeiro; Luís Torgo

This paper describes a rule learning method that obtains models biased towards a particular class of regression tasks. These tasks have as main distinguishing feature the fact that the main goal is to be accurate at predicting rare extreme values of the continuous target variable. Many real-world applications from scientific areas like ecology, meteorology, finance,etc., share this objective. Most existing approaches to regression problems search for the model parameters that optimize a given average error estimator (e.g. mean squared error). This means that they are biased towards achieving a good performance on the most common cases. The motivation for our work is the claim that being accurate at a small set of rare cases requires different error metrics. Moreover, given the nature and relevance of this type of applications an interpretable model is usually of key importance to domain experts, as predicting these rare events is normally associated with costly decisions. Our proposed system (R-PREV) obtains a set of interpretable regression rules derived from a set of bagged regression trees using evaluation metrics that bias the resulting models to predict accurately rare extreme values. We provide an experimental evaluation of our method confirming the advantages of our proposal in terms of accuracy in predicting rare extreme values.

II - Long Papers | Pp. 219-230

A Pragmatic Logic of Scientific Discovery

Jean Sallantin; Christopher Dartnell; Mohammad Afshar

To the best of our knowledge, this paper is the first attempt to formalise a pragmatic logic of scientific discovery in a manner such that it can be realised by scientists assisted by machines. Using Institution Agents, we define a dialectic process to manage contradiction. This allows autoepistemic Institution Agents to learn from a supervised teaching process. We present an industrial application in the field of Drug Discovery, applying our system in the prediction of pharmaco-kinetic properties (ADME-T) and adverse side effects of therapeutic drug molecules.

II - Long Papers | Pp. 231-242

Change Detection with Kalman Filter and CUSUM

Milton Severo; João Gama

In most challenging applications learning algorithms acts in dynamic environments where the data is collected over time. A desirable property of these algorithms is the ability of incremental incorporating new data in the actual decision model. Several incremental learning algorithms have been proposed. However most of them make the assumption that the examples are drawn from a stationary distribution [13]. The aim of this study is to present a detection system (DSKC) for regression problems. The system is modular and works as a post-processor of a regressor. It is composed by a regression predictor, a Kalman filter and a Cumulative Sum of Recursive Residual (CUSUM) change detector. The system continuously monitors the error of the regression model. A significant increase of the error is interpreted as a change in the distribution that generates the examples over time. When a change is detected, the actual regression model is deleted and a new one is constructed. In this paper we tested DSKC with a set of three artificial experiments, and two real-world datasets: a Physiological dataset and a clinic dataset of Sleep Apnoea. Sleep Apnoea is a common disorder characterized by periods of breathing cessation (apnoea) and periods of reduced breathing (hypopnea) [7]. This is a real-application where the goal is to detect changes in the signals that monitor breathing. The experimental results showed that the system detected changes fast and with high probability. The results also showed that the system is robust to false alarms and can be applied with efficiency to problems where the information is available over time.

II - Long Papers | Pp. 243-254

Automatic Recognition of Landforms on Mars Using Terrain Segmentation and Classification

Tomasz F. Stepinski; Soumya Ghosh; Ricardo Vilalta

Mars probes send back to Earth enormous amount of data. Automating the analysis of this data and its interpretation represents a challenging test of significant benefit to the domain of planetary science. In this study, we propose combining terrain segmentation and classification to interpret Martian topography data and to identify constituent landforms of the Martian landscape. Our approach uses unsupervised segmentation to divide a landscape into a number of spatially extended but topographically homogeneous objects. Each object is assigned a 12 dimensional feature vector consisting of terrain attributes and neighborhood properties. The objects are classified, based on their feature vectors, into predetermined landform classes. We have applied our technique to the Tisia Valles test site on Mars. Support Vector Machines produced the most accurate results (84.6% mean accuracy) in the classification of topographic objects. An immediate application of our algorithm lies in the automatic detection and characterization of craters on Mars.

II - Long Papers | Pp. 255-266

A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms

György Szarvas; Richárd Farkas; András Kocsor

In this paper we introduce a multilingual Named Entity Recognition (NER) system that uses statistical modeling techniques. The system identifies and classifies NEs in the Hungarian and English languages by applying AdaBoostM1 and the C4.5 decision tree learning algorithm. We focused on building as large a feature set as possible, and used a split and recombine technique to fully exploit its potentials. This methodology provided an opportunity to train several independent decision tree classifiers based on different subsets of features and combine their decisions in a majority voting scheme. The corpus made for the CoNLL 2003 conference and a segment of Szeged Corpus was used for training and validation purposes. Both of them consist entirely of newswire articles. Our system remains portable across languages without requiring any major modification and slightly outperforms the best system of CoNLL 2003, and achieved a 94.77% F measure for Hungarian. The real value of our approach lies in its different basis compared to other top performing models for English, which makes our system extremely successful when used in combination with CoNLL modells.

II - Long Papers | Pp. 267-278

Model-Based Estimation of Word Saliency in Text

Xin Wang; Ata Kabán

We investigate a generative latent variable model for model-based word saliency estimation for text modelling and classification. The estimation algorithm derived is able to infer the saliency of words with respect to the mixture modelling objective. We demonstrate experimental results showing that common stop-words as well as other corpus-specific common words are automatically down-weighted and this enhances our ability to capture the essential structure in the data, ignoring irrelevant details. As a classifier, our approach improves over the class prediction accuracy of the Naive Bayes classifier in all our experiments. Compared with a recent state of the art text classification method (Dirichlet Compound Multinomial model) we obtained improved results in two out of three benchmark text collections tested, and comparable results on one other data set.

II - Long Papers | Pp. 279-290

Learning Bayesian Network Equivalence Classes from Incomplete Data

Hanen Borchani; Nahla Ben Amor; Khaled Mellouli

This paper proposes a new method, named Greedy Equivalence Search-Expectation Maximization (GES-EM), for learning Bayesian networks from incomplete data. Our method extends the recently proposed GES algorithm to deal with incomplete data. Evaluation of generated networks was done using expected Bayesian Information Criterion (BIC) scoring function. Experimental results show that GES-EM algorithm yields more accurate structures than the standard Alternating Model Selection-Expectation Maximization (AMS-EM) algorithm.

III - Regular Papers | Pp. 291-295

Interesting Patterns Extraction Using Prior Knowledge

Laurent Brisson

One important challenge in data mining is to extract interesting knowledge and useful information for expert users. Since data mining algorithms extracts a huge quantity of patterns it is therefore necessary to filter out those patterns using various measures. This paper presents IMAK, a part-way interestingness measure between objective and subjective measure, which evaluates patterns considering expert knowledge. Our main contribution is to improve interesting patterns extraction using relationships defined into an ontology.

III - Regular Papers | Pp. 296-300