Catálogo de publicaciones - libros

Compartir en
redes sociales


Knowledge Discovery in Databases: PKDD 2005: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings

Alípio Mário Jorge ; Luís Torgo ; Pavel Brazdil ; Rui Camacho ; João Gama (eds.)

En conferencia: 9º European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) . Porto, Portugal . October 3, 2005 - October 7, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-29244-9

ISBN electrónico

978-3-540-31665-7

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification

Dimitrios Mavroeidis; George Tsatsaronis; Michalis Vazirgiannis; Martin Theobald; Gerhard Weikum

The introduction of hierarchical thesauri (HT) that contain significant semantic information, has led researchers to investigate their potential for improving performance of the text classification task, extending the traditional “bag of words” representation, incorporating syntactic and semantic relationships among words. In this paper we address this problem by proposing a Word Sense Disambiguation (WSD) approach based on the intuition that word proximity in the document implies proximity also in the HT graph. We argue that the high precision exhibited by our WSD algorithm in various humanly-disambiguated benchmark datasets, is appropriate for the classification task. Moreover, we define a semantic kernel, based on the general concept of GVSM kernels, that captures the semantic relations contained in the hierarchical thesaurus. Finally, we conduct experiments using various corpora achieving a systematic improvement in classification accuracy using the SVM algorithm, especially when the training set is small.

- Long Papers | Pp. 181-192

Mining Paraphrases from Self-anchored Web Sentence Fragments

Marius Paşca

Near-synonyms or paraphrases are beneficial in a variety of natural language and information retrieval applications, but so far their acquisition has been confined to clean, trustworthy collections of documents with explicit external attributes. When such attributes are available, such as similar time stamps associated to a pair of news articles, previous approaches rely on them as signals of potentially high content overlap between the articles, often embodied in sentences that are only slight, paraphrase-based variations of each other. This paper introduces a new unsupervised method for extracting paraphrases from an information source of completely different nature and scale, namely unstructured text across arbitrary Web textual documents. In this case, no useful external attributes are consistently available for all documents. Instead, the paper introduces linguistically-motivated text anchors, which are identified automatically within the documents. The anchors are instrumental in the derivation of paraphrases through lightweight pairwise alignment of Web sentence fragments. A large set of categorized names, acquired separately from Web documents, serves as a filtering mechanism for improving the quality of the paraphrases. A set of paraphrases extracted from about a billion Web documents is evaluated both manually and through its impact on a natural-language Web search application.

Palabras clave: Relative Clause; News Article; Pairwise Alignment; Computational Linguistics; Lexical Resource.

- Long Papers | Pp. 193-204

M^2SP: Mining Sequential Patterns Among Several Dimensions

M. Plantevit; Y. W. Choong; A. Laurent; D. Laurent; M. Teisseire

Mining sequential patterns aims at discovering correlations between events through time. However, even if many works have dealt with sequential pattern mining, none of them considers frequent sequential patterns involving several dimensions in the general case. In this paper, we propose a novel approach, called M ^2 SP , to mine multidimensional sequential patterns. The main originality of our proposition is that we obtain not only intra-pattern sequences but also inter-pattern sequences. Moreover, we consider generalized multidimensional sequential patterns, called jokerized patterns, in which some of the dimension values may not be instanciated. Experiments on synthetic data are reported and show the scalability of our approach.

Palabras clave: Data Mining; Sequential Patterns; Multidimensional Rules.

- Long Papers | Pp. 205-216

A Systematic Comparison of Feature-Rich Probabilistic Classifiers for NER Tasks

Benjamin Rosenfeld; Moshe Fresko; Ronen Feldman

In the CoNLL 2003 NER shared task, more than two thirds of the submitted systems used the feature-rich representation of the task. Most of them used maximum entropy to combine the features together. Others used linear classifiers, such as SVM and RRM. Among all systems presented there, one of the MEMM-based classifiers took the second place, losing only to a committee of four different classifiers, one of which was ME-based and another RRM-based. The lone RRM was fourth, and CRF came in the middle of the pack. In this paper we shall demonstrate, by running the three algorithms upon the same tasks under exactly the same conditions that this ranking is due to feature selection and other causes and not due to the inherent qualities of the algorithms, which should be ranked otherwise.

Palabras clave: Conditional Random Field; Inductive Logic Programming; Shared Task; Entity Recognition; Sequence Label.

- Long Papers | Pp. 217-227

Knowledge Discovery from User Preferences in Conversational Recommendation

Maria Salamó; James Reilly; Lorraine McGinty; Barry Smyth

Knowledge discovery for personalizing the product recommendation task is a major focus of research in the area of conversational recommender systems to increase efficiency and effectiveness. Conversational recommender systems guide users through a product space, alternatively making product suggestions and eliciting user feedback. Critiquing is a common and powerful form of feedback, where a user can express her feature preferences by applying a series of directional critiques over recommendations, instead of providing specific value preferences. For example, a user might ask for a ‘ less expensive ’ vacation in a travel recommender; thus ‘ less expensive ’ is a critique over the price feature. The expectation is that on each cycle, the system discovers more about the user’s soft product preferences from minimal information input. In this paper we describe three different strategies for knowledge discovery from user preferences that improve recommendation efficiency in a conversational system using critiquing. Moreover, we will demonstrate that while the strategies work well separately, their combined effort has the potential to considerably increase recommendation efficiency even further.

- Long Papers | Pp. 228-239

Unsupervised Discretization Using Tree-Based Density Estimation

Gabi Schmidberger; Eibe Frank

This paper presents an unsupervised discretization method that performs density estimation for univariate data. The subintervals that the discretization produces can be used as the bins of a histogram. Histograms are a very simple and broadly understood means for displaying data, and our method automatically adapts bin widths to the data. It uses the log-likelihood as the scoring function to select cut points and the cross-validated log-likelihood to select the number of intervals. We compare this method with equal-width discretization where we also select the number of bins using the cross-validated log-likelihood and with equal-frequency discretization.

Palabras clave: Density Estimation; Training Instance; Split Point; Discretization Tree; Unsupervised Discretization.

- Long Papers | Pp. 240-251

Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization

Karl-Michael Schneider

Mutual information is a common feature score in feature selection for text categorization. Mutual information suffers from two theoretical problems: It assumes independent word variables, and longer documents are given higher weights in the estimation of the feature scores, which is in contrast to common evaluation measures that do not distinguish between long and short documents. We propose a variant of mutual information, called Weighted Average Pointwise Mutual Information (WAPMI) that avoids both problems. We provide theoretical as well as extensive empirical evidence in favor of WAPMI. Furthermore, we show that WAPMI has a nice property that other feature metrics lack, namely it allows to select the best feature set size automatically by maximizing an objective function, which can be done using a simple heuristic without resorting to costly methods like EM and model selection.

Palabras clave: Feature Selection; Mutual Information; Text Categorization; Feature Ranking; Vocabulary Size.

- Long Papers | Pp. 252-263

Non-stationary Environment Compensation Using Sequential EM Algorithm for Robust Speech Recognition

Haifeng Shen; Jun Guo; Gang Liu; Qunxia Li

The paper presents a non-stationary environment compensation using sequential EM estimation for tracking the complicated environment. All of the noisy features used in the recognition system are effectively compensated. The speech corruption in the log domain such as the 24 log-filterbank coefficients and the log-energy feature can be modeled as a nonlinear model. For efficient estimating noise parameter using the subsequent sequential Expectation-Maximization (EM) algorithm, the nonlinear environment model is linearized by the truncated first-order vector Taylor series (VTS) approximation. Due to the cepstral features are nearly independence, we train the clean speech using cepstral features and the log-energy feature, and then obtain a diagonal Gaussian mixture model in the log domain by taking inverse discrete cosine transform (IDCT). The experiments are conducted on the large vocabulary continuous speech recognition (LVCSR) system. Results demonstrate that it achieves attractive improvements when compared with CMN (cepstral mean normalization) and the batch-EM based compensation approach.

Palabras clave: Speech Recognition; Gaussian Mixture Model; Inverse Discrete Cosine Transform; Clean Speech; Noisy Speech.

- Long Papers | Pp. 264-273

Hybrid Cost-Sensitive Decision Tree

Shengli Sheng; Charles X. Ling

Cost-sensitive decision tree and cost-sensitive naïve Bayes are both new cost-sensitive learning models proposed recently to minimize the total cost of test and misclassifications. Each of them has its advantages and disadvantages. In this paper, we propose a novel cost-sensitive learning model, a hybrid cost-sensitive decision tree, called DTNB, to reduce the minimum total cost, which integrates the advantages of cost-sensitive decision tree and of the cost-sensitive naïve Bayes together. We empirically evaluate it over various test strategies, and our experiments show that our DTNB outperforms cost-sensitive decision and the cost-sensitive naïve Bayes significantly in minimizing the total cost of tests and misclassification based on the same sequential test strategies, and single batch strategies.

Palabras clave: Decision Tree; Test Strategy; Test Cost; Single Batch; Misclassification Cost.

- Long Papers | Pp. 274-284

Characterization of Novel HIV Drug Resistance Mutations Using Clustering, Multidimensional Scaling and SVM-Based Feature Ranking

Tobias Sing; Valentina Svicher; Niko Beerenwinkel; Francesca Ceccherini-Silberstein; Martin Däumer; Rolf Kaiser; Hauke Walter; Klaus Korn; Daniel Hoffmann; Mark Oette; Jürgen K. Rockstroh; Gert Fätkenheuer; Carlo-Federico Perno; Thomas Lengauer

We present a case study on the discovery of clinically relevant domain knowledge in the field of HIV drug resistance. Novel mutations in the HIV genome associated with treatment failure were identified by mining a relational clinical database. Hierarchical cluster analysis suggests that two of these mutations form a novel mutational complex, while all others are involved in known resistance-conferring evolutionary pathways. The clustering is shown to be highly stable in a bootstrap procedure. Multidimensional scaling in mutation space indicates that certain mutations can occur within multiple pathways. Feature ranking based on support vector machines and matched genotype-phenotype pairs comprehensively reproduces current domain knowledge. Moreover, it indicates a prominent role of novel mutations in determining phenotypic resistance and in resensitization effects. These effects may be exploited deliberately to reopen lost treatment options. Together, these findings provide valuable insight into the interpretation of genotypic resistance tests.

Palabras clave: HIV; clustering; multidimensional scaling; support vector machines; feature ranking.

- Long Papers | Pp. 285-296