Catálogo de publicaciones - libros

Compartir en
redes sociales


From Data and Information Analysis to Knowledge Engineering: Proceedings of the 29th Annual Conference of the Gesellschaft für Klassifikation e.V. University of Magdeburg, March 9-11, 2005

Myra Spiliopoulou ; Rudolf Kruse ; Christian Borgelt ; Andreas Nürnberger ; Wolfgang Gaul (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-31313-7

ISBN electrónico

978-3-540-31314-4

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer Berlin · Heidelberg 2006

Tabla de contenidos

Boosting and -Penalty Methods for High-dimensional Data with Some Applications in Genomics

Peter Bühlmann

We consider Boosting and -penalty (regularization) methods for prediction and model selection (feature selection) and discuss some relations among the approaches. While Boosting has been originally proposed in the machine learning community (Freund and Schapire (1996)), -penalization has been developed in numerical analysis and statistics (Tibshirani (1996)). Both of the methods are attractive for very high-dimensional data: they are computationally feasible and statistically consistent (e.g. Bayes risk consistent) even when the number of covariates (predictor variables) is much larger than sample size and if the true underlying function (mechanism) is sparse: e.g. we allow for arbitrary polynomial growth = = () for any γ > 0. We demonstrate high-dimensional classification, regression and graphical modeling and outline examples from genomic applications.

- Plenaries and Semi-plenaries | Pp. 1-12

Striving for an Adequate Vocabulary: Next Generation ‘Metadata’

Dieter Fellner; Sven Havemann

Digital Libraries (DLs) in general and technical or cultural preservation applications in particular offer a rich set of multimedia objects like audio, music, images, videos, and 3D models. But instead of handling these objects consistently as regular documents — in the same way we handle text documents — most applications handle them differently. This is due to the fact that ‘standard’ tasks like content categorization, indexing, content representation or summarization have not yet been developed to a stage where DL technology could readily apply it for these types of documents. Instead, these tasks have to be done manually making the activity almost prohibitively expensive. Consequently, the most pressing research challenge is the development of an adequate ‘vocabulary’ to characterize the content and structure of non-textual documents as the key to indexing, categorization, dissemination and access.

We argue that textual metadata items are insufficient for describing images, videos, 3D models, or audio adequately. A new type of is needed that permits to express semantic information — which is a prerequisite for a retrieval of generalized documents based on their content, rather than on static textual annotations. The crucial question being which methods and which types of technology will best support the definition of vocabularies and ontologies for non-textual documents.

We present one such method for the domain of 3D models. Our approach allows to differentiate between the structure and the appearance of a 3D model, and we believe that this formalism can be generalized to other types of media.

- Plenaries and Semi-plenaries | Pp. 13-20

Scalable Swarm Based Fuzzy Clustering

Lawrence O. Hall; Parag M. Kanade

Iterative fuzzy clustering algorithms are sensitive to initialization. Swarm based clustering algorithms are able to do a broader search for the best extrema. A swarm inspired clustering approach which searches in fuzzy cluster centroids space is discussed. An evaluation function based on fuzzy cluster validity was used. A swarm based clustering algorithm can be computationally intensive and a data distributed approach to clustering is shown to be effective. It is shown that the swarm based clustering results in excellent data partitions. Further, it shown that the use of a cluster validity metric as the evaluation function enables the discovery of the number of clusters in the data in an automated way.

- Plenaries and Semi-plenaries | Pp. 21-31

SolEuNet: Selected Data Mining Techniques and Applications

Nada Lavrač

Data mining is concerned with the discovery of interesting patterns and models in data. In practice, data mining has become an established technology with applications in a wide range of areas that include marketing, health care, finance, environmental planning, up to applications in e-commerce and e-science. This paper presents selected data mining techniques and applications developed in the course of the SolEuNet 5FP IST project Data Mining and Decision Support for Business Competitiveness: A European Virtual Enterprise (2000–2003).

- Plenaries and Semi-plenaries | Pp. 32-39

Inferred Causation Theory: Time for a Paradigm Shift in Marketing Science?

Josef A. Mazanec

Over the last two decades the analytical toolbox for examining the properties needed to claim causal relationships has been significantly extended. New approaches to the theory of causality rely on the concept of ‘intervention’ instead of ‘association’. Under an axiomatic framework they elaborate the conditions for safe causal inference from nonexperimental data. (Spirtes et. al., 2000; Pearl, 2000) teaches us that the same independence relationships (or covariance matrix) may have been generated by numerous other graphs representing the cause-effect hypotheses. ICT combines elements of graph theory, statistics, logic, and computer science. It is not limited to parametric models in need of quantitative (ratio or interval scaled) data, but also operates much more generally on the observed conditional independence relationships among a set of qualitative (categorical) observations. Causal inference does not appear to be restricted to experimental data. This is particularly promising for research domains such as consumer behavior where policy makers and managers are unwilling to engage in experiments on real markets. A case example highlights the potential use of Inferred Causation methodology for analyzing the marketing researchers’ belief systems about their scientific orientation.

- Plenaries and Semi-plenaries | Pp. 40-51

Text Mining in Action!

Dunja Mladenič

Text mining methods have being successfully used on different problems, where text data is involved. Some Text mining approaches are capable of handling text just relying on statistics such as, frequency of words or phrases, while others assume availability of additional resources such as, natural language processing tools for the language in which the text is written; availability of lexicons; ontologies of concepts; aligned corpus in several languages; additional data sources such as, links between the text units or other non-textual data. This paper aims at illustrating potential of Text mining by presenting several approaches having some of the listed properties. For this purpose, we present research applications that were developed mainly inside European projects in collaboration with end-users and, research prototypes that do not necessary involve end-users.

- Plenaries and Semi-plenaries | Pp. 52-62

Identification of Real-world Objects in Multiple Databases

Mattis Neiling

Object identification is an important issue for integration of data from different sources. The identification task is complicated, if no global and consistent identifier is shared by the sources. Then, object identification can only be performed through the , the objects data provides itself. Unfortunately real-world data is dirty, hence identification mechanisms like fail mostly — we have to take care of the variations and errors of the data. Consequently, object identification can no more be guaranteed to be fault-free. Several methods tackle the object identification problem, e.g. , or the .

Based on a novel object identification framework, we assessed data quality and evaluated different methods on real data. One main result is that scalability is determined by the applied preselection technique and the usage of efficient data structures. As another result we can state that achieves better correctness and is more robust than .

- Plenaries and Semi-plenaries | Pp. 63-74

Kernels for Predictive Graph Mining

Stefan Wrobel; Thomas Gärtner; Tamás Horváth

In many application areas, are a very natural way of representing structural aspects of a domain. While most classical algorithms for data analysis cannot directly deal with graphs, recently there has been increasing interest in approaches that can learn general classification models from graph-structured data. In this paper, we summarize and review the line of work that we have been following in the last years on making a particular class of methods suitable for predictive graph mining, namely the so-called . Firstly, we state a result on fundamental computational limits to the possible expressive power of kernel functions for graphs. Secondly, we present two alternative graph kernels, one based on in a graph, the other based on and . The paper concludes with empirical evaluation on a large chemical data set.

- Plenaries and Semi-plenaries | Pp. 75-86

PRISMA: Improving Risk Estimation with Parallel Logistic Regression Trees

Bert Arnrich; Alexander Albert; Jörg Walter

Logistic regression is a very powerful method to estimate models with binary response variables. With the previously suggested combination of tree-based approaches with local, piecewise valid logistic regression models in the nodes, interactions between the covariates are directly conveyed by the tree and can be interpreted more easily. We show that the restriction of partitioning the feature space only at best attribute limits the overall estimation accuracy. Here we suggest (PRISMA) and demonstrate how the method can significantly improve risk estimation models in heart surgery and successfully perform a benchmark on three UCI data sets.

- Clustering | Pp. 87-94

Latent Class Analysis and Model Selection

José G. Dias

This paper discusses model selection for latent class (LC) models. A large experimental design is set that allows the comparison of the performance of different information criteria for these models, some compared for the first time. Furthermore, the level of separation of latent classes is controlled using a new procedure. The results show that AIC3 (Akaike information criterion with 3 as penalizing factor) outperforms other model selection criteria for LC models.

- Clustering | Pp. 95-102