Catálogo de publicaciones - libros
Knowledge Discovery in Databases: PKDD 2005: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings
Alípio Mário Jorge ; Luís Torgo ; Pavel Brazdil ; Rui Camacho ; João Gama (eds.)
En conferencia: 9º European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) . Porto, Portugal . October 3, 2005 - October 7, 2005
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
No disponibles.
Disponibilidad
| Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
|---|---|---|---|---|
| No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-29244-9
ISBN electrónico
978-3-540-31665-7
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2005
Tabla de contenidos
doi: 10.1007/11564126_11
Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results
Ian Davidson; S. S. Ravi
We explore the use of instance and cluster-level constraints with agglomerative hierarchical clustering. Though previous work has illustrated the benefits of using constraints for non-hierarchical clustering, their application to hierarchical clustering is not straight-forward for two primary reasons. First, some constraint combinations make the feasibility problem (Does there exist a single feasible solution?) NP -complete. Second, some constraint combinations when used with traditional agglomerative algorithms can cause the dendrogram to stop prematurely in a dead-end solution even though there exist other feasible solutions with a significantly smaller number of clusters. When constraints lead to efficiently solvable feasibility problems and standard agglomerative algorithms do not give rise to dead-end solutions, we empirically illustrate the benefits of using constraints to improve cluster purity and average distortion. Furthermore, we introduce the new γ constraint and use it in conjunction with the triangle inequality to considerably improve the efficiency of agglomerative clustering.
Palabras clave: Hierarchical Cluster; Triangle Inequality; Feasibility Problem; Agglomerative Cluster; Hierarchical Cluster Algorithm.
- Long Papers | Pp. 59-70
doi: 10.1007/11564126_12
Cluster Aggregate Inequality and Multi-level Hierarchical Clustering
Chris Ding; Xiaofeng He
We show that (1) in hierarchical clustering, many linkage functions satisfy a cluster aggregate inequality, which allows an exact O ( N ^2) multi-level (using mutual nearest neighbor) implementation of the standard O ( N ^3) agglomerative hierarchical clustering algorithm. (2) a desirable close friends cohesion of clusters can be translated into kNN consistency which is guaranteed by the multi-level algorithm; (3) For similarity-based linkage functions, the multi-level algorithm is naturally implemented as graph contraction. The effectiveness of our algorithms is demonstrated on a number of real life applications.
Palabras clave: Linkage Function; Hierarchical Cluster; Close Friend; Current Cluster; Large Linkage.
- Long Papers | Pp. 71-83
doi: 10.1007/11564126_13
Ensembles of Balanced Nested Dichotomies for Multi-class Problems
Lin Dong; Eibe Frank; Stefan Kramer
A system of nested dichotomies is a hierarchical decomposition of a multi-class problem with c classes into c –1 two-class problems and can be represented as a tree structure. Ensembles of randomly-generated nested dichotomies have proven to be an effective approach to multi-class learning problems [1]. However, sampling trees by giving each tree equal probability means that the depth of a tree is limited only by the number of classes, and very unbalanced trees can negatively affect runtime. In this paper we investigate two approaches to building balanced nested dichotomies— class-balanced nested dichotomies and data-balanced nested dichotomies—and evaluate them in the same ensemble setting. Using C4.5 decision trees as the base models, we show that both approaches can reduce runtime with little or no effect on accuracy, especially on problems with many classes. We also investigate the effect of caching models when building ensembles of nested dichotomies.
- Long Papers | Pp. 84-95
doi: 10.1007/11564126_14
Protein Sequence Pattern Mining with Constraints
Pedro Gabriel Ferreira; Paulo J. Azevedo
Considering the characteristics of biological sequence databases, which typically have a small alphabet, a very long length and a relative small size (several hundreds of sequences), we propose a new sequence mining algorithm ( gIL ). gIL was developed for linear sequence pattern mining and results from the combination of some of the most efficient techniques used in sequence and itemset mining. The algorithm exhibits a high adaptability, yielding a smooth and direct introduction of various types of features into the mining process, namely the extraction of rigid and arbitrary gap patterns. Both breadth or a depth first traversal are possible. The experimental evaluation, in synthetic and real life protein databases, has shown that our algorithm has superior performance to state-of-the art algorithms. The use of constraints has also proved to be a very useful tool to specify user interesting patterns.
Palabras clave: Minimum Support; Support Variation; Sequence Pattern Mining; Alphabet Size; Frequent Pair.
- Long Papers | Pp. 96-107
doi: 10.1007/11564126_15
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Yan-Nei Law; Carlo Zaniolo
In this paper, we propose an incremental classification algorithm which uses a multi-resolution data representation to find adaptive nearest neighbors of a test point. The algorithm achieves excellent performance by using small classifier ensembles where approximation error bounds are guaranteed for each ensemble size. The very low update cost of our incremental classifier makes it highly suitable for data stream applications. Tests performed on both synthetic and real-life data indicate that our new classifier outperforms existing algorithms for data streams in terms of accuracy and computational costs.
Palabras clave: Feature Space; Data Stream; Test Point; Nearest Neighbor; Training Point.
- Long Papers | Pp. 108-120
doi: 10.1007/11564126_16
Support Vector Random Fields for Spatial Classification
Chi-Hoon Lee; Russell Greiner; Mark Schmidt
In this paper we propose Support Vector Random Fields (SVRFs), an extension of Support Vector Machines (SVMs) that explicitly models spatial correlations in multi-dimensional data. SVRFs are derived as Conditional Random Fields that take advantage of the generalization properties of SVMs. We also propose improvements to computing posterior probability distributions from SVMs, and present a local-consistency potential measure that encourages spatial continuity. SVRFs can be efficiently trained, converge quickly during inference, and can be trivially augmented with kernel functions. SVRFs are more robust to class imbalance than Discriminative Random Fields (DRFs), and are more accurate near edges. Our results on synthetic data and a real-world tumor detection task show the superiority of SVRFs over both SVMs and DRFs.
Palabras clave: Support Vector Machine; Support Vector; Class Label; Conditional Random Field; Class Imbalance.
- Long Papers | Pp. 121-132
doi: 10.1007/11564126_17
Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication
Jurij Leskovec; Deepayan Chakrabarti; Jon Kleinberg; Christos Faloutsos
How can we generate realistic graphs? In addition, how can we do so with a mathematically tractable model that makes it feasible to analyze their properties rigorously? Real graphs obey a long list of surprising properties: Heavy tails for the in- and out-degree distribution; heavy tails for the eigenvalues and eigenvectors; small diameters; and the recently discovered “Densification Power Law” (DPL). All published graph generators either fail to match several of the above properties, are very complicated to analyze mathematically, or both. Here we propose a graph generator that is mathematically tractable and matches this collection of properties. The main idea is to use a non-standard matrix operation, the Kronecker product , to generate graphs that we refer to as “Kronecker graphs”. We show that Kronecker graphs naturally obey all the above properties; in fact, we can rigorously prove that they do so. We also provide empirical evidence showing that they can mimic very well several real graphs.
Palabras clave: Random Graph; Degree Distribution; Kronecker Product; Heavy Tail; Scree Plot.
- Long Papers | Pp. 133-145
doi: 10.1007/11564126_18
A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns
Jinyan Li; Haiquan Li; Donny Soh; Limsoon Wong
For an undirected graph G without self-loop, we prove: (i) that the number of closed patterns in the adjacency matrix of G is even; (ii) that the number of the closed patterns is precisely double the number of maximal complete bipartite subgraphs of G ; (iii) that for every maximal complete bipartite subgraph, there always exists a unique pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, we can enumerate all maximal complete bipartite subgraphs by using efficient algorithms for mining closed patterns which have been extensively studied in the data mining field.
Palabras clave: Bipartite Graph; Association Rule; Adjacency Matrix; Undirected Graph; Protein Interaction Network.
- Long Papers | Pp. 146-156
doi: 10.1007/11564126_19
Improving Generalization by Data Categorization
Ling Li; Amrit Pratap; Hsuan-Tien Lin; Yaser S. Abu-Mostafa
In most of the learning algorithms, examples in the training set are treated equally. Some examples, however, carry more reliable or critical information about the target than the others, and some may carry wrong information. According to their intrinsic margin, examples can be grouped into three categories: typical, critical, and noisy. We propose three methods, namely the selection cost, SVM confidence margin, and AdaBoost data weight, to automatically group training examples into these three categories. Experimental results on artificial datasets show that, although the three methods have quite different nature, they give similar and reasonable categorization. Results with real-world datasets further demonstrate that treating the three data categories differently in learning can improve generalization.
Palabras clave: Support Vector Machine; Learning Algorithm; Target Function; Selection Cost; Intrinsic Function.
- Long Papers | Pp. 157-168
doi: 10.1007/11564126_20
Mining Model Trees from Spatial Data
Donato Malerba; Michelangelo Ceci; Annalisa Appice
Mining regression models from spatial data is a fundamental task in Spatial Data Mining. We propose a method, namely Mrs-SMOTI, that takes advantage from a tight-integration with spatial databases and mines regression models in form of trees in order to partition the sample space. The method is characterized by three aspects. First, it is able to capture both spatially global and local effects of explanatory attributes. Second, explanatory attributes that influence the response attribute do not necessarily come from a single layer. Third, the consideration that geometrical representation and relative positioning of spatial objects with respect to a reference system implicitly define both spatial relationships and properties. An application to real-world spatial data is reported.
Palabras clave: Spatial Data; Target Object; Response Attribute; Spatial Database; Spatial Object.
- Long Papers | Pp. 169-180