Catálogo de publicaciones - libros
Machine Learning and Data Mining in Pattern Recognition: 5th International Conference, MLDM 2007, Leipzig, Germany, July 18-20, 2007. Proceedings
Petra Perner (eds.)
En conferencia: 5º International Workshop on Machine Learning and Data Mining in Pattern Recognition (MLDM) . Leipzig, Germany . July 18, 2007 - July 20, 2007
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Database Management; Data Mining and Knowledge Discovery; Pattern Recognition; Image Processing and Computer Vision
Disponibilidad
| Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
|---|---|---|---|---|
| No detectada | 2007 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-73498-7
ISBN electrónico
978-3-540-73499-4
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2007
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2007
Tabla de contenidos
Choosing the Kernel Parameters for the Directed Acyclic Graph Support Vector Machines
Kuo-Ping Wu; Sheng-De Wang
The directed acyclic graph support vector machines (DAGSVMs) have been shown to be able to provide classification accuracy comparable to the standard multiclass SVM extensions such as Max Wins methods. The algorithm arranges binary SVM classifiers as the internal nodes of a directed acyclic graph (DAG). Each node represents a classifier trained for the data of a pair of classes with the specific kernel. The most popular method to decide the kernel parameters is the grid search method. In the training process, classifiers are trained with different kernel parameters, and only one of the classifiers is required for the testing process. This makes the training process time-consuming. In this paper we propose using separation indexes to estimate the generalization ability of the classifiers. These indexes are derived from the inter-cluster distances in the feature spaces. Calculating such indexes costs much less computation time than training the corresponding SVM classifiers; thus the proper kernel parameters can be chosen much faster. Experiment results show that the testing accuracy of the resulted DAGSVMs is competitive to the standard ones, and the training time can be significantly shortened.
- Support Vector Machine | Pp. 276-285
Data Selection Using SASH Trees for Support Vector Machines
Chaofan Sun; Ricardo Vilalta
This paper presents a data preprocessing procedure to select support vector (SV) candidates. We select decision boundary region vectors (BRVs) as SV candidates. Without the need to use the decision boundary, BRVs can be selected based on a vector’s nearest neighbor of opposite class (NNO). To speed up the process, two spatial approximation sample hierarchical (SASH) trees are used for estimating the BRVs. Empirical results show that our data selection procedure can reduce a full dataset to the number of SVs or only slightly higher. Training with the selected subset gives performance comparable to that of the full dataset. For large datasets, overall time spent in selecting and training on the smaller dataset is significantly lower than the time used in training on the full dataset.
- Support Vector Machine | Pp. 286-295
Dynamic Distance-Based Active Learning with SVM
Jun Jiang; Horace H. S. Ip
In this paper, we present a novel active learning strategy, named dynamic active learning with SVM to improve the effectiveness of learning sample selection in active learning. The algorithm is divided into two steps. The first step is similar to the standard distance-based active learning with SVM [1] in which the sample nearest to the decision boundary is chosen to induce a hyperplane that can halve the current version space. In order to improve upon the learning efficiency and convergent rates, we propose in the second step, a dynamic sample selection strategy that operates within the neighborhood of the “standard” sample. Theoretical analysis is given to show that our algorithm will converge faster than the standard distance-based technique and using less number of samples while maintaining the same classification precision rate. We also demonstrate the feasibility of the dynamic selection strategy approach through conducting experiments on several benchmark datasets.
- Support Vector Machine | Pp. 296-309
Off-Line Learning with Transductive Confidence Machines: An Empirical Evaluation
Stijn Vanderlooy; Laurens van der Maaten; Ida Sprinkhuizen-Kuyper
The recently introduced transductive confidence machines (TCMs) framework allows to extend classifiers such that they satisfy the calibration property. This means that the error rate can be set by the user prior to classification. An analytical proof of the calibration property was given for TCMs applied in the on-line learning setting. However, the nature of this learning setting restricts the applicability of TCMs. In this paper we provide strong empirical evidence that the calibration property also holds in the off-line learning setting. Our results extend the range of applications in which TCMs can be applied. We may conclude that TCMs are appropriate in virtually any application domain.
- Transductive Inference | Pp. 310-323
Transductive Learning from Relational Data
Michelangelo Ceci; Annalisa Appice; Nicola Barile; Donato Malerba
Transduction is an inference mechanism “from particular to particular”. Its application to classification tasks implies the use of both labeled (training) data and unlabeled (working) data to build a classifier whose main goal is that of classifying (only) unlabeled data as accurately as possible. Unlike the classical inductive setting, no general rule valid for all possible instances is generated. Transductive learning is most suited for those applications where the examples for which a prediction is needed are already known when training the classifier. Several approaches have been proposed in the literature on building transductive classifiers from data stored in a single table of a relational database. Nonetheless, no attention has been paid to the application of the transduction principle in a (multi-)relational setting, where data are stored in multiple tables of a relational database. In this paper we propose a new transductive classifier, named TRANSC, which is based on a probabilistic approach to making transductive inferences from relational data. This new method works in a transductive setting and employs a principled probabilistic classification in multi-relational data mining to face the challenges posed by some spatial data mining problems. Probabilistic inference allows us to compute the class probability and return, in addition to result of transductive classification, the confidence in the classification. The predictive accuracy of TRANSC has been compared to that of its inductive counterpart in an empirical study involving both a benchmark relational dataset and two spatial datasets. The results obtained are generally in favor of TRANSC, although improvements are small by a narrow margin.
- Transductive Inference | Pp. 324-338
A Novel Rule Ordering Approach in Classification Association Rule Mining
Yanbo J. Wang; Qin Xin; Frans Coenen
A Classification Association Rule (CAR), a common type of mined knowledge in Data Mining, describes an implicative co-occurring relationship between a set of binary-valued data-attributes (items) and a pre-defined class, expressed in the form of an “antecedent (consequent-class” rule. Classification Association Rule Mining (CARM) is a recent Classification Rule Mining (CRM) approach that builds an Association Rule Mining (ARM) based classifier using CARs. Regardless of which particular methodology is used to build it, a classifier is usually presented as an ordered CAR list, based on an applied rule ordering strategy. Five existing rule ordering mechanisms can be identified: (1) Confi-dence-Support-size_of_Antecedent (CSA), (2) size_of_Antecedent-Confidence-Support (ACS), (3) Weighted Relative Accuracy (WRA), (4) Laplace Accuracy, and (5) ( Testing. In this paper, we divide the above mechanisms into two groups: (i) pure “support-confidence” framework like, and (ii) additive score assigning like. We consequently propose a hybrid rule ordering approach by combining one approach taken from (i) and another approach taken from (ii). The experimental results show that the proposed rule ordering approach performs well with respect to the accuracy of classification.
- Association Rule Mining | Pp. 339-348
Distributed and Shared Memory Algorithm for Parallel Mining of Association Rules
J. Hernández Palancar; O. Fraxedas Tormo; J. Festón Cárdenas; R. Hernández León
The search for frequent patterns in transactional databases is considered one of the most important data mining problems. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the dataset to determine the set of frequent itemsets, thus implying high I/O overhead. In the parallel case, most algorithms perform a sum-reduction at the end of each pass to construct the global counts, also implying high synchronization cost. We present a novel algorithm that exploits efficiently the trade-offs between computation, communication, memory usage and synchronization. The algorithm was implemented over a cluster of SMP nodes combining distributed and shared memory paradigms. This paper presents the results of our algorithm on different data sizes experimented on different numbers of processors, and studies the effect of these variations on the overall performance.
- Association Rule Mining | Pp. 349-363
Analyzing the Performance of Spam Filtering Methods When Dimensionality of Input Vector Changes
J. R. Méndez; B. Corzo; D. Glez-Peña; F. Fdez-Riverola; F. Díaz
Spam is a complex problem that makes difficult the exploitation of Internet resources. In this sense, several authorities have alerted about the dimension of this problem and aim everybody to fight against it. In this paper we present an extensive analysis showing how the effect of changing the dimensionality of message representation influences the accuracy of some well-known classical spam filtering techniques. The conclusions drawn from the experiments carried out will be useful for building a comparison of the dimensionality reorganization effects between classical filtering techniques and a successful spam filter model called .
- Mining Spam, Newsgroups, Blogs | Pp. 364-378
Blog Mining for the Fortune 500
James Geller; Sapankumar Parikh; Sriram Krishnan
In recent years there has been a tremendous increase in the number of users maintaining online blogs on the Internet. Companies, in particular, have become aware of this medium of communication and have taken a keen interest in what is being said about them through such personal blogs. This has given rise to a new field of research directed towards mining useful information from a large amount of unformatted data present in online blogs and online forums. We discuss an implementation of such a blog mining application. The application is broadly divided into two parts, the indexing process and the search module. Blogs pertaining to different organizations are fetched from a particular blog domain on the Internet. After analyzing the textual content of these blogs they are assigned a sentiment rating. Specific data from such blogs along with their sentiment ratings are then indexed on the physical hard drive. The search module searches through these indexes at run time for the input organization name and produces a list of blogs conveying both positive and negative sentiments about the organization.
- Mining Spam, Newsgroups, Blogs | Pp. 379-391
A Link-Based Rank of Postings in Newsgroup
Hongbo Liu; Jiahai Yang; Jiaxin Wang; Yu Zhang
Discussion systems such as Usenet, BBS, Forum are important resources for information sharing, view exchanging, problem solving and product feedback, etc. on Internet. The postings in newsgroups on Usenet represents the judgments and choices of participators. The structure of postings could provide helpful information for the users. In this paper, we present a method called PostRank to rank the postings based on the structure of newsgroup. Its results correspond to the eigenvectors of the transition probability matrix and the stationary vectors of the Markov chains. It could provide useful global information for the newsgroup and it can be used to help the users access information in it more effectively and efficiently. This method can be also applied on other discussion systems. Some experimental results and discussions on real data sets collected by us are also provided.
- Mining Spam, Newsgroups, Blogs | Pp. 392-403