Catálogo de publicaciones - libros

Compartir en
redes sociales


Biological and Medical Data Analysis: 7th International Symposium, ISBMDA 2006, Thessaloniki, Greece, December 7-8, 2006. Proceedings

Nicos Maglaveras ; Ioanna Chouvarda ; Vassilis Koutkias ; Rüdiger Brause (eds.)

En conferencia: 7º International Symposium on Biological and Medical Data Analysis (ISBMDA) . Thessaloniki, Greece . December 7, 2006 - December 8, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Biomedicine general; Data Mining and Knowledge Discovery; Artificial Intelligence (incl. Robotics); Information Storage and Retrieval; Probability and Statistics in Computer Science; Computational Biology/Bioinformatics

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-68063-5

ISBN electrónico

978-3-540-68065-9

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

HLA and HIV Infection Progression: Application of the Minimum Description Length Principle to Statistical Genetics

Peter T. Hraber; Bette T. Korber; Steven Wolinsky; Henry A. Erlich; Elizabeth A. Trachtenberg; Thomas B. Kepler

The minimum description length (MDL) principle was developed in the context of computational complexity and coding theory. It states that the best model to account for some data minimizes the sum of the lengths, in bits, of the descriptions of the model and the data as encoded via the model. The MDL principle gives a criterion for parameter selection, by using the description length as a test statistic. Class I HLA genes play a major role in the immune response to HIV, and are known to be associated with rates of progression to AIDS. However, these genes are extremely polymorphic, making it difficult to associate alleles with disease outcome, given statistical issues of multiple testing. Application of the MDL principle to immunogenetic data from a longitudinal cohort study (Chicago MACS) enables classification of alleles associated with plasma HIV RNA abundance, an indicator of infection progression. Variation in progression is strongly associated with HLA-B. Allele associations with viral levels support and extend previous studies. In particular, individuals without supertype alleles average viral RNA levels 3.6 times greater than individuals with them. Mechanisms for these associations include variation in epitope specificity and selection that favors rare alleles.

- Bioinformatics: Functional Genomics | Pp. 1-12

Visualization of Functional Aspects of microRNA Regulatory Networks Using the Gene Ontology

Alkiviadis Symeonidis; Ioannis G. Tollis; Martin Reczko

The post-transcriptional regulation of genes by microRNAs (miRNAs) is a recently discovered mechanism of growing importance. To uncover functional relations between genes regulated by the same miRNA or groups of miRNAs we suggest the simultaneous visualization of the miRNA regulatory network and the Gene Ontology (GO) categories of the targeted genes. The miRNA regulatory network is shown using circular drawings and the GO is visualized using treemaps. The GO categories of the genes targeted by user-selected miRNAs are highlighted in the treemap showing the complete GO hierarchy or selected branches of it. With this visualization method patterns of reoccurring categories can easily identified supporting the discovery of the functional role of miRNAs. Executables for MS-Windows are available under

- Bioinformatics: Functional Genomics | Pp. 13-24

A Novel Method for Classifying Subfamilies and Sub-subfamilies of G-Protein Coupled Receptors

Majid Beigi; Andreas Zell

G-protein coupled receptors (GPCRs) are a large superfamily of integral membrane proteins that transduce signals across the cell membrane. Because of that important property and other physiological roles undertaken by the GPCR family, they have been an important target of therapeutic drugs. The function of many GPCRs is not known and accurate classification of GPCRs can help us to predict their function. In this study we suggest a kernel based method to classify them at the subfamily and sub-subfamily level. To enhance the accuracy and sensitivity of classifiers at the sub-subfamily level that we were facing with a low number of sequences (imbalanced data), we used our new synthetic protein sequence oversampling (SPSO) algorithm and could gain an overall accuracy and Matthew’s correlation coefficient (MCC) of 98.4 % and 0.98 for class A, nearly 100% and 1 for class B and 96.95% and 0.91 for class C, respectively, at the subfamily level and overall accuracy and MCC of 97.93% and 0.95 at the sub-subfamily level. The results shows that Our oversampling technique can be used for other applications of protein classification with the problem of imbalanced data.

- Bioinformatics: Functional Genomics | Pp. 25-36

Integration Analysis of Diverse Genomic Data Using Multi-clustering Results

Hye-Sung Yoon; Sang-Ho Lee; Sung-Bum Cho; Ju Han Kim

In modern data mining applications, clustering algorithms are among the most important approaches, because these algorithms group elements in a dataset according to their similarities, and they do not require any class label information. In recent years, various methods for ensemble selection and clustering result combinations have been designed to optimize clustering results. Moreover, conducting data analysis using multiple sources, given the complexity of data objects, is a much more powerful method than evaluating each source separately. Therefore, a new paradigm is required that combines the genome-wide experimental results of multi-source datasets. However, multi-source data analysis is more difficult than single source data analysis. In this paper, we propose a new clustering ensemble approach for multi-source bio-data on complex objects. In addition, we present encouraging clustering results in a real bio-dataset examined using our proposed method.

- Bioinformatics: Functional Genomics | Pp. 37-48

Effectivity of Internal Validation Techniques for Gene Clustering

Chunmei Yang; Baikun Wan; Xiaofeng Gao

Clustering is a major exploratory technique for gene expression data in post-genomic era. As essential tools within cluster analysis, cluster validation techniques have the potential to assess the quality of clustering results and performance of clustering algorithms, helpful to the interpretation of clustering results. In this work, the validation ability of Silhouette index, Dunn’s index, Davies-Bouldin index and FOM in gene clustering was investigated with public gene expression datasets clustered by hierarchical single-linkage and average-linkage clustering, K-means and SOMs. It was made clear that Silhouette index and FOM can preferably validate the performance of clustering algorithms and the quality of clustering results, Dunn’s index should not be used directly in gene clustering validation for its high susceptibility to outliers, while Davies- Bouldin index can afford better validation than Dunn’s index, exception for its preference to hierarchical single-linkage clustering.

- Bioinformatics: Sequence and Structure Analysis | Pp. 49-59

Intrinsic Splicing Profile of Human Genes Undergoing Simple Cassette Exon Events

Andigoni Malousi; Vassilis Koutkias; Sofia Kouidou; Nicos Maglaveras

Alternative pre-mRNA splicing presides over protein diversity and organism complexity. Alternative splicing isoforms in human have been associated with specific developmental stages, tissue-specific expressions and disease-causing factors. In this study, we identified and analysed intrinsic features that discriminate non-conserved human genes that undergo a single internal cassette exon event from constitutively spliced exons. Context-based analysis revealed a guanine-rich track at the donor of the cassette’s upstream intronic region that is absent in the constitutive dataset, as well as significant differences in the distribution of CpG and A3/G3 sequences between the alternative and the constitutive intronic regions. Interestingly, introns flanking cassette exons are larger than the constitutive ones, while exon lengths do not vary significantly. Splice sites flanking cassette exons are less identifiable, while splice sites at the outer ends are ‘stronger’ than constitutive introns. The results indicate that specific intrinsic features are linked with the inclusion/excision of internal exons which are indicative of the underlying selection rules.

- Bioinformatics: Sequence and Structure Analysis | Pp. 60-71

Generalization Rules for Binarized Descriptors

Jürgen Paetz

Virtual screening of molecules is one of the hot topics in life science. Often, molecules are encoded by descriptors with numerical values as a basis for finding regions with a high enrichment of active molecules compared to non-active ones. In this contribution we demonstrate that a simpler binary version of a descriptor can be used for this task as well with similar classification performance, saving computational and memory resources. To generate binary valued rules for virtual screening, we used the GenIntersect algorithm that heuristically determines common properties of the binary descriptor vectors. The results are compared to the ones achieved with numerical rules of a neuro-fuzzy system.

- Bioinformatics: Sequence and Structure Analysis | Pp. 72-82

Application of Combining Classifiers Using Dynamic Weights to the Protein Secondary Structure Prediction – Comparative Analysis of Fusion Methods

Tomasz Woloszynski; Marek Kurzynski

We introduce common framework for classifiers fusion methods using dynamic weights in decision making process. Both weighted average combiners with dynamic weights and combiners which dynamically estimate local competence are considered. Few algorithms presented in the literature are shown in accordance with our model. In addition we propose two new methods for combining classifiers. The problem of protein secondary structure prediction was selected as a benchmark test. Experiments were carried out on previously prepared dataset of non-homologous proteins for fusion algorithms comparison. The results have proved that developed framework generalizes dynamic weighting approaches and should be further investigated.

- Bioinformatics: Sequence and Structure Analysis | Pp. 83-91

A Novel Data Mining Approach for the Accurate Prediction of Translation Initiation Sites

George Tzanis; Christos Berberidis; Ioannis Vlahavas

In an mRNA sequence, the prediction of the exact codon where the process of translation starts (Translation Initiation Site – TIS) is a particularly important problem. So far it has been tackled by several researchers that apply various statistical and machine learning techniques, achieving high accuracy levels, often over 90%. In this paper we propose a mahine learning approach that can further improve the prediction accuracy. First, we provide a concise review of the literature in this field. Then we propose a novel feature set. We perform extensive experiments on a publicly available, real world dataset for various vertebrate organisms using a variety of novel features and classification setups. We evaluate our results and compare them with a reference study and show that our approach that involves new features and a combination of the Ribosome Scanning Model with a meta-classifier shows higher accuracy in most cases.

- Bioinformatics: Sequence and Structure Analysis | Pp. 92-103

SPSO: Synthetic Protein Sequence Oversampling for Imbalanced Protein Data and Remote Homology Detection

Majid Beigi; Andreas Zell

Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Our method of oversampling involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. This method is very useful in remote homology detection, and we used real and artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).

- Bioinformatics: Sequence and Structure Analysis | Pp. 104-115