Catálogo de publicaciones - libros

Compartir en
redes sociales


Data Mining and Bioinformatics: First International Workshop, VDMB 2006, Seoul, Korea, September 11, 2006, Revised Selected Papers

Mehmet M. Dalkilic ; Sun Kim ; Jiong Yang (eds.)

En conferencia: 1º VLDB Workshop on Data Mining and Bioinformatics (VDMB) . Seoul, South Korea . September 11, 2006 - September 11, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Artificial Intelligence (incl. Robotics); Data Mining and Knowledge Discovery; Information Storage and Retrieval; Computational Biology/Bioinformatics; Probability and Statistics in Computer Science; Health Informatics

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-68970-6

ISBN electrónico

978-3-540-68971-3

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Bioinformatics at Microsoft Research

Simon Mercer

The advancement of the life sciences in the last twenty years has been the story of increasing integration of computing with scientific research, and this trend is set to transform the practice of science in our lifetimes. Conversely, biological systems are a rich source of ideas that will transform the future of computing.

In addition to supporting academic research in the life sciences, Microsoft Research is a source of tools and technologies well suited to the needs of basic scientific research. Current projects include new languages to simplify data extraction and processing, tools for scientific workflows, and biological visualization.

Computer science researchers also bring new perspectives to problems in biology, such as the use of schema-matching techniques in merging ontologies, machine learning in vaccine design, and process algebra in understanding metabolic pathways.

Pp. 1-1

A Novel Approach for Effective Learning of Cluster Structures with Biological Data Applications

Miyoung Shin

Recently DNA microarray gene expression studies have been actively performed for mining unknown biological knowledge hidden under a large volume of gene expression data in a systematic way. In particular, the problem of finding groups of co-expressed genes or samples has been largely investigated due to its usefulness in characterizing unknown gene functions or performing more sophisticated tasks, such as modeling biological pathways. Nevertheless, there are still some difficulties in practice to identify good clusters since many clustering methods require user’s arbitrary selection of the number of target clusters. In this paper we propose a novel approach to systematically identifying good candidates of cluster numbers so that we can minimize the arbitrariness in cluster generation. Our experimental results on both synthetic dataset and real gene expression dataset show the applicability and usefulness of this approach in microarray data mining.

Pp. 2-13

Subspace Clustering of Microarray Data Based on Domain Transformation

Jongeun Jun; Seokkyung Chung; Dennis McLeod

We propose a mining framework that supports the identification of useful knowledge based on data clustering. With the recent advancement of microarray technologies, we focus our attention on gene expression datasets mining. In particular, given that genes are often co-expressed under subsets of experimental conditions, we present a novel subspace clustering algorithm. In contrast to previous approaches, our method is based on the observation that the number of subspace clusters is related with the number of maximal subspace clusters to which any gene pair can belong. By performing discretization to gene expression profiles, the similarity between two genes is transformed as a sequence of symbols that represents the maximal subspace cluster for the gene pair. This domain transformation (from genes into gene-gene relations) allows us to make the number of possible subspace clusters dependent on the number of genes. Based on the symbolic representations of genes, we present an efficient subspace clustering algorithm that is scalable to the number of dimensions. In addition, the running time can be drastically reduced by utilizing inverted index and pruning non-interesting subspaces. Experimental results indicate that the proposed method efficiently identifies co-expressed gene subspace clusters for a yeast cell cycle dataset.

Pp. 14-28

Bayesian Hierarchical Models for Serial Analysis of Gene Expression

Seungyoon Nam; Seungmook Lee; Sanghyuk Lee; Seokmin Shin; Taesung Park

In the Serial Analysis of Gene Expression (SAGE) analysis, the statistical procedures have been performed after aggregation of observations from the various libraries for the same class. Most studies have not accounted for the within-class variability. The identification of the differentially expressed genes based on the class separation has not been easy because of heteroscedasticity of libraries.We propose a hierarchical Bayesian model that accounts for the within-class variability. The differential expression is measured by a distribution-free silhouette width which was first introduced into the SAGE differential expression analysis. It is shown that the silhouette width is more appropriate and is easier to compute than the error rate.

Pp. 29-39

Applying Gaussian Distribution-Dependent Criteria to Decision Trees for High-Dimensional Microarray Data

Raymond Wan; Ichigaku Takigawa; Hiroshi Mamitsuka

Biological data presents unique problems for data analysis due to its high dimensions. Microarray data is one example of such data which has received much attention in recent years. Machine learning algorithms such as support vector machines (SVM) are ideal for microarray data due to its high classification accuracies. However, sometimes the information being sought is a list of genes which best separates the classes, and not a classification rate.

Decision trees are one alternative which do not perform as well as SVMs, but their output is easily understood by non-specialists. A major obstacle with applying current decision tree implementations for high-dimensional data sets is their tendency to assign the same scores for multiple attributes. In this paper, we propose two distribution-dependant criteria for decision trees to improve their usefulness for microarray classification.

Pp. 40-49

A Biological Text Retrieval System Based on Background Knowledge and User Feedback

Meng Hu; Jiong Yang

Efficiently finding the most relevant publications in large corpus is an important research topic in information retrieval. The number of biological literatures grows exponentially in various publication databases. The objective of this paper is to quickly identify useful publications from a large number of biological documents. In this paper, we introduce a new iterative search paradigm that integrates biomedical background knowledge in organizing the results returned by search engines and utilizes user feedbacks in pruning irrelevant documents by document classification. A new term weighting strategy based on Gene Ontology is proposed to represent biomedical literatures. A prototype text retrieval system is built on this iterative search approach. Experimental results on MEDLINE abstracts and different keyword inputs show that the system can filter a large number of irrelevant documents in a reasonable time while keeping most of the useful documents. The results also show that the system is robust against different inputs and parameter settings.

Pp. 50-64

Automatic Annotation of Protein Functional Class from Sparse and Imbalanced Data Sets

Jaehee Jung; Michael R. Thon

In recent years, high-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function however the process for annotating proteins with GO terms is usually through a tedious manual curation process by trained professional annotators. With the wealth of genomic data that are now available, there is a need for accurate automated annotation methods. In this paper, we propose a method for automatically predicting GO terms for proteins by applying statistical pattern recognition techniques. We employ protein functional domains as features and learn independent Support Vector Machine classifiers for each GO term. This approach creates sparse data sets with highly imbalanced class distribution. We show that these problems can be overcome with standard feature and instance selection methods. We also present a meta-learning scheme that utilizes multiple SVMs trained for each GO term, resulting in improved overall performance than either SVM can achieve alone. The implementation of the tool is available at http://fcg.tamu.edu/AAPFC.

Pp. 65-77

Bioinformatics Data Source Integration Based on Semantic Relationships Across Species

Badr Al-Daihani; Alex Gray; Peter Kille

Bioinformatics databases are heterogeneous, differ in their representation as well as in their query capabilities across diverse information held in distributed autonomous resources. Current approaches to integrating heterogeneous bioinformatics data sources are based on one of a: common field, ontology or cross-reference. In this paper we investigate the use of semantic relationships across species to link, integrate and annotate genes from publicly available data sources and a novel Soft Link approach is introduced, to link information across species held in biological databases, through providing a flexible method of joining related information from different databases, including non-bioinformatics databases. A measure of relationship closeness will afford a biologist a new tool in their repertoire for analysis. Soft Links are identified as interrelated concepts and can be used to create a rich set of possible relation types supporting the investigation of alternative hypothesis.

Pp. 78-93

An Efficient Storage Model for the SBML Documents Using Object Databases

Seung-Hyun Jung; Tae-Sung Jung; Tae-Kyung Kim; Kyoung-Ran Kim; Jae-Soo Yoo; Wan-Sup Cho

As SBML is regarded as a de-facto standard to express the biological network data in systems biology, the amount of the SBML documents is exponentially increasing. We propose an SBML data management system (SMS) on top of an object database. Since the object database supports abundant data types like multi-valued attributes and object references, mapping from the SBML documents into the object database is straightforward. We adopt the event-based SAX parser instead of the DOM parser for dealing with huge SBML documents. Note that DOM parser suffers from excessive memory overhead for the document parsing. For high quality data, SMS supports data cleansing function by using gene ontology. Finally, SMS generates user query results in an SBML format (for data exchange) or in a visual graphs (for intuitive understanding). Real experiments show that our approach is superior to the one using conventional relational databases in the aspects of the modeling capability, storage requirements, and data quality.

Pp. 94-105

Identification of Phenotype-Defining Gene Signatures Using the Gene-Pair Matrix Based Clustering

Chung-Wein Lee; Shuyu Dan Li; Eric W. Su; Birong Liao

Mining the “meaningful” clues from vast amount of expression profiling data remains to be challenge for biologists. After all the statistical tests, biologists often struggle deciding how to do next with a large list of genes without any obvious theme of mechanism, partly because most statistical analyses do not incorporate understanding of biological systems before hand. Here, we developed a novel method of “gene –pair difference within a sample” to identify phenotype-defining gene signatures, based on the hypothesis that a biological state is governed by the relative difference among different biological processes. For gene expression, it is relative difference among the genes within a sample (an individual, cell, etc), the highest frequency of occurrences a gene contributing to the within sample difference underline the contributions of genes in defining the biological states. We tested the method on three datasets, and identified the most important gene-pairs to drive the phenotypic differences.

Pp. 106-119