Catálogo de publicaciones - libros

Compartir en
redes sociales


Data Mining and Bioinformatics: First International Workshop, VDMB 2006, Seoul, Korea, September 11, 2006, Revised Selected Papers

Mehmet M. Dalkilic ; Sun Kim ; Jiong Yang (eds.)

En conferencia: 1º VLDB Workshop on Data Mining and Bioinformatics (VDMB) . Seoul, South Korea . September 11, 2006 - September 11, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Artificial Intelligence (incl. Robotics); Data Mining and Knowledge Discovery; Information Storage and Retrieval; Computational Biology/Bioinformatics; Probability and Statistics in Computer Science; Health Informatics

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-68970-6

ISBN electrónico

978-3-540-68971-3

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

TP+Close: Mining Frequent Closed Patterns in Gene Expression Datasets

YuQing Miao; GuoLiang Chen; Bin Song; ZhiHao Wang

Unlike the traditional datasets, gene expression datasets typically contain a huge number of items and few transactions. Though there were a large number of algorithms that had been developed for mining frequent closed patterns, their running time increased exponentially with the average length of the transactions increasing. Therefore, most current methods for high-dimensional gene expression datasets were impractical. In this paper, we proposed a new data structure, tidset-prefix-plus tree (TP+-tree), to store the compressed transposed table of dataset. Based on TP+-tree, an algorithm, TP+close, was developed for mining frequent closed patterns in gene expression datasets. TP+close adopted top-down and divide-and-conquer search strategies on the transaction space. Moreover, TP+close combined efficient pruning and effective optimizing methods. Several experiments on real-life gene expression datasets showed that TP+close was faster than RERII and CARPENTER, two existing algorithms.

Pp. 120-130

Exploring Essential Attributes for Detecting MicroRNA Precursors from Background Sequences

Yun Zheng; Wynne Hsu; Mong Li Lee; Limsoon Wong

MicroRNAs (miRNAs) have been shown to play important roles in post-transcriptional gene regulation. The hairpin structure is a key characteristic of the microRNAs precursors (pre-miRNAs). How to encode their hairpin structures is a critical step to correctly detect the pre-miRNAs from background sequences, i.e., pseudo miRNA precursors. In this paper, we have proposed to encode the hairpin structures of the pre-miRNA with a set of features, which captures both the global and local structure characteristics of the pre-miRNAs. Furthermore, we find that four essential attributes are discriminatory for classifying human pre-miRNAs and background sequences with an information theory approach. The experimental results show that the number of conserved essential attributes decreases when the phylogenetic distance between the species increases. Specifically, one A-U pair, which produces the U at the start position of most mature miRNAs, in the pre-miRNAs is found to be well conserved in different species for the purpose of biogenesis.

Pp. 131-145

A Gene Structure Prediction Program Using Duration HMM

Hongseok Tae; Eun-Bae Kong; Kiejung Park

Gene structure prediction, which is to predict protein coding regions in a given nucleotide sequence, is a critical process in annotating genes and greatly affects gene analysis and genome annotation. As the gene structure of eukaryotes is much more complicated than that of prokaryotic genes, eukaryotic gene structure prediction should have more diverse and more complicated computational models. We have developed GeneChaser, a gene structure prediction program, using a duration hidden markov model. GeneChaser consists of two major processes, one of which is to train datasets to produce parameter values and the other of which is to predict protein coding regions based on the parameter values. The program predicts multiple genes rather than a single gene from a DNA sequence. To predict the gene structure for a huge chromosomal DNA sequence, it splits the sequence into overlapped fragments and performs prediction process for each fragment. A few computational models were implemented to detect signal patterns and their scanning efficiency was evaluated. Based on a few criteria, its prediction performance was compared with that of a few commonly used programs, GeneID and Morgan.

Pp. 146-157

An Approximate de Bruijn Graph Approach to Multiple Local Alignment and Motif Discovery in Protein Sequences

Rupali Patwardhan; Haixu Tang; Sun Kim; Mehmet Dalkilic

Motif discovery is an important problem in protein sequence analysis. Computationally, it can be viewed as an application of the more general multiple local alignment problem, which often encounters the difficulty of computer time when aligning many sequences. We introduce a new algorithm for multiple local alignment for protein sequences, based on the de Bruijn graph approach first proposed by Zhang and Waterman for aligning DNA sequence. We generalize their approach to aligning protein sequences by building an approximate de Bruijn graph to allow gluing similar but not identical amino acids. We implement this algorithm and test it on motif discovery of 100 sets of protein sequences. The results show that our method achieved comparable results as other popular motif discovery programs, while offering advantages in terms of speed.

Pp. 158-169

Discovering Consensus Patterns in Biological Databases

Mohamed Y. ElTabakh; Walid G. Aref; Mourad Ouzzani; Mohamed H. Ali

Consensus patterns, like motifs and tandem repeats, are highly conserved patterns with very few substitutions where no gaps are allowed. In this paper, we present a progressive hierarchical clustering technique for discovering consensus patterns in biological databases over a certain length range. This technique can discover consensus patterns with various requirements by applying a post-processing phase. The progressive nature of the hierarchical clustering algorithm makes it scalable and efficient. Experiments to discover motifs and tandem repeats on real biological databases show significant performance gain over non-progressive clustering techniques.

Pp. 170-184

Comparison of Modularization Methods in Application to Different Biological Networks

Zhuo Wang; Xin-Guang Zhu; Yazhu Chen; Yixue Li; Lei Liu

Most biological networks have been proposed to possess modular organization, which increases the robustness, flexibility, and stability of networks. Many clustering methods have been used in mining biological data and partitioning complex networks into functional modules. Most of these methods require presetting the number of modules and therefore can potentially obtain biased results. The Markov clustering method (MCL) and the simulated annealing module-detection method (SA) eliminate this requirement and can objectively separate relatively dense subgraphs. In this paper, we compared these two module-detection methods for three types of biological data: protein family classification, microarray clustering, and modularity of metabolic networks. We found that these two methods show differential advantages for different biological networks. In the case of the gene network based on Affymetrix microarray spike data, MCL exactly identified the same number of groups and same contents in each group set by the spike data. In the case of the gene network derived from actual expression data, although neither of the two methods can perfectly recover the natural classification, MCL performs slightly better than SA. However, with increased random noise added to the gene expression values, SA generates better modular structures with higher modularity. Next we compared the modularization results of MCL and SA for protein family classification and found the modules detected by SA could not be well matched with the Structural Classification of Proteins (SCOP database), which suggests that MCL is ideally suited to the rapid and accurate detection of protein families. In addition, we used both methods to detect modules in the metabolic network of . MCL gives a trivial clustering, which generates biologically insignificant modules. In contrast, SA detects modules well corresponding to the KEGG functional classification. Moreover the modularity for several other metabolic networks detected by SA is also much higher than that by MCL. In summary, MCL is more suited to modularize relatively complete and definite data, such as a protein family network. In contrast, SA is less sensitive to noise such as experimental error or incomplete data and outperforms MCL when modularizing gene networks based on microarray data and large scale metabolic networks constructed from incomplete databases.

Pp. 185-195