Catálogo de publicaciones - libros
Data Mining and Bioinformatics: First International Workshop, VDMB 2006, Seoul, Korea, September 11, 2006, Revised Selected Papers
Mehmet M. Dalkilic ; Sun Kim ; Jiong Yang (eds.)
En conferencia: 1º VLDB Workshop on Data Mining and Bioinformatics (VDMB) . Seoul, South Korea . September 11, 2006 - September 11, 2006
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Artificial Intelligence (incl. Robotics); Data Mining and Knowledge Discovery; Information Storage and Retrieval; Computational Biology/Bioinformatics; Probability and Statistics in Computer Science; Health Informatics
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2006 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-68970-6
ISBN electrónico
978-3-540-68971-3
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2006
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2006
Tabla de contenidos
doi: 10.1007/11960669_11
TP+Close: Mining Frequent Closed Patterns in Gene Expression Datasets
YuQing Miao; GuoLiang Chen; Bin Song; ZhiHao Wang
Unlike the traditional datasets, gene expression datasets typically contain a huge number of items and few transactions. Though there were a large number of algorithms that had been developed for mining frequent closed patterns, their running time increased exponentially with the average length of the transactions increasing. Therefore, most current methods for high-dimensional gene expression datasets were impractical. In this paper, we proposed a new data structure, tidset-prefix-plus tree (TP+-tree), to store the compressed transposed table of dataset. Based on TP+-tree, an algorithm, TP+close, was developed for mining frequent closed patterns in gene expression datasets. TP+close adopted top-down and divide-and-conquer search strategies on the transaction space. Moreover, TP+close combined efficient pruning and effective optimizing methods. Several experiments on real-life gene expression datasets showed that TP+close was faster than RERII and CARPENTER, two existing algorithms.
Pp. 120-130
doi: 10.1007/11960669_12
Exploring Essential Attributes for Detecting MicroRNA Precursors from Background Sequences
Yun Zheng; Wynne Hsu; Mong Li Lee; Limsoon Wong
MicroRNAs (miRNAs) have been shown to play important roles in post-transcriptional gene regulation. The hairpin structure is a key characteristic of the microRNAs precursors (pre-miRNAs). How to encode their hairpin structures is a critical step to correctly detect the pre-miRNAs from background sequences, i.e., pseudo miRNA precursors. In this paper, we have proposed to encode the hairpin structures of the pre-miRNA with a set of features, which captures both the global and local structure characteristics of the pre-miRNAs. Furthermore, we find that four essential attributes are discriminatory for classifying human pre-miRNAs and background sequences with an information theory approach. The experimental results show that the number of conserved essential attributes decreases when the phylogenetic distance between the species increases. Specifically, one A-U pair, which produces the U at the start position of most mature miRNAs, in the pre-miRNAs is found to be well conserved in different species for the purpose of biogenesis.
Pp. 131-145
doi: 10.1007/11960669_13
A Gene Structure Prediction Program Using Duration HMM
Hongseok Tae; Eun-Bae Kong; Kiejung Park
Gene structure prediction, which is to predict protein coding regions in a given nucleotide sequence, is a critical process in annotating genes and greatly affects gene analysis and genome annotation. As the gene structure of eukaryotes is much more complicated than that of prokaryotic genes, eukaryotic gene structure prediction should have more diverse and more complicated computational models. We have developed GeneChaser, a gene structure prediction program, using a duration hidden markov model. GeneChaser consists of two major processes, one of which is to train datasets to produce parameter values and the other of which is to predict protein coding regions based on the parameter values. The program predicts multiple genes rather than a single gene from a DNA sequence. To predict the gene structure for a huge chromosomal DNA sequence, it splits the sequence into overlapped fragments and performs prediction process for each fragment. A few computational models were implemented to detect signal patterns and their scanning efficiency was evaluated. Based on a few criteria, its prediction performance was compared with that of a few commonly used programs, GeneID and Morgan.
Pp. 146-157
doi: 10.1007/11960669_14
An Approximate de Bruijn Graph Approach to Multiple Local Alignment and Motif Discovery in Protein Sequences
Rupali Patwardhan; Haixu Tang; Sun Kim; Mehmet Dalkilic
Motif discovery is an important problem in protein sequence analysis. Computationally, it can be viewed as an application of the more general multiple local alignment problem, which often encounters the difficulty of computer time when aligning many sequences. We introduce a new algorithm for multiple local alignment for protein sequences, based on the de Bruijn graph approach first proposed by Zhang and Waterman for aligning DNA sequence. We generalize their approach to aligning protein sequences by building an approximate de Bruijn graph to allow gluing similar but not identical amino acids. We implement this algorithm and test it on motif discovery of 100 sets of protein sequences. The results show that our method achieved comparable results as other popular motif discovery programs, while offering advantages in terms of speed.
Pp. 158-169
doi: 10.1007/11960669_15
Discovering Consensus Patterns in Biological Databases
Mohamed Y. ElTabakh; Walid G. Aref; Mourad Ouzzani; Mohamed H. Ali
Consensus patterns, like motifs and tandem repeats, are highly conserved patterns with very few substitutions where no gaps are allowed. In this paper, we present a progressive hierarchical clustering technique for discovering consensus patterns in biological databases over a certain length range. This technique can discover consensus patterns with various requirements by applying a post-processing phase. The progressive nature of the hierarchical clustering algorithm makes it scalable and efficient. Experiments to discover motifs and tandem repeats on real biological databases show significant performance gain over non-progressive clustering techniques.
Pp. 170-184
doi: 10.1007/11960669_16
Comparison of Modularization Methods in Application to Different Biological Networks
Zhuo Wang; Xin-Guang Zhu; Yazhu Chen; Yixue Li; Lei Liu
Most biological networks have been proposed to possess modular organization, which increases the robustness, flexibility, and stability of networks. Many clustering methods have been used in mining biological data and partitioning complex networks into functional modules. Most of these methods require presetting the number of modules and therefore can potentially obtain biased results. The Markov clustering method (MCL) and the simulated annealing module-detection method (SA) eliminate this requirement and can objectively separate relatively dense subgraphs. In this paper, we compared these two module-detection methods for three types of biological data: protein family classification, microarray clustering, and modularity of metabolic networks. We found that these two methods show differential advantages for different biological networks. In the case of the gene network based on Affymetrix microarray spike data, MCL exactly identified the same number of groups and same contents in each group set by the spike data. In the case of the gene network derived from actual expression data, although neither of the two methods can perfectly recover the natural classification, MCL performs slightly better than SA. However, with increased random noise added to the gene expression values, SA generates better modular structures with higher modularity. Next we compared the modularization results of MCL and SA for protein family classification and found the modules detected by SA could not be well matched with the Structural Classification of Proteins (SCOP database), which suggests that MCL is ideally suited to the rapid and accurate detection of protein families. In addition, we used both methods to detect modules in the metabolic network of . MCL gives a trivial clustering, which generates biologically insignificant modules. In contrast, SA detects modules well corresponding to the KEGG functional classification. Moreover the modularity for several other metabolic networks detected by SA is also much higher than that by MCL. In summary, MCL is more suited to modularize relatively complete and definite data, such as a protein family network. In contrast, SA is less sensitive to noise such as experimental error or incomplete data and outperforms MCL when modularizing gene networks based on microarray data and large scale metabolic networks constructed from incomplete databases.
Pp. 185-195