Catálogo de publicaciones - libros
Bioinformatics Research and Applications: Third International Symposium, ISBRA 2007, Atlanta, GA, USA, May 7-10, 2007. Proceedings
Ion Măndoiu ; Alexander Zelikovsky (eds.)
En conferencia: 3º International Symposium on Bioinformatics Research and Applications (ISBRA) . Atlanta, GA, USA . May 7, 2007 - May 10, 2007
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
No disponibles.
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2007 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-72030-0
ISBN electrónico
978-3-540-72031-7
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2007
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2007
Cobertura temática
Tabla de contenidos
Finding Minimal Sets of Informative Genes in Microarray Data
Kung-Hua Chang; Yong Kyun Kwon; D. Stott Parker
For a microarray dataset with attached phenotype information – which gives expression levels of various genes and a phenotype classification for each of a set of samples – an important problem is to find . These genes have high information content as attributes for classification, minimizing the expected number of tests needed to identify a phenotype. This study investigates the use of a heuristic method for finding complete sets of informative genes (sets that are sufficient for constructing a maximally discriminating classifier) that are as small as possible. These can be very useful in developing an appreciation for the data. Our method uses branch-and-bound depth-first search. Experimental results suggest that our method is effective in finding minimal gene sets, and the resulting classifiers have good performance in terms of classification accuracy.
Pp. 227-236
Noise-Based Feature Perturbation as a Selection Method for Microarray Data
Li Chen; Dmitry B. Goldgof; Lawrence O. Hall; Steven A. Eschrich
DNA microarrays can monitor the expression levels of thousands of genes simultaneously, providing the opportunity for the identification of genes that are differentially expressed across different conditions. Microarray datasets are generally limited to a small number of samples with a large number of gene expressions, therefore feature selection becomes a very important aspect of the microarray classification problem. In this paper, a new feature selection method, feature perturbation by adding noise, is proposed to improve the performance of classification. The experimental results on a benchmark colon cancer dataset indicate that the proposed method can result in more accurate class predictions using a smaller set of features when compared to the SVM-RFE feature selection method.
Pp. 237-247
Efficient Generation of Biologically Relevant Enriched Gene Sets
Igor Trajkovski; Nada Lavrač
Gene set enrichment analysis is a microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. In this paper we present a novel method integrating gene interaction information with Gene Ontology (GO) for the construction of new interesting enriched gene sets. The experimental results show that the introduced method improves over traditional methods that compute the enrichment of a single GO terms, i.e. that it is capable to find new statistically relevant descriptions of the biology governing the experiments not detectable by the existing methods.
Pp. 248-259
Space and Time Efficient Algorithms to Discover Endogenous RNAi Patterns in Complete Genome Data
Sudha Balla; Sanguthevar Rajasekaran
RNAi, short for RNA Interference, a phenomenon of inhibiting the expression of genes, is widely adopted in laboratories for the study of pathways and determination of gene function. Recent studies have shown that RNAi could be used as an approach to treat diseases like cancers and some genetic disorders in which the down-regulation of a protein could prevent or stop progression of the disease. In [7], the problem of detecting endogenous dsRNA control elements and their corresponding mRNA target, i.e., the gene under RNAi control by degradation, in complete genomes of species using a suffix tree data structure is discussed. While the algorithm identifies triple repeats in the genome sequence in linear time, its very high memory requirement (12 GB for the genome of size 100 Mbp) becomes a bottleneck for processing genomes of higher order. In this paper, we give algorithms that are space and time efficient in practice than the suffix tree based algorithm. Our algorithms are based on simple array data structures and adopt basic sorting techniques to identify the desired patterns in a given genome sequence. We achieve a speedup of 23 and reduction in memory requirement by a factor of 12 for the genome, over the suffix tree approach, making the processing of higher order genomes possible for detecting such endogenous controls and targets for RNAi by degradation.
Pp. 260-269
A Fast Approximate Covariance-Model-Based Database Search Method for Non-coding RNA
Scott F. Smith
A covariance-model-based search method for non-coding RNA genes is proposed which is much faster than dynamic programming, but which is shown to be very effective in experimental tests. The method incorporates secondary structure information in the entire first pass of the database, unlike the usual primary-sequence-only pre-filters applied when using dynamic programming. An iterative alignment refining algorithm which starts at an ungapped alignment and successively selects alignment breakpoints gives only an approximation to the optimal alignment, but appears to be sufficient for gene localization.
Pp. 270-281
Extensions of Naive Bayes and Their Applications to Bioinformatics
Raja Loganantharaj
In this paper we will study the naïve Bayes, one of the popular machine learning algorithms, and improve its accuracy without seriously affecting its computational efficiency. Naïve Bayes assumes positional independence, which makes the computation of the joint probability value easier at the expense of the accuracy or the underlying reality. In addition, the prior probabilities of positive and negative instances are computed from the training instances, which often do not accurately reflect the real prior probabilities. In this paper we address these two issues. We have developed algorithms that automatically perturb the computed prior probabilities and search around the neighborhood to maximize a given objective function. To improve the prediction accuracy we introduce limited dependency on the underlying pattern. We have demonstrated the importance of these extensions by applying them to solve the problem in discriminating a TATA box from putative TATA boxes found in promoter regions of plant genome. The best prediction accuracy of a naïve Bayes with 10 fold cross validation was 69% while the second extension gave the prediction accuracy of 79% which is better than the best solution from an artificial neural network prediction.
Pp. 282-292
The Solution Space of Sorting by Reversals
Marília D. V. Braga; Marie-France Sagot; Celine Scornavacca; Eric Tannier
In comparative genomics, algorithms that sort permutations by reversals are often used to propose evolutionary scenarios of large scale genomic mutations between species. One of the main problems of such methods is that they give one solution while the number of optimal solutions is huge, with no criteria to discriminate among them. Bergeron [4] started to give some structure to the set of optimal solutions, in order to be able to deliver more presentable results than only one solution or a complete list of all solutions. The structure is a way to group solutions into equivalence classes, and to identify in each class one particular representative. However, no algorithm exists so far to compute this set of representatives except through the enumeration of all solutions, which takes too much time even for small permutations. Bergeron [4] state as an open problem the design of such an algorithm. We propose in this paper an answer to this problem, that is, an algorithm which gives one representative for each class of solutions and counts the number of solutions in each class, with a better theoretical and practical complexity than the complete enumeration method. We give several biological examples where the result is more relevant than a unique optimal solution or the list of all solutions.
Pp. 293-304
A Fast and Exact Algorithm for the Perfect Reversal Median Problem
Matthias Bernt; Daniel Merkle; Martin Middendorf
We study the problem of finding for the gene orders of three taxa a potential ancestral gene order such that the corresponding rearrangement scenario has a minimal number of reversals where each of the reversals has to preserve the common intervals of the given input gene orders. Common intervals identify sets of genes that occur consecutively in all input gene orders. The problem of finding such an ancestral gene order is called the perfect reversal median problem (pRMP). A tree based data structure for the representation of the common intervals of all input gene orders is used for the design and realization of a fast and exact algorithm — called TCIP — for solving the pRMP. It is known that for two given gene orders the minimum number of reversals to transfer one gene order into the other can be computed in polynomial time, whereas the corresponding problem with the restriction that common intervals should not be destroyed by the reversals is already NP-hard. Nevertheless, we show empirically on biological and artificial data that TCIP for the pRMP is usually even faster than the fastest exact algorithm (Caprara’s median solver) for the reversal median problem (RMP), i.e., the corresponding problem in which the common intervals are not considered.
Pp. 305-316
Genomic Signatures from DNA Word Graphs
Lenwood S. Heath; Amrita Pati
Genomes have both deterministic and random aspects, with the underlying DNA sequences exhibiting features at numerous scales, from codons and -elements through genes and on to regions of conserved or divergent gene order. The DNA Words program aims to identify mathematical structures that characterize genomes at multiple scales. The focus of this work is the fine structure of genomic sequences, the manner in which short nucleotide sequences fit together to comprise the genome as an abstract sequence, within a graph-theoretic setting. A DNA word graph is a generalization of a de Bruijn graph that records the occurrence counts of node and edges in a genomic sequence. A DNA word graph can be derived from a genomic sequence generated by a finite Markov chain or a subsequence of a sequenced genome. Both theoretically and empirically, DNA word graphs give rise to genomic signatures. Several genomic signatures are derived from the structure of a DNA word graph, including an information-rich and visually appealing genomic bar code. Application of genomic signatures to several genomes demonstrate their practical value in identifying and distinguishing genomic sequences.
Pp. 317-328
Enhancing Motif Refinement by Incorporating Comparative Genomics Data
Erliang Zeng; Giri Narasimhan
Transcription factor binding sites (TFBS) are often located in the upstream regions of genes and transcription factors (TFs) cause transcription regulation by binding at these locations. Predicting these binding sites is a difficult problem, and traditional methods have a high degree of false positives in their predictions. Comparative genomics data can help to improve motif predictions. In this paper, a new strategy is presented, which refines motif by taking the comparative genomics data into account. Tested with the help of both simulation data and biological data, we show that our method makes improved predictions. We also propose a new metric to score a motif profile. This score is biologically motivated and helps the algorithm in its predictions.
Pp. 329-337