Catálogo de publicaciones - libros

Compartir en
redes sociales


Algorithms in Bioinformatics: 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings

Rita Casadio ; Gene Myers (eds.)

En conferencia: 5º International Workshop on Algorithms in Bioinformatics (WABI) . Mallorca, Spain . October 3, 2005 - October 6, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-29008-7

ISBN electrónico

978-3-540-31812-5

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

A Unifying Framework for Seed Sensitivity and Its Application to Subset Seeds

Gregory Kucherov; Laurent Noé; Mikhail Roytberg

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem – a set of target alignments, an associated probability distribution, and a seed model – that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.

Palabras clave: Hide Markov Model; Regular Language; Bernoulli Model; Full Path; Seed Model.

Pp. 251-263

Generalized Planted (l,d)-Motif Problem with Negative Set

Henry C. M. Leung; Francis Y. L. Chin

Finding similar patterns (motifs) in a set of sequences is an important problem in Computational Molecular Biology. Pevzner and Sze [18] defined the planted ( l , d )-motif problem as trying to find a length- l pattern that occurs in each input sequence with at most d substitutions. When d is large, this problem is difficult to solve because the input sequences do not contain enough information on the motif. In this paper, we propose a generalized planted ( l , d )-motif problem which considers as input an additional set of sequences without any substring similar to the motif (negative set) as extra information. We analyze the effects of this negative set on the finding of motifs, and define a set of unsolvable problems and another set of most difficult problems, known as “challenging generalized problems”. We develop an algorithm called VANS based on voting and other novel techniques, which can solve the (9,3), (11,4),(15,6) and (20,8)-motif problems which were unsolvable before as well as challenging problems of the planted ( l , d )-motif problem such as (9,2), (11,3), (15,5) and (20,7)-motif problems.

Palabras clave: Local Search; Input Sequence; Extra Information; Find Motif; Motif Problem.

Pp. 264-275

Alignment of Tandem Repeats with Excision, Duplication, Substitution and Indels (EDSI)

Michael Sammeth; Thomas Weniger; Dag Harmsen; Jens Stoye

Traditional sequence comparison by alignment applies a mutation model comprising two events, s ubstitutions and i ndels (insertions or deletions) of single positions (SI). However, modern genetic analysis knows a variety of more complex mutation events (e.g., duplications, excisions and rearrangements), especially regarding DNA. With the ever more DNA sequence data becoming available, the need to accurately compare sequences which have clearly undergone more complicated types of mutational processes is becoming critical. Herein we introduce a new model, where in total four mutational events are considered: e xcision and d uplication of tandem repeats, as well as s ubstitutions and i ndels of single positions ( EDSI ). Assuming the EDSI model, we develop a new algorithm for pairwisely aligning and comparing DNA sequences containing tandem repeats. To evaluate our method, we apply it to the spa VNTR (variable number of tandem repeats) of Staphylococcus aureus , a bacterium of great medical importance.

Palabras clave: Tandem Repeat; Duplication Event; Repeat Type; Edit Operation; Repeat Copy.

Pp. 276-290

The Peres-Shields Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity

Daniel Dalevi; Devdatt Dubhashi

Recently Peres and Shields discovered a new method for estimating the order of a stationary fixed order Markov chain [15]. They showed that the estimator is consistent by proving a threshold result. While this threshold is valid asymptotically in the limit, it is not very useful for DNA sequence analysis where data sizes are moderate. In this paper we give a novel interpretation of the Peres-Shields estimator as a sharp transition phenomenon. This yields a precise and powerful estimator that quickly identifies the core dependencies in data. We show that it compares favorably to other estimators, especially in the presence of noise and/or variable dependencies. Motivated by this last point, we extend the Peres-Shields estimator to Variable Length Markov Chains. We give an application to the problem of detecting DNA sequence similarity using genomic signatures. Abbreviations: Mk = Fixed order Markov model of order k , PST = Prediction suffix tree, MC = Markov chain, VLMC = Variable length Markov chain.

Palabras clave: Markov Chain; Akaike Information Criterion; Bayesian Information Criterion; Sharp Transition; Lower Order Model.

Pp. 291-302

Multiple Structural RNA Alignment with Lagrangian Relaxation

Markus Bauer; Gunnar W. Klau; Knut Reinert

Many classes of functionally related RNA molecules show a rather weak sequence conservation but instead a fairly well conserved secondary structure. Hence, it is clear that any method that relates RNA sequences in form of (multiple) alignments should take structural features into account. Since multiple alignments are of great importance for subsequent data analysis, research in improving the speed and accuracy of such alignments benefits many other analysis problems. We present a formulation for computing provably optimal , structure-based, multiple RNA alignments and give an algorithm that finds such an optimal (or near-optimal) solution. To solve the resulting computational problem we propose an algorithm based on Lagrangian relaxation which already proved successful in the two-sequence case. We compare our implementation, mLARA , to three programs ( clustalW , MARNA , and pmmulti ) and demonstrate that we can often compute multiple alignments with consensus structures that have a significant lower minimum free energy term than computed by the other programs. Our prototypical experiments show that our new algorithm is competitive and, in contrast to other methods, is applicable to long sequences where standard dynamic programming approaches must fail. Furthermore, the Lagrangian method is capable of handling arbitrary pseudoknot structures.

Palabras clave: Lagrangian Relaxation; Minimum Free Energy; Consensus Structure; Lagrangian Problem; Consensus Secondary Structure.

Pp. 303-314

Faster Algorithms for Optimal Multiple Sequence Alignment Based on Pairwise Comparisons

Pankaj K. Agarwal; Yonatan Bilu; Rachel Kolodny

Multiple Sequence Alignment (MSA) is one of the most fundamental problems in computational molecular biology. The running time of the best known scheme for finding an optimal alignment, based on dynamic programming, increases exponentially with the number of input sequences. Hence, many heuristics were suggested for the problem. We consider the following version of the MSA problem: In a preprocessing stage pairwise alignments are found for every pair of sequences. The goal is to find an optimal alignment in which matches are restricted to positions that were matched at the preprocessing stage. We present several techniques for making the dynamic programming algorithm more efficient, while still finding an optimal solution under these restrictions. Namely, in our formulation the MSA must conform with pairwise (local) alignments, and in return can be solved more efficiently. We prove that it suffices to find an optimal alignment of sequence segments, rather than single letters, thereby reducing the input size and thus improving the running time.

Palabras clave: Dynamic Programming; Multiple Sequence Alignment; Optimal Path; Dynamic Programming Algorithm; Pairwise Alignment.

Pp. 315-327

Ortholog Clustering on a Multipartite Graph

Akshay Vashist; Casimir Kulikowski; Ilya Muchnik

We present a method for automatically extracting groups of orthologous genes from a large set of genomes through the development of a new clustering method on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from multiple genomes to assess the orthologous relationships between genes in the subset. This score is computed using sequence similarities between the member genes and the phylogenetic relationship between the corresponding genomes. An ortholog cluster is found as the subset with highest score, so ortholog clustering is formulated as a combinatorial optimization problem. The algorithm for finding an ortholog cluster runs in time O (| E | + | V | log | V |), where V and E are the sets of vertices and edges, respectively in the graph. However, if we discretize the similarity scores into a constant number of bins, the run time improves to O (| E | + | V |). The proposed method was applied to seven complete eukaryote genomes on which manually curated ortholog clusters, KOG (eukaryotic ortholog clusters, http://www.ncbi.nlm.nih.gov/COG/new/) are constructed. A comparison of our results with the manually curated ortholog clusters shows that our clusters are well correlated with the existing clusters. Finally, we demonstrate how gene order information can be incorporated in the proposed method for improving ortholog detection.

Palabras clave: Linkage Function; Orthologous Relationship; Multiple Genome; Pfam Family; Ortholog Cluster.

Pp. 328-340

Linear Time Algorithm for Parsing RNA Secondary Structure

Baharak Rastegari; Anne Condon

Accurate prediction of pseudoknotted RNA secondary structure is an important computational challenge. Typical prediction algorithms aim to find a structure with minimum free energy according to some thermodynamic (“sum of loop energies”) model that is implicit in the recurrences of the algorithm. However, a clear definition of what exactly are the loops and stems in pseudoknotted structures, and their associated energies, has been lacking. We present a comprehensive classification of loops in pseudoknotted RNA secondary structures. Building on an algorithm of Bader et al. [2] we obtain a linear time algorithm for parsing a secondary structures into its component loops. We also give a linear time algorithm to calculate the free energy of a pseudoknotted secondary structure. This is useful for heuristic prediction algorithms which are widely used since (pseudoknotted) RNA secondary structure prediction is NP-hard. Finally, we give a linear time algorithm to test whether a secondary structure is in the class handled by Akutsu’s algorithm [1]. Using our tests, we analyze the generality of Akutsu’s algorithm for real biological structures.

Palabras clave: Secondary Structure; Dynamic Programming Algorithm; Closed Region; Band Region; Base Index.

Pp. 341-352

A Compressed Format for Collections of Phylogenetic Trees and Improved Consensus Performance

Robert S. Boyer; Warren A. Hunt; Serita M. Nelesen

Phylogenetic tree searching algorithms often produce thousands of trees which biologists save in Newick format in order to perform further analysis. Unfortunately, Newick is neither space efficient, nor conducive to post-tree analysis such as consensus. We propose a new format for storing phylogenetic trees that significantly reduces storage requirements while continuing to allow the trees to be used as input to post-tree analysis. We implemented mechanisms to read and write such data from and to files, and also implemented a consensus algorithm that is faster by an order of magnitude than standard phylogenetic analysis tools. We demonstrate our results on a collection of data files produced from both maximum parsimony tree searches and Bayesian methods.

Palabras clave: Consensus Tree; Storage Requirement; Input Tree; Consensus Algorithm; Majority Consensus.

Pp. 353-364

Optimal Protein Threading by Cost-Splitting

P. Veber; N. Yanev; R. Andonov; V. Poirriez

In this paper, we use integer programming approach for solving a hard combinatorial optimization problem, namely protein threading. For this sequence-to-structure alignment problem we apply cost-splitting technique to derive a new Lagrangian dual formulation. The optimal solution of the dual is sought by an algorithm of polynomial complexity. For most of the instances the dual solution provides an optimal or near-optimal (with negligible duality gap) alignment. The speed-up with respect to the widely promoted approach for solving the same problem in [17] is from 100 to 250 on computationally interesting instances. Such a performance turns computing score distributions, the heaviest task when solving PTP, into a routine operation.

Palabras clave: Computational Biology; Lagrangian Relaxation; Alignment Problem; Lagrangian Duality; Contact Graph.

Pp. 365-375