Catálogo de publicaciones - libros

Compartir en
redes sociales


Combinatorial Pattern Matching: 16th Annual Symposium, CPM 2005, Jeju Island, Korea, June 19-22, 2005, Proceedings

Alberto Apostolico ; Maxime Crochemore ; Kunsoo Park (eds.)

En conferencia: 16º Annual Symposium on Combinatorial Pattern Matching (CPM) . Jeju Island, South Korea . June 19, 2005 - June 22, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Algorithm Analysis and Problem Complexity; Data Structures; Information Storage and Retrieval; Document Preparation and Text Processing; Pattern Recognition; Bioinformatics

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-26201-5

ISBN electrónico

978-3-540-31562-9

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

The Median Problem for the Reversal Distance in Circular Bacterial Genomes

Enno Ohlebusch; Mohamed Ibrahim Abouelhoda; Kathrin Hockel; Jan Stallkamp

In the median problem, we are given a distance or dissimilarity measure , three genomes ,, and , and we want to find a genome (a median) such that the sum ∑(,) is minimized. The median problem is a special case of the multiple genome rearrangement problem, where one wants to find a phylogenetic tree describing the most “plausible” rearrangement scenario for multiple species. The median problem is NP-hard for both the breakpoint and the reversal distance [5,14]. To the best of our knowledge, there is no approach yet that takes biological constraints on genome rearrangements into account. In this paper, we make use of the fact that in circular bacterial genomes the predominant mechanism of rearrangement are inversions that are centered around the origin or the terminus of replication [8,10,18]. This constraint simplifies the median problem significantly. More precisely, we show that the median problem for the reversal distance can be solved in linear time for circular bacterial genomes.

Pp. 116-127

Using PQ Trees for Comparative Genomics

Gad M. Landau; Laxmi Parida; Oren Weimann

Permutations on strings representing gene clusters on genomes have been studied earlier in [3, 12, 14, 17, 18] and the idea of a maximal permutation pattern was introduced in [12]. In this paper, we present a new tool for representation and detection of gene clusters in multiple genomes, using PQ trees [6]: this describes the inner structure and the relations between clusters succinctly, aids in filtering meaningful from apparently meaningless clusters and also gives a natural and meaningful way of visualizing complex clusters. We identify a minimal consensus PQ tree and prove that it is equivalent to a maximal pattern [12] and each subgraph of the PQ tree corresponds to a non-maximal permutation pattern. We present a general scheme to handle multiplicity in permutations and also give a linear time algorithm to construct the minimal consensus PQ tree. Further, we demonstrate the results on whole genome data sets. In our analysis of the whole genomes of human and rat we found about 1.5 million common gene clusters but only about 500 minimal consensus PQ trees, and, with and genomes we found only about 450 minimal consensus PQ trees out of about 15,000 gene clusters. Further, we show specific instances of functionally related genes in the two cases.

Pp. 128-143

Hardness of Optimal Spaced Seed Design

François Nicolas; Eric Rivals

Speeding up approximate pattern matching is a line of research in stringology since the 80’s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are excluded (filtered out) in a first step, and remaining regions are compared to the pattern by dynamic programming in a second step. Among the necessary conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, it was shown recently that counting spaced subwords instead of substrings improve the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called gapped seeds, for the subwords, depending on the search parameters. The seed design problems proposed up to now differ by the way the similarities to detect are given: either a set of similarities is given (this is a “region specific” problem), or one wishes to detect all similar regions having at most substitutions (general detection problem). Several articles exhibit exponential algorithms for these problems. In this work, we provide hardness and inapproximability results for both the region specific and general seed design problems, thereby justifying the exponential complexity of known algorithms. Moreover, we introduce a new formulation of the region specific seed design problem, in which the weight of the seed (, number of characters in the subwords) has to be maximized, and show it is as difficult to approximate than .

Pp. 144-155

Weighted Directed Word Graph

Meng Zhang; Liang Hu; Qiang Li; Jiubin Ju

Weighted Directed Word Graph is a new text-indexing structure which is based on a new compaction method on DAWG. WDWGs are basically cyclic, means that they may accept infinite strings. But by assigning weights to the edges, the acceptable strings are limited only to the substrings of input strings. The size of WDWGs is smaller than that of DAWGs both in theory and practice. A linear-time on-line construction algorithm for WDWGs is also presented.

Pp. 156-167

Construction of Aho Corasick Automaton in Linear Time for Integer Alphabets

Shiri Dori; Gad M. Landau

We present a new simple algorithm that constructs an Aho Corasick automaton for a set of patterns, , of total length , in () time and space for integer alphabets. Processing a text of size over an alphabet Σ with the automaton costs , where there are occurrences of patterns in the text.

Pp. 168-177

An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression

Sabrina Mantaci; Antonio Restivo; G. Rosone; Marinella Sciortino

We introduce a generalization of the Burrows-Wheeler Transform () that can be applied to a multiset of words. The extended transformation, denoted by , is reversible, but, differently from , it is also surjective. The transformation allows to give a definition of distance between two sequences, that we apply here to the problem of the whole mitochondrial genome phylogeny. Moreover we give some consideration about compressing a set of words by using the transformation as preprocessing.

Pp. 178-189

DNA Compression Challenge Revisited: A Dynamic Programming Approach

Behshad Behzadi; Fabrice Le Fessant

Standard compression algorithms are not able to compress DNA sequences. Recently, new algorithms have been introduced specifically for this purpose, often using detection of long approximate repeats. In this paper, we present another algorithm, , based on dynamic programming. In comparison with former existing programs, it compresses DNA slightly better, while the cost of dynamic programming is almost negligible.

Pp. 190-200

On the Complexity of Sparse Exon Assembly

Carmel Kent; Gad M. Landau; Michal Ziv-Ukelson

Gene structure prediction is one of the most important problems in computational molecular biology. A combinatorial approach to the problem, denoted , was introduced by Gelfand, Mironov and Pevzner [5]. The method works by finding a set of blocks in a source genomic sequence whose concatenation (splicing) fits a target gene belonging to a homologous species. Let , and the candidate exons be sequences of size (). The innovative algorithm described in [5] yields an () result for spliced alignment, regardless of filtration mode.

In this paper we suggest a new algorithm which targets the case where filtering has been applied to the data, resulting in a set of () candidate exon blocks. Our algorithm yields an solution for this case.

Pp. 201-218

An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

Paul Horton; Wataru Fujibuchi

Motif discovery is the problem of finding local patterns or from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width (which is typically 5-20) and is a dramatic improvement relative to previously known bounds.

We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length , in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of (typically 4) the trivial bound is . Our bound is roughly (log).

We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.

Pp. 219-228

Incremental Inference of Relational Motifs with a Degenerate Alphabet

Nadia Pisanti; Henry Soldano; Mathilde Carpentier

In this paper we define a new class of problems that generalizes that of finding repeated motifs. The novelty lies in the addition of constraints on the motifs in terms of relations that must hold between pairs of elements of the motifs. For this class of problems we give an algorithm that is a suitable extension of the KMR [3] paradigm and, in particular, of the KMRC [7] as it uses a degenerate alphabet. The algorithm contains several improvements with respect to [7] that result especially useful when – as it is required for relational motifs – the inference is made by partially overlapping shorter motifs. The efficiency, correctness and completeness of the algorithm is assured by several non-trivial properties. Finally, we list some possible applications and we focus on one of them: the study of 3D structures of proteins.

Pp. 229-240