Catálogo de publicaciones - libros

Compartir en
redes sociales


Combinatorial Pattern Matching: 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings

Moshe Lewenstein ; Gabriel Valiente (eds.)

En conferencia: 17º Annual Symposium on Combinatorial Pattern Matching (CPM) . Barcelona, Spain . July 5, 2006 - July 7, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Pattern Recognition; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing; Information Storage and Retrieval; Bioinformatics; Data Structures

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-35455-0

ISBN electrónico

978-3-540-35461-1

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Finding Common RNA Pseudoknot Structures in Polynomial Time

Patricia A. Evans

This paper presents the first polynomial time algorithm for finding common RNA substructures that include pseudoknots and similar structures. While a more general problem is known to be NP-hard, this algorithm exploits special features of RNA structures to match RNA bonds correctly in polynomial time. Although the theoretical upper bound on the algorithm’s time and space usage is high, the data-driven nature of its computation enables it to avoid computing unnecessary cases, dramatically reducing the actual running time. The algorithm works well in practice, and has been tested on sample RNA structures that include pseudoknots and pseudoknot-like tertiary structures.

Palabras clave: Polynomial Time; Index Combination; Common Substructure; Pseudoknot Structure; Turnip Yellow Mosaic Virus.

- Session 6. Applications in Molecular Biology II | Pp. 223-232

A Compact Mathematical Programming Formulation for DNA Motif Finding

Carl Kingsford; Elena Zaslavsky; Mona Singh

In the motif finding problem one seeks a set of mutually similar subsequences within a collection of biological sequences. This is an important and widely-studied problem, as such shared motifs in DNA often correspond to regulatory elements. We study a combinatorial framework where the goal is to find subsequences of a given length such that the sum of their pairwise distances is minimized. We describe a novel integer linear program for the problem, which uses the fact that distances between subsequences come from a limited set of possibilities. We show how to tighten its linear programming relaxation by adding an exponential set of constraints and give an efficient separation algorithm that can find violated constraints, thereby showing that the tightened linear program can still be solved in polynomial time. We apply our approach to find optimal solutions for the motif finding problem and show that it is effective in practice in uncovering known transcription factor binding sites.

Palabras clave: Transcription Factor Binding Site; Integer Linear Programming; Linear Programming Relaxation; Separation Algorithm; Integer Linear Programming Formulation.

- Session 6. Applications in Molecular Biology II | Pp. 233-245

Local Alignment of RNA Sequences with Arbitrary Scoring Schemes

Rolf Backofen; Danny Hermelin; Gad M. Landau; Oren Weimann

Local similarity is an important tool in comparative analysis of biological sequences, and is therefore well studied. In particular, the Smith-Waterman technique and its normalized version are two established metrics for measuring local similarity in strings. In RNA sequences however, where one must consider not only sequential but also structural features of the inspected molecules, the concept of local similarity becomes more complicated. First, even in global similarity, computing global sequence-structure alignments is more difficult than computing standard sequence alignments due to the bi-dimensionality of information. Second, one can view locality in two different ways, in the sequential or structural sense, leading to different problem formulations. In this paper we introduce two sequentially-local similarity metrics for comparing RNA sequences. These metrics combine the global RNA alignment metric of Shasha and Zhang [16] with the Smith-Waterman metric [17] and its normalized version [2] used in strings. We generalize the familiar alignment graph used in string comparison to apply also for RNA sequences, and then utilize this generalization to devise two algorithms for computing local similarity according to our two suggested metrics. Our algorithms run in $\mathcal{O}(m^2 n \lg n)$ and $\mathcal{O}(m^2 n \lg n+n^2m)$ time respectively, where m ≤ n are the lengths of the two given RNAs. Both algorithms can work with any arbitrary scoring scheme.

Palabras clave: Local Alignment; Local Similarity; Alignment Score; Edit Operation; String Comparison.

- Session 6. Applications in Molecular Biology II | Pp. 246-257

An $O(n^{3/2}\sqrt{\log (n)})$ Algorithm for Sorting by Reciprocal Translocations

Michal Ozery-Flato; Ron Shamir

We prove that sorting by reciprocal translocations can be done in $O(n^{3/2}\sqrt{\log (n)})$ for an n -gene genome. Our algorithm is an adaptation of the Tannier et. al algorithm for sorting by reversals. This improves over the O ( n ^3) algorithm for sorting by reciprocal translocations given by Bergeron et al.

Palabras clave: Reciprocal Translocation; Internal Edge; Internal Component; External Edge; Discrete Apply Mathematic.

- Session 7. Applications in Molecular Biology III | Pp. 258-269

Longest Common Subsequences in Permutations and Maximum Cliques in Circle Graphs

Alexander Tiskin

For two strings a , b , the longest common subsequence (LCS) problem consists in comparing a and b by computing the length of their LCS . In a previous paper, we defined a generalisation, called “the all semi-local LCS problem”, for which we proposed an efficient output representation and an efficient algorithm. In this paper, we consider a restriction of this problem to strings that are permutations of a given set. The resulting problem is equivalent to the all local longest increasing subsequences (LIS) problem. We propose an algorithm for this problem, running in time O ( n ^1.5) on an input of size n . As an interesting application of our method, we propose a new algorithm for finding a maximum clique in a circle graph on n nodes, running in the same asymptotic time O ( n ^1.5). Compared to a number of previous algorithms for this problem, our approach presents a substantial improvement in worst-case running time.

Palabras clave: Maximum Clique; Circle Graph; Longe Common Subsequence; Interval Model; Maximum Clique Problem.

- Session 7. Applications in Molecular Biology III | Pp. 270-281

A Simpler Analysis of Burrows-Wheeler Based Compression

Haim Kaplan; Shir Landau; Elad Verbin

In this paper we present a new technique for worst-case analysis of compression algorithms which are based on the Burrows-Wheeler Transform. We deal mainly with the algorithm purposed by Burrows and Wheeler in their first paper on the subject [6], called bw0 . This algorithm consists of the following three steps: 1) Compute the Burrows-Wheeler transform of the text, 2) Convert the transform into a sequence of integers using the move-to-front algorithm, 3) Encode the integers using Arithmetic code or any order-0 encoding (possibly with run-length encoding). We prove a strong upper bound on the worst-case compression ratio of this algorithm. This bound is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, we show that for any input string s , and μ > 1, the length of the compressed string is bounded by μ | s | H _ k ( s ) + log( ζ ( μ )) | s | + g _ k where H _ k is the k-th order empirical entropy, g _ k is a constant depending only on k and on the size of the alphabet, and $\zeta(\mu) = \frac{1}{1^\mu} + \frac{1}{2^\mu} + \ldots $ is the standard zeta function. As part of the analysis we prove a result on the compressibility of integer sequences, which is of independent interest. Finally, we apply our techniques to prove a worst-case bound on the compression ratio of a compression algorithm based on the Burrows-Wheeler transform followed by distance coding, for which worst-case guarantees have never been given. We prove that the length of the compressed string is bounded by 1.7286 | s | H _ k ( s ) + g _ k . This bound is better than the bound we give for bw0 .

Palabras clave: Compression Ratio; Compression Algorithm; Input String; Arithmetic Code; Consecutive Occurrence.

- Session 8. Data Compression | Pp. 282-293

Statistical Encoding of Succinct Data Structures

Rodrigo González; Gonzalo Navarro

In recent work, Sadakane and Grossi [SODA 2006] introduced a scheme to represent any sequence S = s _1 s _2... s _ n , over an alphabet of size σ , using $nH_k(S)+O(\frac{n}{\log_\sigma n} (k \log \sigma + \log\log n))$ bits of space, where H _ k ( S ) is the k -th order empirical entropy of S . The representation permits extracting any substring of size Θ(log_ σ n ) in constant time, and thus it completely replaces S under the RAM model. This is extremely important because it permits converting any succinct data structure requiring o (| S |) = o ( n log σ ) bits in addition to S , into another requiring nH _ k ( S )+ o ( n log σ ) (overall) for any k = o (log_ σ n ). They achieve this result by using Ziv-Lempel compression, and conjecture that the result can in particular be useful to implement compressed full-text indexes. In this paper we extend their result, by obtaining the same space and time complexities using a simpler scheme based on statistical encoding. We show that the scheme supports appending symbols in constant amortized time. In addition, we prove some results on the applicability of the scheme for full-text self-indexing.

Palabras clave: Statistical Encode; Arithmetic Code; Metic Code; Wavelet Tree; Rank Query.

- Session 8. Data Compression | Pp. 294-305

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Veli Mäkinen; Gonzalo Navarro

Given a sequence of n bits with binary zero-order entropy H _0, we present a dynamic data structure that requires nH _0 + o ( n ) bits of space, which is able of performing rank and select , as well as inserting and deleting bits at arbitrary positions, in O (log n ) worst-case time. This extends previous results by Hon et al. [ISAAC 2003] achieving O (log n /loglog n ) time for rank and select but $\Theta({\textrm{polylog}}(n))$ amortized time for inserting and deleting bits, and requiring n + o ( n ) bits of space; and by Raman et al. [SODA 2002] which have constant query time but a static structure. In particular, our result becomes the first entropy-bound dynamic data structure for rank and select over bit sequences. We then show how the above result can be used to build a dynamic full-text self-index for a collection of texts over an alphabet of size σ , of overall length n and zero-order entropy H _0. The index requires nH _0 + o ( n log σ ) bits of space, and can count the number of occurrences of a pattern of length m in time O ( m log n log σ ). Reporting the occ occurrences can be supported in O ( occ log^2 n log σ ) time, paying O ( n ) extra space. Insertion of text to the collection takes O (log n log σ ) time per symbol, which becomes O (log^2 n log σ ) for deletions. This improves a previous result by Chan et al. [CPM 2004]. As a consequence, we obtain an O ( n log n log σ ) time construction algorithm for a compressed self-index requiring nH _0 + o ( n log σ ) bits working space during construction.

Palabras clave: Extra Space; Wavelet Tree; Rank Query; Dynamic Data Structure; Select Query.

- Session 8. Data Compression | Pp. 306-317

Reducing the Space Requirement of LZ-Index

Diego Arroyuelo; Gonzalo Navarro; Kunihiko Sadakane

The LZ-index is a compressed full-text self-index able to represent a text P _1...m, over an alphabet of size $\sigma = O(\textrm{polylog}(u))$ and with k -th order empirical entropy H _ k ( T ), using 4 uH _ k ( T ) + o ( u log σ ) bits for any k = o (log_ σ u ). It can report all the occ occurrences of a pattern P _1...m in T in O ( m ^3log σ + ( m + occ )log u ) worst case time. Its main drawback is the factor 4 in its space complexity, which makes it larger than other state-of-the-art alternatives. In this paper we present two different approaches to reduce the space requirement of LZ-index. In both cases we achieve (2 + ε ) uH _ k ( T ) + o ( u log σ ) bits of space, for any constant ε > 0, and we simultaneously improve the search time to O ( m ^2log m + ( m + occ )log u ). Both indexes support displaying any subtext of length ℓ in optimal O (ℓ/log_ σ u ) time. In addition, we show how the space can be squeezed to (1 + ε ) uH _ k ( T ) + o ( u log σ ) to obtain a structure with O ( m ^2) average search time for $m \geqslant 2\log_\sigma{u}$ .

Palabras clave: Space Requirement; Navigation Scheme; Phrase Pair; Text Substring; Operation Parent.

- Session 8. Data Compression | Pp. 318-329

Faster Algorithms for Computing Longest Common Increasing Subsequences

Gerth Stølting Brodal; Kanela Kaligosi; Irit Katriel; Martin Kutz

We present algorithms for finding a longest common increasing subsequence of two or more input sequences. For two sequences of lengths m and n , where m ≥ n , we present an algorithm with an output-dependent expected running time of $O((m+n\ell) \log\log \sigma + {\ensuremath{\mathit{Sort}}})$ and O ( m ) space, where ℓ is the length of an LCIS, σ is the size of the alphabet, and ${\ensuremath{\mathit{Sort}}}$ is the time to sort each input sequence. For k ≥3 length- n sequences we present an algorithm which improves the previous best bound by more than a factor k for many inputs. In both cases, our algorithms are conceptually quite simple but rely on existing sophisticated data structures. Finally, we introduce the problem of longest common weakly-increasing (or non-decreasing) subsequences (LCWIS), for which we present an O ( m + n log n )-time algorithm for the 3-letter alphabet case. For the extensively studied longest common subsequence problem, comparable speedups have not been achieved for small alphabets.

Palabras clave: Input Sequence; Longe Common Subsequence; Longe Common Subsequence; Close Split; Longe Common Subsequence Problem.

- Session 9. String Matching II | Pp. 330-341