Catálogo de publicaciones - libros

Compartir en
redes sociales


Combinatorial Pattern Matching: 16th Annual Symposium, CPM 2005, Jeju Island, Korea, June 19-22, 2005, Proceedings

Alberto Apostolico ; Maxime Crochemore ; Kunsoo Park (eds.)

En conferencia: 16º Annual Symposium on Combinatorial Pattern Matching (CPM) . Jeju Island, South Korea . June 19, 2005 - June 22, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Algorithm Analysis and Problem Complexity; Data Structures; Information Storage and Retrieval; Document Preparation and Text Processing; Pattern Recognition; Bioinformatics

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-26201-5

ISBN electrónico

978-3-540-31562-9

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Sharper Upper and Lower Bounds for an Approximation Scheme for

Broňa Brejová; Daniel G. Brown; Ian M. Harrower; Alejandro López-Ortiz; Tomáš Vinař

We present sharper upper and lower bounds for a known polynomial-time approximation scheme due to Li, Ma and Wang [7] for the problem. This NP-hard problem is an abstraction of motif finding, a common bioinformatics discovery task. The PTAS due to Li is simple, and a preliminary implementation [8] gave reasonable results in practice. However, the previously known bounds on its performance are useless when runtimes are actually manageable. Here, we present much sharper lower and upper bounds on the performance of this algorithm that partially explain why its behavior is so much better in practice than what was previously predicted in theory. We also give specific examples of instances of the problem for which the PTAS performs poorly in practice, and show that the asymptotic performance bound given in the original proof matches the behaviour of a simple variant of the algorithm on a particularly bad instance of the problem.

Pp. 1-10

On the Longest Common Rigid Subsequence Problem

Bin Ma; Kaizhong Zhang

The longest common subsequence problem (LCS) and the closest substring problem (CSP) are two models for the finding of common patterns in strings. The two problem have been studied extensively. The former was previously proved to be not polynomial-time approximable within ratio for a constant . The latter was previously proved to be -hard and have a PTAS. In this paper, the longest common rigid subsequence problem (LCRS) is studied. LCRS shares similarity with LCS and CSP and has an important application in motif finding in biological sequences. LCRS is proved to be hard in this paper. An exact algorithm with quasi-polynomial average running time is also provided.

Pp. 11-20

Text Indexing with Errors

Moritz G. Maaß; Johannes Nowak

In this paper we address the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors. In particular, our index can be built to report all occurrences, all positions, or all documents where a pattern occurs in time linear in the size of the query string and the number of results. This improves over previous work where the lookup time is not linear or depends upon the size of the document corpus. Our data structure has size on average and with high probability for input size and queries with up to errors. Additionally, we present a trade-off between query time and index complexity that achieves worst-case bounded index size and preprocessing time with linear lookup time on average.

Pp. 21-32

A New Compressed Suffix Tree Supporting Fast Search and Its Construction Algorithm Using Optimal Working Space

Dong Kyue Kim; Heejin Park

The compressed suffix array and the compressed suffix tree for a given string are full-text index data structures occupying (log|Σ|) bits where is the length of and Σ is the alphabet from which symbols of are drawn. When they were first introduced, they were constructed from suffix arrays and suffix trees, which implies they were not constructed in optimal (log|Σ|)-bit working space. Recently, several methods were developed for constructing compressed suffix arrays and compressed suffix trees in optimal working space. By these methods, one can construct compressed suffix trees supporting the pattern search in (′ |Σ|) time where ′ = log, is the length of a pattern, and log is the time to find the th smallest suffix of from the compressed suffix array for any fixed 0 < ≤ 1. However, compressed suffix trees supporting the pattern search in (′ log|Σ| ) time are not constructed by these methods.

In this paper, we present a new compressed suffix tree supporting (′ log|Σ|)-time pattern search and its construction algorithm using optimal working space. To obtain this result, we developed a new succinct representation of the suffix trees, which is different from the classic succinct representation of parentheses encoding of the suffix trees. Our succinct representation technique can be generally applicable to succinct representation of other search trees.

Pp. 33-44

Succinct Suffix Arrays Based on Run-Length Encoding

Veli Mäkinen; Gonzalo Navarro

A succinct full-text self-index is a data structure built on a text =... , which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern =... in , and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. They usually take () or () bits, being the th order empirical entropy of . The time to count how many times does occur in ranges from () to (log ).

We present a new self-index, called run-length FM-index (RLFM index), that counts the occurrences of in in () time when the alphabet size is . The index requires log + () bits of space for small . We then show how to implement the RLFM index in practice, and obtain in passing another implementation with different space-time tradeoffs. We empirically compare ours against the best existing implementations of other indexes and show that ours are fastest among indexes taking less space than the text.

Pp. 45-56

Linear-Time Construction of Compressed Suffix Arrays Using ( log )-Bit Working Space for Large Alphabets

Joong Chae Na

The is a fundamental index data structure in string algorithms and bioinformatics, and the and the are its compressed versions. Many algorithms for constructing these index data structures have been developed. Recently, Hon et al. [11] proposed a construction algorithm using ( ·loglog|Σ|) time and (log|Σ|)-bit working space, which is the fastest algorithm using (log|Σ|)-bit working space.

In this paper we give an efficient algorithm to construct the index data structures for large alphabets. Our algorithm constructs the suffix array, the CSA, and the FM-index using () time and -bit working space, where = log 2. Our algorithm takes less time and more space than Hon et al.’s algorithm. Our algorithm uses least working space among alphabet-independent linear-time algorithms.

Pp. 57-67

Faster Algorithms for ,-Matching and Related Problems

Peter Clifford; Raphaël Clifford; Costas Iliopoulos

We present new faster algorithms for the problems of and (, )-matching on numeric strings. In both cases the running time of the proposed algorithms is shown to be ( log ), where is the pattern length, is the text length and a given integer. Our approach makes use of Fourier transform methods and the running times are independent of the alphabet size. algorithms for the -matching and total-difference problems are also given. In all the above cases, we improve existing running time bounds in the literature.

Pp. 68-78

A Fast Algorithm for Approximate String Matching on Gene Sequences

Zheng Liu; Xin Chen; James Borneman; Tao Jiang

Approximate string matching is a fundamental and challenging problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA sequence analysis. In this paper, we present a fast algorithm for approximate string matching, called FAAST. It aims at solving a popular variant of the approximate string matching problem, the , whose objective is to find all occurrences of a short pattern in a long text string with at most mismatches. FAAST generalizes the well-known Tarhio-Ukkonen algorithm by requiring two or more matches when calculating shift distances, which makes the approximate string matching process significantly faster than the Tarhio-Ukkonen algorithm. Theoretically, we prove that FAAST on average skips more characters than the Tarhio-Ukkonen algorithm in a single shift, and makes fewer character comparisons in an entire matching process. Experiments on both simulated data sets and real gene sequences also demonstrate that FAAST runs several times faster than the Tarhio-Ukkonen algorithm in all the cases that we tested.

Pp. 79-90

Approximate Matching in the Metric

Amihood Amir; Ohad Lipsky; Ely Porat; Julia Umanski

is one of the fundamental problems in pattern matching, and a ubiquitous problem in real applications. The Hamming distance is a simple and well studied example of approximate matching, motivated by typing, or noisy channels. Biological and image processing applications assign a different value to mismatches of different symbols.

We consider the problem of approximate matching in the metric – the . Given text =,..., and pattern =,..., strings of natural number, and a natural number , we seek all text locations where the distance of the pattern from the length substring of text starting at is not greater than , i.e. .

We provide an algorithm that solves the --distance problem in time . The algorithm applies a bounded divide-and-conquer approach and makes novel uses of non-boolean convolutions.

Pp. 91-103

An Efficient Algorithm for Generating Super Condensed Neighborhoods

Luís M. S. Russo; Arlindo L. Oliveira

Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Here, we point out that condensed neighborhoods are not a minimal representation of a pattern neighborhood. We show that we can restrict our attention to super condensed neighborhoods which are minimal. We then present an algorithm for generating Super Condensed Neighborhoods. The algorithm runs in (⌈ / ⌉ ), where is the pattern size, is the size of the super condensed neighborhood and the size of the processor word. Previous algorithms took (⌈ / ⌉ ) time, where is the size of the condensed neighborhood. We further improve this algorithm by using Bit-Parallelism and Increased Bit-Parallelism techniques. Our experimental results show that the resulting algorithm is very fast.

Pp. 104-115