Catálogo de publicaciones - libros

Compartir en
redes sociales


Combinatorial Pattern Matching: 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings

Moshe Lewenstein ; Gabriel Valiente (eds.)

En conferencia: 17º Annual Symposium on Combinatorial Pattern Matching (CPM) . Barcelona, Spain . July 5, 2006 - July 7, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Pattern Recognition; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing; Information Storage and Retrieval; Bioinformatics; Data Structures

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-35455-0

ISBN electrónico

978-3-540-35461-1

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Asynchronous Pattern Matching

Amihood Amir

This paper introduces a new pattern matching model that has been gaining importance recently, that of Asynchronous Pattern Matching . Traditional pattern matching has assumed the possibility of errors in the data content . We present motivation from text editing, computational biology, and computer architecture, that points to a new paradigm – where the errors occur in the address . It turns out that there are differences in techniques, complexities, and tools between the two different models, making it important to recognize their differences. We motivate and define the new model and present some problems that are worth pursuing.

Palabras clave: Pattern Match; String Match; Text Location; Address Error; Distance Problem.

- Asynchronous Pattern Matching | Pp. 1-10

SNP and Haplotype Analysis – Algorithms and Applications

Eran Halperin

The recent release of the Haplotype Mapping project (Nature, Oct. 26, 2005 – see also, e.g., NY Times, Oct. 27), and the rapid reduction in genotyping costs open new directions and opportunities in the study of complex genetic disease such as cancer or Alzheimer’s disease. The datasets collected for these studies are DNA sequences, with some noise and ambiguous information. In this talk I will discuss some of the algorithmic issues of disambiguating these DNA sequences, and the current and potential impact of these algorithms on genetics and medicine. In particular, I will discuss some of the problems in the field, such as genotype phasing, tag SNP selection (e.g. feature selection), and population stratification issues (e.g. clustering). The talk will be self contained, although some introductory material for the biological terminology and the HapMap project can be found at http://www.hapmap.org/whatishapmap.html.

Palabras clave: Feature Selection; Haplotype Analysis; Population Stratification; Rapid Reduction; HapMap Project.

- Asynchronous Pattern Matching | Pp. 11-11

Identifying Co-referential Names Across Large Corpora

Levon Lloyd; Andrew Mehler; Steven Skiena

A single logical entity can be referred to by several different names over a large text corpus. We present our algorithm for finding all such co-reference sets in a large corpus. Our algorithm involves three steps: morphological similarity detection, contextual similarity analysis, and clustering. Finally, we present experimental results on over large corpus of real news text to analyze the performance our techniques.

Palabras clave: Noun Phrase; Large Corpus; Cosine Similarity; Contextual Similarity; Computational Linguistics.

- Asynchronous Pattern Matching | Pp. 12-23

Adaptive Searching in Succinctly Encoded Binary Relations and Tree-Structured Documents

Jérémy Barbay; Alexander Golynski; J. Ian Munro; S. Srinivasa Rao

The most heavily used methods to answer conjunctive queries on binary relations (such as the one associating keywords with web pages) are based on inverted lists stored in sorted arrays and use variants of binary search. We show that a succinct representation of the binary relation permits much better results, while using space within a lower order term of the optimal. We apply our results not only to conjunctive queries on binary relations, but also to queries on semi-structured documents such as XML documents or file-system indexes, using a variant of an adaptive algorithm used to solve conjunctive queries on binary relations.

Palabras clave: conjunctive queries; intersection problem; succinct data structures; labeled trees; multi-labeled trees.

- Session 1. Data Structures | Pp. 24-35

Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE

Johannes Fischer; Volker Heun

The Range-Minimum-Query-Problem is to preprocess an array such that the position of the minimum element between two specified indices can be obtained efficiently. We present a direct algorithm for the general RMQ-problem with linear preprocessing time and constant query time, without making use of any dynamic data structure. It consumes less than half of the space that is needed by the method by Berkman and Vishkin. We use our new algorithm for RMQ to improve on LCA-computation for binary trees, and further give a constant-time LCE-algorithm solely based on arrays. Both LCA and LCE have important applications, e.g., in computational biology. Experimental studies show that our new method is almost twice as fast in practice as previous approaches, and asymptotically slower variants of the constant-time algorithms perform even better for today’s common problem sizes.

Palabras clave: Binary Tree; Query Time; Extra Space; Lower Common Ancestor; Query Length.

- Session 1. Data Structures | Pp. 36-48

A Linear Size Index for Approximate Pattern Matching

Ho-Leung Chan; Tak-Wah Lam; Wing-Kin Sung; Siu-Lung Tam; Swee-Seong Wong

This paper revisits the problem of indexing a text S [1.. n ] to support searching substrings in S that match a given pattern P [1.. m ] with at most k errors. A naive solution either has a worst-case matching time complexity of Ω( m ^ k ) or requires Ω( n ^ k ) space. Devising a solution with better performance has been a challenge until Cole et al. [5] showed an O ( n log^ k n )-space index that can support k -error matching in O ( m + occ + log^ k n loglog n ) time, where occ is the number of occurrences. Motivated by the indexing of DNA, we investigate in this paper the feasibility of devising a linear-size index that still has a time complexity linear in m . In particular, we give an O ( n )-space index that supports k -error matching in O ( m + occ + (log n ) $^{k({\it k}+1)}$ loglog n ) worst-case time. Furthermore, the index can be compressed from O ( n ) words into O ( n ) bits with a slight increase in the time complexity.

- Session 2. Indexing Data Structures | Pp. 49-59

On-Line Linear-Time Construction of Word Suffix Trees

Shunsuke Inenaga; Masayuki Takeda

Suffix trees are the key data structure for text string matching, and are used in wide application areas such as bioinformatics and data compression. Sparse suffix trees are kind of suffix trees that represent only a subset of suffixes of the input string. In this paper we study word suffix trees , which are one variation of sparse suffix trees. Let D be a dictionary of words and w be a string in D ^ + , namely, w is a sequence w _1 ⋯ w _ k of k words in D . The word suffix tree of w w.r.t. D is a path-compressed trie that represents only the k suffixes in the form of w _ i ⋯ w _ k . A typical example of its application is word- and phrase-level search on natural language documents. Andersson et al. proposed an algorithm to build word suffix trees in O ( n ) expected time with O ( k ) space. In this paper we present a new word suffix tree construction algorithm with O ( n ) running time and O ( k ) space in the worst cases. Our algorithm is on-line, which means that it can sequentially process the characters in the input, each by each, from left to right.

Palabras clave: Data Compression; Construction Algorithm; Suffix Tree; Input String; String Processing.

- Session 2. Indexing Data Structures | Pp. 60-71

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Pang Ko; Srinivas Aluru

Designing external memory data structures for string data-bases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string B-trees provide the best known asymptotic performance in external memory for substring search and update operations. Work on external memory variants of suffix trees has largely focused on constructing suffix trees in external memory or layout schemes for suffix trees that preserve link locality. In this paper, we present a new suffix tree layout scheme for secondary storage and present construction, substring search, insertion and deletion algorithms that are competitive with the string B-tree. For a set of strings of total length n , a pattern p and disk blocks of size B , we provide a substring search algorithm that uses O (| p |/ B + log_ B n ) disk accesses. We present algorithms for insertion and deletion of all suffixes of a string of length m that take O ( m log_ B ( n + m )) and O ( m log_ B n ) disk accesses, respectively. Our results demonstrate that suffix trees can be directly used as efficient secondary storage data structures for string and sequence data.

Palabras clave: Internal Node; Tree Construction; External Memory; Edge Label; Suffix Tree.

- Session 2. Indexing Data Structures | Pp. 72-83

Geometric Suffix Tree: A New Index Structure for Protein 3-D Structures

Tetsuo Shibuya

Protein structure analysis is one of the most important research issues in the post-genomic era, and faster and more accurate query data structures for such 3-D structures are highly desired for research on proteins. This paper proposes a new data structure for indexing protein 3-D structures. For strings, there are many efficient indexing structures such as suffix trees, but it has been considered very difficult to design such sophisticated data structures against 3-D structures like proteins. Our index structure is based on the suffix trees and is called the geometric suffix tree. By using the geometric suffix tree for a set of protein structures, we can search for all of their substructures whose RMSDs (root mean square deviations) or URMSDs (unit-vector root mean square deviations) to a given query 3-D structure are not larger than a given bound. Though there are O ( N ^2) substructures, our data structure requires only O ( N ) space where N is the sum of lengths of the set of proteins. We propose an O ( N ^2) construction algorithm for it, while a naive algorithm would require O ( N ^3) time to construct it. Moreover we propose an efficient search algorithm. We also show computational experiments to demonstrate the practicality of our data structure. The experiments show that the construction time of the geometric suffix tree is practically almost linear to the size of the database, when applied to a protein structure database.

Palabras clave: Singular Value Decomposition; Index Structure; Outgoing Edge; Construction Algorithm; Suffix Tree.

- Session 2. Indexing Data Structures | Pp. 84-93

New Bounds for Motif Finding in Strong Instances

Broňa Brejová; Daniel G. Brown; Ian M. Harrower; Tomáš Vinař

Many algorithms for motif finding that are commonly used in bioinformatics start by sampling r potential motif occurrences from n input sequences. The motif is derived from these samples and evaluated on all sequences. This approach works extremely well in practice, and is implemented by several programs. Li, Ma and Wang have shown that a simple algorithm of this sort is a polynomial-time approximation scheme. However, in 2005, we showed specific instances of the motif finding problem for which the approximation ratio of a slight variation of this scheme converges to one very slowly as a function of the sample size r , which seemingly contradicts the high performance of sample-based algorithms. Here, we account for the difference by showing that, for a variety of different definitions of “strong” binary motifs, the approximation ratio of sample-based algorithms converges to one exponentially fast in r . We also describe “very strong” motifs, for which the simple sample-based approach always identifies the correct motif, even for modest values of r .

- Session 3. Probabilistic and Algebraic Techniques | Pp. 94-105