Catálogo de publicaciones - libros

Compartir en
redes sociales


Combinatorial Pattern Matching: 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings

Moshe Lewenstein ; Gabriel Valiente (eds.)

En conferencia: 17º Annual Symposium on Combinatorial Pattern Matching (CPM) . Barcelona, Spain . July 5, 2006 - July 7, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Pattern Recognition; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing; Information Storage and Retrieval; Bioinformatics; Data Structures

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-35455-0

ISBN electrónico

978-3-540-35461-1

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Fingerprint Clustering with Bounded Number of Missing Values

Paola Bonizzoni; Gianluca Della Vedova; Riccardo Dondi; Giancarlo Mauri

The problem of clustering fingerprint vectors with missing values is an interesting problem in Computational Biology that has been proposed in [6]. In this paper we show some improvements in closing the gaps between the known lower bounds and upper bounds on the approximability of variants of the biological problem. Moreover, we have studied two additional variants of the original problem. We prove that all such problems are APX-hard even when each fingerprint contains only two unknown positions and we present a greedy algorithm that has constant approximation factors for these variants. Despite the hardness of these restricted versions of the problem, we show that the general clustering problem on an unbounded number of missing values such that they occur for every fixed position of an input vector in at most one fingerprint is polynomial time solvable.

Palabras clave: Vertex Cover; Maximal Clique; Bound Number; Constant Approximation Factor; Minimum Vertex Cover.

- Session 3. Probabilistic and Algebraic Techniques | Pp. 106-116

Tiling an Interval of the Discrete Line

Olivier Bodini; Eric Rivals

We consider the problem of tiling a segment {0, ..., n } of the discrete line. More precisely, we ought to characterize the structure of the patterns that tile a segment and their number. A pattern is a subset of ℕ. A tiling pattern or tile for {0, ..., n } is a subset $A \in {{\mathcal P}({\mathbb{N}})}$ such that there exists $B \in {{\mathcal P}({\mathbb{N}})}$ and such that the direct sum of A and B equals {0, ..., n }. This problem is related to the difficult question of the decomposition in direct sums of the torus ℤ/ n ℤ (proposed by Minkowski). Using combinatorial and algebraic techniques, we give a new elementary proof of Krasner factorizations. We combinatorially prove that the tiles are direct sums of some arithmetic sequences of specific lengths. Besides, we show there are as many tiles whose smallest tilable segment is {0, ..., n } as tiles whose smallest tilable segment is {0, ..., d }, for all strict divisors d of n . This enables us to exhibit an optimal linear time algorithm to compute for a given pattern the smallest segment that it tiles if any, as well as a recurrence formula for counting the tiles of a segment.

Palabras clave: Linear Time Algorithm; Irreducible Factor; Integer Sequence; Algebraic Technique; Discrete Line.

- Session 3. Probabilistic and Algebraic Techniques | Pp. 117-128

Common Substrings in Random Strings

Eric Blais; Mathieu Blanchette

In computational biology, an important problem is to identify a word of length k present in each of a given set of sequences. Here, we investigate the problem of calculating the probability that such a word exists in a set of r random strings. Existing methods to approximate this probability are either inaccurate when r > 2 or are restricted to Bernoulli models. We introduce two new methods for computing this probability under Bernoulli and Markov models. We present generalizations of the methods to compute the probability of finding a word of length k shared among q of r sequences, and to allow mismatches. We show through simulations that our approximations are significantly more accurate than methods previously published.

- Session 3. Probabilistic and Algebraic Techniques | Pp. 129-140

On the Repeat-Annotated Phylogenetic Tree Reconstruction Problem

Firas Swidan; Michal Ziv-Ukelson; Ron Y. Pinter

A new problem in phylogenetic inference is presented, based on recent biological findings indicating a strong association between reversals (aka inversions ) and repeats . These biological findings are formalized here in a new mathematical model, called repeat-annotated phylogenetic trees (RAPT). We show that, under RAPT, the evolutionary process — including both the tree-topology as well as internal node genome orders — is uniquely determined, a property that is of major significance both in theory and in practice. Furthermore, the repeats are employed to provide linear-time algorithms for reconstructing both the genomic orders and the phylogeny, which are NP-hard problems under the classical model of sorting by reversals (SBR).

Palabras clave: Edge Label; Xanthomonas Campestris; Repeat Pair; Repeat Subsequence; Legal Reversal.

- Session 4. Applications in Molecular Biology I | Pp. 141-152

Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms

Sven Rahmann

We investigate combinatorial enumeration problems related to subsequences of strings; in contrast to substrings, subsequences need not be contiguous. For a finite alphabet Σ, the following three problems are solved. (1) Number of distinct subsequences : Given a sequence s ∈Σ^ n and a nonnegative integer k ≤ n , how many distinct subsequences of length k does s contain? A previous result by Chase states that this number is maximized by choosing s as a repeated permutation of the alphabet. This has applications in DNA microarray production. (2) Number of ρ -restricted ρ -generated sequences : Given s ∈Σ^ n and integers k ≥1 and ρ ≥1, how many distinct sequences in Σ^ k contain no single nucleotide repeat longer than ρ and can be written as $s_1^{r_1}\dots s_n^{r_n}$ with 0≤ r _ i ≤ ρ for all i ? For ρ = ∞, the question becomes how many length- k sequences match the regular expression s _1 * s _2 * ... s _ n * . These considerations allow a detailed analysis of a new DNA sequencing technology (“454 sequencing”). (3) Exact length distribution of the longest increasing subsequence : Given Σ= {1,..., K } and an integer n ≥1, determine the number of sequences in Σ^ n whose longest strictly increasing subsequence has length k , where 0 ≤ k ≤ K . This has applications to significance computations for chaining algorithms.

Palabras clave: Arithmetic Operation; Regular Expression; Deposition Sequence; Alphabet Size; Motif Occurrence.

- Session 4. Applications in Molecular Biology I | Pp. 153-164

Solving the Maximum Agreement SubTree and the Maximum Compatible Tree Problems on Many Bounded Degree Trees

Sylvain Guillemot; François Nicolas

Given a set of leaf-labeled trees with identical leaf sets, the well-known Maximum Agreement SubTree problem (MAST) consists of finding a subtree homeomorphically included in all input trees and with the largest number of leaves. Its variant called Maximum Compatible Tree (MCT) is less stringent, as it allows the input trees to be refined. Both problems are of particular interest in computational biology, where trees encountered have often small degrees. In this paper, this paper, we study the parameterized complexity of MAST and MCT with respect to the maximum degree, denoted D , of the input trees. While MAST is polynomial for bounded D [1,6,3], we show that MAST is W[1]-hard with respect to parameter D . Moreover, relying on recent advances in parameterized complexity we obtain a tight lower bound: while MAST can be solved in O ( N $^{O({\it D})}$ ) time where N denotes the input length, we show that an O ( N $^{o({\it D})}$ ) bound is not achievable, unless SNP ⊆ SE. We also show that MCT is W[1]-hard with respect to D , and that MCT cannot be solved in $O\big(N^{o(2^{D/2})}\big)$ time, unless SNP ⊆ SE.

Palabras clave: Maximum Degree; Parameterized Complexity; Mast Problem; Input Tree; Leaf Label.

- Session 4. Applications in Molecular Biology I | Pp. 165-176

An Improved Algorithm for the Macro-evolutionary Phylogeny Problem

Behshad Behzadi; Martin Vingron

Macro-evolutionary processes (e.g., gene duplication and loss) have rarely been incorporated into gene phylogeny reconstruction methods. Durand et al. [5] have proposed a polynomial time dynamic programming algorithm to find the gene family tree that optimizes a macro-evolutionary criterion which is the weighted sum of the number of gene duplications and losses. The complexity of this algorithm is O ( nm ^2) where n is the number of species and m is the maximum number of copies of the gene in a species. In this paper, we propose an improved algorithm with time complexity of O ( nm ) for solving this problem. We also show, that the problem can be solved in O ( n ) if unit costs are considered for both loss and duplication.

Palabras clave: Species Tree; Gene Duplication; Gene Tree; Improve Algorithm; Optimal Interval.

- Session 4. Applications in Molecular Biology I | Pp. 177-187

Property Matching and Weighted Matching

Amihood Amir; Eran Chencinski; Costas Iliopoulos; Tsvi Kopelowitz; Hui Zhang

Pattern Matching with Properties (Property Matching, for short), involves a string matching between the pattern and the text, and the requirement that the text part satisfies some property. It is straightforward to do sequential matching in a text with properties. However, indexing in a text with properties becomes difficult if we desire the time to be output dependent. We present an algorithm for indexing a text with properties in O ( n log|Σ|+ n loglog n ) time for preprocessing and O (| P |log|Σ|+ tocc _ π ) per query, where n is the length of the text, P is the sought pattern, Σ is the alphabet, and tocc _ π is the number of occurrences of the pattern that satisfy some property π . As a practical use of Property Matching we show how to solve Weighted Matching problems using techniques from Property Matching. Weighted sequences have been introduced as a tool to handle a set of sequences that are not identical but have many local similarities. The weighted sequence is a “statistical image” of this set, where we are given the probability of every symbol’s occurrence at every text location. Weighted matching problems are pattern matching problems where the given text is weighted. We present a reduction from Weighted Matching to Property Matching that allows off-the-shelf solutions to numerous weighted matching problems including indexing, swapped matching, parameterized matching, approximate matching, and many more. Assuming that one seeks the occurrence of pattern P with probability ε in weighted text T of length n , we reduce the problem to a property matching problem of pattern P in text T ′ of length $O(n(\frac{1}{\epsilon})^2 \log \frac{1}{\epsilon})$ .

Palabras clave: Pattern Match; Match Problem; Query Time; Weighted Match; Approximate Match.

- Session 5. String Matching I | Pp. 188-199

Faster Two Dimensional Scaled Matching

Amihood Amir; Eran Chencinski

The rapidly growing need for analysis of digitized images in multimedia systems has lead to a variety of interesting problems in multidimensional pattern matching. One of the problems is that of scaled matching , finding all appearances of a pattern, proportionally enlarged according to an arbitrary real-sized scale, in a given text. The best known algorithm for this problem uses techniques from dictionary matching to solve the problem in O ( nm ^3+ n ^2 m log m ) time using O ( nm ^3+ n ^2) space, where the text is a two-dimensional n × n array and the pattern is a two-dimensional m × m array. We present a new approach for solving the scaled matching problem improving both the running times and the space requirements. Our algorithm runs in O ( n ^2 m ) time and uses O ( n ^2) space. This time includes the preprocessing ( O ( m ^3) time and O ( m ^2) space), since the problem is only defined for m ≤ n .

- Session 5. String Matching I | Pp. 200-210

Approximation of RNA Multiple Structural Alignment

Marcin Kubica; Romeo Rizzi; Stéphane Vialette; Tomasz Waleń

In the context of non-coding RNA (ncRNA) multiple structural alignment, Davydov and Batzoglou introduced in [7] the problem of finding the largest nested linear graph that occurs in a set ${\mathcal{G}}$ of linear graphs, the so-called Max-NLS problem. This problem generalizes both the longest common subsequence problem and the maximum common homeomorphic subtree problem for rooted ordered trees. In the present paper, we give a fast algorithm for finding the largest nested linear subgraph of a linear graph and a polynomial-time algorithm for a fixed number ( k ) of linear graphs. Also, we strongly strengthen the result of [7] by proving that the problem is NP -complete even if ${\mathcal{G}}$ is composed of nested linear graphs of height at most 2, thereby precisely defining the borderline between tractable and intractable instances of the problem. Of particular importance, we improve the result of [7] by showing that the Max-NLS problem is approximable within ratio O (log m _ opt ) in O ( kn ^2) running time, where m _ opt is the size of an optimal solution. We also present ${{\mathcal O}}(1)$ -approximation of Max-NLS problem running in ${{\mathcal O}}(kn)$ time for restricted linear graphs. In particular, for ncRNA derived linear graphs, an $\frac{1}{4}$ -approximation is presented.

Palabras clave: Structural Alignment; Nest Loop; Linear Graph; Discrete Apply Mathematic; Independent Edge.

- Session 6. Applications in Molecular Biology II | Pp. 211-222