Catálogo de publicaciones - libros
Combinatorial Pattern Matching: 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings
Moshe Lewenstein ; Gabriel Valiente (eds.)
En conferencia: 17º Annual Symposium on Combinatorial Pattern Matching (CPM) . Barcelona, Spain . July 5, 2006 - July 7, 2006
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Pattern Recognition; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing; Information Storage and Retrieval; Bioinformatics; Data Structures
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2006 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-35455-0
ISBN electrónico
978-3-540-35461-1
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2006
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2006
Cobertura temática
Tabla de contenidos
doi: 10.1007/11780441_11
Fingerprint Clustering with Bounded Number of Missing Values
Paola Bonizzoni; Gianluca Della Vedova; Riccardo Dondi; Giancarlo Mauri
The problem of clustering fingerprint vectors with missing values is an interesting problem in Computational Biology that has been proposed in [6]. In this paper we show some improvements in closing the gaps between the known lower bounds and upper bounds on the approximability of variants of the biological problem. Moreover, we have studied two additional variants of the original problem. We prove that all such problems are APX-hard even when each fingerprint contains only two unknown positions and we present a greedy algorithm that has constant approximation factors for these variants. Despite the hardness of these restricted versions of the problem, we show that the general clustering problem on an unbounded number of missing values such that they occur for every fixed position of an input vector in at most one fingerprint is polynomial time solvable.
Palabras clave: Vertex Cover; Maximal Clique; Bound Number; Constant Approximation Factor; Minimum Vertex Cover.
- Session 3. Probabilistic and Algebraic Techniques | Pp. 106-116
doi: 10.1007/11780441_12
Tiling an Interval of the Discrete Line
Olivier Bodini; Eric Rivals
We consider the problem of tiling a segment {0, ..., n } of the discrete line. More precisely, we ought to characterize the structure of the patterns that tile a segment and their number. A pattern is a subset of ℕ. A tiling pattern or tile for {0, ..., n } is a subset $A \in {{\mathcal P}({\mathbb{N}})}$ such that there exists $B \in {{\mathcal P}({\mathbb{N}})}$ and such that the direct sum of A and B equals {0, ..., n }. This problem is related to the difficult question of the decomposition in direct sums of the torus ℤ/ n ℤ (proposed by Minkowski). Using combinatorial and algebraic techniques, we give a new elementary proof of Krasner factorizations. We combinatorially prove that the tiles are direct sums of some arithmetic sequences of specific lengths. Besides, we show there are as many tiles whose smallest tilable segment is {0, ..., n } as tiles whose smallest tilable segment is {0, ..., d }, for all strict divisors d of n . This enables us to exhibit an optimal linear time algorithm to compute for a given pattern the smallest segment that it tiles if any, as well as a recurrence formula for counting the tiles of a segment.
Palabras clave: Linear Time Algorithm; Irreducible Factor; Integer Sequence; Algebraic Technique; Discrete Line.
- Session 3. Probabilistic and Algebraic Techniques | Pp. 117-128
doi: 10.1007/11780441_13
Common Substrings in Random Strings
Eric Blais; Mathieu Blanchette
In computational biology, an important problem is to identify a word of length k present in each of a given set of sequences. Here, we investigate the problem of calculating the probability that such a word exists in a set of r random strings. Existing methods to approximate this probability are either inaccurate when r > 2 or are restricted to Bernoulli models. We introduce two new methods for computing this probability under Bernoulli and Markov models. We present generalizations of the methods to compute the probability of finding a word of length k shared among q of r sequences, and to allow mismatches. We show through simulations that our approximations are significantly more accurate than methods previously published.
- Session 3. Probabilistic and Algebraic Techniques | Pp. 129-140
doi: 10.1007/11780441_14
On the Repeat-Annotated Phylogenetic Tree Reconstruction Problem
Firas Swidan; Michal Ziv-Ukelson; Ron Y. Pinter
A new problem in phylogenetic inference is presented, based on recent biological findings indicating a strong association between reversals (aka inversions ) and repeats . These biological findings are formalized here in a new mathematical model, called repeat-annotated phylogenetic trees (RAPT). We show that, under RAPT, the evolutionary process — including both the tree-topology as well as internal node genome orders — is uniquely determined, a property that is of major significance both in theory and in practice. Furthermore, the repeats are employed to provide linear-time algorithms for reconstructing both the genomic orders and the phylogeny, which are NP-hard problems under the classical model of sorting by reversals (SBR).
Palabras clave: Edge Label; Xanthomonas Campestris; Repeat Pair; Repeat Subsequence; Legal Reversal.
- Session 4. Applications in Molecular Biology I | Pp. 141-152
doi: 10.1007/11780441_15
Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms
Sven Rahmann
We investigate combinatorial enumeration problems related to subsequences of strings; in contrast to substrings, subsequences need not be contiguous. For a finite alphabet Σ, the following three problems are solved. (1) Number of distinct subsequences : Given a sequence s ∈Σ^ n and a nonnegative integer k ≤ n , how many distinct subsequences of length k does s contain? A previous result by Chase states that this number is maximized by choosing s as a repeated permutation of the alphabet. This has applications in DNA microarray production. (2) Number of ρ -restricted ρ -generated sequences : Given s ∈Σ^ n and integers k ≥1 and ρ ≥1, how many distinct sequences in Σ^ k contain no single nucleotide repeat longer than ρ and can be written as $s_1^{r_1}\dots s_n^{r_n}$ with 0≤ r _ i ≤ ρ for all i ? For ρ = ∞, the question becomes how many length- k sequences match the regular expression s _1 * s _2 * ... s _ n * . These considerations allow a detailed analysis of a new DNA sequencing technology (“454 sequencing”). (3) Exact length distribution of the longest increasing subsequence : Given Σ= {1,..., K } and an integer n ≥1, determine the number of sequences in Σ^ n whose longest strictly increasing subsequence has length k , where 0 ≤ k ≤ K . This has applications to significance computations for chaining algorithms.
Palabras clave: Arithmetic Operation; Regular Expression; Deposition Sequence; Alphabet Size; Motif Occurrence.
- Session 4. Applications in Molecular Biology I | Pp. 153-164
doi: 10.1007/11780441_16
Solving the Maximum Agreement SubTree and the Maximum Compatible Tree Problems on Many Bounded Degree Trees
Sylvain Guillemot; François Nicolas
Given a set of leaf-labeled trees with identical leaf sets, the well-known Maximum Agreement SubTree problem (MAST) consists of finding a subtree homeomorphically included in all input trees and with the largest number of leaves. Its variant called Maximum Compatible Tree (MCT) is less stringent, as it allows the input trees to be refined. Both problems are of particular interest in computational biology, where trees encountered have often small degrees. In this paper, this paper, we study the parameterized complexity of MAST and MCT with respect to the maximum degree, denoted D , of the input trees. While MAST is polynomial for bounded D [1,6,3], we show that MAST is W[1]-hard with respect to parameter D . Moreover, relying on recent advances in parameterized complexity we obtain a tight lower bound: while MAST can be solved in O ( N $^{O({\it D})}$ ) time where N denotes the input length, we show that an O ( N $^{o({\it D})}$ ) bound is not achievable, unless SNP ⊆ SE. We also show that MCT is W[1]-hard with respect to D , and that MCT cannot be solved in $O\big(N^{o(2^{D/2})}\big)$ time, unless SNP ⊆ SE.
Palabras clave: Maximum Degree; Parameterized Complexity; Mast Problem; Input Tree; Leaf Label.
- Session 4. Applications in Molecular Biology I | Pp. 165-176
doi: 10.1007/11780441_17
An Improved Algorithm for the Macro-evolutionary Phylogeny Problem
Behshad Behzadi; Martin Vingron
Macro-evolutionary processes (e.g., gene duplication and loss) have rarely been incorporated into gene phylogeny reconstruction methods. Durand et al. [5] have proposed a polynomial time dynamic programming algorithm to find the gene family tree that optimizes a macro-evolutionary criterion which is the weighted sum of the number of gene duplications and losses. The complexity of this algorithm is O ( nm ^2) where n is the number of species and m is the maximum number of copies of the gene in a species. In this paper, we propose an improved algorithm with time complexity of O ( nm ) for solving this problem. We also show, that the problem can be solved in O ( n ) if unit costs are considered for both loss and duplication.
Palabras clave: Species Tree; Gene Duplication; Gene Tree; Improve Algorithm; Optimal Interval.
- Session 4. Applications in Molecular Biology I | Pp. 177-187
doi: 10.1007/11780441_18
Property Matching and Weighted Matching
Amihood Amir; Eran Chencinski; Costas Iliopoulos; Tsvi Kopelowitz; Hui Zhang
Pattern Matching with Properties (Property Matching, for short), involves a string matching between the pattern and the text, and the requirement that the text part satisfies some property. It is straightforward to do sequential matching in a text with properties. However, indexing in a text with properties becomes difficult if we desire the time to be output dependent. We present an algorithm for indexing a text with properties in O ( n log|Σ|+ n loglog n ) time for preprocessing and O (| P |log|Σ|+ tocc _ π ) per query, where n is the length of the text, P is the sought pattern, Σ is the alphabet, and tocc _ π is the number of occurrences of the pattern that satisfy some property π . As a practical use of Property Matching we show how to solve Weighted Matching problems using techniques from Property Matching. Weighted sequences have been introduced as a tool to handle a set of sequences that are not identical but have many local similarities. The weighted sequence is a “statistical image” of this set, where we are given the probability of every symbol’s occurrence at every text location. Weighted matching problems are pattern matching problems where the given text is weighted. We present a reduction from Weighted Matching to Property Matching that allows off-the-shelf solutions to numerous weighted matching problems including indexing, swapped matching, parameterized matching, approximate matching, and many more. Assuming that one seeks the occurrence of pattern P with probability ε in weighted text T of length n , we reduce the problem to a property matching problem of pattern P in text T ′ of length $O(n(\frac{1}{\epsilon})^2 \log \frac{1}{\epsilon})$ .
Palabras clave: Pattern Match; Match Problem; Query Time; Weighted Match; Approximate Match.
- Session 5. String Matching I | Pp. 188-199
doi: 10.1007/11780441_19
Faster Two Dimensional Scaled Matching
Amihood Amir; Eran Chencinski
The rapidly growing need for analysis of digitized images in multimedia systems has lead to a variety of interesting problems in multidimensional pattern matching. One of the problems is that of scaled matching , finding all appearances of a pattern, proportionally enlarged according to an arbitrary real-sized scale, in a given text. The best known algorithm for this problem uses techniques from dictionary matching to solve the problem in O ( nm ^3+ n ^2 m log m ) time using O ( nm ^3+ n ^2) space, where the text is a two-dimensional n × n array and the pattern is a two-dimensional m × m array. We present a new approach for solving the scaled matching problem improving both the running times and the space requirements. Our algorithm runs in O ( n ^2 m ) time and uses O ( n ^2) space. This time includes the preprocessing ( O ( m ^3) time and O ( m ^2) space), since the problem is only defined for m ≤ n .
- Session 5. String Matching I | Pp. 200-210
doi: 10.1007/11780441_20
Approximation of RNA Multiple Structural Alignment
Marcin Kubica; Romeo Rizzi; Stéphane Vialette; Tomasz Waleń
In the context of non-coding RNA (ncRNA) multiple structural alignment, Davydov and Batzoglou introduced in [7] the problem of finding the largest nested linear graph that occurs in a set ${\mathcal{G}}$ of linear graphs, the so-called Max-NLS problem. This problem generalizes both the longest common subsequence problem and the maximum common homeomorphic subtree problem for rooted ordered trees. In the present paper, we give a fast algorithm for finding the largest nested linear subgraph of a linear graph and a polynomial-time algorithm for a fixed number ( k ) of linear graphs. Also, we strongly strengthen the result of [7] by proving that the problem is NP -complete even if ${\mathcal{G}}$ is composed of nested linear graphs of height at most 2, thereby precisely defining the borderline between tractable and intractable instances of the problem. Of particular importance, we improve the result of [7] by showing that the Max-NLS problem is approximable within ratio O (log m _ opt ) in O ( kn ^2) running time, where m _ opt is the size of an optimal solution. We also present ${{\mathcal O}}(1)$ -approximation of Max-NLS problem running in ${{\mathcal O}}(kn)$ time for restricted linear graphs. In particular, for ncRNA derived linear graphs, an $\frac{1}{4}$ -approximation is presented.
Palabras clave: Structural Alignment; Nest Loop; Linear Graph; Discrete Apply Mathematic; Independent Edge.
- Session 6. Applications in Molecular Biology II | Pp. 211-222