Catálogo de publicaciones - libros

Compartir en
redes sociales


Bioinformatics Research and Development: First International Conference, BIRD 2007, Berlin, Germany, March 12-14, 2007. Proceedings

Sepp Hochreiter ; Roland Wagner (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-71232-9

ISBN electrónico

978-3-540-71233-6

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Synthetic Protein Sequence Oversampling Method for Classification and Remote Homology Detection in Imbalanced Protein Data

Majid M. Beigi; Andreas Zell

Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. ynthetic rotein equence versampling (SPSO) method involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. We used G-protein-coupled receptors families as real data to classify them at subfamily and sub-subfamily levels (having low number of sequences) and could get better accuracy and Matthew’s correlation coefficient than other previously published method. We also made artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).

- Session 7: Sequence Analysis II | Pp. 263-277

Stem Kernels for RNA Sequence Analyses

Yasubumi Sakakibara; Kiyoshi Asai; Kengo Sato

Several computational methods based on stochastic context-free grammars have been developed for modeling and analyzing functional RNA sequences. These grammatical methods have succeeded in modeling typical secondary structures of RNA and are used for structural alignment of RNA sequences. However, such stochastic models cannot sufficiently discriminate member sequences of an RNA family from non-members and hence detect non-coding RNA regions from genome sequences.

A novel kernel function, , for the discrimination and detection of functional RNA sequences using support vector machines (SVM) is proposed. The stem kernel is a natural extension of the string kernel, specifically the all-subsequences kernel, and is tailored to measure the similarity of two RNA sequences from the viewpoint of secondary structures. The stem kernel examines all possible common base-pairs and stem structures of arbitrary lengths, including pseudoknots between two RNA sequences and calculates the inner product of common stem structure counts. An efficient algorithm was developed to calculate the stem kernels based on dynamic programming. The stem kernels are then applied to discriminate members of an RNA family from non-members using SVM. The study indicates that the discrimination ability of the stem kernel is strong compared with conventional methods. Further, the potential application of the stem kernel is demonstrated by the detection of remotely homologous RNA families in terms of secondary structures. This is because the string kernel is proven to work for the remote homology detection of protein sequences. These experimental results have convinced us to apply the stem kernel to find novel RNA families from genome sequences.

- Session 7: Sequence Analysis II | Pp. 278-291

Prediction of Structurally-Determined Coiled-Coil Domains with Hidden Markov Models

Piero Fariselli; Daniele Molinini; Rita Casadio; Anders Krogh

The coiled-coil protein domain is a widespread structural motif known to be involved in a wealth of key interactions in cells and organisms. Coiled-coil recognition and prediction of their location in a protein sequence are important steps for modeling protein structure and function. Nowadays, thanks to the increasing number of experimentally determined protein structures, a significant number of coiled-coil protein domains is available. This enables the development of methods suited to predict the coiled-coil structural motifs starting from the protein sequence. Several methods have been developed to predict classical heptads using manually annotated coiled-coil domains. In this paper we focus on the prediction structurally-determined coiled-coil segments. We introduce a new method based on hidden Markov models that complement the existing methods and outperforms them in the task of locating structurally-defined coiled-coil segments.

- Session 7: Sequence Analysis II | Pp. 292-302

Patch Prediction of Protein Interaction Sites: Validation of a Scoring Function for an Online Server

Susan Jones; Yoichi Mukarami

An online protein interaction server has been designed and implemented to make predictions for 256 nonhomologous protein-protein interaction sites using patch analysis. Predictions of interactions sites are made using a scoring function that ranks four parameters, Solvation Potential, Hydrophobicity, Accessible Surface Area and Residue Interface Propensity, for overlapping patches of surface residues. Using the server, correct predictions were made for 85% of an original hand curated data set of 28 homodimers and for 65% of a new dataset of 256 homodimeric proteins. This is an increased prediction rate over the original algorithm, and proves that the method is valid for a larger set of proteins that includes more diverse interaction sites. In addition, a number of proteins for which predictions are categorized as incorrect, are shown to have alternative protein interaction sites on their surfaces.

- Session 8: Proteomics I | Pp. 303-313

Statistical Inference on Distinct RNA Stem-Loops in Genomic Sequences

Shu-Yun Le; Jih-H. Chen

Functional RNA elements in post-transcriptional regulation of gene expression are often correlated with distinct RNA stem-loop structures that are both thermodynamically stable and highly well- ordered. Recent Discoveries of microRNA (miRNA) and small regulatory RNAs indicate that there are a large class of small non-coding RNAs having the potential to form a distinct, well-ordered and/or stable stem-loop in numbers of genomes. The distinct RNA structure can be well evaluated by a quantitative measure, the energy difference () between the optimal structure folded from the segment and its corresponding optimal restrained structure where all base pairings formed in the original optimal structure are forbidden. In this study, we present an efficient algorithm to compute of local segment by scanning a window along a genomic sequence. The complexity of computational time is ( ×), where is the length of the genomic sequence and is the size of the sliding window. Our results indicate that the known stem-loops folded by miRNA precursors have high normalized scores with highly statistical significance. The distinct well-ordered structures related to the known miRNA can be predicted in a genomic sequence by a robust statistical inference. Our computational method StemED can be used as a general approach for the discovery of distinct stem-loops in genomic sequences.

- Session 8: Proteomics I | Pp. 314-327

Interpretation of Protein Subcellular Location Patterns in 3D Images Across Cell Types and Resolutions

Xiang Chen; Robert F. Murphy

Detailed knowledge of the subcellular location of all proteins and how they change under various conditions is essential for systems biology efforts to recreate the behavior of cells and organisms. Systematic study of subcellular patterns requires automated methods to determine the location pattern for each protein and how it relates to others. Our group has designed sets of numerical features that characterize the location patterns in high-resolution fluorescence microscope images, has shown that these can be used to distinguish patterns better than visual examination, and has used them to automatically group proteins by their patterns. In the current study, we sought to extend our approaches to images obtained from different cell types, microscopy techniques and resolutions. The results indicate that 1) transformation of subcellular location features can be performed so that similar patterns from different cell types are grouped by automated clustering; and 2) there are several basic location patterns whose recognition is insensitive to image resolution over a wide range. The results suggest strategies to be used for collecting and analyzing images from different cell types and with different resolutions.

- Session 8: Proteomics I | Pp. 328-342

Bayesian Inference for 2D Gel Electrophoresis Image Analysis

Ji Won Yoon; Simon J. Godsill; ChulHun Kang; Tae-Seong Kim

Two-dimensional gel electrophoresis (2DGE) is a technique to separate individual proteins in biological samples. The 2DGE technique results in gel images where proteins appear as dark spots on a white background. However, the analysis and inference of these images get complicated due to 1) contamination of gels, 2) superposition of proteins, 3) noisy background, and 4) weak protein spots. Therefore there is a strong need for an automatic analysis technique that is fast, robust, objective, and automatic to find protein spots. In this paper, to find protein spots more accurately and reliably from gel images, we propose Reversible Jump Markov Chain Monte Carlo method (RJMCMC) to search for underlying spots which are assume to have Gaussian-distribution shape. Our statistical method identifies very weak spots, restores noisy spots, and separates mixed spots into several meaningful spots which are likely to be ignored and missed. Our proposed approach estimates the proper number, centre-position, width, and amplitude of the spots and has been successfully applied to the field of projection reconstruction NMR (PR-NMR) processing [15,16]. To obtain a 2DGE image, we peformed 2DGE on the purified mitochondiral protein of liver from an adult Sprague-Dawley rat.

- Session 9: Proteomics II (Measurements) | Pp. 343-356

SimShiftDB: Chemical-Shift-Based Homology Modeling

Simon W. Ginzinger; Thomas Gräupl; Volker Heun

An important quantity that is measured in NMR spectro- scopy is the chemical shift. The interpretation of these data is mostly done by human experts. We present a method, named , which identifies structural similarities between a protein of unknown structure and a database of resolved proteins based on chemical shift data. To evaluate the performance of our approach, we use a small but very reliable test set and compare our results to those of 123D and TALOS. The evaluation shows that SimShiftDB outperforms 123D in the majority of cases. For a significant part of the predictions made by TALOS, our method strongly reduces the error. SimShiftDB also assesses the statistical significance of each similarity identified.

- Session 9: Proteomics II (Measurements) | Pp. 357-370

Annotation of LC/ESI-MS Mass Signals

Ralf Tautenhahn; Christoph Böttcher; Steffen Neumann

Mass spectrometry is the work-horse technology of the emerging field of metabolomics. The identification of mass signals remains the largest bottleneck for a non-targeted approach: due to the analytical method, each metabolite in a complex mixture will give rise to a number of mass signals. In contrast to GC/MS measurements, for soft ionisation methods such as ESI-MS there are no extensive libraries of reference spectra or established deconvolution methods. We present a set of annotation methods which aim to group together mass signals measured from a single metabolite, based on rules for mass differences and peak shape comparison.

The software and documentation is available as an R package on

- Session 9: Proteomics II (Measurements) | Pp. 371-380

Stochastic Protein Folding Simulation in the d-Dimensional HP-Model

K. Steinhöfel; A. Skaliotis; A. A. Albrecht

We present results from two- and three-dimensional protein folding simulations in the HP-model on selected benchmark problems. The importance of the HP-model for investigating general complexity issues of protein folding has been recently demonstrated by Fu & Wang (LNCS 3142:630–644, 2004) in proving an ((·ln )) time bound for -dimensional protein folding simulation of sequences of length . The time bound is close to the approximation of real folding times of (·±·/2) by Finkelstein & Badretdinov (FOLD DES 2:115–121, 1997), where and are constants close to unity. We utilise a stochastic local search procedure that is based on logarithmic simulated annealing. We obtain that after (/) Markov chain transitions the probability to be in a minimum energy conformation is at least 1 − , where  ≤ ()· is the maximum neighbourhood size for a small integer (), is a small constant, and is the maximum value of the minimum escape height from local minima of the underlying energy landscape. We note that the time bound is sequence-specific, and we conjecture  <  as a worst case upper bound. We analyse  <  experimentally on selected HP-model benchmark problems.

- Session 10: Proteomics III (Structure) | Pp. 381-394