Catálogo de publicaciones - libros

Compartir en
redes sociales


Advances in Nonlinear Speech Processing: International Conference on Non-Linear Speech Processing, NOLISP 2007 Paris, France, May 22-25, 2007 Revised Selected Papers

Mohamed Chetouani ; Amir Hussain ; Bruno Gas ; Maurice Milgram ; Jean-Luc Zarader (eds.)

En conferencia: International Conference on Nonlinear Speech Processing (NOLISP) . Paris, France . May 22, 2007 - May 25, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Theory of Computation; Artificial Intelligence (incl. Robotics); Language Translation and Linguistics; Biometrics; Computer Appl. in Arts and Humanities; Image Processing and Computer Vision

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-77346-7

ISBN electrónico

978-3-540-77347-4

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Word Recognition with a Hierarchical Neural Network

Xavier Domont; Martin Heckmann; Heiko Wersing; Frank Joublin; Stefan Menzel; Bernhard Sendhoff; Christian Goerick

In this paper we propose a feedforward neural network for syllable recognition. The core of the recognition system is based on a hierarchical architecture initially developed for visual object recognition. We show that, given the similarities between the primary auditory and visual cortexes, such a system can successfully be used for speech recognition. Syllables are used as basic units for the recognition. Their spectrograms, computed using a Gammatone filterbank, are interpreted as images and subsequently feed into the neural network after a preprocessing step that enhances the formant frequencies and normalizes the length of the syllables. The performance of our system has been analyzed on the recognition of 25 different monosyllabic words. The parameters of the architecture have been optimized using an evolutionary strategy. Compared to the Sphinx-4 speech recognition system, our system achieves better robustness and generalization capabilities in noisy conditions.

- Speech Recognition | Pp. 142-151

Hybrid Models for Automatic Speech Recognition: A Comparison of Classical ANN and Kernel Based Methods

Ana I. García-Moral; Rubén Solera-Ureña; Carmen Peláez-Moreno; Fernando Díaz-de-María

Support Vector Machines (SVMs) are state-of-the-art methods for machine learning but share with more classical Artificial Neural Networks (ANNs) the difficulty of their application to input patterns of non-fixed dimension. This is the case in Automatic Speech Recognition (ASR), in which the duration of the speech utterances is variable. In this paper we have recalled the hybrid (ANN/HMM) solutions provided in the past for ANNs and applied them to SVMs performing a comparison between them. We have experimentally assessed both hybrid systems with respect to the standard HMM-based ASR system, for several noisy environments. On the one hand, the ANN/HMM system provides better results than the HMM-based system. On the other, the results achieved by the SVM/HMM system are slightly lower than those of the HMM system. Nevertheless, such a results are encouraging due to the current limitations of the SVM/HMM system.

- Speech Recognition | Pp. 152-160

Towards Phonetically-Driven Hidden Markov Models: Can We Incorporate Phonetic Landmarks in HMM-Based ASR?

Guillaume Gravier; Daniel Moraru

Automatic speech recognition mainly relies on hidden Markov models (HMM) which make little use of phonetic knowledge. As an alternative, landmark based recognizers rely mainly on precise phonetic knowledge and exploit distinctive features. We propose a theoretical framework to combine both approaches by introducing phonetic knowledge in a non stationary HMM decoder. To demonstrate the potential of the method, we investigate how broad phonetic landmarks can be used to improve a HMM decoder by focusing the best path search. We show that, assuming error free landmark detection, every broad phonetic class brings a small improvement. The use of all the classes reduces the error rate from 22 % to 14 % on a broadcast news transcription task. We also experimentally validate that landmarks boundaries does not need to be detected precisely and that the algorithm is robust to non detection errors.

- Speech Recognition | Pp. 161-168

A Hybrid Genetic-Neural Front-End Extension for Robust Speech Recognition over Telephone Lines

Sid-Ahmed Selouani; Habib Hamam; Douglas O’Shaughnessy

This paper presents a hybrid technique combining the Karhonen-Loeve Transform (KLT), the Multilayer Perceptron (MLP) and Genetic Algorithms (GAs) to obtain less-variant Mel-frequency parameters. The advantages of such an approach are that the robustness can be reached without modifying the recognition system, and that neither assumption nor estimation of the noise are required. To evaluate the effectiveness of the proposed approach, an extensive set of continuous speech recognition experiments are carried out by using the NTIMIT telephone speech database. The results show that the proposed approach outperforms the baseline and conventional systems.

- Speech Recognition | Pp. 169-178

Efficient Viterbi Algorithms for Lexical Tree Based Models

Salvador España-Boquera; Maria Jose Castro-Bleda; Francisco Zamora-Martínez; Jorge Gorbe-Moya

In this paper we propose a family of Viterbi algorithms specialized for lexical tree based FSA and HMM acoustic models. Two algorithms to decode a tree lexicon with left-to-right models with or without skips and other algorithm which takes a directed acyclic graph as input and performs error correcting decoding are presented. They store the set of active states topologically sorted in contiguous memory queues. The number of basic operations needed to update each hypothesis is reduced and also more locality in memory is obtained reducing the expected number of cache misses and achieving a speed-up over other implementations.

- Speech Recognition | Pp. 179-187

Non-stationary Self-consistent Acoustic Objects as Atoms of Voiced Speech

Friedhelm R. Drepper

To account for the strong non-stationarity of voiced speech and its nonlinear aero-acoustic origin, the classical source-filter model is extended to a cascaded drive-response model with a conventional linear secondary response, a synchronized and/or synchronously modulated primary response and a non-stationary fundamental drive which plays the role of the long time-scale part of the basic time-scale separation of acoustic perception. The transmission proto col of voiced speech is assumed to be based on non-stationary acoustic objects which can be synthesized as the described secondary response and which are analysed by introducing a self-consistent (filter stable) part-tone decom position, suited to reconstruct the hidden funda mental drive and to confirm its topo logical equivalence to a glottal master oscillator. The filter-stable part-tone decomposition opens the option of a phase modulation trans mission protocol of voiced speech. Aiming at communi cation channel invariant acoustic features of voiced speech, the phase modulation cues are expected to be particularly suited to extend and/or replace the classical feature vectors of phoneme and speaker recognition.

- Speech Analysis | Pp. 188-203

The Hartley Phase Cepstrum as a Tool for Signal Analysis

Ioannis Paraskevas; Maria Rangoussi

This paper proposes the use of the Hartley Phase Cepstrum as a tool for speech signal analysis. The phase of a signal conveys critical information, which is exploited in a variety of applications. The role of phase is particularly important for speech or audio signals. Accurate phase information extraction is a prerequisite for speech applications such as coding, synchronization, synthesis or recognition. However, signal phase extraction is not a straightforward procedure, mainly due to the discontinuities appearing in it (phase ‘wrapping’ effect). Several phase ‘unwrapping’ algorithms have been proposed to overcome this point, when extraction of the accurate phase values is required. In order to extract the phase content of a signal for subsequent utilization, it is necessary to choose a function that can efficiently encapsulate it. In this paper, through comparison of three alternative non-linear phase features, we propose the use of the Hartley Phase Cepstrum (HPC).

- Speech Analysis | Pp. 204-212

Voiced Speech Analysis by Empirical Mode Decomposition

Aïcha Bouzid; Noureddine Ellouze

Recently Empirical Mode Decomposition has been proposed as a nonlinear tool for the analysis of non stationary data. This paper concerns Empirical Mode Decomposition (EMD) of speech signal into intrinsic oscillatory mode functions IMFs and their spectral analysis. EMD is applied on speech signal, spectrogram of speech and IMFs are analysed. The different modes explored, underline the band-pass structure of IMFs. LPC analysis of the different modes shows that formant frequencies of voiced speech signal are still preserved.

- Speech Analysis | Pp. 213-220

Estimation of Glottal Closure Instances from Speech Signals by Weighted Nonlinear Prediction

Karl Schnell

In this contribution, a method based on nonlinear prediction of speech signals is proposed for detecting the locations of the instances of glottal closures (GCI). For that purpose, feature signals are obtained from the nonlinear prediction of speech using a sliding window technique. The resulting feature signals show maxima caused by the glottal closures which can be utilized for the GCI detection. To assess the procedure, a speech database with corresponding EGG signals is analyzed providing GCIs as reference.

- Speech Analysis | Pp. 221-229

Quantitative Perceptual Separation of Two Kinds of Degradation in Speech Denoising Applications

Anis Ben Aicha; Sofia Ben Jebara

Classical objective criteria evaluate speech quality using one quantity which embed all possible kinds of degradation. For speech denoising applications, there is a great need to determine with accuracy the kind of the degradation (residual background noise, speech distortion or both). In this work, we propose two perceptual bounds and defining regions where original and denoised signals are perceptually equivalent or different. Next, two quantitative criteria and are developed to quantify separately the two kinds of degradation. Some simulation results for speech denoising using different approaches show the usefulness of proposed criteria.

- Speech Analysis | Pp. 230-245