Catálogo de publicaciones - libros

Compartir en
redes sociales


Speaker Classification II: Selected Projects

Christian Müller (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-74121-3

ISBN electrónico

978-3-540-74122-0

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

A Linear-Scaling Approach to Speaker Variability in Poly-segmental Formant Ensembles

Frantz Clermont

A linear-scaling approach is introduced for handling acoustic-phonetic manifestations of inter-speaker differences. The approach is motivated (i) by the similarity commonly observed amongst formant-frequency patterns resulting from different speakers’ productions of the same utterance, and (ii) by the fact that there are linear-scaling properties associated with similarity. In methodological terms, formant patterns are obtained for a set of segments selected from a fixed utterance, which we call . Linear transformations of these ensembles amongst different speakers are then sought and interpreted as a set of scaling relations. Using multi-speaker data based on Australian English “hello”, it is shown that the transformations afford a significant reduction of inter-speaker dissimilarity by inverse similarity. The proposed approach is thus able to unlock regularity in formant-pattern variability from speaker to speaker, without prior knowledge of the exact causes of the speaker differences manifested in the data at hand.

Pp. 116-129

Sound Change and Speaker Identity: An Acoustic Study

Gea de Jong; Kirsty McDougall; Francis Nolan

This study investigates whether the pattern of diachronic sound change within a language variety can predict phonetic variability useful for distinguishing speakers. An analysis of Standard Southern British English (SSBE) monophthongs is undertaken to test whether individuals differ more widely in their realisation of sounds undergoing change than in their realisation of more stable sounds. Read speech of 20 male speakers of SSBE aged 18-25 from the DyViS database is analysed. The vowels , demonstrated by previous research to be changing in SSBE, are compared with the relatively stable . Results from Analysis of Variance and Discriminant Analysis based on F1 and F2 frequencies suggest that although ‘changing’ vowels exhibit greater levels of between-speaker variation than ‘stable’ vowels, they may also exhibit large within-speaker variation, resulting in poorer classification rates. Implications for speaker identification applications are discussed.

Pp. 130-141

Bayes-Optimal Estimation of GMM Parameters for Speaker Recognition

Guillermo Garcia; Sung-Kyo Jung; Thomas Eriksson

In text-independent speaker recognition, Gaussian Mixture Models (GMMs) are widely employed as statistical models of the speakers. It is assumed that the Expectation Maximization (EM) algorithm can estimate the optimal model parameters such as weight, mean and variance of each Gaussian model for each speaker. However, this is not entirely true since there are practical limitations, such as limited size of the training database and uncertainties in the model parameters. As is well known in the literature, limited-size databases is one of the largest challenges in speaker recognition research. In this paper, we investigate methods to overcome the database and parameter uncertainty problem. By reformulating the GMM estimation problem in a Bayesian-optimal way (as opposed to ML-optimal, as with the EM algorithm), we are able to change the GMM parameters to better cope with limited database size and other parameter uncertainties. Experimental results show the effectiveness of the proposed approach.

Pp. 142-156

Speaker Individualities in Speech Spectral Envelopes and Fundamental Frequency Contours

Tatsuya Kitamura; Masato Akagi

Perceptual cues for speaker individualities embedded in spectral envelopes of vowels and fundamental frequency (F0) contours of words were investigated through psychoacoustic experiments. First, the frequency bands having speaker individualities are estimated using stimuli created by systematically varying the spectral shape in specific frequency bands. The results suggest that speaker individualities of vowel spectral envelopes mainly exist in higher frequency regions including and above the peak around 20–23 ERB rate (1,740–2,489 Hz). Second, three experiments are performed to clarify the relationship physical characteristics of F0 contours extracted using Fujisaki and Hirose’s F0 model and the perception of speaker identity. The results indicate that some specific parameters related to the dynamics of F0 contours have many speaker individuality features. The results also show that although there are speaker individuality features in the time-averaged F0, they help to improve speaker identification less than the dynamics of the F0 contours.

Pp. 157-176

Speaker Segmentation for Air Traffic Control

Michael Neffe; Tuan Van Pham; Horst Hering; Gernot Kubin

In this contribution a novel system of speaker segmentation has been designed for improving safety on voice communication in air traffic control. In addition to the usage of the aircraft identification tag to assign speaker turns on the shared communication channel to aircrafts, speaker verification is investigated as an add-on attribute to improve security level effectively for the air traffic control. The verification task is done by training universal background models and speaker dependent models based on Gaussian mixture model approach. The feature extraction and normalization units are especially optimized to deal with small bandwidth restrictions and very short speaker turns. To enhance the robustness of the verification system, a cross verification unit is further applied. The designed system is tested with SPEECHDAT-AT and WSJ0 database to demonstrate its superior performance.

Pp. 177-191

Detection of Speaker Characteristics Using Voice Imitation

Elisabeth Zetterholm

When recognizing a voice we attend to particular features of the person’s speech and voice. Through voice imitation it is possible to investigate which aspects of the human voice need to be altered to successfully mislead the listener. This suggests that voice and speech imitation can be exploited as a methodological tool to find out which features a voice impersonator picks out in the target voice and which features in the human voice are not changed, thereby making it possible to identify the impersonator instead of the target voice. This article examines whether three impersonators, two professional and one amateur, selected the same features and speaker characteristics when imitating the same target speakers and whether they achieved similar degrees of success. The acoustic-auditory results give an insight into how difficult it is to focus on only one or two features when trying to identify one speaker from his voice.

Pp. 192-205

Reviewing Human Language Identification

Masahiko Komatsu

This article overviews human language identification (LID) experiments, especially focusing on the modification methods of stimulus, mentioning the experimental designs and languages used. A variety of signals to represent prosody have been used as stimuli in perceptual experiments: lowpass-filtered speech, laryngograph output, triangular pulse trains or sinusoidal signals, LPC-resynthesized or residual signals, white-noise driven signals, resynthesized signals preserving or degrading broad phonotactics, syllabic rhythm, or intonation, and parameterized source component of speech signal. Although all of these experiments showed that “prosody” plays a role in LID, the stimuli differ from each other in the amount of information they carry. The article discusses the acoustic natures of these signals and some theoretical backgrounds, featuring the correspondence of the source, in terms of the source-filter theory, to prosody, from a linguistic perspective. It also reviews LID experiments using unmodified speech, research into infants, dialectology and sociophonetic research, and research into foreign accent.

Pp. 206-228

Underpinning //: Automatic Estimation of Pitch Range and Speaker Relative Pitch

Jens Edlund; Mattias Heldner

In this study, we explore what is needed to get an automatic estimation of speaker relative pitch that is good enough for many practical tasks in speech technology. We present analyses of fundamental frequency (F0) distributions from eight speakers with a view to examine (i) the effect of semitone transform on the shape of these distributions; (ii) the errors resulting from calculation of percentiles from the means and standard deviations of the distributions; and (iii) the amount of voiced speech required to obtain a robust estimation of speaker relative pitch. In addition, we provide a hands-on description of how such an estimation can be obtained under real-time online conditions using /nailon/ – our software for online analysis of prosody.

Pp. 229-242

Automatic Dialect Identification: A Study of British English

Emmanuel Ferragne; François Pellegrino

This contribution deals with the automatic identification of the dialects of the British Isles. Several methods based on the linguistic study of dialect-specific vowel systems are proposed and compared using the Accents of the British Isles (ABI) corpus. The first method examines differences in diphthongization for the lexical set. Discrimination scores in a two-dialect discrimination task range from chance to ca. 98% of correct decision depending on the pair of dialects under test. Thanks to the ACCDIST method (developed in [1,2]), the second and third experiments take dialectal differences in the structure of vowel systems into consideration; evaluation is performed on a 13-dialect closed set identification task. Correct identification reaches up to 90% with two subsets of the ABI corpus (/hVd/ set and read passages). All these experiments rely on a front-end automatic phonetic alignment and are therefore text-dependent. Results and possible improvements are discussed in the light of British dialectology.

Pp. 243-257

ACCDIST: An Accent Similarity Metric for Accent Recognition and Diagnosis

Mark Huckvale

ACCDIST is a metric of the similarity between speakers’ accents that is largely uninfluenced by the individual characteristics of the speakers’ voices. In this article we describe the ACCDIST approach and contrast its performance with formant and spectral-envelope similarity measures. Using a database of 14 regional accents of the British Isles, we show that the ACCDIST metric outperforms linear discriminant analysis based on either spectral-envelope or normalised formant features. Using vowel measurements from 10 male and 10 female speakers in each accent, the best spectral-envelope metric assigned the correct accent group to a held-out speaker 78.8% of the time, while the best normalised formant-frequency metric was correct 89.4% of the time. The ACCDIST metric based on spectral-envelope features, scored 92.3%. ACCDIST is also effective in clustering speakers by accent and has applications in speech technology, language learning, forensic phonetics and accent studies.

Pp. 258-275