Catálogo de publicaciones - libros

Compartir en
redes sociales


Speaker Classification II: Selected Projects

Christian Müller (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-74121-3

ISBN electrónico

978-3-540-74122-0

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

A Study of Acoustic Correlates of Speaker Age

Susanne Schötz; Christian Müller

Speaker age is a speaker characteristic which is always present in speech. Previous studies have found numerous acoustic features which correlate with speaker age. However, few attempts have been made to establish their relative importance. This study automatically extracted 161 acoustic features from six words produced by 527 speakers of both genders, and used normalised means to directly compare the features. Segment duration and sound pressure level (SPL) range were identified as the most important acoustic correlates of speaker age.

Pp. 1-9

The Impact of Visual and Auditory Cues in Age Estimation

Kajsa Amilon; Joost van de Weijer; Susanne Schötz

Several factors determine the ease and accuracy with which we can estimate a speaker’s age. The question addressed in our study is to what extent visual and auditory cues compete with each other. We investigated this question in a series of five related experiments. In the first four experiments, subjects estimated the age of 14 female speakers, either from still pictures, an audio recording, a video recording without sound, or a video recording with sound. The results from the first four experiments were used in the fifth experiment, to combine the speakers with new voices, so that there was a discrepancy in how old the speaker looked and how old she sounded. The estimated ages of these dubbed videos were not significantly different from those of the original videos, suggesting that voice has little impact on the estimation of age when visual cues are available.

Pp. 10-21

Development of a Femininity Estimator for Voice Therapy of Gender Identity Disorder Clients

Nobuaki Minematsu; Kyoko Sakuraba

This work describes the development of an automatic estimator of perceptual femininity (PF) of an input utterance using speaker verification techniques. The estimator was designed for its clinical use and the target speakers are Gender Identity Disorder (GID) clients, especially MtF (Male to Female) transsexuals. The voice therapy for MtFs, which is conducted by the second author, comprises three kinds of training; 1) raising the baseline range, 2) changing the baseline voice quality, and 3) enhancing dynamics to produce an exaggerated intonation pattern. The first two focus on static acoustic properties of speech and the voice quality is mainly controlled by size and shape of the articulators, which can be acoustically characterized by the spectral envelope. Gaussian Mixture Models (GMM) of values and spectrums were built separately for biologically male speakers and female ones. Using the four models, PF was estimated automatically for each of 142 utterances of 111 MtFs. The estimated values were compared with the PF values obtained through listening tests with 3 female and 6 male novice raters. Results showed very high correlation (=0.86) between the two, which is comparable to the intra- and inter-rater correlation.

Pp. 22-33

Real-Life Emotion Recognition in Speech

Laurence Devillers; Laurence Vidrascu

This article is dedicated to Real-life emotion detection using a corpus of real agent-client spoken dialogs from a medical emergency call center. Emotion annotations have been done by two experts with twenty verbal classes organized in eight macro-classes. Two studies are reported in this paper with the four macro classes: Relief, Anger, Fear and Sadness: the first investigates automatic emotion detection using linguistic information whith a detection score of about 78% and a very good detection of Relief, whereas the second investigates emotion detection with paralinguistic cues with 60% of good detection, Fear being best detected.

Pp. 34-42

Automatic Classification of Expressiveness in Speech: A Multi-corpus Study

Mohammad Shami; Werner Verhelst

We present a study on the automatic classification of expressiveness in speech using four databases that belong to two distinct groups: the first group of two databases contains adult speech directed to infants, while the second group contains adult speech directed to adults. We performed experiments with two approaches for feature extraction, the approach developed for Sony’s robotic dog AIBO (AIBO) and a segment based approach (SBA), and three machine learning algorithms for training the classifiers. In mono corpus experiments, the classifiers were trained and tested on each database individually. The results show that AIBO and SBA are competitive on the four databases considered, although the AIBO approach works better with long utterances whereas the SBA seems to be better suited for classification of short utterances. When training was performed on one database and testing on another database of the same group, little generalization across the databases happened because emotions with the same label occupy different regions of the feature space for the different databases. Fortunately, when the databases are merged, classification results are comparable to within-database experiments, indicating that the existing approaches for the classification of emotions in speech are efficient enough to handle larger amounts of training data without any reduction in classification accuracy, which should lead to classifiers that are more robust to varying styles of expressiveness in speech.

Pp. 43-56

Acoustic Impact on Decoding of Semantic Emotion

Erik J. Eriksson; Felix Schaeffler; Kirk P. H. Sullivan

This paper examines the interaction between the emotion indicated by the content of an utternance and the emotion indicated by the acoustic of an utterance, and considers whether a speaker can hide their emotional state by acting an emotion even though being semantically honest. Three female and two male speakers of Swedish were recorded saying the sentences “Jag har vunnit en miljon pa° lotto” (I have won a million on the lottery), “Det finns böcker i bokhyllan” (There are books on the bookshelf) and “Min mamma har just dött” (my mother just died) as if they were happy, neutral (indifferent), angry or sad. Thirty-nine experimental participants (19 female and 20 male) heard 60 randomly selected stimuli randomly coupled with the question “Do you consider this speaker to be emotionally ?”, where could be angry, happy, neutral or sad. They were asked to respond yes or no; the listeners’ responses and reaction times were collected. The results show that semantic cues to emotion play little role in the decoding process. Only when there are few specific acoustic cues to an emotion do semantic cues come into play. However, longer reaction times for the stimuli containing mismatched acoustic and semantic cues indicate that the semantic cues to emotion are processed even if they impact little on the perceived emotion.

Pp. 57-69

Emotion from Speakers to Listeners: Perception and Prosodic Characterization of Affective Speech

Catherine Mathon; Sophie de Abreu

This article describes a project which aimes at reviewing perceptive works on emotion and prosodic description of affective speech. A study with a spontaneous French corpus, for which a corresponding acted version has been built, shows that native listeners perceive the difference between acted and spontaneous emotions. The results of cross-linguistic perceptual studies indicate that emotions are perceived by listeners partly on the basis of prosody only, proposing the universality of emotions like anger, and partly on the basis of the variability. The latter assumption is supported by the fact that the characterization of anger in degrees is different depending on the mother tongue of the listeners. Finally, a prosodic analysis of the emotional speech is presented, measuring F0 cues, duration parameters and intensity.

Pp. 70-82

Effects of the Phonological Contents on Perceptual Speaker Identification

Kanae Amino; Takayuki Arai; Tsutomu Sugawara

It is known that the accuracy of perceptual speaker identification is dependent on the stimulus contents presented to the subjects. Two experiments were conducted in order to find out the effective sounds and to investigate the effects of the syllable structures on familiar speaker identification. The results showed that the nasal sounds were effective for identifying the speakers both in onset and coda positions, and coronal sounds were more effective than labial counterparts. The onset consonants were found to be important, and the identification accuracy was degraded in onsetless structures.

Pp. 83-92

Durations of Context-Dependent Phonemes: A New Feature in Speaker Verification

Charl Johannes van Heerden; Etienne Barnard

We present a text-dependent speaker verification system based on Hidden Markov Models. A set of features, based on the temporal duration of context-dependent phonemes, is used in order to distinguish amongst speakers. Our approach was tested using the YOHO corpus; it was found that the HMM-based system achieved an equal error rate (EER) of 0.68% using conventional (acoustic) features and an EER of 0.32% when the time features were combined with the acoustic features. This compares well with state-of-the-art results on the same test, and shows the value of the temporal features for speaker verification. These features may also be useful for other purposes, such as the detection of replay attacks, or for improving the robustness of speaker-verification systems to channel or speaker variations. Our results confirm earlier findings obtained on text-independent speaker recognition [1] and text-dependent speaker verification [2] tasks, and contain a number of suggestions on further possible improvements.

Pp. 93-103

Language–Independent Speaker Classification over a Far–Field Microphone

Jerome R. Bellegarda

The speaker classification approach described in this contribution leverages the analysis of both speaker and verbal content information, so as to use two light-weight components for classification: a spectral matching component based on a global representation of the entire utterance, and a temporal alignment component based on more conventional frame-level evidence. The paradigm behind the spectral matching component is related to latent semantic mapping, which postulates that the underlying structure in the data is partially obscured by the randomness of local phenomena with respect to information extraction. Uncovering this latent structure results in a parsimonious continuous parameter description of feature frames and spectral bands, which then replaces the original parameterization in clustering and identification. Such global analysis can then be advantageously combined with elementary temporal alignment. This approach has been commercially deployed for the purpose of language-independent desktop voice login over a far-field microphone.

Pp. 104-115