Catálogo de publicaciones - libros

Compartir en
redes sociales


Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR 2006, Southampton, UK, April 6-7, 2006, Revised Selected Papers

Rainer Stiefelhagen ; John Garofolo (eds.)

En conferencia: 1º International Evaluation Workshop on Classification of Events, Activities and Relationships (CLEAR) . Southampton, UK . April 6, 2006 - April 7, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Pattern Recognition; Image Processing and Computer Vision; Artificial Intelligence (incl. Robotics); Computer Graphics; Biometrics; Algorithm Analysis and Problem Complexity

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-69567-7

ISBN electrónico

978-3-540-69568-4

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Person Identification Based on Multichannel and Multimodality Fusion

Ming Liu; Hao Tang; Huazhong Ning; Thomas Huang

Person ID is a very useful information for high level video analysis and retrieval. In some scenario, the recording is not only multimodality and also multichannel(microphone array, camera array). In this paper, we describe a Multimodal person ID system base on multichannel and multimodal fusion. The audio only system is combining 7 channel microphone recording at decision output individual audio-only system. The modeling technique of audio system is Universal Background Model(UBM) and Maximum a Posterior adaptation framework which is very popular in speaker recognition literature. The visual only system works directly on the appearance space via norm and nearest neighbor classifier. The linear fusion is then combining the two modalities to improve the ID performance. The experiments indicate the effectiviness of micropohone array fusion and audio/visual fusion.

- Person Identification | Pp. 241-248

ISL Person Identification Systems in the CLEAR Evaluations

Hazım Kemal Ekenel; Qin Jin

In this paper, we presented three person identification systems that we have developed for the CLEAR evaluations. Two of the developed identification systems are based on single modalities- audio and video, whereas the third system uses both of these modalities. The visual identification system analyzes the face images of the individuals to determine the identity of the person. It processes multi-view, multi-frame information to provide the identity estimate. The speaker identification system processes the audio data from different channels and tries to determine the identity. The multi-modal identification system fuses the similarity scores obtained by the audio and video modalities to reach an identity estimate.

- Person Identification | Pp. 249-257

Audio, Video and Multimodal Person Identification in a Smart Room

Jordi Luque; Ramon Morros; Ainara Garde; Jan Anguita; Mireia Farrus; Dušan Macho; Ferran Marqués; Claudi Martínez; Verónica Vilaplana; Javier Hernando

In this paper, we address the modality integration issue on the example of a smart room environment aiming at enabling person identification by combining acoustic features and 2D face images. First we introduce the monomodal audio and video identification techniques and then we present the use of combined input speech and face images for person identification The various sensory modalities, speech and faces, are processed both individually and jointly. It’s shown that the multimodal approach results in improved performance in the identification of the participants.

- Person Identification | Pp. 258-269

Head Pose Estimation on Low Resolution Images

Nicolas Gourier; Jérôme Maisonnasse; Daniela Hall; James L. Crowley

This paper addresses the problem of estimating head pose over a wide range of angles from low-resolution images. Faces are detected using chrominance-based features. Grey-level normalized face imagettes serve as input for linear auto-associative memory. One memory is computed for each pose using a Widrow-Hoff learning rule. Head pose is classified with a winner-takes-all process. We compare results from our method with abilities of human subjects to estimate head pose from the same data set. Our method achieves similar results in estimating orientation in tilt (head nodding) angle, and higher precision for estimating orientation in the pan (side-to-side) angle.

- Head Pose Estimation | Pp. 270-280

Evaluation of Head Pose Estimation for Studio Data

Jilin Tu; Yun Fu; Yuxiao Hu; Thomas Huang

This paper introduces our head pose estimation system that localizes nose-tip of the faces and estimate head poses in studio quality pictures. After the nose-tip in the training data are manually labeled, the appearance variation caused by head pose changes is characterized by tensor model. Given images with unknown head pose and nose-tip location, the nose-tip of the face is localized in a coarse-to-fine fashion, and the head pose is estimated simultaneously by the head pose tensor model. The image patches at the localized nose tips are then cropped and sent to two other head pose estimators based on LEA and PCA techniques. We evaluated our system on the Pointing’04 head pose image database. With the nose-tip location known, our head pose estimators can achieve 94~96% head pose classification accuracy(within ±15). With nose-tip unknown, we achieves 85% nose-tip localization accuracy(within 3 pixels from the ground truth), and 81~84% head pose classification accuracy(within ±15).

- Head Pose Estimation | Pp. 281-290

Neural Network-Based Head Pose Estimation and Multi-view Fusion

Michael Voit; Kai Nickel; Rainer Stiefelhagen

In this paper, we present two systems that were used for head pose estimation during the CLEAR06 Evaluation. We participated in two tasks: (1) estimating both pan and tilt orientation on synthetic, high resolution head captures, (2) estimating horizontal head orientation only on real seminar recordings that were captured with multiple cameras from different viewing angles. In both systems, we used a neural network to estimate the persons’ head orientation. In case of seminar recordings, a Bayes filter framework is further used to provide a statistical fusion scheme, integrating every camera view into one joint hypothesis. We achieved a mean error of 12.3° on horizontal head orientation estimation, in the monocular, high resolution task. Vertical orientation performed with 12.77° mean error. In case of the multi-view seminar recordings, our system could correctly identify head orientation in 34.9% (one of eight classes). If neighbouring classes were allowed, even 72.9% of the frames were correctly classified.

- Head Pose Estimation | Pp. 291-298

Head Pose Estimation in Seminar Room Using Multi View Face Detectors

Zhenqiu Zhang; Yuxiao Hu; Ming Liu; Thomas Huang

Head pose estimation in low resolution is a challenge problem. Traditional pose estimation algorithms, which assume faces have been well aligned before pose estimation, would face much difficulty in this situation, since face alignment itself does not work well in this low resolution scenario. In this paper, we propose to estimate head pose using view-based multi-view face detectors directly. classifier is then applied to fuse the information of head pose from multiple camera views. To model the temporal changing of head pose, Hidden Markov Model is used to obtain the optimal sequence of head pose with greatest likelihood.

- Head Pose Estimation | Pp. 299-304

Head Pose Detection Based on Fusion of Multiple Viewpoint Information

Cristian Canton-Ferrer; Josep Ramon Casas; Montse Pardàs

This paper presents a novel approach to the problem of determining head pose estimation and face 3D orientation of several people in low resolution sequences from multiple calibrated cameras. Spatial redundancy is exploited and the head in the scene is detected and geometrically approximated by an ellipsoid. Skin patches from each detected head are located in each camera view. Data fusion is performed by back-projecting skin patches from single images onto the estimated 3D head model, thus providing a synthetic reconstruction of the head appearance. Finally, these data are processed in a pattern analysis framework thus giving an estimation of face orientation. Tracking over time is performed by Kalman filtering. Results of the proposed algorithm are provided in the SmartRoom scenario of the CLEAR Evaluation.

- Head Pose Estimation | Pp. 305-310

CLEAR Evaluation of Acoustic Event Detection and Classification Systems

Andrey Temko; Robert Malkin; Christian Zieger; Dušan Macho; Climent Nadeu; Maurizio Omologo

In this paper, we present the results of the Acoustic Event Detection (AED) and Classification (AEC) evaluations carried out in February 2006 by the three participant partners from the CHIL project. The primary evaluation task was AED of the testing portions of the isolated sound databases and seminar recordings produced in CHIL. Additionally, a secondary AEC evaluation task was designed using only the isolated sound databases. The set of meeting-room acoustic event classes and the metrics were agreed by the three partners and ELDA was in charge of the scoring task. In this paper, the various systems for the tasks of AED and AEC and their results are presented.

- Acoustic Scene Analysis | Pp. 311-322

The CLEAR 2006 CMU Acoustic Environment Classification System

Robert G. Malkin

We describe the CLEAR 2006 acoustic environment classification evaluation and the CMU system used in the evaluation. Environment classification is a critical technology for the CHIL Connector service [1] in that Connector relies on maintaining awareness of user state to make intelligent decisions about the optimal times, places, and methods to deal with requests for human-to-human communication. Environment is an important aspect of user state with respect to this problem; humans may be more or less able to deal with voice or text communications depending on whether they are, for instance, in an office, a car, a cafe, or a cinema. We unfortunately cannot rely on the availability of the full CHIL sensor suite when users are not in the CHIL room; hence, we are motivated to explore the use of the only sensor which is reliably available on every mobile communication device: the microphone.

- Acoustic Scene Analysis | Pp. 323-330