Multimodal Affective User Interface (MAUI ... - Agnes Abastado

Feb 21, 2007 - Semester 2006 Proposal. By. Agnès Abastado .... Nasoz, 2004], which is psychologically sound and based on emotion theory [Scherer, 2001] ... recognition system and the intelligent back-end cognitive architecture. Examples ...
1MB taille 7 téléchargements 268 vues
Multimodal Affective User Interface (MAUI) Development Semester 2006 Proposal By

Agnès Abastado Damien Birraux

Submitted to

Professor Christine Lisetti Date

February 22nd, 2007

MAUI Development

2/21/2007

0

TABLE OF CONTENT 1. RESEARCH OBJECTIVES.................................... 2 2. STATE OF THE ART....................................... 3 2.1 Description of the MAUI paradigm: ................ 3 2.2 Inputs model: .................................... 5 2.3 Multimodal fusion: ............................... 5 2.4 Action Units (AUs): .............................. 6 2.5 Scherer's theory: ................................ 7 2.6 Social Agent: .................................... 9 2.7 Conclusion: ...................................... 9 3. THE SYSTEM AND ITS ARCHITECTURE....................... 10 3.1 General description of the MAUI architecture: ... 10 3.2 Focus on our work for the MAUI architecture implementation:........................................ 14 4. THE ARCHITECTURE IMPLEMENTATION....................... 24 4.1 Design and motivations: ......................... 24 4.2 Overview on the implementation: ................. 24 4.3 The threads versus the ActiveX controls: ........ 24 4.4 Classes for the data flow: ...................... 26 4.5 Classes for the architecture modules: ........... 33 4.6 Classes for the special modules: ................ 41 4.7 Classes for the fusion modules: ................. 43 4.8 Class MAUI.cpp: the brain of the software ....... 45 5. THE GRAPHIC INTERFACE................................. 46 5.2. Design and functionalities: ..................... 48 5.3. Implementation: ................................. 52 6. THE SIMULATION........................................ 53 7. FUTURE WORK........................................... 55 6.5. Future work on the implemented algorithms ....... 55 6.6. Future work on the recognition engines .......... 56 6.7. Future work on the output management ............ 57 6.8. Future work on the intelligent modules .......... 58 8. CONCLUSION............................................ 59 9. TENTATIVE SCHEDULE.................................... 60 10. WEEKLY STUDENT SCHEDULE ............................. 62 11. REFERENCES .......................................... 63 12. ANNEXES ............................................. 64 12.1 How one can add an AU? .......................... 64 12.2 How one can add a SEC? .......................... 64 12.3 How one can add an emotion? ..................... 64 12.4 How one can add a User Profile? ................. 65 12.5 Articles summaries: ............................. 66

MAUI Development

2/21/2007

1

1. RESEARCH OBJECTIVES Affective computing is a relatively new search field investigating the role of emotions in Human Computer and Human Robot Interactions. The Affective Social Computing Group at the Eurécom Institute (www.eurecom.fr/~lisetti and /~paleari/ascg/) is working toward building an Affective-Cognitive Architecture for Affective Socially Intelligent Agent [Lisetti and Nasoz, 2004], which is psychologically sound and based on emotion theory [Scherer, 2001]. The project will focus on the implementation of a multimodal user interface capable of taking multimodal inputs (i.e. speech and video) and to communicate via an embodied conversational agent on different platforms: an avatar developed by Haptek and the iCat Philips robot. The development will include a representation of the emotion according to Scherer's Psychological theory [Scherer, 2001] and interface multimodal components for other modules (already developed or to be developed) such as voice recognition, face recognition, multimodal emotion recognition system and the intelligent back-end cognitive architecture. Examples of a similar interface will be available and interface specifications will be fully provided to the students at the beginning of the project. The main goals of this project are to develop the Multimodal Affective User Interface skeleton and to code some basic classes in C++ in order to develop a demo.

• • •

In short, this project consist in: Performing a short state of the art (literature) search, Designing the MAUI architecture and implementing its skeleton, Coding a simple demo for emotive facial expressions.

MAUI Development

2/21/2007

2

2. STATE OF THE ART The aim of this chapter is to presents what have been done on the Affective Computing field for the MAUI architecture development. We will try to give the basic references that explain it. We provide a state-of-the-art that focuses on our. As a consequence, we will only approach the main researches interesting for the development of the affective interface.

2.1 Description of the MAUI paradigm: The MAUI paradigm is a high level architecture. The goal of this interface is to find the more probable emotion of the user, which can then be used to provide an appropriate response of the computer in a specific context. The MAUI framework is to sense and interpret the user’s emotions, build a user profile (of the user) and an agent profile (of a virtual or robotic agent) in term of experienced emotions, preference and personality, and finally, to use these information to compute an adaptive behaviour for the agent that interacts with the user in a specific context. The process is developed in three steps: • the affect perception and recognition, • the affect prediction and generation, • and finally, the affect expression. Multimodality is granted by the multimodal inputs. All the components needed as emotions are really multi-layer and complex phenomena. The facial emotion recognition alone, for instance, would probably never recognize the right emotion if the user is dissimulating it. Body sensors alone would never give complete information about the emotion itself, as the facial expression would not be considered. This is one of the main ideas of the MAUI: a complete agent needs information from all the components in the list of facial recognition, voice recognition and others. It then fuses all these inputs before it determines the agent actions. MAUI Development

2/21/2007

3

Fig. 1: The MAUI paradigm (Lisetti and Nasoz, 2001)

MAUI Development

2/21/2007

4

2.2 Inputs model: It is first necessary to develop an accurate model of inputs since it is a link between the user and the interface. Different models of inputs have been developed. Taking into account the modern tools of computer science, one classical approach is to use an input model based on the three kind of phenomenon which give birth to emotions. The information used to model an emotion is sensory, cognitive and biological inputs. That is what is called the affect-cognition interface [Lisetti & Nasoz 2002]. This leads to a model associated with three phenomena called the VKA(L) model. The first version of this model is the VKA (Visual, Kinesthetic and Auditory) model. Its advantage is to take into account psychological and neuronal aspect of the emotions, and to analyse those aspects to reflect the primordial role in emotion generation. Some linguistic tools can be added in order to have more accurate signals, and it is called the VKAL model [Lisetti & Nasoz 2002]. • • •



The visual signal (V) deals with facial expression [Terzopoulos & Water 1990]. The vocal emotion (A) is the way to analyse the emotion contented in the voice [Murray & Arnott, 1993]. (K)-advanced bio-feedback takes into account physiological parameters [Healy, 2001]. In MAUI, it is the Autonomic Nervous System (ANS) signals which bring the information on these types of parameter: signals emotional arousal and valence [Villon and Lisetti, 2006]. The spoken or written natural language (L) [O'Rorke, 2000] permits to take into account the subjective experience.

For us, this model has a particular importance because it is the one which has been used to make easier the MAUI implementation. Actually, the first step (the affect perception and recognition) of the MAUI is carry out by analyzing the VKAL signals .

2.3 Multimodal fusion:

MAUI Development

2/21/2007

5

The fusion consists on the mix of two or more signals. The success of this step can only be done with coupled and synchronized signals. Moreover, these signals have to be of the same nature at the signal level fusion. To achieve this fusion, Hidden Markov Model (HMM) or time biased Neural Network (NN) are often used. In multimodal fusion, feature level fusion can be applied at several levels. For instance, it can be used to mix information about voice and lip movements for speech recognition or voice and video features for emotion recognition. Voice and facial emotion can be recognized together to obtain better results [Lisetti & Paleari, 2006]. Multimodal information fusion is usually performed at the following three different levels: • Signal level • Feature level • Decision or Conceptual level. These different levels generation and recognition.

allow

emotion

simultaneity

2.4 Action Units (AUs): First scientifically relevant works about facial expressions of emotions are the ones of Ekman and Friesel [Eckman, 1972] who describe facial expression as combinations of independent parameters. These independent parameters are called Action Units (AUs). In 1971, Ekman analyzed the movement of 46 action units, 35 of which correspond to the minimal set of muscular face movements that cannot be split in a smaller one. The other 11 are more complex movements. Furthermore, an intensity value is attached to each unit to give more accuracy in the model.

MAUI Development

2/21/2007

6

Fig. 1: Some action units extracted from Cohn and Kanade’s database

The principle is to refer a basic facial movement or action to a given Action Unit. AUs define minimal action that cannot be divided into smaller or simpler actions (. We then can define and generate facial expressions in terms of combination of AUs [Grizard, Paleari and Lisetti, 2006; Paleari and Lisetti, 2006]. Moreover, they have a particular importance since they allow an automatism of the emotion generation processing. In fact, the generation of the facial expressions consists on an AUs combination and an intensity prevision. The AUs is a kind of response to the Sequential Evaluation Checks of Scherer’s theory.

2.5 Scherer's theory: The MAUI has been developed [Lisetti & Nasoz 2001] as an adaptive system architecture combining and taking into account psychological theories such as Scherer's one. The Scherer’s theory brought a huge improvement to the emotion modelling [Paleari, 2005]. K. R. Scherer has actually developed a sequential check theory of emotion differentiation. The emotions can be accurately recognized by analyzing the results of a sequence of sequential evaluation checks (SECs) [Scherer, 2001]. The SECs are the minimal number of binary criteria that allow us to differentiate emotional subjective states.

MAUI Development

2/21/2007

7

The emotion appraisal consists of five distinctive functions associated with five organismic subsystems (information processing, support, executive, action and monitor). The SECs are organized and processed in sequence, consisting in four stages called the appraisal objectives: •

The event relevance (which takes into account the novelty as the suddenness and the predictability, the intrinsic pleasantness and the suitability for the momentary goals).



The event implications and consequences (which takes into account whether an agent was responsible for the event or not, the individual’s estimate of the event probability, the discrepancy with the individual’s expectation, the conduciveness in the current goal achievement, and finally, the urgency in terms of priority of the goals/needs).



The coping potential (knowing the probability that an event can be influenced by an agent, the likelihood this agent is able to influence a potentially controllable event, and if the agent can adjust or live with the consequences of this event),



The normative significance (which refers to the personal values of the individual and his perception of the norms and standards of the society in which he is living).

However, the process of appraisal requires multiple rounds of appraisal in three differential levels: the sensory-motor level occurring at the innate features and reflex systems level, the schematic level based on the learning history of the individual and the conceptual level involving memory storage. In fact, the theory considers emotions with three levels given guidelines for developing recognition and generation of affective cues. This theory is crucial for the MAUI development as it gives a computable representation of the emotions and a way to recognize and generate them. Actually, if the Scherer’s theory is used for the emotion recognition, in the same way the Scherer's theory can be used in emotion generation [Lisetti, Paleari & Grizard,

MAUI Development

2/21/2007

8

2006]. This has allowed believability of emotions.

researchers

to

improve

the

Scherer’s theory permits to predict emotional responses from the face, the body, the gesture or the autonomous nervous system. From a given emotion it is possible to generate the corresponding facial expression with the help of the Scherer’s sequential evaluation checks.

2.6 Social Agent: The role of the social agent is to create a better interaction with the user. The interactions between humans and social agents integrate two states: emotion expression and emotion model. To give a simple definition, a rational agent is an agent who, given its beliefs about the environment, chooses an action in such a way that it maximizes its desires or Subjective Expected Utility (SEU) [Lisetti & Gmytrasiewicz, 2002]. The SEU expresses the probability of a subjective expression. Many models have been developed which assume different hypothesis in order to define the role of a rational agent. These models take into account observations of the human being, like for example the fact that each one try to maximize its utility. Mathematical and abstract theories have been developed to model correctly a rational agent [Lisetti and Gmytrasiewicz, 1999]. These models make easier the implementation.

2.7 Conclusion: Many researches have been done for several aspect of the MAUI implementation and it is now time to fix the interface specifications and to develop the Multimodal Affective User Interface code skeleton, which will be our work during this project.

MAUI Development

2/21/2007

9

3. THE SYSTEM AND ITS ARCHITECTURE 3.1 General description of the MAUI architecture: Using the state-of-the-art of this field, we designed a complete and implementable architecture for the MAUI system. In this chapter, we are going to describe this complete architecture of MAUI and its fundamental processing steps. In a second chapter we will stress the details of the parts we are effectively going to implement, but it seems essential to first have a global overview of the whole system.



The MAUI has five inputs: The Body media, a wireless body monitor device which allows three measurements [Lisetti and Nasoz, 2002]: the skin conductance, the skin temperature compared with the ambient temperature, and the movement.

Fig. 2: Sensewear Armband Body media

• • • •

An additional device called polar system measures heartbeat [Lisetti and Nasoz, 2002]. A microphone. A webcam films the scene. For the keyboard and mouse, we underlined two types of signals: the Haptic signal [Lisetti and Nasoz, 2002] and the key pressure that registers the keys or buttons pressed by the user.

MAUI Development

2/21/2007

10

Features (e.g. AUs, Vocal Features etc.)

Signals

ANS Signal Processing

S I G N A L L E V E L F U S I O N

Communicative Functions

Agent Profile

ANS Feature Recognition F E A T U R E

Audio Signal Processing Speech Recognition

L E V E L

Video Signal Processing Video User Identification

Keyboard & Mouse Signal Processing Text Capture

Decisions (Scherer’s Theory SECs)

F U S I O N

Speech Feature Recognition Linguistic Emotion Recognition

Video Feature Recognition

Haptic Movement Feature Recognition Linguistic Emotion Recognition

F U S I O N

Static Information User Profile Static Information (name, personality)

A F F E C T I V E C U E S

Interaction Specific Information Historical Information Current Affective Information

of

Speech Information

Interaction Specific Information (preferences, nick)

Affective Social Cognitive Architecture

Affective and Event Related Historical Information

Transparent Mirroring of User’s Affective State

Affective Information (SEC Based Structure)

Other Modalities (e.g. iCat lights)

Current Affective State Information (SEC Based Structure) Various Functionalities: -Speech Generation - Text Generation - Definition of Adaptive Behaviors - etc.

Decisions (Scherer’s Theory SECs)

F I S S I O N of A F F E C T I V E

Features (e.g. AUs, Vocal Features etc.)

Speech Generation (Text To Speech)

Speech Expression Feature Generation

Facial Expression Feature Generation

2/21/2007

Facial Expression Generation

Movement Generation

C U E S

11

S I G N A L L E V E L

Other Expression Feature Generation

Fig. 2: The global MAUI architecture

MAUI Development

Signals

F I S S I O N

Once we retrieved all the input signals, the second step of the process is the signal level fusion. In this first fusion step, a pre processing is performed and a fusion operation gathers all the signals and combines them (if the are several microphones or webcams for instance, it computes the full audio and video signals). Then comes the signal processing level that extracts features from the pre processed signals. The MAUI system performs affective expression recognition in term of arousal and intensity from the ANS signals [Villon and Lisetti, 2006], vocal features extraction (based on the pitch of the voice, the speed, etc), text extraction from both the voice and the keyboard. From images of the user’s face, a system based on neural networks extracts the Action Units [Lisetti and Nasoz, 2002; Grizard, Paleari and Lisetti, 2006]. Finally, a keyboard and mouse signal processing extracts the Haptic movement features that refers to the mouse motion and the way the user gives pressure on the keys (for instance, it could detect that the user is shaking). A feature level fusion is then be able to fuse different features from the different previous analysis into complex data representations. We are now able to recognize the emotions in terms of Scherer’s Theory SECs [Scherer, 2001; Grizard, Paleari and Lisetti, 2006; Paleari and Lisetti, 2006]. As previously, we have one module for each type of signal, so an ANS feature recognition module, a voice feature recognition, a linguistic emotion recognition module for both speech and keyboard that uses linguistic tools to find the possible emotional state of the user, the image feature recognition and an Haptic movement feature recognition. The emotional information about the user goes User Profile [Lisetti and Nasoz, 2002]. It is a that gives a complex data representation of the this data structure, all static, interactive or useful information is stored. We have the same for the Agent Profile that aims to become emotional character” interacting with the user.

into the structure user. In emotional structure a “fully

The affective social cognitive architecture is the crucial part of the process of creating an interactive and emotionally intelligent response as it is where we really

MAUI Development

2/21/2007

12

decide what the most adaptive behaviours are for the agent. It determines the best behaviour of the interactive agent, choosing in particular the speech or text that will be said by the agent and its emotional attitude. This information is then transmitted to the communicative function, that are the intermediary functions transforming the decision of the adaptive emotions into Scherer’s Theory SECs. The fission of affective cues level is the dual of the Fusion of Affective Cues one, translating the SECs structure into another SECs structure and then features (AUs, vocal features, little motion features, etc) [Scherer, 2001; Grizard, Paleari and Lisetti, 2006]. Then comes the creation of signals from the emotion features. It is done by speech, facial expression and other expression or movement generators. A final signal level fission aims to extract the very physical, electrical signals or instruction code messages that need to be transmitted to the outputs.



• •

The outputs are: A visual interface with a virtual Haptek avatar [Lisetti and Nasoz, 2002; Grizard, Paleari and Lisetti, 2006], a text field, and other functionalities. Loudspeakers. The iCat, a research platform for studying human-robot interaction topics developed by Philips.

Fig. 3: Cherry, the Haptek virtual agent developed by the ASCG iCat Philips robot

MAUI Development

2/21/2007

and

13

3.2 Focus on our work for architecture implementation:

the

MAUI

The goal of this chapter is to walk through the simplified architecture we designed, describing the goals and the functions of each module, as well as the information flux. We are now going to describe only the parts we are going to implement, the dotted modules on the diagram concerning optional work depending on our project schedule management.

3.2.1. The inputs: • • •

We are going to place emphasis on three inputs: A microphone that registers the user's voice. A webcam that films his face. For the keyboard, we are going to underline the key pressure that registers the keys pressed by the user.

3.2.2. The signal recognition: Then, we recognition processing extract the

are going to implement three modules for the link to these input, which aim, after a signal level (called the signal level fusion) is to signal features:

• Speech Recognition: Input: Voice Output: Recognized text Function: Ideally, the sound measured by the microphone and accurately transformed by the pre processing (so that it becomes usable by a speech recognition engine), is directly recognized for any user and transformed into text. Our work: We are not going to stress the voice and sound processing as it will stay a minor part in our work in this project. For our project, we are going to try to use Microsoft Speech Recognition Engine. We may have to perform a training of our voices for this engine.

MAUI Development

2/21/2007

14

Fig. 3: MAUI Architecture 1/3

MAUI Development

2/21/2007

15

• Video Signal Processing: Input: Video sequence Output: Recognized Ekman's (1971) AUs Function: Image and video processing is necessary to adjust the images. For instance, the images need to be at a specific format for the facial recognition engine (specific lightening condition, scale, etc) [Lisetti and Nasoz, 2002]. Then a recognition system of AUs is necessary, as, the module would be ideally able to take the webcam video as input and analyze it in order to extract the AUs. Our work: For the simulation the module will take video files and “analyze” them by reading the pre extracted AUs from attached text files. • Text Capture: Input: Text Output: Text Function: Finally, the keyboard analyzer consists of a text recovery engine which aim is to recover the text that should have been hit directly from the keyboard output. Our work: The module will just take the text inserted by the keyboard on a text field of the interface and store it on an appropriate data structure.

3.2.3. The feature level fusion: This fundamental step of the final MAUI system makes the fusion of the features: Input: Data structures representing Ekman’s AUs from video, text from speech and text from the keyboard. Output: Ekman’s AUs and two texts. Function: The module would ideally be able to fuse different features into more complex data structures. Our work: The module will actually be transparent, transmitting the AUs to the next step.

3.2.4. The feature recognition: We are going to implement two modules of the feature recognition level: • Video Feature Recognition: Input: Sequences of Ekman’s AUs

MAUI Development

2/21/2007

16

Output: Recognized SECs Function: The module will compare AUs in input with the definition of the SECs given by Scherer’s theory [Scherer, 2001; Grizard, Paleari and Lisetti, 2006; Paleari and Lisetti, 2006]. Our work: We are going to perform a mapping between the AUs set coming from the user emotional expression and a data base that make the link between AUs and SECs. • Linguistic Emotion Recognition: Input: Text Output: Recognized SECs Function: The module will ideally search the recognized text for text patterns linked to specific SECs. Actually, linguistic terms are often associated with emotions, and a semantic analyze of the vocabulary could help for the user’s emotion recognition. For this function, we have to build a semantic database listing the vocabulary associated with specific emotions. Our work: Any work at this level will be optional.

3.2.5. The fusion of affective cues: This aim of this fusion level is to merge the information from the incoming features into a new data structure that gives a better representation of the emotion: Input: SECs. Output: SEC structure. Function: The scope of this module is to put together the information about the recognized SECs coming from the different feature recognizers. Our work: This whole module is not expected to be developed for the project. The processing at this level will be optional, but we have to define the new SEC data structure. It will be developed as almost transparent or aggregator of the recognized SECs.

MAUI Development

2/21/2007

17

Fig. 4: MAUI Architecture 2/3

MAUI Development

2/21/2007

18

3.2.6. The user profile: The user information:

profile

gathers

the

user

identification

Input: SEC Structure. Output: SEC Structure. Function: The module will contain data structures representing the user [Lisetti and Nasoz, 2002]. Note: ASIA (Affective Social Cognitive Architecture) will be able to access (read/write) to the data contained in the profile. We decided to split the information about the user in four parts: • The static information is the data that is definitely associated with each user: the user enters this information at its first connection (the first time he uses the MAUI). It includes for instance the user’s name, its gender, and its personality (as a result of a quick test). • The interaction specific information is the data that remains constant each time the user uses the MAUI but could be easily modified by him. In a not exhausted list, it could include the user’s nickname and his favourite interface (favourite avatar or iCat) [Lisetti and Nasoz, 2002]. • We are also going to store some dynamical information, modified dynamically by the events occurring during each MAUI session. They are of two types: o The affective and event related historical information which consists of emotion statistics done on the user’s moods and the storage of particular events, e.g. a history of what happens with this user. For instance, the computer has to be able to remember if, for the user, a particular word is frequently associated with sadness, or if the use of the text editor made him nervous, etc. o The current affective state information is directly correlated with the current emotion recognition and has a SEC based structure. We are going to build two of these data structures: the static information sub module and the current affective state information one.

MAUI Development

2/21/2007

19

Fig. 5: MAUI Architecture 3/3

MAUI Development

2/21/2007

20

3.2.7. The affective social cognitive architecture: The affective social cognitive architecture is the module that defines the more adaptive behaviours for the agent. It includes the emotion expression as well as the text the agent would have to say/write and the correlated movements. As a consequence, its inputs are all the information stored in the user and the agent profiles. Input: SEC Structure Output: SEC Structure Function: speech generation, text generation, definition of adaptive behaviours, etc. Our work: The module will be developed as completely transparent. The only function it will perform will be the direct mirroring of the affective expression. Note: ASIA will be able to access (read/write) to the data contained in both user’s and agent’s profile.

3.2.8. The communicative functions: For our project, we are going to take an interest only in the affective information module and the speech information module at this level: • Affective Information: Input: SEC Structure Output: SEC Structure Function: The aim of this module is to merge the information from the affective social cognitive architecture and the current affective information of the agent to generate the final affective state that would be communicated by the agent. Our work: The module will contain a data structure able to store the information about the affective state that will be communicated. • Speech Information: Input: Text + SEC Based structure Output: Text Function: This module has to generate the final adaptive text that will be said by the agent, taking into account the information from the affective social cognitive architecture.

MAUI Development

2/21/2007

21

Our work: Any work at this level will be optional, but we may use this function to enable the agent to say a pre defined text (about what it is doing for instance).

3.2.9. The fission of affective cues: This module is the dual of the Fusion of Affective Cues one: Input: SEC Structure Output: SECs for the different Feature Generators Function: The module is the dual of the Fusion of Affective Cues one. It has to change the structure of the data so that they could be directly used by the generator functions. Our work: The module will be developed as transparent. It will be able to pass to the different feature generation the same SEC structure it as in input.

3.2.10. The feature generation level: We only consider the facial expression feature generation at this level: Input: SECs Output: Ekman’s AUs sequences (AUs + intensity level) Function: Ideally the module will generate AUs for each SEC in input [Grizard, Paleari and Lisetti, 2006]. Our work: The module will actually store the whole SEC structure, perform recognition of the emotional label represented by the structure and determine the corresponding AUs.

3.2.11. The platform selector: We then need to select the output platform and all the data pass through the last inter-signal level. Input: Features from the different Feature Generators (AUs, Speech parameters, etc.), speech information and other modality related signal. Output: Filtered features

MAUI Development

2/21/2007

22

Function: Determine the wanted platform from the user’s preference and filter the features to keep only the ones available on the specific platforms. Our work: The module will be implemented as transparent. It will take in input AUs (eventually speech info) and will send them to the next modules.

3.2.12. The commands generator: Finally, all the information to produce an adaptive behaviour in response to the user’s attitude computed till now, will be transformed into command for the selected platform. • Cherry commands: Input: Features, speech information & other. Output: Commands (Signals) for Cherry. Function: The module will ideally aggregate and translate the information in input into commands for the avatar. Our work: For our project and the final simulation, the module will take the AUs sequence in input but read the emotion in a file to express the command to be sent to the avatar. At the same time it would send eventual speech text. • iCat commands: Input: Features, speech information & other. Output: Commands (Signals) for the iCat. Function: The module will ideally aggregate and translate the information in input into commands for the avatar. Our work: As an optional work for our project and the final simulation, the module will search a file for the AU sequence in input and read the command to be sent to the avatar. At the same time it would send eventual speech text.

MAUI Development

2/21/2007

23

4. THE ARCHITECTURE IMPLEMENTATION 4.1 Design and motivations: The implementation of the architecture is done in the C++ language. Our goal is to build the most simple, the most flexible and the most adaptable architecture as possible, as our work is going to be used for future research and to be completed and adapted to more complex applications.

4.2 Overview on the implementation: As a consequence to the motivations, we decided to implement each module as a class. The data flow known as messages transmitted from one module to another, are also implemented as distinct classes. This allows us to have a easier view of the whole software. Finally, we tried to stay constant is the writing of the code and to have a homogeneous code for the whole project, and also to be as clear as possible by adding clear separators and comments each time it was possible.

4.3 The threads versus the ActiveX controls: 4.3.1 Presentation of the threads: Thread is a short for thread of execution. Threads can be thought of as lightweight processes, offering many of the advantages of processes without the communication requirements that separate processes require. Threads provide a means to divide the main flow of control into multiple, concurrently executing flows of control. At least one thread exists within each process. If multiple threads can exist within a process, then they share the same memory and file resources. Threads are pre-emptively multitasked if the operating system's process scheduler is pre-emptive.

MAUI Development

2/21/2007

24

4.3.2 Interest for the MAUI project: We wondered whether we needed a parallel thread structure for the processes of the first step as we need to perform at the same time the speech recognition, the video signal processing and the text capture, and an assembly line work process using threads for the video chain. Threads could be useful for these two aspects of our project. Then, as we mainly consider the video chain, we wondered if needed to build an assembly line work process using threads. Actually, while the video feature recognition is processing the data collected at time 't', the video signal processing module must carries on with collecting and processing the new data for time 't+1'.

4.3.3 The threads management in Visual C++: After studying the implementation of threads in C++, we concluded threads can be implemented with the class CThread but that it is not easy to use it. Historically, it does not exist at the very beginning of the C++ but quickly developers think it is a lack of the language. The class has been designed to be as safety as possible. The class has only basic functionalities and it is tricky to develop some more complex applications using threads. In our case, the class could have been sufficient for our aim but as we found an easier, more portable and more extendible solution, we prefer the use of ActiveX controls

4.3.4 The use of ActiveX Controls: Actually, we think that we can avoid the use of the threads in order to make our software simpler. To replace it, we thought to use the possibility of the MFC using the ActiveX Controls (known under the name of OLE control too). The heavy applications (Microsoft Speech Recognition, the Video Player and Haptek Cherry controller) are managed with ActiveX controls. ActiveX controls are programs which are running on the machine of the user from the software. Some security problems have been detected for the web applications. For us, it is not a problem. Consequently, we can use this very powerful tool. In fact, the multitask aspect is simply taken in charge by the dynamic link libraries (dll) of Windows. Other

MAUI Development

2/21/2007

25

applications are lighter and make their job only at a specific time. It is quite easy to start them when we want by our self. The main advantage of using ActiveX control is an easier implementation of the "multimedia functions". For the video we used the control of VideoLan Client (VLC). It permits to use any kind of video (mpeg, avi, mov, wmv, etc) without codec. However, the only drawback is a problem of mobility since the user has to install the software. Nevertheless, VLC is an open source program and it is simple to download it (www.telecharger.com, for instance).

4.4 Classes for the data flow: The data flow is transmitted from one module to another at each step of the process and it requires a specific work to find the most adaptive structure that could be design. As we want to stay very simple, adaptable and portable, we choose to implement those messages as specific classes.

4.4.1 Feature level messages: •

Messages:

As we are going to manipulate time in all types of messages, we create an over class called “Message” with a private variable: Time, of type CString. This variable has the following structure “seconds. milliseconds” where the first part is the time in seconds since midnight (00:00:00), January 1st, 1970, co-ordinated universal time (UTC), and the second part is the fraction of a second in milliseconds. This allows us to have a simple form for the time (a CString) with both an absolute time and a good precision. struct _timeb timebuffer; CString _sc, _ms; CString c_time; _ftime( &timebuffer );

MAUI Development

2/21/2007

26

_ms.Format("%d", timebuffer.millitm); _sc.Format("%d", timebuffer.time); c_time = _sc + "." + _ms; This code constructs the CString c_time and we then just have to use the function set_time(CString _time) of the class “Message” whenever we need to set another time. By default, each message has a time equal to the computed CString at the moment of its declaration. This will highly simplify the time consideration in the later code has every piece of data flow has an associated time since its declaration. •

Au messages: Inherit from “Message”.

Based on Ekman’s representation of the facial expression of emotion, we implemented a class, called “AU_Message” with five private variables (actually four plus the time inherited from the class “Message”): Table 1: The AU structure variables

Private variables

Types

AU (number in the Ekman’s AU list)

Integer between 0 and NBAUS (0 corresponding to the default value or “empty” case for simplicity)

Intensity of the left side of the face

Character with an integer value corresponding to a, b, c, d and e.

Intensity of the right side of the face

Character with an integer value corresponding to a, b, c, d and e.

Likelihood

Integer corresponding to the probability to have done the good estimation in % (in our case 100% for all the AUs)

Time

CString “s.ms”, inherited

MAUI Development

2/21/2007

27

from “Message” The AUs are integers and the value of NBAUS = 64 (63 Ekman’s AUs + the “0” case). Functions “get” and “set” are implemented in order to manipulate each variable. •

AUs list messages:

The recognition engine can detect one or several AUs at the same time and that is why we need to have a flexible structure. We choose to implement a message called “AUs_list_Message” constructed as a list of “AU_Message”. We define some function to manipulate it (add elements, get elements, etc). For this purpose, we use the class CList. The drawback is that the CList is not easy to manipulate, and that it has a heavy code. That is why we implement our own functions. On the other hand, the CList is a general structure, already defined in a library, and then complete and it is easy to find information about how to use it. It is consequently more suitable for future modifications. •

Text messages: Inherit from “Message”.

The text variables:

message

“Text_Mess”

is

a

class

with

two

Table 2: The text message variables

Private variables

Types

Text

Class CString

Time

CString “s.ms”, inherited from “Message”



Feature list messages (after the platform selection):

This message includes a list of features that are useful depending on which platform we select. In our case, it will be a simple list of AUs and as a consequence, we are just going to use an AUs_list_Message.

MAUI Development

2/21/2007

28

4.4.2 Scherer’s theory SECs messages: •

SEC messages: Inherit from “Message”.

Based on Scherer’s theory, the SECs are a sequence of the following sixteen pieces of information: • facial expression (happiness, disgust, sadness, anger, fear) [Grizard, Paleari and Lisetti, 2006], • valence (unpleasant, pleasant, unknown) • intensity (very low, low, medium, high, very high: a, b, c, d, e), • duration, • focality (object, event, global, unknown), • agency (who is responsible?), • novelty (predictability of the emotion), • intentionality (caused by someone), • controllability, • modifiability (time notion), • certainty (anticipation effect to come), • legitimacy, • external norm (acceptable for the other), • internal norm (acceptable for oneself), • action tendency (identify the most appropriate coping strategies), • causal chain (identify the cause of the event). In order to cope with the emotion representation based on Scherer’s theory, we define the integer NBSECS = 16 and the following CString SEC: #define #define #define #define #define #define #define #define #define #define #define #define

SEC1 "Novelty: Suddenness" SEC2 "Novelty: Familiarity" SEC3 "Novelty: Predictability" SEC4 "Intrinsic pleasantness" SEC5 "Goal-Need relevance" SEC6 "Cause: Agent" SEC7 "Cause: Motive" SEC8 "Outcome probability" SEC9 "Discrepancy from expectation" SEC10 "Conduciveness" SEC11 "Urgency" SEC12 "Control"

MAUI Development

2/21/2007

29

#define #define #define #define

SEC13 SEC14 SEC15 SEC16

"Power" "Adjustment" "Internal standards compatibility" "External standards compatibility"

These sixteen SECs represent the thirteen Scherer’s sequential evaluation checks of the four appraisal objectives: relevance, implication, coping potential and normative significance, taking into account all their declinations (for instance, the novelty has three declinations: suddenness, familiarity and predictability). An intensity and a likelihood are linked to each SEC. The SEC_message structure is: Table 3: SEC message variables

Private variables

Types

SEC

CString: defined by SEC1 to SEC16.

Intensity

Integer varying from LOWERINTENSITY= -100 to HIGHERINTENSITY 100

Likelihood

Integer corresponding to the probability to have this SEC with its intensity in % [MINLIKELIHOOD =0 to MAXLIKELIHOOD =100]

Time

CString “s.ms”, inherited from “Message”



SECs structure messages (after the fusion of affective cues):

As the number of SECs are currently fix by the theory, we implement the SECs structure as a simple vector of NBSECS elements of the type “SEC_Message” described above. The SECs structure is fixed and we choose this solution because it is the easiest structure to implement and manipulate. The SECs that are not going to be considered in the

MAUI Development

2/21/2007

30

recognition set to 0.

computation have by

default

their

likelihood

4.4.3 Command messages: •

List of commands for Cherry:

The commands that are sent to Cherry are simple CString of the form: “\\load[file= [data/standard/safe_moods/Hap_ " + Emotion + ".hap” As a consequence, we use "Text_Message" with the command as CString.

MAUI Development

2/21/2007

31

Fig. 4: The data flow basic messages

Fig. 5: The data flow more complex structures

MAUI Development

2/21/2007

32

4.5 Classes for the architecture modules: 4.5.1 Speech Recognition: Goal: speech  text We tried to use Microsoft Speech Recognition engine, drawing our inspiration from the file “talkback.cpp” available on the Affective Social Computing Group lab computer. We had some problem implanting this module: even if we found quickly the interesting part of code useful for the recognition, it put several issues. First, it reduces extremely the exportability. Actually, it imposes to have the useful API but also Windows XP installed on the computer. Secondly, we did some test in the lab of the Affective Social Computing Group and they reveal that the recognition works very badly: only few words are recognized and it results in non coherent sentences totally non useful for MAUI. As a consequence, we decided it was not useful to reduce as much the exportability in order to include a code that does not work well. We finally implement the module with a empty function "recognize_Speech()" that would need to be done once the speech recognition performance would be better (see future work).

4.5.2 Video Signal Processing: Goal: video  AU This module is crucial as we will see later on because it is the point of departure of the whole simulation. This module simulates the recognition of AUs related to the video that is launch from the interface. It actually reads a text file containing the pre determined AUs list for this video (same file name than the video but with the extension ".au.txt"). It sends output “AUs_list_Message”, reading the text file gradually and waiting for the video such that the recognized AUs are read just after they appear on the video. The module sends one list by frame, i.e. one or several AUs if more then one AU is recognized at the same frame. Manipulation of CString time are mandatory to wait for the video (AU reading time link to

MAUI Development

2/21/2007

33

the number of the frame multiplies by the time of one frame: TIMEFRAME = 33 milliseconds) and to send messages with the good time information (same if the AUs where recognized at the same frame). For the purpose of doing the simulation, two pieces of information need to be set at the beginning of the simulation: the complete path to the video and the absolute starting time of the video. This can be done by calling the two Video Signal Processing module functions "set_Path" and "set_video_time". Then, we have to call "launch_video" to launch the reading of the AUs and consequently the whole video recognition chain.

4.5.3 Text Capture: Goal: keyboard  text This module performs the text capture: it fills “Text_Message” (output) with text from the interface.

in

4.5.4 Video Feature Recognition: Goal: AUs  SECs 4.5.4.1

Overview:

In this module, we perform the SECs recognition as Scherer explained in theory and the Affective Social Computing Group of Eurécom grasped it. Actually, we are going to implement a mapping between the set of the last AUs expressed by the user and the following data base. This is going to allow us to fill in the SEC vector, with their associated likelihood corresponding to the distance between the input AU list and the data base AU sets. The output of this module is a SEC vector with right intensities and likelihoods. Table 4: Scherer AUs to SECs estimation Data Base

1, 4, 5, 12

AUs 2 & 5 7, 26 & 38 26 & 38 & 25

MAUI Development

SEC Novelty Pleasantness (Valence)

2/21/2007

34

4, 7, 9, 10, 15, 17, 24 & 39 16, 19, 25 & 26 4, 7, 17 & 23 15, 25, 26, 41 & 43 4 & 5 7, 23 & 25 23, 24 & 38 1, 2, 5, 20, 26 & 38

4.5.4.2

Unpleasantness (Valence) Goal-Need Conduciveness (discrepant) Coping Potential & No Control Coping Potential, Control & High Power Coping Potential, Control & Low Power

Algorithms:

The first part of the algorithm aims to collect "AUs_list_Message" coming from the video recognition. Actually, for the SEC recognition, we not only use the last message sent corresponding to the last video frame but the last messages within a certain time slot called RECOGNITIONTIME and currently equals to 200 milliseconds. To this purpose, we implement a buffer of the type AU list and an update function linked to it. The recognition is then performed as a mapping between the buffer completed with input list of AUs and the AUs for each SEC of the previous table. We decided to do this mapping throw intermediate boolean vectors (boolean is the less expensive type). As we will see in the future work, this representation is the optimal for our application but not for the final application using a real AUs recognition engine, as it does not take into account the likelihood of the input AUs. However, the code can be easily transform whenever it will become necessary. The input boolean vector: NBAUS boolean 1 0 1 0 1 1 …

0 1 0

The boolean vector for one specific SEC: NBAUS boolean 0 0 1 1 1 0 …

1 1 0

The “And” vector: 0 0 1 0 1 0 …

MAUI Development

2/21/2007

0 1 0

35

By computing the logical “and” between the boolean input and the AU boolean representation of the SEC recognition, we find the AUs that are accurately recognized for each SEC. The likelihood of each SEC is link to the number of accurately recognized AUs. Actually, they are proportional but we also have to take into account the number of relevant AUs for each set. Evidently, if a SEC needs three AUs to be recognized at an 100%, and another one needs eight AUs, the second one is at a disadvantage compared with the first SEC. We have to rectify this point by computing the likelihood as the number of accurately recognized AUs divided by the number of relevant AUs for this set. We do not need to take into account the number of AUs recognized at the input because, even if we can think that a to sensible system recognizing almost all the AUs could be bad, it will not have a real consequence on the SEC mapping: we can recognize several SECs at the same time and their likelihood depends only on the previous values. The consequence will be on the threshold we will choose later in the “intelligent” part of the MAUI, when we will decide what behaviours to adapt and then which SECs are important.

The computation of the intensity of the output SECs is another critical part. It depends directly of the AUs recognized and of the “way” the SECs are mapped. Actually, in the general case, we suppose that the intensity of the SEC is the sum of the intensity of the relevant and accurately recognized AUs (the mean of the intensity left plus the intensity right) divided by the number of relevant and accurately recognized AUs. These AUs are the “1” within the “and” vector. With the Scherer AUs to SECs estimation (see above table), it is not always simple to link the recognition with our 16 SEC representation. First, only five SECs are concerned by the recognition ("Novelty: Suddenness", "Intrinsic pleasantness", "Conduciveness", "Control", "Power"). Then, the SEC intensities are not directly link to the AU intensities: for instance, we recognized the pleasantness (Intrinsic pleasantness with positive intensities) and the unpleasantness (Intrinsic pleasantness with negative intensities) separately, we only recognize discrepant Conduciveness, that means for the very lower MAUI Development

2/21/2007

36

intensity, etc. Therefore, we adapted the output intensities to each case (see the intensity values of SEC in the Facial Expression Generation).

4.5.5 Affective Information: Goal: SECs Struct This module only transmits application. It is transparent.

the

SECs

vector

in

our

4.5.6 Facial Expression Generation: Goal: SECs  emotion  AUs 4.5.6.1

Overview:

This module aims to translate the SECs into AUs via an intermediate representation namely the emotions. We used the following data base to provide a mapping between SECs and Emotions:

MAUI Development

2/21/2007

37

Fig. 6: Predicted appraisal patterns for some major emotions

4.5.6.2

Algorithms and implementation:

Emotions have a SECs structure with an additive name. That is why we decided to implement the class called “Emotion_Message” that inherits from the “SECs_vect_Message” and has an additive private variable: a CString called Emotion. We defined the number of emotions NBEMOTIONS = 15 (the fourteen described in the above table + the neutral case), and we fixed the major modal emotion names: #define EMO1 "Happiness" #define EMO2 "Joy" #define EMO3 "Disgust"

MAUI Development

2/21/2007

38

#define #define #define #define #define #define #define #define #define #define #define #define

EMO4 "Contempt" EMO5 "Sadness" EMO6 "Despair" EMO7 "Worry" EMO8 "Fear" EMO9 "ColdAnger" EMO10 "HotAnger" EMO11 "Boredom" EMO12 "Shame" EMO13 "Guilt" EMO14 "Pride" EMO15 "Neutral"

The “Facial_Expression_Feature_Generation” class has a vector of SECs as input and a list of AUs as output. Eventually, we can also get the absolute vector of Emotions and the intermediary emotion recognized. In the constructor of the “Facial_Expression_Feature_Generation” class, we call a function “fill_Emotions_vect” that fills in two vectors of Emotions with minimum intensities and maximum intensities according to the previous table and the following definitions we fixed. Table 5: Intensity level for the SECs

Name Very low Low Medium High Very high Open Chance Negligence Intentional Consonant Dissonant Obstruct Natural Other Self

Intensity min -100 -60 -20 20 60 -100 -100 -30 30 0 -100 -100 -100 -30 30

Intensity max -60 -20 20 60 100 100 -30 30 100 100 0 -100 -30 30 100

A first function computes the intermediary recognised emotion by comparing the input SECs message with the NBEMOTIONS emotions. The emotion that has the more SECs MAUI Development

2/21/2007

39

intensities equal to the input is chosen (we only use the SEC that has been recognized by the Video Feature Recognition). The choice is computed with likelihood: the input intensity is inside, is near or far from the emotion intensity interval. Actually, the distance between the input SEC intensity value and the interval for each emotion is computed: the distance is equal to zero if the input intensity is inside the interval, and equal to the distance from the input intensity and the interval mean in the other case. We then sum these values for each SEC of each emotion and find the emotion that has the minimum value. We could have done of a less sensitive absolute likelihood without considering a distance to the interval but a boolean addition (is inside or not). It would have been easier and less expensive but it would not have allowed to rectify the output in the case illustrated below: the input values are much more close to the values of emotion 2 then emotion 1. With our method, we actually choose emotion 2 and not emotion 1. However, one default still remained as it gives a considerable advantage to emotions that have lots of “open” values. A future work may be to find a solution to this.

Fig. 7: Illustration of an emotion recognition issue

Another function computes the AUs list output. Actually, it reads a text file that gives a list of AUs, intensities, likelihood and time that may correspond to an emotion. The fact is that, as we saw in a lab session using Cherry, the

MAUI Development

2/21/2007

40

result are not very good: the expression are not believable. Some of the AUs are too strong and some others have no reason to be expressed in this emotion. For instance, we obtain a kind of smile within the expression of fear that is not believable at all. Currently, Cherry is controlled with an “emotion” information and that is why we did not stress this part and only wrote typical but not necessary believable AUs list into text files. Finally, we call the function “read_Emotion_to_AUs” that fills in the output AUs list message.

4.5.7 Cherry Commands: Goal: AUs  emotion  commands The Cherry commands are currently expressed in term of emotions. That is why we need to translate the AUs set into emotion. We still want to have an AUs type of input in order to not have a lack of generality for other functions and modalities that could be implement later. In the Facial Expression Feature Generation module, the recognized emotion is written in a file .txt and then it is read within the Cherry Commands module. That "trick" is totally transparent for someone that does not look the code in details. The command is a CString containing the emotion name. It is send through a Text_Message.

4.6 Classes for the special modules: 4.6.1 Profile: •

Profile:

We implement a general class "Profile" so that both the User Profile and the Agent Profile can inherit from it and have their own specificity. This class defines the two kinds of information data we look at: the static information and the current affective state information. The "Static Information" is implemented as a structure of the following for:

MAUI Development

2/21/2007

41

struct { CString FirstName; CString FamilyName; int Age; bool Gender; // 0 for female, 1 for male; CString Personality; }; While the Current Affective State Information is a vector of SECs message (SECs_vect_Message).



User Profile: Inherit from “Profile”.

We implement "User_Profile" specifically because we decided that the Static Information comes from a predefined file .txt and we do not know how it will be for the Agent Profile. The text file stores a certain amount of information: # User Profile First name: Agnes Family name: Abastado #Date of birth: d/m/y Date of birth: 17/12/1984 Gender: F Personality: unknown The User Profile class has a function called "compute_SI_age" that computes the age from the date of birth. Thanks to this we only store an integer, which is less expensive.

MAUI Development

2/21/2007

42

Fig. 6: User Profile implementation

4.6.2 Affective Social Cognitive Architecture: The Affective Information module is the one that concentrate the “intelligent” part of the whole system. We are not going to stress this part. It has a function called "do_mirroring_AffState" that transfers the data. Actually, in our work, the only functionality of this module is to be transparent.

4.7 Classes for the fusion modules: 4.7.1 Feature Level Fusion: AUs and Texts This step is the intermediary between the first Signal Processing step and Feature Recognition step. We designed a class called “Feature_Level_Fusion” that takes as input three messages: an “AUs_list_Message” coming from the video signal processing, and two “Text_Message” coming from both the Speech Recognition and the Text Capture. In our project, this fusion level only transmits the AUs list to the next step. The only output we considered (reachable with the function “get_AUs_Message”) is the list

MAUI Development

2/21/2007

43

“AUs_Message_Out” which “AUs_Message_Video”.

is

equal

to

the

input

4.7.2 Fusion of Affective Cues: SECs The fusion of affective cues merges the information from the incoming “SECs_vect_Message”s into a new “SECs_vect_Message”. We consider two inputs of type “SECs_vect_Message”, one coming from the video feature extraction line and one eventually coming from the Linguistic Emotion Recognition module. In order to have a control on the importance of each recognition process, we added two private variables called “r_Video” and “r_Linguistic”. They are float include between 0 and 1. In our case, as we first do not consider the linguistic input, we set r_Linguistic = 0 and r_Video = 1. The fusion function sums the likelihood of the SECs vector with these ratios depending on the influence of the inputs and computes the new average intensity. As the intensities are defined by characters, we need to convert them into integers for instance, do the computation with the ratios, and convert back into a character. Finally, we build the “SECs_vect_Message” SECs_Message_Out that is reachable with the function “get_SECs_Message”.

4.7.3 Fission of Affective Cues: SECs The fission of affective cues is a transparent module as we choose a fix structure for the SECs and we have only one input. This module transmits the “SECs_vect_Message” from the Affective Information module to the Facial Expression Feature Generation module.

4.7.4 Platform Selector: AUs The Platform Selector is a transparent module as we only work with Cherry for our project and as we do not use directly the AUs list to command Cherry expression.

MAUI Development

2/21/2007

44

4.8 Class MAUI.cpp: the brain of the software Now that we have implemented all the modules, the fusions and the messages, we have to build a tool that allows them to interact and that controls the data flow. We choose to go trough an external class instead of linking directly all the modules and fusion classes in order to provide a clearer view of what happens and to allow easier modification and addition of modules. The MAUI class instances an element of each fusion and module classes and controls the messages transmission between them. We implemented three functions to transmit data flow: “transmit_Text_Mess”, “transmit_AUs_Mess”, “transmit_SECs_Mess” for the three types of messages. These functions take the name of the module from which the data are send as parameter. They call the respective next input module “set” function of the input message. Actually, nearly all the module and fusion class (in fact, all without the chain beginners) have a “set” function that sets the value of their input messages, do the computations, and send back the output to the next step via MAUI. The MAUI class also control some element of the interface: when new messages are transmitted, it updates the interface windows. Currently, three elements are updated by MAUI: the AUs buffer list and the SECs values as input and output of the Video Feature Recognition module and the command to Cherry. Finally, the MAUI class has the main function called “run_MAUI” that need to be called from the interface to launch the whole simulation.

MAUI Development

2/21/2007

45

5. THE GRAPHIC INTERFACE 5.1. Motivation: In order to provide to the user a software as pleasant as possible, we thought design a graphical interface. Our first idea was to show the architecture of the MAUI system we implemented in a graphical window which appears on the screen. On clicking on each a module, information about it appears in a new dialog box. Thus, the user can better understand how the MAUI system works and what is the goal of each module. Then, we improved this tool to allow the user to control the software from this interface. The main advantage is to provide to the user an intuitive way of controlling the software and allows him to really understand step by step how works the process from the inputs (keyboard, webcam and microphone) to the outputs (the Cherry avatar in our case and later, the iCat). Finally, the design of the interface has to mix buttons (active elements) with static elements (drawings). On another side, the interface has to exchange data with the rest of the code and to be regularly updated at a right time to give valid information to the user.

MAUI Development

2/21/2007

46

Fig. 8: The MAUI graphic interface

MAUI Development

2/21/2007

47

5.2. Design and functionalities: The design reproduces the MAUI architecture with all the elements we implemented. As it, our interface is block based. We can split the functionalities of the interface in two parts: the information part and the command one. The first one only gives information about each block to the user, the other allows a control of the system.

5.2.1 The information functionality: When the user clicks on a module, a dialog box is opened and the user has access to the information about the module (inputs, outputs, what is its role in the architecture, and how does it work in our project). For this part, we chose to use modal dialog boxes. These kind of boxes have a particularity. When it is open the user cannot interact with the rest of the software and he has to first close the window to do something else. For example, we have this kind of function when you want to open a file. Two main reasons have motivated this choice. First, these windows allow a very easy implementation. There is no need to create a new class, we just have to call a function which create the new window. Then and it is the main reason, we do not want that the user can open many windows when the software is running. The reason of it is esthetical and to avoid slowing the software with usefulness windows.

5.2.2 The controlling functionality: In the interface, close to some of these dialog boxes, when it is relevant, another button is available. These buttons allow to access to the real time information going through the designed module or to control directly a part of the software. For example, we launch the video recognition on clicking on the appropriate button. The user launches the video he wants and the simulation starts. Then, the user can click on

MAUI Development

2/21/2007

48

the “AU and SEC” button and has access to the recognized AUs and SECs and on the “launch Cherry” to see the avatar. Ideally, this interface would also allows the user to adapt the value of several parameters such as the recognition time for the mapping from AUs to SECs.

MAUI Development

2/21/2007

49

Fig. 7: The Video Player Window

Fig. 9: Cherry avatar window

Fig. 10: The text capture interface Fig. 8: The User Profile Interface

MAUI Development

2/21/2007

50

Fig. 11: Video Feature Recognition AUs and SECs window

MAUI Development

2/21/2007

51

5.3. Implementation: The main difficulty of the implementation was to fuse the code of the interface with the rest of the project. In order to have an adaptive code, each window that can be open by the user (dialog box more accurately) is associated to a class derived from the class CDialog. This class permits to have a set of options about the windows and is powerful about the dialog box management. The implementation of the two kinds of windows is very different. For modal windows, we used the function DoModal. The implementation is done only with the ID of the window; we do not have to care about the handle of it. Consequently, the implementation was made very efficient and easy due to this function. Actually, the class allows to update data of the dialog box. A data map is generated to automatically handle the exchange of data between the member variables and the dialog box's controls. The data map provides functions that initialize the controls in the dialog box with the proper values, retrieve the data, and validate the data. All this functions are needed in our case. The implementation of the modeless window has been more difficult. Here, we needed to create a specific class for each window. Since with this kind of window, several dialog boxes can be displayed together, we need to manage the handle of the different windows.

MAUI Development

2/21/2007

52

6. THE SIMULATION 6.1. How to launch the simulation? 1. Launch the MAUI_2007 executable file. The main interface appears. 2. You can now have information about each modules by clicking on the one you are interested in. Close the information window. 3. If you want, you can load a User Profile: • Click on “load user profile”. • A new window appears, choose in the combo box the user. • Information about him appears in the User Profile window. 4. You can open the windows you want to be visible during the simulation: the User Profile window, Cherry window, the AUs and SECs window, the Text Capture that allows the user to enter text. 5. Then, you launch the simulation: • Click on “Launch video” within the main window: the video window appears. • Click on “Launch video”: the video called “angry.avi” that is inside the directory “Simulation Video” is launch. • The simulation is running. 6. You can launch it again as many time as you want.

6.2. The simulation scenario: The video we have at our disposal is a video of a man playing the angry.

MAUI Development

2/21/2007

53

6.3. The results: The simulation works well in term of pure efficiency and time. The data flow transits perfectly and is well computed at each step and the time management is correct. The interface is pleasant and fluent while updating the information, even on a not powerful computer. We can redo the simulation at will. Unfortunately, the result in term of the recognized emotions is bad. This could be due to several reasons. Actually, the algorithm recognizes the joy and the guilt instead of the angry (cold or hot anger). This can be the consequence of: • the AUs list that has been recognized manually: all the AUs may not have been seen and the intensities are not certain. • only five SECs are recognized at the feature level: it reduces consequently the emotion recognition performance that is based on a theory for which the thirteen SECs with all their declinations are used. Another point is that the interval management in order to translate the emotion description in terms of SEC intensities may not be optimal. • Finally, we compute a unique emotion and every 200 ms, which is not a valid intermediary as ideally we would not go through this representation to re translate into AUs. Another issue that remains is a problem of synchronisation between the video and the launch of the recognition (reading of AUs). It is quite hard to see where the problem is because during the debug, the times are not the same than when the simulation is running. Nevertheless, in the future, the video wont be able to send the AUs before the image! We suggest some improvement into the future work chapter that may help to have better results. Nevertheless, we would like to remind the user about the fact that the computation of intermediary emotion is not the goal of the project, and actually, not the one of MAUI that ideally stay at the “SECs” representation level. It is only because of the current possible command for Cherry that we stress this part.

MAUI Development

2/21/2007

54

7. FUTURE WORK 6.4. Reminder of the project objectives: Our project aimed to design a MAUI architecture in agreement with the MAUI paradigm, several theories of emotions, and the current state-of-the-art in the Affective Computing and recognition fields. This architecture had to be developed and implemented as much as our work and the current researches allow it, and we had to put emphasis on the video recognition chain. We stressed the video input and the avatar Cherry as output, developing all the intermediate steps and implementing almost all the modules around this main computational chain. The final goal was to have a flexible and adaptable architecture, in order to have final software that could be easily adapted or completed with future research and code. We also had to provide a simple scenario demonstration with a video and its corresponding recognized AUs text file.

6.5. Future work on the implemented algorithms We implemented two major algorithms. The first one allows the translation of AUs into SECs and appears in the Video Feature Recognition module. In this algorithm we choose to use an intermediary representation of the SECs as boolean vectors whose indices represent the AU numbers. This representation is not the optimal. Actually, for our application it is the best one as it is the lower in term of memory (a boolean is coded on only one bit), but it is restrictive: in our case, the likelihood of the AUs is always of 100% as we computed them manually for a pre recorded short video. In a future work, the AUs recognition will be implemented and added to the MAUI architecture within the Video Signal Processing module and as a consequence, this likelihood or level of recognition will be computed and its values will vary

MAUI Development

2/21/2007

55

between 0 and 100%. Then we would have to take this into account in the computation of the resulting SEC likelihood. This could be simply done by using integer between 0 and 100 for the input AUs list vector and the “input(and)SEC” vector but as it was not useful for our work, we let our algorithm at it is, i.e. simplest and with a lower cost. The second major algorithm allows the translation from SECs into emotion and is implemented in the Facial Expression Feature Generation module. One default remains in the current likelihood computation as it gives a considerable advantage to emotions that have lots of “open” values. A future work may be to find a solution to this. Another point is that the emotion is not always efficiently recognized as Scherer’s theory does not foresee the relative importance of the SECs. It seems that we may have had better results if we could have said that the intrinsic pleasantness is one of the most important criteria in the emotion determination, at least more important then the power control. Nevertheless, this part may be changed radically as it may not compute anymore the AUs output list through the intermediate emotion. As a consequence, these issues will not be a problem anymore.

6.6. Future work on the recognition engines As we explained before, we did not implement the speech recognition engine because the code we would have added limits significantly the exportability (calling Microsoft Recognition engine) and has currently very poorly results in term of recognition. However, we did the skeleton of the module, and the future work will only consist in adding the efficient code in the function “recognize_Speech” that set up the private variable “speech_in”. The Video Recognition is one of the crucial part to improve. First, the ideal MAUI project would have a recognition of the AUs directly from the video by automatic facial AU recognition. Researches in that direction and important progress are done, notably thanks to the use of neural networks for feature recognition and let hope a soon valid implementation. This improvement will not need fundamentally changes in the code: the function “fill_AUs_list” of the Video Signal Processing module will

MAUI Development

2/21/2007

56

have to be implemented, replacing the current “read_AUsfile” called. Secondly, the real MAUI input is a webcam which means the engine will have to perform image and video processing as face tracking and to perform real time recognition. Some modification at the interface level will be also needed in order to replace a video by the webcam. The Text Capture model can also be improved by collecting the text, the command and the Haptic signal directly from the keyboard and mouse. We did not implement the Linguistic Feature Recognition so this part also remained. In consist in using linguistic tools to analyze the text coming from both the speech and the keyboard and to extract emotional information in term of SECs (words from specific lexical fields can help in the emotion recognition process). Finally, the use of the ANS signal for the emotion recognition can be implemented, and Olivier Villon of the Affective Computing Group of Eurécom works on this direction and has conclusive results. As it was not the purpose of our project, we did not consider this part, but one can easily add new classes for the ANS Signal processing and ANS Feature Recognition.

6.7. Future work on the output management The MAUI aims to perform the emotion generation through the SEC structure representation that would directly be translated into features such as AUs, speech, text, etc. This implies to carry out two improvements: first, the translation from SECs to AUs needs to be greater as it is actually quite limited and with very poor result (as we saw in the second lab session of the Affective Computing course). Then, in a second time, the Cherry commands have to be changed and allow to directly emotionally control Cherry with these AUs. We did not manage the iCat control and one future work will be to add the iCat output and to manage the control command generation. The Platform Selection has to be adapted with a control from the interface of its variable "choice" which is equal by default to false, corresponding to Cherry choice.

MAUI Development

2/21/2007

57

Another significant improvement that could be done is the generation of others features such as voice, onomatopoeia (exclamations and sighs), text, body motion, etc. It would increase consequently the expression believability and add control for the agent behaviour generation.

6.8. Future work on the intelligent modules In any case, much more work needs to be done, and the future work would also develop the fundamental intelligent modules described in the architecture, with several affective and reasoning functions. In our framework, this could be performing by creating an Agent Profile (that inherits from the Profile module) and the Affective Social Cognitive Architecture, coded to be the "brain" of the MAUI system. The idea would be to implement several scenarios and then allows training and improving of the "agent knowledge", and finally create an intelligent agent able to interact with human through a sensor interface.

MAUI Development

2/21/2007

58

8. CONCLUSION In general much work needs to be done, but we hope in a future not too far from now this project will evolve in a software that would be able to replicate and simulate human behaviour in a realistic, emotional, theory founded and maybe unexpected way. With this project, we have begun to develop a framework for the implementation of the Multimodal Affective User Interface. With this interface, the user can communicate with an active agent via a video recognition. The code and the structure of the architecture are very adaptive. We tried to make the code as clear as possible with many comments to allow a future development of it. Any developer should be able to bring some improvements to our project. The first improvement to bring probably would be to allow emotion recognition from speech. Our work had a particular interest for us. We worked only on a part of the MAUI, but we had to implement a whole sequence. Consequently, we worked from the beginning to the end of the process. It was a great satisfaction to see the result of it: watch recognition of the emotion from a video. We only want to indicate that the implementation with Microsoft Visual C++ has been difficult. We have met many problems particularly in term of portability. In fact, with the same version of the software, on different computers, we had some errors which appear during the compilation. In the future, we think that the use of an other software for coding should be relevant. To conclude, we especially want to thank Marco Paleari for his help during the project. He was always available to help us in our work, solve the problem we have been faced to and lead us in the good way to purchase our final target. He has spent a lot of time with us. Without him, we probably never achieve to obtain such a result.

MAUI Development

2/21/2007

59

9. TENTATIVE SCHEDULE Below is the tentative schedule throughout the semester:

Week 1: 26/10

Perform state-of-art research on topic(s). Start an annotated bibliography. Do the schedule.

Week 2: 2/11

Define general architecture. Finish the annoted bibliography. C++ “theory”.

Week 3: 9/11

Compile annotated bibliography into state-of-theart synthesis. C++ exercises. Start the report. Refine the architecture.

16/11

Holidays (from 11/11 to 19/11)

Week 4: 23/11

Implement the “step” classes. Start the second part of the report.

Week 5: 30/12

Implement the “step” classes.

Week 6: 7/12

Implement the “sub-classes”.

Week 7: 14/01

Implement the “sub-classes”.

Week 8: 21/12

Finish the second part of the report. Finish the general skeleton.

MAUI Development

2/21/2007

60

21/12

Holidays (from 21/12 to 3/01)

Week 9: 4/01

Fill the defined classes.

Week 10: 11/01

Fill the defined classes. Feedback on the second part of the report.

Week 11: 18/01

Design the avatar interface. Start the third part of the report.

Week 12: 25/02

Implement the interface.

Week 13: 1/02

Implement the interface.

Week 14: 8/02

Test the interface. Finish the report.

Week 15: 15/02

Prepare the presentation.

Week 16: 22/02

Presentation.

MAUI Development

2/21/2007

61

10.

WEEKLY STUDENT SCHEDULE

EVEN WEEKS (Starting the 20/11/2006) Agnes

Monday

Tuesday

Wednesday

ManagIntro

AffComp

Thursday

Friday

Morning Afternoon Languages

Damien

Spanish

Monday

Morning Afternoon

MobCom

Tuesday

Wednesday

Thursday

Friday

MobServ ImCompress

Languages

ManagIntro

AffComp

MobCom

English

ODD WEEKS (Starting the 27/11/2006) Agnes

Monday

Tuesday

Wednesday

Thursday

Morning

Property

Afternoon

ManagIntro

Languages

Damien

MMIR

MobCom

Spanish

Monday

Morning Afternoon

Friday

Tuesday

Wednesday

MobServ ImCompress

Languages

MAUI Development

ManagIntro

Thursday

Friday Property

MMIR

MobCom

English

2/21/2007

62

11.

REFERENCES

[1] A. Grizard, M. Paleari et C. L. Lisetti. Une théorie psychologique fondée sur les expressions faciales adaptée à deux plateformes différentes. A WACA´ 2006 Deuxième Workshop sur les Agents Conversationnels Animés, Toulouse, Octobre 2006. [2] C. L. Lisetti and F. Nasoz. MAUI: a multimodal affective user interface. In Proceedings of the ACM Multimedia International Conference 2002, Juan les Pins, December 2002. [3] C. L. Lisetti and P. Gmytrasiewicz. Can a rational agent afford to be affectless? A formal approach. In Applied Artificial Intelligence, page 577-609. University of Central Florida, Orlando, US, 2002. [4] M. Paleari and C. L. Lisetti. Toward Multimodal Fusion of Affective Cues. In Proceeding of 1st Workshop on Human centred Multimedia at ACM Multimedia, Santa Barbara, California, US, October 2006. [5] K. R. Scherer. Appraisal processes in emotion: Theory, methods, research, chapter Appraisal Considered as a Process of Multilevel Sequential Checking, page 92-120. New York, NY, US: Oxford University Press, 2001. [6] O. Villon and C. Lisetti. A user-modelling approach to build user’s psycho-physiological maps of emotions using bio-sensors. In IEEE RO-MAN 2006, 15th IEEE International Symposium on Robot and Human Interactive Communication, Session Emotional Cues in Human-Robot Interaction, Hatfield, United Kingdom, September 2006.

MAUI Development

2/21/2007

63

12.

ANNEXES

12.1 How one can add an AU? In order to add an AU, the developer would have to set the variable NBAUS defined in two files: Video_Feature_Recognition.h(19) and StdAfx.h(98). Currently, NBAUS equals 64 (63 AUs + the default value 0 for initialization).

12.2 How one can add a SEC? First, the developer have to set the variable NBSECS defined in StdAfx.h(99), in SECs_vect_Message.h(16) and in Facial_Expression_Feature_Generation.h(17). Currently NBSECS = 16. Then one have to define string SEC17 with it name (for instance, #define SEC16 "External standards compatibility") in the Stdafx.h file. The change of the data files is also essential: the emotion files in the directory emotions (add the interval values for the new SEC for each emotion) and additional lines in AUs_to_SECs.text in the AUs_SECs directory. Actually, if it is a SEC that can be use for the mapping (recognized by the video AUs), one have also to modify the value of NBSECSVIDEO corresponding to the number of set of AUs that are recognized , i.e. the number of significant lines in AUs_to_SECs.txt (currently equals to 12, defined in Video_Feature_Recognition.h(20) and StdAfx.h(177)). Then, the developer would have to adapt the reading of these files which occurs in the Video_Feature_Recognition.h and in Facial_Expression_Feature_Generation.h and give a particular attention to the implementation of the emotion vectors in this last class.

12.3 How one can add an emotion?

MAUI Development

2/21/2007

64

First, the developer have to set the variable NBEMOTIONS defined twice in StdAfx.h(137) and in Facial_Expression_Feature_Generation.h(18) and currently equals to 15. One have to create a new constant string EMO16 (for instance, #define EMO15 "Neutral") in Stdafx.h. The creation of two data files is then essential: the emotion file called “Your_Emotion.txt” in the directory emotions which must contain the value min and max of the interval for each SEC, and the creation of “YourEmotion_to_AUs.txt” that enumerate the AUs that describe this emotion in theory. One should take a particular attention of the Emotion vectors in Facial_Expression_Feature_Generation.h and add the useful lines. Finally, additional edit boxes need to be implemented on the interface (AUs and SECs window) and its display has to be updated in MAUI.cpp.

12.4 How one can add a User Profile? User Profile can be simply added with a text file in the corresponding data directory (FirstName.txt in User Profiles, copying the presentation of the user’s information) and in adding a variable in the interface combo box.

MAUI Development

2/21/2007

65

12.5 Articles summaries: K. R. Scherer. Appraisal processes in emotion: Theory, methods, research, chapter Appraisal Considered as a Process of Multilevel Sequential Checking, page 92-120. New York, NY, US: Oxford University Press, 2001. In this chapter, K. R. Scherer developed his sequential check theory of emotion differentiation. According to him, the emotions can be accurately recognized by analyzing the results of a sequence of stimulus evaluation checks (SECs). The SECs are the minimal number of binary criteria that allow us to differentiate emotional subjective states. The emotion appraisal consists of five distinctive functions associated with five organismic subsystems (information processing, support, executive, action and monitor). The SECs are organized and processed in sequence, consisting in four stages called the appraisal objectives, viz: - the event relevance (which takes into account the novelty as the suddenness, the usuality, and the predictability, the intrinsic pleasantness and the suitability for the momentary goals or needs), - the event implications and consequences (which takes into account whether an agent was responsible for the event or not, the individual’s estimate of the event probability, the discrepancy with the individual’s expectation, the conduciveness in the current goal achievement, and finally, the urgency in terms of priority of the goals/needs), - the coping potential (knowing the probability that an event can be influenced by an agent, the likelihood this agent is able to influence a potentially controllable event, and if the agent can adjust or live with the consequences of this event), - the normative significance (which refers to the personal values of the individual and his perception of the norms and standards of the society in which he is living). However, the process of appraisal requires multiple rounds of appraisal in three differential levels: the sensory-motor level occurring at the innate features and reflex systems level, the schematic level based on the MAUI Development

2/21/2007

66

learning history of the individual and the conceptual level involving memory storage. componential K. R. Scherer than developed the patterning theory saying that, as the subsystems involved in the SEC sequence are interdependent, the outcomes of each SEC modifies the states of all other organismic subsystems. Thus, prediction of specific emotion profiles is conceivable, using the antecedent emotional situations. Finally, K. R. Scherer ended this chapter exposing the empirical evidence for his sequential check theory of emotion differentiation.

C. L. Lisetti and F. Nasoz. MAUI: a multimodal affective user interface. In Proceedings of the ACM Multimedia International Conference 2002, Juan les Pins, December 2002. In this article, Christine L. Lisetti and Fatma Naroz defended the role of both affect and emotion in the cognition process and proposed an adaptive system architecture. The authors first presented the affect-cognition interface: the three level inputs known as the sensory inputs (VKA model), the cognitive inputs (involving memory phenomena) and the biological inputs (for instance, hormonal mechanisms or drug consumption) give birth to emotions. The emotion generation is then associated with three phenomena: the autonomic nervous system (ANS) arousal, the expression and the subjective experience. The authors identified a certain amount of phenomena governing the affect-cognition interaction. Christine L. Lisetti and Fatma Naroz proposed an architecture for the MAUI. The inputs are the visual, kinesthetic and auditory (VKA) signals as well as the subjective experience via linguistic tools (L). The goal of this interface is to find the more probable emotion of the user, which can then be used to provide an appropriate response. The process is developed in three steps: the affect perception and recognition (where it analyses VKAL

MAUI Development

2/21/2007

67

signals), the affect prediction finally, the affect expression.

and

generation,

and

The authors described three portions of the whole system: - The first results on facial recognition computed with an one-hidden layer neural network. By using E-FERET face database for training, they experimented that zooming in three areas of the face for expression recognition is much more efficient than processing a full face. - They designed a multicultural avatar (Haptek) in order to offer a great diversity for facial properties and for the background scenery. They also built user-profiles including the user’s favourite avatar. This avatar is already able to mirror the user’s facial expression, giving feedback to the user. - They presented the wireless device they use for ANS: Sense Wear Armband that can measure various data such as galvanic skin response, heat flow or movement in “real life” situation. They concluded exposing the future of this project and the expectations in terms of human-computer interactions.

C. L. Lisetti and P. Gmytrasiewicz. Can a rational agent afford to be affectless ? A formal approch. In Applied Artificial Intelligence, page 577-609. University of Central Florida, Orlando, US, 2002. In this article, the authors expose the classical approach of agent modelling and they purposes a formal one using the fact that rationalism and emotion are linked. Then, they describe a scheme to represent and classify emotions. In neo classical theories, rationalism is the opposite of emotion. Influenced by the Cartesian tradition, body and mind are independent. This view implies to suppress emotion to be fully reasonable. But it doesn’t correspond to the real world where people could be rational and emotional. That is why the authors define a new kind of rationalism

MAUI Development

2/21/2007

68

called rationality, which takes between rationalism and emotion.

into

account

the

link

This new notion leads to consider two views. First, rationality requires emotional guidance. This view considers that rationality and emotion are clearly two different properties but converge to an unique aim. The pure reason cannot exist, it is always influenced. On an other hand, an other way of thinking supposes rationality and emotion continuous. Emotions will give a feeling which helps us to have rational behaviour. Choices are based on belief which is due to emotional experience. Emotion have an effect on the action once the choice has been made and are socially contagious. Finally, emotions enlarge rationality. The advantage of this rationalism approach is to predict how humans should behave according to the theory and not how they behave. Those new views imply more complexity for the nature of the emotion. That is why the authors purpose a scheme to represent it by a formal way which is called AKR (affective knowledge representation). This model includes affect, mood, emotion, personality and a framework for emotional state dynamics using Markov Models and taking into account the fact that people act rationally to maximize positive emotional experiences. Using Markov Model helps agent to predict a future emotional state. One of the advantage is to have a finite set of possible emotions for a given state. The emotion is divided into several components : ANS, facial expression, subjective experience. This approach offers the advantage to represent the emotion according to a finite set of criteria which allows an easier classification.

M. Paleari and C. L. Lisetti. Toward Multimodal Fusion of Affective Cues. In Proceeding of 1st Workshop on Human centered Multimedia at ACM Multimedia, Santa Barbara, California, US, October 2006. The main part of the information communication is through paralanguage three areas where recognition emotion Moreover, different models (i.e. set of

MAUI Development

2/21/2007

in a face to face signals. There are has been developed. features) have been

69

developed. ASIA is based on BDI+E (Belief, Desire, Intention + Emotion). Affective information flows from user to an agent who interprets those signals. As a consequence, decisions influence on the user and the environment. Scherer’s theory is the best theory link to the MAUI framework. In fact, it considers emotions with three levels, given guidelines for developing generation and recognition. It introduces the notion of SEC (Sequential Evaluation Check). SEC are parameters which allow a sequential evaluation. They are chosen to represent the minimum set of dimensions. Moreover, the Scherer’s theory uses five functions which justify the need of emotion for human being. The advantage of the sequential approach is the reduction of the processing. Process used for recognition of CPT can be used for generation. The aim is to develop a modular system which allows the addition of new inputs. Multimodal fusion is done on three levels which gives a good amount of information. However, the fusion algorithm suffers of a lack of dynamism. One solution is the “bufferized” approach. It solves the problem of real time (eg : ANS signal) and gives more accuracy of the affective state. Algorithms have to be able to check the dimension of buffers in order to evaluate different kind of emotions. It is also possible to design unimodal emotion recognition system (and link them to have a better information). Algorithms take into account new recognition system and use it to improve estimations. The Scherer’s model is detailed, with its CPT three level model and links recognition and emotion. To sum up, the inputs of the system are : audio, video, ANS signals. The SECs chains can be seen as the outputs. Three operations lead from the inputs to the outputs : fusion, algorithm and two chains working in parallel.

A. Grizard, M. Paleari et C. L. Lisetti. Une théorie psychologique fondée sur les expressions faciales adaptée à deux plateformes différentes. A WACA´ 2006 Deuxième Workshop sur les Agents Conversationnels Animés, Toulouse, octobre 2006.

MAUI Development

2/21/2007

70

Dans cet article, les auteurs présentent leurs résultats concernant l’amélioration de la crédibilité des expressions faciales sur deux plateformes différentes en utilisant la théorie de l’évaluation et de la génération des émotions de K. R. Scherer. Les interactions entre les humains et les agents sociaux intelligents comprennent deux aspects : état externe, et la l’expression des émotions dit modélisation des émotions (état interne). En appliquant la théorie de Scherer, les auteurs veulent rendre plus crédible les expressions faciales en les liant à leurs états internes. Les agents sociaux, impliqués dans les interactions sociales, peuvent être de deux natures : - Virtuelle, alors souvent implémentés par des ACAs (Agents Conversationnels Animés), eux même pouvant être zoomorphique - Robotique, (souvent des jouets d’apparence animal) ou anthropomorphique (dédiés aux interactions sociales). Après avoir présenté les principaux points de la théorie de Scherer, les auteurs expliquent quelle permet de prédire les réponses faciales émotionnelles, celles-ci étant définies en terme d’AUs (Actions Units : position de chaque muscle facial durant les expressions°). Le processus de génération d’émotions peut être automatisé. La génération des expressions faciales émotionnelles consiste en une combinaison d’AUs, en réponse aux SECs, et les prédictions d’intensité sur les expressions. Finalement, les auteurs présentent ensuite plus en détail les deux plateformes sur lesquels ils ont travaillé, les difficultés rencontrées et les résultats leurs travaux : - Le robot iCat de Philips n’a qu’un degré de liberté réduit pour ses expressions, mais avec un peu d’exagération et l’utilisation de tous les outils proposés par l’iCat, les expériences donnent un taux de reconnaissance des émotions concluant et une amélioration de la crédibilité des émotions par rapport à celles prédéfinies de Philips. - Pour l’avatar Haptek, les auteurs ont du rajouter à la théorie de Scherer des notion de durée et d’intensité des AUs pour générer des animations réussies. Les études utilisateurs se sont avérées très satisfaisantes ici aussi en terme de reconnaissance et de crédibilité.

MAUI Development

2/21/2007

71