Physiological-Based Affect Event Detector for Entertainment

The performances of the detector are evaluated on manually labelled sequences, and its robustness is discussed .... three machine learning algorithms. Efficient ... other information of the emotional flow. .... Another solution would consist in computing the normalization .... chines theory). ..... inference,” in Conf. Proc.
535KB taille 37 téléchargements 245 vues
1

Physiological-Based Affect Event Detector for Entertainment Video Applications Julien Fleureau, Philippe Guillotel, and Quan Huynh-Thu Abstract—In this paper, we propose a methodology to build a real-time affect detector dedicated to video viewing and entertainment applications. This detector combines the acquisition of traditional physiological signals, namely galvanic skin response, heart rate and electromyogram, and the use of supervised classification techniques by means of Gaussian Processes. It aims at detecting the emotional impact of a video clip in a new way, by first identifying emotional events in the affective stream (fast increase of the subject excitation) and then, by giving the associated binary valence (positive or negative) of each detected event. The study was conducted to be as close as possible to realistic conditions by especially minimizing the use of active calibrations and considering on-the-fly detection. Furthermore, the influence of each physiological modality is evaluated through three different key-scenarios (mono-user, multi-user and extended multi-user) that may be relevant for consumer applications. A complete description of the experimental protocol and processing steps are given. The performances of the detector are evaluated on manually labelled sequences, and its robustness is discussed considering the different single and multi-user contexts. Index Terms—Affective Computing, Emotion Detection, Physiological Signals, Machine Learning, Gaussian Processes.



1

I NTRODUCTION

K

the emotional response of an audience watching a movie is of high interest to video content creation and distribution. The possible applications are obvious in an advertisement context but other kind of applications are also possible for affective search in a personal video database [1], recommendation in Video On Demand services [2] or even, affective-based video summarization [3]. Nevertheless, real-time emotion detection is not an easy task. Obtaining a manual and direct self-assessment of each viewer, second per second, is clearly not realistic for consumers applications. Face analysis through an adapted camera would be an alternative [4], but the associated algorithms may be very sensitive to the user environment. The camera could be far from the viewer, the illumination conditions may be complex or poor. Moreover, such algorithms are often optimized for nonspontaneous facial expressions and performances on natural expressions are not sufficient for “real” audience. Another approach, adopted in this paper, is based on the recording of physiological signals. Many physiological signals are known to vary with the emotional state of a subject. Some typical signals are the Skin Conductivity (SC) measured through the Galvanic Skin Response (GSR), the SKin Temperature (SKT), the Heart Rate Variability (HRV) that may be approximated by the means of a PhotoPlethysmoGraphy (PPG) device [5] or the facial surface ElectroMyoGram (EMG). The link between such signals and the affective state is wellNOWING

• The authors are with Technicolor R&D France, Cesson-S´evign´e, France E-mail: {julien.fleureau, philippe.guillotel, quan.huynhthu}@technicolor.com

known in the literature [6], [7], [8], [9] and the devices to record such biodata are becoming more and more compact and non-obtrusive [10], [11], [12]. However, estimating the current affective state on-thefly and directly from the physiological data remains a difficult task, especially if the application targets consumer needs. The physiological signals presented above are quite subject-specific and often require calibration. Moreover, even if a clear link exists between such signals variations and the emotional state, many other factors independent of the affective state (temperature, humidity, diet, ...) have an impact on the signal response. Many studies in the related literature address this specific topic and they generally adopt a common strategy, which can be summed up into four main steps [13], [14], [15]: i) selection of emotion-eliciting videos, ii) viewing of the clips and acquisition of the associated physiological measured data, iii) collection of the emotions felt during each clip using a self-assessment survey and iv) datamining to find links between the acquired physiological signals (or specific extracted features) and the associated affective state. The affective state may be directly given in terms of emotions (amusement, sadness, ...), or indirectly by an emotional model such as the very classical bidimensional “valence / arousal” emotional model [16]. Concerning the datamining steps, two main groups of studies may be identified in the literature. A first category of works only tries to find statistical linear correlations between the physiological data (or features) and the associated emotion (or its equivalent representation in terms of valence and arousal). In [13], Soleymani et al. record physiological data (GSR, PPG, EMG, SKT) to characterize the emotional content of video clips. By combining features extracted from those

2

signals with other features coming from audio and video streams, they found linear correlations with users’ selfassessment of arousal and valence. Similarly, Canini et al. [14] tried to correlate the GSR signal with the arousal axis in a context of affective film content analysis. More specifically, the authors tried to adopt a narrative point of view by classifying video sequences considering their narrative coherency. A second category of works adopts a classification strategy to link physiological data and movie-elicited emotions. In this context the model validation is not performed in terms of correlation coefficient but in terms of classification results which may be more meaningful and explicit. In [15], a complete setup using this strategy is proposed, starting from the emotion-eliciting movie selection via a pilot study, to the classification step using three machine learning algorithms. Efficient classification is achieved (around 80%) for six emotions on multiuser data. Lee et al. reached similar findings by using HRV and GSR [17]. Whereas these studies do not take into account real-time changes in emotion but process the whole signal content, an on-the-fly classification has been proposed with specific video stimuli containing records of people watching some emotion-eliciting films [18]. Each captured signal (including SC, HRV and ECG especially) are manually labelled second per second by trained experts. Physiological features are selected and combined with facial expression features (from an additional dedicated camera) to estimate two emotions, e.g. sadness and amusement. Mono-users results are significantly better than multi-user ones. In this paper, we propose a methodology to build and evaluate a new real-time affect detector. It includes: i) a new way to label the emotional data (which makes the classification process different) by identifying first the emotional events and then, an associated binary valence, ii) an on-the-fly approach tested on realistic clips using physiological signals only (as opposed to Bailenson et al. [18]), with a new affect event detector architecture, iii) a complementary understanding of which physiological signal may be useful for a short-term evaluation of affect, iv) the use and evaluation of Gaussian processes to perform the classification with few learning examples (50-50 split for learning and testing) which is not new, but rare in this context, v) and the study of three different realistic scenarios, and especially an extended multi-user one, relevant in the targeted context.

2

E MOTIONAL E VENTS DATABASE

2.1 Emotional Event Traditional works in the literature try to estimate specific emotions or to directly position a measured affective state in the discretized valence / arousal plane. Despite these approaches being interesting, some applications may not require such level of precision or may target other information of the emotional flow. In our context, we focus on the emotional short-term impact that a

movie may have on a user at a specific time. We are interested in the detection of fast and big changes in the affective state, rather than in the identification of a precise emotion. We will name “emotional events” (or simply “events”) such affective changes. They may be mainly caused in a movie by: i) relaxing, funny and exciting sequences in the case of positive valence, ii) by shocking, frightening or upsetting sequences in the case of negative valence. Consequently, the proposed detector aims at automatically identifying those events in real-time and give their associated binary valence (positive or negative). Given the variability of reactions when watching a video content as well as the ambiguity or the proximity between some emotions, only binary valence is considered. By reducing the space of possibilities, we also expect a higher reliability. 2.2 Experimental setup We developed a complete experimental setup to design and test the detector. We selected a set of 15 emotioneliciting audio-video clips from the YouTube website, each of approximately two minutes in duration with explicit emotional content. We used various contents that may cause positive or negative events in the viewer’s biosignals. The videos were shown in a random presentation order using a dedicated testing framework to 10 different subjects consisting of 8 males and 2 females aged between 20 and 60 years old. Each viewer was comfortably seated in a dedicated dark and quiet room. Headphones were used to isolate the subjects and increase their immersiveness. Before the experiment, subjects were asked to fill in a form to indicate their age and gender and to indicate any potential health issue or incompatibility with the experimental context. Amongst the several physiological options to detect emotion changes, our choice was led by: i) practical considerations in terms of potential integration in consumers applications, ii) performances reported in studies from the literature. Skin conductance, facial muscles activity as well as heart rate were consequently considered. Besides their good potential to characterize emotional valence and arousal [6], they may be easily integrated in low-obtrusive devices ([10] for an example) and are thus good candidates for the target application. During the viewing of each clip, the GSR was captured using sensor electrodes placed on the second and third medial phalanges of the non-dominant hand. A facial EMG was also recorded using two appropriated electrodes placed on the zygomatic and frontal muscles. In addition, a monitoring of the heart rate was performed by means of a PPG sensor placed at the end of the index. Finally, a video of the subjects was recorded to check for potential troubles during the experiment and to retrieve some additional information for the future labelling step. The entire recording process was automatically synchronized and controlled by a dedicated software. A BIOPAC

3

for the valence groups) reflects the variability of the emotional profiles from one subject to another.

3

D ETECTOR D ESIGN

3.1 Overview

Fig. 1. Example of record for one user. X-axis: time in second / Y-axis: modality value in specific unit (top: PPG, middle: GSR, bottom: EMG). The 3 highlighted events correspond to 3 funny phases (the third one only elicits a smile - see the EMG peak).

MP36RW device was used for the raw data acquisition at a rate of 1kHz (required of the EMG). After each clip, the subjects were asked to give an evaluation of their emotional feelings during the viewing, by selecting one or several predefined emotions (exciting, amusement, neutral, shocking, sadness and fear). 2.3 Data Labelling For each video clip and each subject, the signals (GSR, PPG and EMG) were manually labelled by one trained expert using a dedicated framework. More precisely, a 10 seconds time windows with 5-second overlap were sequentially considered, and a tag combining “Event / Not Event” and “Negative / Positive” valence was associated to each windows. Both the GSR signal and video information were used to label each time period. A strong (or high enough) emotional impact is known to cause changes in the Autonomic Nervous System and, incidentally, to modify the Skin Conductivity [6]. GSR is thus linearly correlated to arousal and reflects emotional shifts [19]. As a consequence, events may be quite easily identified and segmented by an analysis of the GSR signal as depicted in Figure 1. In addition, the videos of the subjects recorded during the experiment, combined with the results of the self-assessment by each viewer helped the expert to determine the associated binary valence. At the end, by combining all subjects and clips, an “emotional events” database D = {Di } is built (see Table 1). It consists of a set of 1494 10-second segments termed Di , including the associated signals (GSR, PPG, EMG) and tag. The unbalanced nature of each class (especially

The detector should both detect emotional events and their binary valence from the recorded biosignals in real-time. To achieve this, a supervised classification approach is adopted in order to predict the tag associated to each video segment in the database (without using the video information). The detector works as follows: 1. Every 5 seconds, get the last 10 seconds s of the trivariate physiological signal (GSR, EMG and PPG). 2. Extract relevant features from s. 3. Launch a previously trained classifier to classify the vector of extracted features. 4. Decide if s corresponds to an emotional event and estimate the associated binary valence. This real-time approach allows a time resolution of approximately 5 seconds. No specific user-calibration is required since the classifier contains sufficient general knowledge to extend its initial learning to a larger group of subjects. Explicit management of the temporal consistency of the prediction is not yet integrated in this version of the detector. Nevertheless the overlapping nature of the temporal windows where the features are computed as well as the continuous behaviour of the classifying function, ensure a certain temporal consistency on a short time-interval. The evaluation of the approach is performed here through the analysis of the data from D. The 10-second biosignals are taken from the database D and each Di is used either as input to the detector for testing its performance or for training the associated classifier. 3.2 Features Extraction For each channel of the trivariate biosignals, features close to those traditionally found in the literature are extracted ([13] for an example). More precisely, three features (F1 to F3 ) are extracted from the EMG high frequency content (fc = 5Hz) - termed EM GHF , five features (F4 to F8 ) from the GSR filtered between 0.5 and 1Hz - termed GSRBP , and two features (F9 and F10 ) from the PPG record. Those features are essentially based on statistical functions of some specific spectral band for each modality (see Table 2). They have been chosen on the basis of previous published works and because of their expected relevance for the considered context. Because of: i) the great variability (and specificity) of shapes and amplitudes of each signals between subjects and ii) possible variations in the conditions of acquisition (temperature, humidity, ...), most of the proposed features have to be normalized for the classification. Normalization is performed using a relevant characteristic

4

TABLE 1 Database Description Subject

#1

#2

#3

#4

#5

#6

#7

#8

#9

#10

Total

Not Event

34

62

78

66

63

47

137

155

146

139

927

Event

59

28

17

27

31

51

102

83

86

83

Negative

Positive

18

41

2

26

4

13

0

27

5

26

TABLE 2 Features Description Modality

Features F1

EMG

F2 F3

Mean value of EMGHF Standard deviation (energy) of EMGHF Maximal value of EMGHF

F5

Slope value of the linear fit of GSRBP Maximum signed amplitude between 2 consecutive extrema of GSRBP

F6

Mean value of the derivative of GSRBP

F7

Standard deviation (energy) of the derivative of GSRBP

F8

Maximal value of the derivative of GSRBP

F9

Standard deviation of the heart rate approximation

F4 GSR

PPG

Description (normalized values - see section 3.2)

F10

Maximum of the heart rate approximation

(mean value) computed on a neutral part (in terms of emotion) of the signals. In our simulation, the affective neutral parts required to compute the normalization factors can be identified using the annotations of the trained expert (see Section 2.3). In a practical situation, evaluating such normalized features may be critical because it requires to automatically identify those “neutral” parts that may highly vary between subjects. Two solutions may be proposed to address this problem and to keep the classification process as user-friendly as possible. A first solution is based on the video meta-data and especially on contextual information that could be directly attached to the Audio/Video streams. Some standards, e.g. MPEGV, plan to integrate in the data stream some affective features such as the estimations of valence and arousal during the whole stream duration. Those data can be used to compute the normalization factors. Another solution would consist in computing the normalization factors on the signal acquired before the A/V stream start. More precisely, we assume that a subject in its daily life would be, on average, in a neutral affective state just before starting to watch a movie (during opening or advertisement for instance). From the “emotional events” database, a new dataset X = {Xi } is therefore built where each Xi is a vector made of the normalized features extracted from the 10second biosignals Di of D. This new dataset is then used to design the supervised classifier. 3.3 Classification Strategy The classifier decides if the analysed vector of features matches an emotional event and if so, gives the associated binary valence. We used a supervised classification strategy to achieve this. More precisely, the detection strategy is based on two hierarchical binary problems of supervised classification:

14

37

43

59

42

41

47

25

55

567 28

230

323

1. Decide if the vector of features designates an emotional event or not (sub-problem A1 ). 2. If so, decide if the associated valence is positive or negative (sub-problem A2 ). In this paper, those two problems will be addressed independently (see Section 4). Using such a classification strategy may appear quite intuitive to solve the sub-problem A2 , but not for the sub-problem A1 . Indeed, at first, it seems reasonably easy to identify emotional events in a GSR sequence (see Section 2 and Figure 1). One may consequently expect that a more direct approach using classical signal processing techniques (filtering, thresholding, ...) applied to the GSR signal will be able to catch those events. However, the GSR variations as well as their shape are quite subject-specific and using, for instance, simple thresholds may not be robust. Moreover, GSR may not be the only source of information to detect events. The other channels (EMG or PPG) may be indeed useful to the detection. For these reasons, using a classification strategy to solve the sub-problem A1 provides a way to learn various user profiles as well as a solution to naturally take advantage of the information from the three biosignals. Once trained, a classifier is also a computationally efficient way to formulate complex rules of decision. All those reasons explain our methodological choice to be a whole supervised classification strategy. 3.4 Gaussian Processes A1 and A2 problems obviously require the choice of a machine learning algorithm. We selected Gaussian Process Classifiers (GPCs) [20] because of their novelty and their interesting properties in terms of formalism and performances. Briefly, let’s consider a binary and Bayesian classification context with a learning dataset X, the associated labels y = {yi ∈ {−1, +1}} and a new sample x∗ to classify with an unknown class y ∗ . A GPC tries to evaluate the predictive density p(y ∗ = +1/X, y, x∗ ) using the prior p(y/x, f ) = σ(yf (x)) where: • f is a Gaussian Process defined, by its mean function m(x) = E[f (x)] (generally set to zero) and its covariance function (also called kernel for its mathematical properties - a covariance matrix being positivedefinite) k(x, x′ ) = E[(f (x) − m(x)(f (x′ ) − m(x′ )], • σ is a symmetric sigmoid function (logistic or probit functions for instance) to transform f in [0, 1].

5

In a Bayesian way and with f ∗ = f (x∗ ), it can be shown that the previous predictive density may be rewritten by introducing the latent variables: ZZ

σ(f ∗ )p(f ∗ /f, X, x∗ )p(y/X, f )p(f /X)/p(y/X)df df ∗

This integral is analytically intractable because of the non-Gaussian likelihood (non-Gaussian prior), and specific methods have to be considered for its computation. Monte-Carlo schemes are a solution but they remain expensive in terms of computational load. We used an alternative and fast approach, with high accuracy, called Expectation-Propagation (EP) and proposed by Minka [21]. The kernel choice implied in the definition of f may be determinant in the classification results. Two well-known and generic enough kernels are used for comparison. Those kernels are respectively: • A tenth order polynomial kernel (P10) defined by k(x, x′ ) = (1 + x· x′ )10 •

where · denotes the canonical scalar product. An anisotropic squared exponential kernel (xSE) allowing one length scale per dimension (noted li for the ith dimension) and a common scale factor s defined by k(x, x′ ) = s2 exp(−

X (xi − x′ )2 i

i

2li2

)

and requiring D + 1 hyper-parameters if (x, x′ ) lives in RD × RD . GPCs offer both the advantages of a Bayesian and probabilistic context as well as the power of kernel approaches (that are also used in Support Vector Machines theory). The model selection (settings of the kernel hyper-parameters) can be directly addressed in a Bayesian way by optimizing the model marginal likelihood [20]. Moreover, the EP complexity is reasonable. All those reasons therefore justify the choice of such a classifier.

4

D ETECTOR

EVALUATION

To evaluate the performances of the proposed detector, three key scenarios corresponding to different reallife configurations have been considered, namely mono, multi and extended multi-user cases. 4.1 Scenarios Description 4.1.1 Mono-User In a first simulation S1 , a mono-user context is addressed. More precisely, the detector is trained and tested on a specific subject. In a practical context it would correspond to a personal learning database for each user. In this study, records associated to each specific subjects are extracted from X and, randomly and equally

divided into training and testing equal parts (50% of the whole individual dataset for learning and the other 50% for testing). Such a split is preferred to a leaveone-out cross-validation procedure (even if it makes the test unavailable for certain cases - see Table 3) because it is closer from a realistic context where the learning population should be limited regarding the testing one. Sub-problems A1 and A2 are then solved independently for each of the 10 subjects of the database and for five random partitions of learning / testing sets. 4.1.2 Multi-User In a second simulation S2 , a multi-user problem is considered. The detector is trained and tested on the whole multi-user database. In a consumer application, it would correspond to a database shared between different members of a same household. In this configuration the whole dataset X is equally divided into training and testing parts and, as previously, sub-problems A1 and A2 are then solved independently and for five random partitions of learning / testing sets. 4.1.3 Extended Multi-User Finally, a last simulation S3 is proposed. In this case the learning step is performed on all but one users and the testing step is performed on the excluded subject (each subject being sequentially excluded). This configuration would correspond to a detector calibrated at the factory using a pre-built common database and tested at home. Such a detector would avoid any further user-calibration and would allow an out-of-the-box system. 4.2 Detector Configuration For each of those scenarios, S1 to S3 , we adopted a common strategy for simulation and datamining purposes with 6 different configurations of the detector: 1. All features described in Table 2 are used with a P10 kernel (termed ALL). 2. The 3 EMG features only are used with a P10 kernel (termed EMG). 3. The 5 GSR features only are used with a P10 kernel (termed GSR). 4. The 2 PPG features only are used with a P10 kernel (termed PPG). 5. The features giving the best mean classification results (among the 1024 possibilities) are used with a P10 kernel (termed BEST). 6. These latter optimal features (obtained with a P10 kernel) are used with a xSE kernel. Each of the first five simulations are meant to evaluate the influence of the modalities and features in the classification process, whereas the last one is meant to evaluate the role of the kernel when using GPCs. Regarding the fifth simulation, the optimal features are obtained by an exhaustive maximization of the accuracy×min(specificity, sensitivity) criterion over the 1024 combinations.

6

4.3 Results 4.3.1

Overview

Simulation results are reported in Table 3 which gathers the three scenarios. For each scenario, the associated table reports the results for the sub-problem A1 (event detection) and the results for the sub-problem A2 (valence classification) for the five first different configurations (the sixth configuration is not quantitatively reported here but a qualitative analysis is provided below). Detection results are indicated for each of the 10 subjects in the case of scenarios S1 and S3 , and for the whole population in the case of scenario S2 . For each of the six detector configurations, classification performances are evaluated on the sub-problems A1 and A2 , independently. Each classification result is characterized (first to third row respectively) by its specificity (true negative rate), its sensitivity (true positive rate) [22] and the total score of good classification (accuracy), averaged over the different random partitions of learning / testing sets when necessary (S1 and S2 ). In the case of scenarios S1 and S3 , an average specificity, sensitivity and score over the different subjects is also proposed to have a more global point of view. 4.3.2

Mono-User

Table 3-A presents the results obtained for the scenario S1 and for the sub-problems A1 and A2 respectively. First of all, it is clear that in both sub-problems, using the whole set of features for the detection is not the optimal solution with total scores varying between 50% and 90% from one observer to another. Indeed, considering an optimal subset significantly improves the results. Even if this may seem surprising, adding too many features during the classification stage may actually lead to over-fitting the learning data and thus to a lack of generalization on unknown data. Regarding the event detection, Table 3-A indicates that the GSR and EMG are the most discriminant modalities, whereas the PPG features remain generally unable to correctly discriminate between the two classes (low sensitivity). The importance of the GSR modality is quite logical considering the importance of this signal during the manual labelling step (see Section 2.3). It also appears that the best combination of features makes also use of the EMG modality. The fact that the PPG features (and therefore heart rate variability information) were not selected in the optimized model in our context may be explained by an inadequate time resolution as the features are computed on 10-second segments, which may be too short to catch discriminative variations. The best configuration allows a mean accuracy of 82.44% for the event classification with satisfying associated specificity and sensitivity (87.41% and 71.26% respectively). This performance is quite homogeneous regarding all subjects individually even if, subjects #3 and #4 seem to present a lower sensitivity which may

be explained by some experimental issues (noisy records due to bad electrodes positioning especially). Regarding the valence classification, Table 3-A clearly shows that the EMG provides the best classification results and the most discriminative signal with a mean accuracy of 90.34% of good detection. Indeed, correctly discriminate between positive and negative emotions when watching a movie is often equivalent to detect a smile or a laugh on a face, which the EMG signal easily detects. Nevertheless, it is of course possible to feel positive emotions without smiling. The GSR seems to react to emotions according to the selected set of optimal features. In this latter case, quite high performances are obtained by the proposed detector with around 87% of mean specificity, sensitivity and score for all observers (even for subjects #9 and #10 with low classification results when using only EMG and GSR respectively). Finally, when looking at the kernel influence (which is not quantitatively reported here) in both sub-problems A1 and A2 , although the xSE kernel seems to offer slightly lower performances, the two kernels exhibit a similar behaviour. Moreover, the lower performances of xSE kernel may be explained by the fact that: i) the optimal features have been computed using a P10 kernel configuration and could be a bit different using an xSE kernel, ii) as it can be observed in Figure 2, the predictive density (see Section 3) has different shapes in the learning (feature) space with a P 10 kernel providing better generalization properties for the targeted context. 4.3.3 Multi-User Table 3-B presents the results obtained for the scenario S2 and for the sub-problems A1 and A2 respectively. As in the previous scenario the same comments can be made regarding the kernel choice, the importance of the GSR modality to detect events, and the role of the EMG information to determine the associated valence. In fact those results are even reinforced and one can observed that using the EMG or PPG modalities alone make the event detection process impossible. Similarly, using the whole set of features or using only those associated to the PPG produces low performance in the valence detection. Moreover, even if the optimal features are not exactly the same as for scenario S1 , a combination of GSR and EMG features leads again to the best detection performances with 78.05% of good detection for the events and 82.67% for the valence. Those performances remain acceptable for the targeted consumer context, and the proposed detector seems to be able to generalize its performances to a group of observers. 4.3.4 Extended Multi-User Table 3-C presents the results obtained for the scenario S3 and for the sub-problems A1 and A2 respectively. This scenario is obviously the most ambitious. It requires higher generalization properties since the detector learns on a group of subjects and is tested on a new unknown one. Considering the great variability of

7

TABLE 3 Events and valence detection results for the three key scenarios. For each scenario, each subject and each configuration of the detector, specificity (top), sensitivity (middle) and accuracy (bottom) values are reported (see section 4.3.1). The best features used in the “BEST“ configuration are also reported for each case (see Table 2). A. Mono-User

#1

#2

#3

Subjects

#4

#5

#6

#7

#8

#9

#10

Average

ALL

EMG

GSR

PPG

53.33 87.50 76.60 96.43 70.59 86.67 92.30 55.56 85.41 96.67 58.82 82.98 96.30 40.00 72.34 79.17 72.00 75.51 50.00 50.00 50.00 90.28 63.83 79.83 90.91 76.00 84.48 91.18 88.37 90.09 74.03 66.27 78.39

73.33 75.00 74.47 85.71 35.29 66.67 94.87 11.11 79.17 93.33 70.59 85.11 66.67 64.00 65.30 62.50 56.00 59.18 64.06 50.00 57.50 81.94 19.15 57.14 63.64 46.00 56.03 83.82 69.77 78.38 76.99 57.38 67.90

66.67 78.13 74.47 92.86 76.47 86.67 92.31 33.33 81.25 86.67 35.29 68.09 92.59 30.00 65.96 75.00 80.00 77.55 79.69 60.71 70.83 94.44 48.94 76.47 95.45 76.00 87.07 85.29 86.05 85.59 86.10 60.50 77.40

53.33 78.13 70.21 82.14 00.00 51.11 84.62 22.22 77.92 86.67 29.41 65.96 85.18 20.00 57.45 54.17 44.00 48.98 50.00 50.00 50.00 73.61 25.53 54.62 74.24 44.00 61.20 69.12 37.21 56.76 71.40 35.05 58.92

BEST F 2,3,5,7

73.33 81.25 78.72 96.43 82.35 91.11 100.0 44.44 89.58 96.67 52.94 80.85 92.59 80.00 87.23 79.17 80.00 79.59 78.13 69.64 74.17 84.72 65.96 77.31 84.85 70.00 78.45 88.24 86.05 87.39 87.41 71.26 82.44

C. Extended Multi-User

Valence Detection (Features) BEST ALL EMG GSR PPG

Events Detection (Features)

F 1,3,7

77.77 95.23 90.00 00.00 100.0 92.85 100.0 85.71 88.88

77.78 100.0 93.33 100.0 100.0 100.0 100.0 100.0 100.0

33.33 85.71 70.00 00.00 100.0 92.86 100.0 85.71 88.89

44.44 71.43 63.33 00.00 84.61 78.57 00.00 85.71 66.67

77.78 100.0 93.33 100.0 100.0 100.0 100.0 100.0 100.0

N/A

N/A

N/A

N/A

N/A

100.0 100.0 100.0 66.67 65.22 65.21 50.00 50.00 50.00 63.64 90.00 76.19 62.50 58.33 61.11 85.71 42.85 71.42 67.37 85.12 77.30

66.67 100.0 93.75 100.0 100.0 100.0 62.50 100.0 82.35 90.90 95.00 92.85 50.00 66.67 55.55 96.43 92.86 95.24 82.70 94.95 90.34

33.33 84.62 75.00 33.33 69.57 65.38 58.33 92.59 76.47 50.00 70.00 59.52 75.00 58.33 69.44 78.57 14.29 57.14 51.32 76.42 72.74

00.00 76.92 62.50 66.67 69.57 68.23 50.00 50.00 50.00 36.36 60.00 47.62 83.33 25.00 63.88 82.14 42.86 69.05 40.33 62.90 63.32

100.0 100.0 100.0 50.00 50.00 50.00 91.27 100.0 96.08 90.91 80.00 85.71 66.67 83.33 72.22 96.43 71.43 88.10 85.90 87.20 87.27

#1

#2

#3

#4

Subjects

Events Detection (Features)

#5

#6

#7

#8

#9

#10

Average

ALL

EMG

GSR

PPG

50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 58.16 81.75 26.47 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.82 53.17 47.65

50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00

61.29 55.88 64.41 82.26 75.00 80.00 89.47 93.59 70.59 86.36 18.52 66.67 84.13 67.74 78.72 89.36 43.14 62.31 81.02 52.94 69.03 83.87 57.83 74.79 82.87 72.09 78.88 72.66 91.57 79.73 81.33 62.83 72.81

50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00

BEST F 5,6,7,8

73.53 69.49 70.97 83.87 67.86 78.89 94.87 70.59 90.53 87.88 33.33 72.04 85.71 83.87 85.11 93.62 56.86 74.48 86.86 45.10 69.04 81.94 65.06 76.05 85.62 79.07 83.19 82.01 72.29 78.38 85.59 64.35 77.87

ALL 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 N/A 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 72.09 08.47 35.29 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 52.45 45.85 48.23

Valence Detection (Features) BEST EMG GSR PPG F 2,4

50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 N/A 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00

33.33 53.66 47.46 100.0 23.08 28.57 50.00 69.23 64.71 N/A 81.48 81.48 20.00 46.15 41.94 50.00 54.05 52.94 32.56 64.80 52.94 47.62 82.93 65.06 14.89 92.00 41.67 21.81 85.71 43.37 46.36 65.31 52.01

50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 N/A 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00

55.56 80.49 72.88 50.00 76.92 75.00 75.00 69.23 70.59 N/A 77.78 77.78 50.00 73.08 61.29 64.29 75.67 72.55 74.52 52.54 61.76 50.00 75.60 62.65 23.40 76.00 41.00 63.64 46.43 57.83 56.27 70.37 65.33

B. Multi-User Events Detection (Features) ALL

EMG

GSR

PPG

50.00 50.00 50.00

50.00 50.00 50.00

85.72 62.91 76.44

50.00 50.00 50.00

Valence Detection (Features) BEST

F 1,3,4,5,8

84.04 72.19 78.05

ALL

EMG

GSR

PPG

50.00 50.00 50.00

67.54 77.30 73.29

53.51 70.55 73.29

50.00 50.00 50.00

BEST F 1,2,7,8

85.09 80.98 82.67

Fig. 2. Kernel influence. 3D plots of the predictive distribution computed with a P10 (left) and an xSE (right) kernel for the GPC associated to the sub-problem A2 (binary valence detection) for the subject #7 with the 3 optimal features. Each axis matches one of the features, each point matches one learning vector of features (blue ∼ negative event / red ∼ positive event). Each surface is an isoprobability contour with value 0.2 (blue), 0.5 (green) and 0.8 (red).

physiological responses between subjects, this is thus a challenging task, and may partially explain the lower results observed. Nevertheless, the events detection accuracy is 77.87%

using the optimal features set. It therefore shows that the proposed detector is able to generalize its behaviour on totally unknown data. This performance remains also quite homogeneous when considering each subject indi-

8

vidually. The GSR is still the most important modality to build an efficient event detector whereas the other modalities, especially the EMG, seem to have a less important role. Regarding the valence detection, results are less convincing. The generalization from a group to a new subject seems to be more difficult. Better results are achieved when a test observer has a physiological behaviour close to one of the learning group (see subjects #3 or #6 for instance). Despite of this, trends similar to the previous simulations may be observed. EMG and GSR modalities remain the most discriminant and different kernels do not really change the final results.

R EFERENCES [1] [2] [3] [4] [5] [6] [7]

5

C ONCLUSION

AND

P ERSPECTIVES

In this paper, we proposed a new physiological-based and real-time oriented detector of affective states for video content production, distribution and rendering applications. It differs from other approaches by focusing on special affective events of the emotional flow. The detector allows, in a user-friendly way, to capture what we called “affective events”, and to give their associated binary valence. It takes advantage of powerful machine learning techniques (Gaussian Process Classifiers) to automatically build complex rules of decision combining multiple modalities. The detector has been evaluated through three different realistic scenarios including mono-user, multiuser and extended multi-user simulations. Convincing results have been obtained regarding the events detection (∼ 80%). Prediction of the binary valence is also very efficient (80 − 90%) except for the extended multiuser case, which remains a hard task (∼ 60%) requiring further research. The role of the different features and their associated physiological modalities, as well as the GPC kernel choice, have been also addressed. We especially pointed out the importance of the GSR and EMG modalities and reciprocally, the low relevance of the PPG in our context. Concerning the GPC kernel choice, even if a deeper analysis should be necessary, the use of two different kernels should not radically modify the performances (despite a slightly more general power of generalization fot the P10 kernel). Several options may be considered to improve the performances of the detector. As a machine learning based process, including more physiological profiles in the learning database is an obvious way to robustify the detection. Secondly, exploring the use of more complex features (shape-based ones for instance) is another path to investigate. Regarding the classification step, a fuzzy interpretation of the GPCs outputs (which were directly binarized in this work) could be also advantageous for a final interpretation. Finally, as it has been discussed in section 3, modelling the temporal affect evolution could be interesting to improve the long-term detection consistency.

[8]

[9] [10]

[11]

[12]

[13]

[14] [15] [16] [17]

[18]

[19]

[20] [21] [22]

A. Sarcevic and M. Lesk, “Searching for Emotional Content in Digital Video,” ACM CHI 2006 Workshop HCI and the Face, pp. 1–4, 2006. C. Calcanis, V. Callaghan, M. Gardner, and M. Walker, “Towards end-user physiological profiling for video recommendation engines,” Conf. Proc. IE, pp. 1–5, 2008. A. G. Money and H. Agius, “Analysing user physiological responses for affective video summarisation,” Displays, vol. 30, no. 2, pp. 59–70, Apr. 2009. M. Pantic and L. J. M. Rothkrantz, “Automatic Analysis of Facial Expressions: The State of the Art,” IEEE PAMI, vol. 22, no. 12, pp. 1424–1445, 2000. G. Lu and F. Yang, “Limitations of oximetry to measure heart rate variability measures,” Cardiovasc. Eng., vol. 9, pp. 119–125, 2009. P. Lang, “The emotion probe,” Am. Psychol., vol. 50, no. 5, pp. 372–385, 1995. J. Wagner and E. Andre, “From Physiological Signals to Emotions: Implementing and Comparing Selected Methods for Feature Extraction and Classification,” in Conf. Proc. IEEE Multimedia and Expo. IEEE, 2005, pp. 940–943. E. Ganglbauer, J. Schrammel, S. Deutsch, and M. Tscheligi, “Applying Psychophysiological Methods for Measuring User Experience: Possibilities, Challenges, and Feasibility,” in Proc. UXEM, 2009. R. Calvo and S. D’Mello, “Affect Detection: An Interdisciplinary Review of Models, Methods, and their Applications,” IEEE TAC, vol. 1, no. 1, pp. 18–37, 2010. S. Rothwell, B. Lehane, C. Chan, A. Smeaton, N. OConnor, G. Jones, and D. Diamond, “The CDVPlex biometric cinema: sensing physiological responses to emotional stimuli in film,” in ICPA. Citeseer, 2006, pp. 1–4. R. Matthews, N. J. McDonald, P. Hervieux, P. J. Turner, and M. A. Steindorf, “A Wearable Physiological Sensor Suite for Unobtrusive Monitoring of Physiological and Cognitive State,” in Conf. Proc. IEEE EMBS, 2007, pp. 5276–5281. R. R. Fletcher, K. Dobson, M. S. Goodwin, H. Eydgahi, O. WilderSmith, D. Fernholz, Y. Kuboyama, E. B. Hedman, M. Z. Poh, and R. W. Picard, “iCalm: Wearable Sensor and Network Architecture for Wirelessly Communicating and Logging Autonomic Activity,” IEEE Trans. Inf. Technol. Biomed., vol. 14, no. 2, 2010. M. Soleymani, G. Chanel, J. J. M. Kierkels, and T. Pun, “Affective Characterization of Movie Scenes Based on Multimedia Content Analysis and User’s Physiological Emotional Responses,” IEEE Int. Symp. Multimedia, pp. 228–235, Dec. 2008. L. Canini, S. Gilroy, M. Cavazza, R. Leonardi, and S. Benini, “Users’ response to affective film content: A narrative perspective,” in Int. Workshop CBMI, vol. 1, no. 1. IEEE, 2010, pp. 1–6. C. L. Lisetti and F. Nasoz, “Using Noninvasive Wearable Computers to Recognize Human Emotions from Physiological Signals,” Eur. J. Adv. Sig. Pro., vol. 2004, no. 11, pp. 1672–1687, 2004. J. Russell, “A circumplex model of affect,” J. Pers. Soc. Psychol., vol. 39, no. 6, pp. 1161–1178, 1980. C. Lee, S. K. Yoo, Y. Park, N. Kim, K. Jeong, and B. Lee, “Using neural network to recognize human emotions from heart rate variability and skin resistance.” Conf. Proc. IEEE EMBS, vol. 5, pp. 5523–5, Jan. 2005. J. Bailenson, E. Pontikakis, I. Mauss, J. Gross, M. Jabon, C. Hutcherson, C. Nass, and O. John, “Real-time classification of evoked emotions using facial feature tracking and physiological responses,” Int. J. Hum. Comput. Stud., vol. 66, no. 5, pp. 303–317, May 2008. R. Mandryk, K. Inkpen, and T. Calvert, “Using psychophysiological techniques to measure user experience with entertainment technologies,” Behav. & Inf. Technol., vol. 25, no. 2, pp. 141–158, Mar. 2006. C. Rasmussen and K. Williams, Gaussian processes for machine learning. MIT Press, 2006. T. Minka, “Expectation propagation for approximate Bayesian inference,” in Conf. Proc. Uncertainty in AI, vol. 17, 2001, pp. 362– 369. D. G. Altman and J. M. Bland, “Diagnostic tests. 1: Sensitivity and specificity,” BMJ, vol. 308, no. 6943, p. 1552, 1994.