Distribution of Semantic Features Across Speech & Gesture by Humans and Machines

Keywords: Affective body language, Affective speech, Emotion recognition, Multimodal fusion
The system used Weka data mining tool

Beyond Facial Expressions: Learning Human Emotion from Body Gestures

 Vision-based human affect analysis is an interesting and challenging problem, impacting important applications in many areas. In this paper, beyond facial expressions, we investigate affective body gesture analysis in video
sequences, a relatively understudied problem. Spatial-temporal features are exploited for modeling of body gestures. Moreover, we present to fuse facial expression and body gesture at the feature level using Canonical Correlation Analysis (CCA). By establishing the relationship between the two modalities, CCA derives a semantic “affect” space. Experimental results demonstrate the effectiveness of our approaches.

-although bodily expression plays a vital role in conveying human emotional states, and the perception of facial expression is strongly influenced by the concurrently presented body language
– Recognition: SVM

The Support Vector Machine (SVM) classifier to recognize affective body gestures. SVM is an optimal discriminant method based on the Bayesian learning theory.

–  They We divided the data set randomly into five groups with roughly equal number of videos, and then used the data from four groups for training and the left group for testing;

Distribution of Semantic Features Across Speech & Gesture by Humans and Machines

Key : semantic and pragmatic content of the intended message.

– A deictic gesture accompanying the spoken words “that folder” may substitute for an expression that encodes all of the necessary information in the speech channel, such as “the folder on top of the stack to the  left of my computer.”

–  Growing body of evidence shows that people unwittingly produce gestures along with speech in many communicative situations. These gestures elaborate upon and enhance the content of accompanying speech  (McNeill, 1992; Kendon, 1972),

–  But when speech is ambiguous (Thompson & Massaro, 1986) or in a speech situation with some noise (Rogers, 1978), listeners do rely on gestural cues (and, the higher the noise-to-signal ratio, the more facilitation by gesture.

– When people are exposed to gestures and speech that convey slightly different information, whether additive or contradictory, they treat the information conveyed by gesture on an equal footing with that conveyed by speech, ultimately seeming to build one single representation out of information conveyed in two modalities (Cassell, McNeill & McCullough, in press)

– Hand gestures co-occur with their semantically parallel linguistic units, although in cases of hesitations, or syntactically complex speech, it is the gesture which appears first (McNeill, 1992)

–  We believe that computers should not simply attempt to understand humans, they should generate human-like communicative behavior in response. We design communicative humanoid agents –animated human figures with faces and hands, and that can produce speech, intonation and appropriately timed gestures and regulatory facial movements.

–  This provisional solution produced pre-scripted gestural forms, and gestures redundant with the speech they accompanied, rather than complementary or non-redundant information. So, for example, the gesture in Figure   1 was produced by accessing the gesture dictionary once the concept of “writing a check” had been generated by the discourse planner.

– In particular, we decided to look at the semantic features of ‘manner’, ‘path’, ‘telicity’ (whether a motion has an endpoint or goal), ‘speed’, and ‘aspect’ (inherent iterativity or duration) within verb phrases and gestures describing the movement of volitional agents (Coyotes, and Road Runners).


– Experiments using two different approaches for the decision-level fusion.

The first approach consisted of selecting the emotion that received the best probability in the three modalities.

The second approach consisted of selecting the emotion that corresponds to the majority of ‘voting’ from the three modalities; if a majority was not possible to define (for example when each unimodal system gives in output a different emotion), the emotion that received the best probability in the three modalities was selected.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s