IEAAIE_LNAI_2012

Embed Size (px)

Citation preview

  • 8/10/2019 IEAAIE_LNAI_2012

    1/10

    Time Invariant Gesture Recognitionby Modelling Body Posture Space

    Binu M. Nair and Vijayan K. Asari

    Computer Vision and Wide Area Surveillance Laboratory,Electrical and Computer Engineering,

    300 College Park Avenue, Kettering Labs KL-302A,University of Dayton, Dayton, OH - 45469, USA

    {nairb1,vasari1 }@udayton.eduhttp://visionlab.udayton.edu

    Abstract. We propose a framework for recognizing actions or gesturesby modelling variations of the corresponding shape postures with respectto each action class thereby removing the need for normalization for thespeed of motion. The three main aspects are the shape descriptor suitablefor describing its posture, the fo rmation of a suitable posture space, anda regression mechanism to model the posture variations with respect toeach action class. Histogram of gradients(HOG) is used as the shapedescriptor with the variations being mapped to a reduced Eigenspace

    by PCA. The mapping of each action class from the HOG space to thereduced Eigen space is done using GRNN. Classication is performedby comparing the points on the Eigen space to those determined byeach of the action model using Mahalanobis distance. The frameworkis evaluated on Weizmann action dataset and Cambridge Hand Gesturedataset providing signicant and positive results.

    Keywords: Histogram of gradients(HOG), Generalized RegressionNeural Nets(GRNN), Human Action Modelling, Principal ComponentAnalysis(PCA), K-Means Clustering.

    1 Introduction

    Human gesture recognition has been a widely researched area over the last fewyears due to potential applications in the eld of security and surveillance. Earlyresearch on gesture recognition used the concept of space time shapes, which areconcatenated silhouettes over a set of frames, to extract certain features corre-sponding to the variation within the spatio-temporal space. Gorelick et al. [7]

    modelled the variation within the space time shape using Poissons equation andextracted space time structures which provides discriminatory features. Wang etal. recognized human activities using the derived form of the Radon transformknown as the R-Transform [17,16]. A combination of a 3D distance transformalong with the R-Transform is used to represent a space time shape at multiplelevels and used as corresponding action features [11].

    H. Jiang et al. (Eds.): IEA/AIE 2012, LNAI 7345, pp. 124133, 2012.c Springer-Verlag Berlin Heidelberg 2012

    http://visionlab.udayton.edu/http://visionlab.udayton.edu/
  • 8/10/2019 IEAAIE_LNAI_2012

    2/10

    Time Invariant Gesture Recognition 125

    Action sequences can also be represented as a collection of spatio-temporalwords with each word corresponding to a certain set of space-time interest pointswhich are detected by set of 2D spatial gaussian lter and 1D gabor tempo-ral lters [12]. Here, Niebles et.al computes the probability distributions of the

    spatio-temporal words corresponding to each class of human action using a prob-abilistic Latent Semantic Analysis model. Another algorithm which is similar isgiven by Batra et.al where a dictionary of mid-level features called space timeshapelets is created which characterize the local motion patterns within a spacetime shape thereby representing an action sequence as a histogram of these spacetime shapelets over the trained dictionary [2]. However, these methods are sus-ceptible to illumination variation or require good foreground segmentation of thesilhouettes. Another approach is to model the non-linear dynamics of the humanaction by tracking the trajectories of certain points in the body and capture fea-

    tures from those trajectories. Ali et al. used the concepts from Chaos Theoryto reconstruct the phase space from each of the trajectories and compute thedynamic and metric invariants which are then used as action feature vectors [1].This method will be aff ected by partial occlusions as some trajectories maybemissing which may affect the metrics extracted. Scovannar et al. used a 3D-SIFTto represent spatio-temporal words in a bag of words model representation of action videos [13]. Sun et al extended the above methodology which combinedlocal descriptors based on SIFT features and hositic moment-based features [15].The local features comprised of the 2D SIFT and 3D SIFT features computed

    from suitable interest points and the holistic features are the Zernike momentscomputed from motion energy images and motion history images. The approachtaken here assumes that the scene is static as it relies on the frame differencingto get suitable interest points.

    A different approach for characterizing human action sequences is to considerthese sequences as multi-dimensional arrays called tensor s. Kim et al. presenteda new framework called Tensor Cannonical Correlation Analysis where descrip-tive similarity features between two video volumes are used in nearest neighbourclassication scheme for recognition [8]. Lui et.al however, studied the under-

    lying geometry of the tensor space occupied by human action sequences andperformed factorization on this space to obtain product manifolds [10]. Classi-cation is done by projecting a video or a tensor onto this space and classifyingit using a geodesic distance measure. In this type of methodology, unlike in thespace time approach, it shows much improved performance on datasets withlarge variations in illumination and scale. However, the classication is done pervideo sequence an d not one a set of frames constituting a part of a video se-quence. A 3D gradient-based shape descriptor representing the local variationswas introduced by Klaser et al. [9] and is based on the 2D HOG descriptor used

    for human body detection [5,6]. Here, each space time shape is divided into cubeswhere in each cube, the histogram is computed from the spatial and temporalgradients. Chin et al performed an analysis on modelling the variation of thehuman silhouettes with respect to time [4]. They studied the use of different di-mensionality reduction techniques such as PCA and LLE and the use of neural

  • 8/10/2019 IEAAIE_LNAI_2012

    3/10

    126 B.M. Nair and V.K. Asari

    networks to model the mapping. The proposed technique in this paper uses thehistogram of spatial gradients in a region of interest, nds an underlying functionwhich captures the temporal variance of these 2D shape descriptors with respectto each action and classies a set of contiguous frames irrespective of the speed

    of the action or the time instant of the body posture.

    2 Proposed Methodology

    In this paper, we focus on three main aspects of the action recognition framework.The rst is that of feature extraction where a shape descriptor is computed forthe region of interest in each frame. The second is that of a computation of anappropriate reduced space which spans the shape change variations across time.The third aspect is that of suitable modelling of the mapping from the shapedescriptor space to the reduced space. A suitable block diagram illustrating theframework is shown Figure 1.

    Fig. 1. Gesture Recognition Framework

    Histogram of gradients is used as the shape descriptor as it provides a morelocal representation of the shape and it is partially invariant to illumination.To obtain a reduced dimensional space where the inter-frame variation amongthe HOG descriptors are large, we use the Principal Component Analysis. Theinter-frame variation between the HOG descriptors is due to the change in shape

    of a body or hand with respect to a particular action being performed. So, wepropose a modelling of these posture or shape variations with respect to eachaction, the underlying idea being the posture variations differ with action class.A modelling of actions in this manner removes the need for normalization withrespect to time and that a slow or fast moving action of the same class willnot cause any difference. Only the postures of each frame are correspondinglymapped onto a reduced space containing variations in time, thereby makingthe framework time-invariant. In other words, while classifying a slow action,the posture variations will occupy a small part of the manifold when compared

    to a fast moving action where the posture variations occupy a large section of the action manifold. Moreover, due to varying speed of the action in differentindividuals, some of the postures in the action sequence may not be presentduring the training phase. So, when these particular postures occur in a testsequence of an action, the action model can estimate where that posture lieson the reduced space. This approach gives a more accurate estimation of the

  • 8/10/2019 IEAAIE_LNAI_2012

    4/10

    Time Invariant Gesture Recognition 127

    (a) Silhouette (b) Gradient (c) HOG Descriptor

    Fig. 2. HOG descriptor extracted from a binary human silhouette from the WeizmannDatabase [7]

    (a) Hand (b) Gradient (c) HOG Descriptor

    Fig. 3. HOG descriptor extracted from a gray scale hand image from Cambridge HandGesture Database [8]

    corresponding location of that particular shape on the action manifold than theapproach which uses nearest neighbours to determine the corresponding reducedposture point. In this paper, we use a separate model for each action class andthe modelling is done using generalized regression neural networks which is amultiple-input multiple-output network.

    2.1 Shape Representation Using HOG

    The histogram of gradients is computed by rst taking the gradient of the imagein the x and y directions and calculating the gradient magnitude and orientationat each pixel. The image is then divided into overlapping K blocks and the ori-entation range is divided into n bins. From each block, the gradient magnitudesof those pixels corresponding to the same range of orientation (belonging to thesame bin) are added up to form a histogram. The histograms from the vari-ous blocks are normalized and concatenated to form the HOG shape descriptor.An illustration of the HOG descriptor extracted from masked human silhouette

    image are shown in Figure 2. It can been seen that since the binary silhouetteproduces a gradient where all of its points correspond to the silhouette, the HOGdescriptor produces a discriminative shape representation. Moreover, due to theblock operation during the computation of the HOG, this descriptor providesa more local representation of the particular posture or shape. An illustrationof the HOG descriptor(rst 50 elements) applied on a gray scale hand image is

  • 8/10/2019 IEAAIE_LNAI_2012

    5/10

    128 B.M. Nair and V.K. Asari

    shown in Figure 3. Unlike the binary image, there is some noise in the gradientimage which gets reected onto the HOG descriptor. Since the HOG descrip-tors are illumination invariant, we can assume that under varying illuminationconditions, the feature descriptors do not vary much.

    2.2 Computation of Reduced Posture Space Using PCA

    The next step in the framework is determine an appropriate space which rep-resents the inter-frame variation of the HOG descriptors. An illustration of thereduced posture or shape space using PCA is shown for the Weizmann datasetand the Cambridge Hand dataset using three Eigenvectors in Figure 4. Eachaction class of the reduced posture points shown in Figure 4 are color-coded toillustrate how close the action manifolds are and the separability existing be-tween them. We can see that there are lot of overlaps between different actionmanifolds in the reduced space and our aim is to use a functional mapping foreach manifold to distinguish between them. We rst collect the HOG descriptorsfrom all the possible postures of the body irrespective of the action class andform a space denoting what is known as an action sp ace denoted by S D . We canexpress the action space mathematically as

    S D = {h k,m : 1 k K (m ) and 1 m M } (1)

    where K (m ) is the number of frames taken over all the training video sequencesfrom the action m out of M action classes and hk,m being the correspondingHOG descriptor of dimension D 1. The reduced action or posture space isobtained by extracting the principal components of the matrix HH T where H =[h 1, 1 h 1 , 1 h 1, 1 ... h K (M ) ,M ] using PCA. This is done by nding the Eigenvectorsor Eigenpostures v 1v 2 ... v d corresponding to the largest variances between theHOG descriptors. In this reduced space, the inter-frame variation between theextracted HOG descriptors due to the changes of shape of the body (due to themotion or the action) are maximized by selecting the appropriate number of

    Eigenpostures and at the same, reducing the effect of noise due to illumination

    (a) Weizmann Dataset (b) Cambridge Hand Dataset

    Fig. 4. Reduced Posture Space for the HOG descriptors extracted from video sequences

  • 8/10/2019 IEAAIE_LNAI_2012

    6/10

    Time Invariant Gesture Recognition 129

    by removing those Eigenpostures having low Eigenvalues. In other words, theEigenvectors with the highest Eigenvalues corresponds to the direction alongwhich the variance between the HOG descriptors due to the posture or shapechange is maximum and all the other Eigenvectors with lower Eigenvalues can

    be considered as directions which corresponds to the noise in the HOG shapedescriptor.

    2.3 Modelling of Mapping from HOG Space to Posture Space UsingGRNN

    The mapping from the HOG descriptor space ( D 1) to the reduced posture orshape space ( d 1) can be represented as S D S d where S d = { p k,m : m =1 to M } and p is a vector representing a point in the reduced posture space.In this framework, we aim to model the mapping from the HOG to the pos-ture space for each action m separately using the Generalized Regression NeuralNetwork [14,3]. This network is a one-pass learning algorithm which providesfast convergence to the optimal regression surface. It is memory intensive as itrequires the storage of the training input and output vectors where each nodein the rst layer is associated with one training point. The network models theequation of the form y =

    N i =1 y i radbasis ( x x i )

    N i =1 radbasis ( x x i )

    where (y i , x i ) are the train-ing input/output pairs, y is the estimated point for the test input x . In our

    algorithm, since a lot of training points are present, a lot of nodes have to beimplemented for each class which is not memory efficient. To get suitable train-ing points that marks the transitions in the posture space for a particular actionclass m , k-means clustering is done to get L (m ) clusters. So, the mapping of theHOG descriptor space to its reduced space for a particular action class m canbe modelled by a general regression equation given as

    p =

    L (m )

    i =1

    p i,m exp(D 2i,m2 2

    )

    L (m )

    i =1

    exp(D 2i,m2 2

    ); D i,m = ( h

    h i,m )

    T

    (h h i,m ) (2)

    where ( p i,m , h i,m ) are the ith cluster centres in the HOG descriptor space andthe posture space. Selection of the standard deviation for each action classis taken as the median Euclidean distance between the corresponding actionscluster centres. The action class is determined by rst projecting the consecutiveset of R frames onto to the Eigenpostures. These projections of the frames given

    by p r : 1 r R is compared with the estimated projections p (m )r of thecorresponding frames estimated by each of the GRNN action model using theMahalanobis distance. The action model which gives the closest estimates of the projections is selected as the action class.

  • 8/10/2019 IEAAIE_LNAI_2012

    7/10

    130 B.M. Nair and V.K. Asari

    (a) Weizmann Dataset [7] (b) Cambridge Hand Dataset [8]

    Fig. 5. Datasets for Testing

    3 Experiments and Results

    The algorithm presented in this paper has been evaluated on two datasets,theWeizmann Human Action [7] and the Cambridge Hand Gestures [8]. The his-togram of gradients feature descriptor has been extracted by dividing the detec-tion region into 7 7 overlapping cells. From each cell, a histogram of gradient iscomputed with 9 orientation bins which are normalized by taking the L2 norm,and the normalized histograms are concatenated to form the feature vector of size 441 1.

    3.1 W ei zmann Action Dataset

    This action dataset consists of 10 actions classes, each of them contain 9 10samples performed by different people. The background in these video sequencesare static with uniform lighting at low resolution and so, silhouettes of the personcan be extracted by a simple background segmentation. HOG features, computedfor these silhouettes represent the shape or the postures of the silhouette at oneparticular instant. During the training phase, all the frames of every trainingsequence of each class are taken together to get the HOG feature set for eachaction class. The test sequence is split up into overlapping windows (partialsequences) of size N with an overlap of N 1. The HOG features of each frameof the window is compared with the estimated features from each action classmodel corresponding to this particular frame using Mahalanobis distance, andthe appropriate distance from each class is computed by taking the L2 normof the distances for each frame. The action model which gives the minimumnal distance measure to the testing partial sequence is determined to be itsaction class. Table 1 gives the results for the framework with GRNN having 10clusters with a window size of 20 frames. The testing is done by using leave-10out procedure where 10 sequences, each one corresponding to a particular actionclass are considered as the testing set while the remaining sequences are takenas the training set. The variation of the overall accuracy for different windowsizes of 10, 12, 15, 18, 20, 23, 25, 28, 30 of the test partial sequences are shown inFigure 6.

  • 8/10/2019 IEAAIE_LNAI_2012

    8/10

    Time Invariant Gesture Recognition 131

    Table 1. Confusion Matrix for Weizmann Actions: a1 - bend; a2 - jplace ; a3 - jack ;a4 - jforward ; a5 - run ; a6 - side ; a7 - wave1 ; a8 - skip ; a9 - wave2 ; a10 - walk

    a1 a2 a3 a4 a5 a6 a7 a8 a9 a10

    a1 99a2 100a3 100a4 99a5 98 1a6 9 90a7 2 97a8 2 97a9 3 96

    a10 4 95

    Fig. 6. Average Accuracy computed for the action classes for window size10, 12, 15, 18, 20, 23, 25, 28, 30

    3.2 Cambridge Hand Gesture Dataset

    The dataset contains 3 main actions classes showing different postures of thehand, at, spread out and V-shape. Each of the main classes has three other sub-classes which differs in the direction of movement. In total, we have 9 differentaction classes which differs in the posture of the hand as well as its directionof motion. The main challenge is to differentiate between different motion andshape at different illumination conditions. The dataset is shown in Figure 5(b).There are 5 sets, each containing different illuminations of all the action classes

    with class having 20 sequences. From each of the video sequences, we applied skinsegmentation to get a rough region of interest, and extracted the HOG basedshape descriptor from the gray scale detection region. Unlike the descriptorsextracted from silhouettes in the Weizmann dataset, these descriptors containnoise variations due to different illumination conditions. The testing strategywe used is the same as that of the Weizmann with lea ve-9 out video sequences

  • 8/10/2019 IEAAIE_LNAI_2012

    9/10

    132 B.M. Nair and V.K. Asari

    Table 2. Confusion Matrix and Overall Accuracy for Cambridge Hand Gesture Dataset

    (a) Confusion Matrixa1 a2 a3 a4 a5 a6 a7 a8 a9

    a1 94.0 1 5a2 91.0 6 3a3 2 1 95.0 2a4 91.0 1 8a5 5 85.0 10a6 1 99.0a7 83.0 3 14a8 86.0 14a9 1 13 9 77.0

    (b) Overall Accuracies for each SetSet 1 Set 2 Set 3 Set 4 Set 5

    Acc 96.11 73.33 70.00 86.67 87.72

    where each test sequence corresponds to an action class. The confusion matrixfor the action classes obtained from the framework with 4 clusters is given inTable 2(a). We can see that if all the illumination conditions are trained intothe system, the overall accuracy obtained with the framework is high. Usingthe same testing strategy, we tested the system for overall accuracy for eachset and this is given in Table 2(b). For set1, the overall accuracy is high as thenon-uniform lighting does not affect the feature vectors and noise is diminished

    by the partial illumination variant property of the HOG descriptor. For sets 4and 5, it shows moderate accuracies while sets 2 and 3 give an average overallaccuracy.

    4 Conclusions and Future Work

    In this paper, we presented a frame work for recognizing actions from partialvideo sequences which is invariant to the speed of the action being performed. We

    illustrated this approach using the Histogram of Gradients shape descriptor andcomputed the mapping from the HOG space to the reduced dimensional posturespace using Principal Component Analysis. The mapping from the HOG spaceto the reduced posture space for each action class is learned separately usingGeneralized Regression neural network. Classication is done by projecting theHOG descriptors of the partial sequence onto the posture space and comparingthe reduced dimensional representation with that of the estimated posture fromthe GRNN action models using Mahalanobis distance. The results shows theaccuracy of the framework as illustrated on the Weizmann database. However,

    when using the gray scale images to compute the HOG, severe illumination con-ditions can affect the framework as illustrated by the Hand Gesture databaseresults. In future, our plan is to extract a shape descriptor which representsa shape from a set of corner points where relationships between them are de-termined in the spatial and temporal scale. Other regression and classicationschemes will also be investigated in this framework.

  • 8/10/2019 IEAAIE_LNAI_2012

    10/10

    Time Invariant Gesture Recognition 133

    References

    1. Ali, S., Basharat, A., Shah, M.: Chaotic invariants for human action recognition.In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 18(October 2007)

    2. Batra, D., Chen, T., Sukthankar, R.: Space-time shapelets for action recognition.In: IEEE Workshop on Motion and video Computing, WMVC 2008, pp. 16 (Jan-uary 2008)

    3. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Scienceand Statistics), 1st edn. Springer (2006), corr. 2nd printing edn. (October 2007)

    4. Chin, T.J., Wang, L., Schindler, K., Suter, D.: Extrapolating learned manifolds forhuman activity recognition. In: IEEE International Conference on Image Process-ing, ICIP 2007, vol. 1, pp. 381384 (October 2007)

    5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:IEEE Computer Society Conference on Computer Vision and Pattern Recognition,CVPR 2005, vol. 1, pp. 886893 (June 2005)

    6. Dalal, N., Triggs, B., Schmid, C.: Human Detection Using Oriented Histograms of Flow and Appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006,Part II. LNCS, vol. 3952, pp. 428441. Springer, Heidelberg (2006)

    7. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. Transactions on Pattern Analysis and Machine Intelligence 29(12),22472253 (2007)

    8. Kim, T.K., Wong, S.F., Cipolla, R.: Tensor canonical correlation analysis for actionclassication. In: IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2007, pp. 18 (June 2007)

    9. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of the British Machine Vision Conference (BMVC 2008),pp. 9951004 (September 2008)

    10. Lui, Y.M., Beveridge, J., Kirby, M.: Action classication on product manifolds.In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010,pp. 833839 (June 2010)

    11. Nair, B., Asari, V.: Action recognition based on multi-level representation of 3dshape. In: Proceedings of the International Conference on Computer Vision Theoryand Applications, pp. 378386 (March 2010)

    12. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categoriesusing spatial-temporal words. In: British Machine Vision Conference, BMVC 2006(2006)

    13. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its appli-cation to action recognition. In: Proceedings of the International Conference onMultimedia (MultiMedia 2007), pp. 357360 (September 2007)

    14. Specht, D.: A general regression neural network. IEEE Transactions on NeuralNetworks 2(6), 568576 (1991)

    15. Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors andholistic features. In: IEEE Computer Society Conference on Computer Vision and

    Pattern Recognition Workshops, CVPR Workshops 2009, pp. 5865 (June 2009)16. Tabbone, S., Wendling, L., Salmon, J.: A new shape descriptor dened on the radontransform. In: Computer Vision and Understanding, vol. 102, pp. 4251 (2006)

    17. Wang, Y., Huang, K., Tan, T.: Human activity recognition based on r transform.In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007,pp. 18 (June 2007)