ISVC LNCS 2014 AcceptedPaper

8/10/2019 ISVC LNCS 2014 AcceptedPaper

http://slidepdf.com/reader/full/isvc-lncs-2014-acceptedpaper 1/10

Learning and Association of Features for ActionRecognition in Streaming Video

Binu M Nair and Vijayan K Asari

ECE Department, University of Dayton, Dayton, OH, USA{nairb1,vasari1 }@udayton.edu

Abstract. We propose a novel framework which learns and associateslocal motion pattern manifolds in streaming videos using generalizedregression neural networks (GRNN) to facilitate real time human ac-tion recognition. The motivation is to determine an individual’s actioneven when the action cycle has not yet been completed. The GRNNsare trained to model the regression function of patterns in latent actionspace on the input local motion-shape patterns. This manifold learningmakes the framework invariant to different sequence length and vary-ing action states. Computation of latent action basis is done using EOFanalysis and association of local temporal patterns to an action class atruntime follows a probabilistic formulation. This corresponds to ndingthe closest estimate the GRNN obtains to the corresponding action basis.Experimental results on two datasets, KTH and the UCF Sports, showaccuracy of above 90% obtained from 15 to 25 frames.

1 Introduction

Human action recognition is a very broad research area where different issuessuch as viewpoint, scale variation and illumination conditions are tackled. Mostof the research done in recent years vary from how the motion features are com-puted and to how the recognition framework is designed. Some of the recent algo-rithms focussed on combining various attributes and motion of spatio-temporalinterest points to determine what action is being performed in YouTube videos,Google videos or action movie clips as illustrated in the HMDB dataset. This isgeared towards solving the problem of automatic video annotation in wild actiondata-sets for unconstrained video search applications. But, a crucial issue whichhas not been given due importance is that of real-time recognition of a humanaction or activity from streaming surveillance footage. Tackling this specic is-sue requires that the learning mechanism must learn the underlying continuous

temporal structure and be able to associate different temporal scales of the ac-tion cycle in a specic temporal window. These variations occur due to differentspeeds at which a person performs a particular activity, thereby obtaining vari-able action cycle lengths in a xed temporal window. In this manuscript, wepropose a novel human action recognition framework (Figure 1) which computesa motion and shape feature set in each frame and learns the non-linear mani-fold in a time-independent manner. Every action class has a non-linear manifold



2 Nair et al.

(a) Training of action basis for m th action class and subsequent learning of theGRNN Model.

(b) Testing of a streaming sequence by projection of test action features onto each actionbasis and comparison with GRNN estimations.

Fig. 1: Action Recognition Framework

embedded in it which can represented by a time independent orthogonal actionbasis and a learned generalized regression neural network (GRNN). A probabilis-tic formulation is used to associate a streaming sequence to one of the trainedmanifolds by computing per-frame discrepancy between the GRNN estimationsand the action basis feature projections.

2 Related Work

By detecting sparse features or interest points on video sequences, a robust rep-resentation can be obtained by computing a histogram of these sparse featuresin a bag of words model. Following this paradigm, a well-known interest pointdetector known as the Spatio-Temporal Interest Point (STIP) was proposed byLaptev et al. for detecting interesting events in video sequences and was extended

to classify actions [4]. Recent progress in human action and activity recognitionused these STIP points and the Bag of Words model as low-level features incomplex learning frameworks. Wang et al [13] developed a contextual descriptorfor each STIP which described its local neighborhood and the arrangement of points using a probabilistic model. Yuan et al [16] computed the 3D-R transformto get the global arrangement of the STIP points and proposed a contextual SVMkernel for sequence classication. However, these algorithms focused mainly on



Learning and Association of Patterns for Action Recognition 3

(a) Hierarchical Division for L = 2. (b) Illustration of LBP operator ondirectional pattern.

Fig. 2: Extraction of Motion and Shape features between two consecutive frames.

automatic video annotation and not necessarily designed for real time actionrecognition.

The proposed algorithm follows the paradigm of extracting the temporalstructure of the action i.e. to model the per-frame features with respect to time.Chin et al. [1] performed an analysis on modeling the variation of the human sil-houettes with respect to time. Shao and Chen [11] explored a subspace learningmethodology known as Spectral Regression Discriminant Analysis to learn thevariation in human body poses described by masked silhouettes. Sagha et al. [9]proposed an embedding technique based on spatio-temporal correlation distancefor action sequences. Our earlier works [6, 7] proposed time-invariant approachesto characterize body posture for classifying human actions but these required agood segmentation to remove features corresponding to the background. Here,the learning mechanism follows the GRNN/EOF model paradigm from our pre-vious work. The key differences are the feature modeling computed within a bagof features and the probabilistic manifold matching scheme.

3 Feature Extraction

Two kinds of features are computed at a frame: shape features and motion fea-tures which represents the pose and motion of an individual respectively. Theshape features are computed by applying the R-Transform(RT) on the motionhistory images(MHI). This provides a shape characteristic prole of an actionat a specic instant. The motion features at a frame, however consists of two

different feature sets computed from the optical ow eld, the Histogram of Flow(HOF) and the Local Binary Flow patterns (LBFP). No prior segmentation isused to mask out features as done in [6, 7]. Let the optical ow eld be repre-sented as ( A( p), α ( p)) at each pixel p. θ( p) is the quantized (discretized) versionof the optical ow direction α ( p) and is given by θ( p) = − π + 2π

B where B is thenumber of bins spanning the ow direction.

To compute features at different scales, we divide the region of interest into



4 Nair et al.

different sub-regions in a pyramidal fashion as shown in Figure 2(a). Each localregion gives a vague representation of a motion associated with a body part

where each level l gives the extent of division. As we divide the region further,the effective number bins at level l is computed as B (l) = B

2 l so that only coarsevariations of ow in local regions are considered. At a single level l and for eachsub-region, the feature vectors HOF ( h l ), LBFP ( lbl ) and RT ( r l ) are computed.With regard to the motion features, the HOF represents only the distributionof rst order motion variation and does not provide any indication of the localarrangement of the ow vectors. Thus, we propose a motion descriptor knownas Local Binary Flow Patterns (LBFP) which encodes the ow direction in away which brings out the “ow texture”. This textural information can also beinterpreted as contextual information that the neighboring ow pixels provide tothe center pixel at local regions of the body. “Flow texture” can then be denedas the second order variations of the optical ow between two instants in a localneighborhood. We apply a variant of the LBP ( g( pc )) [5] on the ow direction tocharacterize this ow texture where P is the number of neighbors around pixel pc and R is the neighborhood size.

g( pc ) =P − 1

i =02i s(θ( pc ) − θ( pi )) ; s(z) =

1 z = 00 otherwise

(1)

The LBFP ( lbl ) motion feature is the LBP ( gl ) encoded directional ow imagehistogram. So, the action feature vectors computed at frame t are H (t) H =[h1 , h2 , h2 , h2 , h2 , h3 .....h L ], LB (t) LB = [lb1 , lb2 , lb2 , lb2 , lb2 , lb3 ..lbL ], R (t)R = [r 1 , r 2 , r 2 , r 2 , r 2 , r 3 .....r L ] where L is the number of levels in the hierarchy.Assuming independence between the selected feature set, we fuse these featuresto form a single action feature given as X = H LB R where X R D .

3.1 Feature Selection using Symmetrical Uncertainty

Due to the high dimensionality of the per-frame feature vector X , we identifythe relevant and redundant subset of the features using a modied feature selec-tion technique based on symmetrical uncertainty [15]. The symmetrical uncer-tainty between random variables ( Z 1 , Z 2) is given by SU (Z 1 , Z 2) = 2 IG (Z 1 | Z 2 )

H (Z 1 )+ H (Z 2 )where IG(Z 1 |Z 2) = H (Z 1) − H (Z 1 |Z 2) is the information gain, H (Z 1), H (Z 2)are the corresponding entropy values. Higher the symmetrical uncertainty, thehigher would be the information gain and more correlated is Z 2 to Z 1 . Con-sider the action feature set S = {z d : 1 ≤ d ≤ D and z d R N × 1} withN observations, N m being the number of observations of class m such that

m N m = N and D features where these observations are features accumu-lated across the action classes. The objective is to select a subset of the featuresS = {z d : zd S , zd R N × 1 , d [1, D ] , 1 ≤ d ≤ D and D ≤ D} S which are relevant and non-redundant features accumulated for all action classes.This feature selection is applied to the action feature X R D to get X R D .1. Compute SU (z d , C ) of feature zd to the set of class id’s C = [1 ..m .. M ]T

for each observation where m R N m × 1 .




2. Form the set S = {z d : SU (z d , C ) ≥ δ f , z d R N × 1} which contains therelevant features.

3. Split the subset S with respect to each class i.e. form S m = {z d : zd S , zd R N m × 1} where m S m = S .

4. Compute SU (z d , z d ) between each feature within each class where d = d5. Form the set S m + = {z d : zd S m , SU (z d , z d ) ≥ SU (z d , C ) d = d }

which contains the redundant intra-class features.6. Form the set S m − = ( S m + )c which contains the non-redundant features

with respect to each class m. Combine to form the nal action feature seti.e. m S m − = S .

4 Learning : Computation and Modeling of ActionManifold

A basis for each action class can be computed by considering its corresponding setof per-frame features as a bag of words model. Hence, within this bag of features,we can associate them as a time series data from where suitable time-independentbasis can be extracted. Nair et al [7] analyzed a time series data using EmpiricalOrthogonal Function Analysis (EOF) [2] where it is represented as a linear com-bination of time-independent orthogonal basis functions. In accordance to EOFanalysis, if there is a time series data x (t) R D for 1 ≤ t ≤ T where T is the num-

ber of frames, then x (t) =D

d =1

ad (t)e d where ed R D are the time-independent

orthogonal basis functions (EOFs) and a (t) = [ a1(t) a2(t) ...a D (t)] R D are thetime-dependent orthogonal coefficients. Each action class m will have its own

time-independent basis functions associated with it which denes the underly-ing lower dimensional latent action manifold. To learn this manifold of class m,we accumulate the action features from all the frames of the training sequencesand form S D (m) = {x n : 1 ≤ n ≤ N m , x n R D } where N m is the numberof accumulated observations from class m. By performing SVD decompositionof the co-variance matrix E [XX T ] where X = [x 1 x 2 ...x N m ] , we get the EOFbasis functions E m = [e 1 e 2 ...e d m ] and can be termed as Eigenaction basis. Theprojections of the action feature vectors x from the set S x (m) on the Eigen basisE m will give us the set of coefficients S a (m) = {a n : 1 ≤ n ≤ N m . a n R d m }which forms the low-dimensional manifold. The dimensionality of the manifoldof each class is selected using the criteria dm → d m

d =1 λd / Dd =1 λd > δ m .

Modeling of an action manifold requires characterizing its surface using suit-

able transition points and nding an approximation. One way is to nd clustersalong the surface of the manifold. By using the bag of features model, we com-pute code-words or clusters using the kmeans++ algorithm. These code-wordswill not only approximate the manifold but it provides us with suitable transitionpoints. To learn the surface of the manifold, we need to learn the transition fromone code-word to the next and this is possible using Generalized Regression Neu-ral Networks(GRNN) [12]. The main advantages of using GRNN is a fast training



6 Nair et al.

(a) Illustration of GRNN network. (b) Illustration of test sequenceassociation.

Fig.3: GRNN network and classication of streaming test sequence.

scheme due to a single pass of the training data and guaranteed convergence tothe optimal regression surface. Let the set of code-words of an action model mbe S cx (m) = {x k : 1 ≤ k ≤ K (m) , x k R D } S x (m) and its correspondingprojections on its basis be S ca (m) = {a k : 1 ≤ k ≤ K (m) , a k R d m } S a (m)with K (m) number of clusters. The GRNN (Figure 3(a)) learns the mappingS x → S a by saving the code-words and the corresponding projections as theinput and output weights. The estimate y (t) = E [a test (t)] = [y1 ... yd m ] of testaction feature x test (t) at an instant t on the action manifold m is given by

yd =

K (m )

k =1

ak,d exp((x test − x k )T (x test − x k )

2(δ x )2 )

K (m

)

k =1

exp( ( x test − x k ) T ( x test − x k )2( δ x ) 2 )

(2)

5 Inference : Association of Test Sequences andClassication

Consider a test action sub-sequence with R number of frames with the action fea-ture set as X test = [x test

1 ...x testR ]T . The classication involves the following steps :

1) Estimation of the action feature set with respect to each action manifold m byits respective GRNN model to obtain Y test (m) = [ y test

1 ...y testR ]T . 2) Projection of

the action feature set onto the action basis E m to get A test (m) = [a test1 ...a test

R ]T .

We rst estimate the class of the features computed from a single frame. By intu-ition, if the action feature x test

r at the r th frame belongs to class m, the differencebetween the GRNN estimations y test

r (m ) and basis projections a testr (m ) should

be minimal for m = m and large for m = m . This measure can be formulatedas a likelihood function given by

Prob (x testr |C m ) = P rob(x test

r |a testr (m), ∆ a (m)) (3)




(a) Variation with window sizes. (b) Effectiveness of feature selection.

Fig. 4: Accuracy obtained with the proposed algorithm.

C (x testr ) = arg max

mexp(( x test

r − a testr )T ∆ − 1

A (x testr − a test

r )) (4)

where ∆ A (m) being the co-variance of the manifold m and a testr as the corre-

sponding mean of the local neighborhood. The estimate of the class of the corre-sponding frame C (x test

r ) is the maximum of the likelihood function Prob(x testr |C m ).

This is illustrated in Figure 3(b). By considering the class estimate of a frame asa random variable, we can obtain the nal class estimate of this partial sequenceby computing the probability of per-frame class estimates and nding the mode.This is given by

Prob(X test |A test (m), ∆ a (m)) N (X test |A test (m), ∆ a (m)) (5)

C (X test ) = arg maxm

Prob(C (x test ) = m) (6)

6 Experimental Results and Evaluations

The proposed algorithm has been tested on KTH dataset [10] and the UCFsports dataset [8]. The design of the algorithm requires setting of three differentparameters : number of codewords K (m), feature selection threshold δ f andaction basis threshold δ m . The algorithm is evaluated for accuracy for differentsequence lengths on all these datasets.

6.1 KTH Dataset

The KTH is a low-resolution dataset and consists of 2400 sequences containing6 actions, such as boxing, hand clapping, hand waving, jogging , running andwalking, performed by 25 subjects in 4 different conditions labeled as sets 1 − 4.After empirical evaluation, we set the number of code-words per action class asK (m) = 250 with the feature selection threshold δ f = 0 .225 and action basis



8 Nair et al.

Table 1: Comparison with state of the art.(Window Size / Overlap) Percentage Accuracy

Proposed (Set 1) (20,15) 92 .23%Proposed (Set 2) (20,15) 84 . 5%Proposed (Set 3) (20,15) 87 . 9%Proposed (Set 4) (20,15) 95 .75%

Yeffet et al[14] Full 90 . 1%Wang et al [13] Full 93 . 8%Yuan at al [16] Full 95 .49%

threshold δ m = 0 .99975 for each set. To compute motion-shape features, we usethe annotations provided by Jiang et al [3]. In Figure 4(a), we plot the overallaccuracy achieved for each set for a particular length of test sub-sequences. Theproposed algorithm gets a high accuracy of 92% or more for sets 1 and 4 anda moderate accuracy of 84% or more for sets 2 and 3. The drop in accuracy inlatter sets is due to the challenging conditions present in the scene where thefeatures are more susceptible to noisy artifacts introduced by camera shakinessand continuous scale change (set 2) and variation of clothing (set 3). In Figure4(b), we see that in spite of noisy annotations and foreground segmentation(MHI) and the lack of mean-shift tracking used in [3], the proposed frameworkachieves an overall accuracy of around 90% or higher. For each action, the featureselection technique boosts the accuracy by 1 − 2% except for the jog action with areduced set of features. This illustrates the effectiveness of the feature selection tocapture only the relevant features for action recognition. The proposed algorithmis also compared with some of the recent techniques published in literature andis given in Table 1. We see that our proposed algorithm achieves close to thestate of the art by using only 20 frames for identifying the action. Thus theproposed framework does not require the complete sequence for classication.

6.2 UCF Sports Dataset

This high-resolution dataset contains 182 video sequences and consists of 9 differ-ent actions, namely diving, golf swinging, lifting, kicking, horseback riding, run-ning, skating, swinging and walking. Here, we set the number of code-words peraction class as K (m) = 250 with the thresholds set as δ f = 0.1 and δ m = 0.995.This is a very challenging dataset mainly because it contains lots of viewpointvariations and is collected from the web. The end-purpose of this dataset is totest action recognition algorithms suited for unconstrained video search applica-

tions. This dataset helps to evaluate the proposed algorithm in conditions of largeviewpoint variations. In Table 2, we summarize the accuracy results obtained fordifferent lengths of the sub-sequences for each type of action. Although accura-cies of 88% were reported in literature, there was no method which analyzedor classied sub-sequences of length 15 − 25 frames. In Table 3, we comparedour proposed learning scheme against the well-known combination of the bagof words model and kernel SVM used extensively in unconstrained video search




Table 2: Accuracy for each action for specic combination of window sizes and use of feature selection

Features(win size,overlap) dive golf kick lift ride run skate swing walk overallProposed(8,6) 93 67 21 87 71 58 49 100 49 66

Proposed(8,6) + FS 94 63 18 86 68 58 45 99 46 64Proposed(10,5) 94 74 18 92 72 62 51 100 50 68



Proposed(14,7) + FS 96 66 15 89 73 60 48 100 51 67

Table 3: Comparison of UCF with state of the art learning mechanisms.Feature Set Learning Mechanism Overall Accuracy

HOF+LBFP+RT BoW + Multi-Channel Kernel SVM 63 .8%HOF+LBFP+RT BoW + Linear SVM 69 . 48%HOF+LBFP+RT BoW+ Gaussian Kernel SVM 68%HOF+LBFP+RT PCA + GRNN + ProbAssociation 69%

applications [4]. The proposed learning scheme gets close to the accuracy ob-tained with the Bag of Words model but the key advantage lies in the exibilityof having different sequence lengths at test time without any prior learning. ForBOW-SVM learning mechanisms, prior learning of the features correspondingto different sequence lengths is required before test time.

7 Conclusions

A novel algorithm is proposed which computes per-frame motion-shape featuresand models the temporal variation thereby facilitating for real time human actionrecognition for streaming surveillance footage. Due to the time-series modelingby GRNN and the probabilistic matching scheme, we can determine the possibleaction happening in a xed short temporal of 20-25 frames, thereby makingit suitable for streaming video applications. Experiments results validates thismodel by extensive testing for different sub-sequence lengths on datasets, onewhich provides low resolution with artifacts associated with surveillance camerasand other a high resolution which has lots of scale and view-point variations.

Acknowledgments. This work is an independent research done as a continu-ation of the work supported initially by US Department of Defense (US ArmyMedical Research and Material Command - USAMRMC) under the program”Bioelectrics Research for Casualty Care and Management”.



10 Nair et al.

References

1. Chin, T.J., Wang, L., Schindler, K., Suter, D.: Extrapolating learned manifolds forhuman activity recognition. In: IEEE International Conference on Image Process-ing, ICIP 2007. vol. 1, pp. 381–384 (october 2007)

2. Holmstrom, I.: Analysis of time series by means of empirical orthogonal functions.Tellus 22(6), 638–647 (1970)

3. Jiang, Z., Lin, Z., Davis, L.: Recognizing human actions by learning and matchingshape-motion prototype trees. Pattern Analysis and Machine Intelligence, IEEETransactions on 34(3), 533–547 (March 2012)

4. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic humanactions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR2008. IEEE Conference on. pp. 1–8 (2008)

5. Liao, S., Chung, A.: Face recognition with salient local gradient orientation binarypatterns. In: Image Processing (ICIP), 2009 16th IEEE International Conferenceon. pp. 3317–3320 (Nov 2009)

6. Nair, B., Asari, V.: Time invariant gesture recognition by modelling body posturespace. In: Advanced Research in Applied Articial Intelligence, Lecture Notes inComputer Science, vol. 7345, pp. 124–133. Springer Berlin Heidelberg (2012)

7. Nair, B., Asari, V.: Regression based learning of human actions from video us-ing hof-lbp ow patterns. In: Systems, Man, and Cybernetics (SMC), 2013 IEEEInternational Conference on. pp. 4342–4347 (Oct 2013)

8. Rodriguez, M., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximumaverage correlation height lter for action recognition. In: Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 1–8 (June 2008)

9. Sagha, B., Rajan, D.: Human action recognition using pose-based discriminantembedding. Signal Processing: Image Communication 27(1), 96 – 111 (2012)

10. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm ap-proach. In: Proceedings of the 17th International Conference on Pattern Recogni-

tion, ICPR 2004. vol. 3, pp. 32 – 36 (August 2004)11. Shao, L., Chen, X.: Histogram of body poses and spectral regression discriminantanalysis for human action categorization. In: Proc. BMVC. pp. 88.1–11 (2010)

12. Specht, D.: A general regression neural network. IEEE Transactions on NeuralNetworks 2(6), 568 –576 (nov 1991)

13. Wang, J., Chen, Z., Wu, Y.: Action recognition with multiscale spatio-temporalcontexts. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Con-ference on. pp. 3185–3192 (2011)

14. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: Com-puter Vision, 2009 IEEE 12th International Conference on. pp. 492–497 (Sept2009)

15. Yu, L., Liu, H.: Feature selection for high-dimensional data: A fast correlation-based lter solution. In: Proceedings of the Twentieth International Conference onMachine Leaning (ICML-03). pp. 856–863 (2003)

16. Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3d r transform on spatio-temporalinterest points for action recognition. In: Computer Vision and Pattern Recognition(CVPR), 2013 IEEE Conference on. pp. 724–730 (2013)

Documents

ISVC LNCS 2014 AcceptedPaper