9
TSINGHUA SCIENCE AND TECHNOLOGY ISSN ll 1007-0214 ll 03/17 ll pp13-21 Volume 16, Number 1, February 2011 Robust Hierarchical Framework for Image Classification via Sparse Representation * ZUO Yuanyuan (左圆圆), ZHANG Bo () ** State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Abstract: The sparse representation-based classification algorithm has been used for human face recogni- tion. But an image database was restricted to human frontal faces with only slight illumination and expres- sion changes. Cropping and normalization of the face needs to be done beforehand. This paper uses a sparse representation-based algorithm for generic image classification with some intra-class variations and background clutter. A hierarchical framework based on the sparse representation is developed which flexibly combines different global and local features. Experiments with the hierarchical framework on 25 object categories selected from the Caltech101 dataset show that exploiting the advantage of local features with the hierarchical framework improves the classification performance and that the framework is robust to im- age occlusions, background clutter, and viewpoint changes. Key words: image classification; keypoint detector; keypoint descriptor; sparse representation Introduction The problem of image classification has been exten- sively studied in recent years. Different systems have been developed for image retrieval [1] and video con- cept detection. Generic image classification involves the two important issues of image representation and classification. Image representation methods have been developed using various global features. Region-based features have also been developed by segmenting the image into several locally uniform regions and extracting features from each region [1] . Recently, keypoint-based image features are getting more and more attention for computer vision. Keypoints, also known as interest points or salient regions, refer to local image patches which contain rich information, have some kind of saliency and can be stably detected under a certain de- gree of variations. The extraction of keypoint-based image features usually includes two steps. First, key- point detectors are used to automatically find the key- points. Second, keypoint descriptors are used to repre- sent the keypoint features. Mikolajczyk et al. [2] evalu- ated the performance of several different keypoint de- tectors, while Mikolajczyk and Schmid [3] evaluated keypoint descriptors. Many classification algorithms have been developed using different kinds of image representation tech- niques [1,4-10] . Image classification models can be di- vided into two classes. One class is generative models. The representative work is the constellation model [6] which is a probabilistic model for object categories. The basic idea of this model is that an object is com- posed of several parts that are selected from the de- tected keypoints, with the appearance of the parts, scale, shape, and occlusion modeled by probability Received: 2010-11-29; revised: 2010-12-17 * Supported by the National Natural Science Foundation of China (No. 90820305) and the National Basic Research and Development Program (973) Program of China (No. 2007CB311003) ** To whom correspondence should be addressed. E-mail: [email protected]; Tel: 86-10-62773875

Robust hierarchical framework for image classification via sparse representation

  • Upload
    bo

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

TSINGHUA SCIENCE AND TECHNOLOGY IS SN l l 1 0 0 7 - 0 2 1 4 l l 0 3 / 1 7 l l p p 1 3 - 2 1 Volume 16, Number 1, February 2011

Robust Hierarchical Framework for Image Classification via Sparse Representation*

ZUO Yuanyuan (左圆圆), ZHANG Bo (张 钹)**

State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology,

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Abstract: The sparse representation-based classification algorithm has been used for human face recogni-

tion. But an image database was restricted to human frontal faces with only slight illumination and expres-

sion changes. Cropping and normalization of the face needs to be done beforehand. This paper uses a

sparse representation-based algorithm for generic image classification with some intra-class variations and

background clutter. A hierarchical framework based on the sparse representation is developed which flexibly

combines different global and local features. Experiments with the hierarchical framework on 25 object

categories selected from the Caltech101 dataset show that exploiting the advantage of local features with

the hierarchical framework improves the classification performance and that the framework is robust to im-

age occlusions, background clutter, and viewpoint changes.

Key words: image classification; keypoint detector; keypoint descriptor; sparse representation

Introduction

The problem of image classification has been exten-sively studied in recent years. Different systems have been developed for image retrieval[1] and video con-cept detection. Generic image classification involves the two important issues of image representation and classification.

Image representation methods have been developed using various global features. Region-based features have also been developed by segmenting the image into several locally uniform regions and extracting features from each region[1]. Recently, keypoint-based image features are getting more and more attention for computer vision. Keypoints, also known as interest

points or salient regions, refer to local image patches which contain rich information, have some kind of saliency and can be stably detected under a certain de-gree of variations. The extraction of keypoint-based image features usually includes two steps. First, key-point detectors are used to automatically find the key-points. Second, keypoint descriptors are used to repre-sent the keypoint features. Mikolajczyk et al.[2] evalu-ated the performance of several different keypoint de-tectors, while Mikolajczyk and Schmid[3] evaluated keypoint descriptors.

Many classification algorithms have been developed using different kinds of image representation tech-niques[1,4-10]. Image classification models can be di-vided into two classes. One class is generative models. The representative work is the constellation model[6] which is a probabilistic model for object categories. The basic idea of this model is that an object is com-posed of several parts that are selected from the de-tected keypoints, with the appearance of the parts, scale, shape, and occlusion modeled by probability

Received: 2010-11-29; revised: 2010-12-17

* Supported by the National Natural Science Foundation of China(No. 90820305) and the National Basic Research and DevelopmentProgram (973) Program of China (No. 2007CB311003)

** To whom correspondence should be addressed. E-mail: [email protected]; Tel: 86-10-62773875

Tsinghua Science and Technology, February 2011, 16(1): 13-21

14

density functions. A Bayesian hierarchical model[7] was developed for natural scene category recognition, which learned the distribution of the visual words in each category.

The other class of image classification models is dis-criminative models, which have been proved to be ef-fective for object classification[8]. A support vector machine (SVM) using the orderless bag of keypoints image representation was demonstrated to effectively classify texture and object images[9]. Lazebnik et al.[10] presented a multi-layer bag of keypoints feature with modified pyramid match kernels, which demonstrated that a well-designed bag-of-features method can out-perform more sophisticated approaches.

Recently, sparse coding has been used for the learn-ing of visual vocabulary or codebook and image rep-resentations. Yang et al. [11] adopted sparse coding in-stead of a K-means cluster to quantize the image local features and proposed a linear spatial pyramid match-ing kernel using image representation based on sparse codes. Considering the mutual dependence among lo-cal features, Gao et al.[12] proposed a Laplacian sparse coding method to learn the codebook and quantize lo-cal features more robustly.

Wright et al.[13] applied sparse signal representation to the problem of human faces recognition. A sparse representation based classification algorithm by com-putting an l1-minimization problem was proposed. The authors gave new insights into two important issues in face recognition: feature extraction and robustness to occlusion. The paper shows that even downsampled images and random projections can do as well as con-ventional complex features, if the feature space dimen-sion is sufficiently large and the sparse representation is computed correctly. Although good performance was obtained, the image database was strictly confined to human frontal faces with only slight illumination and expression changes. Detection, cropping, and normali-zation of the face were done beforehand.

We applied the sparse representation based classifi-cation algorithm to the problem of generic image clas-sification[14] with background clutter and scale, transla-tion, and rotation variations within the same image class. No preprocessing was necessary for each image. Experiments gave comparable results with SVM clas-sifiers for different size vocabularies and numbers of training images by repeating for several times with

randomly divided training and testing sets. Previous studies showed that global and local fea-

tures describe different levels of information granular-ity. An effective combination of these two kinds of features may improve the classification performance. The global features represent an image as a high di-mensional vector describing the global information in the image. For example, the bag of visual words fea-ture reflects the distribution of visual words detected in an image. The advantage of using the global features is that it is simple, efficient, and can be directly applied to different kinds of image retrieval systems and image classification algorithms. However, the local attributes of keypoints, which are lost in the global features, may contain distinctive patterns and be very helpful for recognizing certain classes of images. For example, in the bag of visual words model, the local features ex-tracted from the keypoints are quantized into visual words. A visual words frequency histogram represents an image with no information of the original local fea-tures of the keypoints. This paper uses the hierarchical framework shown in Fig. 1 based on a sparse repre-sentation which flexibly combines the advantages of global and local features.

1 Hierarchical Framework Based on Sparse Representation

The layout of the sparse representation based hierar-chical framework is shown in Fig. 1. The framework is composed of the image database pretreatment, the global layer computation based on sparse representa-tion for image classification and the local layer com-putation based on sparse representation.

The global and local features of an image database are extracted and saved in the pretreatment stage. Dif-ferent keypoint detectors are utilized to find salient local patches, with descriptors used to represent the local features of the keypoints for image classification of the local layer. A randomly selected fraction of the total keypoint features is clustered to generate the vis-ual vocabulary. The vocabulary is then used to ex-tract a bag of visual words feature from an image for image classification of the global layer.

The image database is randomly divided into a training set and a testing set. The global feature of each training image is used to form the training matrix,

ZUO Yuanyuan (左圆圆) et al.:Robust Hierarchical Framework for Image …

15

Fig. 1 Overview of a hierarchical framework based on sparse representation

which is then used to solve the sparse representation of the global feature of the test image. A threshold is used to decide if the sparse representation result is stable. If the result can be accepted, then the test image label is output. Otherwise, the local layer will be used to de-termine the final test image label.

The local features of all the training images are used to form the training matrix of the local layer. A local feature of each test image is represented as a sparse linear combination of the local features from the train-ing images. The sparse representation procedure is re-peated for each local feature of the test image to obtain each keypoint label. The class label that is voted by the most keypoints is output as the final test image label.

Different types of global and local features can be flexibly applied in this framework. The advantage of the hierarchical framework is that it can effectively utilize the strong points of both the global and local features.

2 Image Feature Extraction

Many image classification systems have been devel-oped using keypoint-based features to represent an image. The image feature extraction method includes the following four procedures.

(1) Keypoint detector The salient region detector[15] and scale invariant

feature transform (SIFT)[16] interest points detector are two of the most popular keypoint detectors used in many image classification systems. The salient region detector can eliminate many keypoints located in clut-tered backgrounds. The salient region detector finds

regions which exhibit unpredictability both in the local attributes space and the scale space. The unpredictabil-ity of image regions is measured by the Shannon en-tropy of the local image attributes, such as the pixel gray value. The number of regions detected in one im-age usually varies from dozens to hundreds. Figure 2 gives several examples of salient region detector re-sults for images from the Caltech 101[17] dataset. The SIFT detector applies the DoG detector to detect re-gions that are stable and rotationally invariant over different scales. The number of interest points for this detector can vary from hundreds to thousands per image.

Fig. 2 Salient regions detected on some images from Caltech 101

(2) Keypoint descriptor For keypoints detected by the SIFT detector, 128-

dimensional SIFT vectors are used to represent the local features of the interest points. For keypoints de-tected by the salient region detector, histograms of oriented gradients (HOG) descriptor[18], which is simi-lar to the SIFT descriptor, is used to describe each keypoint feature. This computes gradients for every pixel in the keypoint local patch. The gradient orienta-tion (unsigned 0o-180o, or signed 0o-360o) is quantized

Tsinghua Science and Technology, February 2011, 16(1): 13-21

16

into a certain number of bins. Local patches can be divided into different size blocks for computing the HOG features. Dalal and Triggs[18] tested different block sizes and normalization schemes to show that 2×2 blocks and the l2-norm performed well.

(3) Visual vocabulary generation Part of the keypoint features are randomly selected

from all the keypoint features detected in each image in the database. These randomly chosen features are clustered using the k-means method to generate the visual vocabulary {w1, w2, …, wm} of size m, with each visual word corresponding to a cluster center.

(4) Image feature representation The bag of visual words is used for the image fea-

ture representation, since this has been shown to be effective[8-10]. For each keypoint local patch, compute the distance between the local patch feature and each visual word. The visual word wi which has the mini-mum distance is assigned to the keypoint local patch. The frequency, ,

iwf of each visual word in the vo-cabulary is counted to represent the image as a histo-gram of the visual words frequency

1 2{ , , ,w wf f=f

}mwf . The problems related to the bag of visual words

method, such as vocabulary size and weighting schemes, are discussed by Yang and Jiang[19].

3 Hierarchical Sparse Representation Based-Classification Algorithm

The sparse representation based classification (SRC) algorithm has been used for human face recognition[13]. Experiments showed that if the feature space dimen-sion is sufficiently large, the SRC performance is comparable with the SVM.

Sparse representation based classification assumes that the training samples from a single class do lie on a subspace. As a result, a test sample y can be repre-sented as a sparse linear combination of the training samples A, y = Ax. Instead of solving the NP-hard problem of finding x with the minimum l0-norm, the-ory of sparse representation[20-22] states that if the solu-tion of x is sparse enough, the solution of finding the minimum l0-norm is equal to the solution of finding the minimum l1-norm, which can be solved in polynomial time using standard linear programming methods[23].

The following gives a detailed description of the hi-erarchical sparse representation-based classification (H-SRC) algorithm.

Step 1 Dataset preparation. Randomly select a certain number of images per category as the training set, with the remaining images as the testing set.

Step 2 Computation of the training feature matrix, A. For each training image from category i, extract the image feature m∈f R , in which m is the global feature dimension. Image features belonging to the same category i form the sub-matrix Ai. Given a training set with k categories, matrix A is composed of every sub- matrix Ai , 1 2[ , , , ]; m n

k×= ∈A A A A A R , in which n

is the total number of images in the training set. Step 3 Solve the optimization problem. For the

given test image, extract the feature m∈y R . Solve the l1-minimization problem:

1 1ˆ arg min || || s. t.= =x x Ax y (1)

1 1 2ˆ arg min || || s. t. || || ε= −x x Ax y (2)

Equation (2) assumes that the real data may have some noise by relaxing the strict constraint condition Ax = y to 2|| || ε−Ax y .

Step 4 Compute the residual between y and its es-timate for each category. Let 1ˆ( ) n

iδ ∈x R keep only nonzero entries in 1x̂ that are associated with category i. Approximate the test image feature y as 1ˆ ˆ( )i iδ=y A x using only the coefficients of 1x̂ which correspond to category i. For each category, compute the residuals

1 2ˆ|| ( ) ||i ir δ= −y A x for i=1, 2, …, k. Step 5 Check if the acceptance condition is satis-

fied. The acceptance condition is defined as the ratio between the two smallest residuals. A lower ratio indi-cates that the test image is more effectively approxi-mated by the training samples from one class instead of those from the other classes. If the ratio is below a certain threshold rε , then no local layer processing is necessary. Test image y is assigned to the category i that has the minimum residual between y and ˆiy . The threshold depends on the image database and features extracted and can be set based on analysis of several image sparse representation results. If the ratio is above the threshold, then go to Step 6 for the local layer sparse representation.

Step 6 Processing of the local layer. Training ma-trix A′ is composed of the local features of images from the training set. For each local feature in the test image, solve the l1-minimization problem and output the local class label. The final test image label is assigned to the category voted by the most local

ZUO Yuanyuan (左圆圆) et al.:Robust Hierarchical Framework for Image …

17

features. Steps 3 to 6 are repeated for each image in the test-

ing set. The classification precision is then calculated for each category.

Figure 3 to Fig. 5 give an example of the H-SRC algorithm using only the global layer. Figure 3 shows an example image with the visual word frequency his-togram feature. The abscissa indicates that the vocabu-lary includes 100 visual words. Figure 4 shows the values of x, which are the sparse representation coeffi-cients of the test image by the training feature matrix A. The abscissa corresponds to the 15 training samples per category from 25 categories. This figure shows that the coefficients of only a few training samples from the first category have large values. Figure 5 shows the residuals, ri , for each category. The test image is cor-rectly assigned to the first category which has the minimum residual of 0.135. The residuals of the other categories are much larger than the first one with the ratio between the two smallest residuals being 0.28.

(a) Image (b) Visual word

Fig. 3 Example image with the visual word frequency histogram feature

Fig. 4 Sparse representation coefficients

Fig. 5 Residuals of a test image

4 Experiments 4.1 Experiment dataset

25 object categories were selected from the Caltech 101 dataset with each category containing 30 images. The images are not cropped or normalized in advance and have some background clutter. Since the keypoint- based features can only tolerate a certain degree of intra-class variations caused by scale, translation, rota-tion, and background clutter, objects with rather large intra-class variations were not selected for the experi-ments. The task is to recognize objects on the 25 object categories dataset.

4.2 Classification performance

We experiment on the 25 object categories to verify the

effectiveness of the hierarchical framework using 5 training images per category.

For the global layer, we used two kinds of bags of visual words features. One is the salient region detector with the 72-dimensional HOG descriptor. The gradi-ents on each salient region are computed with the signed orientation (0o-360o) quantized into 18 bins. Each region is divided into 2×2 blocks with the l2-nomalized HOG feature computed for each block to give a 72-dimensional feature. The other method uses a SIFT detector with a 128-dimensional SIFT descriptor. The vocabulary size is set to 100 visual words. For each test image, we solve the optimization problem in Eq. (2) with an error tolerance ε = 0.05. The thresh-old rε for the acceptance condition of the global layer is set to 0.8.

Tsinghua Science and Technology, February 2011, 16(1): 13-21

18

The local layer uses a 72-dimensional HOG de-scriptor for the salient regions as the local features. The number of regions detected in each image varies from dozens to hundreds. The regions are then sorted by their saliency. For the consideration of computational complexity, we only keep the 10 most salient regions for training and testing. If the global layer result does not satisfy the acceptance condition, the sparse repre-sentation of the local feature is then computed. The label output by the global layer and the labels of local regions output by the local layer are all used to vote for the final result.

The H-SRC performance with only the global layer is compared with that of the H-SRC with both the global and local layers, using the bag of visual words based on the SIFT algorithm in Fig. 6 and that based on the salient regions as the global feature in Fig. 7.

Fig. 6 Performance comparison of H-SRC (global) and H-SRC (global+local) based on the SIFT local features

Fig. 7 Performance comparison of H-SRC (global) and H-SRC (global+local) based on the salient regions and HOG local features

The object categories with significant performance improvement by the hierarchical framework are listed in Table 1. A common property of the images from these categories is that the keypoints detected by the salient region detector have strong discriminative abil-ity. For example, the spots on the Dalmatian are very distinctive and the outputs of the sparse representation of the local features are almost identical for the most salient regions. Another example is the images from the face category where the performance is greatly im-proved by the local features. The precision with the SIFT-based global layer increases from 12.8% to 84.8%. Although not many regions are detected by the salient region detector in each image, the detected re-gions are usually located on the eyes and hair of the person’s forehead, as shown in the right column of Fig. 8. For different people with different backgrounds, these salient regions have very similar features. The SIFT algorithm not only detects interest points in the person’s face but also many points in the background, as shown in the left column of Fig. 8. The background clutter increases the distances between the bag-of- words features based on SIFT for the face category. Therefore, the global layer output is more likely to ex-ceed the threshold, so the local layer is used to deter-mine the test image label. For the face category, the local features extracted from the eyes and hairs on the person’s forehead are very distinctive and tend to be predicted correctly by the sparse representation of the local layer. This significantly improves the classifica-tion performance of images from the face category.

Table 1 Object categories with significant perform-ance improvement by the hierarchical framework

Precision (%) Object category

Global(SIFT)

Global (SIFT)+local

Global (Salient)

Global (Salient)+local

accordion 61.6 82.4 72.0 84.0 airplanes 20.8 76.8 89.6 92.8

Dalmatian 80.8 96.8 59.2 88.0 face 12.8 84.8 57.6 91.2

grand_piano 77.6 81.6 54.4 72.8 pagoda 37.6 59.2 65.6 71.2

stop_sign 64.8 85.6 50.4 74.4

Average 46.8 56.6 45.6 51.9

ZUO Yuanyuan (左圆圆) et al.:Robust Hierarchical Framework for Image …

19

Fig. 8 Left column: keypoints detected by SIFT; right column: keypoints detected by salient region detector

The experiments also show that the performance can be greatly improved by introducing the local layer, as shown in the bottom row of Table 1. For the salient region detector-based global layer, the average preci-sion over the 25 object categories increases from 45.6% to 51.9%. For the SIFT-based global layer, the average precision increases from 46.8% to 56.6%. The almost 10% better performance achieved by the SIFT- based global layer is due to the complementary proper-ties of the features of the two layers. Therefore, com-plementary types of features should be used in the hi-erarchical framework to improve the performance.

4.3 Robustness to background clutter and viewpoint changes

The robustness of the hierarchical framework to changes of the background and viewpoint are evalu-ated using all the images in the leopard (not including 180o rotated images) and Dalmatian categories from the Caltech 101 dataset. These images contain more background clutter, illumination variations, and view-point changes.

In the experiments, we use the bag of visual words feature based on the salient region detector with the HOG descriptor as the global feature. The HOG de-scriptor extracted on the salient region is used for the local features. 5 images are randomly selected from

each category as the training samples. For the leopard category, the remaining 95 images are then used as the test samples. For the Dalmatian category, the remain-ing 62 images are then used as the test samples.

The results in Table 2 show that for the leopard cate-gory, the sparse representation based classification has good performance for the larger dataset with more background clutter and viewpoint changes. For the Dalmatian category, the performance using only the global layer drops significantly. However, use of the hierarchical framework with both the global layer and the local layer greatly increases the performance be-cause of the distinctiveness of the extracted local fea-tures. Thus, these results show the robustness of the framework to background clutter and viewpoint changes.

Table 2 Robustness to background clutter and viewpoint changes

Precision (%) Object category

H-SRC (Global) H-SRC (Global+local)

Dalmatian 37.1 71.0 Leopards 70.5 72.6

4.4 Robustness to random block occlusion

This section tests the robustness of the hierarchical framework to random block occlusions. Images from the airplane and car side categories are tested by re-placing a randomly located block in each test image with a black block. The level of image occlusion is changed from 10% to 50%. Some examples of the occluded images are shown in Fig. 9. Generally, occlusions above 30% give obvious occlusions of the object. Much of the object can be occluded with 50% occlusion.

The same global and local features are used as in Section 4.3. The test results for robustness to block occlusions shown in Figs. 10 and 11 show that the hi-erarchical framework performs well for up to 30% oc-clusion, correctly classifying over 85% of the test im-ages. With more than 30% occlusion, the performance deteriorates because the objects are significantly occluded.

The results also show that the hierarchical frame-work outperforms use of only the global layer for all levels of occlusions. For the car side category, in-creasing the occlusion from 30% to 40% reduces the

Tsinghua Science and Technology, February 2011, 16(1): 13-21

20

(a) 10%

(b) 20%

(c) 30%

(d) 40%

(e) 50%

Fig. 9 Random block occlusion of test images, oc-clusion level changes from 10% to 50%

Fig. 10 Performance for various levels of occlusions for the airplane category

Fig. 11 Performance for various levels of occlusions for the car_side category

hierarchical framework precision only a small amount from 93.3% to 83.3%, while the performance with only the global layer drops dramatically from 90% to 53.5%. Thus, the tests show that the hierarchical framework is robust to a certain degree of random block occlusion.

5 Conclusions and Future Work

A hierarchical framework based on the sparse repre-sentation has been used for generic image classifica-tion. The advantage of the hierarchical framework is that it combines effectively the benefits of both the global and the local features. If a low-dimensional subspace is found for the global feature, the test image can be labeled directly by the global layer. If the image category global feature space has no low-dimensional subspace, the local features are used to find a low-di-mensional structure for images within the same cate-gory. The test image can then be correctly classified through the local features.

This framework is very flexible. Different types of global and local features can be utilized in the frame-work with complementary features recommended to achieve better performance. Experiments show that the classification performance of the hierarchical frame-work is improved and the framework is robust to a certain degree of image occlusions, background clutter, and viewpoint changes.

Although the results show that the sparse represen-tation based algorithm can be applied to generic image classification, the performance is closely related to the features being used. The features used in this work may be suitable for image categories that have some distinctive local properties. Future work will concen-trate on how to select appropriate global and local fea-tures so that the hierarchical framework using sparse representation can be applied to large image databases.

References

[1] Jing F, Li M J, Zhang H J, et al. Relevance feedback in region-based image retrieval. IEEE Transactions on Cir-cuits and Systems for Video Technology, 2004, 14(5): 672-681.

[2] Mikolajczyk K, Tuytelaars T, Schmid C, et al. A compari-son of affine region detectors. International Journal of Computer Vision, 2005, 65(1-2): 43-72.

[3] Mikolajczyk K, Schmid C. A performance evaluation of

ZUO Yuanyuan (左圆圆) et al.:Robust Hierarchical Framework for Image …

21

local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(10): 1615-1630.

[4] Jing F, Li M J, Zhang H J, et al. An efficient and effective region-based image retrieval framework. IEEE Transac-tions on Image Processing, 2004, 13(5): 699-709.

[5] Sivic J, Zisserman A. Video Google: A text retrieval ap-proach to object matching in videos. In: IEEE International Conference on Computer Vision. Nice, France, 2003: 1470-1477.

[6] Fergus R, Perona P, Zisserman A. Object class recognition by unsupervised scale-invariant learning. In: IEEE Con-ference on Computer Vision and Pattern Recognition. Madison, USA, 2003: 264-271.

[7] Li F F, Perona P. A Bayesian hierarchical model for learn-ing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition. San Diego, USA, 2005: 524-531.

[8] Csurka G, Dance C, Fan L, et al. Visual categorization with bags of keypoints. In: ECCV International Workshop on Statistical Learning in Computer Vision. Prague, Czech Republic, 2004.

[9] Zhang J, Marszalek M, Lazebnik S, et al. Local features and kernels for classification of texture and object catego-ries: A comprehensive study. International Journal of Computer Vision, 2007, 73(2): 213-238.

[10] Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition. New York, USA, 2006: 2169-2178.

[11] Yang J C, Yu K, Gong Y H, et al. Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recog-nition. Miami, USA, 2009.

[12] Gao S H, Tsang W I, Chia L T, et al. Local features are not lonely – Laplacian sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition. San Francisco, USA, 2010.

[13] Wright J, Yang A Y, Ganesh A, et al. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(2): 210-227.

[14] Zuo Y, Zhang B. General image classifications based on sparse representation. In: IEEE International Conference on Cognitive Informatics. Beijing, China, 2010.

[15] Kadir T, Zisserman A, Brady M. An affine invariant salient region detector. In: European Conference on Computer Vi-sion. Prague, Czech Republic, 2004: 404-416.

[16] Lowe D. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91-110.

[17] Caltech 101 dataset. http://www.vision.caltech.edu/Im-age_Datasets /Caltech101, 2010.

[18] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vi-sion and Pattern Recognition. San Diego, USA, 2005, 1: 886-893.

[19] Yang J, Jiang Y. Evaluating bag-of-visual-words represen-tations in scene classification. In: Proc. of the International Workshop on Multimedia Information Retrieval. Augsburg, Germany, 2007.

[20] Donoho D. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparest solution. Communications on Pure and Applied Mathematics, 2006, 59(6): 797-829.

[21] Candes E, Romberg J, Tao T. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 2006, 59(8): 1207-1223.

[22] Candes E, Tao T. Near-optimal signal recovery from ran-dom projections: Universal encoding strategies? IEEE Transactions on Information Theory, 2006, 52(12): 5406-5425.

[23] Chen S, Donoho D, Saunders M. Atomic decomposition by basis pursuit. SIAM Review, 2001, 43(1): 129-159.