[IEEE 2013 International Conference on Electronics, Computers and Artificial Intelligence (ECAI) - Pitesti, Arges, Romania (2013.06.27-2013.06.29)] Proceedings of the International

978-1-4673-4937-6/13/$31.00 ©2013IEEE

Surveillance System using IP Camera and Face-Detection Algorithm

Petre Anghelescu, *Ionut Serbanescu, Silviu Ionita Department of Electronics, Communications and Computers

University of Pitesti Pitesti, Romania

[email protected], [email protected], [email protected]

*Student in the final year of bachelor studies

Abstract – This paper presents an efficient video surveillance system which consists of a network camera and an algorithm for automatically detection of the human faces in the monitoring area via real time video contents analysis. The main contribution of this research consists in a software application which is able to process the images received from the camera in order to detect human faces and trigger the process of saving the live stream as a video file. The algorithm for face-detection is based on the integral image (summed area table) – a model which allows the calculation of the sum of all pixels values from any rectangular area in the original image by doing only four operations of addition or subtraction. In addition, the application also includes a file containing data about needed tests in the detection algorithm by analyzing multiple images as templates. The application is written in C# language and experimental results are presented for images with different sizes and backgrounds.

Keywords- face detection, pattern recognition, integral image, IP camera, computer vision.

I. INTRODUCTION The technological advance on which we were all witnesses

in recent years has facilitated the development of complex systems specialized on interaction with people through real-time detection and tracking of various human features or even the whole body. Examples are numerous, starting from face or lip and smile detection capabilities, integrated even in the most affordable digital cameras and getting to bionic robots that analyze the environment visually in order to interact with a human agent and collect various information.

It is obvious that detecting the presence and determining the position of a face in an image is the first step in creating a complex system designed for processing images with human subjects.

The surveillance system presented in this document integrates a face detection algorithm based on the integral image, introduced in digital graphics by Franklin Crow [1] and successfully used by Viola-Jones [2] in building an object detection system.

This document is structured as follows. The next chapter presents the problem of detecting shapes in an image and face detection in particular. The third section proposes a summary

of previously developed methods along with their advantages and disadvantages. In section 4 the entire algorithm is described and the way we generate tests using the tool developed exclusively for this purpose. The presentation of experiments and results is shown in chapter 5. In the final section conclusions can be found together with any improvements that might be added to the detection algorithm or the whole surveillance system.

II. THE PROBLEM OF FACE-DETECTION A. What does face-detection means?

Although the human brain finds this task very simple and carries it with high accuracy, not the same thing can be said about how computer systems perform detection. It’s known that a computing system performs some operations dictated by a specialized algorithm for the detection of a specific pattern.

There are cases in which the goal is to detect fixed forms, which can vary very little, perhaps only in size, due to the distance from which the image was taken, and the process does not put too much trouble, but in case of human face detection things become complicated: differences in illumination mode (number and directions of light sources), the presence of obstacles such as mustache, beard or glasses, skin color variations from one person to another, etc.

Fig. 1. Image analyzed using a scanning frame

The way in which a computer system could perform this task can be explained as follows (Figure 1): it uses a scan window that scrolls through the original image for each

position and runs a set of tests based on which it can decide if the window indicates a human face or not. After testing all possible locations, the window should be resized to test all possibilities of placement of the window with the new size, and so on. Tests can start with the color analysis of the pixels and can end up applying the pixels color values to the input of a neural network previously trained for this purpose. However there are other algorithms that do not require scanning the image, one such being summarized in the next section.

B. Requirements for real time response Most times is wanted a system that can analyze the image

and return an answer in a limited time. For example, the analysis of moving images at a rate of 15 frames per second requires a response time of 66 ms. There must be a balance between the depth of the scanning process and the response time required. Ideally, the algorithm should be able to process the image with the maximum of details and fulfill the real time conditions imposed.

III. EXISTING METHODS AND ALGORITHMS

A. Color filter based algorithm The algorithm presented below is not the most popular or

the most powerful but it’s relatively easy to understand and implement. It was published in 2007 by Yao-Jiunn Chen and Yen-Chun Lin [3].

Fig. 2. Flowchart of the algorithm proposed by Yao-Jiunn Chen and

Yen-Chun Lin

Minimum features refer to use filters for skin and hair color. Starting from the filters results it determines the location where the face is actually shown in the image. The image filter that determines the skin pixels requires the following steps:

• Calculation of the next two functions, determined experimentally:

F1(r) = -1.376r2+1.0743r+0.2 (1)

F2(r) = -0.776r2+0.5601r+0.18 (2)

The values r and g represents the normalization of a red pixel and a green pixel, respectively. They are obtained with the next two formulas:

r = R/ (R+G+B) (3)

g = G/ (R+G+B) (4)

The two functions are respectively the upper and lower limit for the skin color.

• In order to remove the white color, situated between these limits, we need to calculate the value of w using the following equation:

w=(r-0.33)2+(g-0.33)2 (5)

• Finally it checks if the following conditions are fulfilled at the same time: g<F1(r), g>F2(r) and w>0.001. If it’s true than the tested pixel has skin color.

By similar tests, but aimed for blacks detection, it determines which pixels corresponding to hair color. The authors established twelve hairstyles that can be tested by this algorithm and also areas that should contain most of the black pixels and areas that should contain skin color pixels.

Encouraging results are obtained under controlled conditions (background and clothes of different colors from skin and/or hair). It has shown a rate of 10 frames per second for face tracking applications with motion images. Response time is not the best because of the complex calculation (multiplication). The main drawback of this algorithm is that it will classify any object of skin color as being a face.

B. Features based methods These methods implies detection of facial features initially

(eyes, nose, mouth) and grouping them in candidates [5]. Various tests are performed on these groups: regarding the positioning, the ratio of their size, the distance between them, their number within the same group and any overlaps, etc.

The main advantage is that the features are generally invariant to changes in orientation and position.

Unfortunately the disadvantage caused by poor lighting, noise or partial occlusion of these feature and also by the analysis of a complex background would force us to use expensive hardware for capturing the high quality images that requires a longer time for processing.

C. Neural Networks Detectors Previous trained Multi-Layer Perceptron (MLP) networks

are used in order to classify the image [6]. It’s necessary to use a scanning window that have to scroll through the image and change its size accordingly, in order to localize all the faces in the image. If the scanning window is bigger than the input resolution of the MLP it will need to rescale the image.

The more advanced face detecting systems that use neural networks also can implement detection of rotated faces [5]. The particular network specialized for this task have to estimate a

rotation angle for a face that might be presented, even if it cannot tell for sure if that is a face or not. The image can be tested with the neural network for upright frontal faces also knowing the angle, in order to return the correct result.

Fig. 3. The structure of a Multi-Layer Perceptron

The results obtained with neural networks detectors are good when training is done properly (increased number of positive and negative templates). The main disadvantage of neural networks is the learning phase, and especially determining the optimal number of neurons in the hidden layers and learning rate. Neural networks have been found very useful when implemented in hardware structures, obtaining good results in real time. Examples of this kind are found also in industry, where they are used for analysis of various forms in order to verify qualify of some products on production lines, and so on.

D. Viola-Jones detection algorithm This is probably the best detection algorithm, not only with

faces but of any object as desired. It proved good performance in both directions: the detection rate and the computational time. A real implementation of this algorithm can be found in the Intel OpenCV library [4]. As it was tested on a 2001 desktop computer, the algorithm was able to process moving images with a resolution of 384*288 at a rate of 15 frames per second.

The main characteristics of this method are represented by the use of the summed area table that we also used to build the algorithm described in the next section. A training algorithm based on the AdaBoost technique was designed in order to find the important features that are combined in a cascaded testing structure.

Fig. 4. Haar-like features used by Viola-Jones

The Haar-like features are represented like rectangles as they are shown in Figure 4. Their name comes from their similarity with Haar pulsations used in mathematics. A simple Haar-like feature in composed of two rectangular surfaces for which it’s necessary to calculate the sum of the pixels values from the original image overlapping with these surfaces.

Initial experiments on a classifier composed of 200 feature tests shown a detection rate of 95% and returned a false-positive from a set of 14084.

IV. OUR ALGORITHM It should be noted from the start that all images are

converted from color to grayscale versions. By these method, the information about the color are lost but are kept the ones about the shape of the object represented in the image.

We used Haar-like features composed of only two rectangles placed horizontally or vertically (Figure 5).

Fig. 5. The featured tested by the proposed algorithm

Next it’s described the calculation method of the sum of all the pixels values from the two complementary surfaces, using the integral image.

A. Integral Image The integral image is a two dimensional array of the same

size as the original image, but it will store different values. Each element from the matrix memorizes the sum of all the pixels above and to the left of it. These numbers are obtained like this:

• We designate by A the matrix corresponding to the grayscale image and by I the integral image.

• Element I[0][0] gets the value of A[0][0].

• Each element of the first line, I[0][i] is computed using the formula (6):

I[0][i] = A[0][i] + I[0][i-1] , for i= 1…m (6)

• Each element of the first column, I[i][0] is computed using the formula (7):

I[i][0] = A[i][0] + I[i-1][0] , for i= 1…n (7)

• At this moment, the matrix corresponding to the integral image has completed the elements of the first line and the first column. Next, formula (8) is used to complete all the elements in the matrix:

I[i][j] = A[i][j] + I[i][j-1] + I[i-1][j] – I[i-1][j-1] ,

where i= 1…n , j= 1…m ; (8)

From this matrix we can obtain the sum of pixels in any rectangular window doing four operations of addition or subtraction. As an example, using the notations from Figure 6, the formula for calculating the sum in green area is:

D=I(4) – I(2) – I(3) + I(1) (9)

This is the explanation: I(4) stores the sum of the whole surface (A+B+C+D); from that we subtract I(2) = A+B; and then we subtract I(3) = A+C and notice that the value of I(1) = A was subtracted twice, so we need to add I(1).

Unlike detectors that use neural networks, in which the image needs to be rescaled to fit the set of inputs of the network, in this case, the detector is the one who is changing its size. This is possible because the processing cost is the same regardless the size of the Haar-like feature.

The scanning window moves vertically and horizontally by a delta step which is usually an integer number between 1 and 10. Obviously the best detection rate is obtained when delta is set to 1.

B. Generating the features The tool used to generate the tests and the expected results

for these tests is developed in C# programming language and Windows Forms (Figure 7), without using an AdaBoost algorithm which is very difficult to implement. The user interface for the training phase is designed so that the operator can enter the coordinates from keyboard or by clicking on the image on the left side of the window to set the coordinates. There are two possibilities regarding the generation of a test: calculating the results of the test for the loaded image or calculating the test for all images in a folder and get an interval in which the result should be found.

Fig. 7. The software window of the feature generator

This tool is aimed to generate the tests used especially for face detection, one can notice that in grayscale images depicting human faces, in the same position, similar shades overlapping areas can be found: surface corresponding to eyes is darker than the upper cheek; the area between the eyes is lighter than the areas of the eyes and eyebrows, and so on.

In this way we can generate a set of tests available for all images used as templates which will be saved in a file for future use in the detection algorithm itself.

Fig. 8. Example of generated text file

In Figure 8, row by row represents a test with the expected results. The first number can be 0 or 1 and indicates calculated area orientation: 0 – vertical orientation (it calculates the upper and the lower summed areas); 1 – horizontal orientation (it calculates the summed value of the left and the right areas). The next four numbers are the coordinates of the top left point of the whole rectangle and the bottom right point’s coordinates. The last two numbers represent lower and upper limits for the ratio of the two amounts calculated. It is preferred to consider those generated test that have the interval of result situated largely below or above 1, thus highlighting which of the two areas has a darker shade than its complementary.

C. Execution of the Algorithm As previously noted in this document, the execution of the

algorithm for a single image uses a scanning window that scrolls through the image and resize. The necessary tests are run for each location in order to classify it. The mechanism shown in Figure 9 is focused on eliminating the candidates that are not faces by running only the tests that can indicate a negative one. Specifically each test represents a weak classifier, which can only decide if a candidate does not represent a face.

As we draw closer to the feature N, increases the possibility that the current candidate will be positive.

The tests in the chart below may consist of every feature in the generated file or can be groups of that kind of features. The algorithm verifies if a number less or equal to the total number of features in the group are successfully passed. This is the way in which the algorithm provides a tolerance for images that doesn’t have the best quality and have been captured in low light conditions.

Our classifier consists of 67 tests grouped in sets that begin with 2 features per group and end with sets of 13 features.

The smaller groups have tolerance 0, being mandatory that all condition to be fulfilled in order to continue with the next, bigger groups, while groups formed by 13 tests require a number of 11 out of 13 conditions to be fulfilled. The groups with 13 tests are also the ones that make the final decision regarding a specific candidate but are not connected as a cascade because their tests were obtained by splitting the initial image database in subclasses of images with different properties.

A B

C D 1 2

3 4

Fig. 6. Calculating the sum of any rectangular surface

Marking of detected faces is done by applying a green frame in the original image in the corresponding position. It is very important to eliminate from testing those areas that overlap partially or completely with a location already marked.

This is necessary because two faces will never overlap in a picture. In this way are avoided multiple marking of the same face and performing unnecessary tests.

Analyzing sub-image x

Does it have feature 1?

Not a face!

NO

Does it have feature 2?

YES

NO

Does it have feature N?

NO

Face detected!

YES

Not a face!

Not a face!

YES

Fig. 9. Portion of the algorithm where the test are run on the current sub-image

D. The obtained surveillance system The surveillance system built in C# programming language

and using Windows Forms template has the following facilities:

• Connecting to an IP camera by sending a http request with the needed credentials;

• Showing the live video stream;

• Possibility to change different camera parameters (resolution, frame rate, compression ratio of the image, contrast and brightness).

• Ability to run multiple instances of the monitoring windows, connected to different network cameras at the same time.

• Detecting a face in the live video stream can trigger the recording process and also stop the recording if no face was detected in a period of time.

• Playing the recorded files.

V. TESTS AND RESULTS All the results presented below were obtained on a laptop

with Intel Core i3 370m CPU running at 2.4 Ghz (that has two physical cores with HyperThreading technology) and 8GB DDR3 of RAM. The analysis itself was performed in a single thread using a single core of the CPU.

The first testing stage consists of creating a set of 280 pictures at 320*240 resolution obtained from the FRAV2D database, made available by the University Rey Juan Carlos, Spain.

From this database we extracted the first 10 sets of 32 pictures, each set representing a different person. From each set of 32 pictures we removed the 4 pictures in which the subjects

are covering half of their face with their hands. In these 10 sets of 28 pictures were left all frontal images, pictures showing the subject turned to an angle of 15 or 30 degrees and the frontal pictures in which the subject displays various gestures.

In all the examples the actual face occupies 30% of the picture and the color of the background is dark blue.

Table 1. Results obtained on images from FRAV2D face database Group name

Number of faces Total Detected Missed

Group 1 28 25 3

Group 2 28 25 3

Group 3 28 23 5

Group 4 28 25 3

Group 5 28 26 2

Group 6 28 24 4

Group 7 28 25 3

Group 8 28 24 4

Group 9 28 26 2

Group 10 28 26 2

Total 280 249 31

Based on Table 1 we can calculate a detection rate of 88% of all images tested.

The algorithm didn’t detect the faces especially in images that illustrate the subject turned at an angle of 30 degrees.

Fig. 10.Examples of detected (left) and missed faces (right)

When applying images with complex background, the algorithm may classify as faces some shapes that display shadows and proportion similar to the human face, even if they are not. At the same time, the faces that are not shown in the frontal position or are covered by shadow may be missed. In Figure 11, the faces without well defined facial features are missed, getting a number of 12 detected faces out of 17.

Fig. 11.Results on a group picture

In Figure 12 is shown another interesting case: the algorithm detects wrongly two faces in the picture preventing itself from analyzing two areas that contain faces that most likely could’ve been detected.

Fig. 12. Wrong detection blocking the real faces from being analyzed.

Regarding processing time, the results are very encouraging. It is obtained an average response time of 40ms for an image at 320*240 pixels meaning we can analyze a video stream at this resolution and a rate of 25 frames per second. For VGA images, the average time is 0.15 seconds meaning a rate of 7 frames per second.

VI. CONCLUSIONS Building a robust algorithm for face detection is

complicated in that it involves the use or implementing complex algorithms specialized in setting tests and ways this tests should be made. Once those tests are established the analysis process is relatively easy.

The performance is imposed by the application. For the proposed surveillance system, some condition can be controlled since images are taken from a fixed location. We could mount lighting sources in order to eliminate the problems caused by inappropriate lighting. Also, since this surveillance system works with both VGA and QVGA (320*240) video stream we can make a trick in order to analyze only the each fourth frame if the resolution is set to VGA.

Another improvement that will be made to this system, and in particular to the detection algorithm is to create a set o routines for automatic generation of tests with significant results, eliminating the need for an operator in the learning phase and extend detection capabilities to different patterns, not only for faces.

REFERENCES [1] Crow Franklin, "Summed-area tables for texture mapping”, in

Proceedings of SIGGRAPH, Conference on Computer Graphics and Interactive Techniques, Volume 18, No. 3, ISBN: 0-89791-138-5, pp. 207-212, July 1984.

[2] Paul Viola, Michael Jones, “Robust real-time Object Detection”, Second International Workshop on Statistical and Computational Theories of Vision - Modeling, Learning, Computing and Sampling, Vancouver, Canada, pp. 1-25, July 2001.

[3] Yao-Jiunn Chen, Yen-Chun Lin, „Simple Face-detection Algorithm based on Minimum Facial Features”, Industrial Electronics Society, IECON 2007, 33rd Annual Conference of the IEEE, pp. 455-460, November 2007.

[4] Paul Viola, Michael Jones, „Rapid object detection using a boosted cascade of simple features”, Computer Vision and Pattern Recognition, Proceedings of the 2001 IEEE Computer Society Conference on, vol.1, pp. I-511- I-518, 2001.

[5] Ming-Hsuan Yang, “Recent Advances in Face Detection”, Honda Research Institute, Montain View, California, USA, IEEE ICPR, August 2004 (available at the web address: http://faculty.ucmerced.edu/mhyang/papers/icpr04_tutorial.pdf).

[6] Henry A. Rowley, Shumeet Balujia, Takeo Kanade, “Neural Network-Based Face detection”, IEEE trans. on PAMI, Vol. 20, No. 1, pp. 23-28, January 1998.

Documents

[IEEE 2013 International Conference on Electronics, Computers and Artificial Intelligence (ECAI) - Pitesti, Arges, Romania (2013.06.27-2013.06.29)] Proceedings of the International