Viola–Jones object detection framework: Difference between revisions

Content deleted Content added

Inline

Revision as of 00:05, 13 December 2014

The Viola–Jones object detection framework is the first object detection framework to provide competitive object detection rates in real-time proposed in 2001 by Paul Viola and Michael Jones.^[1]^[2] Although it can be trained to detect a variety of object classes, it was motivated primarily by the problem of face detection. This algorithm is implemented in OpenCV as cvHaarDetectObjects().

Problem Statement and Description

The basic problem to be solved is to implement an algorithm for detection of faces in an image. This can be solved easily by humans. However there is a stark contrast to how difficult it actually is to make a computer successfully solve this task.

In order to ease the task Viola-Jones limit themselves to full view frontal upright faces. That is, in order to be detected the entire face must point towards the camera and it should not be tilted to any side. This may compromise the requirement for being unconstrained a little bit, but considering that the detection algorithm most often will be succeeded by a recognition algorithm these demands seem quite reasonable.

Related Papers

Rapid object detection using a boosted cascade of simple features, 2001 by Paul Viola and Michael J. Jones.

Robust Real-Time Objection Detection, 2004 by Paul Viola and Michael J. Jones.

The main characteristics of viola-jones algorithm which makes it a good detection algorithm is –

1.Robust – very high Detection Rate (True-Positive Rate) & very low False-Positive Rate always. 2.Real Time – For practical applications at least 2 frames per second must be processed. 3.Face Detection and not recognition. The goal is to distinguish faces from non-faces (face detection is the first step in the identification process)

The algorithm has mainly 4 stages –

1.Haar Features Selection. 2.Creating Integral Image. 3.Adaboost Training algorithm. 4.Cascaded Classifiers.

Components of the framework

Feature types and evaluation

The features employed by the detection framework universally involve the sums of image pixels within rectangular areas. As such, they bear some resemblance to Haar basis functions, which have been used previously in the realm of image-based object detection.^[3] However, since the features used by Viola and Jones all rely on more than one rectangular area, they are generally more complex. The figure at right illustrates the four different types of features used in the framework. The value of any given feature is always simply the sum of the pixels within clear rectangles subtracted from the sum of the pixels within shaded rectangles. As is to be expected, rectangular features of this sort are rather primitive when compared to alternatives such as steerable filters. Although they are sensitive to vertical and horizontal features, their feedback is considerably coarser.

1.Haar Features – All human faces share some similar properties. This knowledge is used to construct certain features known as Haar Features. The properties that are similar for a human face are: • The eyes region is darker than the upper-cheeks. • The nose bridge region is brighter than the eyes. • That is useful domain knowledge • Location - Size: eyes & nose bridge region

There are mainly 4 types of Haar features used in the algorithm. They are –

1.

           Fig 1: Haar Feature that looks similar to the bridge of the nose is applied onto the face.

2.

Fig 2: Haar Feature that looks similar to the eye region which is darker than the upper cheeks, is applied onto a face.

3.

                    Fig 3: 3rd and 4th kind of Haar Feature.

 Rectangle features: • Value = ∑ (pixels in black area) - ∑ (pixels in white area) • Three types: two-, three-, four-rectangles, Viola & Jones used two-rectangle features • For example: the difference in brightness between the white &black rectangles over a specific area  Each feature is related to a special location in the sub-window

2.Integral Image - However, with the use of an image representation called the integral image, rectangular features can be evaluated in constant time, which gives them a considerable speed advantage over their more sophisticated relatives. Because each rectangular area in a feature is always adjacent to at least one other rectangle, it follows that any two-rectangle feature can be computed in six array references, any three-rectangle feature in eight, and any four-rectangle feature in just nine.

Using the integral image representation we can compute the value of any rectangular sum (part of features) in constant time.

For example the integral sum inside rectangle D can be computed as: ii(d) + ii(a) – ii(b) – ii(c) We have to select a subset of relevant features – which are informative to model a face. Hypothesis: “A very small subset of features can be combined to form an effective classifier”. This is done by the Adaboost Algorithm.

3.Adaboost (Adaptive Boost) Algorithm – Constructs a “strong” classifier as a linear combination of weighted simple “weak” classifiers.

Learning algorithm

The speed with which features may be evaluated does not adequately compensate for their number, however. For example, in a standard 24x24 pixel sub-window, there are a total of 162,336^[4] possible features, and it would be prohibitively expensive to evaluate them all. Thus, the object detection framework employs a variant of the learning algorithm AdaBoost to both select the best features and to train classifiers that use them.

Features as weak classifiers • Each single rectangle feature may be regarded as a simple weak classifier An iterative algorithm • AdaBoost performs a series of trials, each time selecting a new weak classifier Weights are being applied over the set of the example images • During each iteration, each example/image receives a weight determining its importance  Given: example images labeled +/- . Here + represents faces and - represents non-faces.

• Initially, all weights set equally  Repeat T times • Step 1: choose the most efficient weak classifier that will be a component of the final strong classifier. This weak classifier has the least weighted error.

• Step 2: Update the weights to emphasize the examples which were incorrectly classified  This makes the next weak classifier to focus on “harder” examples  The misclassified ones are marked with red circles.

• Step 3: Increase the weights on the training examples that were misclassified.

• Step 4: Repeat Steps 1 to 3 “T” times where “T” is the number of weak classifiers.  Final (strong) classifier is a weighted combination of the T “weak” classifiers • Weighted according to their accuracy

  h(x) = 2 0 otherwisetherwise Where, h(x) is the final strong classifier, αt is the weight and ht(x) is the weak classifier. Summary of Adaboost:  AdaBoost starts with a uniform distribution of “weights” over training examples.  Select the classifier with the lowest weighted error (i.e. a “weak” classifier)  Increase the weights on the training examples that were misclassified.  (Repeat)  At the end, carefully make a linear combination of the weak classifiers obtained at all iterations

4.Cascading

 On average only 0.01% of all sub-windows are positive (faces)  Status Quo: equal computation time is spent on all sub-windows  Must spend most time only on potentially positive sub-windows.  A simple 2-feature classifier can achieve almost 100% detection rate with 50% FP rate.  That classifier can act as a 1st layer of a series to filter out most negative windows  2nd layer with 10 features can tackle “harder” negative-windows which survived the 1st layer, and so on…  A cascade of gradually more complex classifiers achieves even better detection rates.

Cascade architecture

The evaluation of the strong classifiers generated by the learning process can be done quickly, but it isn’t fast enough to run in real-time. For this reason, the strong classifiers are arranged in a cascade in order of complexity, where each successive classifier is trained only on those selected samples which pass through the preceding classifiers. If at any stage in the cascade a classifier rejects the sub-window under inspection, no further processing is performed and continue on searching the next sub-window (see figure at right). The cascade therefore has the form of a degenerate tree. In the case of faces, the first classifier in the cascade – called the attentional operator – uses only two features to achieve a false negative rate of approximately 0% and a false positive rate of 40%.^[5] The effect of this single classifier is to reduce by roughly half the number of times the entire cascade is evaluated.

 Here each stage consists of a strong classifier. So all the features are grouped into several stages where each stage has certain number of features.  The job of each stage is to determine whether a given sub-window is definitely not a face or may be a face. A given sub-window is immediately discarded as not a face if it fails in any of the stages. A simple framework for cascade training is given below: • User selects values for f, the maximum acceptable false positive rate per layer and d, the minimum acceptable detection rate per layer. • User selects target overall false positive rate Ftarget. • P = set of positive examples • N = set of negative examples • F0 = 1.0; D0 = 1.0; i = 0 While Fi > Ftarget i++ ni = 0; Fi = Fi-1 while Fi > f x Fi-1 o ni ++ o Use P and N to train a classifier with ni features using AdaBoost o Evaluate current cascaded classifier on validation set to determine Fi and Di o Decrease threshold for the ith classifier until the current cascaded classifier has a detection rate of at least d x Di-1 (this also affects Fi) N = ∅ If Fi > Ftarget then evaluate the current cascaded detector on the set of non-face images and put any false detections into the set N.

The cascade architecture has interesting implications for the performance of the individual classifiers. Because the activation of each classifier depends entirely on the behavior of its predecessor, the false positive rate for an entire cascade is:

F=\prod _{i=1}^{K}f_{i}.

Similarly, the detection rate is:

D=\prod _{i=1}^{K}d_{i}.

Thus, to match the false positive rates typically achieved by other detectors, each classifier can get away with having surprisingly poor performance. For example, for a 32-stage cascade to achieve a false positive rate of $10^{-6}$ , each classifier need only achieve a false positive rate of about 65%. At the same time, however, each classifier needs to be exceptionally capable if it is to achieve adequate detection rates. For example, to achieve a detection rate of about 90%, each classifier in the aforementioned cascade needs to achieve a detection rate of approximately 99.7%. ^{[citation needed]}

Advantages of Viola-Jones Algorithm:  Extremely fast feature computation  Efficient feature selection  Scale and location invariant detector • Instead of scaling the image itself (e.g. pyramid-filters), we scale the features.  Such a generic detection scheme can be trained for detection of other types of objects (e.g. cars, hands)

Disadvantages of Viola-Jones Algorithm:  Detector is most effective only on frontal images of faces • It can hardly cope with 45o face rotation both around the vertical and horizontal axis.  Sensitive to lighting conditions  We might get multiple detections of the same face, due to overlapping sub-windows.

Figure: Detected Face using the cascadeObjectDetector function in Matlab. This function uses the viola-jones algorithm to detect faces. Matlab Code for implementing the cascadeObjectDetector() function: clear all vid = videoinput('winvideo',1,'YUY2_640x480'); %giving the framesize vid.ReturnedColorSpace='RGB'; %mentioning RGB format vid.TriggerRepeat = Inf; %Triggers the camera repeatedly vid.FrameGrabInterval = 1; %time between successive frames start(vid); %starts capturing video detector = vision.CascadeObjectDetector();%create a detector using violajones while true %infinite loop to continuously detect the face img = flipdim(getdata(vid, 1),2); %flips the image horizontally imshow(img); hold on; %displays image

   bbox = step(detector, img); %creating bounding box using detector     if ~isempty(bbox) 
       rectangle('position', bbox(1,:), 'lineWidth', 2, 'edgeColor', 'y');     end    % draws a yellow rectangle around the detected face     flushdata(vid);  
   hold off; end

Related Face Detection and Tracking Algorithm: The algorithm that is similar to viola-jones but can detect and track even tilted and rotated faces is the KLT algorithm. Here the feature points are detected by scanning the face initially. Thereafter, the face can be detected and tracked even when the face is tilted and further away from the camera, which is not possible in case of viola jones algorithm. http://www.mathworks.com/help/vision/examples/face-detection-and-tracking-using-thekltalgorithm.html

Improvements over the Viola Jones Algorithm: An improved algorithm on Viola-Jones object detector http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6269796

MATLAB implementation of Viola-Jones

Algorithm: http://www.mathworks.in/help/vision/ref/vision.cascadeobjectdetectorclass.html?refresh=true Open CV implementation of Viola-Jones Algorithm: Haar Cascade Detection in OpenCV: http://docs.opencv.org/trunk/doc/py_tutorials/py_objdetect/py_face_detection/py_face _detection.html Cascade Classifier Training in OpenCV: http://docs.opencv.org/doc/user_guide/ug_traincascade.html Citations of the Viola-Jones Algorithm in Google Scholar: http://scholar.google.com/citations?view_op=view_citation&hl=en&user=G2nFaIAAAAJ&citati on_for_view=G2-nFaIAAAAJ:u5HHmVD_uO8C

References

^ Rapid object detection using a boosted cascade of simple features
^ Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See pages 1,3.
^ C. Papageorgiou, M. Oren and T. Poggio. A General Framework for Object Detection. International Conference on Computer Vision, 1998
^ Viola-Jones' face detection claims 180k features
^ Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See page 11.

External links

[1] Rapid object detection using a boosted cascade of simple features

[2] Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See pages 1,3.

[3] C. Papageorgiou, M. Oren and T. Poggio. A General Framework for Object Detection. International Conference on Computer Vision, 1998

[4] Viola-Jones' face detection claims 180k features

[5] Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See page 11.

[1]

[2]

[3]

[4]

[5]