# SURF

For the urban renewal body, see Scottish Urban Regeneration Forum. For the university computing organisation in the Netherlands, see SURFnet.

SURF (Speeded Up Robust Features) is a robust local feature detector, first presented by Herbert Bay et al. ECCV 9th in International Conference on Computer Vision held in Austria in May 2006, that can be used in computer vision tasks like object recognition or 3D reconstruction. It is partly inspired by the SIFT descriptor. The standard version of SURF is several times faster than SIFT and claimed by its authors to be more robust against different image transformations than SIFT. SURF is based on sums of 2D Haar wavelet responses and makes an efficient use of integral images.

It uses an integer approximation to the determinant of Hessian blob detector, which can be computed extremely quickly with an integral image (3 integer operations). For features, it uses the sum of the Haar wavelet response around the point of interest. Again, these can be computed with the aid of the integral image. This information is treated to perform operations such as locate and recognition of certain objects, people or faces, make 3D scenes, object tracking and extraction of points of interest. This algorithm is part of that artificial intelligence, able to train a system to interpret images and determine the content.

An application of the algorithm is patented in the US.[1]

## Overview

SURF is a detector and a high-performance descriptor points of interest in an image where the image is transformed into coordinates, using a technique called multi-resolution. Is to make a copy of the original image with Pyramidal Gaussian or Laplacian Pyramid shape and obtain image with the same size but with reduced bandwidth. Thus a special blurring effect on the original image, called Scale-Space is achieved. This technique ensures that the points of interest are scale invariant. The SURF algorithm is based on the SIFT predecessor.

## Algorithm and Features

### Detection

The SURF algorithm is based on the same principles and steps off SIFT, but it uses a different scheme and should provide better results: it works much faster. In order to detect characteristic points on a scale invariably SIFT approach it uses cascaded filters, where the difference Gaussian, DOG, is calculated on rescaled images progressively.

#### Integral Image

Similar to SDoG.

Instead of using Gaussian averaging the image, squares are used (approximation). Making the convolution of the image with a square is much faster if the integral image is used. The integral image is defined as:

$I(x)=\sum_{i=0}^{i<=x}\sum_{j=0}^{j<=y} I(x,y)$

The sum of the original image within a rectangle D image can be evaluated quickly using this integrated image. I(x, y) added over the selected area requires four evaluations S(x, y) (A, B, C, D)

#### Points of interest in the Hessian matrix

SURF uses a BLOB detector based on the Hessian to find points of interest. The determinant of the Hessian matrix expresses the extent of the response and is an expression of a local change around the area.

The detector is based on the Hessian matrix, due to its good performance in accuracy. More precisely, BLOB structures are detected in places where the determining factor is the maximum. In contrast to the detector Hess - Laplace Mikolajczyk and Schmid, also is based on the determinant of the Hessian for selecting scale, as it is done by Lindeberg. Given a point x = (x, y) in an image I, the Hessian matrix H (x, σ) in x at scale σ , is defined as follows:

$H(x,\sigma)=\begin{pmatrix} Lxx(x,\sigma) & Lxy(x,\sigma) \\ Lxy(x,\sigma) & Lyy(x,\sigma) \end{pmatrix}$

where $Lxx=(x,\sigma)$ is the convolution of second order derivative $\partial x/\partial x^2 g(\sigma)$ with the image in the point x , y similarly with $Lxy=(x,\sigma)$ and $Lyy=(x,\sigma)$.

The Gaussians filters are optimal for scale analysis - space, but in practice should be quantized and clipped. This leads to a loss of repeatability image rotations around the odd multiple of π / 4. This weakness is true for Hessian-based detectors in general. Repeatability of peaks around multiples of π / 2. This is due to the square shape of the filter. However, the detectors still work well, the discretization has a slight effect on performance. As real filters are not ideal, in any case, given the success of Lowe with logarithmic approximations, they push the approximation of the Hessian further matrix with square filters. These filters second order Gaussian aproximate can be evaluated with a cost very low with the use of integrated computer images. Therefore, the calculation time is independent of the filter size. Here are some approaches: Gyy and Gxy (1)

Approximation Filters Gxx & Gxy (1)

The box filters 9x9 are approximations of a Gaussian with σ = 1.2 and represents the lowest level ( higher spatial resolution ) for computerized maps BLOB response. Is denoted Dxx , Dyy , Dxy . The weights applied to the rectangular regions are maintained by the efficiency of the CPU.

Images are calculated :

-Dxx (x, y ) from I ( x , y) and Gxx ( x , y)

-Dxy (x, y ) from I ( x , y) and Gxy (x, y )

-Dyy (x, y ) from I ( x , y) and Gxyyx , y)

Then, the following image is generated:

$Det(H aprox) = DxxDyy-(wDxy)^2$

The relative weighting (w) of the filter response is used to balance the expression for the Hessian determinant . It is necessary for the conservation of energy between Gaussian kernels and Gaussian kernels approximate.

$w=\frac{|Lxy(1.2)|F|Dyy(9)|F}{|Lyy(1.2)|F|Dxy(9)|F}$

0.9 factor appears such a correction factor using squares instead of gaussians. it can generate several images det (H) for several filter sizes . This is called multi- resolution analysis.

|x|F is the Frobenius norm:  Frobenius

Changes of weighting depends on scale σ . In practice, this factor is kept constant . How it keep constant? Normalizing the filter response relative to its size. This ensures the Frobenius norm for any other filter.

The approximation of the determinant of the Hessian matrix representing the response BLOB image on location x . These responses are stored in the BLOB response map on different scales.

Then, the local maxima are searched.

#### Scale-space representation & location of points of interest

The attractions can be found in different scales , partly because the search for correspondences often requires comparison images where they are seen at different scales. The scale spaces are generally applied as a pyramid image . Images are repeatedly smoothed with a Gaussian filter, then , is sub sampled to achieve a higher level of the pyramid. Therefore , several floors or stairs " det H" with various measures of the masks are calculated:

$\sigma approx = Current filter size * \left( \frac{\left(Base Filter scale\right) }{Base Filter Size} \right)$

The scale -space is divided into a number of octaves , Where an octave Refers to a series of response maps of covering a doubling of scale . In SURF The Lowest level of the Scale- space is Obtained from the output of the 9 × 9 filters.

After 3D maxima are looking at ( x, y , n) using the cube 3x3x3 neighborhood . From there it is proceed to do the interpolation of the maximum. Lowe rest of the layers of the pyramid to get the DOG (Difference of Gaussian ) find images contours and stains.

Specifically, it is entered by variant a quick and Van Gool Neubecker used . The maximum of the determinant of the Hessian matrix in scale and space interpolated image with Brown and Lowe proposed method. The approach of the determinant of the Hessian matrix represents the response of BLOB in the image to the location x . These responses are stored in the BLOB map of responses on different scales.

### Description

The next three introducing steps are explained in the following: The first step consists of fixing a reproducible orientation based on information from a circular region around the interest point. Then, we construct a square region aligned to the selected orientation and extract the SURF descriptor from it. Finally, features are matched between two images.

#### Orientation Assignment

The goal of a descriptor is to purpose a unique and robust description of a set. Describe the intensity distribution of a content within the interest point of neighbourhoods. It is generated based in the surrounding area of a certain interest point, so that we obtain a descriptor vector for every interest point.

The dimensions of the descriptor have an direct impact with the acquisition time has been taken. So that, less dimensions are undesire for matching the interest points, eventhough it provides less distintctions than a bigger dimension. Below demonstrate the entire procedure is perdormed out in order to make the descriptor process.

The first step to obtain the descriptor, once calculated the scaling, is calculate the orientation of the interest point. To obtain a point invariant to rotations, illumination and orientation, For that purpose, first is calculated the Haar wavelet responses in x and y direction within a circular neighbourhood of radius 6s around the interest point, with s the scale at which the interest point was detected. Them have the principal feature of repetibility, that means if some point is considerated realiable, the detector will find the same point under differents perspective (scale, orientation, rotation, etc).

It has one position (x,y) for each interest point. Once the wavelet-Haar responses has been realised, with a Gaussian centred in the interest poin, the responses are plotted as points in the space, where the horizontal response is in the abscissa and the vertical response in the ordenate. Now, once calculated for all neighbourhoods, the dominant orientation is estimated with the sum of all resuts within a sliding window that covers an angle of pi/3.

The dominant orientation is estimated by calculating the sum of all responses within a sliding orientation window of size π/3. The horizontal and vertical responses within the window are summed. The two summed responses then yield a local orientation vector. The longest such vector over all.

window defines the orientation of the interest point. The size of the sliding window is a parameter which had to be chosen carefully. Small sizes fire on single dominating gradients, large sizes tend to yield maxima in vector length that are not outspoken. Both result in a misorientation of the interest point.

#### Descriptor based on Sum of Haar Wavelet Responses

For the extraction of the descriptor, the first step consist of constructing a square region centred around the interest 6point and oriented along the orientation selected in the previous section. The size of this window is 20s.

The interest region is split up into smaller 4x4 square sub-regions, and for each one, it is computed Haar wavelet responses at 5x5 regularly spaced sample points., and weighted with a Gaussian (to offer more robustness for deformations, noise and translations, This responses are dx and dy.

The select interest point orientation is who defines de horizontality and verticality. For each sub-region the resultts of $dx$, $dy$ and their absolutes, are summed in a vector, the descriptor:

$v=(\sum_{} dx, \sum_{} dy, \sum_{} |dx|, \sum_{} |dy| )$

Which is distinctive and, at the same time, robust to noise , geometric and photometric distortions and errors. Surf descriptor obtained by the union of the sub- vectors .

### Matching

This section details the step back in search of characteristic points that provides the detector. This way it is possible to compare between descriptors and look for matching pairs of images between them. There are two ways to do it:

• Get the characteristic points of the first image and its descriptor and do the same with the second image . So you will be able to compare the two images descriptor correspondences between points and establish some kind of measure.
• Get the characteristic points of the first image with the descriptor. Then compare this descriptor with the points of the second image which is believed to be the partner concerned.

## Development of an algorithm using the SURF method

1- HOME : Initialization of all global variables and necessary to run the program functions.

2- SEARCH CAMERA : If a camera is connected , go to step : Capture sequence and stored on your computer . If not, the searching step begins , if one receives a sequence of video stored on your computer.

3 - VIDEO SEARCH : Only enter if the algorithm did not find any video device connected to your computer . If a route stored video sequence is received, it is passed to the step of capturing sequence. If path is not received, it is closed.

4 - IMAGE CAPTURE : You can capture the image from the camera or stored video. Extracted for each iteration a picture . Program execution if no box ( pointer image is NULL) ends .

5- THERE IS A PICTURE : Previous stage is reviewed , if the pointer does not mark any memory address, the execution is finished. If no image passes to the step of applying filter to soften the image. Gaussian filters , average , etc can be used ( they are the clearest ) ...

6 - SHOW VIDEO : Displayed on screen for each loop iteration , the image obtained in the image capture stage .

7- PAUSE VIDEO: The video can be stopped, and the last image obtained by the capture step is displayed on screen in a separate window that would go on to the next stage.

8 - SELECTION ROI ( Region Of Interest) : With the image in the window duplicated once made the pause, the image area or zone is selected. The selected area will appear in another window to be grasped as a base image that can analyze the rest of the video. This new image will be stored temporarily and parallel algorithm .

9 - SELECTION OF POINTS OF THE MASK : The mask is an image containing two colors: white and black . This image is taken as a basis for finding points of interest in the SURF algorithm in later steps.

10, IMAGE CONVERSION OF GRAY SCALE : The image is converted to grayscale , because the algorithm needs of the first selected by the user and on the boundary area blank point selection mask.

11- EXTRACTION POINT SURF : This step extracts the points of interest ( image descriptors ) of the first selected by the user and on the boundary area blank point selection mask.

12-STORAGE OF IMAGE OF BASE: Temporarily saves images to be picked for future comparison with the rest of the video sequence . These are:

-Image Obtained from the ROI selection gray ladder .

13 - COMPARISON USING SURF TEMPLATES : The images above are not changed and compared with other video sequence . At this stage you can identify image descriptors and descriptors of the video stream , drawing a line between them. Switching between these descriptors, we determine wether the selected object will be determined is partially clogged or has been removed.

14- OBSTRUCTION COUNTER : This stage is used in case the object has been partially or totally obstructed by a short period of time . It will have two temporary counters:

- Obstructor Counter : Timer or cycles that indicates whether the object has been overlapped or obstructed. While the object is overlapped the counter increases when it reaches the maximum, the end of program counter is activated.

- Counter End Program : Turns the program or show warning if it reaches a certain time or cycles where the object is not seen yet. Having peaked in the counter obstruction. If you have been a temporary obstruction, counters are reseted.

15- WARNING OBJECT BLOCKED : The video stops if confirmed whether the object is missing or has been blocked for a considerable time . It also indicates the second when the algorithm detects . Spent stop showing the video image and finally ends the program.

16- VALUES OF DESCRIPTORS : And taking the values ​​taken in the extraction step SURF points , a condition is created to determine whether the object is present or not . Yes 4 descriptors, X <4 , where X is the number of descriptors analyzed in each table are taken . Thus , it checks whether or not the object is .

17- LOCATION PURPOSE . POINT LOCATION FLANN . Relationships between descriptors points of both the image and the video stream are and a line is drawn .

Location object using homography .

18- Pass the video Tu

## Surf vs Sift

### Presentation of the algorithms

The recognition of images or objects, is one of the most important applications of computer vision, becomes a comparison of local descriptors SIFT (Scale Invariant Feature Transform) [David Lowe, 1999] and SURF (Speeded-UP Feature transform) [Bay and Tuytelaars 2006]. These two local descriptors to detect structures or very significant points in an image for a discriminating description of these areas from its neighboring points, in order to compare them with other descriptors using similar measures. Below are given some basic features of these two algorithms:

SIFT (Scale Invariant Feature Transform) was published by David Lower in 1999 with the idea of proposing an algorithm capable of extracting the characteristics of an image already splitting of these describe the set of objects that were contained in it. This algorithm can be broken into 4 stages:

-Detection of edges on the scale: the first stage performs a search in different scales and dimensions of the image by identifying possible points of interest, invariant to changes in orientation and scaling. This procedure is carried out with the function (DoG (Difference-of-Gaussian) giving different values to the σ, in the following equation):

$D \left( x, y, \sigma \right) = L \left( x, y, k_i\sigma \right) - L \left( x, y, k_j\sigma \right),$

Where G is the Gaussian function and the I is the image. Now Gaussian images are subtracted in order to produce the DoG, after that is submostrea the Gaussian image by a factor of 2 to obtain a DoG image sampling. Each pixel is set with its neighbors with a mask 3 x 3 to find maxim and minim local of D (x, y, σ).

-Location of the keypoints: key points are chosen on the basis of stability measures, removed the key points with low.

-Contrast or are located in the margins-orientation assignment: invariance to rotation is achieved by assigning to each of the points of orientation based on the local properties of the image and which represents the descriptor for this orientation.

-Descriptor keypoints: the local gradients of the image are measured in the region surrounding the key point. These are processed by means of a representation which allows distortion levels and changes in lighting locally.

Més informació del SIFT

SIFT more

SURF (Speeded-UP Feature Transform) as we have seen in previous sections, the SURF is an algorithm developed by Herbert Bay, Tinne Tuytelaars and Luc Van Gool (2008), works as a detector and descriptor of sites focused on object recognition. This algorithm has 3 stages:

-Detection of points of interest, keypoints.

-Assignment guidance.

### Theoretical differences

The main differences between the algorithms SIFT and SURF are the speed of implementation of respect for one another, but with both points of interest of the image keep invariants, scale and orientation, and changes in the lighting. The main differences of data in multiple experiments are that SIFT is kept the position, scale and orientation, since it is possible that in a same position (x, and), we find several points of interest to different scale and / or orientation σ. But another side in the SURF in a position (x, and) appears only a single point of interest, by what is not saved the scale and orientation, however if that registers the matrix of second order and the sign of the Laplacian.

### Experimental differences

#### Test 1

To make comparisons must take into account the versions of the algorithms that are used in each case, in this article obtained conclusions with version 4 of the Sift and SURF with C code version 1.0.9 and analyzing images with greyscale, are also drawn conclusions from results of experiments with robots and response of the algorithms in platforms Android. The result of tests with 109 images, as announced before, it has been confirmed that the SURF method is much faster than SIFT. From the point of view of invariant points detected, the SIFT algorithm outperforms surfing more than double points detected. This is caused because the SURF allows that multiple invariant points in a same position with different scale or orientation, while in the SIFT if that is possible there are. This fact means that the distance between the number of detected features would decrease if the SIFT not duplication of points. The graphs show that the SIFT method has some 2.68 invariant points that the SURF with respect to time, SIFT more used about 1.646 ms to analyze an image and the SURF 485 ms, is almost one-third of the time, efficiency in this factor is considerable. This factor is important in face of the purpose of each project, if you want an algorithm that can get a great amount of data would be more advisable to SIFT, but a quick response is needed and we don't give so much importance to the number of detected features SURF algorithm is the most suitable.

SIFT SURF

There is detail the order of the images by comparing the photo is very important, since results may vary if you choose a comparison with X vs and if done with and vs X. The SIFT method shows more points when the images being compared have more points in common, unlike the SURF gives more values when the images are more differentiated, false positive values. This happens with all photographs and leads to the conclusion that the SIFT algorithm has more durability in the time to SURF, obtaining more reliable pairings.

109 IMAGES

SIFT SURF
Detected points 1292 482
Average time 1646.53 ms 485.77 ms
• Data extracted from an analysis made by A. M. Romero and M. Cazorla, in the document "comparison of detectors of Visual characteristics and its application to SLAM ”.

#### Test 2

The following study is an experiment of performance, consumption and efficiency of algorithms, in which 100 prepared logos have been and was executed 10 times each test in functionality, in order to see if the algorithms modify their behavior depending on the mobile device. The components of the testing scenario were the following:

Database: is a tuple consisting of a vector characteristics and other points of training Interes.

Training component: saves the database and points of interest generated by test algoritmos.

Try component: allows to get the vector of characteristics and interest points generated by each descriptor (SIFT, SURF) from a search (matching) input.

Search (matching): allows you to obtain a set of correspondences between the obtained tuple and tuple from the database.

SIFT SURF
Average interest points 222 431
Runtime 2887.13 ms 959.87 ms
Power consumption 0.04 % 0.02 %
Memory used 102.30 KB 385.02 KB
Effectiveness 0.33 0.40

In the previous table, the results obtained with regard to the points of interest is observed the difference from the average of points detected by the SIFT method much less than surfing, because of the difference in the compared images. Regarding runtime SIFT is much slower than the SUFT, indicating that for real-time applications this algorithm should be ruled out. In battery consumption the results are proportional to the runtime, so SIFT consumes twice as much as the SUFT method.Regarding the memory used SUFT method exceeds the resources to store the vector of characteristics of images, almost four times more than the method SIFT on issues of effectiveness the SIFT method is also less effective than the SUFT, but slightly lower.

## Implementation SURF OpenCV

### Language and environment

C ++ is chosen for the following reasons:

Speed: image processing is needed to be low-level C ++ so fast and easy.

Usability: It is very typical to use C ++ in digital image processing. But it is also useful C or Matlab code. Portability: C ++ is portable to multiple platforms, so you can move across different platforms and compilations

Libraries for image processing: OpenCV is a library function in C ++ that allows reaching images, videos and live video from webcam or other devices. Supported by windows and linux.

A possible environment is Visual Basic C ++ together with VC ++, a powerful IDE that allows code to easily create and organize projects. OpenCV integrates nicely with e compiler. Visual C ++ and OpenCV are free, which will allow the library to be distributed without licensing restrictions.

### Architecture Design

#### Integral Image

Description: The module creates and manipulates embedded images.

Tickets: A Portrait.

Process: Create the representation of the integral image supplied input image. Calculates the sum of pixels on rectangular areas.

Outputs: The representation of the overall image.

#### Fast Hessian

Description: Find the Hessian based attractions / regions a comprehensive picture.

Tickets: A representation of the overall image of an image.

Processes: Build determinant of Hessian Response Map. Make a non-maximal suppression to locate points of interest on a scale space. Interpolates detects accurately points to sub - pixel.

Outputs: A vector of attractions located precisely.

#### SURF descriptor

Description: Removes descriptor components for a given set interest points detected.

Tickets: A representation of the overall image of an image, vector POI.

Processes: Calculate Haar wavelet responses to them. Calculate the dominant orientation of a landmark. Extracted 64 - dimensional descriptor vector sums of wavelet-based responses.

Outputs: A vector SURF descriptors, points of interest.

#### Points of Interest

Description: Stores data associated with each individual point of interest.

Posts: Specifics of interest.

Processes: access methods / switch for data.

Outputs: None

#### Services

Description: This module contains all specific functions do not SURF.