# User:Visionwiki/ASIFT (Affine-SIFT)

Affine-SIFT (ASIFT)

Affine-SIFT (ASIFT) is a computer vision algorithm that aims at establishing correspondences between objects in images taken potentially from different viewpoints and at different distances. The algorithm was published by Morel and Yu[1] in 2009. Typical applications include image recognition, 3D reconstruction, object tracking, robot localization and image registration.

ASIFT is an extension of the scale-nvariant feature transform (SIFT) of Lowe[2], a popular image matching algorithm that is invariant to rotation, translation, zoom, and partially invariant to illumination change. In addition to these invariances, ASIFT is invariant to camera angle change.

## How does ASIFT work?

File:ASIFT overview.png
ASIFT algorithm overview.

In short, ASIFT achieves fully affine invariance by first simulating two camera angle parameters and then applying SIFT between the simulated images.

### Camera model

To understand how ASIFT works, let us first look at a camera model that describes the procedure of image acquisition. Digital image acquisition of a flat object can be described as

${\displaystyle \mathbf {u} =\mathbf {S} _{1}G_{1}A{\mathcal {T}}u_{0}}$

where ${\displaystyle {\textbf {u}}}$ is a digital image and ${\displaystyle u_{0}}$ is an infinite resolution frontal view of the flat object. ${\displaystyle {\mathcal {T}}}$ and ${\displaystyle A}$ are respectively a plane translation and a planar projective map due to the camera motion. ${\displaystyle G_{1}}$ is a Gaussian convolution modeling the optical blur, and ${\displaystyle \mathbf {S} _{1}}$ is the standard sampling operator on a regular grid with mesh 1. The sampling operator ${\displaystyle \mathbf {S} _{1}}$ will be neglected in the following as the Shannon condition is assumed to hold, which allows to invert the sampling operator by a Shannon-Whittaker interpolation.

### From viewpoint invariance to affine invariance

In the camera model, the projective transforms ${\displaystyle A}$ can be locally approximated by affine transforms. Indeed, by the first order Taylor formula, any planar smooth deformation can be approximated around each point by an affine map. The apparent deformation of a plane object induced by a camera motion is a planar homographic transform, which is smooth, and therefore locally tangent to affine transforms.

As most other image matching algorithms that attempt to be viewpoint invariant[3], ASIFT seeks affine invariance as an approximation of viewpoint invariance.

### Affine space with 6 camera motion parameters

An affine deformation can be interpreted with 6 camera motion parameters.

File:Camera affine.png
Camera motion interpretation of an affine space.
• Translation. The camera can translate parallel to the image plane, as measured by the translation ${\displaystyle {\mathcal {T}}}$. (It contains two parameters ${\displaystyle x}$ and ${\displaystyle y}$.)
• Zoom. The camera can move forward or backward, as measured by the zoom parameter ${\displaystyle \lambda }$.
• Camera spin. The camera can rotate around its optical axis, measured by the rotation parameter ${\displaystyle \psi }$.
• Longitude angle. The plane containing the normal and the camera optical axis makes an angle ${\displaystyle \phi }$ with a fixed vertical plane.
• Latitude angle. The camera optical axis makes a ${\displaystyle \theta }$ angle with the normal to the image plane ${\displaystyle u}$. The latitude angle can be written in function of tilt ${\displaystyle t=1/\cos \theta }$, which measures the ratio of the image aspect-ratio between the frontal view and the slanted view.

#### High transition tilt

File:Transition tilt.png
The transition tilt between two images can be as big as the product of the absolute tilts of the two images.

The tilt ${\displaystyle t}$ defined above is the absolute tilt that measures the amount of aspect-ratio change between a frontal view and a slanted view. In real applications, both compared images are usually slanted views. The transition tilt is designed to quantify the amount of tilt between two such images. It can be shown that the transition tilt between two images can be as big as the product of the absolute tilts of the two images.

Transition tilt is an important notation. It measures the amount of viewpoint change between any two images. It can also be used to evaluate the affine invariance performance of image matching algorithms.

### How to achieve affine invariance?

There are essentially two approaches to make an image matching algorithm affine invariant.

• Simulation. The simulation approach first simulates the affine deformation on the images under comparison and then compare the deformed images. Affine invariance via simulation is reliable, but is computationally expensive to achieve. It is prohibitive to simulate all the 6 affine parameters.
• Normalization. Given an image patch that has undergone an unknown affine transform, the normalization approaches transforms somehow the patch into a standardized one that is independent of the affine transform. The comparison is performed on the patches normalization. The normalization approach is rapid, but it is unreliable when the affine deformation is large. Mathematically it can be shown that accurate normalization of the latitude angle ${\displaystyle \phi }$ and the longitude angle ${\displaystyle \theta }$ is impossible due to the camera optical blur[1].

### SIFT covers 4 parameters in the affine space

SIFT is invariant normalizes the translation and rotation parameters, and simulates scale changes in the scale-space. In consequence SIFT is invariant to translation, rotation and zoom.

SIFT does not treat the camera longitude angle and latitude angle (or equivalently tilt). Therefore in theory SIFT is not invariant to camera angle changes. Fortunately, in pratice SIFT is robust to moderate camera angle changes thanks to its local histogram coding strategy[2].

### ASIFT covers all 6 parameters in the affine space

#### ASIFT algorithm

ASIFT first simulates the camera longitude and latitude angles by simulating a number of rotation and tilt distortions on the images under comparison. (A tilt ${\displaystyle t}$ is simluated on the image by prefiltering the image along one direciton and then subsampling the image by a factor of ${\displaystyle t}$ along that direction.) The simulated images are then compared with a translation-, rotation-, and zoom-invariant image comparison algorithm, namely SIFT. By doing so, ASIFT covers all the 6 parameters in the affine space.

#### Full affine invariance

ASIFT is mathematically shown to be fully affine invariance, up to a sampling precision[1].

#### Longitude and latitude sampling

ASIFT simulates the lontitude angles ${\displaystyle \phi }$ and latitude angles ${\displaystyle \theta }$ by sampling a finite and small number of rotations ${\displaystyle \phi }$ and tilts ${\displaystyle t}$. The sampling is denser near the equator, so that the simulated images keep close to any other possible view generated by other values of ${\displaystyle \phi }$ and ${\displaystyle \theta }$. More precisely, ASIFT samples ${\displaystyle t}$ and ${\displaystyle \phi }$ as follows.

• Geometrical sampling of ${\displaystyle t}$. ${\displaystyle \Delta t=t_{k+1}/t_{k}=C_{t}}$, where ${\displaystyle C_{t}}$ is a constant.
• Arithmatical sampling of ${\displaystyle \phi }$. ${\displaystyle \Delta \phi =\phi _{k+1}-\phi _{k}=C_{\phi }/t}$., where ${\displaystyle C_{\phi }}$ is a constant.

A recommended configuration is ${\displaystyle t=1,{\sqrt {2}},2{\sqrt {2}},4,4{\sqrt {2}}}$ and ${\displaystyle \Delta \phi =72^{\circ }/t}$. The maximum simulated tilt ${\displaystyle t_{\max }=4{\sqrt {2}}}$ corresponds to a latitude angle ${\displaystyle \theta _{\max }\approx 80^{\circ }}$. The resulting ASIFT achieves a maximum transition tilt ${\displaystyle \tau =t_{\max }^{2}=32}$.

#### Coarse-to-fine acceleration

ASIFT is accelerated following a coarse-to-fine scheme that consists of the followin two steps.

• Low-resolution ASIFT. ASIFT is first performed on a subsampled version of the image. Low-resolution ASIFT allows to identify a few good affine transforms between the two images under comparison.
• High-resolution ASIFT. ASIFT is performed on the original full-resolution images by simulating only the good affine transforms.

High-resolution ASIFT is performed only if low-resolution ASIFT succeeds in establishing some correspondences. In a recommended configuration, Low-resolution ASIFT subsamples the images by a factor of ${\displaystyle 3\times 3}$.

#### Complexity of ASIFT

The complexity of ASIFT is proportional to the image area simulated by ASIFT. Following the recommended configuration and the coarse-to-fine acceleration, low-resolution ASIFT simulates about 1.5 times the area of the original image. The complexity of the low-resolution ASIFT feature computation is therefore 1.5 times that of SIFT. The resulting complexity of low-resolution ASIFT feature comparison is ${\displaystyle 1.5^{2}=2.25}$ times as much as that of SIFT.

If the image comparisons involve a large database where most comparisons will be failures, ASIFT stops essentially at the end of the low-resolution procedure, and the overall complexity is about twice the SIFT complexity, as argued above.

If the comparisons involve a set of images with high matching likeliness, then the high resolution step is no more negligible. The overall complexity of ASIFT depends on the number of the identified good affine transforms simulated in the high-resolution procedure as well as on the simulated tilt values ${\displaystyle t}$. However, in that case, ASIFT ensures many more detections than SIFT, because it explores many more viewpoint angles. In that case the complexity rate per match detection is in practice equal to or smaller than the per match detection complexity of a SIFT routine.

## Comparison to other popular methods

Most other popular affine invariant image matching algorithms such as MSER[4], Harris-Affine and Hessian-Affine[5] attempt to achieve affine invariance via normalization.

Using the maximum transition tilt ${\displaystyle \tau _{\max }}$ that an image comparison algorithm can attain as an objective measure, Morel and Yu[1] have checked the affine invariant performance of various algorithms. The conclusion they obtained experimentally is summarized as follows.

• ASIFT is robust to transition tilts ${\displaystyle \tau _{\max }\geq 36}$.
• SIFT is robust to transition tilts smaller than ${\displaystyle \tau _{\max }\approx 2}$.
• MSER is robust to transition tilts ${\displaystyle \tau _{\max }}$ between 5 and 10, if the images under comparison contain highly contrastly content and if there is no substantial zoom change between the images under comparison. If these conditions do not hold, the its performance drops dramatically.
• Harris-affine and Hessian-affine are robust to transition tilts smaller than ${\displaystyle \tau _{\max }\approx 2.5}$, if there is no substantial zoom change between the images under comparison. If these conditions do not hold, the their performance drops dramatically.

A comprehensive sets of experiments supporting this conclusion can be found here [[1]].

## Online demo, software and other material

• Online demo. ASIFT can be tested with users' own images via an online demo [[2]].
• Software and other materials. ASIFT software, referernce source code, posters, presentation slides, as well as some failure examples are available here [[3]].

## References

1. ^ a b c d J.M. Morel and G. Yu. "ASIFT: A New Framework for Fully Affine Invariant Image Comparison." SIAM Journal on Imaging Sciences, 2(2):438-469, 2009.
2. ^ a b Lowe, D. G., “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, 60, 2, pp. 91-110, 2004.
3. ^ K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir and L. Van Gool., A comparison of affine region detectors. In IJCV 65(1/2):43-72, 2005
4. ^ J. Matas, O. Chum, M. Urba, and T. Pajdla. "Robust wide baseline stereo from maximally stable extremal regions." Proc. of British Machine Vision Conference, pages 384-396, 2002.
5. ^