# Perspective-n-Point

Perspective-n-Point is the problem of estimating the pose of a calibrated camera given a set of n 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6 degrees-of-freedom (DOF) which are made up of the rotation (roll, pitch, and yaw) and 3D translation of the camera with respect to the world. This problem originates from camera calibration and has many applications in computer vision and other areas, including 3D pose estimation, robotics and augmented reality. A commonly used solution to the problem exists for n = 3 called P3P, and many solutions are available for the general case of n ≥ 3. Implementations of these solutions are also available in open source software.

## Problem Specification

### Definition

Given a set of n 3D points in a world reference frame and their corresponding 2D image projections as well as the calibrated intrinsic camera parameters, determine the 6 DOF pose of the camera in the form of its rotation and translation with respect to the world. This follows the perspective project model for cameras:

${\displaystyle s\,p_{c}=K\,[\,R\,|\,T\,]\,p_{w}}$.

where ${\displaystyle \textstyle p_{w}={\begin{bmatrix}x&y&z&1\end{bmatrix}}^{T}}$ is the homogeneous world point, ${\displaystyle \textstyle p_{c}={\begin{bmatrix}u&v&1\end{bmatrix}}^{T}}$ is the corresponding homogeneous image point, ${\displaystyle \textstyle K}$ is the matrix of intrinsic camera parameters, (where ${\displaystyle \textstyle f_{x}}$ and ${\displaystyle f_{y}}$ are the scaled focal lengths, ${\displaystyle \textstyle \gamma }$ is the skew parameter which is sometimes assumed to be 0, and ${\displaystyle \textstyle (u_{0},\,v_{0})}$ is the principal point), ${\displaystyle \textstyle s}$ is a scale factor for the image point, and ${\displaystyle \textstyle R}$ and ${\displaystyle \textstyle T}$ are the desired 3D rotation and 3D translation of the camera (extrinsic parameters) that are being calculated. This leads to the following equation for the model:

${\displaystyle s{\begin{bmatrix}u\\v\\1\end{bmatrix}}={\begin{bmatrix}f_{x}&\gamma &u_{0}\\0&f_{y}&v_{0}\\0&0&1\end{bmatrix}}{\begin{bmatrix}r_{11}&r_{12}&r_{13}&t_{1}\\r_{21}&r_{22}&r_{23}&t_{2}\\r_{31}&r_{32}&r_{33}&t_{3}\\\end{bmatrix}}{\begin{bmatrix}x\\y\\z\\1\end{bmatrix}}}$.

### Assumptions and Data Characteristics

There are a few preliminary aspects of the problem that are common to all solutions of PnP. The assumption made in most solutions is that the camera is already calibrated. Thus, its intrinsic properties are already known, such as the focal length, principal image point, skew parameter, and other parameters. Some methods, such as UPnP.[1] or the Direct Linear Transform (DLT) applied to the projection model, are exceptions to this assumption as they estimate these intrinsic parameters as well as the extrinsic parameters which make up the pose of the camera that the original PnP problem is trying to find.

For each solution to PnP, the chosen point correspondences cannot be coplanar. In addition, PnP can have multiple solutions, and choosing a particular solution would require post-processing of the solution set. RANSAC is also commonly used with a PnP method to make the solution robust to outliers in the set of point correspondences. Most methods, however, assume that the data is noise-free.

## Methods

This following section describes two common methods that can be used to solve the PnP problem that are also readily available in open source software and how RANSAC can be used to deal with outliers in the data set.

### P3P

When n = 3, the PnP problem is in its minimal form of P3P and can be solved with three point correspondences. However, with just three point correspondences, P3P yields many solutions, so a fourth correspondence is used in practice to remove ambiguity. The setup for the problem is as follows.

Let P be the center of projection for the camera, A, B, and C be 3D world points with corresponding images points u, v, and w. Let X = |PA|, Y = |PB|, Z = |PC|, ${\displaystyle \alpha =\angle BPC}$, ${\displaystyle \beta =\angle APC}$, ${\displaystyle \gamma =\angle APB}$, ${\displaystyle p=2\cos \alpha }$, ${\displaystyle q=2\cos \beta }$, ${\displaystyle r=2\cos \gamma }$, ${\displaystyle a'=|AB|}$, ${\displaystyle b'=|BC|}$, ${\displaystyle c'=|AC|}$. This forms triangles PBC, PAC, and PAB from which we obtain the equation system for P3P:

${\displaystyle {\begin{cases}Y^{2}+Z^{2}-YZp-b'^{2}&=0\\Z^{2}+X^{2}-XZq-c'^{2}&=0\\X^{2}+Y^{2}-XYr-a'^{2}&=0\\\end{cases}}}$.

It's common to normalize the image points before solving P3P. Solving the P3P system results in four possible solutions for R and T. The fourth world point D and its corresponding image point z are then used to find the best solution among the four. The algorithm for solving the problem as well as the complete solution classification for it is given in the 2003 IEEE Transactions on Pattern Analysis and Machine Intelligence paper by Gao, et al.[2] An open source implementation of the P3P can be found in OpenCV's calib3d module in the solvePnP function.[3]

### EPnP

Efficient PnP (EPnP) is a method developed by Lepetit, et al. in their 2008 International Journal of Computer Vision paper[4] that solves the general problem of PnP for n ≥ 3. This method is based on the notion that each of the n points (which are called reference points) can be expressed as a weighted sum of four virtual control points. Thus, the coordinates of these control points become the unknowns of the problem. It is from these control points that the final pose of the camera is solved for.

As an overview of the process, first note that each of the n reference points in the world frame, ${\displaystyle p_{i}^{w}}$, and their corresponding image points, ${\displaystyle p_{i}^{c}}$, are weighted sums of the four controls points, ${\displaystyle c_{j}^{w}}$ and ${\displaystyle c_{j}^{c}}$ respectively, and the weights are normalized per reference point as shown below. All points are expressed in homogeneous form.

${\displaystyle p_{i}^{w}=\sum _{j=1}^{4}{\alpha _{ij}c_{j}^{w}}}$
${\displaystyle p_{i}^{c}=\sum _{j=1}^{4}{\alpha _{ij}c_{j}^{c}}}$
${\displaystyle \sum _{j=1}^{4}{\alpha _{ij}}=1}$

From this, the derivation of the image reference points becomes

${\displaystyle s_{i}\,p_{i}^{c}=K\sum _{j=1}^{4}{\alpha _{ij}c_{j}^{c}}}$.

The homogeneous image control point has the form ${\displaystyle \textstyle c_{j}^{c}={\begin{bmatrix}x_{j}^{c}&y_{j}^{c}&z_{j}^{c}\end{bmatrix}}^{T}}$. Rearranging the image reference point equation yields the following two linear equations for each reference point:

${\displaystyle \sum _{j=1}^{4}{\alpha _{ij}f_{x}x_{j}^{c}+\alpha _{ij}(u_{0}-u_{i})z_{j}^{c}}=0}$
${\displaystyle \sum _{j=1}^{4}{\alpha _{ij}f_{y}y_{j}^{c}+\alpha _{ij}(v_{-}v_{i})z_{j}^{c}}=0}$.

Using these two equations for each of the n reference points, the system ${\displaystyle \textstyle Mx=0}$ can be formed where ${\displaystyle \textstyle x={\begin{bmatrix}c_{1}^{c^{T}}&c_{2}^{c^{T}}&c_{3}^{c^{T}}&c_{4}^{c^{T}}\end{bmatrix}}^{T}}$. The solution for the control points exists in the null space of M and is expressed as

${\displaystyle x=\sum _{i=1}^{N}{\beta _{i}v_{i}}}$

where ${\displaystyle N}$ is the number of null singular values in ${\displaystyle M}$ and each ${\displaystyle v_{i}}$ is the corresponding right singular vector of ${\displaystyle M}$. ${\displaystyle N}$ can range from 1 to 4. After calculating the initial coefficients ${\displaystyle \beta _{i}}$, the Gauss-Newton algorithm is used to refine them. The R and T matrices that minimize the reprojection error of the world reference points, ${\displaystyle p_{i}^{w}}$, and their corresponding actual image points ${\displaystyle p_{i}^{c}}$, are then calculated.

This solution has ${\displaystyle O(n)}$ complexity and works in the general case of PnP for both planar and non-planar control points. Open source software implementations of this method can be found in OpenCV's Camera Calibration and 3D Reconstruction module in the solvePnP function[3] as well as from the code published by Lepetit, et al. at their website, CVLAB at EPFL.[5]

### Using RANSAC

PnP is prone to errors if there are outliers in the set of point correspondences. Thus, RANSAC can be used in conjunction with existing solutions to make the final solution for the camera pose more robust to outliers. An open source implementation of PnP methods with RANSAC can be found in OpenCV's Camera Calibration and 3D Reconstruction module in the solvePnPRansac function[6]