Differential dynamic programming

Differential dynamic programming (DDP) is an optimal control algorithm of the trajectory optimization class. The algorithm was introduced in 1966 by Mayne[1] and subsequently analysed in Jacobson and Mayne's eponymous book.[2] The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence. It is closely related to Pantoja's step-wise Newton's method.[3][4]

Finite-horizon discrete-time problems

The dynamics

${\displaystyle \mathbf {x} _{i+1}=\mathbf {f} (\mathbf {x} _{i},\mathbf {u} _{i})}$

(1)

describe the evolution of the state ${\displaystyle \textstyle \mathbf {x} }$ given the control ${\displaystyle \mathbf {u} }$ from time ${\displaystyle i}$ to time ${\displaystyle i+1}$. The total cost ${\displaystyle J_{0}}$ is the sum of running costs ${\displaystyle \textstyle \ell }$ and final cost ${\displaystyle \ell _{f}}$, incurred when starting from state ${\displaystyle \mathbf {x} }$ and applying the control sequence ${\displaystyle \mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}}$ until the horizon is reached:

${\displaystyle J_{0}(\mathbf {x} ,\mathbf {U} )=\sum _{i=0}^{N-1}\ell (\mathbf {x} _{i},\mathbf {u} _{i})+\ell _{f}(\mathbf {x} _{N}),}$

where ${\displaystyle \mathbf {x} _{0}\equiv \mathbf {x} }$, and the ${\displaystyle \mathbf {x} _{i}}$ for ${\displaystyle i>0}$ are given by Eq. 1. The solution of the optimal control problem is the minimizing control sequence ${\displaystyle \mathbf {U} ^{*}(\mathbf {x} )\equiv \operatorname {argmin} _{\mathbf {U} }J_{0}(\mathbf {x} ,\mathbf {U} ).}$ Trajectory optimization means finding ${\displaystyle \mathbf {U} ^{*}(\mathbf {x} )}$ for a particular ${\displaystyle \mathbf {x} _{0}}$, rather than for all possible initial states.

Dynamic programming

Let ${\displaystyle \mathbf {U} _{i}}$ be the partial control sequence ${\displaystyle \mathbf {U} _{i}\equiv \{\mathbf {u} _{i},\mathbf {u} _{i+1}\dots ,\mathbf {u} _{N-1}\}}$ and define the cost-to-go ${\displaystyle J_{i}}$ as the partial sum of costs from ${\displaystyle i}$ to ${\displaystyle N}$:

${\displaystyle J_{i}(\mathbf {x} ,\mathbf {U} _{i})=\sum _{j=i}^{N-1}\ell (\mathbf {x} _{j},\mathbf {u} _{j})+\ell _{f}(\mathbf {x} _{N}).}$

The optimal cost-to-go or value function at time ${\displaystyle i}$ is the cost-to-go given the minimizing control sequence:

${\displaystyle V(\mathbf {x} ,i)\equiv \min _{\mathbf {U} _{i}}J_{i}(\mathbf {x} ,\mathbf {U} _{i}).}$

Setting ${\displaystyle V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})}$, the dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time:

${\displaystyle V(\mathbf {x} ,i)=\min _{\mathbf {u} }[\ell (\mathbf {x} ,\mathbf {u} )+V(\mathbf {f} (\mathbf {x} ,\mathbf {u} ),i+1)].}$

(2)

This is the Bellman equation.

Differential dynamic programming

DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory. We begin with the backward pass. If

${\displaystyle \ell (\mathbf {x} ,\mathbf {u} )+V(\mathbf {f} (\mathbf {x} ,\mathbf {u} ),i+1)}$

is the argument of the ${\displaystyle \min[]}$ operator in Eq. 2, let ${\displaystyle Q}$ be the variation of this quantity around the ${\displaystyle i}$-th ${\displaystyle (\mathbf {x} ,\mathbf {u} )}$ pair:

{\displaystyle {\begin{aligned}Q(\delta \mathbf {x} ,\delta \mathbf {u} )\equiv &\ell (\mathbf {x} +\delta \mathbf {x} ,\mathbf {u} +\delta \mathbf {u} )&&{}+V(\mathbf {f} (\mathbf {x} +\delta \mathbf {x} ,\mathbf {u} +\delta \mathbf {u} ),i+1)\\-&\ell (\mathbf {x} ,\mathbf {u} )&&{}-V(\mathbf {f} (\mathbf {x} ,\mathbf {u} ),i+1)\end{aligned}}}

and expand to second order

${\displaystyle \approx {\frac {1}{2}}{\begin{bmatrix}1\\\delta \mathbf {x} \\\delta \mathbf {u} \end{bmatrix}}^{\mathsf {T}}{\begin{bmatrix}0&Q_{\mathbf {x} }^{\mathsf {T}}&Q_{\mathbf {u} }^{\mathsf {T}}\\Q_{\mathbf {x} }&Q_{\mathbf {x} \mathbf {x} }&Q_{\mathbf {x} \mathbf {u} }\\Q_{\mathbf {u} }&Q_{\mathbf {u} \mathbf {x} }&Q_{\mathbf {u} \mathbf {u} }\end{bmatrix}}{\begin{bmatrix}1\\\delta \mathbf {x} \\\delta \mathbf {u} \end{bmatrix}}}$

(3)

The ${\displaystyle Q}$ notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout.[5] Dropping the index ${\displaystyle i}$ for readability, primes denoting the next time-step ${\displaystyle V'\equiv V(i+1)}$, the expansion coefficients are

{\displaystyle {\begin{alignedat}{2}Q_{\mathbf {x} }&=\ell _{\mathbf {x} }+\mathbf {f} _{\mathbf {x} }^{\mathsf {T}}V'_{\mathbf {x} }\\Q_{\mathbf {u} }&=\ell _{\mathbf {u} }+\mathbf {f} _{\mathbf {u} }^{\mathsf {T}}V'_{\mathbf {x} }\\Q_{\mathbf {x} \mathbf {x} }&=\ell _{\mathbf {x} \mathbf {x} }+\mathbf {f} _{\mathbf {x} }^{\mathsf {T}}V'_{\mathbf {x} \mathbf {x} }\mathbf {f} _{\mathbf {x} }+V_{\mathbf {x} }'\cdot \mathbf {f} _{\mathbf {x} \mathbf {x} }\\Q_{\mathbf {u} \mathbf {u} }&=\ell _{\mathbf {u} \mathbf {u} }+\mathbf {f} _{\mathbf {u} }^{\mathsf {T}}V'_{\mathbf {x} \mathbf {x} }\mathbf {f} _{\mathbf {u} }+{V'_{\mathbf {x} }}\cdot \mathbf {f} _{\mathbf {u} \mathbf {u} }\\Q_{\mathbf {u} \mathbf {x} }&=\ell _{\mathbf {u} \mathbf {x} }+\mathbf {f} _{\mathbf {u} }^{\mathsf {T}}V'_{\mathbf {x} \mathbf {x} }\mathbf {f} _{\mathbf {x} }+{V'_{\mathbf {x} }}\cdot \mathbf {f} _{\mathbf {u} \mathbf {x} }.\end{alignedat}}}

The last terms in the last three equations denote contraction of a vector with a tensor. Minimizing the quadratic approximation (3) with respect to ${\displaystyle \delta \mathbf {u} }$ we have

${\displaystyle {\delta \mathbf {u} }^{*}=\operatorname {argmin} \limits _{\delta \mathbf {u} }Q(\delta \mathbf {x} ,\delta \mathbf {u} )=-Q_{\mathbf {u} \mathbf {u} }^{-1}(Q_{\mathbf {u} }+Q_{\mathbf {u} \mathbf {x} }\delta \mathbf {x} ),}$

(4)

giving an open-loop term ${\displaystyle \mathbf {k} =-Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} }}$ and a feedback gain term ${\displaystyle \mathbf {K} =-Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} \mathbf {x} }}$. Plugging the result back into (3), we now have a quadratic model of the value at time ${\displaystyle i}$:

{\displaystyle {\begin{alignedat}{2}\Delta V(i)&=&{}-{\tfrac {1}{2}}Q_{\mathbf {u} }Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} }\\V_{\mathbf {x} }(i)&=Q_{\mathbf {x} }&{}-Q_{\mathbf {u} }Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} \mathbf {x} }\\V_{\mathbf {x} \mathbf {x} }(i)&=Q_{\mathbf {x} \mathbf {x} }&{}-Q_{\mathbf {x} \mathbf {u} }Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} \mathbf {x} }.\end{alignedat}}}

Recursively computing the local quadratic models of ${\displaystyle V(i)}$ and the control modifications ${\displaystyle \{\mathbf {k} (i),\mathbf {K} (i)\}}$, from ${\displaystyle i=N-1}$ down to ${\displaystyle i=1}$, constitutes the backward pass. As above, the Value is initialized with ${\displaystyle V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})}$. Once the backward pass is completed, a forward pass computes a new trajectory:

{\displaystyle {\begin{aligned}{\hat {\mathbf {x} }}(1)&=\mathbf {x} (1)\\{\hat {\mathbf {u} }}(i)&=\mathbf {u} (i)+\mathbf {k} (i)+\mathbf {K} (i)({\hat {\mathbf {x} }}(i)-\mathbf {x} (i))\\{\hat {\mathbf {x} }}(i+1)&=\mathbf {f} ({\hat {\mathbf {x} }}(i),{\hat {\mathbf {u} }}(i))\end{aligned}}}

The backward passes and forward passes are iterated until convergence.

Regularization and line-search

Differential dynamic programming is a second-order algorithm like Newton's method. It therefore takes large steps toward the minimum and often requires regularization and/or line-search to achieve convergence [6] .[7] Regularization in the DDP context means ensuring that the ${\displaystyle Q_{\mathbf {u} \mathbf {u} }}$ matrix in Eq. 4 is positive definite. Line-search in DDP amounts to scaling the open-loop control modification ${\displaystyle \mathbf {k} }$ by some ${\displaystyle 0<\alpha <1}$.