# Montgomery reduction

In arithmetic computation, Montgomery reduction is an algorithm introduced in 1985 by Peter Montgomery [1] [2] that allows modular arithmetic to be performed efficiently when the modulus is large (typically several hundred bits).

A single application of the Montgomery algorithm (henceforth referred to as a "Montgomery step") is faster than a "naive" modular multiplication:

$c \equiv a \times b \pmod n. \,$

Because numbers have to be converted to and from a particular form suitable for performing the Montgomery step, a single modular multiplication performed using a Montgomery step is actually slightly less efficient than a "naive" one. However, modular exponentiation can be implemented as a sequence of Montgomery steps, with conversion only required once at the start and once at the end of the sequence. In this case the greater speed of the Montgomery steps far outweighs the need for the extra conversions.

## Formal statement

Let N be a positive integer, R and T be integers such that $R>N$, $\gcd(N,R)=1$, and $0\leqslant T, and let $R^{-1}$ be the multiplicative inverse modulo N of R. The Montgomery reduction of T modulo N with respect to R is defined as the value

$TR^{-1} \pmod{N}. \,$

A systematic interpretation of Montgomery reduction and the definition of Montgomery multiplication operation is based on the 2nd generalized division algorithm; see Euclidean division#Generalized division algorithms.

The algorithm used to calculate this reduction is much more efficient than the classical method of taking a product over the integers and reducing the result modulo N.

## Use in cryptography

Many important cryptosystems such as RSA and DSA are based on arithmetic operations, such as multiplications, modulo a large number. [3] The classical method of calculating a modular product involves first multiplying the numbers as if they were integers and then taking the modulus of the result. However, modular reduction is very expensive computationally—equivalent to dividing two numbers.

The situation is even worse when the algorithm requires modular exponentiation. Classically, $a^b \pmod {N}$ is calculated by repeatedly multiplying a by itself b times, each time reducing the result modulo N. Note that taking a single modulo at the end of the calculation will result in increasingly larger intermediate products—infeasible if b is very large.

## Rationale

We wish to calculate c such that

$c \equiv a \times b \pmod {N}$.

Rather than working directly with a and b, we define the residue

$\bar a \equiv aR \pmod {N}$
$\bar b \equiv bR \pmod {N}$

The number $R$ is chosen both greater than and relatively prime to N, such that division and remainder operations are easy. A power of two is generally chosen so that these operations become shifts and bitwise masks respectively. The numbers R and N are guaranteed to be relatively prime if N is odd and R is a power of two, as is typical in cryptographic applications.

Addition and subtraction operations are the same:

$xR + yR \equiv zR \pmod{N}$

if and only if

$x + y \equiv z \pmod{N}.$

This is important because converting between natural and residue representations is expensive, and we would prefer to work in one representation as much as possible and minimise conversions.

To define multiplication, define the modular inverse of R, $R^{-1}$ such that

$RR^{-1} \equiv 1 \pmod {N}$

in other words

$RR^{-1} = kN + 1$

where k is a positive integer. Bezout's identity theorem claims such a $R^{-1}$ and k exist and can be calculated by the Extended Euclidean algorithm. Now if

$c \equiv a \times b \pmod {N}$

then

$\bar c \equiv cR \equiv (a \times b)R \equiv (aR \times bR) R^{-1} \equiv (\bar a \times \bar b) R^{-1} \mod{N}$

and then

$c \equiv \bar c R^{-1} \mod{N}$.

It turns out that this is easy to calculate using the following algorithm.

## Description of Algorithm

$\mbox{Redc}(T) = \mbox{Redc}(T + mN)$, for some $m$, since operations modulo $N$ can add integer multiples of $N$

Choose $m$ such that $(T + mN) / R$ has no remainder. This condition can only be enforced for arbitrary $T$ if $N$ and $R$ are relatively prime, hence the requirement stated in the section above.

$T + mN = 0 \pmod {R}$
\begin{align}m &= -TN^{-1}\ \bmod {\ R}\\ &= T(-N^{-1})\ \bmod {\ R}\end{align}, which may be implemented as $(T\ \bmod {\ R})(-N^{-1}\ \bmod {\ R})\ \bmod {\ R}$

The Montgomery reduction algorithm $\mbox{Redc}(T)$ calculates $TR^{-1} \pmod{N}$ as follows:

\begin{align}m &:= T(-N^{-1})\ \bmod {\ R}\\ t &:= (T + mN)/R\end{align}
If $t \ge N$, return $t - N$, else return $t$.

Note that only additions, subtractions, multiplications, integer divisions, and modulos by R are used – all of which are 'cheap' operations. For many operations modulo $N$, the quantity $(-N^{-1}\ \bmod {\ R})$ can be calculated once and applied many times.

To understand why this gives the right answer, consider the following:

• $mN \equiv TkN \pmod{R}$. But by the definition of $R^{-1}$ and $k$, $kN + 1$ is a multiple of $R$, so $TkN \equiv -T \pmod{R}$. Therefore, $(T + mN) \equiv 0 \pmod{R}$; in other words, $(T + mN)$ is exactly divisible by $R$, so $t$ is an integer.
• Furthermore, $tR = (T + mN) \equiv T \pmod {N}$; therefore, $t \equiv TR^{-1} \mod{N}$, as required.
• Assuming $0 \le T < RN$, $t < 2N$ (as $m < R$). Therefore the return value is always less than $N$.

Therefore, we can say that

$\bar c = \mbox{Redc}(\bar a \times \bar b).$

Using this method to calculate $c$ is generally less efficient than a naive multiplication and reduction, as the cost of conversions to and from residue representation (multiplications by $R$ and $R^{-1}$ modulo $N$) outweigh the savings from the reduction step. The advantage of this method becomes apparent when dealing with a sequence of multiplications, as required for modular exponentiation (e.g. exponentiation by squaring).

## Examples

### Modular multiplication 1

Consider the multiplication problem

$43 \times 56\ \bmod {\ 97},$\quad \begin{align}a &= 43\\ b &= 56\\ N &= {\color{Blue}97}\end{align}

Modulo 97 operations are hard. We would prefer to do modulo 100 operations.

$R = 100$

\begin{align}-N^{-1}\ \bmod{\ R} &= -({\color{Blue}97})^{-1}\ \bmod{\ 100} = -33\ \bmod{\ 100}\\ &= {\color{BrickRed}67}\pmod{100}\end{align}

\begin{align}\bar{a} &= aR\ \bmod{\ N} = 43 \times 100\ \bmod{\ 97}\\ &= 32\pmod{97}\\ \bar{b} &= bR\ \bmod{\ N} = 56 \times 100\ \bmod{\ 97}\\ &= 71\pmod{97}\end{align}
\begin{align}\bar{c} = ({\color{OliveGreen}32 \times 71})R^{-1} &= \mbox{Redc}({\color{OliveGreen}2272})\\ &= [{\color{OliveGreen}2272} + ({\color{OliveGreen}2272} \times {\color{BrickRed}67}\ \bmod{\ 100}) \times {\color{Blue}97}] / 100\\ &= [{\color{OliveGreen}2272} + 24 \times {\color{Blue}97}] / 100\\ &= 4600 / 100\\ &= 46\pmod{97}\end{align}
\begin{align}c &= \mbox{Redc}(\bar{c})\\ &= [{\color{OliveGreen}46} + ({\color{OliveGreen}46} \times {\color{BrickRed}67}\ \bmod{\ 100}) \times {\color{Blue}97}] / 100\\ &= 80\pmod{97}\end{align}

Obviously this is more work than

\begin{align}43 \times 56\ \bmod {\ 97} &= 2408\ \bmod{\ 97}\\ &= 80\pmod{97}\end{align}

for a single multiplication, but (after some initial set-up) it reduces modulo 97 operations to only multiplications and modulo 100 operations. For a computer implementation, $R$ would be a power of 2 rather than a power of 10, since division by powers of 2 simply means throwing away bits, which should be zero if implemented correctly.

#### Initial reduction

Continuing the example above, consider

$\bar{R} = R^2\ \bmod{\ N}$

Then

$\mbox{Redc}(x\bar{R}) = \mbox{Redc}(xR^2) = xR^2R^{-1} = xR = \bar{x}$

Calculate $\bar{a}$ and $\bar{b}$ from $a$ and $b$ above.

\begin{align}\bar{R} &= 100^2\ \bmod{\ 97}\\ &= {\color{Magenta}9}\pmod{97}\end{align}
\begin{align}\bar{a} &= \mbox{Redc}(43 \times {\color{Magenta}9})\\ &= [{\color{OliveGreen}387} + ({\color{OliveGreen}387} \times {\color{BrickRed}67}\ \bmod{\ 100}) \times {\color{Blue}97}] / 100\\ &= 32\pmod{97}\\ \bar{b} &= \mbox{Redc}(56 \times {\color{Magenta}9})\\ &= [{\color{OliveGreen}504} + ({\color{OliveGreen}504} \times {\color{BrickRed}67}\ \bmod{\ 100}) \times {\color{Blue}97}] / 100\\ &= 71\pmod{97}\end{align}

This method makes the calculation of $\bar{R}$ the only modulo 97 operation required for the entire problem.

### The Montgomery step

Working with n-digit numbers to base d, a Montgomery step calculates $a \times b \div d^n\pmod r$. The base d is typically 2 for microelectronic applications, 28 for 8-bit firmware,[4] or 232 or 264 for software applications. For the purpose of exposition, we shall illustrate with d = 10 and n = 4.

To calculate 0472 × a ÷ 10000:

1. Zero the accumulator.
2. Starting from the last digit; add 2a to the accumulator.
3. Shift the accumulator one place to the right (thus dividing by 10).
4. Add 7a to the accumulator.
5. Shift the accumulator one place to the right.
6. Add 4a to the accumulator.
7. Shift the accumulator one place to the right.
8. Add 0a to the accumulator.
9. Shift the accumulator one place to the right.

It is easy to see that the result is 0.0472 × a, as required.

To turn this into a modular operation with a modulus r, add, immediately before each shift, whatever multiple of r is needed to make the value in the accumulator a multiple of 10.

The result will be that the final value in the accumulator will be an integer (since only multiples of 10 have ever been divided by 10) and equivalent (modulo r) to 472 × a ÷ 10000.

Finding the appropriate multiple of r is a simple operation of single-digit arithmetic. When working to base 2, it is trivial to calculate: if the value in the accumulator is even, the multiple is 0 (nothing needs to be added); if the value in the accumulator is odd, the multiple is 1 (r needs to be added).

The Montgomery step is faster than the methods of "naive" modular arithmetic because the decision as to what multiple of r to add is taken purely on the basis of the least significant digit of the accumulator. This allows the use of carry-save adders, which are much faster than the conventional kind but are not immediately able to give accurate values for the more significant digits of the result.

### Modular multiplication 2

Consider the following pair of calculations:

24 × 73 = 1752
240000 × 730000 ÷ 10000 = 17520000

It can be seen that if we choose to represent integers by 10000 times themselves (let us temporarily call this a "Montgomery representation") then the result of a Montgomery step on the Montgomery representation of a and the Montgomery representation of b is the Montgomery representation of a × b.

Thus we can use a Montgomery step to perform a modular multiplication by "Montgomeryizing" both operands before the Montgomery step and "de-Montgomeryizing" the result after it.

To "de-Montgomeryize" a number—in other words, to take it from its representation as "12340000" to a conventional representation as "1234"—it suffices to do a single Montgomery step with the number and 1: 12340000×1÷10000=1234.

To "Montgomeryize" a number—in other words, to take it from its conventional representation to a representation as "12340000"—it suffices to do a single Montgomery step with the number and 100000000: 1234×100000000÷10000=12340000.

The value of 100000000 modulo r can be precomputed, since the same modulus r is usually used many times over.

The total budget for a single modular multiplication is thus two Montgomery steps: the first, on $a$ and $b$, yields $ab/R$, and the second, on this product and $R^2$, yields $(ab/R)\cdot R^2/R = ab$.

Usually, this is not a favorable trade-off for a single multiplication, as a conventional modular multiplication is faster than two Montgomery steps. However, Montgomery reduction is easier to make resistant to side-channel attacks, so in some circumstances the Montgomery technique may be preferable.

### Modular exponentiation

Raising a number to a k-bit exponent involves between k and 2k multiplications. In most applications of modular exponentiation the exponent is at least several hundred bits long.

To fix our ideas, suppose that a particular modular exponentiation requires 800 multiplications. In that case 802 Montgomery steps will be needed: one to Montgomeryize the number being exponentiated, 800 to do the exponentiation, and one to de-Montgomeryize the result.

If a Montgomery step is even slightly faster than a conventional modular multiplication, the Montgomery algorithm will produce a faster result than conventional modular exponentiation .[5]

## Side channel attacks

When using it as a part of a cryptographically secure algorithm, unmodified Montgomery reduction is vulnerable to side channel attacks, where the attacker can learn about the inner workings of the algorithm by studying the differences in time, power-consumption or any other parameter affected by the fact that the algorithm performs very different actions depending on the input. However it is simple to modify the algorithm or the hardware to make it resistant to such attacks.[4][6]

## References

1. ^ Peter Montgomery. Modular Multiplication Without Trial Division, Math. Computation, vol. 44, pp. 519–521, 1985.
2. ^ Martin Kochanski, Montgomery Multiplication A colloquial explanation.
3. ^ Chapter 14 of Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. Handbook of Applied Cryptography. CRC Press, 1996. ISBN 0-8493-8523-7.
4. ^ a b Zhe Liu, Johann Großschädl, and Ilya Kizhvatov. "Efficient and Side-Channel Resistant RSA Implementation for 8-bit AVR Microcontrollers".
5. ^ Cetin K. Koc, Tolga Acar, "Analyzing and Comparing Montgomery Multiplication Algorithms" 16 (3). 1996. pp. 26–33. CiteSeerX: 10.1.1.26.3120.
6. ^ Marc Joye and Sung-Ming Yen. "The Montgomery Powering Ladder". 2002.