# Saturation arithmetic

Saturation arithmetic is a version of arithmetic in which all operations, such as addition and multiplication, are limited to a fixed range between a minimum and maximum value.

If the result of an operation is greater than the maximum, it is set ("clamped") to the maximum; if it is below the minimum, it is clamped to the minimum. The name comes from how the value becomes "saturated" once it reaches the extreme values; further additions to a maximum or subtractions from a minimum will not change the result.

For example, if the valid range of values is from −100 to 100, the following saturating arithmetic operations produce the following values:

• 60 + 30 → 90.
• 60 + 43 → 100. (not the expected 103.)
• (60 + 43) − (75 + 25) → 0. (not the expected 3.) (100 − 100 → 0.)
• 10 × 11 → 100. (not the expected 110.)
• 99 × 99 → 100. (not the expected 9801.)
• 30 × (5 − 1) → 100. (not the expected 120.) (30 × 4 → 100.)
• (30 × 5) − (30 × 1) → 70. (not the expected 120. not the previous 100.) (100 − 30 → 70.)

Here is another example for saturating subtraction when the valid range is from 0 to 100 instead:

• 30 - 60 → 0. (not the expected -30.)

As can be seen from these examples, familiar properties like associativity and distributivity may fail in saturation arithmetic.[a] This makes it unpleasant to deal with in abstract mathematics, but it has an important role to play in digital hardware and algorithms where values have maximum and minimum representable ranges.

## Modern use

Typically, general-purpose microprocessors do not implement integer arithmetic operations using saturation arithmetic; instead, they use the easier-to-implement modular arithmetic, in which values exceeding the maximum value "wrap around" to the minimum value, like the hours on a clock passing from 12 to 1. In hardware, modular arithmetic with a minimum of zero and a maximum of rn − 1, where r is the radix, can be implemented by simply discarding all but the lowest n digits. For binary hardware, which the vast majority of modern hardware is, the radix is 2, and the digits are bits.

However, although more difficult to implement, saturation arithmetic has numerous practical advantages. The result is as numerically close to the true answer as possible; for 8-bit binary signed arithmetic, when the correct answer is 130, it is considerably less surprising to get an answer of 127 from saturating arithmetic than to get an answer of −126 from modular arithmetic. Likewise, for 8-bit binary unsigned arithmetic, when the correct answer is 258, it is less surprising to get an answer of 255 from saturating arithmetic than to get an answer of 2 from modular arithmetic.

Saturation arithmetic also enables overflow of additions and multiplications to be detected consistently without an overflow bit or excessive computation, by simple comparison with the maximum or minimum value (provided the datum is not permitted to take on these values).

Additionally, saturation arithmetic enables efficient algorithms for many problems, particularly in digital signal processing. For example, adjusting the volume level of a sound signal can result in overflow, and saturation causes significantly less distortion to the sound than wrap-around. In the words of researchers G. A. Constantinides et al.:[1]

When adding two numbers using two's complement representation, overflow results in a "wrap-around" phenomenon. The result can be a catastrophic loss in signal-to-noise ratio in a DSP system. Signals in DSP designs are therefore usually either scaled appropriately to avoid overflow for all but the most extreme input vectors, or produced using saturation arithmetic components.

## Implementations

Saturation arithmetic operations are available on many modern platforms, and in particular was one of the extensions made by the Intel MMX platform, specifically for such signal-processing applications. This functionality is also available in wider versions in the SSE2 and AVX2 integer instruction sets. It is also available in ARM NEON instruction set.

Saturation arithmetic for integers has also been implemented in software for a number of programming languages including C, C++, such as the GNU Compiler Collection,[2] LLVM IR, and Eiffel. This helps programmers anticipate and understand the effects of overflow better, and in the case of compilers usually pick the optimal solution.

Saturation is challenging to implement efficiently in software on a machine with only modular arithmetic operations, since simple implementations require branches that create huge pipeline delays. However, it is possible to implement saturating addition and subtraction in software without branches, using only modular arithmetic and bitwise logical operations that are available on all modern CPUs and their predecessors, including all x86 CPUs (back to the original Intel 8086) and some popular 8-bit CPUs (some of which, such as the Zilog Z80, are still in production). On the other hand, on simple 8-bit and 16-bit CPUs, a branching algorithm might actually be faster if programmed in assembly, since there are no pipelines to stall, and each instruction always takes multiple clock cycles. On the x86, which provides overflow flags and conditional moves, very simple branch-free code is possible.[3]

Although saturation arithmetic is less popular for integer arithmetic in hardware, the IEEE floating-point standard, the most popular abstraction for dealing with approximate real numbers, uses a form of saturation in which overflow is converted into "infinity" or "negative infinity", and any other operation on this result continues to produce the same value. This has the advantage over simple saturation that later operations decreasing the value will not end up producing a misleadingly "reasonable" result, such as in the computation ${\textstyle {\sqrt {x^{2}-y^{2}}}}$. Alternatively, there may be special states such as "exponent overflow" (and "exponent underflow") that will similarly persist through subsequent operations, or cause immediate termination, or be tested for as in IF ACCUMULATOR OVERFLOW ... as in FORTRAN for the IBM 704 (October 1956).