Estrin's scheme

In numerical analysis, Estrin's scheme (after Gerald Estrin), also known as Estrin's method, is an algorithm for numerical evaluation of polynomials.

Horner's method for evaluation of polynomials is one of the most commonly used algorithms for this purpose, and unlike Estrin's scheme it is optimal in the sense that it minimizes the number of multiplications and additions required to evaluate an arbitrary polynomial. On a modern processor, instructions that do not depend on each other's results may run in parallel. Horner's method contains a series of multiplications and additions that each depend on the previous instruction and so cannot execute in parallel. Estrin's scheme is one method that attempts to overcome this serialization while still being reasonably close to optimal.

Description of the algorithm

Estrin's scheme operates recursively, converting a degree-n polynomial in x (for n≥2) to a degree-⌊n/2⌋ polynomial in x² using ⌈n/2⌉ independent operations (plus one to compute x²).

Given an arbitrary polynomial P(x) = C₀ + C₁x + C₂x² + C₃x³ + ⋯ + C_nxⁿ, one can group adjacent terms into sub-expressions of the form (A + Bx) and rewrite it as a polynomial in x²: P(x) = (C₀ + C₁x) + (C₂ + C₃x)x² + (C₄ + C₅x)x⁴ + ⋯ = Q(x²).

Each of these sub-expressions, and x², may be computed in parallel. They may also be evaluated using a native multiply–accumulate instruction on some architectures, an advantage that is shared with Horner's method.

This grouping can then be repeated to get a polynomial in x⁴: P(x) = Q(x²) = ((C₀ + C₁x) + (C₂ + C₃x)x²) + ((C₄ + C₅x) + (C₆ + C₇x)x²)x⁴ + ⋯ = R(x⁴).

Repeating this ⌊log₂n⌋+1 times, one arrives at Estrin's scheme for parallel evaluation of a polynomial:

Compute D_i = C_2i + C_2i+1x for all 0 ≤ i ≤ ⌊n/2⌋. (If n is even, then C_n+1 = 0 and D_n/2 = C_n.)
If n ≤ 1, the computation is complete and D₀ is the final answer.
Otherwise, compute y = x² (in parallel with the computation of D_i).
Evaluate Q(y) = D₀ + D₁y + D₂y² + ⋯ + D_⌊n/2⌋y^⌊n/2⌋ using Estrin's scheme.

This performs a total of n multiply-accumulate operations (the same as Horner's method) in line 1, and an additional ⌊log₂n⌋ squarings in line 3. In exchange for those extra squarings, all of the operations in each level of the scheme are independent and may be computed in parallel; the longest dependency path is ⌊log₂n⌋+1 operations long.

Examples

Take P_n(x) to mean the nth order polynomial of the form: P_n(x) = C₀ + C₁x + C₂x² + C₃x³ + ⋯ + C_nxⁿ

Written with Estrin's scheme we have:

P₃(x) = (C₀ + C₁x) + (C₂ + C₃x) x²

P₄(x) = (C₀ + C₁x) + (C₂ + C₃x) x² + C₄x⁴

P₅(x) = (C₀ + C₁x) + (C₂ + C₃x) x² + (C₄ + C₅x) x⁴

P₆(x) = (C₀ + C₁x) + (C₂ + C₃x) x² + ((C₄ + C₅x) + C₆x²)x⁴

P₇(x) = (C₀ + C₁x) + (C₂ + C₃x) x² + ((C₄ + C₅x) + (C₆ + C₇x) x²)x⁴

P₈(x) = (C₀ + C₁x) + (C₂ + C₃x) x² + ((C₄ + C₅x) + (C₆ + C₇x) x²)x⁴ + C₈x⁸

P₉(x) = (C₀ + C₁x) + (C₂ + C₃x) x² + ((C₄ + C₅x) + (C₆ + C₇x) x²)x⁴ + (C₈ + C₉x) x⁸

…

In full detail, consider the evaluation of P₁₅(x):

Inputs: x, C₀, C₁, C₂, C₃, C₄, C₅ C₆, C₇, C₈, C₉ C₁₀, C₁₁, C₁₂, C₁₃ C₁₄, C₁₅

Step 1: x², C₀+C₁x, C₂+C₃x, C₄+C₅x, C₆+C₇x, C₈+C₉x, C₁₀+C₁₁x, C₁₂+C₁₃x, C₁₄+C₁₅x

Step 2: x⁴, (C₀+C₁x) + (C₂+C₃x)x², (C₄+C₅x) + (C₆+C₇x)x², (C₈+C₉x) + (C₁₀+C₁₁x)x², (C₁₂+C₁₃x) + (C₁₄+C₁₅x)x²

Step 3: x⁸, ((C₀+C₁x) + (C₂+C₃x)x²) + ((C₄+C₅x) + (C₆+C₇x)x²)x⁴, ((C₈+C₉x) + (C₁₀+C₁₁x)x²) + ((C₁₂+C₁₃x) + (C₁₄+C₁₅x)x²)x⁴

Step 4: (((C₀+C₁x) + (C₂+C₃x)x²) + ((C₄+C₅x) + (C₆+C₇x)x²)x⁴) + (((C₈+C₉x) + (C₁₀+C₁₁x)x²) + ((C₁₂+C₁₃x) + (C₁₄+C₁₅x)x²)x⁴)x⁸

References

Estrin, Gerald (May 1960). "Organization of computer systems—The fixed plus variable structure computer" (PDF). Proc. Western Joint Comput. Conf. San Francisco: 33–40. doi:10.1145/1460361.1460365. S2CID 16384320.
Muller, Jean-Michel (2005). Elementary Functions: Algorithms and Implementation (2nd ed.). Birkhäuser. p. 58. ISBN 0-8176-4372-9.

Description of the algorithm

Examples

References

Further reading