# Talk:Exponential smoothing

WikiProject Statistics (Rated C-class, Low-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
Low  This article has been rated as Low-importance on the importance scale.
WikiProject Mathematics (Rated C-class, Low-priority)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 Low Priority
Field: Applied mathematics

## The original version

The whole thing reads like a student essay. --A bit iffy 15:17, 27 April 2006 (UTC)

I do not have access to recent research paper etc., but I can give the following reference: Montgomery, Douglas C "Forecasting and time series analysis" 1976, McGraw-Hill Inc. martin.wagner@nestle.com

It's pretty messy. I've seen worse, though. One of the early formulas seems right, but I haven't looked closely at the text yet. Michael Hardy 20:59, 23 October 2006 (UTC)

For such a widely adopted forecasting method, this is extremely poor. At the very least, standard notation should be adopted, see Makridakis, S., Wheelwright, S. C., & Hyndman, R. J. (1998). Forecasting: Methods and applications (3rd ed.). New Jersey: John Wiley & Sons. State-space notation would also be a useful addition. Dr Peter Catt 02:29, 15 December 2006 (UTC)

I feel really sorry to see poor work like this on Wiki.

## A complete rewrite

OK, nobody liked this article very much, and it even came up over on this talk page. So I've rewritten it in its entirety. I'll try to lay my hands on a few reference books in the next month or so, so that I can verify standard notation, etc. Additional information about double and triple exponential smoothing should also go in this article, but at least I've made a start. DavidCBryant 00:45, 10 February 2007 (UTC)

## Problems with weighting

Now that the article has been re-written to be much clearer, I see that there are some problems which I had not previously noticed.

1. Currently, the way the average is initialized gives much too much weight to the first observation. If α is 0.05 (corresponding to a moving average of about 20 values), when you initialize and then input a second observation, the average is (1*x1 + 19*x0)/20 which gives the first observation 19 times the weight of the second when it would be more appropriate to give the first observation 19/20 of the weight of the second observation.
2. No provision is made for the practical problem of missing data. If an observation is not made at some time, or it is made but lost, then what do we do?

Perhaps these difficulties could be addressed, in part, by separately computing a normalization factor which could be done by forming a sum in the same way using always 1 as the data and then dividing that into the sum of the actual observations. JRSpriggs 03:45, 12 February 2007 (UTC)

Do you see this as a problem with the article itself, or a problem with the statistical technique described in the article? —David Eppstein 08:06, 12 February 2007 (UTC)
Well, that is the question, is it not? I do not know enough about statistics to know whether the technique has been described incorrectly or whether we should point out that the technique has these limitations. JRSpriggs 09:59, 12 February 2007 (UTC)
When I learned about this technique, I think I remember learning that either of the two methods could be used to initialize it (either copy the first data point enough times to fill in the array, of copy the most recent data point enough times). But I have no idea where I learned about this and I have no references on it, so I can't say what is done in practice. The article on moving average also has no discussion of how to initialize the array. It's hardly a limitation on the method because the method is intended for large data sets, not tiny ones. There seem to be some references at moving average like this one. CMummert · talk 13:27, 12 February 2007 (UTC)

Perhaps we should consider merging this article into Moving average#Exponential moving average. Otherwise, I think that the weighting should be more like this:

${\displaystyle s_{t}={\frac {\sum _{k}\exp(-\alpha (t-t_{k}))x_{k}}{\sum _{k}\exp(-\alpha (t-t_{k}))}}}$

where the sum is over observations with ${\displaystyle t_{k}\leq t.\!}$ What do you-all think? JRSpriggs 04:35, 13 February 2007 (UTC)

I agree on both counts, particularly the merge. MisterSheik 18:47, 16 February 2007 (UTC)

## The exponential moving average

The article says: For example, the method of least squares might be used to determine the value of α for which the sum of the quantities (sn − xn)^2 is minimized. Uh? I think such ${\displaystyle \alpha ,}$ would be one! Albmont 11:14, 28 February 2007 (UTC)

Thanks. It should have been sn-1 instead of sn. I changed it. JRSpriggs 12:25, 28 February 2007 (UTC)

## Is ${\displaystyle x_{t}}$ right?

Should it be

{\displaystyle {\begin{aligned}s_{t}&=\alpha x_{t}+(1-\alpha )s_{t-1}\,\end{aligned}}}

or

{\displaystyle {\begin{aligned}s_{t}&=\alpha x_{t-1}+(1-\alpha )s_{t-1}\,\end{aligned}}}?

This indicates the first one, but this PowerPoint presentation implies the second one.

If the formula is used for forecasting purposes, then it looks to me like the second one is the only usable one (and also looks more natural somehow). Or am I missing something? (I'm very, very rusty on this now!). Hope someone can clear this up.--A bit iffy 11:18, 3 March 2007 (UTC)

The choice is a matter of convention (for strictly periodic data), but I think that the first one (which we use here) is more natural, since people would compute the smoothed value as soon as possible and would naturally want to label it with the time that they computed it. JRSpriggs 11:27, 4 March 2007 (UTC)

The textbooks that I use for teaching time series use the second one (with the lagged value of the series). See Moore, McCabe, Duckworth and Sclove, The Practice of Business Statistics. In Minitab, on the other hand, the smoothed value at time t is defined using the first formula, while the fitted value at time t is the smoothed value at time (t-1). EconProf86 16:20, 28 May 2007 (UTC)

In my view it should be X_t and not X_{t-1}. The simple exponential filter is analageous to a single pole low pass IIR filter. In both cases you are saying that the current out put is formed from the sum of the new input plus a fraction of the old output — Preceding unsigned comment added by 212.77.61.18 (talk) 08:09, 9 September 2011 (UTC)

Both are correct. As said above, X_t is used when the focus is smoothing a series and X_{t-1} is used when prediction is the focus. As the article mostly follows the NIST handbook we use X_{t-1}.--Muhandes (talk) 16:55, 10 September 2011 (UTC)

Another related point: the following paragraph states "In the limiting case with α = 1 the output series is just the same as the original series". This will only be the case with X_t. Using X_{t-1} means adding a lag in the time series.Eric thiebaut (talk) 17:28, 24 December 2011 (UTC)

Corrected. --Muhandes (talk) 11:45, 25 December 2011 (UTC)

looks like it is back to t-1. I like it better as t. That is how you would program it. — Preceding unsigned comment added by 128.114.150.27 (talk) 22:59, 17 March 2014 (UTC)

## Negative values for smoothing factor α

All references I have looked at suggest that the value of α must be chosen between 0 and 1. However, none offer any reason for this. Although such a range may be "intuitive", I have worked with datasets for which the optimal value for α (in a least-squares sense, as described in the article) is negative. Why would this be wrong? koochak 10:30, 5 March 2008 (UTC)

Look at the meaning of the α. It is a percentage of the smoothed value that should be generated using the previous smoothed value. You cannot have a negative percentage. JLT 15:03, 16 Dec 2009 (CST)

I agree with UTC, and with the fact that α can take negative values. The explanation that α must be between 0 and 1 because it is a percentage is just circular logic. So long as you can build a linear combination of observed and predicted values, the exponential smoothing formula holds. 07:42, 28 September 2014 (PC)

## Corrected an Error

I removed an innacuracy that stated that simple exponential smoothing was the same as Brown exponential smoothing. This is not the case; Brown's method is double exponential smoothing. JLT 1451, 16 Dec 1009 (CST) —Preceding unsigned comment added by 131.10.254.62 (talk)

## Unsatisfying Derivation of Alpha

The statement that there is no simple way to choose α is very unsatisfying.

If one considers the impulse-response of this method, then the time delay of the response (mean) is 1/α data points and the rms width of the response is also on the order of (but not exactly) 1/α data points. Thus the method smooths with a smoothing width of 1/α data points, and this is a perfectly good way to choose an α.

208.252.219.2 (talk) 16:01, 18 August 2010 (UTC)

WP:SOFIXIT --Muhandes (talk) 10:03, 20 August 2010 (UTC)

## last square optimisation of alpha

I do not understand exactly why optimizing alpha using LS methods should work. Sum of squares of differences is minimized for alpha=1, and it equals 0. By continuity optimization problem I suppose there is no other non-trivial optimisation solutions. Please give some citation/reference. —Preceding unsigned comment added by 149.156.82.207 (talk) 18:25, 15 December 2010 (UTC)

I don't follow. For alpha=1 s_t=x_{t-1}, i.e. the estimate is always the last measure. This minimizes the sum of square only when x_{t-1}=x_t, i.e. only for a constant series. If you need a source on this, check section 6.4.3.1 of the NIST/SEMATECH e-Handbook of Statistical Methods, which is the source of most of the article. --Muhandes (talk) 19:15, 15 December 2010 (UTC)

## Double exponential smoothing != double exponential smoothing

After a lot of confusion and searching, I noticed that there are at least three approaches to calculate a double exponential smoothing.

1. possibly the Holt method
This one just calculates a single exponential smoothing, the results ${\displaystyle s_{t}}$ are used as the starting values for the estimation line (i. e. ${\displaystyle F_{t+0}}$). Additionally, the trend itself ${\displaystyle b_{t}}$ is calculated and is used as the gradient of the estimation line. As a result, the estimation difference to single exponential smoothing is just that a trend is assumed, calculated and used, using the result of the simple exp. smoothing as the starting point.
Sources:
2. the Brown method
This one first calculates a single exponential smoothing ${\displaystyle S'_{t}}$ over the data and then calculates another exponential smoothing ${\displaystyle S''_{t}}$ over that smoothed line, resulting in a double-smoothing. For both times, the same α is being used. The estimation line has the starting value ${\displaystyle 2\cdot S'_{t}-S''_{t}}$, the line gradient is described as ${\displaystyle {\frac {\alpha }{1-\alpha }}\cdot (S'_{t}-S''_{t})}$
Sources:
3. allegedly the linear exponential smoothing by Holt/Winters (the one talked about in the article)
This one works similarly to the Brown method but instead of just taking the previous result of the single smoothing it takes into account the previously forecasted trend ${\displaystyle S''_{t-1}}$ by adding it to the previously forecasted level ${\displaystyle S'_{t-1}}$. Also, the new variable β is used to adjust the influence of the trend on the forecast.
Sources:

My point is: this must be clarified and explained properly.

I’m sorry for the lack of English resources, I hope you can find better ones than me. --Uncle Pain (talk) 14:29, 23 September 2011 (UTC)

A small addition after some comparison: the methods 1 and 2 are indeed very similar, as the German PDF implies by combining them in a connected row. The only difference seems to be the ${\displaystyle {\frac {\alpha }{1-\alpha }}}$ factor in method 2 which should be ${\displaystyle {\frac {1}{1-\alpha }}}$ to make it match the results of method 1. Both calculate the same gradient of the estimation line. --Uncle Pain (talk) 15:59, 23 September 2011 (UTC)

Very good points. I'll try to work it all out into the article tomorrow, thanks for the resources. --Muhandes (talk) 20:35, 24 September 2011 (UTC)
I added the second method. I never met the first method before, and I'm still trying to figure out if it is different from the second. If you are confident in your analysis, I suggest you add it yourself. --Muhandes (talk) 14:58, 25 September 2011 (UTC)

## Text from Initial value problem

I have removed the following text from the page Initial value problem which is about ODE theory. If somebody with knowledge of the domain thinks it belongs here, please integrate it. --138.38.106.191 (talk) 14:25, 10 May 2013 (UTC)

Exponential smoothing is a general method for removing noise from a data series, or producing a short term forecast of time series data.
Single exponential smoothing is equivalent to computing an exponential moving average. The smoothing parameter is determined automatically, by minimizing the squared difference between the actual and the forecast values. Double exponential smoothing introduces a linear trend, and so has two parameters. For estimating initial value there are several methods. like we use these two formulas;
${\displaystyle y'_{0}=\left({\frac {\alpha }{1-\alpha }}\right)a_{t}+b_{t}}$
${\displaystyle y''_{0}=\left({\frac {\alpha }{1-\alpha }}\right)a_{t}+2b_{t}}$

## redundant "Exponential moving average"

Two articles present similar content:

Sorry, I do not have enougth motivation/time to check further and to manage the potential merge.

Oliver H (talk) 09:06, 6 March 2014 (UTC)

## Division by 0 in triple exponential smoothing

How does the method cope with scenarios in which either c(t) is 0 or s(t) is 0? The problem occurs in the next time iteration:

• if c(t) is 0 then s(t+1) = ... x(t) / 0 ...
• if s(t) is 0 then c(t+1) = ... x(t) / 0 ...

## Summary is far from neutral and too specific.

The first paragraph of the summary starts by accusing statistics of having a trivial standard of proof and ends by claiming that the (unvalidated by citation) practice of triple filtering is both standard an numerology. The second paragraph uses the loaded and insulting word "parrots". Having seen the density of bias in the summary, I will not waste time with the body of the article. Further, it is atypical for mathematical articles to establish notation in the summary -- notation is typically established in the Definition section. -- 66.103.116.83 (talk) 01:48, 18 July 2015 (UTC)

For reference, in case someone wants to back it out, here is the (anonymous) edit where the inappropriate sarcasm in the first paragraph, at least, was introduced:

Dr. Hyndman has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

* Confused comment distinguishing "random process" from "an orderly, but noisy, process".

• Notation used corresponding to engineering rather than statistics and econometrics. Hyndman, Koehler, Ord and Snyder (2008) has encouraged a standardization of exponential smoothing notation in econometrics and statistics. It would be better to use this if possible.
• The background on simple and weighted moving averages is unnecessary, and should be in a separate article.
• "FIR and IIR filters" used without explanation.
• An EWMA is not equivalent to an ARIMA(0,1,1) model as stated. Rather, an ARIMA(0,1,1) model will give forecasts that are equivalent to an EWMA. There is a difference.
• Derivation of formula from Newbold and Box out of place in this context. It is also derived backwards. This section should be removed as the derivation is given later in any case.
• The time constant result seems to assume s_0=0.
• The initial smoothed value is assumed to be set by the user. In modern implementations, this is estimated along with the smoothing parameter.
• Optimization section refers to regression in a strange aside.
• The section on "Comparison with moving average" is highly confused. It mixes up smoothing with forecasting, and makes false statements about the distribution of forecast errors. I'm not sure there is anything in this section worth keeping.
• In double exponential smoothing section it is again assumed that s_0 and b_0 are to be determined by the user rather than estimated.
• Triple exponential smoothing is discussed in Holt's 1957 paper. It is NOT due to Winters. Winter's 1960 contribution popularized the method.
• There is a need for a new section (or perhaps a new article) on the link between exponential smoothing and innovations state space models, introduced by Ord, Koehler and Snyder (JASA 1997) and extended in Hyndman et al (2002,2008).

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

Dr. Hyndman has published scholarly research which seems to be relevant to this Wikipedia article:

• Reference : Christoph Bergmeir & Rob J Hyndman & Jose M Benitez, 2014. "Bagging Exponential Smoothing Methods using STL Decomposition and Box-Cox Transformation," Monash Econometrics and Business Statistics Working Papers 11/14, Monash University, Department of Econometrics and Business Statistics.

ExpertIdeasBot (talk) 11:13, 1 June 2016 (UTC)

Dr. Snyder has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

Article Exponential Smoothing

Reviewer Adjunct Associate Professor Ralph D Snyder Affiliation Department of Econometrics and Statistics, Monash University, Clayton, Victoria, Australia 3800 Date July, 2016

## Brief comments on current article

This article contains a traditional perspective of exponential smoothing, being very much captive to versions as they appeared historically in the literature and which have been overtaken by more modern integrated approaches. It places too much emphasis on technical details of methods which have been superseded or substantially simplified in more modern approaches. Some methods are misnamed and a general version of exponential smoothing, of which all the earlier methods are special cases, is ignored.

## Too immersed in technical details

Exponential smoothing is a rule of thumb technique for smoothing time series data, particularly for recursively applying as many as three low-pass filters with exponential window functions “.

The article begins with this sentence which contains the technical terms “low-pass filters” and “exponential window functions”. These terms are taken from an engineering oriented time series literature but would be unknown to most business forecasters and managers. Given that the application of exponential smoothing has been traditionally centred on short-term forecasting of inventory demands, any article written on the subject should recognise that many readers seeking help on this topic are likely to be immediately put off by this orientation. It is curious that such terms are used given that the topic is normally exposited [1] without recourse to them.

This orientation continues throughout the article.

It also introduces a phase shift into the data of half the window length. For example, if the data were all the same except for one high data point, the peak in the "smoothed" data would appear half a window length later than when it actually occurred. Where the phase of the result is important, this can be simply corrected by shifting the resulting series back by half the window length.

This level of detail is unnecessary and unenlightening in an introductory exposition.

## Covers superseded methods

The last paragraph of the section Double exponential smoothing has a focus on Brown’s double exponential smoothing [2] without any explanation for the equations which define it. Brown was an important pioneer of the exponential smoothing methods and has an important place in any historical analysis of their evolution. His approaches all involve only one parameter α, however, and consequentially are less flexible than multi-parameter analogues, usually provide poorer forecasts, and have largely been superseded.

## Mislabeled methods

1. The section Double exponential smoothing would be better labelled Exponential smoothing with local trends. It describes two approaches: one due to Holt[3] and the other due to Brown[2]. The latter is correctly termed double exponential smoothing because it involves two applications of exponential smoothing: one to the original data and the second to the smoothed data-see the equations for s_t^' and s_t^. The first method, however, is traditionally referred to as trend-corrected exponential smoothing to distinguish it from the second method. The term trend-corrected is apt because the level term s_(t-1) is augmented with the trend term b_(t-1) in the formula for the revised level s_t. The application of the term double exponential smoothing to the first approach has been justified with a reference to note 12 in the Wikipedia article. However, it appears to me that this reference has not been published in an authoritative refereed journal and is likely to be an unreliable source for terminology.<
2. The article has a section on Triple exponential smoothing which has a discussion on seasonal methods. Again, the section header involves a confused use of terminology. The term normally applies to a method devised by Brown{2] to augment double exponential smoothing with yet a third application of simple exponential smoothing, this time to the doubly smoothed series. It has nothing to do with seasonal effects.

## General versions of exponential smoothing are not covered

The above methods are often described as linear versions of exponential smoothing. The terms linear and exponential side-by-side is necessary, even if a bit confusing. The methods are considered to be linear because they yield forecasts which are linear functions of the series values. They are considered to be exponential because the coefficients of the series values in these linear forecast functions decline exponentially with the age of the data. The most general linear version of exponential smoothing was introduced in a much overlooked section of the one of the most influential books on time series analysis: Box and Jenkins[4] . It does not rate a mention in what should be a authoritative overview of exponential smoothing. Moreover, it ignores the most general version of exponential smoothing, encompassing both linear and nonlinear versions (Ord, Koehler and Snyder[5]). The advantage of both general approaches is that they unify the whole area of exponential smoothing and eliminate much of the complexity associated with traditional special cases.

## The need to consider statistical models underlying the exponential smoothing

1. The methods proposed for seeding the associated recurrence relationships have been ad hoc.
2. Many of the traditional methods proposed for determining the parameters like α and β have been ad hoc, with the exception of those that rely on minimising the sum of squared errors.
3. Although exponential smoothing often yields good point predictions, it has little to say about the measurement of uncertainty surrounding them. What little that has been written on this issue is often based on assumptions which are inconsistent with those which underpin exponential smoothing.

These problems are largely resolved by augmenting exponential smoothing with appropriate statistical models. Then it is possible to bring standard statistical methods to bare on the estimation of seed values and parameters, together with the generation of prediction distributions. There are two main possibilities:

1. Box and Jenkins demonstrated in their seminal book[4] on time series analysis, that any of their ARIMA models has an implied method of linear exponential smoothing and vice versa. Thus, the standard technology associated with the ARIMA framework can be invoked to obtain indirect estimates of exponential smoothing parameters and prediction distributions.
2. Both the linear and nonlinear innovations state space frameworks in Ord, Koehler and Snyder[5] provide a more transparent link with both the linear and nonlinear forms of exponential smoothing. It can be used to directly obtain maximum likelihood estimates of the parameters and seed values associated with exponential smoothing and also derive prediction distributions. By encompassing nonlinear cases, it is more general than the Box and Jenkins framework.

## Links to other modern forecasting methods are not covered

• The Kalman filter[6], a pivotal time series method, is not mentioned. It has close links with exponential smoothing, the former converging to latter as successive series values are processed in conjunction with invariant linear state space models.
• Bayesian forecasting (Harrison and Stevens[7]), which relies on the Kalman filter, has its conceptual roots in exponential smoothing.
• The structural time series framework (Harvey[8]) also has strong links with exponential smoothing.

## References

Ord, J. Keith. & Fildes, Robert (2013), Principles of Business Forecasting, South-Western Centage Learning, Sections 3.3.1, 3.4.1, 3.5-3.6.

R. G. Brown (1963). Smoothing, Forecasting and Prediction of Discrete Time Series, Englewood Cliffs, New Jersey: Prentice-Hall.

Holt, Charles C (1957) see note 4 in Wikipedia article

Box, G. E. P., Jenkins, G. M. (1970), Time Series Analysis: Forecasting and Control, Englewood Cliffs, NJ: Prentice-Hall. (An appendix to Chapter 5).

Ord J. K., Koehler, A. B. and Snyder, R. D. (1997), Estimation and Prediction for a Class of Dynamic Nonlinear Statistical Models. Journal of the American Statistical Association, 92, 1621-29.

Kalman, R. E and Bucy, R. S. (1961), New Results in Linear Filtering and Prediction Theory. Journal of Basic Engineering, 83, 95-108.

Harrison, P. J. and Stevens, C. F. (1976), Bayesian Forecasting (with discussion). Journal of the Royal Statistical Society, Ser. B, 38, 205-247.

Harvey, A. C. (1991) Forecasting, structural time series models and the Kalman filter, Cambridge: Cambridge University Press.

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

We believe Dr. Snyder has expertise on the topic of this article, since he has published relevant scholarly research:

• Reference : Keith Ord & Ralph Snyder & Adrian Beaumont, 2010. "Forecasting the Intermittent Demand for Slow-Moving Items," Monash Econometrics and Business Statistics Working Papers 12/10, Monash University, Department of Econometrics and Business Statistics.

ExpertIdeasBot (talk) 17:01, 27 July 2016 (UTC)

## Croston's Method for intermittent demand forecasting

I've never commented on Wikipedia before, so please excuse my ignorance.

I've recently been reviewing methods of intermittent demand forecasting and have applied the Croston Method for applying exponential smoothing based on the ratio of demand/demand intervals - see https://www.researchgate.net/publication/254044245_A_Review_of_Croston's_method_for_intermittent_demand_forecasting.

This method cannot be found on wikipedia as far as I can tell, and I'm not sure if it belong's here, under the forecasting topic or a stand-alone topic. — Preceding unsigned comment added by S&opgeek (talkcontribs) 02:18, 23 September 2016 (UTC)