Talk:Ward's method

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated Start-class, Low-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

Start-Class article Start  This article has been rated as Start-Class on the quality scale.
 Low  This article has been rated as Low-importance on the importance scale.

Not minimum variance[edit]

The incorrect label for Ward as "minimum variance" method is widely spread, unfortunately. But in fact this method is "minimum increase of sum-of-squares (of errors)", which is not minimum variance. If it were minimum variance, the decision to link a cluster with a singleton point would always be the same as the decision made by centroid linkage method in that case (because assigning a point to the closest centroid guarantees the new cluster with minimal variance), but that is not the case and the two methods - Ward and centroid - can make different decisions in the above situation. See Podany, J. New combinatorial clustering methods // Vegetatio, vol 81, 1989, where on page 67 he's up in arms against incorrect labels for Ward's method. (talk) 13:48, 1 February 2014 (UTC) There indeed exists true minimum variance method (called MNVAR by Podani) which is not identical to Ward's method.

The objective function of Ward's method is to minimize, at each step, 2*[SS12-(SS1+SS2)], where SS1 and SS2 are the within-cluster errors of clusters 1 and 2 and SS12 is the within-cluster error of their combined cluster.

Why do we multiply by 2?[edit]

The formula of Wards Method is : 2*[SS12-(SS1+SS2)]
or ?
And our aim is minimizing this right?
Does it matter we multiply with 2 or not? I am trying to formulate algo based on SSE BurstPower (talk) 16:38, 2 March 2016 (UTC)

→ Reply to BurstPower

Lance-Williams formula for Ward's method, on a step where clusters 1 and 2 have merged into cluster t, and now the distances between t and every other cluster r are being updated, is:

Dtr = [(Nr+N1)*Dr1+(Nr+N2)*Dr2-Nr*D12] / (Nr+Nt),

where some "Dij" is the distance between clusters i and j, and Ni is the number of points in cluster i.

Now, if the input distances are squared euclidean ones then this formula corresponds to the objective function Dtr = 2*[SS12-(SS1+SS2)].

Clearly, the 2 multiplier can be avoided on the steps (in order to speed up the process). Just divide the input squared distances by 2 before starting the agglomeration. Then the above Lance-Williams formula will correspond to objective function in the form Dtr = SS12-(SS1+SS2). So, you can go either way (with the same result), but the second one is programmically better: no need to multiply by 2 on every step. (talk) 19:09, 15 March 2016 (UTC)