Jump to content

Elo rating system: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Wolfkeeper (talk | contribs)
m Either provide evidence that centaurs outperform Hydra, or keep it out of the article. Removed unfounded and uncited claim that centaurs perform at 3200 level
Line 47: Line 47:
In the whole history of FIDE rating system, only 39 players (including list 01.04.2006), called [[International Grandmaster|"Super-grandmasters"]], have achieved a peak rating of 2700 or more.
In the whole history of FIDE rating system, only 39 players (including list 01.04.2006), called [[International Grandmaster|"Super-grandmasters"]], have achieved a peak rating of 2700 or more.


As of April 2006, the [[Hydra (chess)|Hydra]] supercomputer is probably the strongest "over the board" chess "player" in the world; its FIDE equivalent playing strength is estimated by its creators to be over ELO 3000, and it was able to provide evidence for this in its 6 game match against [[Michael Adams]] in 2005 in which the then 7th highest rated player in the world only managed to score a single draw [http://www.chessbase.com/newsdetail.asp?newsid=2476].
As of April 2006, the [[Hydra (chess)|Hydra]] supercomputer is probably the strongest "over the board" chess "player" in the world; its FIDE equivalent playing strength is estimated by its creators to be over ELO 3000, and it was able to demonstrate this in its 6 game match against [[Michael Adams]] in 2005 in which the then 7th highest rated player in the world only managed to score a single draw [http://www.chessbase.com/newsdetail.asp?newsid=2476].

However, the world's strongest chess entities seem to be combinations of skilled humans playing with the aid of computer analysis engines. The best of these 'cyborg' or 'centaur' players have playing strengths that are uncertaib but that some have estimated to may be as high as ELO 3200, but probably are rather lower. Historically they have outperformed [[Hydra (chess)|Hydra]] in standard and correspondence chess matches. However, in the April 2006 PAL/CSS Freestyle Chess Tournament the ZOR_Champ team including a 64 node version of Hydra believed to be running in centaur mode, won the tournament.<ref>[http://www.hydrachess.com/main.cfm Hydra was playing as part of a team in the PAL/CSS Freestyle Chess Tournament]</ref>


== Mathematical details ==
== Mathematical details ==
Line 179: Line 177:


National [[Scrabble]] organizations compute normally-distributed Elo ratings except in the [[United Kingdom]], where a different system is used. The North American [[National Scrabble Association]] has the largest rated population, numbering over 11,000 as of early 2006.
National [[Scrabble]] organizations compute normally-distributed Elo ratings except in the [[United Kingdom]], where a different system is used. The North American [[National Scrabble Association]] has the largest rated population, numbering over 11,000 as of early 2006.

==References==

<references/>


== See also ==
== See also ==

Revision as of 00:58, 30 April 2006

The Elo rating system is a method for calculating the relative skill levels of players in two-player games such as chess and Go. It is also used as a rating system for competitive multi-player play in a number of computer games. It was originally invented as an improved chess rating system. "Elo" is often written in capital letters (ELO), but it is not an acronym. It is the family name of the system's creator, Árpád Élő (1903-1992), a Hungarian-born American physics professor. Professor Élő spelled his own name "Elo" after he left Hungary, a common anglicization.

A statistical system, not a reward system

Árpád Élő was a master-level chess player and an active participant in the United States Chess Federation (USCF) from its founding in 1939. The USCF used a numerical ratings system, devised by Kenneth Harkness, to allow members to track their individual progress in terms other than tournament wins and losses. The Harkness system was reasonably fair, but in some circumstances gave rise to ratings which many observers considered inaccurate. On behalf of the USCF, Élő devised a new system with a statistical flavor.

It was (and still is) daring to substitute statistical estimation for a system of competitive rewards. Rating systems for many sports award points in accordance with subjective evaluations of the greatness of certain achievements. For example, winning an important golf tournament might be worth five times as many rating points as winning a lesser tournament, and taking third place might be worth half the points of taking first place, etc.

A statistical endeavor, in contrast, postulates a model of some aspect of reality, and seeks to mathematically estimate, based on observation, the variables in that model. Competitors may still feel that they are being rewarded and punished for good and bad results, but the lofty claim of a statistical system is that it estimates real unknowns, and thus mirrors some hidden truth.

Élő's specific assumptions about the nature of real chess performance are open to doubt, but chess fans praise the accuracy of ELO ratings with a fervor unheard of in other sports. For example, professional tennis ratings are purely rewards based on tournament results. (Statistically rating tennis players would be complicated by variables chess doesn't have, particularly the playing surface, but the rating organizations don't even try for predictive accuracy.) As a result, it is routine for tennis fans to consider the higher-rated player an underdog in a given match. In chess the higher-rated player is regarded as the favorite in almost every case.

Élő's rating system model

Élő's central assumption was that the chess "performance" of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Élő assumed that the mean value of the performances of any given player changes only slowly over time. Élő thought of the mean of a player's performance random variable as that player's true skill.

A further assumption is necessary, because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and say, "That performance is 2039." Performance can only be inferred from wins, draws and losses. Therefore, if a player wins a game, he is assumed to have performed at a higher level than his opponent for that game. Conversely if he loses, he is assumed to have performed at a lower level. If the game is a draw, the two players are assumed to have performed at nearly the same level.

Élő waved his hands at several details of his model. For example, he did not specify exactly how close two performances ought to be to result in a draw rather than a decisive result. And while he thought it likely that each player might have a different standard deviation to his performance, he made a simplifying assumption to the contrary.

To simplify computation even further, Élő proposed a straightforward method of estimating the variables in his model —i.e. the true skill of each player. One could calculate relatively easily, from tables, how many games a player is expected to win based on a comparison of his rating to the ratings of his opponents. If a player won more games than he was expected to win, his rating would be adjusted upward, while if he won fewer games than expected his rating would be adjusted downward. Moreover, that adjustment was to be in exact linear proportion to the number of wins by which the player had exceeded or fallen short of his expected number of wins.

From a modern perspective, Élő's simplifying assumptions are not necessary, because computing power is inexpensive and widely available. Moreover, even within the simplified model, more efficient estimation techniques are well known. Several people, most notably Mark Glickman, have proposed using more sophisticated statistical machinery to estimate the same variables. In November 2005, the Xbox Live online gaming service proposed the TrueSkill ranking system that is an extension of Glickman's system to multi-player and multi-team games. On the other hand, the computational simplicity of the ELO system has proved to be one of its greatest assets. With the aid of a pocket calculator, an informed chess competitor can calculate to within one point what his next officially published rating will be, which helps promote a perception that the ratings are fair.

Implementing Élő's scheme

The USCF implemented Élő's suggestions in 1960, and the system quickly gained recognition as being both more fair and more accurate than the Harkness system. Élő's system was adopted by FIDE in 1970. Élő described his work in some detail in the book The Rating of Chessplayers, Past and Present, published in 1978.

Subsequent statistical tests have shown that chess performance is almost certainly not normally distributed. Weaker players have significantly greater winning chances than Élő's model predicts. Therefore, both the USCF and FIDE have switched to formulas based on the logistic distribution. However, in deference to Élő's contribution, both organizations are still commonly said to use "the ELO system".

Comparative ratings

The phrase "ELO rating" is often used to mean a player's chess rating as calculated by FIDE. However, this usage is confusing and often misleading, because Élő's general ideas have been adopted by many different organizations, including the USCF (before FIDE), the Internet Chess Club (ICC), Yahoo! Games, and the now defunct Professional Chess Association (PCA). Each organization has a unique implementation, and none of them precisely follows Élő's original suggestions. It would be more accurate to refer to all of the above ratings as ELO ratings, and none of them as the ELO rating.

Instead one may refer to the organization granting the rating, e.g. "As of August 2002, Gregory Kaidanov had a FIDE rating of 2638 and a USCF rating of 2742." It should be noted that the ELO ratings of these various organizations are not always directly comparable. For example, someone with a FIDE rating of 2500 will generally have a USCF rating near 2600 and an ICC rating in the range of 2500 to 3100.

The following analysis of the January 2006 FIDE rating list gives a rough impression of exactly what having a given FIDE rating means:

The highest ever FIDE rating was 2851, which Garry Kasparov had on the July 1999 and January 2000 lists.

In the whole history of FIDE rating system, only 39 players (including list 01.04.2006), called "Super-grandmasters", have achieved a peak rating of 2700 or more.

As of April 2006, the Hydra supercomputer is probably the strongest "over the board" chess "player" in the world; its FIDE equivalent playing strength is estimated by its creators to be over ELO 3000, and it was able to demonstrate this in its 6 game match against Michael Adams in 2005 in which the then 7th highest rated player in the world only managed to score a single draw [1].

Mathematical details

Performance can't be measured absolutely; it can only be inferred from wins and losses. Ratings therefore have meaning only relative to other ratings. Therefore, both the average and the spread of ratings can be arbitrarily chosen. Élő suggested scaling ratings so that a difference of 200 rating points in chess would mean that the stronger player has an expected score of approximately 0.75, and the USCF initially aimed for an average club player to have a rating of 1500.

A player's expected score is his probability of winning plus half his probability of drawing. Thus an expected score of 0.75 could represent a 75% chance of winning, 25% chance of losing, and 0% chance of drawing. On the other extreme it could represent a 50% chance of winning, 0% chance of losing, and 50% chance of drawing. The probability of drawing, as opposed to having a decisive result, is not specified in the ELO system. Instead a draw is considered half a win and half a loss.

Above is an explanation for ELO in games where draws can occur. ELO ranking for games without the possibility of draws (Go, Backgammon) is discussed in Go rating with ELO. It explains also the non-cumulativeness of winning chances for big ELO differences in those zero-sum, full-information games, where the result can have also a quantity (small/big margin) in addition to the quality (win/loss) (Go).

If Player A has true strength and Player B has true strength , the exact formula (using the logistic curve) for the expected score of Player A is

.

Similarly the expected score for Player B is

.

Note that . In practice, since the true strength of each player is unknown, the expected scores are calculated using the player's current ratings.

When a player's actual tournament scores exceed his expected scores, the ELO system takes this as evidence that player's rating is too low, and needs to be adjusted upward. Similarly when a player's actual tournament scores fall short of his expected scores, that player's rating is adjusted downward. Élő's original suggestion, which is still widely used, was a simple linear adjustment proportional to the amount by which a player overperformed or underperformed his expected score. The maximum possible adjustment per game (sometimes called the K-value) was set at K=16 for masters and K=32 for weaker players.

Supposing Player A was expected to score points but actually scored points. The formula for updating his rating is

This update can be performed after each game or each tournament, or after any suitable rating period. An example may help clarify. Suppose Player A has a rating of 1613, and plays in a five-round tournament. He loses to a player rated 1609, draws with a player rated 1477, defeats a player rated 1388, defeats a player rated 1586, and loses to a player rated 1720. His actual score is (0 + 0.5 + 1 + 1 + 0) = 2.5. His expected score, calculated according the formula above, was (0.506 + 0.686 + 0.785 + 0.539 + 0.351) = 2.867. Therefore his new rating is (1613 + 32*(2.5 - 2.867)) = 1601.

Note that while two wins, two losses, and one draw may seem like a par score, it is worse than expected for Player A because his opponents were lower rated on average. Therefore he is slightly penalized. If he had scored two wins, one loss, and two draws, for a total score of three points, that would have been slightly better than expected, and his new rating would have been (1613 + 32*(3 - 2.867)) = 1617.

This updating procedure is at the core of the ratings used by FIDE, USCF, Yahoo! Games, the ICC, and FICS. However, each organization has taken a different route to deal with the uncertainty inherent in the ratings, particularly the ratings of newcomers, and to deal with the problem of ratings inflation/deflation. New players are assigned provisional ratings, which are adjusted more drastically than established ratings, and various methods (none completely successful) have been devised to inject points into the rating system so that ratings from different eras are roughly comparable.

The principles used in these rating systems can be used for rating other competitions—for instance, international football matches.

Practical issues

Game activity vs. protecting one's rating

In general the Elo system has increased the competitive climate for chess and aspired players for further study and improvement of their game. It has enabled fascinating insights into comparing the relative strength of players from completely different generations, such as the ability to compare Capablanca with Kasparov for example.

However, in some cases ratings can discourage game activity for players who wish to "protect their rating".

Examples:

(1) They may choose their events or opponents more carefully where possible.

(2) If a player is in a Swiss tournament, and loses a couple of games in a row, they may feel the need to abandon the tournament in order to avoid any further rating "damage".

(3) Junior players, who may have high provisional ratings, and who should really be practicing as much as possible, might play less than they would, because of rating concerns.

In these examples, the rating "agenda" can sometimes conflict with the agenda of promoting chess activity and rated games.

Some of the clash of agendas between game activity, and rating concerns is also seen on many servers online which have implemented the Elo system. For example, the higher rated players, being much more selective in who they play, results often in those players lurking around, just waiting for "overvalued" opponents to try and challenge. Such players because of rating concerns, may feel discouraged of course from playing any significantly lower rated players again for rating concerns. And so, this is one possible anti-activity/ anti-social aspect of the Elo rating system which needs to be understood. The agenda of points scoring can interfere with playing with abandon, and just for fun.

Interesting from the perspective of preserving high Elo ratings vs. promoting rated game activity is a recent proposal by British Grandmaster John Nunn regarding qualifiers based on Elo rating for a World championship model [2]. Nunn highlights in the section on "Selection of players", that players not only be selected by high Elo ratings, but also their rated game activity. Nunn clearly separates the "activity bonus" from the Elo rating, and only implies using it as a tie-breaking mechanism.

The Elo system when applied to casual online servers have at least two other major practical issues that need tackling when Elo is applied to the context of online chess server ratings. These are engine abuse and selective pairing.

Chess engines

The first and most significant issue are players making use of chess engines to inflate their ratings. This is particular an issue for correspondence chess style servers and organizations, where the possibility of making use of a wide variety of engines within the same game, is entirely possible. This would make any attempts to conclusively prove that someone is cheating quite futile. Blitz servers such as the Internet Chess Club attempt to minimize engine bias by clear indications that engine use is not allowed when logging on to their server.

Selective pairing

A more subtle issue is related to pairing. When players can choose their own opponents, they can choose opponents with minimal risk of losing, and maximum reward for winning. Such a luxury of being able to hand-pick your opponents is not present in Over-The-board Elo type calculations, and therefore this may account strongly for the ratings on the ICC using Elo which are well over 2800.

Particular examples of 2800+ rated players choosing opponents with minimal risk and maximum possibility of rating gain include: choosing computers that they know they can beat with a certain strategy; choosing opponents that they think are over-rated; or avoiding playing strong players who are rated several hundred points below them, but may hold chess titles such as IM or GM. In the category of choosing over-rated opponents, new-entrants to the rating system who have played less than 50 games are in theory a convenient target as they may be overrated in their provisional rating. The ICC compensates for this issue by assigning a lower K-factor to the established player if they do win against a new rating entrant. The K-factor is actually a function of the number of rated games played by the new entrant.

Elo therefore must be treated as a bit of fun when applied in the context of online server ratings. Indeed the ability to choose one's own opponents can have great fun value also for spectators watching the very highest rated players. For example they can watch very strong GM's like Shirov (nicknamed "Leon") challenge other very strong GMs who are also rated over 3100 for example. Such opposition which the highest level players online would play in order to maintain their rating, would often be much stronger opponents than if they did play in an Open tournament which is run by Swiss pairings. Additionally it does help ensure that the game histories of those with very high ratings will often be with opponents of similarly high level ratings.

Elo ratings online therefore still provides a useful mechanism for providing a rating based on the opponent's rating. Its overall credibility however, needs to be seen in the context of at least the above two major issues described - namely engine abuse, and selective pairing of opponents.

The ICC has also in recent times introduced "auto-pairing" ratings which are based on random pairings, but with each win in a row ensuring a statistically much harder opponent who has also won x games in a row. With potentially hundreds of players involved, this creates some of the challenges of a major large Swiss event which is being fiercely contested, with round winners meeting round winners. This approach to pairing certainly maximizes the rating risk of the higher-rated participants, who may face very stiff opposition from players below 3000 for example. This is a separate rating in itself, and is under "1-minute" and "5-minute" rating categories. Maximum ratings achieved over 2500 are exceptionally rare.

Mathematical issues

There are three main mathematical concerns relating to the original work of Professor Elo, namely the correct curve, the correct K-factor, and the provisional period crude calculations.

Most accurate distribution model

The first major mathematical concern addressed by both FIDE and the USCF was the use of the normal distribution. They found that this did not accurately represent the actual results achieved by particularly the lower rated players. Instead they switched to a logistical distribution model, which seemed to provide a better fit for the actual results achieved.

Most accurate K-factor

The second major concern, is the correct "K-factor" used. The chess statistician Jeff Sonas reckons that the original K=10 value (for players rated above 2400) is inaccurate in Professor Elo's work. If the K-factor coefficient is set too large, there will be too much sensitivity to winning, losing or drawing, in terms of the large number of points exchanged. Too low a K-value, and the sensitivity will be minimal, and it would be hard to achieve a significant number of points for winning, etc.

Elo's original K-factor estimation, was based without the benefit of huge databases and statistical evidence. Sonas indicates that a K-factor of 24 (for players rated above 2400) may be more accurate both as a predictive tool of future performance, and also more sensitive to performance. A key Sonas article is Jeff Sonas: The Sonas Rating Formula — Better than Elo?

Certain Internet chess sites seem to avoid a three-level K-factor staggering based on rating range. For example the ICC seems to adopt a global K=32 except when playing against provisionally rated players. The USCF (which makes use of a logistical distribution curve as opposed to a normal distribution) have staggered the K-factor according to three main rating ranges of:

Players below 2100 -> K factor of 32 used

Players between 2100 and 2400 -> K factor of 24 used

Players above 2400 -> K factor of 16 used

FIDE apparently (according to Mark Weeks in the following article:-)

K-factor article

make use of:-

Players <30 rated games -> K factor of 25 used

Players less than 2400 -> K factor of 15 used

Players 2400+ and played 30 rated games+ -> K factor of 10 used

Certainly in Over-the-board chess, the staggering of K-factor is important to help ensure minimial inflation at the top end of the rating spectrum. This assumption might in theory apply equally to an online chess server, as well as a standard over-the-board chess organisation such as FIDE or USCF. In theory, it would make it harder for players to get the much higher ratings, if their K-factor sensitivity was lessened from 32 to 16 for example, when they get over 2400 rating. However, the ICC's help on K-factors at the following reference:-

ICC K-factor help

indicates that it may simply be the choosing of opponents that enables 2800+ players to further increase their rating quite easily. This would seem to hold true, for example, if one analysed the games of GM Shirov on the ICC who is nicknamed "leon", you can find a string of games of opponents who are all over 3100. In Over-the-board chess, it would only be in very high level all-play-all events that GM Shirov would be able to find a steady stream of 2700+ opponents - in at least a category 15+ FIDE event. A specific category 10 FIDE event would mean players are restricted in rating between 2476 to 2500. However if GM Shirov entered normal Swiss-paired open Over-the-board chess tournaments, he would likely meet many opponents less than 2500 FIDE on a regular basis. A single loss or draw against a player <2500 will knock Shirovs real FIDE rating down significantly on a single game.

Even if the k-factor was 16, and "leon" (GM Shirov) defeated a 3100+ player several games in a row, his rating would still rise quite significantly in a short period of time, due to the speed of blitz games, and hence the ability to play many games within a few days. The K-factor would arguably only slow down the increases that GM Shirov achieves after each win. The evidence given in the ICC k-factor article relates to the auto-pairing system, where the maximum ratings achieved are seen to be only about 2500. So it seems that random-pairing as opposed to selective pairing is the key for combatting rating inflation at the top end of the rating spectrum, and possibly only to a much lesser extent, a slightly lower K-factor for player >2400 rating.

Provisional period crude averaging

Within the provisional period for calculating Elo, for some reason a crude averaging system of the rating of opponents is used over 20 games by some sites. Apart from the obvious flawed logic that beating someone much weaker than you should not lose points and losing to someone much higher than you should not gain points, it should also be obvious to players in the UK that a much superior and fully tested performance measurement system is already available — namely the English Chess Federation (ECF) grading system. The English Chess Federation grading system already accounts for cases where you beat someone much lower, by simply adding some points to your grade. And when you lose to someone much higher, you also some lose some points relative to your rating. Perhaps in a new model of Elo calculations, the tried and tested English Chess Federation grading system could be used for the 1st 20-50 games in order to establish a stable Elo rating which is then subject to Elo's main formula relating to probability distribution.

Jeff Sonas, on his Chessmetrics web-site, has also described a more sophisticated way of averaging called a "weighted and padded simultaneous performance rating". More information about this can be seen in the Chessmetrics formulas page.

Elo ratings in other competitions

A spin off system not related to chess has been adopted to rate the relative team strength of national football teams in competition called Elo football rating.

In other sports, individuals maintain rankings based on the Elo algorithm. For instance, Jeff Sagarin publishes rankings for American college football and basketball, with "Elo chess" being one of the two rankings he presents.

National Scrabble organizations compute normally-distributed Elo ratings except in the United Kingdom, where a different system is used. The North American National Scrabble Association has the largest rated population, numbering over 11,000 as of early 2006.

See also