Measuring Harmony With Algebra: On Players Evaluation

Part I.

The Elo rating system is the system used for evaluation and comparison of competitors. Up until today it's been mostly applied in the domain of board games, most well-known in chess, but also in disciplines such as draughts or go. The Elo system, named after its inventor, Prof. Arpad Elo, who first published it in the 1950s in the US, is capable to produce a reliable score expectation for an encounter between two competitors who oppose each other.

For those who are not familiar with chess or draughts, let's take a look on how the Elo ratings work:

1) In an encounter between two competitors, A and B, assume they have ratings R_a and R_b.

2) There is a function that maps the expected result for each player given the opponent:

E_a = F(R_a, R_b)

E_b = F(R_b, R_a)

where F is a monotonic non-decreasing function bounded between minimum and maximum possible scores, such as 0 and 1 in chess. An example for such a function would be arctan(x)/π + 0.5 .

E_a+E_b should be equal to maximum possible score.

In practice a non-analytical table-defined function is used that relates only on the difference between Ra and Rb, and not their actual values. The function can be reliably approximated by the following expression:

E = 1 / [ 1 + 10^(R_b-R_a) / 400 ]

which works well with ratings in low 4-digit numbers and rating changes per game in 0-20 range.

3) After the encounter, when real scores S_a and S_b have been registered, the ratings are adjusted:

R_a1 = R_a + K*(S_a-E_a)

R_b1 = R_b + K*(S_b-E_b)

Where K is a volatility coefficient, which is usually higher for participants with shorter history, but ideally it should be equal for both participants. The new ratings are used to produce the new expected results and so on.

The Elo rating has several highly important properties:

1) It gravitates to the center. As rating R of a participant climbs higher, so does the expected result E, which becomes difficult to maintain, and a failure to maintain it usually results in a bigger drop in the rating.

2) It's approximately distributive. If we gather N performances and average the opponents as R_av, the expected average performance as E_av = F(R_a, R_av), and the actual performance as S_av, then the new rating R_aN' = R_a + N*K*(S_av-E_av) will be relatively close to R_aN obtained via direct R_a reciprocal update after each of the N games.

3) It reflects tendencies, but overall performance still trumps it. Given the three players with ten encounters against other players with the same rating, when the performances are (W - win, L - loss):

For player 1: L,L,L,L,L,W,W,W,W,W

For player 2: L,W,L,W,L,W,L,W,L,W

For player 3: W,W,W,W,W,L,L,L,L,L

player 1 will end up with the highest rating of the three, player 2 will be in the middle, and player 3 will have the lowest one - but not by a very big margin. Only when the streaks become really long the Elo of a lower performance may overcome the Elo of a higher one.

And how does Elo stack against the four Brits?

* Goodhart's Law: pass. It measures the same thing it indicates.
* Granger's Causality: pass. It is a consequence of a performance by definition, and a prediction of future peformance, by definition.
* Occam's Razor: pass. The ratings revolve around the same parameter they measure.
* Popper's Falsifiability: partial pass. The predictions of Elo sometimes fail, because they are probabilistic. However, the test of time and the wide acceptance indicate that the confidence level holds. Elo was even used for "paleostatistics" when the ratings were calculated backwards until middle XIX century, and the resulting calculations are well-received by the chess historians' community.

The only well-known drawback of Elo is the avoidance by top chess players of competition against much weaker oppositions, especially when facing White, as such a game can be drawn relatively easily by the opponent, and the Elo rating of the top player could take a significant hit resulting in a drop of several places in the rating list.

Now, to the question of the chicken and the egg - where do the initial Elo ratings come from? Well, they can be set to an arbitrary value of low 4-digit number. Currently a FIDE beginner starts with the rating of 1300. If the newcomer is recognized as being more skilled than a beginner, then a higher rating is assigned based on rating grades for each skill level, sort of an historical average of the newcomer's peers.

And... What does all this have to do with hockey?

To be continued...

Measuring Harmony With Algebra

Wednesday, December 28, 2016

On Players Evaluation - Part II (Elo)

No comments:

Post a Comment