Despite having a crazy amount of stuff to do on my
TODO list for the site, I spent a number of hours exploring the
Glicko-2 rating system in detail.
While some of you will have heard of Glicko from The NAF (who relatively recently introduced a glicko variant rating for their site), the rating system itself has been around for quite some time. I've considered switching FUMBBL to it multiple times in the past, but never got around to actually do a proper deep-dive into the system and play around with it.
Last night, that changed. I spent a fair bit of time reading through the original Glicko-2 description and wrote an implementation of it. This blog is an attempt at a summary for those of you who are interested. Note, though, that this is relatively heavy in terms of math, but I'll do my best to explain it as clearly as possible. Also be aware that this explanation is glossing over some details and is somewhat simplified.
First some background. Fundamentally, a rating system is a way to try to estimate how strong a player is. The concept is fairly easy to explain: You have some rating on a scale. If you win a game against someone else, your rating will go up and if you lose, your rating will go down.
A trivial example of such a system would be to add one point for a win, and subtract one point for a loss, and the sum would be your rating. This would clearly not be very good since winning against the best player in the world would count as much as winning against someone who never played the game before.
FUMBBL has been using a rating system based on something called Elo (the name being derived from the Hungarian inventor who came up with it, Arpad Elo. It's not an acronym). The core concept of the Elo system is that you gain more points if you win against a stronger coach (someone with a higher rating) than you do if you win against someone with a lower rating. Conversely, you lose more rating if you lose against a lower rated coach than you do if you lose against a higher rated coach.
There are two core formulas in the Elo system:
E = 1 / (1 + 10 ^ (rating difference / 400) )
and
R' = R + K(S - E)
To break these down, let's start with the first one. E is the "expected outcome", or the expected win probability, and will give an estimated probability of a given coach to win against a given opponent. The inverted exponential function ensures that the result is somewhere between 0 and 100%, and if rating difference is zero (players are equal) the E value will be 50% (as expected).
The second formula is how a rating changes after a game. R' is the new rating, R is the pre-game rating, K is a constant (most often set to 2 for FUMBBL), S is the result for the game (1 for a win, 0.5 for a tie, and 0 for a loss). E is the expected win rate from above.
On FUMBBL and for Blood Bowl in general, games are a bit different than most other games out there. Coaches have teams which are fundamentally not equal. Team Value is a (very rough) estimate of this difference in power level, and in order to take this into account, I introduced a variation of the E formula:
E = 1 / (1 + 10 ^ (rating difference / 400 + TV difference / 70k) )
There are many more details in how FUMBBL implements the rating system, but for this blog, I'm going to skip additional details to keep it relatively simple.
So why look at other alternatives? What's wrong with Elo?
There are a few things that could be improved on from an Elo based system. There are two major things:
Rating inflation and deflation is where a current rating level is worth more or less than it used to historically. This is absolutely a thing that has happened on FUMBBL, and despite the system being designed to have people on a normal distribution around 150.0, the actual ratings on the site are shifted down slightly to maybe 148 or so. This means that new coaches who join the site, assuming they're average on ... average (sorry for the word play here), will be estimated to be slightly too strong and introducing more rating to the system than they should. This isn't a huge deal and is offset by lower rated coaches being more likely to stop playing (statistically speaking, this is true), which in turn removes some rating from the system.
The other, more important, problem is that someone new to the site enters with 150 rating, regardless of if they are actually averagely "good" at the game (this is rarely true). There is a "window" of opportunity for this error in estimation for the rating to be "cherry picked" by more experienced coaches. FUMBBL has introduced certain safe-guards against this (primarily adding a variable K in the formula above for games where ranking brackets are different), which work in general but isn't very elegant.
A third, less major, problem is that once you gain a certain rating, you maintain it forever. This causes "pollution" on the top coach list, and the site currently requires a coach to have played a game in recent times or they'll be removed from the top coach list. And even if they do disappear, they can play one match and will reappear on the list again and there is no significant incentive to "fight" for your top spot once you reach it.
So what's this Glicko thing, and why is it better?
Glicko (again named after the inventor, Mark Glickman) is based on Elo and introduces the core concept of "ratings deviation", or RD. This is a measure of how reliable the rating is, and has fairly fundamental effects on how the rating system works as a whole.
To give an example, a typical starting rating in a Glicko system is 1500 (just think of this like 10x the current FUMBBL CR), and an initial RD of 350. This means that the system thinks there is a 68% chance that a new player is rated between 1150 and 1850 (1500 plus or minus the 350), and a 95% chance that the new player is rated between 800 and 2200 (1500 plus or minus 2x350). For those of you who are mathematically inclined, this is a typical standard deviation measurement (similar to confidence intervals, but not technically that).
In a fundamental sense, winning against a player with a high RD (high level of uncertainty) will give the player less rating than winning against a player with the same rating, but a lower RD (low level of uncertainty). This addresses one of the core problems above.
Another thing that the Glicko system introduces is that RD will increase over time as players are inactive. While a player's rating does not change over time (there is no real decay in play here), the uncertainty will increase and once the player comes back to play again, their rating will move quicker to "catch up" from the time spent inactive. This could go both ways; sometimes someone hasn't been playing for a year, and sometimes they've been playing tabletop tournaments every weekend.
In an Elo based system, a decay would have to remove rating from coaches, which causes real problems in the overall system (causes deflation). With Glicko, you can get a decay of sorts by sorting coaches by (rating - RD) instead of raw rating without causing deflation on a system level.
So what about the math? How is Glicko different? Again, at its core, there are two relatively similar formulas involved (simplified to make it clearer):
E = 1 / (1 + exp( -RD_opp * (rating difference) ) )
and
R' = R + RD^2 * RD_opp *(S - E)
These two formulas are very similar to the Elo variants, and the essential addition is the introduction of RD (which is expressed in a different scale for math reasons). With some fiddling around with scaling and constants, the Glicko system can fundamentally be the same as Elo; the correlation is clear.
I spent some time studying the Glicko-2 PDF from Mark Glickman and implemented the formulas in a way that is inherently compatible with how FUMBBL works, and earlier today spent a few hours figuring out details on how to introduce a TV modification to the formula. In its simplest form, I've adjusted the E formula above to the following:
E = 1 / (1 + exp( -RD_Opp * (rating difference + f(TV difference) ) ) )
f here is a function that converts the TV difference to a number roughly between -1.4 and 1.4. It's slightly affected by the RD value as well to produce a better fit with statistical win rates pulled from the currect set of matches in the C division.
If you're interested in nitty-gritty details, I've transcribed the math to an online site (Desmos) where you can visualize how things function:
Go check it out here.
To understand what is going on there, r is the rating (which doesn't affect the graph in its current state), dR is the rating deviation (can't use RD in the app), dTV is the difference in TV between the two teams, and the graph shows E as a function of difference in rating (think 10x FUMBBL's current scale). If you want further details than that page, I recommend reading the Glicko-2 PDF linked above.
Ok, so after this incredibly large wall of text, what's next?
I'm not sure honestly. I have code on the site that can calculate Glicko-2 ratings with this formula, and the step from here to a functional CR system isn't terribly big (it's mostly setting up the database tables necessary and adjusting some of the GUI to show glicko instead of / in addition to Elo).
Given that there are things I will need to adjust in terms of the future bb2020 blackbox, there is a high probability that we'll go Glicko-2 in that process. It's something I've wanted to do for a long time as it is and just never got around to.
If you're still with me at this point (it doesn't count if you skipped! go back and read it all :) ), feel free to post your thoughts on this stuff in the comments below.