FUMBBL :: Online Blood Bowl League

Despite having a crazy amount of stuff to do on my TODO list for the site, I spent a number of hours exploring the Glicko-2 rating system in detail.

While some of you will have heard of Glicko from The NAF (who relatively recently introduced a glicko variant rating for their site), the rating system itself has been around for quite some time. I've considered switching FUMBBL to it multiple times in the past, but never got around to actually do a proper deep-dive into the system and play around with it.

Last night, that changed. I spent a fair bit of time reading through the original Glicko-2 description and wrote an implementation of it. This blog is an attempt at a summary for those of you who are interested. Note, though, that this is relatively heavy in terms of math, but I'll do my best to explain it as clearly as possible. Also be aware that this explanation is glossing over some details and is somewhat simplified.

First some background. Fundamentally, a rating system is a way to try to estimate how strong a player is. The concept is fairly easy to explain: You have some rating on a scale. If you win a game against someone else, your rating will go up and if you lose, your rating will go down.

A trivial example of such a system would be to add one point for a win, and subtract one point for a loss, and the sum would be your rating. This would clearly not be very good since winning against the best player in the world would count as much as winning against someone who never played the game before.

FUMBBL has been using a rating system based on something called Elo (the name being derived from the Hungarian inventor who came up with it, Arpad Elo. It's not an acronym). The core concept of the Elo system is that you gain more points if you win against a stronger coach (someone with a higher rating) than you do if you win against someone with a lower rating. Conversely, you lose more rating if you lose against a lower rated coach than you do if you lose against a higher rated coach.

There are two core formulas in the Elo system:

E = 1 / (1 + 10 ^ (rating difference / 400) )

and

R' = R + K(S - E)

To break these down, let's start with the first one. E is the "expected outcome", or the expected win probability, and will give an estimated probability of a given coach to win against a given opponent. The inverted exponential function ensures that the result is somewhere between 0 and 100%, and if rating difference is zero (players are equal) the E value will be 50% (as expected).

The second formula is how a rating changes after a game. R' is the new rating, R is the pre-game rating, K is a constant (most often set to 2 for FUMBBL), S is the result for the game (1 for a win, 0.5 for a tie, and 0 for a loss). E is the expected win rate from above.

On FUMBBL and for Blood Bowl in general, games are a bit different than most other games out there. Coaches have teams which are fundamentally not equal. Team Value is a (very rough) estimate of this difference in power level, and in order to take this into account, I introduced a variation of the E formula:

E = 1 / (1 + 10 ^ (rating difference / 400 + TV difference / 70k) )

There are many more details in how FUMBBL implements the rating system, but for this blog, I'm going to skip additional details to keep it relatively simple.

So why look at other alternatives? What's wrong with Elo?

There are a few things that could be improved on from an Elo based system. There are two major things:

Rating inflation and deflation is where a current rating level is worth more or less than it used to historically. This is absolutely a thing that has happened on FUMBBL, and despite the system being designed to have people on a normal distribution around 150.0, the actual ratings on the site are shifted down slightly to maybe 148 or so. This means that new coaches who join the site, assuming they're average on ... average (sorry for the word play here), will be estimated to be slightly too strong and introducing more rating to the system than they should. This isn't a huge deal and is offset by lower rated coaches being more likely to stop playing (statistically speaking, this is true), which in turn removes some rating from the system.

The other, more important, problem is that someone new to the site enters with 150 rating, regardless of if they are actually averagely "good" at the game (this is rarely true). There is a "window" of opportunity for this error in estimation for the rating to be "cherry picked" by more experienced coaches. FUMBBL has introduced certain safe-guards against this (primarily adding a variable K in the formula above for games where ranking brackets are different), which work in general but isn't very elegant.

A third, less major, problem is that once you gain a certain rating, you maintain it forever. This causes "pollution" on the top coach list, and the site currently requires a coach to have played a game in recent times or they'll be removed from the top coach list. And even if they do disappear, they can play one match and will reappear on the list again and there is no significant incentive to "fight" for your top spot once you reach it.

So what's this Glicko thing, and why is it better?

Glicko (again named after the inventor, Mark Glickman) is based on Elo and introduces the core concept of "ratings deviation", or RD. This is a measure of how reliable the rating is, and has fairly fundamental effects on how the rating system works as a whole.

To give an example, a typical starting rating in a Glicko system is 1500 (just think of this like 10x the current FUMBBL CR), and an initial RD of 350. This means that the system thinks there is a 68% chance that a new player is rated between 1150 and 1850 (1500 plus or minus the 350), and a 95% chance that the new player is rated between 800 and 2200 (1500 plus or minus 2x350). For those of you who are mathematically inclined, this is a typical standard deviation measurement (similar to confidence intervals, but not technically that).

In a fundamental sense, winning against a player with a high RD (high level of uncertainty) will give the player less rating than winning against a player with the same rating, but a lower RD (low level of uncertainty). This addresses one of the core problems above.

Another thing that the Glicko system introduces is that RD will increase over time as players are inactive. While a player's rating does not change over time (there is no real decay in play here), the uncertainty will increase and once the player comes back to play again, their rating will move quicker to "catch up" from the time spent inactive. This could go both ways; sometimes someone hasn't been playing for a year, and sometimes they've been playing tabletop tournaments every weekend.

In an Elo based system, a decay would have to remove rating from coaches, which causes real problems in the overall system (causes deflation). With Glicko, you can get a decay of sorts by sorting coaches by (rating - RD) instead of raw rating without causing deflation on a system level.

So what about the math? How is Glicko different? Again, at its core, there are two relatively similar formulas involved (simplified to make it clearer):

E = 1 / (1 + exp( -RD_opp * (rating difference) ) )

and

R' = R + RD^2 * RD_opp *(S - E)

These two formulas are very similar to the Elo variants, and the essential addition is the introduction of RD (which is expressed in a different scale for math reasons). With some fiddling around with scaling and constants, the Glicko system can fundamentally be the same as Elo; the correlation is clear.

I spent some time studying the Glicko-2 PDF from Mark Glickman and implemented the formulas in a way that is inherently compatible with how FUMBBL works, and earlier today spent a few hours figuring out details on how to introduce a TV modification to the formula. In its simplest form, I've adjusted the E formula above to the following:

E = 1 / (1 + exp( -RD_Opp * (rating difference + f(TV difference) ) ) )

f here is a function that converts the TV difference to a number roughly between -1.4 and 1.4. It's slightly affected by the RD value as well to produce a better fit with statistical win rates pulled from the currect set of matches in the C division.

If you're interested in nitty-gritty details, I've transcribed the math to an online site (Desmos) where you can visualize how things function:

Go check it out here.

To understand what is going on there, r is the rating (which doesn't affect the graph in its current state), dR is the rating deviation (can't use RD in the app), dTV is the difference in TV between the two teams, and the graph shows E as a function of difference in rating (think 10x FUMBBL's current scale). If you want further details than that page, I recommend reading the Glicko-2 PDF linked above.

Ok, so after this incredibly large wall of text, what's next?

I'm not sure honestly. I have code on the site that can calculate Glicko-2 ratings with this formula, and the step from here to a functional CR system isn't terribly big (it's mostly setting up the database tables necessary and adjusting some of the GUI to show glicko instead of / in addition to Elo).

Given that there are things I will need to adjust in terms of the future bb2020 blackbox, there is a high probability that we'll go Glicko-2 in that process. It's something I've wanted to do for a long time as it is and just never got around to.

If you're still with me at this point (it doesn't count if you skipped! go back and read it all :) ), feel free to post your thoughts on this stuff in the comments below.

Posted by C0ddlefish on 2021-10-16 20:37:31

I tried to follow a bit of the conversation around this on Discord and got completely lost. This above does a great job of explaining it.

As someone who has hidden his CR rating on his own homepage though, I'm happy with whatever you choose.

Posted by MattDakka on 2021-10-16 20:58:19

Thanks for spending so much time to improve the rating system, Christer!
As a slave-to-CR Wraith I really appreciate it!

Posted by Diablange on 2021-10-16 21:28:42

Hi Christer,

Usually K can be slightly different (in chess, new players have K equal to 40, then good players 20 and grand masters 10) in order to represent the volatility of the ranking and the lower your level, the most uncertain it represent your REAL level.

Glicko and Glicko-2 use standard variation in place of K but fundamentaly it's not SO different.

My (limited) understanding of what you have implemented on Fummbbl is : you could play with K a little bit more in your current formula to get closer to Glicko without re-implementing everything. But I might as well be wrong... :D

Cheers

Posted by RedDevilCG on 2021-10-16 22:26:19

I might have missed it, but did you add another constant to TV to weight against the team race matchups? For example Dwarfs winning against stunties should be worth less than a mirror match, but more against Dark Elves (just guessing here).

Maybe this would pull from a table that also takes into account how matchups change between races as their TV increases. The table could be populated by historical win rates.

Posted by BeanBelly on 2021-10-16 23:27:24

CR ignores racial match; a search would likely uncover the blogs or posts where Christer has explained why this is not part of it - if you want to search for ‘em.

Posted by Java on 2021-10-16 23:55:13

*maths

Posted by Java on 2021-10-16 23:56:01

rated 6

Posted by Jeffro on 2021-10-17 00:29:28

I cheated and skipped ahead. But I did read more than my share of letters and numbers, so as an American I feel confident I can speak on this to anyone at any level. Prost!

Posted by PurpleChest on 2021-10-17 02:10:42

i was going to rite 'i learned something today, its not E.L.O. its Elo, presumably like Jello but without the J.

But i read it twice and think ive caught it. Mostly, maybe.

My one question is does this mean when i am Top of the rating system and go away for 6 months, firstly that the rating you plan will see my rating decrease but the uncertainty factor. it seems so and this is great.

But does it also mean when i return and play, people will only gain the rating from my rating minus the uncertainty factor. and that as my real rating 'firms up' again, that i am losing less than i would have without that factor? if so what do i have to lose by getting high up and then taking a break, coming back knowing i now have less to lose each game and assuming i have got no worse, isnt it then far easier for me to climb back up?

But as always, as long as everyone uses the same system, i'll be fine with it.

Posted by Christer on 2021-10-17 04:48:58

Let's consider a hypothetical situation where you've played a number of games and are sitting at maybe 2000 rating, with a low-ish RD of 50. At this point, you take a break for some arbitrary long-ish period of time.

What happens is that your rating will remain at 2000, but your RD will increase over time, let's say to 200. Visually, on the coach top list, you'd be at 1950 (2000-50) before you took the break, and return at 1800 (2000-200).

So, what happens when you start playing games again? Well, the system will think "ok, I'm not entirely sure about how this player stacks up any more, so we'll change their rating quicker for a bit". Essentially, when you play games, your rating will move a bit quicker, and your RD will decrease over time again.

Running a couple of simulations with a coach starting out at a rating of 2000 and RD 50. We use rating periods of a day, and an opponent at rating 1500 and an RD of 100 (so played a few games but not super active; average coach.

1. Without a break
For a win, rating moves to 2001, RD to 51
For a loss, rating moves to 1987, RD to 51
Your position in the coach top list would be either 1950 (2001-51) or 1936 (1987-51).

2. After roughly a year of not playing, RD is now 200
For a win, rating moves to 2012, RD to 194
For a loss, rating moves to 1807, RD to 194
Your position in the coach top list would be either 1818 (2021-194) or 1613 (1807-194).

While yes, you get more rating if you win (factor 13 in this case), you risk losing more if you lose as well (factor of almost 15 in this case), and you need to work your RD back down, which takes a while. Assuming you find identical 1500/100 opponents, you'd need 11 consecutive wins to get above the 1936 you would have after a loss prior to the break, at which point you're still at an RD of 148.

3. Still after the year break, RD is 200 but you're playing vs a 2000/50 opponent instead
After a win, rating moves to 2086, RD to 174
After a loss, rating moves to 1914, RD to 174
Your position in the coach top list would be either 1912 or 1740

4. For completeness, without the break, against the 2000/50 opponent
After a win, rating moves to 2007, RD stays at 51
After a loss, rating moves to 1993, RD stays at 51
Your position in the coach top list would be either 1956 or 1942

Essentially, taking a long break and picking cherries isn't going to be a viable solution to getting to the top of the leaderboard. Against stronger opponents, it's also not really worth it in my opinion. You will increase rating, but losses (which are probably going to happen) will hurt more than they did before the break, until you manage to get your RD back down to the low levels you had before the break.

Either way, you're better off from a rating perspective to remain active and maintain your low RD.

To respond to some of the direct questions:

People who play you after the break will have their rating change slightly slower, because the system isn't sure what your true rating is anymore. You, however, lose (and gain) slightly more rating than you would before the break in an attempt to push you to your "true" rating a bit quicker. Playing against coaches around your own rating will lower RD quicker, but will clearly result in more losses than picking on lower rated coaches, where even one loss could be devastating to your rating and your RD remains in this high-risk state.

Posted by neilwat on 2021-10-17 08:01:05

All seems to make sense, a small difference but it seems logical.

Posted by erased000033 on 2021-10-17 10:12:17

Thank you christer. I think it is very useful to introduce a factor like RD that takes into account the frequency of play . But if I understand correctly if a player reaches a rating level this will not change if the player leaves the site. Let's take two examples (the first one a bit more rare)
1) A player plays in FUMBBL for one year and reaches the top 10 of the ranking. Then for some reason he leaves and does not come back to play on the site anymore

2) A coach plays in FUMBBL in Competitive or in Box and then he decides to play only League (which does not affect the CR).

In both cases - if I understand correctly - since there is no rating deflation linked to inactivity time they will maintain their ranking forever. They need at least to play one match in Competitive or in Box to have their ranking moved

Is it possible with Glicko to introduce a certain rating deflation due to inactivity time for coaches who for a long time (maybe more than two years) are no longer connected to the site and/or play only league games (and therefore do not affect their CR?)?

Thanks for clarification and again, congratulation for the initiative!

Posted by Verminardo on 2021-10-17 12:40:45

I didn't get the math but I did get the general idea, so thanks for the explanation. On the subject of Rating, I was gonna ask about the Overall Rating displayed in the top right of the coach profile. This is currently the Combined Rating or, if you have significantly more games in one of the ranked divisions than the others, the rating for that division (so in my case, the [B] Rating). Assuming my activity stays the same and I play all my ranked games in [C] instead, it would take roughly 10-15 years until the now static [B] rating goes away. So I assume at some point this will move to the new [C] rating for everyone?

And this is absolutely not about vanity and me dropping to Star just before I stopped playing Box... ;-))

Posted by MattDakka on 2021-10-17 13:14:58

@Verminardo: you can choose which Rating is displayed (currently Global, Ranked, Box).
Click on More, then Settings, you will be able to select the Rating you prefer.

Posted by Verminardo on 2021-10-17 13:24:20

@MattDakka: Only for the first two, not the last one, as far as I can tell.

Posted by JackassRampant on 2021-10-17 18:43:01

Okay, so let's say I take a one-year break, and then it takes me 3 games to get up to form. If I understand what you're saying, Christer, then my 3 games of losses and draws will drag my rating down fast, and then when I'm to my winning ways again, it'll take more wins to make up for it? What's up with that?

Posted by Purplegoo on 2021-10-17 19:05:55

I pushed hard for / championed (but did not do any of the work to enable) Glicko in parallel to Elo over at the NAF. I really like it. The ranking system is probably the least of your worries Christer, but if this were to come in it would get a thumbs up from me!

Posted by Xandyreoch on 2022-07-11 13:47:10

This is very cool, looking forward to it being used :D


(bad)	(good)