With the recent game server instability seemingly resolved, a few people have asked about what the cause was and also asking a bit about residual remaining issues. Instead of repeating this same information over and over again, I figured I'd write a bit of a post-mortem for the issue, including an explanation of my understanding of this issue, what problems are still remaining and what the plan is to resolve those.
First off, let's talk about the problem itself.
The
news post explains a bit of the background with an issue with authentication (e.g. wrong password when starting the FFB client), and explains how I've converted the auth process from the Java back-end to a node back-end and finally to being direct PHP. None of these things helped.
After the news post left off, I also moved the game server from the old physical server (Roxanna) to a newly spun up virtual server (Max). This also didn't remove the problem. The problem remained.
While I was moving the ffb server over to the new platform, I was also looking through how the FFB Server does these requests, and added some extra logging to that code. Looking at thousands of lines of very spammy logs ended up with me identifying an event that caused a behaviour change on the server side: Replays. After a request for a replay, the part of the FFB server responsible for talking to external systems (including FUMBBL for authentication and team loading) simply stopped responding. No good.
So, I disabled replays and restarted the FFB Server. No more instability. With a huge amount of relief I was confident that this was the core problem. So the next step was to figure out what was going on, and why it was locking up.
To be able to explain the problem, I need to take a step back here and explain how replays work on a technical level.
Basically, when a match is being played, the log of events (movements, dice rolls, everything) is kept in the FFB database. After the match completes, it goes into a "Finished" state and a request to upload the match to FUMBBL is sent. Once this has completed (note that this is also using the same external communication process as outlined above, which is why matches were completed but didn't upload automatically to the site during the problematic days), the match is set to another state; "Uploaded". The next step that happens is that the log of the match is written to a (compressed) file on the server running the ffb server. This cleans up the database and reduces the "cost" of database backups.
On the game server itself, these match logs are copied to Amazon S3 for long-term storage as part of a nightly batch. And finally, every now and again, the local copies of old matches are simply deleted (after verifying that the Amazon S3 copy script is still working to make sure they're not completely lost).
Now, when someone on the site requests to view a replay, the FFB server first checks the current matches (you can technically replay a match in progress) for the log. If it's not there, it checks the database for it. If it's not there, the local drive is checked to see if it's available there. And finally, if all else fails, it sends out a request to an external process which is responsible for downloading it from Amazon S3. All of this is done by a separate "replay" service which is part of the FFB server package. Except the Amazon S3 downloader, which is yet another service. This sounds pretty complicated, but is a rather neat infrastructure from a technical perspective.
And now we can get back to the problem. What happened here was a problem where two individually non-critical problems generated a huge mess. First off, that last component responsible for Amazon S3 downloading wasn't running. This meant that no old replays could be loaded. By itself, this wasn't a huge deal since in theory the FFB Server could just accept that no replay is available, produce an error and tell the end user that it wasn't possible to load that particular replay.
This is where the second problem comes in. The FFB Server does a request to this backup service to ask for the file. Once the backup service realized the file wasn't available, it sent a request to the downloader service asking for the file. And here's the crux of it. The technology used for communication between these two entities is a "message queue" or MQ, using a methodology which is called "asynchronous". This means that a request is sent for some data, and then you wait until later to get a response back. It's technically not a direct reply, but a separate request. Think of it like the backup service sends a text message to the downloader service saying it wants to get the replay file. After some amount of time, the downloader service finds the file and sends it back over. During this waiting time, the first service can theoretically do other things.
The way this is implemented in the FFB Server component, it simply sends a request and waits. And waits. And waits. If there is no downloader service, the waiting will never stop. This wait is "active", meaning it will never respond to the FFB server, but is not directly saying "no, it doesn't work". It just sits there, keeping the proverbial phone line open and implying the response will come soon.
This ended up locking up the ffb server external service connector from processing anything beyond this replay request.
As you can imagine, this was a very very tough problem to identify.
So with the direct problem identified, the site is left in a bit of a state. To summarize, the changes were:
1. The authentication framework was rewritten into PHP.
2. Replays are disabled.
3. The FFB Server now runs on a new machine.
The first point created a bit of craziness when I did the change, because I had to quickly implement something and at the same time not break existing code. That has mostly been solved I believe (although I may go back to the node backend backed authentication as it's simply better).
The second point is something I can probably sort out short term by simply making sure that all services are running and functioning and then simply allow replays again. The core problem of the FFB Server locking up if there's a problem needs to be resolved as well somehow. Adding a timeout for those requests seems reasonable. This is FFB Server code though, which is something I have limited experience with but something that needs to be done at some point.
The third point is the most complex one. The core gameplay aspect works fine at this point. However, there are some residual effects. The gamefinder and blackbox scheduler components are currently implemented in the old Java back-end. I very much do not want to make changes to that code base, because it's quite frankly a nightmare to work with. The code itself isn't great, and the infrastructure is really really annoying to deal with. It's hard to test, hard to deploy, hard to get working in the first place. The problem is that these two subsystems talk to the FFB Server to start matches between people. With the FFB Server moving from one server to another, these two subsystems of the site are trying to connect to an address where there is no FFB Server running. Clearly not a good situation.
This is something that needs to be fixed soon. Which means that I will either have to go back on my unwillingness to edit the java backend, or simply go ahead and reimplement the gamefinder and blackbox scheduler in my currently preferred backend infrastructure. We'll see what I go with.
Until then, the gamefinder and blackbox scheduler works, but you will need to set up game names to connect to the games.
I'll take the weekend to think about it.