It's been far too long since I wrote one of these blog entries, so I figured I'd take some time and talk a bit about the recent downtime and the type of thing I deal with that isn't directly visible to all of you.
To be able to explain what I believe caused the downtime, I'll go through some basics of how the Internet works, and how you are able to access the site.
When you type in an address in your web browser (such as
https://fumbbl.com), your computer doesn't automatically know where to send the request for the webpage. The Internet doesn't address servers by name, but instead use a number (IP number to be specific). In order for your computer to translate the name of the site you try to access to a number, it asks what's called a Domain Name Server (DNS) for the IP number that corresponds to the name. I'm sure that most of you know this part already.
Ok, so your web browser asks your computer to look up the IP number for a name. Your computer, in turn, asks its configured DNS server (often your router, which in turn asks the DNS server of your ISP) for the number. But, I hear you asking, how does the ISP DNS server know the IP number? Eventually, it comes down to finding the "Start of Authority" for the domain. This is essentially stored in the domain registry record (you can do what's called a WHOIS lookup for a domain to find this) which, among other things, contains a list of "authoritative" name servers for the requested domain. These name servers will contain "DNS records" for the domain where for example "fumbbl.com" is assigned an actual IP number. The DNS system ends up caching (remembering) the IP numbers in order to speed up repeated requests.
Now, these DNS records are stored with a Time To Live (TTL) which essentially tells DNS servers to re-request the name after a certain time period has expired (it's actually stored as part of the SOA record for the domain).
Enough of tutoring about Internet infrastructure.. What happened during the downtime?
Essentially, I believe that the authoritative name servers got disconnected from the Internet for long enough that the TTL of the records expired. When this happens, your ISPs DNS server will try to re-request the IP number (because it may have changed). Now, with the primary name servers not being accessable, the name lookup will simply fail, and your ISPs DNs server will assume the domain doesn't exist. Great.
Prior to this event, FUMBBL was using a service named "Zoneedit" for name server hosting. In the past, I (or rather Google's web master tools) noticed that there were intermittent loss of connectivity to the name servers and I ended up activating an extra name server (at a cost). This extended downtime made me take the decision to move name servers from Zoneedit to Amazon's Route 53 offering (part of their cloud computing platform). I also moved the domain registry from Zoneedit's partner (mydomain.com) to Amazon (not because of any problems, but mainly to simplify management and reduce cost; Amazon offers privacy protection for free, which mydomain.com didn't). Once the name server updates propagated out to the various DNS server around the world, people could once again access the site.
If you now do a WHOIS lookup on fumbbl.com, you'll see that the name servers are various servers across the globe, all with some sort of "AWSDNS" tag (AWS being Amazon Web Services).
I'm pretty confident that Amazon has a relatively high level of attention on their infrastructure and that this particular problem will not happen again with them handling the servers.