On February 1, 2017, it was reported that a sleepy system administrator accidentally deleted the production database of a majorly popular source code hosting site GitLab.com, which hosts the source code, version control and associated metadata of a lot of companies and organizations. Restoring the database from backups likewise failed, as apparently it just so happened that five different methods of backups had all failed for different reasons. What does this teach us?
Anyone who has done live maintenance on a critical production system during a long late night session can sympathize, and I'm sure we're all bleeding inside out of sympathy. The authoritative reference for the entire incident is obviously the live Twitter feed of GitLab themselves, where they have, with full transparency, been providing live updates of the progress of attempted restoration. The Register, among others, has also reported about the incident.
Personally, I can say that we are fortunate that we, as a company, have not hosted any of our code or that of our customers on GitLab.com. However, if I had (and I know that MANY have), then this could have been a very disastrous day indeed. Even if ultimately it turns out that at least the code will not be forever lost (we hope), I can only imagine how I would have felt if a customer delivery had been scheduled for today, and the code hosting provider suddenly melted down. I would surely have said, without hesitation, that this is the time to change our ways.
So is it? Even if we are not using GitLab.com hosting, we do use other services. Which may similarly just suddenly fail like this. What should we learn from this? Off the top of my head, here are some possible takeaways we COULD choose to extract from this. But should we?
Don’t rely on services of others?
Like I said, my company does not use GitLab.com. But actually we do use GitLab software. But we have our own instance. Which we administer ourselves. Indeed, we may accidentally mess it up just as well as they did. But at least we will then have only ourselves to blame, and we are in full control of our very own disaster. Or would it be better to still use the service, but not come to rely on it?
Five layers of backup is not enough. Should have six? Or more?
Seriously, if five alternative methods of backup didn't work, will it help if you add one or two more? Wouldn't they also fail? Perhaps my disaster recovery friends would have a word or two to say about testing in advance if the backup solutions are indeed working at all? Indeed, perhaps even just one backup solution should be enough? As long as you make sure that it actually works.
We should create less complicated systems?
The first time that I installed a private GitLab instance, I took note what a complicated monster it was. The installation was simplified to a simple script that you run, but I took note how the script pulls in and configures a gazilliard different components, and it made me nervous just watching it work. Then, upon reading the description on the Register, describing the live setup of the service, which talks about Postgres 9.2, 9.6, Azure NFS server, Azure DB server, S3 backups, replication setup, etc., and I know that all that is connected to a large number of independently developed open source programs that have been patched together with scripts, it makes one wonder if it would be time to go back to a drawing board and design a single-executable web service that can be scaled and configured in a more manageable manner.
Just move to a different (more reliable) provider?
Whether or not you have (or had) your code on GitLab.com, once the dust settles, where will you now put that code? You may choose to bite the bullet and, for the sake of quality, agree to pay for Visual Studio Team Services. But can you be confident that Microsoft engineers are better than GitLab engineers, or is it possible that they also would sometimes run the risk of getting sleepy and making mistakes?
Don't let fallible human beings administer live servers?
Or should this be considered exhibit A for the case of deploying software robotics in place of human system administrators, letting artificial intelligence programs to reliably make sure our servers are running properly?