I am writing this to give myself air and because I think you can still learn a lot from incidents like this I have experienced here as an old hare.
Imagine: You are solely responsible for a complex software. You have to introduce new essential features for the system. You have slowly brought the new features into the system via “feature switches”. Many already use the new features. Now comes the time when you can start switching off old features and all users more or less compulsively are lead to the new features. This is probably the most difficult moment of a product launch. There is no easy return. Just at this moment happens what all developers and system administrators are afraid of. The system dies and there is a complete data loss for several days.
This is exactly what happened with the UlangoTV 2.0 introduction.
The whole story is so exemplary and classic that I think it will interest many of the technically skilled, how it could come to it and how one can liberate oneself without damage from such a catastrophe.
A fatal crash with big data loss – how can it happen?
Most web and app servers use Linux systems because they have proven to be particularly stable, secure, powerful and inexpensive over the years – so we too. Many administrators are proud that their systems have not been restarted for more than 6 months or more and that they are just stable. Backups are made regularly – today mostly by the providers via snapshot on the lowest level of the raw partitions. To this end, regular DB backups are made with the DB tools, which guarantee the necessary transaction security.
The DB backups are a special problem, if the databases become too big as in our case – over 50GB. Importing such data requires a lot of time and leads to a great downtime in the event of a disaster. The only way to achieve high availability is the introduction of redundancy. How this helped in our case relatively quickly back on the legs, more below.
Now to the crash in more detail. In ongoing operation with increasing system load, it suddenly happens that some processes are no longer finishing their work in time. The load swells, the system begins to swap until practically everything hangs. You quickly identify load causes, services are turned off. But suddenly this does not help anymore because there is obviously a blockage situation in the system (A waits for B, B waits for C, C waits for A – deadlock – you know what I mean). Very bad, if the blockade obviously lies at the bottom of the filesystem. Now the moment has come, where only a reboot can help. And then it happens: The system can not be rebooted because there are inconsistencies in the filesystem. This is usually not too bad, because a modern file system carries enough redundancy in itself to repair itself. Unfortunately, however, this was also not possible in our case. No chance to make the system work again. So a backup had to be restored – from the last day – a few hours before. After an hour we realize that the filesystem is already too broken here, that it is no longer usable. Another backup back – another hour – does not work either. Now it’s slowly becoming critical. In the meantime, we are preparing a new system into which we can upload our backup – uploading the compressed data: 22 hours !!
So a backup back – the weekly backup 5 days back. Hurray – it works. I decide to relinquish the import of the DB and to manage with the data loss of 5 days somehow. Phew
Restoration of data from redundant sources
When it comes to recovering from a data loss as quickly as possible, all kinds of sources can help. This is where the strengths of the Ruby on Rails programming language we use come into play, with the help of which AdHoc programs could be written very quickly. In our case, we had three sources: 1. Central log files, which were redundantly stored on other servers. 2. Our external order data at PayPal and 3. Our Riak-based key-value storage for channels and streams.
From the logfiles, it was relatively easy to restore user data – apart from the passwords, of course. With an AdHoc program the data were imported and the users were notified by e-mail to reset their passwords.
The restoration of the order data proved to be much more difficult, since the order information was also lost, and it was not always possible to associate the payments with users.
Finally, relatively easy was to restore our central database for streams and channels. We had transferred this data to a so-called key-value store (Riak) a long time ago, in particular in order to distribute the load on requests as well as the redundancy achieved over several servers. It is the key to a virtually unlimited scalable system. It is a technology that is used today in all large systems and was used for the first time by Amazon in a larger style (Dynamo).
For us, the following are the main conclusions
- More frequent reboots of the system to detect “creeping” destruction of data, which then also migrates into the backups, early.
- Storage of the DB Backups close to the server in order to make it available as quickly as possible.
- Improved centralized logging (syslog daemon)
- Relocat even more data – especially users and order data – in distributed KV stores to minimize the “single point of failure” situations.
So – now I’ve got rid of it and I feel much better and ready to tackle new challenges at UlangoTV!