Earlier this month a high profile startup, GitLab.com, suffered a major outage and data loss:
Human error triggered the incident but 'to err is human', in other words mistakes happen, which is why we need robust procedures to ensure that data is securely backed up and that backups can be restored. GitLab unfortunately did not have such procedures in place but at TrialGrid we have implemented fully automatic backups including verification that every backup can be successfully restored.
TrialGrid backup process
TrialGrid runs on the widely used PostgreSQL relational database which provides strong features for ensuring data integrity. Our backup process is initiated by an automatic scheduler which triggers this series of actions:
Step 1: Create a backup
The first step in the automated backup pipeline is to create a backup of the live database using standard PostgreSQL backup utilities. This creates a file containing all of the data and schema information from the live database.
Step 2: Copy file to Amazon S3
Having created the backup file we need to store it in a secure location, encrypted, with strong guarantees that the file can be retrieved when we need it. Amazon's S3 service provides these features, and so we keep each daily backup of the database in S3:
Step 3: Restore the backup from S3
Once we have the backup file stored in S3 we need to verify that we can use it to restore the database. If, for example, the backup file had only partially been transferred to S3 we would not be able to restore from it, so we need to find out immediately whether its a viable backup or not. Again we use standard PostgreSQL utilities to restore the backup file into a temporary database.
Step 4: Run queries
If the restore process is successful we do further verification of the restored database by running some queries against TrialGrid database tables. If these queries run successfully then we know we have a viable backup file which can be restored.
Step 5: Post results
Finally, we need to know whether or not the backup pipeline has been successfully completed. We do this in two ways . First, the backup pipeline posts the result to a chat room, to which everyone at TrialGrid has access:
This is useful to give us quick notification of the backup results, but what if the backup pipeline has failed and no chat message has been posted? We might notice that there isn't a message for today, but to make sure that we know the backup process has run and been successful we have a second nofication step, using the DeadMansSnitch service:
DeadMansSnitch is configured to expect a notification from our backup pipeline once per day. This notification is only sent if all steps have been successful, so if there's been a silent failure in the pipeline, no nofication will be sent and DeadMansSnitch will alert us by sending emails.
Using this automated backup and verification pipeline we have a series of steps to take database backups, store them securely and verify that the backups are valid. All of this happens automatically (no human error!) and we're alerted if anything goes wrong.