
This week at TrialGrid (Mar 3, 2017)
This week Amazon had problems with its Simple Storage Service (S3) which affected a number of prominent services including Quora the Q&A site and the popular Slack team-communication system.
For full details of what happened you can read Amazon's disclosure. But in summary, a user typed a command incorrectly and took a large part of the infrastructure offline. The system worked perfectly but it was told to do the wrong thing due to a typo.
Ironically, the icons that show the current system status on the S3 Status page were themselves served by the part of the S3 system that went down, so they could not be updated.
Lessons for TrialGrid
We know that nothing on the Internet is going to be 100% reliable but until this mistake Amazon S3 had a very good uptime and stability record. It was, and continues to be, a trusted part of many companies infrastructure, including ours.
Anything that can happen will happen but we want to engineer TrialGrid to be as resilient as possible and be transparent about our own system uptime. To that end we're adding a direct link to our public system status page to our website:
which links to http://status.trialgrid.io
Learning from Amazon, we don't host this page ourselves, it's managed by a 3rd party provider, Pingdom.
Subsystem monitoring
We have been using Pingdom since our very first "hello world" push to our beta site to monitor availability. However, this monitoring was skin-deep, checking only that the website was responsive.
Over the last few days we have been working on monitoring of critical sub-systems: database access, cache system and S3 upload to ensure they are also responding.
TrialGrid was affected on the 28th Feb by the Amazon issue but unfortunately our public monitoring did not show this, as of today it will, along with other subsystems.
Though we are now working toward a validated release, TrialGrid is still in Beta and our status page will show occasional downtime, sometimes for an hour or more while we make infrastructure changes. Having this level of transparency is as important to us as we know it is for our customers.