Running a SaaS service at scale

Brian Harry has done a couple of very interesting posts (post 1 and post 2) on the recent outages of the VSTS service. Whether you use VSTS or not they make interesting reading for anyone who is involved in running SaaS based systems, or anything at scale.

From the posts the obvious reading is you cannot under estimate the importance of

in production montoring
having an response plan
doing a proper root cause analysis
and putting steps in place to stop the problem happening again

Well worth a read