Brian Harry has done a couple of very interesting posts (post 1 and post 2) on the recent outages of the VSTS service. Whether you use VSTS or not they make interesting reading for anyone who is involved in running SaaS based systems, or anything at scale.
From the posts the obvious reading is you cannot under estimate the importance of
- in production montoring
- having an response plan
- doing a proper root cause analysis
- and putting steps in place to stop the problem happening again
Well worth a read