In looking for an explanation, it’s best to turn to the source. The old adage is that “It’s not that you have a problem but rather how you handle it that is most important.”, applies here in a way that Google would like to not repeat. Here’s some official words from the official Gmail blog
Gmail’s web interface had a widespread outage earlier today, lasting about 100 minutes. We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there’s a problem with the service. Thus, right up front, I’d like to apologize to all of you — today’s outage was a Big Deal, and we’re treating it as such. We’ve already thoroughly investigated what happened, and we’re currently compiling a list of things we intend to fix or improve as a result of the investigation.
The blog then goes on to explain the 5 W’s of the situation in layman’s terms and, in my opinion, provided an appropriate mea culpa as well as showing that there is work taking place to ensure that this would not happen again to the same degree. What was most interesting was the recognition that the way that the architecture was at the time of the failure caused the shutdown rather than a slowdown and that Gmail is opting for slow service over no service for the future. Good choice.
What’s next: We’ve turned our full attention to helping ensure this kind of event doesn’t happen again. Some of the actions are straightforward and are already done — for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle — for example, we have concluded that request routers don’t have sufficient failure isolation (i.e. if there’s a problem in one datacenter, it shouldn’t affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). We’ll be hard at work over the next few weeks implementing these and other Gmail reliability improvements — Gmail remains more than 99.9% available to all users, and we’re committed to keeping events like today’s notable for their rarity.
For something of this magnitude I give Google a decent grade for being transparent enough to say ‘Yup, we’re not perfect’ while working to get it right for the future. Today will be a great day for all of the Google haters out there. I on the other hand, have decided to realize that since I am far from perfect myself, that to expect from others is, well, a waste of time. Does that mean I will welcome future outages with open arms? Of course not. Based on what I have seen here though, I suspect that Google won’t either.