I got some mail yesterday from Google about their recent Google Apps service outage. Here it is, along with my editorial comments.
We’re committed to making Google Apps Premier Edition a service on which your organization can depend. During the first half of August, we didn’t do this as well as we should have. We had three outages – on August 6, August 11, and August 15. The August 11 outage was experienced by nearly all Google Apps Premier users while the August 6 and 15 outages were minor and affected a very small number of Google Apps Premier users. As is typical of things associated with Google, these outages were the subject of much public commentary.
Well-deserved public commentary, at that, mostly focused on the question of why Google thinks that Google Apps is an enterprise-grade service. Three outages in a nine-day period is not confidence-building.
Through this note, we want to assure you that system reliability is a top priority at Google. When outages occur, Google engineers around the world are immediately mobilized to resolve the issue. We made mistakes in August, and we’re sorry. While we’re passionate about excellence, we can’t promise you a future that’s completely free of system interruptions. Instead, we promise you rapid resolution of any production problem; and more importantly, we promise you focused discipline on preventing recurrence of the same problem.
Notice what’s missing here: any commitment to a particular level of availability, or any information about the cause of the outage, or any information about how they applied “focused discipline” to keep it from happening again.
Given the production incidents that occurred in August, we’ll be extending the full SLA credit to all Google Apps Premier customers for the month of August, which represents a 15-day extension of your service. SLA credits will be applied to the new service term for accounts with a renewal order pending. This credit will be applied to your account automatically so there’s no action needed on your part.
So let me get this straight: in exchange for three days of outages (in fairness, not three complete outages), you’re going to give me a credit for $25/user. That’s not a bad start, but I daresay for most Google Apps customers it’s only a small fraction of their lost productivity. Not to mention that I might not want a service credit in the first place.
We’ve also heard your guidance around the need for better communication when outages occur. Here are three things that we’re doing to make things better:
We’re building a dashboard to provide you with system status information. This dashboard, which we aim to make available in a few months, will enable us to share the following information during an outage:
- A description of the problem, with emphasis on user impact. Our belief is during the course of an outage, we should be singularly focused on solving the problem. Solving production problems involves an investigative process that’s iterative. Until the problem is solved, we don’t have accurate information around root cause, much less corrective action, that will be particularly useful to you. Given this practical reality, we believe that informing you that a problem exists and assuring you that we’re working on resolving it is the useful thing to do.
- A continuously updated estimated time-to-resolution. Many of you have told us that it’s important to let you know when the problem will be solved. Once again, the answer is not always immediately known. In this case, we’ll provide regular updates to you as we progress through the troubleshooting process.
Positive steps, but note that there’s no definite delivery date. Note also the weasel language around how “assuring you” is the useful thing to do. No, fixing the problem is the useful thing to do, followed closely by timely and informative status reports. Just look at what Twitter does, then do the opposite. (Actually, for a decent model, check out how the Xbox Live service folks handle outages.)
In cases where your business requires more detailed information, we’ll provide a formal incident report within 48 hours of problem resolution. This incident report will contain the following information:
- business description of the problem, with emphasis on user impact;
- technical description of the problem, with emphasis on root cause;
- actions taken to solve the problem;
- actions taken or to be taken to prevent recurrence of the problem;
- e. time line of the outage.
This is more like it! However, my business always requires this detailed information. Who says so? I do. I’m betting that Google will closely control this information, and that they will only provide it if they think your business requires such information.
In cases where your business requires an in-depth dialogue about the outage, we’ll support your internal communication process through participation in post-mortem calls with you and your management team.
Translated: “if you take heat for our outages, we’ll be happy to get on the phone and help spin the problem so we don’t lose your account.”
Once again, thanks for you continued support and understanding.
Sincerely, The Google Apps Team