Comparing Service Levels – Do Google and Microsoft really differ?
Recently there have been downtimes for cloud services and I thought I’ll have a look at the details of the service level agreements for two popular services. Gmail or GoogleMail (in Germany Gmail is a reserved term under copyright laws) and Exchange Online. The first major difference is how easily and quickly you do find the agreements. On the Microsoft site I got lost slightly in descriptions of buying plans, PURs (product use rights) and several kind of licensing terms that sounded much like software rather than service. Finally I found the SLA documents. Google forced me to search right from the beginning but guess what, it worked. After that experience and to give Microsoft a fair chance I opened bing and searched for their SLAs as well. The result was somewhat disappointing. I got an outdated version and forum entries of folks failing to find the Microsoft SLAs on the web.
It raises a question whether the SLAs are somewhat secret. Secondly right on the top of my mental list of questions is whether Microsoft is not limiting itself by marrying the new world (cloud computing) with the old world licensing agreements. At least from an agreement perspective a clear-cut would have been easier for everybody, inside and outside of Microsoft, to understand.
Links to SLAs used for this comparison:
What is a SLA?
There are different kinds of SLAs. The major ones you probably heard about are availability and performance SLAs. It is a method to put measure to a service so that you can either quantify of qualify the service you get. You might wonder why I mention qualify as SLAs are mostly about quantification. There is a reason for that. When you start going beyond measuring your service and look at its impact, you start your journey of looking into quality. A recommendation if you are interested to learn more is this book: Sense and Respond: The Journey to Customer Purpose
Back to SLA’s, apart from being a measurement method the other key component of an SLA is the consequence. So the question of what happens if the provider fails to deliver according to the promise (also called contract or agreement). Again the range could be wide from financial penalties over rights to cancel contract and many more kinds of penalties. If you would head for a custom agreement you creativity could be brought to use here. But with cloud services and especially SaaS offerings there is no such thing as creativity involved. It is more or less take it or leave it which makes it more important to really understand the service level agreements.
In addition you probably will be discussing KPIs (key performance indicators) with your service provider. These are more or less comparable to SLAs with the one and only difference that there are no penalties associated to them.
I will compare the agreements in different categories. Be aware that some of these categories are based on my subjective interpretation. You should always read and understand the agreements yourself and decide based on your business needs. So much for disclaimers let us head into the details.
The first thing you notice is that the Google SLA is a one pager where the Microsoft SLA has 6 pages. I am not judging whether that is good or bad but it makes a different read. Where Microsoft starts to score is that they start right away with a notice about the last update to this agreement. So there is a chance to identify the version. There is no such thing on the Google SLA. Giving the fact that usually these agreements are closed online, it really poses the question, how you will be able to determine which version of the SLA is the one that was current when you signed. Major advice: Print out your contract documents and keep this documentation for the length of the contract.
In addition to stating the last change date Microsoft also confirms how SLA changes are handled during the term of the contract and at renewal. There is no such information in the Google SLA.
In both cases, Microsoft and Google, penalties only apply if the customer actively claims these. This is an important distinction from hosting or outsourcing where a constant monitoring and an active SLA reporting are in place where your SLA counts always and not only if you notice it.
There is a major difference about the kind of penalty Google and Microsoft apply. With Google you’ll get days of service added at the end of the term at no charge. There is a statement in there that states “or monetary credit equal to the value of days of service for monthly postpay billing customers” but it lacks details how this is processed and the calculation of these credits. Microsoft on the other hand not only gives its customers a financial service credit but also details the process of claiming, checking and payout in clear detail. It provides for more reading material for sure but your legal department will be much happier with knowing the terms. So in this case longer really is better. We do have a first indicator why the Microsoft agreement is 6 times longer.
Let’s look at the different SLAs now
Let me start off with a positive note on both service providers measuring the availability on a monthly base. In case you wonder why that is important let me explain. The difference is the maximum straight time of downtime. If you measure monthly your maximum straight downtime is the maximum downtime per month. If you measure annually your maximum straight downtime is 12 times that. So the business impact in the worst case is 12 smaller downtime windows vs. one huge one. In general the business impact of the smaller ones is much lower.
So how do Google and Microsoft actually calculate downtime?
In the Google case it says:
“Downtime” means, for a domain, if there is more than a five percent user error rate. Downtime is measured based on server side error rate”
Sounds good but the agreement misses to explain what a five percent user error rate is. My guess would be if there are more that 5% of your user are not able to use the service you might have a downtime. So in case of 100 users under service 6 users offline would count as a downtime. So that is your trigger for claiming a penalty. The measurement though is then based not on your user’s downtime but the “server side error rate”. And again we are forced into guessing as the server side error rate is not explained. As this error rate is a main component of the uptime percentage calculation we are kind of in the hands of Google. No way to understand or even check this. Also Google is in the position to change its definition without us even noticing.
The way Microsoft defines downtime is different.
“”Downtime” means the total minutes in a month during which the aspects of a Service specified in the following table are unavailable, multiplied by the number of affected users, excluding Scheduled Downtime; and unavailability of a Service due to limitations described … below.”
From the table mentioned above: Exchange online – any period of time when end users are unable to send or receive email with Outlook Web Access
So let’s stick to the example above, you have a hundred users. The service is offline for 20 minutes for 6 users. That would result in calculated 2 hours of downtime (a component for the uptime percentage calculation, so keep in mind). Please note that Microsoft excludes scheduled downtime (e.g. for upgrades) where there is no such thing in Google’s SLA. This is based on a major difference in the technical concept. It is important to make clear that Microsoft does not limit the hours of scheduled downtime which would allow them at least theoretically to have as much downtime as they want as long as they announce it in time.
In both cases Google and Microsoft there are exclusions of force majeure. I have not dug into that area but any customer should have checked this by the legal department or advisor.
Now that we have on base component, the downtime, let’s calculate the Monthly Uptime Percentage which will be used to determine whether a penalty applies or not.
So let’s play an example. The example is different from the one above to get into the space of penalties. 100 Users, 40 users with 180 minutes non availability of the service, total number of minutes in the example month are 24*30*60=43200. Here are the calculations with numbers:
So in this example Microsoft would report 99.83% and payout 25% service credit; Google would report 99.58% availability which would result in 3 days of service added. The major difference without wanting to show you multiple examples is that with more users being affected the Microsoft penalty raises whereas it would stay the same with Google unless there is some secret ingredient in the downtime calculation that is not the SLA document.
Virus Detection, Spam Effectiveness and False Positives
If you consider an email service availability is not the only critical component. The protection of your service, the mailboxes and ultimately your users is a top priority. So you would want an SLA for that as well.
No SLA, Google offers a standalone service under different terms at extra cost
Microsoft has two ways of getting eligible for this service, either as a standalone service at extra cost which would be comparable to Google or included into Exchange Online, which is why it is mentioned in here.
Microsoft Offers an SLA for Virus protection – 25 % Service Credit available if infection occurs in calendar month. One credit per month only
Microsoft offers an SLA on Spam Effectiveness – based on the level of effectiveness below 98%, different credit levels are available.
Another SLA offered is the False/Positive Ratio – Based on the ratio on three levels, credits are available.
Disasters are excluded from both service level agreements. Therefore it is important to look into the case a disaster strikes.
Google: I have not found a statement in the SLA but here is what the Google Enterprise Blog states:
For Google Apps customers, our RPO design target is zero, and our RTO design target is instant failover. We do this through live or synchronous replication: every action you take in Gmail is simultaneously replicated in two data centers at once, so that if one data center fails, we nearly instantly transfer your data over to the other one that’s also been reflecting your actions.
Actually this could mean that there is no exception from the SLA even if disaster strikes but the SLA document refers to a ““Force Majeure” section of the Agreement”. That section only refers to performance or inadequate performance and has no statement with regards to availability. So my assumption is that Google even guarantees the SLA in case of disasters.
Microsoft: In the service descriptions Microsoft explains its position towards disasters (look for the service continuity paragraphs):
Two metrics commonly used in service continuity management to evaluate disaster recovery solutions are a recovery time objective (RTO), which measures the time between a system disaster and the time when the system is again operational, and a recovery point objective (RPO), which is a time representation of the possible data loss that occurred as a result of the recovery from the unexpected event.
Exchange Online has set an RPO and RTO for client messaging services in the event of a disaster:
- · Nearly instantaneous RPO: Microsoft protects your Exchange Online data and makes a nearly instantaneous copy of your data.
- · 1 hour RTO: Organizations will be able to resume service within 60 minutes after service disruption if a disaster incapacitates a hosting data center.
The important piece next to the times mentioned is that after both RPO and RTO are met, Service Levels are reinstated.
In summary I would say the Microsoft agreement towards disaster recovery in contractually clearer and technically state of the art. The Google statement leaves some room for interpretation, but if my understanding is right it is stronger that Microsoft’s statement.
Do they differ?
I believe that resume is an obvious one, yes, they differ. While Microsoft has a lengthy but well-structured and clearly defined contract document on the table, the Google approach is more direct. Sadly there are critical areas where it leaves room for interpretation. I do hope someone from Google would be able to explain these white spots to potential customer but self-explanatory would have been better. In general I am in favor of precise and comprehensive contract documents. I have seen to many outsourcing contracts fail and spent lots of time drafting documents myself to let a contract document with white space scrape through. Especially as many of these contract for customers of smaller sizes will be signing this online without legal checks it is important to have a clear contract document.
I struggle to take a judgment on the Uptime Percentage calculation. I do lack information from Google. In general the model from Microsoft including the number of users into the calculation makes a lot of sense.
None of the agreements I believe is so self-explanatory to an inexperienced person that it should be signed without a legal check and a thorough questioning of each companies representative. Make sure you understand what you sign up for!