The Ambiguity of Service Availability

In a previous blog entry, I promised to describe the subtle issue that caused a debate about the meaning of the term “availability.”  The post provided a deceptively simple mathematical definition of availability, and pointers to resources for understanding the topic in more depth.

Here’s why a mathematical approach alone isn’t enough…

Server Attention Span
https://xkcd.com/869/

Sometimes it’s hard to tell whether a system is actually performing useful work!  A system’s users may be unhappy, even when your monitoring systems tell you that your site is serving traffic and responding to each request with a successful response (HTTP 200 OK, for example).  For example, your site might be misconfigured to show a default page to all users.

In many cases, this kind of problem can be solved with monitoring.  If the monitoring system tests the content of responses, then we can detect HTTP 200 responses that return a default page instead of useful information.  If our monitoring system tests content for each client type, we can tell when an important constituency (like users with iPhones or Android devices) might be unhappy.

But things can get even more complicated…

For the last few years, Raymie and I were leaders at a company that sold big data analytics (in the form of Hadoop, Hive, Spark, and similar systems) to other companies.  There were times that our service was functioning as designed, but our customers were still unhappy.  That’s because they were using our service to write their own programs in SQL, Java, Scala, Python, and other languages.  We saw many situations where a bug in a program caused problems not just for the person running the program, but also for other users at the same company.  For example, a bug in a SQL JOIN statement could easily generate enough data to fill up a petabyte-sized file system.

It was this kind of subtlety that was at the heart of the debate about the meaning of “availability.”  How far does a service engineering team need to go to ensure that the users of a system are happy?  To the extent that “customer happiness” can be measured and it’s possible to negotiate a related service level objective (SLO), it makes sense for service engineers to accept the challenge to keep users happy.  For example, we changed our product so that it automatically notified customers to delete data (or to purchase a larger data plan) when they filled up file systems.  With this feature in place, individual users may have been unhappy that their programs could not write data, but their unhappiness was decoupled from the definition of availability.

When running complex services, there are many similar sources of ambiguity in the definition of availability.  Resolving the relationship between customer happiness and availability often requires cooperation between customer success managers, product managers, developers, and service engineers.  The process involves resolving customer problem reports, recognizing patterns of unhappiness, building features that improve the customer experience, and ultimately providing less ambiguous ways to define and to measure availability.

The Calculus of Service Availability

A few weeks ago, a team of colleagues at Pinterest were debating the meaning of the term “availability.”  In a subsequent post to this blog, I’ll describe the subtle issue that caused the debate.  For now, let’s have a look at the mathematical definition of availability.

A deceptively simple formula for availability, or the proportion of time that a system is functioning correctly is MTTF/(MTTF + MTTR), where MTTF is the mean time to failure and MTTR is the mean time to repair the system when it is broken.  Raymie and I will go into detail in our MIT IAP workshop (6.S188) delving into the details of how to measure and maximize MTTF, and how to minimize MTTR.

I first encountered this definition of availability as an EECS graduate student at MIT.  At the time, the best reference was Chapter 3 of Raj Jain’s book on The Art of Computer Systems Performance Analysis. Since some of my colleagues at work weren’t alive when this excellent book was published in 1991, I thought that it would be best to find a more recent reference.  Not to worry!  A paper titled The Calculus of Service Availability just appeared in September’s issue of the Communications of the ACM.  This paper was a reprint of the same article in ACM Queue in May, and includes information from the Google SRE book Site Reliability Engineering: How Google Runs Production Systems.

The Calculus of Service Availability paper takes the availability formula as a starting point.  It discusses some important observations about availability, defines some helpful terminology, and provides practical strategies for analyzing and improving the availability of distributed systems. Anyone who is (or wants to be) in the business of delivering products on the Internet should read this paper.

Reading this paper led me to start reading the entire Google SRE book, which really resonates with my own experience developing and running distributed systems.  This book is the unofficial textbook for 6.S188, but don’t feel like you need to read it before the start of class.  Raymie and I will provide lots of our own stories that are similar to the ones in the book, teach you relevant techniques, and provide an opportunity to practice in an online laboratory.