A few weeks ago, a team of colleagues at Pinterest were debating the meaning of the term “availability.” In a subsequent post to this blog, I’ll describe the subtle issue that caused the debate. For now, let’s have a look at the mathematical definition of availability.
A deceptively simple formula for availability, or the proportion of time that a system is functioning correctly is MTTF/(MTTF + MTTR), where MTTF is the mean time to failure and MTTR is the mean time to repair the system when it is broken. Raymie and I will go into detail in our MIT IAP workshop (6.S188) delving into the details of how to measure and maximize MTTF, and how to minimize MTTR.
I first encountered this definition of availability as an EECS graduate student at MIT. At the time, the best reference was Chapter 3 of Raj Jain’s book on The Art of Computer Systems Performance Analysis. Since some of my colleagues at work weren’t alive when this excellent book was published in 1991, I thought that it would be best to find a more recent reference. Not to worry! A paper titled The Calculus of Service Availability just appeared in September’s issue of the Communications of the ACM. This paper was a reprint of the same article in ACM Queue in May, and includes information from the Google SRE book Site Reliability Engineering: How Google Runs Production Systems.
The Calculus of Service Availability paper takes the availability formula as a starting point. It discusses some important observations about availability, defines some helpful terminology, and provides practical strategies for analyzing and improving the availability of distributed systems. Anyone who is (or wants to be) in the business of delivering products on the Internet should read this paper.
Reading this paper led me to start reading the entire Google SRE book, which really resonates with my own experience developing and running distributed systems. This book is the unofficial textbook for 6.S188, but don’t feel like you need to read it before the start of class. Raymie and I will provide lots of our own stories that are similar to the ones in the book, teach you relevant techniques, and provide an opportunity to practice in an online laboratory.