Foreword
Preface
Part Ⅰ.Introduction
1. Introduction
The Sysadmin Approach to Service Management
Google's Approach to Service Management: Site Reliability Engineering
Tenets of SRE
The End of the Beginning
2. The Production Environment at 6oogle, from the Viewpoint of an SRE
Hardware
System Software That "Organizes" the Hardware
Other System Software
Our Software Infrastructure
Our Development Environment
Shakespeare: A Sample Service
Part Ⅱ.Principles
3. Embracing Risk
Managing Risk
Measuring Service Risk
Risk Tolerance of Services
Motivation for Error Budgets
4. Service Level Objectives
Service Level Terminology
Indicators in Practice
Objectives in Practice
Agreements in Practice
5. Eliminating Toil
Toil Defined
Why Less Toil Is Better
What Qualifies as Engineering?
Is Toil Always Bad?
Conclusion
6. Monitoring Distributed Systems
Definitions
Why Monitor?
Setting Reasonable Expectations for Monitoring
Symptoms Versus Causes
Black-Box Versus White-Box
The Four Golden Signals
Worrying About Your Tail (or, Instrumentation and Performance)
Choosing an Appropriate Resolution for Measurements
As Simple as Possible, No Simpler
Tying These Principles Together
Monitoring for the Long Term
Conclusion
7. The Evolution of Automation at Google
The Value of Automation
The Value for Google SRE
The Use Cases for Automation
Automate Yourself Out of a Job: Automate ALL the Things!
Soothing the Pain: Applying Automation to Cluster Turnups
Borg: Birth of the Warehouse-Scale Computer
Reliability Is the Fundamental Feature
Recommendations
8. Release Engineering
The Role of a Release Engineer
Philosophy
……