Issue #29 - Reliability Testing for Cloud-Backed Applications
Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker
Another weekend, another weekend read, this time all about Reliability Testing for Cloud-Backed Applications.
I am writing a book on Thinking in Distributed Systems. 12 chapters, one chapter per month, full of diagrams, illustrations, and examples, all about distributed systems.
In their Usenix paper Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker Yinfang Chen et al discuss the reliability challenges faced by cloud applications.
The paper is based on the premise that cloud applications experience different failure types than traditional applications. Failures are often assumed to be transient, resulting in retries as the default failure handling strategy.
The paper consists of two parts:
(1) systematically understanding the bug patterns, and
(2) building practical tooling to systematically test whether a cloud-backed application can correctly handle the myriad errors that may happen during interactions with the cloud services it depends on.
Personally, I enjoy the fascinating exploration of different bug patterns. For example, the bug highlighted below: The Azure Storage SDK automatically retries on timeouts, which returns a 409 error because its precondition is invalidated by the first request, causing all subsequent logs to get lost.
A great reminder that if your app retries, your app needs to be idempotent, that is, requests can be retried without causing unintended consequences.
A fun, thought provoking read that will have you revisit lots of places in your code base :)
Happy Reading
Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker
Yinfang Chen, Xudong Sun, Suman Nath, Ze Yang, & Tianyin Xu
Modern applications have been emerging towards a cloud-based programming model where applications depend on cloud services for various functionalities. Such “cloud native” practice greatly simplifies application deployment and realizes cloud benefits. Meanwhile, it imposes emerging reliability challenges for addressing fault models of the opaque cloud and less predictable Internet connections.