Another weekend, another weekend read, this time all about Understanding Epochs in Distributed Systems
Epochs are a coordination mechanism in distributed systems: Epochs are a recency token, a monotonically increasing logical timestamp marking a specific point in a system's operations. You can think of epochs as “eras” or “generations” in your distributed system's lifecycle.
In this issue of the Weekend Read we will explore what epochs are, how they are used, and how they actually work
Epoch
An epoch or term is a coordination mechanism: An epoch is a unique, incrementing value that represents the state, in this context also called configuration, of a system at a point in time. However, why do we need a representation of the state at a point in time?
Why Epochs?
A core challenge in distributed systems is that every component only has a limited local view: A component can only observe its own state, not the global state of the system. If a component wants to learn about the global state-the state of any component other than itself-it must communicate i.e. exchange messages.
As a system progresses, the configuration of the system changes, invalidating a component’s knowledge about the configuration of a system.
Epochs are used to coordinate components by enabling a component to make recency decisions i.e. decisions based on the recency of their knowledge.
Example • Leadership in distributed systems
To ensure reliability, systems often deploy components redundantly, with a designated leader making decisions. If the leader fails, or more accurately, the other components suspect the leader has failed, they elect a new leader. What happens if the leader did not actually fail? In that case, two components might each believe they are the leader, issuing conflicting commands, a situation known as split brain.
What can we do?
Epochs prevent the conflict: On startup, a group selects a leader with epoch 0. Every time a new leader is elected, the epoch is incremented. When a leader issues a command, the leader includes its epoch.
If a component sees commands from two “leaders,” the component can fence off the obsolete leader by accepting commands with the highest epoch.
A key insight
An essential aspect of epochs in distributed systems is this: what matters is not who currently holds the highest epoch number but the highest epoch number each component has seen.
For example, if a new leader is elected but has not yet communicated its epoch to all components, they will continue accepting commands from the previous leader until they learn of the higher epoch.
Conclusion
Epochs offer a principled way to handle changing configurations: In the absence of global knowledge, epochs allow the coordination of components by enabling a component to make recency decisions i.e. decisions based on the recency of their knowledge.
Happy Reading