Another weekend, another weekend read, this time all about Simulating Distributed Systems
At Resonate, we're building Distributed Async Await, a programming model where concurrency and distribution are first-class citizens. Concurrency introduces non-deterministic partial order, and distribution introduces non-deterministic partial failure. Any model with these properties at its core needs mechanisms to mitigate both.
But how do we ensure that the protocols designed to manage partial order and partial failure are themselves reliable under such conditions?
A Testing Challenge
Unit testing components such as the Resonate Python SDK and the Resonate Server does not give me confidence that these parts will integrate into a coherent, correct whole.
From the beginning, we have invested in Deterministic Simulation Testing. However, so far our tests were limited to an individual component, they did not span component interactions.
So in this issue of the Weekend Read, let’s take the first steps simulating an entire system.
Simulating Distributed Systems
In 2023, fly.io and Kyle Kingsbury built a collection of distributed systems challenges. The challenges are built on top of a platform called Maelstrom, which in turn, is built on Jepsen. Maelstrom orchestrates processes into a distributed system using a simple JSON protocol on top of stdin
and stdout
.
A Custom Simulation Framework
Inspired by Maelstrom, I decided to explore creating a custom simulation framework that follows a similar approach. The idea? Define a straightforward message-passing protocol where messages are exchanged via stdin
and stdout
.
Components
Components are os processes. The Component
class wraps a process and manages its stdin
, stdout
, and stderr
streams:
A component receives a message via reading from its
stdin
streamA component sends a message via writing to its
stdout
stream
import subprocess, threading, time, queue
class Component:
def __init__(self, name, command):
self.name = name
self.proc = subprocess.Popen(command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdin=subprocess.PIPE,
text=True)
self.stdout_queue = queue.Queue()
self.stderr_queue = queue.Queue()
# Thread to read stdout and stderr asynchronously
threading.Thread(target=self._read_output,
args=(self.proc.stdout, self.stdout_queue)).start()
threading.Thread(target=self._read_output,
args=(self.proc.stderr, self.stderr_queue)).start()
def _read_output(self, pipe, q):
while True:
line = pipe.readline()
if line:
q.put(line.strip())
else:
break
def send(self, message):
self.proc.stdin.write(message + '\n')
self.proc.stdin.flush()
def get_output(self):
try:
return self.stdout_queue.get_nowait()
except queue.Empty:
return None
def get_error(self):
try:
return self.stderr_queue.get_nowait()
except queue.Empty:
return None
The Simulator
The Simulator manages a collection of components and facilitates their interactions. When a component writes to stdout or stderr, the simulator will print each line prefixed with the component’s identifier.
However, if the line starts with send
, the simulator interprets this as a message and forwards it to the target components stdin
.
send <target> <message>
class Simulator:
def __init__(self):
self.procs = {}
def add(self, component):
self.procs[component.name] = component
def send(self, source, target, message):
if target in self.procs:
self.procs[target].send(message)
def run(self):
while True:
for name, proc in self.procs.items():
output = proc.get_output()
if output:
print(f"{name}: {output}")
if output.startswith('send'):
parts = output.split()
target, msg = parts[1], ' '.join(parts[2:])
self.send(name, target, msg)
error = proc.get_error()
if error:
print(f"{name} (stderr): {error}")
time.sleep(0.1)
Example Usage
These few lines are enough to get started with a simple distributed system simulation:
sim = Simulator()
sim.add(Component("p1",
["python3", "-c", "print('send p2, Hello from p1')"]))
sim.add(Component("p2",
["python3", "-c", "import sys; print(sys.stdin.readline())"]))
sim.run()
Outlook
So far, we are not simulating the adverse effects of concurrency and distribution such as process crashes, message delay, message loss, message duplication, or message reordering. But this is a great starting point for further exploration.
Happy Reading