Distributed systems don’t isolate failure.
They route it.
Failure Follows the Graph
Every distributed system is a graph:
- services
- dependencies
- communication paths
When something fails:
It doesn’t stop.
It travels along connections.
Dependencies Define Propagation Paths
A failure spreads through:
- upstream dependencies
- downstream consumers
- shared infrastructure
This is the same structure described in external dependencies.
Which means:
The system defines how failure moves.
One Failure Becomes Many Requests
A single failure triggers:
- retries
- fallback calls
- parallel requests
This multiplies load.
And accelerates propagation.
Retry Logic Amplifies Failure
Retries are designed for resilience.
Under failure, they create pressure:
- more traffic
- more contention
- more instability
This connects directly to small errors spreading.
Latency Propagation Is Invisible
Failures are visible.
Latency is not.
- slow responses
- delayed processing
- increasing wait times
Latency spreads quietly.
But destabilizes the system.
Queues Turn Pressure Into Backlog
Distributed systems rely on queues.
Under failure:
- queues grow
- processing lags
- timeouts increase
Queues convert local issues into global slowdown.
Resource Limits Accelerate Collapse
When systems approach limits:
- CPU saturates
- memory pressure increases
- network delays grow
This connects directly to resource limits.
Because under pressure:
Propagation becomes faster.
Protocols Define Failure Behavior
Propagation is not random.
It depends on:
- retry policies
- timeout strategies
- circuit breakers
- consistency models
As described in protocol complexity.
Interfaces Hide Propagation
From the outside:
- requests fail
- latency increases
But internal propagation remains hidden.
This builds directly on interfaces hiding risks.
Observability Sees Events, Not Flow
Monitoring shows:
- errors
- spikes
- alerts
It does not show:
- propagation paths
- interaction chains
- feedback loops
This is the same limitation described in monitoring vs understanding.
Drift Makes Propagation Unpredictable
When systems drift:
- configurations differ
- behavior changes
- responses vary
This builds on configuration drift.
Which means:
Propagation paths are no longer consistent.
Scaling Increases Propagation Speed
At scale:
- more nodes
- more connections
- more dependencies
This connects directly to why systems break.
Because:
Failure travels faster in larger systems.
Partial Failure Becomes System Failure
Distributed systems rarely fail completely.
They degrade:
- partial outages
- inconsistent behavior
- cascading delays
But these partial failures:
Eventually converge into full failure.
Infrastructure Is a Shared Risk Surface
Multiple services share:
- networks
- storage
- compute resources
A failure in shared infrastructure:
- affects multiple systems simultaneously
Which amplifies propagation.
Failure Is a System Behavior
Failures are not anomalies.
They are part of how systems behave under stress.
You Can’t Prevent Failure Propagation
You can:
- slow it
- contain it
- isolate it
But you cannot eliminate it.
Because propagation is built into the system.
The Real Problem
The problem is not that something failed.
The problem is:
How the system allows that failure to spread.
Where Systems Actually Collapse
Not at the first failure.
But when propagation
outpaces control.