Distributed Systems
The fundamental challenges, consistency, availability, partition tolerance, and the trade-offs between them.
Overview
Distributed systems are collections of independent computers that communicate over a network and appear as a single system to the user. The fundamental challenges are partial failures (some components fail while others continue), network unreliability, and the impossibility of a globally consistent view of time and state.
Origin
Distributed computing research dates to the 1970s (Lamport clocks, 1978; the Byzantine Generals problem, 1982). The practical engineering of large-scale distributed systems was shaped by Google's published papers (MapReduce 2004, Bigtable 2006, Chubby 2006, Spanner 2012) and Amazon's Dynamo (2007).
Examples
Idempotency key to handle at-least-once delivery
class PaymentService
# Network failures mean the client may retry a charge request.
# Without idempotency keys, the card could be charged twice.
def charge(amount:, idempotency_key:)
# Check if we already processed this request
cached = IdempotencyStore.find(idempotency_key)
return cached.response if cached&.complete?
result = stripe_client.charges.create(
amount: amount,
idempotency_key: idempotency_key, # Stripe deduplicates on their side too
)
IdempotencyStore.record(idempotency_key, response: result)
result
rescue Stripe::IdempotencyError
IdempotencyStore.find(idempotency_key).response
end
endStripe's API accepts idempotency keys natively. For internal services, maintain your own store keyed by a UUID generated by the caller before the first attempt.
Use Cases
- 01Any system spanning multiple servers, even a single web server with a separate database is a distributed system
- 02Multi-region deployments for latency or disaster recovery
- 03Systems that must continue operating during partial failures
When Not to Use
- //Distribution introduces complexity that is unjustified until a single machine cannot handle the load
- //When strong consistency is required: distributed systems make strong consistency expensive
Technical Notes
- The eight fallacies of distributed computing (Peter Deutsch, 1994): the network is reliable; latency is zero; bandwidth is infinite; the network is secure; topology does not change; there is one administrator; transport cost is zero; the network is homogeneous. None of these are true
- Logical clocks (Lamport) and vector clocks establish happened-before relationships when wall clocks cannot be trusted across machines
- The two generals problem proves that reliable agreement over an unreliable channel is impossible, you cannot guarantee both sides commit without risk of split-brain
- Design for failure: assume every call to another service will eventually fail. Use timeouts, retries, circuit breakers, and bulkheads
More in Architecture