Architecture

Distributed Systems

The fundamental challenges, consistency, availability, partition tolerance, and the trade-offs between them.

Overview

Distributed systems are collections of independent computers that communicate over a network and appear as a single system to the user. The fundamental challenges are partial failures (some components fail while others continue), network unreliability, and the impossibility of a globally consistent view of time and state.

Origin

Distributed computing research dates to the 1970s (Lamport clocks, 1978; the Byzantine Generals problem, 1982). The practical engineering of large-scale distributed systems was shaped by Google's published papers (MapReduce 2004, Bigtable 2006, Chubby 2006, Spanner 2012) and Amazon's Dynamo (2007).

Examples

Idempotency key to handle at-least-once delivery

class PaymentService
  # Network failures mean the client may retry a charge request.
  # Without idempotency keys, the card could be charged twice.
  def charge(amount:, idempotency_key:)
    # Check if we already processed this request
    cached = IdempotencyStore.find(idempotency_key)
    return cached.response if cached&.complete?

    result = stripe_client.charges.create(
      amount: amount,
      idempotency_key: idempotency_key,  # Stripe deduplicates on their side too
    )

    IdempotencyStore.record(idempotency_key, response: result)
    result
  rescue Stripe::IdempotencyError
    IdempotencyStore.find(idempotency_key).response
  end
end

Stripe's API accepts idempotency keys natively. For internal services, maintain your own store keyed by a UUID generated by the caller before the first attempt.

Use Cases

01Any system spanning multiple servers, even a single web server with a separate database is a distributed system
02Multi-region deployments for latency or disaster recovery
03Systems that must continue operating during partial failures

When Not to Use

//Distribution introduces complexity that is unjustified until a single machine cannot handle the load
//When strong consistency is required: distributed systems make strong consistency expensive

Technical Notes

The eight fallacies of distributed computing (Peter Deutsch, 1994): the network is reliable; latency is zero; bandwidth is infinite; the network is secure; topology does not change; there is one administrator; transport cost is zero; the network is homogeneous. None of these are true
Logical clocks (Lamport) and vector clocks establish happened-before relationships when wall clocks cannot be trusted across machines
The two generals problem proves that reliable agreement over an unreliable channel is impossible, you cannot guarantee both sides commit without risk of split-brain
Design for failure: assume every call to another service will eventually fail. Use timeouts, retries, circuit breakers, and bulkheads