Architecture

Distributed Systems

The fundamental challenges, consistency, availability, partition tolerance, and the trade-offs between them.

Overview

Distributed systems are collections of independent computers that communicate over a network and appear as a single system to the user. The fundamental challenges are partial failures (some components fail while others continue), network unreliability, and the impossibility of a globally consistent view of time and state.

Origin

Distributed computing research dates to the 1970s (Lamport clocks, 1978; the Byzantine Generals problem, 1982). The practical engineering of large-scale distributed systems was shaped by Google's published papers (MapReduce 2004, Bigtable 2006, Chubby 2006, Spanner 2012) and Amazon's Dynamo (2007).

Examples

Idempotency key to handle at-least-once delivery

class PaymentService
  # Network failures mean the client may retry a charge request.
  # Without idempotency keys, the card could be charged twice.
  def charge(amount:, idempotency_key:)
    # Check if we already processed this request
    cached = IdempotencyStore.find(idempotency_key)
    return cached.response if cached&.complete?

    result = stripe_client.charges.create(
      amount: amount,
      idempotency_key: idempotency_key,  # Stripe deduplicates on their side too
    )

    IdempotencyStore.record(idempotency_key, response: result)
    result
  rescue Stripe::IdempotencyError
    IdempotencyStore.find(idempotency_key).response
  end
end

Stripe's API accepts idempotency keys natively. For internal services, maintain your own store keyed by a UUID generated by the caller before the first attempt.

Use Cases

  • 01Any system spanning multiple servers, even a single web server with a separate database is a distributed system
  • 02Multi-region deployments for latency or disaster recovery
  • 03Systems that must continue operating during partial failures

When Not to Use

  • //Distribution introduces complexity that is unjustified until a single machine cannot handle the load
  • //When strong consistency is required: distributed systems make strong consistency expensive

Technical Notes

  • The eight fallacies of distributed computing (Peter Deutsch, 1994): the network is reliable; latency is zero; bandwidth is infinite; the network is secure; topology does not change; there is one administrator; transport cost is zero; the network is homogeneous. None of these are true
  • Logical clocks (Lamport) and vector clocks establish happened-before relationships when wall clocks cannot be trusted across machines
  • The two generals problem proves that reliable agreement over an unreliable channel is impossible, you cannot guarantee both sides commit without risk of split-brain
  • Design for failure: assume every call to another service will eventually fail. Use timeouts, retries, circuit breakers, and bulkheads