Distributed Systems

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Distributed Systems are the "Invisible Engines" that power the modern internet. While a traditional computer is a single box on your desk, a distributed system is a "Team" of hundreds or thousands of computers (Nodes) working together to solve a single problem. Whether you are searching Google, streaming Netflix, or sending a Bitcoin transaction, you are interacting with a distributed system. It is the science of "Scaling Up"—creating a machine that is so large and so resilient that it can serve millions of people at once and keep running even if dozens of its parts fail.

Remembering

Distributed System — A collection of independent computers that appears to its users as a single coherent system.
Node — A single computer or process within a distributed system.
Scalability — The ability of a system to handle more work by adding more resources (e.g., adding more servers).
Availability — The percentage of time a system is "Up" and responding to requests (e.g., "Five Nines" = 99.999% uptime).
Fault Tolerance — The ability of a system to continue operating even when one or more components fail.
Consensus — The process by which all nodes in a system agree on a single value or state (e.g., the Paxos or Raft algorithms).
Latency — The "Delay" or time it takes for a message to travel from one node to another.
Throughput — The amount of work (requests) a system can handle per second.
Load Balancer — A device that acts as a "Traffic Cop," distributing incoming requests to different servers so no one server is overwhelmed.
CAP Theorem — The fundamental rule that a distributed system can only provide two out of three: Consistency, Availability, and Partition Tolerance.

Understanding

Distributed systems are understood through Coordination and Trade-offs.

1. The CAP Theorem (The "Choose Two" Rule): This is the most important law in distributed systems.

Consistency (C): Every node sees the same data at the same time.
Availability (A): Every request gets a response.
Partition Tolerance (P): The system keeps working even if the network between nodes breaks.
Because the network *always* breaks eventually (P), you have to choose between **C** and **A**. Do you want the answer to be "Perfectly Correct" (C) or "Fast and Always There" (A)?

2. Horizontal vs. Vertical Scaling:

Vertical (Scale Up): Buying a "Bigger" computer. (Expensive and has a limit).
Horizontal (Scale Out): Buying "More" cheap computers. (This is how modern systems like Google and AWS work).

3. The Fallacies of Distributed Computing: New developers often make dangerous assumptions that cause systems to fail:

They assume the network is "Always Reliable."
They assume "Latency is Zero."
They assume "Bandwidth is Infinite."
In a distributed system, you must "Design for Failure." You assume everything will break and build the software to handle it.

Eventual Consistency: A compromise used by systems like Amazon or Facebook. If you update your profile, your friend in Australia might not see it for 1 second. The data is "Eventually" consistent, allowing the system to stay incredibly fast.

Applying

Modeling 'The Consensus Problem' (Agreement in a noisy room): <syntaxhighlight lang="python"> import random

def reach_consensus(nodes, value):

   """
   Simplified 'Majority Rule' consensus.
   """
   votes = []
   for node in nodes:
       # 10% chance a node is 'Down'
       if random.random() > 0.1:
           votes.append(value)
       else:
           votes.append("ERROR")
           
   # If more than 50% agree, we have a state!
   if votes.count(value) > (len(nodes) / 2):
       return f"SUCCESS: System agreed on {value} with {votes.count(value)} votes."
   else:
       return "FAILURE: System could not reach consensus (Too many nodes down)."

A cluster of 5 servers trying to save a database entry

print(reach_consensus(["S1", "S2", "S3", "S4", "S5"], "User_Balance=100")) </syntaxhighlight>

Distributed Landmarks: The 'Google File System' (GFS) Paper (2003) → The document that changed the world by showing how to build a massive storage system using thousands of "Cheap, failing computers" instead of one expensive mainframe.; Bitcoin (2008) → The first truly "Decentralized" system that solved the consensus problem without needing a central boss (using Proof of Work).; Netflix Chaos Monkey → A tool used by Netflix that "Randomly Kills" their own servers in production to force their developers to build systems that never crash.; The Fall of 1996 (AOL) → A famous 19-hour outage that showed the world how fragile distributed systems can be if they aren't designed for scale.

Analyzing

Centralized vs. Distributed
Feature	Centralized (Mainframe)	Distributed (Cloud)
Point of Failure	Single (If it dies, all die)	Multiple (If one dies, others live)
Cost	High (Special hardware)	Low (Commodity hardware)
Complexity	Low (Easy to manage)	High (Hard to sync)
Capacity	Limited by one box	Effectively Infinite

The Concept of "Sharding": Analyzing how to store a database that is too big for one disk. You "Slice" the data (e.g., users A-M on Server 1, N-Z on Server 2). This allows you to store petabytes of data, but makes "Searching" much more complex.

Evaluating

Evaluating distributed systems:

Complexity: Is the "Overhead" of managing 1,000 servers worth it for a small company? (The "Monolith vs. Microservices" debate).
Debugging: How do you find a bug that only happens when Server A talks to Server B while Server C is slow? (The "Heisenbug" problem).
Environment: Do massive data centers use too much electricity? (Some distributed systems now "Follow the Sun"—moving their work to whichever part of the world has the most solar power right now).
Trust: In a "Public" distributed system (like Blockchain), how do you prevent "Byzantine" nodes from lying to the group?

Creating

Future Frontiers:

Edge Computing: Moving the "Brain" of the internet out of data centers and into your fridge, your car, and your watch to reduce latency to zero.
Quantum Networking: Using "Entanglement" to create distributed systems that can sync data instantly across the galaxy.
Self-Healing Systems: AI that "Watches" a system and automatically spawns new nodes or rewrites code when it detects a bottleneck.
Interplanetary Internet: Designing distributed systems that can handle the "20-minute latency" between Earth and Mars.