Distributed Systems
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Distributed Systems are the "Invisible Engines" that power the modern internet. While a traditional computer is a single box on your desk, a distributed system is a "Team" of hundreds or thousands of computers (Nodes) working together to solve a single problem. Whether you are searching Google, streaming Netflix, or sending a Bitcoin transaction, you are interacting with a distributed system. It is the science of "Scaling Up"—creating a machine that is so large and so resilient that it can serve millions of people at once and keep running even if dozens of its parts fail.
Remembering
- Distributed System — A collection of independent computers that appears to its users as a single coherent system.
- Node — A single computer or process within a distributed system.
- Scalability — The ability of a system to handle more work by adding more resources (e.g., adding more servers).
- Availability — The percentage of time a system is "Up" and responding to requests (e.g., "Five Nines" = 99.999% uptime).
- Fault Tolerance — The ability of a system to continue operating even when one or more components fail.
- Consensus — The process by which all nodes in a system agree on a single value or state (e.g., the Paxos or Raft algorithms).
- Latency — The "Delay" or time it takes for a message to travel from one node to another.
- Throughput — The amount of work (requests) a system can handle per second.
- Load Balancer — A device that acts as a "Traffic Cop," distributing incoming requests to different servers so no one server is overwhelmed.
- CAP Theorem — The fundamental rule that a distributed system can only provide two out of three: Consistency, Availability, and Partition Tolerance.
Understanding
Distributed systems are understood through Coordination and Trade-offs.
1. The CAP Theorem (The "Choose Two" Rule): This is the most important law in distributed systems.
- Consistency (C): Every node sees the same data at the same time.
- Availability (A): Every request gets a response.
- Partition Tolerance (P): The system keeps working even if the network between nodes breaks.
- Because the network *always* breaks eventually (P), you have to choose between **C** and **A**. Do you want the answer to be "Perfectly Correct" (C) or "Fast and Always There" (A)?
2. Horizontal vs. Vertical Scaling:
- Vertical (Scale Up): Buying a "Bigger" computer. (Expensive and has a limit).
- Horizontal (Scale Out): Buying "More" cheap computers. (This is how modern systems like Google and AWS work).
3. The Fallacies of Distributed Computing: New developers often make dangerous assumptions that cause systems to fail:
- They assume the network is "Always Reliable."
- They assume "Latency is Zero."
- They assume "Bandwidth is Infinite."
- In a distributed system, you must "Design for Failure." You assume everything will break and build the software to handle it.
Eventual Consistency: A compromise used by systems like Amazon or Facebook. If you update your profile, your friend in Australia might not see it for 1 second. The data is "Eventually" consistent, allowing the system to stay incredibly fast.
Applying
Modeling 'The Consensus Problem' (Agreement in a noisy room): <syntaxhighlight lang="python"> import random
def reach_consensus(nodes, value):
"""
Simplified 'Majority Rule' consensus.
"""
votes = []
for node in nodes:
# 10% chance a node is 'Down'
if random.random() > 0.1:
votes.append(value)
else:
votes.append("ERROR")
# If more than 50% agree, we have a state!
if votes.count(value) > (len(nodes) / 2):
return f"SUCCESS: System agreed on {value} with {votes.count(value)} votes."
else:
return "FAILURE: System could not reach consensus (Too many nodes down)."
- A cluster of 5 servers trying to save a database entry
print(reach_consensus(["S1", "S2", "S3", "S4", "S5"], "User_Balance=100")) </syntaxhighlight>
- Distributed Landmarks
- The 'Google File System' (GFS) Paper (2003) → The document that changed the world by showing how to build a massive storage system using thousands of "Cheap, failing computers" instead of one expensive mainframe.
- Bitcoin (2008) → The first truly "Decentralized" system that solved the consensus problem without needing a central boss (using Proof of Work).
- Netflix Chaos Monkey → A tool used by Netflix that "Randomly Kills" their own servers in production to force their developers to build systems that never crash.
- The Fall of 1996 (AOL) → A famous 19-hour outage that showed the world how fragile distributed systems can be if they aren't designed for scale.
Analyzing
| Feature | Centralized (Mainframe) | Distributed (Cloud) |
|---|---|---|
| Point of Failure | Single (If it dies, all die) | Multiple (If one dies, others live) |
| Cost | High (Special hardware) | Low (Commodity hardware) |
| Complexity | Low (Easy to manage) | High (Hard to sync) |
| Capacity | Limited by one box | Effectively Infinite |
The Concept of "Sharding": Analyzing how to store a database that is too big for one disk. You "Slice" the data (e.g., users A-M on Server 1, N-Z on Server 2). This allows you to store petabytes of data, but makes "Searching" much more complex.
Evaluating
Evaluating distributed systems:
- Complexity: Is the "Overhead" of managing 1,000 servers worth it for a small company? (The "Monolith vs. Microservices" debate).
- Debugging: How do you find a bug that only happens when Server A talks to Server B while Server C is slow? (The "Heisenbug" problem).
- Environment: Do massive data centers use too much electricity? (Some distributed systems now "Follow the Sun"—moving their work to whichever part of the world has the most solar power right now).
- Trust: In a "Public" distributed system (like Blockchain), how do you prevent "Byzantine" nodes from lying to the group?
Creating
Future Frontiers:
- Edge Computing: Moving the "Brain" of the internet out of data centers and into your fridge, your car, and your watch to reduce latency to zero.
- Quantum Networking: Using "Entanglement" to create distributed systems that can sync data instantly across the galaxy.
- Self-Healing Systems: AI that "Watches" a system and automatically spawns new nodes or rewrites code when it detects a bottleneck.
- Interplanetary Internet: Designing distributed systems that can handle the "20-minute latency" between Earth and Mars.