A Crash Course on Distributed Systems

A distributed system is a collection of computers, also known as nodes, that collaborate to perform a specific task or provide a service.

These nodes are physically separate and communicate with each other by passing messages over a network. Distributed systems can span geographical boundaries, enabling them to utilize resources from different locations.

Distributed systems have several characteristics that distinguish them from traditional centralized systems:

Distributed systems are ubiquitous in our daily lives. Examples include large web applications like Google Search, online banking systems, multiplayer games, etc. These systems leverage the power of multiple computers working together to provide a seamless and responsive user experience.

In this post, we’ll explore the benefits and challenges of distributed systems. We will also discuss common approaches and techniques used to address these challenges and ensure the reliable operation of distributed systems.


Understanding Distributed Systems 

The term “distributed systems” can sometimes confuse developers. 

A couple of common confusions are around decentralized systems and parallel systems. 

Let’s understand the meaning of these terms in the context of a distributed system and how they are similar or different.

Decentralized Systems Vs. Distributed Systems

Terms like “Decentralized Systems” and “Distributed Systems” are often used interchangeably, but they have a key difference. 

While both types of systems involve multiple components working together, the decision-making process sets them apart. 

In a decentralized system, which is also a type of distributed system, no single component has complete control over the decision-making process. Instead, each component owns a part of the decision but does not possess the complete information required to make an independent decision.

Parallel Systems vs Distributed Systems

Another term that is closely associated with distributed systems is parallel systems. 

Both distributed and parallel systems aim to scale up computational capabilities, but they achieve this goal using different means.

In parallel computing, multiple processors within a single machine perform multiple tasks simultaneously. These processors often have access to shared memory, allowing them to exchange data and efficiently coordinate their activities.

On the other hand, distributed systems are made up of multiple autonomous machines that do not share memory. These machines communicate and coordinate their actions by passing messages over a network. Each machine operates independently, contributing to the overall computation by performing its assigned tasks.

Key Benefits of Distributed Systems

While designing and building distributed systems can be more complex than traditional centralized systems, their benefits make the effort worthwhile.

Let's explore some of the key advantages of distributed systems:

Challenges of Distributed Systems

Distributed systems also pose multiple challenges when it comes to operating them. 

Knowing about these challenges and the techniques to overcome them is the key to taking advantage of distributed systems.

Let’s explore the main challenges of distributed systems and the techniques to handle them.

Communication

In a distributed system, nodes need to communicate and coordinate with each other over a network to function as a cohesive unit. 

However, this communication is challenging due to the unreliable nature of the underlying network infrastructure.

The Internet Protocol (IP), which is responsible for delivering packets between nodes, only provides a "best effort" service. This means that the network does not guarantee the reliable delivery of packets.

Several issues can arise during packet transmission:

Building reliable communication on top of this unreliable foundation is a significant challenge.

Some key techniques used by distributed systems to handle these issues are as follows:

1 - Reliable Communication with TCP

The Transmission Control Protocol (TCP) is a fundamental protocol that provides a robust mechanism for ensuring the reliable, in-order delivery of a byte stream between processes, making it a cornerstone of reliable data transmission in distributed systems.

It employs several key mechanisms to overcome the inherent unreliability of the network:

The diagram below shows the TCP 3-way handshake process that establishes the connection between a client and the server.

2 - Securing Communication with TLS

While TCP ensures reliable communication over an unreliable network, it does not address the security aspects of data transmission. This is where the Transport Layer Security (TLS) protocol comes into play.

TLS is a cryptographic protocol that adds encryption, authentication, and integrity to the communication channel established by TCP.

TLS uses several mechanisms to secure the communication between the nodes:

3 - Service Discovery with DNS

In a distributed system, nodes need a mechanism to discover and communicate with each other. This is where the Domain Name System (DNS) comes into play, solving the service discovery problem.

DNS acts as the "phone book of the internet," providing a mapping between human-readable domain names and their corresponding IP addresses. It allows nodes to locate and connect to each other using easily memorable names instead of complex numerical IP addresses.

The diagram below shows the service discovery process between a consumer and a provider.

Under the hood, DNS is implemented as a distributed, hierarchical key-value store. It consists of a network of servers that work together to resolve domain names to IP addresses. 

The hierarchical structure of DNS enables efficient and scalable name resolution across the internet.

Coordination

Coordination between nodes is a critical challenge while building distributed systems. 

Some key coordination-related problems are as follows:

Let’s look at some key techniques used to solve the coordination-related challenges of distributed systems:

1 - Failure Detection

In a distributed environment, it is impossible to definitively distinguish between a failed process and a process that is simply very slow in responding.

The reason for this ambiguity lies in the fact that network delays, packet loss, or temporary network partitions can cause a process to appear unresponsive, even though it may still be functioning correctly.

To address this challenge, distributed systems employ failure detectors. 

Failure detectors are components that monitor the status of processes and determine their availability. However, failure detectors must make a tradeoff between detection time and the rate of false positives.

If a failure detector is configured to detect failures quickly, it may incorrectly classify slow processes as failed, resulting in a higher false positive rate. On the other hand, if the failure detector is configured to be more conservative and allow more time for processes to respond, it may take longer to detect actual failures, leading to a slower detection time.

2 - Event Ordering and Timing

Agreeing on the timing and order of events is a big coordination challenge in distributed systems.

One of the primary reasons for this challenge is the imperfect nature of physical clocks in distributed systems. Each node in the system has its physical clock, which is typically based on a quartz crystal oscillator. These physical clocks are subject to drift and can gradually diverge from each other over time. Even small variations in clock frequencies can lead to significant discrepancies in the perceived time across nodes.

Moreover, achieving a total order of events in a distributed system requires coordination among the nodes. In a total order, all nodes agree on the exact sequence of events, regardless of the causal relationships between them. 

To address these challenges, distributed systems often rely on logical clocks and vector clocks to capture the causal ordering of events.

The diagram below shows the difference between logical Lamport clocks and vector clocks.

3 - Leader Election

In distributed systems, many coordination tasks, such as holding a lock or committing a transaction, require the presence of a single "leader" process. 

The leader is responsible for managing and coordinating the execution of these tasks.

Raft is a widely adopted consensus algorithm that addresses the challenge of leader election in a distributed environment. It provides a mechanism for electing a leader among a group of processes, ensuring that there is at most one leader per term.

The Raft algorithm operates in terms of "terms," which represent a logical clock or a timeframe in which a leader is elected and serves their role. Each term is assigned a unique, monotonically increasing number. 

One of the key principles of Raft is that a candidate process can only become the leader if its log is up-to-date. In other words, the candidate must have the most recent and complete set of log entries compared to other processes in the system.

The diagram below shows a mapping of state changes (follower, candidate, and leader) that a node can go through as part of the Raft consensus algorithm.

4 - Data Replication and Consistency

Keeping replicated data in sync across multiple nodes is a fundamental coordination challenge in distributed systems. 

Replication is essential for ensuring data availability and fault tolerance, but it introduces the complexity of maintaining consistency among the replicas.

The CAP theorem states that in a distributed system, it is impossible to provide consistency, availability, and partition tolerance simultaneously.

According to the theorem, a distributed system can only provide two of the three guarantees at any given time. In the presence of network partitions, there is a trade-off between consistency and availability. Systems must choose to either maintain strong consistency at the cost of reduced availability or prioritize availability while accepting weaker consistency guarantees.

Also, distributed systems support multiple consistency models that define the guarantees for read and write operations on replicated data.

Scalability

As discussed earlier, scalability is one of the key benefits of adopting distributed systems. A scalable system can increase its capacity to handle more load by adding resources.

However, choosing the right scalability patterns is also important.

Let’s look at a few key patterns for scaling a distributed system.

1 - Functional Decomposition

An example of functional decomposition is breaking down a monolithic application into smaller, independently deployable services. Each service has its well-defined responsibility and communicates with other services through APIs.

An API Gateway acts as a single entry point for external clients to interact with microservices. It handles request routing and composition by providing a unified interface to the clients.

Another functional decomposition approach is a pattern like CQRS (Command Query Responsibility Segregation). 

CQRS is a pattern that separates the read and write paths of an application. 

This optimizes performance by allowing read and write operations to scale independently. CQRS is often used in conjunction with microservices to enable efficient data retrieval and modification.

The functional decomposition approach also relies on asynchronous messaging as a communication pattern. 

It helps decouple services, enabling resilience and scalability. Services communicate through message queues or publish-subscribe systems, enabling them to process requests at their own pace and prevent overload.

2 - Partitioning

Partitioning is a fundamental technique used in distributed systems to split a large dataset across multiple nodes when it becomes too large to be stored and processed on a single node.

By distributing the data, partitioning enables horizontal scalability and allows the system to handle increasing data volumes.

The diagram below shows the concept of data partitioning where the book table is partitioned based on the category field.

There are several techniques used for partitioning data:

3 - Duplication

Duplication is an important technique used in distributed systems to scale capacity and increase availability by introducing redundancy of components. 

By duplicating servers, data, and other resources, systems can handle higher loads and provide resilience against failures.

Some key techniques that enable duplication are as follows:

The diagram below shows how a load balancer helps distribute traffic between multiple service instances.

Resiliency

Resiliency is the ability of a system to continue functioning correctly in the face of failures. 

As distributed systems scale in size and complexity, failures become not just possible but inevitable. Some of the most common causes of failures are as follows:

There are two main categories to manage a component’s resiliency within a distributed system.

1 - Downstream Resiliency 

When services interact through synchronous request/response communication, it’s important to stop faults from propagating from one component or service to another.

Key techniques to make this possible are as follows:

The diagram below shows the circuit breaker pattern and the three states of the circuit breaker.

2 - Upstream Resiliency

Resiliency techniques are relevant even at the level of the service owner. There are multiple strategies to protect a service from being overwhelmed by incoming requests.

Let’s look at the most important ones:

Summary

In this article, we’ve explored distributed systems in great detail. We’ve understood the meaning of distributed systems, their benefits, and the challenges they create for developers.

Let’s summarize the learnings in brief: