Fundamentals 13 min read

Key Characteristics and Design Principles of Distributed Systems

Distributed systems, long used since the 1970s and popularized by internet giants like Google, offer scalable, cost‑effective, and fault‑tolerant architectures by leveraging many low‑cost servers, emphasizing horizontal scaling, avoidance of single points of failure, minimal inter‑node communication, and stateless services for elastic application deployment.

Art of Distributed System Architecture Design

Jun 13, 2015

Key Characteristics and Design Principles of Distributed Systems

Distributed systems are not a new term; various distributed systems appeared in the 1970s and 1980s. It was only in the Internet era that distributed systems truly shone, especially Google, which applied distributed systems to the extreme. Google's entire software architecture is based on various distributed systems such as Borg, MapReduce, BigTable, etc. These systems enable Google to handle high‑concurrency request‑response and massive data processing. Apache's Hadoop, Spark, Mesos and other distributed systems have made big‑data processing technologies accessible, allowing more enterprise customers to experience the benefits of distributed systems.

1. Characteristics of Distributed Systems

The biggest characteristic of distributed systems is scalability; they can expand to meet changing demands. Enterprise‑level applications often evolve over time, requiring platforms that can handle increasing traffic, concurrent users, and data volume. Distributed systems achieve this by adding more servers to boost overall processing capacity, supporting high concurrency and massive data handling.

The core idea is to let multiple servers cooperate to accomplish tasks that a single server cannot handle, especially high‑concurrency or large‑data workloads. A distributed system consists of independent servers loosely coupled via a network. Each server is an independent PC, connected by a fast internal network. Because communication between nodes incurs overhead, designs aim to minimize inter‑node communication. Since network latency dominates, the performance of a single node has limited impact on overall system performance, allowing the use of ordinary PCs rather than high‑performance machines. Overall performance is improved through horizontal scaling (adding more servers) rather than vertical scaling (upgrading individual servers).

Distributed systems are cheap and efficient: clusters built from low‑cost PC servers can match or exceed the performance of mainframes at a fraction of the cost. Although PC servers are less reliable hardware‑wise, software provides fault tolerance and ensures high overall reliability.

The greatest benefit is elastic scaling at the application‑service layer. While public‑cloud IaaS providers offer elastic scaling of compute resources, enterprises need elastic scaling of application services themselves—adjusting the number of service instances according to business load. For example, a short‑video sharing app may need thousands of backend instances during peak hours and only dozens during off‑peak; a distributed system can dynamically schedule these instances.

2. Design Principles of Distributed Systems

Below are several design principles for distributed systems as understood by the author.

1. Low hardware requirements for servers

This principle includes two aspects:

No strict reliability requirements; hardware failures are tolerated and handled by software fault tolerance, making high reliability a software responsibility.

No performance requirements; high‑frequency CPUs, large memory, or high‑performance storage are unnecessary because the bottleneck lies in network communication rather than CPU speed.

Consequently, large data centers of internet companies typically use many inexpensive PC servers instead of a few high‑performance machines to reduce costs. For example, Google’s data‑center design eliminates unnecessary components to minimize power consumption.

2. Emphasis on horizontal scalability (Scale Out)

Horizontal scalability means increasing the number of servers to improve overall cluster performance, whereas vertical scalability means enhancing each server’s performance. Since network overhead is the main bottleneck, vertical scaling quickly reaches its limit, while horizontal scaling can continue to add servers, often achieving near‑linear performance gains. For instance, expanding a 10‑server cluster to 100 similar servers can increase performance roughly tenfold.

In practice, a large internet data center can scale horizontally to tens of thousands of servers. Google’s basic unit, a CELL, consists of about twenty thousand servers managed by the Borg system, and multiple CELLS compose a data center.

3. No single point of failure (No Single Point Failure)

A single point of failure occurs when a service runs only one instance on a single server; if that server fails, the service becomes unavailable. Distributed systems assume any server may fail at any time, so each service runs multiple instances on different nodes, and data is replicated across nodes, dramatically reducing the chance that all instances of a service or all copies of data fail simultaneously.

Additionally, servers should not run at full load for extended periods, as this raises failure probability. Distributed systems therefore spread load across many low‑performance PCs to keep individual server utilization moderate and maintain overall stability.

4. Minimize inter‑node communication overhead

Since network transmission is slower than CPU memory or disk access, reducing communication overhead significantly improves performance. A typical example is Hadoop MapReduce, which schedules computation on the nodes where the data resides, avoiding costly data transfer over the network.

5. Prefer stateless application services

Service state refers to in‑memory data generated while handling requests. Stateless services store necessary data in external storage (e.g., Redis, Memcached) rather than in the process memory, allowing any instance to be restarted without losing state and facilitating fault‑tolerant recovery.

For example, a website can keep user session information in Redis, so backend instances do not retain login state; if an instance crashes, other instances can continue serving users without loss of session data.

In summary, distributed systems are the preferred platform for enterprise applications in the big‑data era, offering excellent scalability—especially horizontal scaling—flexibility to meet diverse business needs, reduced hardware requirements, and true elastic scaling of application services.

Disclaimer: The content of this article is sourced from publicly available internet channels; the author remains neutral and provides it for reference and discussion only. Copyright belongs to the original author or organization; please contact us for removal if any infringement occurs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems fault tolerance horizontal scaling stateless services

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.