Unveiling Complete Data Flow Systems: Architecture, Reliability, and Scalability
This article explains how modern data‑intensive applications are built, detailing a complete data‑flow architecture—from API requests, caching, database queries, change capture, search indexing, and message queues—to core system concerns such as reliability, scalability, and maintainability, offering practical insights for architects.
How Complete Data Flow Systems Operate
Modern applications focus on data‑intensive workloads; the main challenges are data scale, complexity, and velocity.
Typical Data Flow Architecture
Clients send API requests; read requests first check cache, returning if hit.
If cache miss, the request queries the database.
A change‑capture service listens to database changes, invalidates cache and builds search indexes.
Search requests retrieve IDs from a full‑text search system and then fetch records from the database.
Event‑driven messages (e.g., logging, notifications) are sent via MQ to asynchronous workers such as email senders.
Core Functional Components
Database : persistent storage for later retrieval.
Cache : stores expensive computation results to accelerate subsequent queries.
Search Indexes : enable keyword‑based data lookup.
Stream Processing : continuously consumes and processes asynchronous cross‑process messages.
Batch Processing : periodically handles large accumulated data sets.
Thinking About Data Systems
The client‑side view of an application is a black box, as is the database; each layer hides implementation details. Different layers require different concerns.
System Concern Elements
1) Reliability and Availability
Reliability means the system continues to operate correctly despite hardware, software, or human errors.
Availability focuses on whether the system is up and responsive, often achieved through redundancy.
Example: A calculator service returning 6 for 2+3 is unreliable; fixing the bug restores reliability.
Example: The same service may be reliable but unavailable when it times out.
Ensuring data correctness and completeness requires handling hardware failures, software bugs, and configuration errors.
2) Scalability and Extensibility
Scalability (software) means the software can adapt to business changes; deployment scalability (horizontal/vertical) allows the system to handle increased load by adding resources.
Load is measured by metrics such as request count, response count, cache hit rate, and ad volume. Performance is evaluated via resource usage (CPU, memory, network, I/O) and response time percentiles (e.g., P99).
Latency vs. response time:
<code>RT = t1 + t2 + t3 + t4 + t5</code>Scaling methods:
Vertical scaling: move to a more powerful machine.
Horizontal scaling: distribute load across multiple smaller machines (shared‑nothing architecture).
3) Maintainability
Operability : enables operations teams to keep the system running smoothly.
Simplicity : reduces complexity so new engineers can understand the system quickly.
Evolvability : allows future changes to accommodate new use cases, also known as extensibility or modifiability.
Summary
Reliability ensures correct operation despite failures; fault‑tolerance hides failures from end users.
Scalability maintains performance under increased load, using quantitative load and performance metrics.
Maintainability creates better working conditions for engineers and operators through good abstraction, operability, and simplicity.
Xiaokun's Architecture Exploration Notes
10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.