Data Platform vs Backend Architecture: Benefits of Moving Functionality to a Data Platform
The article explains why shifting batch jobs, reporting, and machine‑learning model training from traditional backend services to a dedicated data platform can simplify development, improve fault tolerance, and scale analytics, using real‑world examples from Spotify and best‑practice guidelines.
Modern tech stacks usually include at least a frontend and a backend, but they quickly evolve to require a data platform for analytics, reporting, cron jobs, dashboards, and batch data replication.
Typical data‑platform workloads have low latency requirements, can run up to 24 hours later, and are expressed as batch jobs on large datasets rather than per‑request operations. Examples include nightly transaction imports for accounting systems and periodic retraining of fraud‑detection models.
At Spotify, the data platform started with royalty reports and grew into a nightly pipeline that rebuilds personalized recommendations and retrains core models every few weeks.
1 Why is it complicated?
Using a data platform simplifies product building and delivery by tenfold: it removes concerns about latency, gives control over data flow, enables more fault‑tolerant (idempotent) batch processing, offers higher efficiency for large‑scale operations, and allows easy recovery from failures.
For instance, building a global headline service that updates hourly via a data‑platform cron job is far easier than implementing real‑time updates directly in the backend.
2 Have you done it with minimal tricks?
Backend architecture best practices (e.g., avoiding shared databases, keeping queries simple, using transactions, extensive unit and integration testing, and breaking monoliths into micro‑services) often become unnecessary when moving functionality to a data platform.
3 Data side: "Wild West"
Typical data pipelines start by shipping backend logs and database dumps to storage, historically Hadoop HDFS, now often scalable databases like Redshift.
4 Data latency considerations
Data in a platform is expected to be delayed (often 24 hours or more). Real‑time access to production databases via cron jobs can create an integrated database anti‑pattern, so separation of delayed batch jobs from real‑time endpoints is recommended.
5 Integrated databases
While traditional architecture discourages services sharing databases, in the data world it can be acceptable to combine three distinct datasets, because changes only require query updates, failures can be fixed and rerun, and queries are read‑only.
6 Large queries
Backend systems need low‑latency, low‑throughput queries per user, whereas data platforms handle large‑scale scans (OLAP) across massive tables; optimizing includes avoiding joins, using simple indexes, and targeting specific IDs.
7 Testing
Testing backend functions is straightforward, but testing data pipelines is hard due to high‑dimensional inputs, nondeterministic ML models, and subjective outputs, leading to low test fidelity and high maintenance cost.
8 Conclusion
Moving as many functions as possible—non‑transactional emails, search index generation, recommendations, reporting, data for business users, and ML model training—to a data platform run as cron jobs reduces backend code complexity by roughly an order of magnitude.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.