Key Big Data Terminology: Offline vs Real-time Computing, Real-time vs Ad Hoc Queries, OLTP vs OLAP, Row vs Column Storage
This article explains fundamental big‑data concepts by comparing offline (batch) and real‑time (stream) computing, distinguishing real‑time queries from ad‑hoc queries, clarifying OLTP versus OLAP workloads, and outlining the differences between row‑based and column‑based storage architectures.
01 Offline Computing vs Real-time Computing
Offline computing (batch processing) handles high‑latency, static data, suitable for periodic jobs such as reports; frameworks include MapReduce and Spark SQL. Real‑time computing (stream processing) processes low‑latency streams, used for ETL, monitoring, with frameworks like Spark Streaming (micro‑batch) and Flink (event‑driven).
02 Real-time Query vs Ad Hoc Query
Real‑time query (online query) returns fresh data instantly, often via APIs; HBase provides low‑latency access. Ad hoc query (Ad hoc) is an interactive SQL‑based query in data warehouses, using engines such as Hive, Impala, Presto, and differs from real‑time query.
03 OLTP vs OLAP
OLTP (On‑Line Transaction Processing) supports frequent transactional operations (insert, update, delete) with strong consistency, typical for banking or order systems. OLAP (On‑Line Analytical Processing) enables complex analytical queries for decision support, often implemented with real‑time OLAP stores like Apache Druid or ClickHouse.
04 Row‑based Storage vs Column‑based Storage
Row‑based storage (e.g., MySQL, Oracle) stores complete records together, favoring write performance and OLTP workloads but incurs higher read I/O. Column‑based storage (e.g., Parquet, Arrow) stores each column separately, optimizing read‑heavy OLAP queries through column pruning and compression.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.