2023 Big Data Interview Guide: Hadoop, Hive, Doris, Data Warehouse Essentials
This comprehensive 2023 guide covers essential big‑data interview topics, providing detailed explanations and step‑by‑step processes for Hadoop HDFS read/write, YARN, Hive table types and optimizations, Doris architecture and data models, data‑warehouse layers, modeling techniques, quality monitoring, and classic algorithm design questions such as TOP‑K and duplicate detection.
This article presents a thorough interview preparation guide for big‑data engineers, covering Hadoop, Hive, Doris, and data‑warehouse concepts with concrete examples and detailed process flows.
Hadoop
HDFS write flow :
Client sends an upload request via RPC to the NameNode, which checks permissions and path conflicts.
Client splits the file (default 128 MB) and requests the DataNode list for the first block.
NameNode selects DataNodes based on network topology, rack awareness, and replication policy, returning their addresses.
Client establishes a pipeline with the first DataNode (A), which then connects to B and C, forming a chain.
Client streams the block to A in 64 KB packets; each packet is forwarded through the pipeline, and acknowledgments travel back to the client.
After a block finishes, the client requests a new block from the NameNode, which selects a new set of DataNodes.
HDFS read flow :
Client requests block locations from the NameNode via RPC.
NameNode returns a list of DataNode replicas, sorted by network proximity and node health (STALE nodes are placed later).
Client reads from the nearest DataNode; if the client itself hosts the block, it reads locally (short‑circuit read).
Data is read through a socket stream (FSDataInputStream) until the block is fully consumed.
If more blocks remain, the client repeats the request for the next block list.
Each packet read is verified with a checksum; on error the client retries from another replica.
All blocks are merged to reconstruct the original file.
Additional Hadoop topics include handling corrupted blocks, DataNode failures, NameNode startup (first‑time formatting, loading fsimage, replaying edits), Secondary NameNode checkpointing, HA architecture with shared storage (QJM), fencing to avoid split‑brain, and small‑file impact on NameNode memory.
Hive
Hive distinguishes internal and external tables. Internal tables store data in the warehouse directory ( hive.metastore.warehouse.dir, default /user/hive/warehouse), while external tables keep data at user‑specified locations. Deleting an internal table removes both metadata and data; deleting an external table only removes metadata.
Hive supports indexes (pre‑0.3) but they are rarely used; bitmap indexes were added in 0.8, and from 3.0 onward indexes were removed in favor of materialized views.
Performance optimizations include using ORC or Parquet storage with Snappy compression, adjusting parallelism, JVM reuse, map/reduce parameters, and disabling speculative execution. Small‑file problems can be mitigated by concatenate (RCFILE/ORC only), CombineHiveInputFormat, reducing the number of reducers, or using Hadoop Archive (HAR).
Doris
Doris is an MPP analytical database with a simple FE/BE architecture. FE stores metadata and plans queries; BE stores physical data with multiple replicas. FE roles include Leader, Follower (HA) and Observer (read‑only scaling).
Data models:
Aggregate – key columns and value columns; supports SUM, REPLACE, MAX, MIN aggregation.
Unique – a special case of Aggregate using REPLACE to enforce primary‑key uniqueness.
Duplicate – stores rows as‑is; key columns only define sorting order.
Example Aggregate table creation (excerpt):
CREATE TABLE IF NOT EXISTS example_db.example_tbl (
`user_id` LARGEINT NOT NULL COMMENT "用户id",
`date` DATE NOT NULL COMMENT "数据灌入日期时间",
`city` VARCHAR(20) COMMENT "用户所在城市",
`age` SMALLINT COMMENT "用户年龄",
`sex` TINYINT COMMENT "用户性别",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间",
`cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间"
) AGGREGATE KEY(`user_id`,`date`,`city`,`age`,`sex`)
...;Rollup creates pre‑aggregated tables stored independently; in Duplicate models Rollup only reorders columns for prefix‑index optimization. Prefix indexes are inherent to the sorted storage order; choosing column order at table creation determines the index.
Materialized views pre‑compute query results, automatically stay in sync with the base table, and are matched during query planning. They support only single‑column aggregation functions and cannot be created on Unique tables for aggregation.
Data Warehouse
The article outlines the typical ODS → DWD → DWS → ADS pipeline. ODS uses Snappy compression and ORC format (≈10 % of raw size). DWD performs data cleaning, null removal, sensitive‑data masking, and dimensional reduction. DWS builds wide tables (60‑100 columns) covering 70 %+ of business metrics.
Fact table types include transaction facts, periodic snapshot facts, cumulative snapshot facts, and non‑fact tables. Dimensional modeling is explained with Star, Snowflake, and Constellation schemas, highlighting their structures, advantages, and trade‑offs.
Data drift handling strategies involve using multiple timestamp fields (modified, log, process, extract) and combining forward‑looking and backward‑looking windows to ensure complete data capture.
Data quality monitoring covers table‑level row counts, null‑value ratios, duplicate detection, and cross‑table volume comparison, with example SQL snippets for each check.
Algorithm Design Questions
TOP‑K query frequency – three solutions: (1) hash‑partition queries into 10 files, count with an in‑memory hash map on a 2 GB machine, sort and merge; (2) use a trie or hash map if the distinct query set fits in memory; (3) distribute the hash‑partitioned files across a MapReduce cluster and merge results.
Finding non‑duplicate integers among 250 million values that cannot fit in memory – two approaches: (1) a 2‑bit bitmap (2 GB) marking 00 = absent, 01 = once, 10 = multiple; after scanning, output numbers with state 01; (2) split the data into smaller files, find unique numbers within each, sort, and merge while removing duplicates.
Overall, the guide provides concrete command examples, code snippets, and step‑by‑step reasoning to help candidates master big‑data interview topics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
