Design and Implementation of Log Parsing for a Big Data Offline Task Platform
The article describes a log‑parsing feature for Youzan’s big‑data offline platform that captures runtime logs from Hive, Spark, DataX, MapReduce and HBase jobs, categorizes scheduling types, extracts metrics such as read/write bytes, shuffle volume and GC time, and processes them in real time via a Filebeat‑Logstash‑Kafka‑Spark‑Streaming pipeline storing results in Redis for monitoring, optimization and resource‑usage ranking.
The article introduces a log parsing feature attached to the Youzan big data platform (data_platform), which processes runtime logs generated by offline big data tasks such as Hive, Spark, DataX incremental jobs, import/export, MapReduce, and HBase bulk operations.
The main goals are to provide visibility into different scheduling types (normal, test, manual import, batch rerun), to monitor task execution outcomes (success, failure, retries), and to quantify resource consumption (read/write bytes, shuffle volume, GC time).
Design Analysis
1. Task‑type specific log structures are identified. For Yarn‑scheduled tasks, metrics like total read/write volume, shuffle amount, and GC time are collected. Common fields such as start and end times are stored and later aggregated.
2. DataX tasks (Hive↔MySQL, MySQL↔ElasticSearch, etc.) have a uniform log format that records total records read, failure counts, execution time, source/target tables, and total bytes.
3. Scheduling types are categorized into normal, test, manual import, and batch rerun, allowing the system to tag each task’s status accordingly.
Architecture
Because task logs are generated in real time and the platform processes over ten thousand tasks per day, a high‑throughput streaming solution is required. After evaluating options, Spark Streaming was chosen for its fault‑tolerant, high‑throughput capabilities.
The log pipeline is as follows: Filebeat monitors log directories on the scheduler cluster, forwards logs to a Logstash cluster, which writes them to a Kafka topic. Spark Streaming consumes the topic, parses the logs in real time, and stores the results in a Redis cache for downstream statistics and queries.
Feature Implementation
1. Resource statistics are displayed, showing per‑task success, failure, and retry counts for a given day.
2. Failed and retried tasks are highlighted to facilitate task‑level optimization and reduce user effort.
3. Ranking of resource usage (e.g., GC time, shuffle volume) enables users to identify and optimize high‑impact metrics.
Important Considerations
- Spark standalone mode uses a fixed core allocation per task, which is unsuitable for multi‑user environments; therefore, Spark on Yarn with dynamic resource allocation is adopted.
- Yarn supports two submission modes: Client (driver runs on the submitting machine) and Cluster (driver runs inside the Yarn cluster). Cluster mode is recommended for stability and easier log management.
- The number of cores allocated to Spark Streaming must exceed the number of receivers; each receiver occupies one core.
- Spark provides reliable and unreliable receivers. Reliable receivers require acknowledgments from the source to guarantee no data loss.
Additional reading links are provided for related practices such as Flink real‑time computation, SparkSQL usage, and HBase write‑throughput analysis.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.