Big Data 12 min read

Ensuring Doris Stability in HuoLala's Big Data Platform: Practices and Lessons

This article presents HuoLala's practical approach to guaranteeing the stability of the Doris OLAP engine within its large‑scale big data platform, covering background, challenges, case studies, capability building, process standards, and future planning.

DataFunTalk
DataFunTalk
DataFunTalk
Ensuring Doris Stability in HuoLala's Big Data Platform: Practices and Lessons

Background: HuoLala, founded in 2013, operates a large‑scale logistics platform with 680,000 active drivers and 9.5 million active users, using a multi‑layer big data architecture that includes data collection, storage, computation, and service layers.

Doris adoption: Since early 2022, Doris has been the core OLAP engine for real‑time and offline data warehouses, supporting AB testing, user profiling, growth analysis, and visualization platforms.

Stability challenges: High demand for service reliability, gaps between open‑source capabilities and production needs, and frequent issues such as query overload, import delays, data quality problems, upgrade‑induced OOM, and schema changes.

Stability goals: Maintain core‑link data accuracy above 99.45%. Detect core issues within 5 minutes. Recover P0 issues within 5 minutes and P1 issues within 10 minutes.

Case studies: (1) Query performance spikes caused by large or numerous queries; (2) Import latency due to task timeouts and disorder; (3) Data quality mismatches from Sparkload; (4) OOM during version upgrade; (5) Business‑driven schema changes triggering bugs.

Capability building: discovery (table, task, component monitoring with Zabbix), capacity planning (resource sizing, threshold alerts), high‑availability (FE three‑node, BE replicas, load balancing), automation (Conductor + Ansible workflows for deployment, scaling, upgrades), and additional safeguards such as query interception, fast fault recovery, multi‑tenant isolation, and RBAC.

Process standards: Doris admission checklist, usage best‑practice guide, and change‑management workflow covering planning windows, notifications, reviews, and functional/stability acceptance.

Summary & planning: Quantify stability targets, record and review incidents, enforce processes, explore multi‑cluster HA, multi‑tenant isolation, cold‑hot storage, enhance OLAP platform usability, and adopt new Doris features like high‑concurrency point‑in‑time queries and text search.

monitoringbig dataAutomationCapacity PlanningOLAPstabilityDoris
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.