Big Data 14 min read

Evolution of Youzan Search Platform Architecture: From 1.0 to 4.0

The Youzan Search Platform evolved from a simple Elasticsearch cluster in 2015 to a modular, message‑driven architecture with proxy validation, caching, and management tools, and now plans a cloud‑native, Kubernetes‑based 4.0 version that automates data sync, isolates workloads, and scales elastically to support billions of records.

Youzan Coder
Youzan Coder
Youzan Coder
Evolution of Youzan Search Platform Architecture: From 1.0 to 4.0

Youzan Search Platform is an internal PaaS product that provides search and multi‑dimensional filtering capabilities for a wide range of applications and some NoSQL storage services. It currently supports more than one hundred search services and handles data approaching a hundred billion records.

Beyond traditional search use‑cases, the platform also needs to serve massive data‑filtering scenarios such as product management, order retrieval, and fan segmentation. From an engineering perspective, extending the platform to meet diverse search requirements poses a significant challenge.

The author, the first member of the Youzan search team, designed and developed most of the platform’s features, focusing on performance, scalability, reliability, and reducing both operational and development costs.

Elasticsearch

Elasticsearch is a highly available distributed search engine that is mature, stable, and backed by an active community. It was chosen as the core engine for building the search system.

Architecture 1.0

In 2015 the production environment consisted of a small Elasticsearch cluster running on a few high‑spec virtual machines, handling product and fan indexes. Data was synchronized from the database to Elasticsearch via Canal. The overall architecture is illustrated below:

This approach allowed rapid, low‑cost creation of sync applications for different business indexes during early growth. However, each sync program was a monolithic application tightly coupled to the database address, making it fragile to schema changes, migrations, and sharding. Multiple Canal instances subscribing to the same database also degraded database performance.

The Elasticsearch cluster lacked physical isolation; a large promotional event once caused the fan data volume to exhaust the JVM heap, leading to OOM and a complete outage of all indexes.

Architecture 2.0

To address these issues, a new 2.0 architecture was introduced. The key changes are:

Data changes are first sent to a message queue (MQ) via a data bus. Sync applications consume MQ messages to update the search indexes, decoupling them from the business databases and eliminating the overhead of multiple Canal listeners on the same binlog.

Advanced Search

As business grew, centralized traffic entrances (e.g., distribution,精选) required more fine‑grained ranking control than simple bool queries could provide. A middleware layer now intercepts business queries, extracts necessary conditions, and rewrites them into advanced Elasticsearch queries (e.g., function_score). The architecture is shown below:

An additional optimization adds a result cache for frequent queries, reducing repeated computation and improving middleware response time.

Big Data Integration

Search is tightly coupled with big data. Offline computation of ranking scores and log analysis are essential. In the 2.0 stage, the open‑source es‑hadoop component was used to build a data pipeline between Hive and Elasticsearch:

Flume collects search logs into HDFS for later analysis, and Hive can generate suggestion terms for the search engine.

Problems

After more than a year of operation, the architecture revealed several pain points: rising maintenance cost, residual business logic in sync programs, message ordering issues in the MQ, opaque traffic patterns leading to unexpected CPU spikes, and overall difficulty in scaling the Elasticsearch cluster.

Architecture 3.0

Key adjustments in version 3.0 include:

Open APIs for user calls, fully decoupling from business code.

A proxy layer that preprocesses requests, performs flow control, and provides caching.

A management platform that simplifies index changes and cluster administration.

Proxy

The proxy exposes a standardized ESLoader interface compatible with multiple Elasticsearch versions and embeds request validation, caching, and template query modules.

Request validation checks for field mismatches, type errors, syntax errors, or potential slow queries, rejecting or throttling them to protect the cluster.

Caching is enhanced with a local cache that automatically degrades during traffic spikes to avoid network congestion in the Codis cluster.

Template queries simplify repetitive DSL structures (e.g., product category or order filters) and allow server‑side performance tuning via default values and optional parameters.

Management Platform

Implemented with Django, the platform provides an approval workflow for index changes and visual management of index metadata, reducing daily maintenance effort.

It also offers a custom visual query component to avoid heavy fielddata loading that can cause heap OOM.

ESWriter

To gain fine‑grained control over offline write traffic, a custom ESWriter plugin was built on top of DataX, allowing per‑second throttling by record count or data volume.

Challenges

Platformization and documentation lower the entry barrier, but rapid business growth increases Elasticsearch operational costs. Shared physical clusters host multiple business indexes, leading to stability risks and difficulty in applying heterogeneous production standards.

Elastic scaling is also cumbersome; adding a node requires hardware procurement, environment setup, and software installation, which can be time‑consuming.

Future Architecture 4.0

The next step is to integrate with the internally developed Data Transport Service (DTS) to achieve configuration‑driven, automated synchronization between Elasticsearch and various data sources.

To address shared‑cluster issues, the plan is to evolve the search service into a cloud‑native offering on Kubernetes. Core applications will run on isolated physical clusters, while non‑core workloads will consume Elasticsearch as a managed cloud service defined by application templates, enabling automatic scaling based on runtime metrics.

Conclusion

This article outlined the architectural evolution of the Youzan search system, the motivations behind each change, and the challenges addressed. Detailed technical implementations will be covered in subsequent articles of this series.

distributed systemsBig DataProxyElasticsearchdata integrationsearch architecture
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.