Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling
This article shares the engineering team’s experience of building a high‑growth, reliable backend for English Fluently, covering inter‑service communication with gRPC, service discovery, Docker‑based deployment, health‑checking, monitoring, autoscaling, Kubernetes orchestration, and multi‑cell availability strategies.
English Fluently’s user base has been growing rapidly, and the engineering team needed to ensure stable and reliable service delivery. This short article outlines the challenges they faced and the solutions they adopted, providing useful references for interested readers.
Interoperability
The internal teams (algorithm, data, backend) use different programming languages. To simplify cross‑team service consumption, the team evaluated Thrift and gRPC, ultimately choosing gRPC for its ability to carry extra meta information such as traces. They piloted it on low‑traffic services before rolling it out to high‑traffic ones, encountering and quickly fixing memory leaks in Python and Java as well as a Ruby Unicorn fork incompatibility.
Service Discovery
When services scale across many machines, callers need a way to locate service instances. Two approaches were considered: a service registry (dynamic address lookup) and HAProxy‑based traffic forwarding. Because implementing a multi‑language registry was costly and gRPC developers had their own roadmap, the team adopted HAProxy configuration changes, later addressing dynamic IP changes as the number of services grew.
Standardization
Different teams relied on varied environments, making deployment scripts complex and upgrades error‑prone. The team moved to Docker for all services, enabling consistent CI pipelines. Images from the development registry are automatically synchronized to the production registry for release branches, while non‑production branches remain unsynced due to volume.
Health Checks
Increasing service count revealed unstable response times that could exhaust caller thread pools. The team introduced circuit breakers, timeout settings, and graceful degradation. Since HAProxy could only perform TCP health checks, each gRPC service now exposes a health‑check endpoint that a monitoring service polls; unhealthy instances are automatically restarted via a shared library.
Monitoring, Alerting, and Log Collection
To gain visibility into latency, error rates, and request volume, each service provides a /metrics endpoint collected by Prometheus and visualized in Grafana, with alerts routed to owners. Logs, including alerts and exceptions, are shipped via Fluentd to an Elasticsearch cluster for searchable analysis.
Elastic Scaling
Traffic spikes required dynamic resource allocation. The team leveraged AWS Auto Scaling Groups, which launch or terminate instances based on CPU/Memory thresholds and execute custom scripts. Prior Dockerization simplified instance startup.
Cluster Scheduling and Deployment
Auto Scaling introduced two problems: resource waste when each group runs a single service, and the operational overhead of manually specifying CPU, memory, and disk for each new service. Adopting Kubernetes solved both issues with built‑in resource management and scheduling, while Spinnaker handled continuous delivery.
Cellular Architecture
Multiple "cells"—independent Kubernetes clusters deployed in separate VPCs—provide isolation and clean architecture. Each cell contains several autoscaling groups with heterogeneous resource profiles, allowing services to be placed on appropriate machines via selectors.
Availability
Cells are spread across different availability zones within the same city, improving resilience. An internal load balancer distributes traffic among cells, and the design can be extended to multi‑region deployments using intelligent DNS routing. Stateful services (e.g., databases) require careful data‑sync handling, with a primary‑cluster redirect strategy for writes.
Robustness
Process failures trigger automatic restarts, node failures cause automatic replacement, and whole‑cluster issues can be mitigated by traffic shifting to healthy clusters, though stateful services are not yet fully covered.
Conclusion
The article summarizes the problems encountered and the architectural changes made over the past period, acknowledging ongoing improvement opportunities and inviting readers to join the team.
Liulishuo Tech Team
Help everyone become a global citizen!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.