Backend Development 11 min read

How to Build a Highly Available, Stable, and Observable SMS Service

This article explains how to design a high‑availability SMS system by identifying stability bottlenecks, defining reliability goals, implementing failover strategies for Redis, MySQL and external services, establishing a comprehensive observability framework, and measuring key quality metrics to ensure 99.99% uptime.

Inke Technology
Inke Technology
Inke Technology
How to Build a Highly Available, Stable, and Observable SMS Service

Background

SMS is widely used in user registration, password recovery, account changes, payment confirmation, activity verification, and marketing. The article focuses on improving SMS system high availability and observability.

Current Issues

The core SMS workflow heavily depends on external resources such as downstream services and MySQL. Failures in these resources cause complete service outage, severely affecting stability. The system also relies on multiple third‑party providers, and without quality evaluation, anomalies are only detected after a day, leading to business loss.

Improvement Goals

Increase SMS service stability to detect faults quickly and maintain interface availability above 99.99%.

Enhance observability by establishing quality monitoring and evaluation for multiple providers to detect channel anomalies promptly.

Overall Solution

The SMS system consists of two parts: the SMS service and the SMS metrics observation module.

SMS Service

Provides basic capabilities such as sending verification codes, validation, receipts, and upstream handling. Stability risk points include strong dependencies on downstream encryption services, Redis, and MySQL.

Service dependency: Downstream encryption service failure leads to phone number encryption failure and terminates the verification flow.

Redis dependency: Redis failure prevents requestID generation, ending the verification flow.

MySQL dependency: MySQL failure blocks query, update, and record storage, ending the verification flow.

Metrics Observation Module

Offers metric calculation, visualization, and alerting. Current observability gaps:

Missing core metrics such as SMS fill‑rate and delivery‑rate, making it impossible to evaluate third‑party provider quality.

Visualization is not user‑friendly and lacks sufficient dimensions.

No alert mechanism for metric anomalies, preventing timely quality awareness.

Optimization Ideas

Replace Redis‑based unique ID generation with a UUID algorithm.

Decouple services using a message queue.

Introduce redundant storage (e.g., Redis) for MySQL disaster recovery.

Refine quality monitoring by defining new metrics and improving data collection.

Design Practices

Eliminate Redis Strong Dependency

Redis is only used to generate a globally unique ID; replace it with UUID.

Eliminate Service Strong Dependency

Use a message queue to decouple the SMS service from downstream encryption services.

Eliminate MySQL Strong Dependency

Introduce Redis as a redundant storage layer; during MySQL failures, execute equivalent Redis commands for disaster recovery.

MySQL Failure Detection & Recovery

Detect MySQL status via error codes and frequency analysis; recovery relies on manual intervention and alerting.

Redis Failure Detection & Recovery

Detect Redis status by parsing error messages and frequency; recovery also depends on manual control and alerts.

Failover State Object

<code>// State failover object
 type State struct {
     acquireStatus func() Status // Get current failover status
     setMySQLFatalFlag func(context.Context) error // Enable MySQL fatal flag
     setRedisFatalFlag func(context.Context) error // Enable Redis fatal flag
     runMaster func() error // Execute when MySQL is healthy
     runBackup func() error // Execute when Redis is healthy
     recordSQL func() // Log SQL during MySQL failure for later recovery
 }

 // Run selects the processing flow based on the current status
 func (s *State) Run(ctx context.Context) error {
     var fn func(context.Context) error
     switch s.acquireStatus() {
     case StatusHealthy:
         fn = s.runMaster
     case StatusMysqlFatal:
         fn = s.runBackup
     case StatusRedisFatal:
         fn = s.runMaster
     default:
         fn = s.runBackup
     }
     return fn(ctx)
 }
</code>

SQL Disaster Recovery

Use a state‑pattern failover object to trigger different behaviors based on MySQL/Redis health, avoiding repetitive conditional code.

Failure Record Recovery

During failures, the RecordSQL function writes the SQL statements to disk; these can be replayed to restore MySQL data after recovery.

Quality Observation System

Added ten new metrics, including third‑party success rate, receipt rate, delivery rate, and fill‑rate. Their formulas and meanings are defined to evaluate provider stability and user verification effectiveness.

Metrics Collection & Visualization Architecture

The collected data feeds daily quality reports, recent trend charts per carrier, and monitoring alerts, providing comprehensive visibility into SMS service quality.

Benefits

SMS service quality now guarantees 99.99% availability with no major incidents in the past year.

Established a quality observation system that detects and resolves channel anomalies within 20 minutes.

Future Outlook

Fine‑grained regional SMS operations: coverage is global, but some regions still need quality improvements.

Automation of analysis tools: current post‑degradation analysis is time‑consuming and requires manual effort.

backendObservabilityhigh availabilityMetricsFailoverSMS
Inke Technology
Written by

Inke Technology

Official account of Inke Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.