Operations 22 min read

Surviving Hundred‑Billion Transactions: Real Production Incident Stories

This article recounts a series of real‑world production incidents—including massive concurrency overloads, DDoS attacks, SQL injection breaches, and critical bugs—encountered by an internet finance platform, and shares the concrete technical fixes and lessons learned to improve system resilience.

Efficient Ops

Mar 5, 2017

Surviving Hundred‑Billion Transactions: Real Production Incident Stories

Preface

The author reflects on years of experience in the internet finance industry, aiming to document both the technical solutions and the hard‑earned lessons from numerous production incidents.

1. Concurrency Over‑booking

During a large promotional campaign, thousands of users attempted to purchase a total of 10 million units of a financial product within seconds, causing the system to either over‑allocate or under‑allocate funds.

To prevent over‑booking, an optimistic lock based on memcached CAS/gets was introduced: the total quota is stored in memcached, and each purchase attempt first tries to lock the required amount. If the lock fails, the request is rejected, effectively throttling excessive concurrency at the entry point.

However, the optimistic lock also caused under‑booking because failed branch operations rolled back the locked quota, leading to temporary full‑booking followed by a rollback. The team removed the separate quota‑progress table and displayed progress directly from real‑time queries, and added a second check on the memcached quota and product status to immediately correct the state when needed.

Later optimizations considered using MQ or Redis queues for a fairer allocation mechanism.

2. Hacker Attacks

In 2015 the platform suffered multiple attacks, including DDoS floods, SQL injection, and social engineering attempts.

A hacker impersonated a customer representative in a group chat, asking for internal backend URLs to find a foothold.

During a DDoS incident the external IP received traffic spikes up to 18 GB/s, forcing the team to switch IPs, separate the corporate network from the shared ISP entry point, and eventually isolate the platform’s network.

Mitigation strategies included:

Hiding the real server IP behind CDN services (e.g., Baidu Cloud Acceleration, 360 Site Shield) and using high‑availability anti‑DDoS appliances.

Purchasing traffic‑scrubbing services from major cloud providers.

Deploying firewall products, though their effectiveness was limited.

3. SQL Injection

The legacy PHP codebase contained injection points that were patched after discovery, but attackers still found vulnerabilities.

Reasons for PHP’s higher exposure include its prevalence on the front‑end and the abundance of older frameworks lacking built‑in protection, whereas Java applications often use ORM tools (MyBatis, Hibernate) that provide parameterized queries.

Attackers typically start with automated scanners (e.g., Acunetix) and then use tools like sqlmap for deeper exploitation.

4. Bugs

4.1 Duplicate Payouts

A faulty retry logic for third‑party payment interfaces caused duplicate interest payouts to over 70 users, resulting in a loss of more than 60,000 CNY.

The root cause was concurrent execution of payout and settlement jobs, combined with unstable payment APIs that returned ambiguous success/failure responses, leading to unnecessary retries.

4.2 Miscellaneous Issues

Other incidents included a missing parenthesis that allowed any password to log in, an unsecured HTTP endpoint that triggered unintended automatic investments, and various performance bottlenecks in MongoDB‑MySQL synchronization and large‑scale data queries.

Each problem prompted a post‑mortem, process improvements, and stricter deployment checks.

Conclusion

Production incidents are powerful training grounds for technical teams, sharpening problem‑solving, stress management, and system design skills. The author emphasizes the three lifecycle stages of an internet platform—initial launch, growth, and maturity—and urges teams to use early‑stage hardships to build robust, scalable architectures for future growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

high concurrency Optimistic Lock SQL injection DDoS internet finance production incidents

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.