How Hengfeng Bank Built a High‑Availability OpenStack Cloud for Financial Services
This article details Hengfeng Bank's practical experience with OpenStack, covering why the bank chose the open‑source cloud platform, its multi‑site deployment architecture, high‑availability design, management practices, and lessons learned from operating a large‑scale financial cloud environment.
1. OpenStack Status at Hengfeng Bank
Hengfeng Bank, one of China's 13 joint‑stock commercial banks, operates multiple OpenStack clusters across two regions and three data centers, supporting both production and testing environments, multi‑tenant isolation, and more than 200 applications running on over 6,000 virtual machines.
Key features include independent OpenStack instances per network zone, a hyper‑converged architecture with pure‑SSD Ceph storage, and integration with Cisco SDN for dynamic VxLAN binding and port migration.
2. Why Choose OpenStack
The bank prefers OpenStack because it is open source, vendor‑agnostic, and avoids lock‑in; the community offers a large ecosystem, mature codebase, and a robust governance model with thousands of developers worldwide.
OpenStack's licensing model reduces service‑fee costs, and its modular architecture allows the bank to customize and extend components as needed.
3. Deploying OpenStack
3.1 How Hengfeng Deploys OpenStack
The deployment uses separate control and compute nodes, with additional DHCP agents and Ceph monitors placed on compute nodes to avoid two‑layer network issues. The architecture spans two data centers with symmetric hardware for high availability, employing three control nodes for quorum and a dedicated arbitration node.
3.2 High‑Availability Applications
Control nodes run HAProxy with a primary‑plus‑two‑standby configuration and a virtual IP for API traffic. Database services use Galera clustering across three nodes, providing automatic failover without manual intervention. Memcached also operates in a primary‑standby mode.
4. Managing OpenStack
4.1 Management Approach
The bank isolates fault domains to prevent failures in one network zone from affecting others, and limits cluster size to around 1,000 physical machines for performance reasons. A single Keystone service manages multiple OpenStack clusters, and two identical Ceph clusters provide storage redundancy across data centers.
4.2 Operational Practices
Comprehensive monitoring covers network, compute, and storage layers; smokeping is used to detect latency issues. The team runs simulated banking workloads across the clouds to validate end‑to‑end functionality. Configuration management is handled with Puppet, and all OpenStack code is sourced from the upstream community via GitLab/GitHub, ensuring a consistent baseline.
5. Conclusion
The bank does not rely on any vendor‑specific OpenStack distribution; instead, it customizes the upstream community version, applying patches as needed while contributing back to the ecosystem. This approach avoids lock‑in and maintains control over the cloud stack.
Q & A
Q: How many people are involved in the OpenStack team? A: Only three to five engineers, because the bank uses a stable set of features and custom patches rather than the full upstream codebase.
Q: What typical problems have you encountered? A: Frequent hot‑migration bugs, patch management challenges, and occasional host‑level failures that require rapid remediation.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.