Cassandra Deployment and Optimization at 360 Cloud Storage
This article details how 360 adopted Cassandra for its cloud drive, describing Cassandra’s decentralized architecture, the reasons for its selection over HBase, large‑scale deployment challenges, performance optimizations, reliability improvements, disk utilization techniques, and the evolution of the system from 2010 to present.
In 2010, Dropbox popularized online cloud storage abroad, and Chinese companies such as 360, Kingsoft, and Baidu launched their own cloud‑disk products. 360’s “360 Cloud Disk” relies on a Cassandra‑based storage solution, which has been deployed at massive scale, eventually reaching over 10,000 physical nodes.
Key topics of the talk include an overview of Cassandra’s characteristics, why Cassandra was chosen at 360, its application scenarios, and the technical evolution of the system.
Cassandra’s characteristics – a fully decentralized design that provides high availability and smooth scalability, flexible schema, multi‑data‑center support, range queries, list data structures, and distributed writes. Advantages listed (cost‑effective scaling from 3 to hundreds of nodes, schema flexibility, fault tolerance, etc.).
Selection rationale – Compared with HBase, Cassandra’s leaderless architecture eliminates single points of failure, and its eventual‑consistency model with tunable read/write strategies avoids downtime during node failures. Cassandra’s direct disk access yields write performance an order of magnitude higher than HBase under similar hardware.
Technical evolution – Started with Cassandra 0.7.3 in 2011, gradually improving data reliability, implementing nonstop upgrades, adopting erasure coding to cut storage cost by 60 %, automating operations for clusters of up to 15,000 nodes, and integrating with HBase for backup. Various optimizations such as proxy‑check for incomplete replicas, custom repair mechanisms, and range‑based compaction were introduced.
Reliability improvements – Addressed issues like insufficient replica writes, disk/sector failures, and incomplete repair mechanisms. Implemented proxy‑check tables to record rows with missing replicas and a RowRepair‑like process to recover data without heavy read‑repair impact on cold data.
Disk utilization and write amplification reduction – Introduced virtual directories and sub‑range compaction to limit SSTable participation in compaction, achieving ~90 % disk utilization. Applied erasure coding (EC) and striping to reduce replica count while maintaining data safety, and used OrderPreservingPartitioner for ordered key storage.
Operational enhancements – Added tools for hinted‑handoff timeout handling, optimized MemTable flush selection to reduce CPU overhead, automated cluster configuration loading, and disk health monitoring with automatic offline handling.
Overall, the talk demonstrates how 360 scaled Cassandra to support a cloud‑disk service serving billions of files, continuously refined the platform for cost efficiency, reliability, and performance, and shared lessons applicable to large‑scale distributed database deployments.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.