BTS (Baidu Table Storage): Architecture and Core Technologies
BTS (Baidu Table Storage) is Baidu Intelligent Cloud’s high‑performance, low‑cost semi‑structured NoSQL service that evolved from single‑table to multi‑model (wide tables, time‑series, soon documents), featuring a three‑layer compute‑storage separation architecture, multi‑level caching, hot‑backup HA, and supporting massive IoT, AI, autonomous‑driving and monitoring workloads.
This article is organized from a technical sharing session titled "National Database Industry Trends" held on December 16, 2023, by Zhu Jie, Chief Database Architect at Baidu Intelligent Cloud. With the rapid development of the Internet and IoT, massive structured and semi-structured data has been generated. BTS (Baidu Table Storage) has become a key product for handling semi-structured data within Baidu. As technology continues to evolve and business needs diversify, BTS has evolved from supporting single Table capabilities to supporting multi-model capabilities such as wide tables and time series.
BTS is Baidu Intelligent Cloud's semi-structured storage product, supporting Baidu's core businesses internally (search, Apollo, Phoenix Nest, feed, system monitoring, etc.) and providing high-performance, low-cost NoSQL table storage services externally.
BTS can be used in rich scenarios, including horizontal business scenarios (distributed storage, structured, aggregation, high-performance retrieval), vertical industry scenarios (Internet, advertising, feed, IoT, big data, time series), and integrated solutions (big data analysis ecosystem, monitoring), supporting business innovation. It provides multiple APIs, SDKs, and a visual Web management platform for developers to quickly access. Through Batch write, concurrent read, multi-level Cache acceleration, etc., it breaks performance bottlenecks. Through hot backup replicas, real-time Failover, and table recycle bin technology, it ensures database high availability. Additionally, BTS provides enterprise-level security assurance, with service availability as high as 99.9% and data reliability reaching 99.99999999% (10 nines).
BTS's technological advantages stem from technical accumulation within Baidu, with a history of 12 years, divided into three generations:
1.0 version: In 2011, Baidu designed the first generation of distributed Table to meet the needs of internal business. The 1.0 version achieved a breakthrough in distributed technology, reaching hundreds of billions of entries, meeting the needs of massive data storage. Before 2014, this version mainly provided customized services for businesses such as search.
2.0 version: In 2015, Baidu began to expand commercial advertising and other business scenarios, adding FreeSchema, compute-storage separation, and sparse table capabilities.
3.0 version: In 2018, Baidu began to launch cloud services, and the scenarios were further expanded, including feed recommendation, AI, and other scenarios, with the number of entries growing to hundreds of trillions. It now covers scenarios such as system monitoring/automatic driving. Since the single Table capability model of BTS cannot meet the needs of new scenarios, starting from 3.0, BTS has added a time series engine. At the same time, multi-model engine reconstruction is currently underway.
1 BTS System Architecture
Below is a detailed introduction to the overall system architecture of BTS. As shown in the figure, the entire architecture is divided into three layers: access layer, engine layer, and storage layer.
The storage layer is responsible for data block management, and the overall adopts a compute-storage separation architecture, achieving modular design. All data is sunk and stored in the storage layer, providing multiple storage media and compression formats, providing customers with data lifecycle management capabilities.
The engine layer provides core data processing and management capabilities, and intelligently schedules various modules based on multi-dimensional strategies. Since the data in the engine layer is stateless, each sub-function can achieve modularity, decoupling from each other and can be independently scaled. In addition, the engine layer specially designs an intelligent scheduling module, which can automatically slice and merge different slice loads. Currently, we are doing automatic adjustment based on feedback, such as predicting business fluctuations based on data and performing automated scheduling in advance.
Finally, the upper access layer supports rich interfaces and ecosystem capabilities. Currently, BTS supports HBase, Influx interfaces, and more interfaces are being gradually improved.
2 BTS Core Technical Architecture Design
NoSQL comes with the development of Internet business, and it was born to solve the shortcomings of relational databases in vertical scenarios. Therefore, for NoSQL databases, high performance, low cost, high availability, and high scalability are its core capabilities.
2.1 Single-machine Engine Read/Write Path
Next, we will focus on how BTS implements these capabilities. Before that, let's introduce the read/write path of the single-machine engine, which is the premise for performance optimization and high availability implementation.
First is the write path: after the client writes data, the data is written to the Redo-Log and written into the memory table, and the exported data is compressed and stored in unit data mode, supporting multi-level compression. The detailed path is: client write -> data write Redo-Log -> enter memory table -> dump data. Compressed and stored in unit data mode, supporting multi-level compression.
The read path is just the opposite of the write path: client read -> priority query in data block cache (if available) -> if not hit, sink to unit data to query -> return data to client. When reading data, the core capability is to use data cache plus data in memory for merging to speed up reading data.
2.2 High Performance Optimization and Low Cost Management
We will first look at how to achieve the most critical performance and cost in NoSQL databases.
In terms of performance, we have several key optimizations, including:
Through GroupCommit, small writes are merged and processed, thereby greatly improving write throughput. This has a significant improvement effect on HDD media.
Multiple Compaction strategies to reduce I/O amplification.
Write data is I/O merged, while the logic of reading data is the opposite, splitting large I/O into small I/O to provide concurrent capability, improving read throughput through "concurrent I/O + prefetch".
Support configurable cache (Cache), providing multi-level cache capabilities such as memory and SSD.
In terms of cost, BTS supports multiple compression types at the bottom, including three replicas, Snappy, EC, and other compression methods to save space. In addition, BTS also supports switching between capacity type and performance type, and will provide the ability to automatically cool down a single table in the future.
2.3 High Availability Architecture Design
Next, we will introduce how BTS achieves high availability.
We all know that data reliability is guaranteed by the storage layer. In terms of high availability, BTS has a complete set of HA framework. This framework consists of two parts: control node and working node, both of which support high availability redundancy. Among them, the working nodes are hot backup, and data is synchronized through Redo-Log. The control node is responsible for high availability switching. If the main node fails, it can be quickly switched, and the unavailability time is controlled within hundreds of milliseconds, and the switching time is controlled within the RPC scheduling time.
Relying on this architecture, it can achieve fast Failover, and the MTTR of service units has dropped from seconds to within hundreds of milliseconds. In addition, during system upgrades, the working node will actively switch the main node, and the business is unaware in this scenario.
BTS has perfect high availability capabilities, such as multi-level fault tolerance and scheduling design, multi-level elastic multi-tenant isolation mechanism, end-to-end data verification and real-time monitoring, and perfect operation and maintenance. It ensures high availability capabilities through the kernel high availability framework and enterprise-level capabilities.
2.4 Core Technology Summary
Finally, let's summarize the key core technologies of BTS:
The first point is support for cache accelerator. Cache can be configured, and it supports multi-level characteristics, including memory and SSD. Support multi-level acceleration for specified tables, which can effectively improve the reading speed of hot data and meet the needs of multiple scenarios.
The second point is multi-active architecture. The overall architecture is hot backup and multi-active, supports automated fault switching, can achieve millisecond-level fault switching, active upgrade, and business is unaware, suitable for businesses that require high availability.
The third point is that the storage layer supports multi-level data compression. This method reduces I/O consumption of useless data, supports direct operation on compressed data, can improve read performance and save storage space.
The last point is the key value separation storage function that is being tested internally. Through separation storage, separate compression can be achieved, reducing I/O consumption of reading entire rows when querying, and improving read and write performance by 50%.
3 Application Scenarios
After introducing the BTS technical architecture, we will further explore the application of BTS architecture in actual application scenarios.
BTS currently supports an average of tens of trillions of daily accesses. In terms of scenarios, it covers processing and analysis scenarios of wide tables, time series, and big data, covering many directions such as IoT, AI, feed flow, advertising, health, search, web applications, and automatic driving.
3.1 Automatic Driving Scenario
Apollo unmanned vehicle automatic training is one of the important scenarios. In this scenario, model training and simulation are required, and multi-dimensional environmental data needs to be obtained on demand. The data in this type has the following characteristics:
Data on the vehicle side is multi-source, including location data, radar data, image data, infrared data, etc.
TB level per vehicle per day, massive data leads to high storage cost pressure.
High requirements for data real-time.
From these characteristics, we can see that the biggest feature of the automatic driving scenario is high throughput and large amount of data, and it is a scenario with both big data and time series characteristics. In this massive time series data scenario, performance and cost are the key.
In terms of performance, the key solution is to distinguish by vehicle and sensor, scatter time series writes, and solve write bottlenecks. The underlying storage unit supports automatic splitting and merging according to size and load to solve data hotspot problems.
In terms of cost, BTS supports data cooling, quarterly table splitting, table switching, data cooling, etc., to achieve efficient storage.
Through the comprehensive solution of BTS, it has achieved low-cost storage of hundreds of PB of data, hundreds of GB of throughput capability, supporting business refined simulation, and the ability to read dimensional data on demand.
3.2 System Monitoring Scenario
Different from the automatic driving scenario, monitoring data is naturally scattered. In the monitoring scenario, physical servers, virtual machines, containers, etc. all need monitoring in various dimensions to improve the visualization of business. However, there are many small values in monitoring data, which leads to high read and write and Compaction consumption. This is also different from the automatic driving business.
The biggest pain point of monitoring business is the mixing of offline traffic and online traffic. Offline traffic needs to analyze full data and then output business reports, which consumes a lot on the system. Online traffic is mainly real-time monitoring and real-time alarms, requiring data to be processed in real time and not blocked.
BTS supports hierarchical management of online and offline through different access identification to distinguish online and offline tasks. Offline tasks will be cut into very small blocks, occupying a short time, allowing high-priority online tasks to be inserted at any time. When online traffic is low, offline traffic can fully utilize resources. When online traffic increases, offline can quickly give up resources and not block online tasks.
Another point is that monitoring has high requirements for cost and performance at the same time. Therefore, by using the characteristics of monitoring business reading the latest data, through multiple cache strategies, hot data is kept in high-speed cache as much as possible to ensure high performance, and cold data is placed on HDD to achieve cost reduction.
Through these overall solutions, it has been able to achieve a cache hit rate of more than 80%, achieving high performance while taking into account costs.
Through the above solutions, in the system monitoring scenario within Baidu Group, the final results achieved a reduction in business costs by more than 50%, support for collecting tens of trillions of monitoring points per day, zero impact of offline traffic on online traffic, and achieving a monitoring zero drop point rate.
4 Future Outlook
Currently, based on the support of wide tables and time series, BTS will further develop the ability to support multi-modal capabilities such as documents and search, providing customers with cross-modal unified analysis and computing capabilities, that is, supporting customers to analyze multiple modal data through one task.
Before introducing the key product BTS, let's briefly review the Baidu Intelligent Cloud product matrix. Baidu Intelligent Cloud Database includes RDS, NoSQL, cloud native database, OLAP, and other products. Compared with other cloud vendors in the industry, Baidu Intelligent Cloud Database has two significant features:
Baidu Intelligent Cloud Database can achieve the same architecture, and customers in the cloud and on-premises enjoy the same product capabilities.
It supports the most complete product forms in China, including public cloud, dedicated cloud ABC Stack, edge computing node BEC, local computing cluster LCC, and other forms, which can serve customers with various needs.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.