Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases
The talk explores how the rapid growth of multimodal data and large language models is reshaping data processing, highlighting three key trends—online‑offline integration, vector‑relational database convergence, and the fusion of data processing with AI computation—while presenting practical solutions and future visions for unified data‑AI ecosystems.
At the OceanBase Developer Conference, Prof. Chen Wanguang from Tsinghua University and Ant Technology Research Institute presented "Data Processing Technologies in the AI Era," emphasizing the explosive growth of data volume, speed, and modalities such as text, audio, images, and video.
He likened data processing to Maslow's hierarchy, progressing from collection and storage to query handling, and finally to AI‑driven processing that can even generate content.
Three major development trends for future databases were identified:
1. Online‑offline integration – unifying real‑time transaction processing with batch analytics to eliminate data inconsistency between separate online and offline pipelines.
2. Vector database and relational database convergence – combining embedding‑based vector search with traditional relational storage to support large‑language‑model services while maintaining a single source of truth.
3. Data processing and AI computation integration – tightly coupling data cleaning, enrichment, and AI model inference in iterative loops, as illustrated by the CCNet workflow for extracting high‑quality web content.
The speaker highlighted challenges such as inconsistent data across online/offline paths, the need for HTAP engines, and the difficulty of synchronizing graph databases (TuGraph DB) with storage systems. A solution using binlog‑based synchronization and a unified query language (ISO‑GQL) was described.
Ant Group's VSAG vector library, built on FAISS, was presented as an optimized, developer‑friendly alternative that integrates vector search into OceanBase via plugins.
To bridge AI and big‑data ecosystems, the talk discussed the limitations of current stacks (Python‑based AI on GPUs vs. Java‑based Spark on CPUs) and the performance gap of PySpark. Optimizations to PySpark were shown to double performance in data deduplication tasks.
Finally, a vision was offered for a unified data‑AI platform that supports both CPU and GPU workloads, enabling "write once, run everywhere" across heterogeneous hardware.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.