Deep Dive into Apache Doris’ Multimodal Capabilities: Architecture and Enterprise Deployments
Apache Doris 4.0 introduces native vector indexes, built‑in AI functions, and hybrid search, turning the OLAP engine into an AI‑centric analytics hub; the article details the technical design, performance optimizations, and real‑world deployments at ByteDance, Squirrel AI, NetEase and a security vendor, highlighting storage savings, query speedups and reduced operational complexity.
Background and Motivation
Unstructured data now exceeds 80% of enterprise information, creating data silos because structured metrics reside in OLAP systems while text, images, and video are stored in object stores or dedicated vector databases. The AI era demands "one data, multiple retrieval" – full‑text, semantic vector search, and traditional SQL aggregation – which drives Doris’s multimodal evolution.
Technical Architecture in Doris 4.0
Native Vector Index
Doris 4.0 integrates Faiss‑based ANN algorithms (HNSW and IVF_PQ) to support billion‑scale vectors with millisecond‑level top‑N and range queries. Key design points include:
Index algorithms: HNSW (high performance, high memory) and IVF_PQ (balanced performance and cost).
Distance metrics: L2 and Inner Product (equivalent to cosine after normalization).
Hybrid filtering: pre‑filtering with inverted index before ANN to balance recall and latency.
Quantization: SQ/PQ reduces memory usage by 4‑8×.
Optimization: Ann Index Only Scan avoids original column I/O, delivering up to 4× speedup over version 4.0.
Example DDL:
CREATE TABLE document_vectors (
id BIGINT,
content TEXT,
embedding ARRAY<FLOAT> NOT NULL,
INDEX idx_embedding (embedding) USING ANN PROPERTIES (
"algorithm" = "HNSW",
"metric_type" = "L2",
"dim" = "768"
)
) DISTRIBUTED BY HASH(id) BUCKETS 8;AI Functions
Doris ships with more than ten AI functions that can be invoked directly from SQL. Core functions include:
AI_QUERY : Calls an LLM to classify, summarize, or perform sentiment analysis on unstructured text, returning structured results.
TEXT_EMBEDDING : Generates embeddings at query time, eliminating the need to pre‑compute vectors in the application.
AI_GENERATE / AI_EXTRACT : Supports content generation and information extraction.
Typical usage – sentiment analysis of customer reviews:
SELECT AI_QUERY('volcengine/Doubao-pro-128k', concat('判断好评1差评0:', review_txt)) AS sentiment,
count(*) AS cnt
FROM customer_reviews
GROUP BY sentiment;Hybrid Search
Doris unifies three retrieval modes in a single engine:
Keyword search : Inverted index with tablet‑level BM25 scoring.
Semantic search : Vector index with ANN nearest‑neighbor lookup.
Tag/Rule search : Traditional SQL WHERE predicates.
Results from the three paths can be merged with UNION ALL and re‑ranked via a Python UDF, removing the need for external search services.
MCP Server for Agent Integration
Doris natively supports the Model Context Protocol (MCP), allowing AI agents to issue NL2SQL queries without handling SQL syntax, enabling real‑time data analysis for agents.
Enterprise Deployments
ByteDance – DataMind One‑Stop AI‑Data Engine
Project background: End‑2024 ByteDance launched DataMind to unify text, audio‑video, and structured data for seamless AI model integration.
Chosen for MPP query performance, built‑in inverted index, active community, and simple FE/BE architecture.
Hybrid search engine combines vector and BM25 indexes; Python UDF provides multi‑process re‑ranking.
GraphRAG builds entity graphs from AI‑extracted metadata, enabling similarity search and community reporting.
Quantitative impact: unified architecture replaces ES + vector DB + OLAP; development cycles shortened; AI capabilities reused across teams; storage compression >80%.
Squirrel AI – PB‑Level Multimodal Dataset Management
Project background: Squirrel AI needed a unified platform for petabyte‑scale multimodal training corpora, addressing data inconsistency and low collaboration efficiency.
Adopted a MLOps‑style pipeline with Doris as the core data‑versioning store.
Key techniques: partitioned tables for millisecond version switching, hot‑SSD / cold‑HDD tiering (30%+ SSD usage reduction), columnar compression (80%+ storage saving), and high‑frequency point‑lookup optimizations for billions of sample IDs.
Performance numbers: 3 × write throughput, >2 × sync speed, >3 × query efficiency, >20% faster model development.
NetEase – Unified Log and Time‑Series Analytics
Scenario: NetEase faced separate Elasticsearch (logs) and InfluxDB (metrics) stacks, leading to high ops cost and data fragmentation.
Replaced ES with Doris inverted index and InfluxDB with Doris time‑partitioned tables.
One‑SQL query now performs both keyword filtering and aggregation.
Quantitative gains: storage cost ↓70% (100 TB → 30 TB), write throughput ↑several‑fold, query speed ↑3‑5×, ops complexity reduced to a single Doris cluster.
Security Vendor – Log Storage & Analysis System
Scenario: Existing StarRocks‑based system could not meet real‑time SLA for security event analysis.
Deployed a 3‑node Doris cluster with Stream Load + Routine Load and asynchronous replica strategy.
Leveraged Doris inverted index for keyword filtering and zone‑based partition pruning.
Results: write speed ↑3×, query speed ↑7×, hardware requirement reduced to three servers, storage cost further lowered by columnar compression.
Technical Practice Highlights
Hybrid Search SQL Example
The following statement combines vector retrieval, BM25 keyword search, and a Python UDF re‑rank in a single query:
-- Channel 1: semantic vector search
WITH channel_vec AS (
SELECT content
FROM my_table
ORDER BY APPROX_COSINE_SIMILARITY(
TEXT_EMBEDDING('doubao-embedding', 'Doris 多模态'),
content_vec_col) DESC
LIMIT 7
),
-- Channel 2: keyword search
channel_bm25 AS (
SELECT content
FROM my_table
WHERE MATCH_ANY(content, 'Doris 多模态')
ORDER BY BM25() DESC
LIMIT 7
)
SELECT content
FROM (SELECT content FROM channel_vec
UNION ALL
SELECT content FROM channel_bm25) t
ORDER BY py_udf_rerank('Doris 多模态', content) DESC
LIMIT 7;Vector Index Tuning Recommendations
Memory estimation: dim × 4 bytes × row_count plus ANN overhead.
HNSW defaults: M=16, ef_construction=200; increase for higher recall.
Normalization: L2‑normalize vectors before import if cosine similarity is required; use Inner Product at query time.
Hybrid filtering: when the pre‑filter set is small, Doris automatically falls back to brute‑force computation.
Index building: use asynchronous BUILD INDEX and monitor via SHOW BUILD INDEX.
Production Considerations
Vector columns must be defined as ARRAY and cannot be NULL.
ANN and inverted indexes can coexist on the same table – the standard pattern for RAG workloads.
For datasets >1 billion rows, prefer IVF_PQ to control memory consumption.
AI function calls involve external model services; configure resource pools and timeouts accordingly.
Future Outlook
Native support for multimodal embeddings beyond text (image, audio) in upcoming releases.
Deeper AI Agent integration via MCP, positioning Doris as the real‑time analytics backbone for agents.
Evolution toward Hybrid Search and Analytics Processing (HSAP) architecture.
Separation of storage and compute with elastic vector workloads to reduce resource costs.
Conclusion
Apache Doris’s multimodal capabilities are more than an add‑on; they are a native integration across storage engine, query optimizer, and function framework, enabling a single engine, a single dataset, and a single SQL statement to perform structured analysis, full‑text search, and semantic retrieval, dramatically simplifying AI application architectures and cutting operational overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
