Building an Enterprise‑Grade RAG 2.0 System: Architecture, Challenges, and Practices
This article analyses the enterprise‑level RAG 2.0 solution, covering its background problems, layered architecture, offline and online pipelines, document parsing, multi‑turn query rewriting, hybrid vector‑plus‑BM25 retrieval, ranking models such as RRF, ColBERT and cross‑encoder, knowledge filtering, two‑stage generation with FoRAG, and practical evaluation metrics.
Background
Large language models (LLMs) still suffer from hallucination, stale knowledge, and data‑privacy risks when deployed in real‑world applications. Retrieval‑Augmented Generation (RAG) is introduced to mitigate these issues by coupling a retriever with a generator.
Core Technical Architecture
The RAG system is organized into three layers:
Algorithm layer : OCR, multi‑turn query rewriting, text segmentation, table recognition, etc.
Process layer : Offline ingestion (document parsing, tokenization, vector indexing) and online QA (query rewriting, hybrid retrieval, ranking, generation). Underlying storage includes vector DB, Elasticsearch, MySQL.
Management layer : Knowledge‑base, model, and dialogue rule configuration.
Figure 1 (below) shows the modular RAG architecture, extending the classic RAG pipeline with pre‑processing (query rewrite, HyDE) and post‑processing (rerank, filtering).
Construction Challenges and Practice
1. Search More Completely
Offline, documents (PDF, Word) are processed through OCR, layout recovery, table extraction, and chunking. Chunk size is balanced to avoid loss of context (e.g., 128–512 tokens). Text is tokenized and embedded with BGE‑M3 and BCE models, then indexed.
Online, user queries undergo multi‑turn rewriting using a TPLinker‑based relation‑extraction model to resolve coreference and fill missing information before retrieval.
Hybrid retrieval combines vector similarity (semantic matching, multilingual support) with BM25 full‑text search (exact keyword matching). The two result sets are merged using Reciprocal Rank Fusion (RRF), which aggregates rankings based on position rather than raw scores.
2. Rank Better
After retrieving the top 100 candidates, a two‑stage ranking is applied:
Coarse ranking (RRF) selects the top 20.
Fine ranking uses interaction‑based models: ColBERT (late‑interaction token‑level scoring) and a cross‑encoder (full query‑document interaction). The final top 5 are passed to a knowledge‑filter. The knowledge‑filter is a lightweight binary classifier trained on business‑specific data to discard irrelevant chunks, offering a cheaper alternative to additional ranking models.
3. Answer More Accurately
Ranked knowledge chunks are formatted (knowledge layout) and inserted into a prompt template with separate knowledge and question fields. To improve answer structure, a two‑stage generation (FoRAG) first produces an outline and then expands it into the final response, ensuring alignment with the query and retrieved context.
Evaluation and Insights
Key observations from the deployment:
Modular, layered design enables horizontal scaling and plug‑in upgrades.
Hybrid retrieval improves recall while maintaining precision.
RRF provides an efficient coarse‑ranking without model inference.
ColBERT offers a good trade‑off between speed and accuracy for token‑level interaction.
Chunk size directly impacts both retrieval recall and generation fidelity; a balanced size (e.g., 256 tokens) was chosen after empirical testing.
Q&A Highlights
Q1: What metrics determine production readiness? A1: Manual evaluation of document‑question‑answer triples, bad‑case resolution rate, and overall accuracy across departments. Q2: How to handle incomplete context in hierarchical documents? A2: Augment the query with sibling and parent layers based on the document hierarchy, respecting the model’s input length limits. Q3: Strategies for latency reduction? A3: Profile bottlenecks; adopt lighter ranking models (e.g., ColBERT) when hardware is constrained. Q4: Beyond chunk size, what optimizations matter? A4: Preserve full document structure during parsing and store it in the index for precise retrieval. Q5: Handling audio/video data? A5: Currently unsupported; future work includes multimodal extensions. Q6: Improving QA on large tables? A6: Currently the whole table is fed to the LLM; fine‑grained table region extraction is a pending improvement.
Conclusion
Building a robust RAG system requires careful attention to every pipeline stage—from document ingestion and query rewriting to hybrid retrieval, multi‑stage ranking, knowledge filtering, and structured generation. The presented enterprise‑grade design demonstrates how modular components, appropriate model choices (BGE‑M3, BCE, ColBERT, RRF), and systematic evaluation can deliver accurate, explainable, and scalable LLM‑augmented QA solutions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
