Engineering and Algorithm Innovations for RAG Engines in Office Scenarios
The article analyzes the challenges of deploying large language models in enterprise settings and presents a modular Retrieval‑Augmented Generation (RAG) solution that combines document parsing, multi‑turn query rewriting, hybrid vector‑plus‑BM25 retrieval, two‑stage ranking (RRF, ColBERT, cross‑encoder) and knowledge‑filtered prompt engineering to achieve more comprehensive search, better ranking and more accurate answers.
Background
Large language models (LLMs) suffer from hallucinations, stale knowledge, and data‑privacy risks, which hinder their practical adoption in office environments. Retrieval‑Augmented Generation (RAG) addresses these issues by coupling a retrieval system with an LLM, allowing external knowledge to be injected at generation time.
Advantages of RAG
External knowledge can be added or updated without retraining the LLM.
The retrieval step provides observable and explainable evidence, reducing hallucinations.
RAG enables fine‑grained control over the information used for answer generation.
System Components
A RAG system consists of five core modules: data source, data‑processing (format conversion), retriever, ranker, and generator.
Core Architecture
The traditional RAG pipeline creates an index, retrieves documents, and feeds them to the LLM. Advanced RAG adds a pre‑retrieval query‑rewrite stage (e.g., HyDE) and a post‑retrieval processing stage (rerank, filter) before generation. The diagram from the "Modular RAG" paper illustrates this evolution.
Our Modular RAG Design
The architecture is layered from bottom to top:
Algorithm layer: OCR, layout analysis, table recognition, multi‑turn query rewrite (TPLinker), tokenization.
Workflow layer: Offline ingestion (document parsing, token/segment splitting, vector and text indexing) and online QA (query rewrite, hybrid retrieval, ranking, generation). Storage back‑ends include a vector DB, Elasticsearch and MySQL.
User‑config layer: Knowledge‑base management, model selection, dialogue rules.
Design benefits include modularity, horizontal scalability, plug‑and‑play algorithms, low cost, and easy maintenance.
Retrieval: “Search More Completely”
Offline, PDFs and Word files are parsed (OCR for PDFs, layout recovery, table extraction), split into logical blocks, and indexed both as text and vectors (BGE‑M3 and BCE models were chosen after relevance testing). Online, a multi‑turn query‑rewrite module converts user questions into richer queries before hybrid retrieval, which combines dense vector search with BM25 full‑text search. This ensures a broad candidate set (e.g., 100 retrieved documents).
Ranking: “Rank Better”
A two‑stage ranking pipeline is applied:
Coarse ranking: Reciprocal Rank Fusion (RRF) merges scores from different retrievers without relying on absolute scores, selecting the top 20 candidates.
Fine ranking: Models such as ColBERT (late‑interaction, token‑level similarity) and a cross‑encoder are used to reorder the top 20, finally picking the top 5. Knowledge filtering (an NLI binary classifier) is added to discard irrelevant passages, offering a cheaper alternative to full‑blown ranking models.
Tokenization experiments showed that jieba and Baidu lac produce overly fine granularity, texsmart is too coarse, and the cutword model offers a balanced split.
Generation: “Answer More Accurately”
After ranking, selected chunks are formatted and inserted into a prompt template with separate knowledge and question sections. To improve answer fidelity, a two‑stage FoRAG generation is used: first an outline is generated, then the final answer is expanded based on the outline, the query, and the retrieved knowledge. Chunk size trade‑offs (e.g., 128, 256, 512 tokens) are discussed, as overly short chunks hurt context while overly long chunks cause information loss.
Additional Considerations
Duplicate removal and knowledge aggregation are performed before prompt assembly.
When context is incomplete, hierarchical fallback (same‑level or parent‑level content) is used to supplement the query.
Latency can be reduced by selecting lightweight ranking models such as ColBERT when hardware is constrained.
Future work includes multimodal support for images and audio‑video content.
Q&A Highlights
Evaluation metrics focus on bad‑case resolution rate and overall accuracy. Incomplete context is handled by hierarchical content completion. Latency issues are mitigated by model selection. Document parsing must preserve structure to avoid information loss, and chunk size should be tuned per scenario.
Conclusion & Lessons Learned
Building a production‑grade RAG system requires careful engineering at every stage—retrieval, ranking, and generation. A modular, pluggable design allows teams to iterate on algorithms, balance efficiency with accuracy, and adapt to specific business needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
