Artificial Intelligence 17 min read

MMR for RAG: Low-Cost Chunk Limits Balance Relevance and Diversity

When a long document is split into many highly similar chunks, vector‑based top‑k retrieval tends to return multiple pieces from the same source, causing document dominance; applying a per‑document chunk limit together with Maximal Marginal Relevance (MMR) re‑ranking introduces diversity while preserving relevance, offering a low‑cost way to improve RAG answer quality.

AI Engineer Programming

May 27, 2026

MMR for RAG: Low-Cost Chunk Limits Balance Relevance and Diversity

Problem

When a long document is split into many fine‑grained chunks that are all highly relevant to a query, top‑k vector retrieval often returns multiple chunks from the same source, limiting answer quality.

Why it happens

Long document length : The number of chunks from a long document far exceeds that of other documents, increasing its chance of being selected.

High semantic relevance : The document’s content matches the query closely, so many of its chunks obtain high similarity scores.

Overly fine chunk granularity : Adjacent chunks overlap heavily (e.g., sliding‑window splitting), making them almost identical for the retrieval model.

Winner‑takes‑all retrieval : Standard top‑k similarity ranking orders only by score and does not consider diversity.

In vector space, similar content forms dense clusters; a single top‑k retrieval round will likely allocate all slots to that cluster. KNN‑based or cross‑encoder re‑rankers return the most relevant paragraphs but ignore redundancy among them.

Formal definition of MMR

MMR (Maximal Marginal Relevance) was introduced by Carbonell and Goldstein (SIGIR 1998). It re‑ranks retrieved documents while reducing redundancy and preserving relevance.

Q : user query

S : set of already selected documents

R : candidate document set (R \ S are unselected)

Sim₁ : similarity between a document and the query (relevance)

Sim₂ : similarity between a candidate document and the already selected documents (redundancy)

λ : balancing coefficient in [0, 1]; λ → 1 favors relevance, λ → 0 favors diversity

MMR algorithm

Obtain a candidate set R (top‑N) via vector retrieval.

Initialize the selected set S = ∅ and add the most relevant document to S.

Compute the MMR score for each unselected document:

score(d) = λ·Sim₁(d, Q) – (1‑λ)·max_{s∈S} Sim₂(d, s)

Select the document with the highest score, add it to S, and remove it from R.

Repeat steps 3–4 until |S| = k.

λ initialization and practical settings

Start from λ = 0.5. Factual QA prefers higher λ (more relevance); exploratory or multi‑angle queries prefer lower λ (more diversity). Evaluate each query type separately.

Candidate set size N : Quality upper bound depends on initial recall. Too small N limits diversity; too large N increases linear computation cost. Typical range: N = 3k–5k.

Per‑document hard limit : Limit each document to contribute at most 1–2 chunks; the exact value depends on average document length and chunk granularity.

Two‑stage retrieval : In high‑precision scenarios, first apply a cross‑encoder reranker to the top‑N, then run MMR on the reranker’s subset to add diversity while preserving a relevance lower bound.

λ meaning

λ = 1.0 – pure similarity ranking, no diversity constraint.

λ = 0.5 – balanced relevance and diversity (common default).

λ = 0.0 – pure diversity maximization, ignores relevance.

MMR is especially suitable for complex multi‑facet queries, content summarization, and query disambiguation.

MMR’s role in RAG

LLMs have limited context windows, so input content must be carefully selected. A common approach—picking the highest‑similarity chunks—often yields severe redundancy and excludes informative, diverse content.

MMR is inserted as a post‑retrieval re‑ranking step: after a wide recall (Top‑N), MMR iteratively selects a set of complementary paragraphs to feed the LLM. Each selected chunk adds new information rather than repeating what is already present, which is crucial for queries requiring coverage from multiple angles. MMR can be combined with evaluation frameworks such as RAGAS to quantify improvements in answer quality, context precision, and fidelity.

Single‑document dominance mitigation

When a knowledge‑base document is densely chunked, its chunk embeddings form a high‑density cluster, causing top‑k retrieval to allocate most slots to that document. MMR’s diversity penalty naturally suppresses this, but in practice a per‑document hard limit (e.g., at most 1–2 chunks per document) is also needed.

Inherent limitations of MMR

Greedy sub‑optimality : The greedy selection makes locally optimal choices at each step and does not guarantee a globally optimal set, especially when the candidate set is large and semantically complex.

Fixed λ : A static λ cannot adapt dynamically to query type, candidate set characteristics, or user intent. The λ optimal for factual QA may be unsuitable for exploratory QA.

Similarity function sensitivity : Performance depends heavily on the chosen similarity measure. Cosine similarity is common, but anisotropic embedding spaces can bias results.

Scalability bottleneck : Computing pairwise similarities for thousands or tens of thousands of candidates becomes a noticeable performance bottleneck; production systems therefore limit re‑ranking depth to a manageable top‑K range.

Evaluation metrics for diversity‑aware retrieval

α‑nDCG (Clarke et al.): discounts relevance contributions from documents already covered by earlier results, directly measuring novelty.

Context Precision / Context Recall : Context Precision measures the proportion of retrieved chunks actually used by the LLM; Context Recall assesses whether all information needed for a correct answer appears in the retrieved context.

Source diversity entropy : In RAG, the entropy of document sources quantifies the degree of single‑document dominance.

Determinantal Point Process (DPP)

DPP originates from quantum physics and random matrix theory (Macchi, 1975). Its core property is that elements in a set tend to “repel” each other, analogous to the Pauli exclusion principle for fermions.

In retrieval and recommendation, this repulsion models diversity: similar documents receive lower joint probability, while dissimilar ones receive higher probability.

DPP vs. MMR

Theoretical basis : MMR uses a heuristic greedy rule; DPP is a probabilistic random process.

Diversity modeling : MMR iteratively subtracts the maximum similarity to the selected set (local view); DPP evaluates the determinant of all pairwise similarities, capturing global diversity.

Optimization : MMR employs greedy sequential selection; DPP can be sampled exactly or approximated greedily.

Global consistency : MMR provides no global optimum guarantee; DPP’s determinant ensures global consistency.

Result reproducibility : MMR is deterministic; DPP sampling is stochastic, though a greedy MAP version is deterministic.

Computational complexity : MMR is approximate (linear in candidate size); DPP exact sampling is cubic, but approximations can reduce it to near‑cubic.

Quality‑diversity trade‑off : MMR uses a linear interpolation via λ; DPP couples quality and diversity multiplicatively.

The key difference: MMR selects based only on the maximum similarity to the already selected set (local view), whereas DPP evaluates the determinant of all pairwise similarities, providing a global view. Two documents far from the selected set may still be penalized by DPP if they are highly similar to each other, which MMR would not capture.

Integrating DPP in the RAG pipeline

DPP is typically used as a post‑retrieval re‑ranking layer because its quality score relies on an accurate relevance estimate.

Suitable scenarios for DPP :

Legal document and legislation retrieval: high homogeneity and overlapping phrasing benefit from DPP’s global redundancy removal.

Multi‑document summarization: selecting representative paragraphs that cover different aspects of a topic.

Result‑page diversification in recommendation: users view an entire page, so global diversity modeling aligns with experience.

Unsuitable scenarios for DPP :

Latency‑sensitive online systems requiring precise output count: exact k‑DPP sampling is computationally heavy at high QPS and needs approximation or pre‑filtering.

Conclusion

Document dominance arises from over‑splitting long documents or using sliding‑window chunking, which creates many near‑identical embeddings. Standard top‑k similarity retrieval then allocates most slots to a single source.

The most direct remedy is to impose a hard limit on the number of chunks returned per document (e.g., 1–2 chunks). Adding MMR‑based diversity re‑ranking then automatically balances relevance and novelty. Mainstream vector stores and frameworks (e.g., LangChain, LlamaIndex) already provide ready‑to‑use implementations.

Long‑term optimization should focus on improving chunking strategies: semantic splitting, reducing overlap, hierarchical chunking, or a “parent‑document + child‑document” retrieval mode to lower redundancy at the indexing stage.

Combining BM25 keyword‑based differentiation, prompting the LLM to cite multiple documents, or rewriting complex queries into multiple sub‑queries for multi‑turn retrieval can further disperse information sources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

RAG Diversity retrieval Re‑ranking Chunking DPP MMR

Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.