Elegant Solution to Prompt Bloat: Semantic Retrieval of Tools for Efficient LLM Inference

The article explains how the limited context window of large language models causes prompt bloat when many tool descriptions are embedded, and presents the RAG‑MCP architecture that stores tool metadata in a vector database, uses semantic retrieval to select only the most relevant tools, dramatically shortens prompts, and improves inference speed and tool‑call accuracy.

Amazon Cloud Developers
Amazon Cloud Developers
Amazon Cloud Developers
Elegant Solution to Prompt Bloat: Semantic Retrieval of Tools for Efficient LLM Inference

Problem

Large language models (LLMs) have limited context windows. Embedding the full descriptions of many external tools managed via the Model Context Protocol (MCP) consumes a large fraction of the token budget, reduces the model's reasoning capacity, and makes tool selection more error‑prone. This phenomenon is called prompt bloat .

RAG‑MCP Architecture

Amazon Bedrock Knowledge Base combines Retrieval‑Augmented Generation (RAG) with MCP. Tool metadata are stored in a vector database; a semantic search retrieves only the most relevant tool specifications for a user query, which are then inserted into an augmented prompt sent to the LLM.

Core Concepts

Retrieval‑Augmented Generation (RAG) matches a user query against embeddings in a vector store and injects the top‑k most relevant passages as context, improving answer relevance and reducing token usage.

Model Context Protocol (MCP) standardises tool metadata (name, description, input schema) and separates the MCP Server (exposes tool list and executes calls) from the MCP Client (fetches metadata and forwards calls to the model).

Tool Definition Schema

{
  "name": "string", // unique identifier
  "description": "string", // optional human‑readable description
  "inputSchema": {
    "type": "object",
    "properties": { ... }
  }
}

Filesystem MCP Server Example

=== All Available Tools (11 tools) ===
1. 🔧 get_file_info
   Description: Retrieve detailed metadata about a file or directory.
   Parameters: path
2. 🔧 write_file
   Description: Create a new file or overwrite an existing file with new content.
   Parameters: path, content
3. 🔧 move_file
   Description: Move or rename files and directories; fails if destination exists.
   Parameters: source, destination
4. 🔧 edit_file
   Description: Line‑based edits to a text file; returns a git‑style diff.
   Parameters: path, edits, dryRun
5. 🔧 read_multiple_files
   Description: Read contents of multiple files simultaneously.
   Parameters: paths
6. 🔧 create_directory
   Description: Create a new directory or ensure it exists.
   Parameters: path
7. 🔧 read_file
   Description: Read the complete contents of a file.
   Parameters: path
8. 🔧 directory_tree
   Description: Recursive JSON view of files and directories.
   Parameters: path
9. 🔧 list_allowed_directories
   Description: List directories the server is allowed to access.
10. 🔧 search_files
    Description: Recursively search for files matching a pattern (case‑insensitive).
    Parameters: path, pattern, excludePatterns
11. 🔧 list_directory
    Description: Detailed listing of files and directories in a path.
    Parameters: path

Benefits of RAG‑MCP

Dynamic tool retrieval : Only the semantically closest tool specs are fetched, dramatically shrinking the prompt.

Context augmentation : Retrieved specs are inserted into an augmented prompt, giving the model precise execution guidance.

Scalability : Large, frequently changing tool sets can be maintained without manual prompt edits.

End‑to‑End Workflow (12 steps)

MCP Client reads all enabled MCP Server tools and writes them to a JSONL file.

Upload the JSONL file to an Amazon S3 bucket that serves as a Bedrock Knowledge Base data source.

Chunk the JSONL file with a custom chunker so that each tool becomes a separate chunk.

Generate embeddings for each chunk using an embedding model (e.g., Amazon Titan Text Embeddings V2).

Store the embeddings in a vector database (Amazon OpenSearch Serverless or Aurora pgvector).

When a user query arrives, encode it with the same embedding model to obtain a query vector.

Perform a similarity search in the vector store and retrieve the top‑k most relevant tool embeddings.

Build an augmented prompt from the retrieved tool specifications.

Send the augmented prompt to the LLM; the model decides whether to invoke a tool.

If a tool is needed, the LLM triggers the call via MCP.

Return the tool execution result to the client.

Steps 6‑10 may repeat for multiple tool calls within a single request.

Implementation Snippets

Key Python classes illustrate how to interact with an MCP Server and Bedrock Knowledge Base.

class MCPClient:
    async def __aenter__(self):
        await self.connect()
        return self
    async def connect(self):
        # initialise stdio connection and MCP session
        ...
    async def list_tools(self):
        if not self._session:
            raise MCPToolError("MCP session not initialized")
        tools_response = await self._session.list_tools()
        return tools_response

The query_semantic method shows how to call Bedrock’s retrieve API, parse the JSON results, and return a QueryResult containing the matched tool specifications.

def query_semantic(self, query_text: str, max_results: int = 10) -> QueryResult:
    response = self.bedrock_client.retrieve(
        knowledgeBaseId=self.knowledge_base_id,
        retrievalQuery={"text": query_text},
        retrievalConfiguration={
            "vectorSearchConfiguration": {"numberOfResults": max_results}
        }
    )
    results = []
    for result in response["retrievalResults"]:
        try:
            content = json.loads(result["content"]["text"])
            results.append(content)
        except json.JSONDecodeError:
            continue
    return QueryResult(tools=results, total_results=len(results))

Configuration Tips

Vector store: Amazon OpenSearch Serverless for high‑throughput production; Aurora pgvector for cost‑sensitive workloads.

Embedding model: Amazon Titan Text Embeddings V2.

Retrieval top‑k: 5‑10 results balances prompt length and relevance.

Enable hybrid search (semantic + keyword) for complex queries.

Monitor ingestion jobs with Amazon CloudWatch.

References

RAG‑MCP paper: https://arxiv.org/html/2505.03275v1

MCP specification: https://modelcontextprotocol.io/docs/concepts/architecture

Amazon Bedrock Knowledge Base documentation: https://aws.amazon.com/cn/bedrock/knowledge-bases

Retrieval‑Augmented Generation overview: https://aws.amazon.com/cn/what-is/retrieval-augmented-generation

MCP Python SDK: https://github.com/modelcontextprotocol/python-sdk

GitHub repository with full code: https://github.com/memoverflow/rag-mcp

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMMCPtool integrationRAGSemantic RetrievalPrompt BloatAmazon Bedrock
Amazon Cloud Developers
Written by

Amazon Cloud Developers

Official technical community of Amazon Cloud. Shares practical AI/ML, big data, database, modern app development, IoT content, offers comprehensive learning resources, hosts regular developer events, and continuously empowers developers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.