Artificial Intelligence 16 min read

How Hermes Implements Bounded Memory: Character Limits, Compression, and Snapshots to Prevent Overflow

The article details Hermes' bounded memory system, which uses character limits for persistent files, a three‑stage context compression pipeline, boundary alignment to protect tool calls, snapshot caching, triple redaction, and anti‑thrashing mechanisms, ensuring agents never overflow or lose critical information.

James' Growth Diary

Jun 1, 2026

How Hermes Implements Bounded Memory: Character Limits, Compression, and Snapshots to Prevent Overflow

Hermes stores persistent memory in two disk files, MEMORY.md (2200 characters) and USER.md (1375 characters). Character limits are used instead of token limits to achieve model‑agnosticism; 2200 characters correspond to roughly 800 tokens across more than 200 supported models.

class MemoryStore:
    def __init__(self, memory_char_limit: int = 2200, user_char_limit: int = 1375):
        self.memory_char_limit = memory_char_limit  # ~800 tokens
        self.user_char_limit = user_char_limit      # ~500 tokens

The add method computes the new total length and, if the addition would exceed the limit, returns a failure response that includes the current usage and the full list of existing entries ( current_entries). This allows the agent to decide which entries to prune or merge without an extra read call.

def add(self, target: str, content: str) -> Dict[str, Any]:
    # compute new_total ...
    if new_total > limit:
        current = self._char_count(target)
        return {
            "success": False,
            "error": f"Memory at {current:,}/{limit:,} chars. Adding this entry ({len(content)} chars) would exceed the limit.",
            "current_entries": entries,
            "usage": f"{current:,}/{limit:,}"
        }
    entries.append(content)
    self.save_to_disk(target)
    return {"success": True, "message": "Entry added."}

Dialogue History Compression – Three‑Stage Progressive Pipeline

The ContextCompressor triggers when the context window reaches 50% of its capacity. It protects the first three non‑system messages and the last ~10 % of tokens (the tail_token_budget), and it tracks ineffective compressions to avoid thrashing.

class ContextCompressor(ContextEngine):
    def __init__(self, threshold_percent: float = 0.50,
                 protect_first_n: int = 3,
                 protect_last_n: int = 20,
                 summary_target_ratio: float = 0.20):
        self.threshold_tokens = max(int(context_length * threshold_percent), MINIMUM_CONTEXT_LENGTH)
        self.tail_token_budget = int(self.threshold_tokens * summary_target_ratio)
        self._ineffective_compression_count = 0

    def should_compress(self, prompt_tokens: int = None) -> bool:
        tokens = prompt_tokens or self.last_prompt_tokens
        if tokens < self.threshold_tokens:
            return False
        if self._ineffective_compression_count >= 2:
            return False
        return True

Stage 1 – Tool‑output pruning (zero LLM cost): Large tool outputs are replaced with concise one‑line summaries, e.g.

read_file config.py [12,000 chars] → [read_file] read config.py (12,000 chars)

Stage 2 – Head/Tail protection (zero LLM cost): The system prompt plus the first three non‑system messages are kept, and the last tail_token_budget tokens are protected.

Stage 3 – Structured LLM summarization: The protected middle segment is sent to an auxiliary LLM, which returns a fixed‑format summary with six fields:

## Active Task        ← verbatim copy of the latest request
## Completed Actions  ← concrete operations performed
## Current State      ← working directory / branch / file
## Pending User Asks  ← unanswered user questions
## Key Context        ← specific values / paths (no secrets)
## Remaining Work     ← description of remaining work (not commands)

Iterative updates reuse the previous summary to avoid rebuilding from scratch:

if self._previous_summary:
    prompt = f"""Update summary (not from zero)
PREVIOUS SUMMARY: {self._previous_summary}
NEW TURNS: {content_to_summarize}
→ move completed items from In Progress to Completed Actions"""
# cost: O(new turns), not O(full history)

Boundary Alignment – Prevent Cutting Active Tool Calls

When a compression boundary falls inside a tool_result, the resulting orphan tool_call would cause an API 400 error. The compressor moves the boundary backward to the end of the corresponding tool‑call group.

def _align_boundary_backward(self, boundary_idx: int) -> int:
    """Move the compression boundary backward to the end of a complete tool‑call group"""
    for i in range(boundary_idx, -1, -1):
        msg = self.messages[i]
        if msg["role"] == "tool" and msg.get("tool_call_id"):
            for j in range(i - 1, -1, -1):
                prev = self.messages[j]
                if prev["role"] == "assistant" and prev.get("tool_calls"):
                    if any(tc["id"] == msg["tool_call_id"] for tc in prev["tool_calls"]):
                        return max(0, j - 1)  # align before the tool call
    return boundary_idx + 1  # no match, skip this group

If an orphan is detected, Hermes deletes the stray tool_result and inserts a placeholder result to keep the pair consistent.

Snapshot Mechanism – Prefix‑Cache Optimization

Memory writes are persisted to disk immediately, but the system‑prompt snapshot used for the current turn is frozen until the next turn. This leverages prefix‑cache support in Anthropic, OpenAI, and Google APIs, saving >90 % of token cost.

def load_from_disk(self):
    self.memory_entries = self._read_file(mem_dir / "MEMORY.md")
    self.user_entries = self._read_file(mem_dir / "USER.md")
    # Freeze snapshot for the current turn – no changes affect this turn
    self._system_prompt_snapshot = {
        "memory": self._render_block("memory", self.memory_entries),
        "user": self._render_block("user", self.user_entries),
    }

The agent can verify the write result via the return value of memory add without asking the model what it remembered.

Security Boundary – Triple Redaction

Because memory is injected into the system prompt, Hermes applies three layers of redaction:

Layer 1 – Pre‑write sanitization: Regex scans remove invisible Unicode characters, prompt‑injection phrases (e.g., "ignore previous instructions"), and credential‑leak patterns (e.g., "curl $TOKEN").

Layer 2 – Prompt‑level placeholder: During compression the LLM is instructed to replace any sensitive content with [已脱敏].

Layer 3 – Post‑summary scan: The generated summary is scanned again to ensure no sensitive data slipped through.

_INVISIBLE_CHARS = {'\u200b', '\u200c', '\u200d', '\u2060', '\ufeff', '\u202a', '\u202b', '\u202c', '\u202d', '\u202e'}
_MEMORY_THREAT_PATTERNS = [
    (r'ignore\s+(previous|all|above|prior)\s+instructions', "prompt_injection"),
    (r'you\s+are\s+now\s+', "role_hijack"),
    (r'curl\s+[^
]*\$\{?\w*(KEY|TOKEN|SECRET|PASSWORD)', "exfil_curl"),
]

def _scan_memory_content(content: str) -> Optional[str]:
    for char in _INVISIBLE_CHARS:
        if char in content:
            return f"Blocked: invisible unicode U+{ord(char):04X}"
    for pattern, pid in _MEMORY_THREAT_PATTERNS:
        if re.search(pattern, content, re.IGNORECASE):
            return f"Blocked: threat pattern '{pid}'"
    return None

Anti‑Thrashing and Degradation Fallbacks

If a compression yields less than 10 % token savings, the ineffective‑compression counter increments; after two consecutive failures the compressor pauses (anti‑thrashing). When the LLM summarizer fails, Hermes falls back to a static message; if the summarizer model crashes, it switches back to the main model; persistent failures trigger a 60‑second cooldown.

savings_pct = (saved_estimate / display_tokens) * 100
if savings_pct < 10:
    self._ineffective_compression_count += 1
else:
    self._ineffective_compression_count = 0

if not summary:
    summary = f"{SUMMARY_PREFIX}
Summary unavailable. {n_dropped} messages removed."

if (_is_model_not_found or _is_timeout) and self.summary_model != self.model:
    self.summary_model = ""
    return self._generate_summary(turns_to_summarize)

self._summary_failure_cooldown_until = time.monotonic() + 60

Two‑Layer Compression Relationship

Storage: Persistent memory uses disk files; dialogue history lives in an in‑memory message list.

Size limit: Persistent memory is bounded by characters (2200/1375); dialogue history is bounded by tokens (50 % of the context window).

Overflow strategy: Persistent memory rejects writes and returns the full entry list for agent‑driven cleanup; dialogue history triggers automatic LLM compression.

Content lifespan: Persistent memory persists across sessions; dialogue history is summarized, losing fine‑grained details.

Injection method: Persistent memory is injected as a frozen snapshot into the system prompt; dialogue history is passed to the API in real time.

Update timing: Persistent memory writes to disk immediately but refreshes the snapshot on the next turn; dialogue history is rebuilt on each compression.

Security scan: Persistent memory is scanned with regex redaction on write; dialogue history summary undergoes credential redaction.

Second‑pass compression: Persistent memory replaces old entries; dialogue history iteratively updates the summary.

The core philosophy is that persistent memory holds distilled knowledge (cross‑session permanent), while the context window serves as a fluid workspace that is kept lightweight through the three‑stage compression pipeline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

memory management Hermes LLM agents context compression bounded memory security redaction snapshot caching

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.