Artificial Intelligence 14 min read

7‑Level Multi‑Provider Fallback: Keeping the Agent Alive When a Model Fails

Hermes Agent’s auxiliary_client.py implements a seven‑level provider fallback chain that ensures auxiliary tasks keep running even if the main LLM crashes, runs out of credits, or hits rate limits, by prioritizing the user’s primary provider, cycling through alternative providers, and handling protocol quirks.

James' Growth Diary

Jun 8, 2026

7‑Level Multi‑Provider Fallback: Keeping the Agent Alive When a Model Fails

01 | Separate auxiliary client

Hermes Agent distinguishes Main Call (core reasoning, tool invocation) from Auxiliary Call (context compression, session search, web extraction, vision analysis). Auxiliary work must not be bound to the main provider because the main provider may be unavailable, auxiliary tasks prefer cheap flash‑level models, and different tasks have distinct capability requirements.

The auxiliary_client.py module implements a unified provider‑resolution entry with a multi‑level fallback to guarantee a usable model for every auxiliary task.

02 | Text‑task 7‑level fallback chain

The source file’s top comment defines the resolution order (auto mode):

文本任务 auto 模式回退顺序：
 1. 用户的主 provider + 主模型
 2. OpenRouter（OPENROUTER_API_KEY）
 3. Nous Portal（~/.hermes/auth.json active provider）
 4. 自定义 endpoint（config.yaml model.base_url + OPENAI_API_KEY）
 5. Native Anthropic
 6. 直连 API‑key providers（z.ai/GLM、Kimi/Moonshot、MiniMax、MiniMax‑CN）
 7. None（全链路故障）

Step 1 always tries the user‑configured provider, regardless of type, ensuring the user’s quota is consumed on the expected provider.

Step 6 iterates over PROVIDER_REGISTRY, attempting each provider that has an api_key configured, so multiple direct‑API providers are tried sequentially.

Step 7 returns None when every provider has failed, allowing the caller to degrade (e.g., skip compression and continue with the original context).

03 | Codex exclusion from the fallback chain

Codex OAuth (ChatGPT‑account auth) is intentionally NOT in either fallback chain: OpenAI gates this endpoint behind an undocumented, shifting model allow‑list , so "just try Codex with a hardcoded model" rots on its own.

Early 2026 saw the model name change from gpt-5.3-codex to gpt-5.2-codex and then to gpt-5.4. If Codex were in the automatic chain, a silent whitelist change could cause silent failures that return HTTP 200 with garbage content.

Codex is therefore used only when (1) the user’s main provider is Codex (Step 1) or (2) the caller explicitly requests provider="openai-codex" with a concrete model name.

An adapter _CodexCompletionsAdapter translates the Codex Responses API to the unified chat/completions shape.

04 | Vision‑task 6‑level fallback and vision blacklist

视觉任务 auto 模式回退顺序：
 1. 主 provider（仅当它支持视觉时）
 2. OpenRouter
 3. Nous Portal
 4. Native Anthropic
 5. 自定义 endpoint（本地视觉模型：Qwen‑VL、LLaVA、Pixtral 等）
 6. None

Step 1 is conditional on the provider supporting vision. The code defines a blacklist:

_PROVIDERS_WITHOUT_VISION = frozenset({"kimi-coding", "kimi-coding-cn"})

If the main provider appears in this set, the fallback skips Step 1 and starts at Step 2, avoiding 404 or request‑rejection errors.

Custom vision endpoints are placed after Anthropic because local models (LLaVA, Pixtral, Qwen‑VL) generally have weaker image understanding than cloud models, serving as a last resort.

05 | 402 credit‑exhaustion auto‑switch logic

When a resolved provider returns HTTP 402 or a credit‑related error, call_llm() automatically retries with the next provider in the auto‑detection chain.

When a resolved provider returns HTTP 402 or a credit‑related error, call_llm() automatically retries with the next available provider in the auto‑detection chain.

This works together with a credential_pool that rotates among multiple accounts for a provider; only when all accounts are exhausted does the chain move to the next provider.

# Conceptual credential‑pool selector (illustrative)
def _select_pool_entry(provider: str) -> Tuple[bool, Optional[Any]]:
    pool = load_pool(provider)
    if not pool or not pool.has_credentials():
        return False, None
    return True, pool.select()  # automatic rotation

A cross‑session rate‑limit guard for Nous checks the remaining quota; if a previous session triggered a 429, the current session skips Nous to avoid exhausting the account’s RPH limit.

06 | Per‑provider auxiliary model strategy

Each provider is assigned a cheap auxiliary model, stored in the dictionary _API_KEY_PROVIDER_AUX_MODELS_FALLBACK and overridable via ProviderProfile.default_aux_model. The mapping is:

gemini → gemini-3-flash-preview anthropic → claude-haiku-4-5-20251001 kimi-coding → kimi-k2-turbo-preview minimax → MiniMax-M2.7 openrouter → google/gemini-2.5-flash nous portal → google/gemini-3-flash-preview zai (GLM) → glm-4.5-flash Special temperature handling rules are encoded. Kimi models omit the temperature parameter because the Kimi gateway decides it based on its “thinking” mode. The helper _fixed_temperature_for_model() centralizes these contracts.

def _is_kimi_model(model: Optional[str]) -> bool:
    bare = (model or "").strip().lower().rsplit("/", 1)[-1]
    return bare.startswith("kimi-") or bare == "kimi"

def _fixed_temperature_for_model(model, base_url=None):
    if _is_kimi_model(model):
        return OMIT_TEMPERATURE  # managed by the service
    # other model‑specific rules omitted for brevity

Arcee Trinity Thinking has a fixed temperature of 0.5, illustrating how each model may have its own contractual parameters.

07 | Protocol‑conversion layer: one interface for all providers

All auxiliary calls use a single client signature:

client.chat.completions.create(**kwargs)
response.choices[0].message.content

Three adapters reconcile differing provider protocols: _AnthropicCompletionsAdapter: converts chat.completions kwargs to Anthropic Messages requests and back. _CodexCompletionsAdapter: translates chat.completions into Codex Responses API calls (including message‑to‑instruction conversion and streaming) and wraps the result. _OpenAIProxy: lazily loads the OpenAI SDK to avoid a 240 ms cold‑start penalty.

URL rewriting is fine‑grained; for example, Anthropic‑style base URLs are transformed into OpenAI‑compatible forms, and Kimi Coding URLs receive a /v1 prefix.

def _to_openai_base_url(base_url: str) -> str:
    """Convert Anthropic‑wire base_url to OpenAI‑compatible format"""
    if url.endswith("/anthropic"):
        if "open.bigmodel.cn" in url:
            return url[:-len("/anthropic")] + "/paas/v4"
        return url[:-len("/anthropic")] + "/v1"
    if "api.kimi.com" in url and url.endswith("/coding"):
        return url + "/v1"  # Kimi Coding needs /v1 prefix
    return url

All provider‑specific quirks are encapsulated in this layer, leaving upper‑level code oblivious to protocol differences.

Summary

Hermes’s auxiliary_client.py finds a usable LLM for auxiliary tasks while handling many edge cases: provider outages, credit exhaustion, whitelist changes, protocol incompatibilities, multi‑account rotation, and cross‑session rate‑limit protection. Core principles of the seven‑level fallback chain are:

Prefer the user‑configured provider; do not divert their quota.

Maintain a deterministic fallback order; avoid randomness.

Exclude providers known to silently fail (e.g., Codex).

Flatten protocol differences in the client layer so callers see a uniform API.

Handle credit and rate‑limit issues inside the call layer, keeping business logic clean.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python AI agents LLM Hermes provider fallback

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.