AI‑Trader: Real‑time Benchmark for Autonomous LLM Agents in Financial Markets
The AI‑Trader benchmark evaluates large language model agents in fully autonomous, real‑time US stock, Chinese A‑share, and cryptocurrency markets, revealing that general intelligence alone does not guarantee profitable trading, while robust risk‑control mechanisms drive cross‑market stability and excess returns.
Background Large language models (LLMs) have shown great potential as autonomous agents, yet existing static benchmarks (question answering, code completion, single‑turn instruction following) fail to capture the dynamic, continuous, and highly volatile nature of financial markets. Real‑time market environments are ideal for testing agents' planning, information retrieval, numerical reasoning, and decision‑making abilities.
Problem Definition Current evaluation methods cannot handle the extreme volatility, real‑time constraints, and information uncertainty of financial markets, leading to a gap between simulated performance and real‑world trading capability. An automated, fully autonomous, data‑clean benchmark is needed to systematically assess LLM‑based agents.
Method
AI‑Trader Framework Design – Simulates the complete workflow of a professional analyst: real‑time market research, strategic reasoning, and autonomous trade execution, with no human intervention. The system consists of two decoupled components: a real‑time trading environment and a trading agent.
Real‑time Trading Environment – Covers three major markets: US equities (NASDAQ‑100 constituents), Chinese A‑shares (SSE‑50 constituents), and cryptocurrency (10 pairs tracked by the Bitwise index). Both hourly and daily trading frequencies are supported to capture diverse market behaviors.
Trading Agent
Design principles: all information must be obtained via tools, decisions must stem from autonomous reasoning, and actions must be executable under real‑world constraints.
Reasoning follows the ReAct paradigm (think‑then‑act). The agent may request additional observations or directly output a trade decision, with all intermediate natural‑language reasoning recorded for auditability.
Action space includes three discrete actions per asset – buy (increase position), sell (decrease position), or hold. Actions exceeding available liquidity are rejected, triggering a self‑correction mechanism.
Toolchain (Model Context Protocol) – Minimal set of tools built on MCP:
Check Price : Retrieves accurate price, volume, and OHLC data for a given ticker, handling market‑specific codes.
Search : Performs time‑restricted web searches for market, company, or macro‑economic information, returning news, announcements, and analyst reports.
News : Provides structured financial news and sentiment signals with timestamps.
Math : Enables basic numerical calculations during reasoning.
Trade : Executes buy/sell orders, updates portfolio holdings and cash balances, and enforces market rules.
Experiments
Setup – Six mainstream LLM backbones are evaluated as identical agents: DeepSeek‑v3.1, MiniMax‑M2, Claude‑3.7‑Sonnet, GPT‑5, Qwen3‑max, and Gemini‑2.5‑Flash. Each model receives the same trading objectives and tool access.
Metrics – Cumulative Return (CR), Sortino Ratio (SR), Volatility (Vol), and Maximum Drawdown (MDD).
Results
General Intelligence vs. Trading Ability – Most agents deliver poor returns and weak risk management. In the US market, GPT‑5 achieves only 1.56% CR (benchmark QQQ: 1.87%) and Qwen3‑max 0.39%; in A‑shares both lose ~3.5% with negative SR. In crypto, Gemini‑2.5‑Flash suffers a –18.63% loss during November’s market correction, while GPT‑5’s lack of liquidity awareness leads to a –16.41% loss.
Risk‑Control Determines Cross‑Market Robustness – MiniMax‑M2 shows the most stable performance: US CR = 9.56% (SR = 4.42, MDD = ‑4.92%), A‑share CR ≈ 1.31% (lowest Vol = 6.72%, MDD = ‑2.15%), and in crypto it maintains the lowest drawdown. DeepSeek‑v3.1 wins in crypto by holding ~41% cash during downturns and buying on dips.
Market Liquidity Effects – In the mature US market, several agents (MiniMax‑M2 + 7.69% excess return, DeepSeek‑v3.1 + 6.52%, Claude‑3.7‑Sonnet + 1.24%) outperform the QQQ benchmark. No agent surpasses the SSE‑50 benchmark in the A‑share market, indicating higher volatility and sentiment‑driven dynamics hinder performance.
Model Generalization Limits – Performance does not transfer across markets. DeepSeek‑v3.1 yields 8.39% CR (SR = 3.73) in US equities but –1.23% CR (SR = ‑0.18) in A‑shares, yet regains advantage in crypto.
Case Studies
On 10 Oct, a sharp US market pull‑back, DeepSeek‑v3.1 proactively reallocates to consumer‑staples and utilities, increases cash buffers, and reduces exposure to trade‑war‑sensitive tech stocks, avoiding large losses.
On 24 Oct, during an A‑share rally, DeepSeek follows an unverified “structural slow‑bull” news signal, over‑weighting energy and banking sectors, missing the rally and under‑performing the SSE‑50 index.
Overall, the study highlights that autonomous LLM agents require strong risk‑control mechanisms and market‑aware strategies to achieve robust, cross‑market trading performance, and it provides a concrete benchmark (AI‑Trader) for future research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
