Analyzing CN‑Buzz2Portfolio: A Chinese Market Dataset for LLM‑Driven Macro and Sector Asset Allocation

This article reviews the CN‑Buzz2Portfolio benchmark, which maps daily Chinese hot‑news streams to macro‑ and industry‑level ETF allocations, introduces a three‑stage CPA pipeline for evaluating large language models as autonomous financial agents, and reports extensive experiments on nine state‑of‑the‑art LLMs across two rolling market periods.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Analyzing CN‑Buzz2Portfolio: A Chinese Market Dataset for LLM‑Driven Macro and Sector Asset Allocation

Background

Large language models (LLMs) are moving from static NLP tasks toward dynamic decision‑making agents in complex financial environments. Existing evaluation paradigms either rely on live‑trading platforms, which lack reproducibility, or static benchmarks that ignore open‑world information flows, creating a dual evaluation bottleneck: reasoning consistency (semantic → logical) and attribution noise (logical → outcome) in noisy, non‑stationary markets.

The authors introduce CN‑Buzz2Portfolio , a reproducible Chinese‑market benchmark covering a rolling window from 2024 to mid‑2025. It aligns daily top‑20 hot‑news topics from four major Chinese financial platforms with macro‑ and sector‑level asset‑allocation decisions, thereby providing a realistic testbed for LLM‑based financial agents.

Problem Definition

The goal is to design a benchmark that accurately evaluates an LLM’s ability to translate macro narratives into portfolio weights, to overcome the limitations of prior datasets, and to offer a standardized evaluation protocol that promotes sustainable research on autonomous financial agents.

Method

3.1 Task Formalization – At each time step t, the agent observes a tuple <N_t, P_{hist}, T_{hist}, H_t>, where N_t is the unstructured hot‑news stream, P_{hist} and T_{hist} are historical prices and trade records, and H_t is the current portfolio state. The action space consists of the next rebalancing instruction w_{t+1}, constrained to a programmatic interface for reproducibility.

3.2 Data Construction – “Hot” Stream – The dataset aggregates the daily top‑20 topics from four Chinese financial platforms, applies strict timestamp filtering so that only news released before market close on day T influence the allocation for day T, thus avoiding forward‑looking bias.

3.3 Asset Scope – Macro and Industry Views – Two ETF‑based asset pools are built:

Task A (Macro & Theme): 11 broad indices covering equities, bonds, gold, and market styles (large‑cap, small‑cap).

Task B (Industry Rotation): 14 sector ETFs representing key Chinese industry nodes such as new energy and TMT.

3.4 Unified Trading Protocol – Three‑Stage CPA Multi‑Agent Framework

Compression : A_{sum} filters the noisy news list N_t into a structured set of finance‑relevant events, improving signal‑to‑noise ratio.

Perception : A_{ana} analyzes the filtered events together with the asset definitions to assess narrative impact on each sector, without relying on price data.

Allocation : A_{trade} combines the qualitative insights from A_{ana} with historical price/transaction data ( P_{hist}, T_{hist}) and the current holdings ( H_t) to produce concrete rebalancing orders.

3.5 Execution Layer and Action Constraints – To mitigate arithmetic errors in LLMs, numerical calculations are offloaded to a deterministic execution engine. Actions are expressed as structured commands:

Budget‑based buying (e.g., “allocate ¥5,000 to asset X”).

Proportion‑based selling (e.g., “sell 50 % of asset Y”).

Experiments

4.1 Evaluation Periods

Stage 1 (2024 full year): a “bear‑to‑bull” transition with high volatility and dense policy shifts.

Stage 2 (first half of 2025): a “high‑volatility oscillation” phase where the CSI 300 index fluctuates sharply but net returns are modest.

4.2 Model Selection – Nine cutting‑edge LLMs are evaluated and grouped by reasoning paradigm and scale:

Reasoning‑oriented models : DeepSeek‑R1, Qwen‑3‑Max‑Think, Qwen‑3‑32B‑Think (chain‑of‑thought enabled).

General instruction models : GPT‑5, Gemini‑2.5‑Pro, DeepSeek‑V3, GLM‑4.6, Qwen‑3‑Max, Qwen‑3‑32B.

4.3 Evaluation Metrics – Cumulative return, Sharpe ratio, maximum drawdown, and volatility are reported for each model on both tasks.

Results and Analysis

Overall Effectiveness – Across both periods, the three‑stage pipeline yields positive absolute returns for most large‑scale models, confirming that the “hot” news stream contains exploitable financial signals.

Beta‑Trap in 2024 Macro Task – Models such as DeepSeek‑V3 and GLM‑4.6 underperform the CSI 300 benchmark (16.20 % return) because their defensive positioning during the bear phase limits upside when policy‑driven rebounds occur.

Structural Alpha in Industry Task – All models surpass the benchmark, indicating strong capability to identify sector‑specific opportunities from news narratives and allocate capital accordingly.

Rolling‑Update Necessity – Performance gaps between 2024 and 2025 highlight the importance of continuously updating the benchmark; a static dataset would cause models to memorize past data and suffer forward‑looking bias.

Variance Decomposition – In 2024, model variance far exceeds random variance, confirming genuine performance hierarchies. In 2025’s low‑volatility regime, variance ratios approach 1, suggesting that LLM decisions become indistinguishable from random permutations.

Information‑Utility Curve (Ablation) – Best performance is achieved with Top‑5 or Top‑10 news items; extending to Top‑20 introduces non‑financial noise and degrades results.

Top‑0 Paradox – During the 2025 oscillation phase, a price‑only (Top‑0) baseline often outperforms news‑augmented inputs, revealing that in trend‑less markets news can be contradictory and cause hallucinated narratives.

Scaling‑Law Paradox – Larger models dominate in 2024 (knowledge advantage) but can over‑react to noisy signals in 2025, allowing smaller models to achieve better risk‑adjusted returns.

Additional Ablation Studies

The authors further explore the “best‑point vs. filtering failure” phenomenon, the impact of knowledge density on performance, and the “ability trap” where superior model capacity does not guarantee superior financial outcomes under high‑noise conditions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMbenchmarkmacroindustryCN-Buzz2PortfolioCPA frameworkfinancial asset allocation
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.