Tagged articles
183 articles
Page 1 of 2
Machine Heart
Machine Heart
May 31, 2026 · Artificial Intelligence

Defining a Good Answer in the Agent Era: A Rubrics Survey

This survey examines how rubrics can decompose the vague notion of a "good answer" for large language models into concrete, multi‑dimensional evaluation criteria, detailing their definition, construction methods, applications in training and evaluation, and the open challenges they present.

AI alignmentLarge Language Modelsagentic AI
0 likes · 13 min read
Defining a Good Answer in the Agent Era: A Rubrics Survey
DataFunTalk
DataFunTalk
May 31, 2026 · Artificial Intelligence

The Most Comprehensive Survey of Agent Harness Engineering

This article summarizes the Agent Harness Engineering survey, outlining the evolution from Prompt to Context to Harness engineering, presenting the seven‑layer ETCLOVG framework, benchmark findings, and the shift toward platform‑level observability, governance, and trace‑native evaluation for reliable AI agents.

Agent HarnessContext EngineeringETCLOVG
0 likes · 12 min read
The Most Comprehensive Survey of Agent Harness Engineering
DataFunTalk
DataFunTalk
May 29, 2026 · Artificial Intelligence

From Prompt to Context to Harness: Unpacking the Three Paradigm Shifts in Agent Engineering

The survey "Agent Harness Engineering: A Survey" reveals how agent systems have evolved from prompt engineering to context engineering and now to harness engineering, introduces the seven‑layer ETCLOVG framework, shows benchmark gains from better harnesses, and argues that observability, governance, and trace‑native evaluation are essential for production‑grade AI agents.

AI agentsContext EngineeringGovernance
0 likes · 14 min read
From Prompt to Context to Harness: Unpacking the Three Paradigm Shifts in Agent Engineering
AI Engineer Programming
AI Engineer Programming
May 29, 2026 · Artificial Intelligence

How to Build a Reliable RAG Test Dataset

The article explains why a structured test set is essential for Retrieval‑Augmented Generation systems, outlines failure modes, describes layered evaluation of retrieval and generation, details infrastructure like chunk IDs and manifests, and provides a complete annotation pipeline with cold‑start and adversarial strategies.

LLMRAGadversarial
0 likes · 24 min read
How to Build a Reliable RAG Test Dataset
DataFunTalk
DataFunTalk
May 28, 2026 · Artificial Intelligence

The Most Comprehensive Survey on Agent Harness Engineering Revealed

This article summarizes the 71‑page survey "Agent Harness Engineering: A Survey", detailing the shift from prompt to context to harness engineering, introducing the seven‑layer ETCLOVG framework, benchmark results showing up to 10× gains, and arguing that future competition will focus on the engineering shell surrounding LLM agents rather than model size alone.

AI SystemsAgentFramework
0 likes · 15 min read
The Most Comprehensive Survey on Agent Harness Engineering Revealed
大转转FE
大转转FE
May 21, 2026 · Artificial Intelligence

Why AI Buzzwords Multiply Faster Than My Hair Falls

The article maps three generations of AI engineering—Prompt Engineering, Context Engineering, and Harness Engineering—explaining their core capabilities, key terms like LLM, RAG, Agent, and evaluation methods, while offering practical tips, pitfalls, and a concise three‑question checklist to stay grounded amid the rapid influx of new AI jargon.

AIAgentHarness
0 likes · 19 min read
Why AI Buzzwords Multiply Faster Than My Hair Falls
PaperAgent
PaperAgent
May 19, 2026 · Artificial Intelligence

Why Long-Term Memory Needs Vision: How MemEye Evaluates Multimodal Agent Recall

MemEye is a multimodal memory benchmark that tests agents across eight real‑world scenarios, measuring visual evidence granularity and reasoning depth, and reveals that captions fall short for fine‑grained visual recall, highlighting the need for true visual memory in long‑term AI agents.

AI agentsMemEyebenchmark
0 likes · 4 min read
Why Long-Term Memory Needs Vision: How MemEye Evaluates Multimodal Agent Recall
DataFunTalk
DataFunTalk
May 19, 2026 · Industry Insights

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

A live discussion dissected the shift from single‑point Copilot assistants to platform‑level Agentic data platforms, exposing hard architectural, security, knowledge‑base, evaluation, stability‑cost, and governance challenges while debating whether the future will favor a super‑agent or a multi‑agent ecosystem.

Big DataData PlatformEnterprise Governance
0 likes · 18 min read
From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms
High Availability Architecture
High Availability Architecture
May 19, 2026 · Artificial Intelligence

5 Essential Tools to Install Before Building an AI Agent

The article outlines five critical setup steps—privacy with direnv and a secret manager, token handling via litellm or portkey, context management using uv and git commits, visibility through mitmproxy, and rigorous evaluation with inspect‑ai—showing how they cut token waste by 68.3%, reduce costs 92.5% and raise evaluation pass rates to 94.2% across 347 runs.

AI agentsDevOpscost optimization
0 likes · 9 min read
5 Essential Tools to Install Before Building an AI Agent
DataFunSummit
DataFunSummit
May 18, 2026 · Artificial Intelligence

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Paths for Data Platforms

A 90‑minute live discussion examined how data platforms must evolve from simple Copilot assistants to fully agentic systems, covering architectural redesign, security guardrails, knowledge‑base integration, evaluation pitfalls, cost management, and whether the future favors a super‑agent or a multi‑agent ecosystem.

Cost ManagementData PlatformFuture Trends
0 likes · 20 min read
From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Paths for Data Platforms
James' Growth Diary
James' Growth Diary
May 11, 2026 · Artificial Intelligence

Mastering RAG Evaluation: Recall@K, MRR, NDCG, and RAGAS Explained

This article breaks down RAG evaluation into a two‑layer framework, explains the four core metrics—Recall@K, MRR, NDCG, and the four RAGAS scores—shows how to implement them with LangChain.js, highlights common pitfalls, and offers scenario‑specific metric combinations for reliable performance monitoring.

LangChainMRRNDCG
0 likes · 20 min read
Mastering RAG Evaluation: Recall@K, MRR, NDCG, and RAGAS Explained
Wuming AI
Wuming AI
May 10, 2026 · Artificial Intelligence

Can Large Models Really Understand 1 M Tokens? Lessons from the RULER Benchmark

The article examines why a model’s advertised context window (e.g., 128 K or 1 M tokens) does not guarantee effective long‑context reasoning, summarizing the RULER framework that breaks long‑context ability into retrieval, interference resistance, multi‑hop tracking, aggregation, and multi‑answer recall, and offering practical guidance for evaluating and using such models.

LLMRULERaggregation
0 likes · 16 min read
Can Large Models Really Understand 1 M Tokens? Lessons from the RULER Benchmark
Machine Heart
Machine Heart
May 10, 2026 · Artificial Intelligence

Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly

The HiLight approach inserts lightweight highlight tags into full-length inputs, training a small Emphasis Actor to score token importance and guide a frozen large language model, improving performance on tasks like recommendation and QA without modifying the solver, while keeping low latency and training cost.

LLMLow latencyevaluation
0 likes · 9 min read
Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly
Architect
Architect
May 4, 2026 · Artificial Intelligence

What Skills Architects Must Master in the Agent Era and Which Will Last Six Months

In the fast‑changing Agent era, architects should focus on durable engineering capabilities—context management, tool design, evaluation, harness, permissions, and cost control—rather than chasing the latest frameworks, ensuring agents remain stable and controllable in production systems.

AI agentsContext ManagementHarness
0 likes · 26 min read
What Skills Architects Must Master in the Agent Era and Which Will Last Six Months
AI Engineering
AI Engineering
May 4, 2026 · Artificial Intelligence

Why the Big‑Model Race Is Over: Where Real Value Lies in AI Infrastructure

The article argues that the competition over which large language model will dominate is outdated, explaining that true value now comes from building multi‑model routing, context engineering, standardized tool protocols, intelligent orchestration, and robust evaluation layers that turn models into reliable AI infrastructure.

AI infrastructureMCPModel routing
0 likes · 6 min read
Why the Big‑Model Race Is Over: Where Real Value Lies in AI Infrastructure
PMTalk Product Manager Community
PMTalk Product Manager Community
May 4, 2026 · Product Management

2026 AI Product Manager: The Essential Capability Model

By 2026, AI product managers must shift from merely using models to delivering stable, valuable results, mastering seven core abilities—demand judgment, evaluation-driven iteration, context design, RAG strategy, agent orchestration, solution planning, and rapid Vibe Coding—to close the loop between business needs and AI capabilities.

AI product managementAgent DesignContext Engineering
0 likes · 13 min read
2026 AI Product Manager: The Essential Capability Model
AgentGuide
AgentGuide
May 3, 2026 · Artificial Intelligence

How to Evaluate an AI Agent Beyond Just Accuracy

Evaluating AI agents requires more than accuracy; you must measure task completion, execution trace, tool usage, latency, cost, error rates, and both explicit and implicit user feedback, using observability, offline smoke‑test and regression suites, and continuous online monitoring to create a closed‑loop improvement process.

AI AgentMetricsObservability
0 likes · 14 min read
How to Evaluate an AI Agent Beyond Just Accuracy
AI Architecture Hub
AI Architecture Hub
May 3, 2026 · Artificial Intelligence

What to Learn, Build, and Skip in AI Agents

The article analyzes the fast‑changing AI‑agent landscape, proposes five concrete criteria for filtering new technologies, outlines essential concepts such as context engineering, tool design, scheduler‑subagent patterns, evaluation frameworks, and recommends a stable 2026 tech stack while warning against hype‑driven tools.

AI agentsContext EngineeringLangGraph
0 likes · 27 min read
What to Learn, Build, and Skip in AI Agents
AI Engineer Programming
AI Engineer Programming
May 2, 2026 · Artificial Intelligence

From Demo to Production: How to Evaluate RAG Effectively

This guide outlines a comprehensive RAG evaluation framework covering failure modes, multi‑layer metrics, test‑set construction, open‑source tools, CI/CD quality gates, production monitoring, and special considerations for agentic RAG to ensure reliable, trustworthy retrieval‑augmented generation systems.

AILLMMetrics
0 likes · 18 min read
From Demo to Production: How to Evaluate RAG Effectively
MaGe Linux Operations
MaGe Linux Operations
Apr 28, 2026 · Artificial Intelligence

Why Your RAG Performance Is Poor: Common Issues and Optimization Strategies

This article systematically analyzes why Retrieval‑Augmented Generation pipelines often underperform—covering embedding model selection, chunking strategies, hybrid retrieval, reranking, context window waste, evaluation metrics, and a detailed troubleshooting checklist—while providing concrete code examples and best‑practice recommendations for engineers.

ChunkingEmbeddingHybrid Retrieval
0 likes · 19 min read
Why Your RAG Performance Is Poor: Common Issues and Optimization Strategies
PaperAgent
PaperAgent
Apr 27, 2026 · Artificial Intelligence

A Comprehensive Review of Modern LLM Agent Memory Frameworks

The article surveys recent LLM‑based agent memory research, presenting a unified framework that breaks memory systems into four components, detailing their design choices, experimental evaluation on LOCOMO and LONGMEMEVAL, key findings, and a new low‑token SOTA architecture.

Agent MemoryLLMMemory Management
0 likes · 8 min read
A Comprehensive Review of Modern LLM Agent Memory Frameworks
AI Engineer Programming
AI Engineer Programming
Apr 23, 2026 · Artificial Intelligence

From Zero to One: A Roadmap for Building Trustworthy AI Agent Evaluations

The article outlines why rigorous, automated evaluation is essential for AI agents, defines core concepts such as tasks, trials, graders, and frameworks, compares code‑based, model‑based and human graders, and presents an eight‑step roadmap—from early testing to open‑source maintenance—to create reliable, scalable agent assessments.

AI agentsBenchmarkingLLM grading
0 likes · 22 min read
From Zero to One: A Roadmap for Building Trustworthy AI Agent Evaluations
MaGe Linux Operations
MaGe Linux Operations
Apr 22, 2026 · Artificial Intelligence

5 Essential Design Principles for Building High‑Quality RAG Systems

This article outlines five critical design principles for constructing high‑quality Retrieval‑Augmented Generation (RAG) systems, covering document chunking strategies, embedding model selection, hybrid retrieval architectures, metadata filtering with multi‑level indexes, and reranking mechanisms, and provides concrete code snippets and evaluation metrics.

EmbeddingHybrid RetrievalRAG
0 likes · 17 min read
5 Essential Design Principles for Building High‑Quality RAG Systems
Su San Talks Tech
Su San Talks Tech
Apr 21, 2026 · Artificial Intelligence

How to Turn Bad Prompts into High‑Scoring AI Prompts: A Step‑by‑Step Guide

This article walks through a complete prompt‑engineering workflow—starting from a weak baseline, building an evaluation pipeline, and applying four concrete techniques (clarity, specificity, XML structuring, and examples) that lift a Claude score from 3.4 to over 9, with code, metrics, and real‑world examples.

AIClaudePrompt Engineering
0 likes · 19 min read
How to Turn Bad Prompts into High‑Scoring AI Prompts: A Step‑by‑Step Guide
FunTester
FunTester
Apr 20, 2026 · Artificial Intelligence

Why Self‑Evaluating Agents Fail and How to Build Reliable Multi‑Agent Systems

The article analyzes why letting the same AI Agent generate and self‑evaluate results in over‑confident but flawed outputs, especially for subjective tasks, and proposes a three‑stage multi‑agent architecture with independent evaluation, concrete standards, and prompt‑based calibration to improve reliability as models evolve.

AIPrompt EngineeringSystem Design
0 likes · 9 min read
Why Self‑Evaluating Agents Fail and How to Build Reliable Multi‑Agent Systems
Java One
Java One
Apr 20, 2026 · Artificial Intelligence

From Bad Prompts to 9.5 Scores: A Step‑by‑Step Prompt Engineering Guide

This article walks through an iterative prompt‑engineering workflow—starting with a weak baseline, applying four concrete techniques (clarity & directness, specificity, XML structuring, and examples), evaluating each change with a PromptEvaluator, and showing how scores jump from 3.4 to over 9.5 using real code snippets and concrete data.

AIClaudePrompt Engineering
0 likes · 20 min read
From Bad Prompts to 9.5 Scores: A Step‑by‑Step Prompt Engineering Guide
Machine Heart
Machine Heart
Apr 17, 2026 · Artificial Intelligence

Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation

The paper introduces OPeRA, a step‑wise online‑shopping dataset capturing observations, personas, rationales, and actions from real users, and uses it to benchmark LLMs on next‑action prediction, revealing that even top models like GPT‑4.1 achieve only about 20 % accuracy on fine‑grained actions, with persona information offering limited benefit while rationales prove crucial.

AILLMdataset
0 likes · 9 min read
Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation
Data Party THU
Data Party THU
Apr 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

The MME-Emotion benchmark, introduced by researchers from CUHK and Alibaba Tongyi and accepted at ICLR 2026, provides a large‑scale, multimodal evaluation of emotional intelligence in large language models, revealing current models’ limited emotion recognition and reasoning abilities across diverse real‑world scenarios.

AIMME-EmotionMultimodal LLM
0 likes · 10 min read
Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark
Machine Heart
Machine Heart
Apr 10, 2026 · Artificial Intelligence

Why Generalist’s Success Shifts Embodied AI Competition From Models to Infrastructure

The launch of Generalist AI’s GEN‑1 model demonstrates a breakthrough in success rate, speed and resilience, but the article argues that the true competitive frontier has moved from model performance to the underlying data, simulation and evaluation infrastructure that enables continuous learning and scalable testing for embodied intelligence.

AI modelsData InfrastructureEmbodied AI
0 likes · 12 min read
Why Generalist’s Success Shifts Embodied AI Competition From Models to Infrastructure
DataFunSummit
DataFunSummit
Apr 10, 2026 · Artificial Intelligence

How Can AI Agents Truly Remember? A Deep Dive into Long‑Term Memory Engineering

This article examines the shortcomings of current AI assistants, outlines the ideal of long‑term memory engineering, reviews mainstream industry solutions such as hard‑context models and Retrieval‑Augmented Generation, proposes a four‑layer memory loop architecture, and looks ahead to online learning and collective intelligence for future agents.

AIAgentFoundation Model
0 likes · 15 min read
How Can AI Agents Truly Remember? A Deep Dive into Long‑Term Memory Engineering
Data STUDIO
Data STUDIO
Apr 10, 2026 · Artificial Intelligence

Step‑by‑Step Guide to Writing Effective Agent Skill.md Files

This article explains what Agent Skills are, shows the folder layout and SKILL.md format, introduces the progressive‑disclosure design, provides concrete best‑practice tips, testing and evaluation methods, and demonstrates how to package scripts for reliable AI‑assistant automation.

AI AssistantAgent SkillsAutomation
0 likes · 29 min read
Step‑by‑Step Guide to Writing Effective Agent Skill.md Files
AI Step-by-Step
AI Step-by-Step
Apr 8, 2026 · Operations

How to Light Up the Black Box of LLM Agents with Full‑Stack Observability

The article explains why traditional logs are insufficient for LLM agents, outlines five observability dimensions—tracing, metrics, behavioral governance, state & memory, and evaluation—and provides concrete, open‑source‑based steps to instrument, monitor, and act on agent workloads in production.

Behavioral GovernanceLLM AgentsMetrics
0 likes · 11 min read
How to Light Up the Black Box of LLM Agents with Full‑Stack Observability
Machine Heart
Machine Heart
Apr 5, 2026 · Artificial Intelligence

How Imitation Learning Powers Dexterous Manipulation: A 2021‑2025 Technical Roadmap

This survey maps the 2021‑2025 progress of imitation learning for dexterous manipulation, detailing theoretical foundations, datasets, algorithms, hardware platforms, and evaluation protocols, and highlights challenges such as data quality, hardware dependence, and the need for standardized benchmarks to advance embodied AI.

AlgorithmsDexterous Manipulationdatasets
0 likes · 11 min read
How Imitation Learning Powers Dexterous Manipulation: A 2021‑2025 Technical Roadmap
AI Code to Success
AI Code to Success
Apr 3, 2026 · Artificial Intelligence

Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform

The author explores Clawvard, an AI‑agent assessment platform that tests agents across eight dimensions, shares personal test results showing an initial A‑ rating with a critical retrieval weakness, details the customized improvement rules applied, and demonstrates a subsequent A+ rating, while also discussing the platform’s limits and practical use cases.

AI AgentArtificial IntelligencePrompt Engineering
0 likes · 8 min read
Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform
AgentGuide
AgentGuide
Apr 3, 2026 · Artificial Intelligence

How to Evaluate RAG Systems: Key Metrics and the Ragas Framework

The article explains how to assess Retrieval-Augmented Generation (RAG) projects using the Ragas automated evaluation framework, detailing four key dimensions—recall quality, answer faithfulness, answer relevance, and context utilization—and describes the underlying metrics for both retrieval and generation stages.

LLMMetricsRAG
0 likes · 5 min read
How to Evaluate RAG Systems: Key Metrics and the Ragas Framework
Top Architecture Tech Stack
Top Architecture Tech Stack
Mar 28, 2026 · Artificial Intelligence

How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Hours

Anthropic’s engineering recap reveals a GAN‑inspired multi‑agent framework that separates generation, evaluation, and planning to overcome Claude’s context anxiety and self‑evaluation bias, enabling the model to sustain multi‑hour, high‑quality tasks across frontend design, full‑stack apps, and game‑engine projects.

AIClaudeevaluation
0 likes · 19 min read
How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Hours
SuanNi
SuanNi
Mar 26, 2026 · Artificial Intelligence

Unveiling Omni-WorldBench: How 18 AI Video Models Stack Up on 4D Interaction Tests

The Omni-WorldBench framework introduces a comprehensive 4D evaluation suite with 1,068 test cases and three interaction levels, applying novel metrics to assess video quality, controllability, and physical interaction fidelity across 18 state‑of‑the‑art AI video models, revealing strengths, weaknesses, and future research directions.

4D interactionOmni-WorldBenchbenchmark
0 likes · 14 min read
Unveiling Omni-WorldBench: How 18 AI Video Models Stack Up on 4D Interaction Tests
SuanNi
SuanNi
Mar 25, 2026 · Artificial Intelligence

Can Harness Engineering Enable AI Agents to Master Complex Long‑Running Tasks?

This article analyses the concept of Harness engineering introduced by OpenAI and Anthropic, explains how multi‑agent architectures decompose and manage long‑running AI tasks, examines practical experiments such as a retro game maker and a web‑audio workstation, and distills lessons for future AI system design.

AI EngineeringAnthropicClaude
0 likes · 16 min read
Can Harness Engineering Enable AI Agents to Master Complex Long‑Running Tasks?
o-ai.tech
o-ai.tech
Mar 25, 2026 · Artificial Intelligence

From Code Writing to Continuous Development: Anthropic’s Long‑Running Agent Harness Design

Anthropic’s article dissects a three‑role harness—planner, generator, evaluator—for building long‑running AI applications, explaining how structured specs, sprint contracts, iterative evaluation, and context management transform a single model into a reliable software‑engineering pipeline, with concrete front‑end and full‑stack case studies.

AI agentsEvaluatorHarness
0 likes · 23 min read
From Code Writing to Continuous Development: Anthropic’s Long‑Running Agent Harness Design
Frontend AI Walk
Frontend AI Walk
Mar 25, 2026 · Artificial Intelligence

Slow Learning Agents: 7 Cognitive Shifts from Using ChatGPT to Truly Understanding Agents

The article outlines seven essential mindset transitions for building robust LLM agents—recognizing agents as autonomous decision loops, prioritizing harness over model size, layering context, designing tools for agent goals, structuring multi‑layer memory, coordinating multiple agents with isolation and protocols, and aligning evaluation with the real environment.

Context ManagementHarnessLLM Agents
0 likes · 16 min read
Slow Learning Agents: 7 Cognitive Shifts from Using ChatGPT to Truly Understanding Agents
AgentGuide
AgentGuide
Mar 22, 2026 · Artificial Intelligence

How to Design Prompt Engineering in Your Project: A Complete Workflow

The article outlines a systematic Prompt Engineering process that starts with defining task goals and metrics, structures prompts into modular components, uses offline evaluation and bad‑case analysis, incorporates RAG or tools when needed, and continuously monitors accuracy, hallucination, latency and cost.

AI workflowFew-shotLarge Language Model
0 likes · 7 min read
How to Design Prompt Engineering in Your Project: A Complete Workflow
ByteDance SE Lab
ByteDance SE Lab
Mar 20, 2026 · Artificial Intelligence

How to Build a Multi‑Repo Semantic Code Q&A System with OpenViking

This guide explains the challenges of multi‑repository code retrieval, presents an experimental evaluation of OpenViking's semantic search, and provides step‑by‑step instructions for installing, configuring, importing repositories, and integrating the system into AI agents and chatbots.

AI AssistantMulti-repoOpenViking
0 likes · 16 min read
How to Build a Multi‑Repo Semantic Code Q&A System with OpenViking
AI Frontier Lectures
AI Frontier Lectures
Mar 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

This article presents MME-Emotion, a large‑scale multimodal benchmark that evaluates both emotion recognition and reasoning abilities of multimodal large language models across 27 real‑world scenarios, revealing current models’ significant gaps in emotional intelligence and outlining future research directions.

AIMultimodal LLMbenchmark
0 likes · 9 min read
Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 16, 2026 · Artificial Intelligence

HeartBench: Building the First Chinese AI Humanization Benchmark

This article details the creation of HeartBench, a Chinese benchmark for evaluating large language models' emotional and social intelligence, describing its background, design principles, data pipeline, evaluation methods, multi‑stage versioning, blind‑test validation, and lessons for building transferable AI assessment frameworks.

AI BenchmarkEmotion AIHumanization
0 likes · 25 min read
HeartBench: Building the First Chinese AI Humanization Benchmark
PaperAgent
PaperAgent
Mar 15, 2026 · Artificial Intelligence

Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench

WildToolBench reveals that existing LLM tool‑calling benchmarks overlook real‑world user behavior, and a comprehensive evaluation of 58 models shows even the strongest agents achieve less than 15% session accuracy, highlighting a huge gap between reported performance and practical usability.

LLMagentic AIbenchmark
0 likes · 10 min read
Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 11, 2026 · Artificial Intelligence

Upgrade All Your Claude Skills Now: Harness the New Skill‑Creator Engine

Anthropic’s updated skill‑creator turns Skills into a core, engineering‑focused capability for Claude, offering a systematic workflow—baseline A/B testing, quantitative assertions, visual evaluation, and iterative description optimization—so developers can rebuild, refine, and reliably trigger their Skills for higher productivity.

AI agentsAnthropicAutomation
0 likes · 13 min read
Upgrade All Your Claude Skills Now: Harness the New Skill‑Creator Engine
PaperAgent
PaperAgent
Mar 9, 2026 · Artificial Intelligence

How SkillNet Turns AI Agent Experience into Reusable Skills

SkillNet proposes a three‑layer infrastructure that extracts, evaluates, and connects over 200,000 AI‑agent skills into a structured graph, dramatically improving performance across benchmark environments while turning transient agent experience into durable, reusable assets.

AI agentsLLMMachine Learning
0 likes · 6 min read
How SkillNet Turns AI Agent Experience into Reusable Skills
AI Tech Publishing
AI Tech Publishing
Mar 7, 2026 · Artificial Intelligence

A Practical Guide to Evaluating Agent Skills

This article explains why many Agent Skills are released without testing, defines measurable success criteria, and presents a lightweight evaluation framework—including prompt set creation, deterministic checks, optional LLM‑based qualitative checks, and best‑practice recommendations—demonstrated by improving a Gemini Interactions API skill from 66.7% to 100% pass rate.

AI agentsAgent SkillsGemini
0 likes · 13 min read
A Practical Guide to Evaluating Agent Skills
Amap Tech
Amap Tech
Mar 5, 2026 · Artificial Intelligence

How MobilityBench Measures the Real Power of AI Route‑Planning Agents

MobilityBench is an open‑source benchmark built from over 100 000 real user queries that evaluates AI route‑planning agents with a deterministic sandbox, multi‑dimensional metrics, and support for ReAct and Plan‑and‑Execute frameworks, revealing performance gaps between open‑source and closed‑source models.

AI agentsMobilityBenchPlan-and-Execute
0 likes · 6 min read
How MobilityBench Measures the Real Power of AI Route‑Planning Agents
Data Party THU
Data Party THU
Feb 18, 2026 · Artificial Intelligence

Why Top AI Agents Fail in Real Work: Inside the Trainee‑Bench Benchmark

The article analyzes the gap between high benchmark scores and poor real‑world performance of AI agents, introduces the Trainee‑Bench workplace simulator, details its three evaluation dimensions, construction steps, and reveals that even state‑of‑the‑art models achieve low success rates, highlighting the need for autonomous learning and zero‑hand‑over.

AI agentsTrainee-Benchcontinuous learning
0 likes · 11 min read
Why Top AI Agents Fail in Real Work: Inside the Trainee‑Bench Benchmark
AI Engineering
AI Engineering
Jan 29, 2026 · Artificial Intelligence

How a Tiny AGENTS.md Change Boosted AI Coding Accuracy from 53% to 100%

A Vercel team experiment shows that replacing the Skills approach with a small 8 KB AGENTS.md file raised AI coding agents' pass rate from 53% to a perfect 100%, revealing the fragility of explicit tool calls and the strength of passive, always‑available context.

AGENTS.mdAI coding agentsNext.js
0 likes · 11 min read
How a Tiny AGENTS.md Change Boosted AI Coding Accuracy from 53% to 100%
JD Tech
JD Tech
Jan 27, 2026 · Artificial Intelligence

How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation

Uni-Layout introduces a unified framework that integrates a universal layout generator, a human‑feedback‑simulating evaluator, and a dynamic margin preference optimization technique to align generation and evaluation across diverse e‑commerce design tasks, backed by a new 100k human‑annotated dataset.

Human FeedbackMultimodal LLMdynamic margin optimization
0 likes · 11 min read
How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation
Architect
Architect
Jan 19, 2026 · Artificial Intelligence

How Cursor Scales Autonomous Coding Agents to Hundreds: Architecture Lessons for AI Systems

This article analyzes Cursor's engineering choices for running autonomous coding agents at scale, detailing the long‑running, drift, and evaluation concepts, the Planner‑Worker‑Judge pipeline, concurrency challenges, experimental results, and actionable rules for building robust multi‑agent systems.

System architectureevaluationsoftware engineering
0 likes · 17 min read
How Cursor Scales Autonomous Coding Agents to Hundreds: Architecture Lessons for AI Systems
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jan 15, 2026 · Operations

Why Most Supplier Evaluation Systems Fail and the 4 Metrics That Actually Matter

The article explains why traditional supplier evaluation forms often become meaningless, introduces four decisive metrics—delivery stability, quality consistency, cost transparency, and collaboration willingness—provides concrete scoring formulas for each, and shows how an SRM system can automate and visualize these indicators to help companies decide whether to replace a supplier.

OperationsSRMevaluation
0 likes · 10 min read
Why Most Supplier Evaluation Systems Fail and the 4 Metrics That Actually Matter
JD Cloud Developers
JD Cloud Developers
Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment

Uni-Layout introduces a unified framework that combines a multimodal large language model‑based generator, a human‑like evaluator trained on the large Layout‑HF100k dataset, and a Dynamic Margin Preference Optimization (DMPO) method to align generation and evaluation, achieving state‑of‑the‑art results across diverse layout tasks.

DMPOHuman FeedbackMultimodal LLM
0 likes · 11 min read
Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment
JD Tech Talk
JD Tech Talk
Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation

Uni-Layout introduces a unified framework that generates layouts across diverse tasks, simulates human evaluation with a novel feedback dataset, and aligns generation and assessment through dynamic margin preference optimization, achieving state‑of‑the‑art performance on multiple benchmarks.

AI designHuman FeedbackMultimodal LLM
0 likes · 11 min read
Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation
AI Insight Log
AI Insight Log
Jan 10, 2026 · Artificial Intelligence

Anthropic’s Full Practical Guide to Evaluating AI Agents – Key Insights

The article explains why evaluating AI agents is far more complex than testing deterministic code, outlines Anthropic’s anatomy of a complete evaluation system—including tasks, transcripts, and three grader types—and offers concrete best‑practice recommendations for building reliable agent pipelines.

AI agentsAnthropicLLM testing
0 likes · 9 min read
Anthropic’s Full Practical Guide to Evaluating AI Agents – Key Insights
JD Retail Technology
JD Retail Technology
Jan 8, 2026 · Artificial Intelligence

Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

Uni-Layout introduces a unified layout generation framework that consolidates diverse design tasks, leverages multimodal large language models for flexible generation, and aligns outputs with human perception through a novel human‑feedback dataset (Layout‑HF100k) and a dynamic margin preference optimization (DMPO) evaluator.

ACM MultimediaHuman FeedbackMultimodal LLM
0 likes · 11 min read
Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation
DataFunSummit
DataFunSummit
Jan 3, 2026 · Artificial Intelligence

What Is Memory Engineering? Unlocking AI’s Long‑Term Recall and Future Potential

A comprehensive dialogue among industry experts explores the concept of memory engineering for AI agents, covering its definition, system‑level challenges from edge to cloud, hybrid technical routes, evaluation metrics, privacy safeguards, audience questions, future directions, and practical advice for developers.

AI agentsHybrid Architectureevaluation
0 likes · 24 min read
What Is Memory Engineering? Unlocking AI’s Long‑Term Recall and Future Potential
AI Product Manager Community
AI Product Manager Community
Dec 27, 2025 · Product Management

Embracing Uncertainty: Redesigning AI Product Requirements

The article explores how product managers must shift from deterministic PRDs to uncertainty‑driven specifications for AI chatbots, replacing exhaustive logic with value‑based constraints, fuzzy‑evaluation metrics, dynamic benchmarks, and sample‑based requirements to better align with probabilistic large‑model behavior.

AIPRDPrompt Engineering
0 likes · 9 min read
Embracing Uncertainty: Redesigning AI Product Requirements
Alibaba Cloud Native
Alibaba Cloud Native
Dec 19, 2025 · Artificial Intelligence

What Enterprises Are Learning from the State of Agent Engineering Report

The recent LangChain "State of Agent Engineering" report, combined with data from the AI‑Native Application Architecture whitepaper, reveals rapid production adoption of AI agents, persistent quality challenges, widespread observability, multi‑model strategies, and evolving evaluation practices across organizations of all sizes.

AI agentsLLMObservability
0 likes · 10 min read
What Enterprises Are Learning from the State of Agent Engineering Report
Model Perspective
Model Perspective
Dec 19, 2025 · Fundamentals

How a Multi‑Dimensional Model Ranks China’s Historical TV Dramas

This study builds a comprehensive evaluation model for Chinese historical drama series, defining four primary and nine secondary indicators, standardizing data, applying weighted calculations and a time‑compensation factor to score 127 candidates and produce a TOP‑100 ranking that highlights the influence of audience reputation, market impact, professional recognition, and historical value.

evaluationhistorical dramamedia
0 likes · 18 min read
How a Multi‑Dimensional Model Ranks China’s Historical TV Dramas
Youzan Coder
Youzan Coder
Nov 21, 2025 · Artificial Intelligence

How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide

This guide walks you through creating AI‑powered test agents, defining success metrics, building evaluation datasets, crafting and refining system prompts with techniques like chain‑of‑thought, XML, few‑shot and concise inputs, and scaling the workflow by splitting agents and managing prompt versions.

AI agentsLLMPrompt Engineering
0 likes · 21 min read
How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Nov 4, 2025 · Artificial Intelligence

Why Financial RAG Fails and How to Solve Its Core Challenges

This article explains why Retrieval‑Augmented Generation (RAG) projects in the financial sector often underperform, highlighting data‑structure complexities, document‑parsing hurdles, chunking strategies, compliance constraints, evaluation metrics, and engineering requirements, and offers practical solutions and code examples.

ChunkingComplianceEngineering
0 likes · 10 min read
Why Financial RAG Fails and How to Solve Its Core Challenges
Open Source Tech Hub
Open Source Tech Hub
Oct 23, 2025 · Backend Development

Boost PHP Performance with CEL-PHP: A Fast, Safe Expression Engine

This guide introduces CEL-PHP, a high‑performance, non‑Turing‑complete expression engine for PHP 8+, showing how to install it, evaluate simple and contextual expressions, handle parsing and optimization, integrate caching, register custom functions, and avoid common pitfalls for robust backend rule evaluation.

CELCachingExpression Language
0 likes · 8 min read
Boost PHP Performance with CEL-PHP: A Fast, Safe Expression Engine
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 20, 2025 · Artificial Intelligence

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

This article revisits nanochat's core components, detailing the preparation of diverse training datasets, the scaling calculations for tokens and parameters, the model's MQA and KV‑cache design, the full training pipeline with gradient accumulation and mixed‑precision, cost breakdown, inference optimizations, evaluation tasks, and identified limitations with suggested improvements.

KV CacheLLMMQA
0 likes · 9 min read
nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 15, 2025 · Artificial Intelligence

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

Large language models are evolving from free‑form text generators to reliable data providers by mastering structured output through prompt engineering, validation frameworks, constrained decoding, supervised fine‑tuning, reinforcement learning, and API‑level capabilities, enabling seamless integration with software systems while addressing hallucinations and format reliability.

APILLMPrompt Engineering
0 likes · 28 min read
Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends
HyperAI Super Neural
HyperAI Super Neural
Oct 14, 2025 · Artificial Intelligence

NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass

OCRBench v2, introduced at NeurIPS 2025, evaluates 58 multimodal models on 23 OCR‑related tasks in Chinese and English, revealing that even top models like Gemini‑2.5‑Pro barely exceed the passing threshold and that most models struggle with fine‑grained text localization and multilingual performance.

GeminiLarge Language ModelsNeurIPS 2025
0 likes · 8 min read
NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Oct 13, 2025 · Operations

How to Build a Fail‑Proof Procurement Process with Data‑Driven SRM

This article explains why many procurement processes fail despite formal procedures and provides a step‑by‑step, data‑driven approach—clarifying requirements, using SRM templates, screening suppliers with performance data, scoring comprehensively, ensuring traceability, and conducting post‑award reviews—to select the right suppliers and turn procurement into a strategic advantage.

SRMdata drivenevaluation
0 likes · 8 min read
How to Build a Fail‑Proof Procurement Process with Data‑Driven SRM
Data Thinking Notes
Data Thinking Notes
Sep 10, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Uncovering the Statistical Roots

OpenAI’s latest research reveals that language model hallucinations stem from training and evaluation incentives that favor confident guesses over acknowledging uncertainty, and proposes revised scoring methods that reward modesty, highlighting statistical mechanisms behind false answers and offering pathways to reduce hallucinations.

AI safetyLanguage Modelsevaluation
0 likes · 10 min read
Why Do Language Models Hallucinate? Uncovering the Statistical Roots
Architect
Architect
Sep 9, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Insights from OpenAI’s New Study

This article explains why large language models often produce confident but incorrect answers, detailing statistical inevitability, data scarcity, and model capacity limits, and proposes concrete solutions such as confidence thresholds and allowing abstention to reduce hallucinations.

AI safetyLanguage ModelsPrompt Engineering
0 likes · 8 min read
Why Do Language Models Hallucinate? Insights from OpenAI’s New Study
DataFunSummit
DataFunSummit
Aug 23, 2025 · Artificial Intelligence

Mastering Role‑Playing AI Agents: Challenges, Techniques, and Future Directions

This article surveys the latest research on role‑playing AI agents, covering their definition, core components, application scenarios, three main challenges—role fidelity, long‑term memory, and evaluation—and presents four technical approaches for each challenge along with future research directions and references.

AI agentsLarge Language ModelsMemory
0 likes · 22 min read
Mastering Role‑Playing AI Agents: Challenges, Techniques, and Future Directions
Data Party THU
Data Party THU
Aug 23, 2025 · Artificial Intelligence

How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning

The article presents MiroMind‑M1, an open‑source math‑reasoning language model that combines a 719K high‑quality SFT dataset, a novel CAMPO reinforcement‑learning algorithm, and extensive evaluations on AIME24, AIME25, and MATH‑500, demonstrating state‑of‑the‑art performance while reducing token usage.

CAMPOevaluationmath reasoning
0 likes · 11 min read
How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning
JD Tech Talk
JD Tech Talk
Jul 27, 2025 · Artificial Intelligence

Evaluating JoyAgent‑JDGenie: A Lightweight Multi‑Agent AI Framework in Action

This article presents a thorough evaluation of the open‑source JoyAgent‑JDGenie multi‑agent AI framework, covering its background, test cases for restaurant recommendation and travel planning, deployment steps, performance metrics, and concluding recommendations, highlighting its efficiency, ease of deployment, and result quality.

AIDeploymentagents
0 likes · 8 min read
Evaluating JoyAgent‑JDGenie: A Lightweight Multi‑Agent AI Framework in Action
Zhihu Tech Column
Zhihu Tech Column
Jul 25, 2025 · Artificial Intelligence

Boost Creative Writing with Zhi-Create-Qwen3-32B: Training, Eval & Deployment

This article introduces the open‑source Zhi‑Create‑Qwen3‑32B model, detailing its fine‑tuned training on creative‑writing data, the multi‑domain dataset strategy, curriculum‑learning based SFT, evaluation on WritingBench, and practical deployment options across various hardware and inference frameworks.

DeploymentLarge Language Modelcreative writing
0 likes · 11 min read
Boost Creative Writing with Zhi-Create-Qwen3-32B: Training, Eval & Deployment
ELab Team
ELab Team
Jul 9, 2025 · Artificial Intelligence

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

This article explains the design of the edit_file tool, the fast‑apply model that rewrites whole files instead of diffs, its training and evaluation methodology, speculative decoding speed gains, and future research directions for large‑scale code‑editing AI systems.

AISpeculative Decodingcode editing
0 likes · 14 min read
How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding
DataFunTalk
DataFunTalk
Jul 3, 2025 · Artificial Intelligence

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

In an interview with Vivo AI engineer Liang Tianan, the article explores the challenges of post‑Q&A recommendation, the integration of large language models into recall, ranking and evaluation pipelines, and the engineering trade‑offs required to deliver high‑quality, diverse suggestions on mobile devices.

LLMModel CompressionMultimodal
0 likes · 15 min read
How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations
DataFunSummit
DataFunSummit
Jun 19, 2025 · Artificial Intelligence

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

In a detailed interview, ByteDance AI specialist Cai Conghuai explains how large‑model techniques such as SFT, DPO and RAG address Douyin’s multimodal user‑experience challenges, improve signal detection, root‑cause analysis, and outline future AI‑agent breakthroughs for content platforms.

AI AlgorithmsMultimodal LearningRAG
0 likes · 11 min read
How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights
Tencent Technical Engineering
Tencent Technical Engineering
Jun 16, 2025 · Artificial Intelligence

Mastering RAG and AI Agents: Practical Tips, Code Samples, and Evaluation Strategies

This comprehensive guide walks you through the fundamentals of Retrieval‑Augmented Generation (RAG) and AI agents, explains their inner workings, shares optimization tricks, provides ready‑to‑run code snippets, and demonstrates how to evaluate performance with metrics such as recall, faithfulness, and answer relevance.

AI agentsLLMPrompt Engineering
0 likes · 36 min read
Mastering RAG and AI Agents: Practical Tips, Code Samples, and Evaluation Strategies
Model Perspective
Model Perspective
May 25, 2025 · Fundamentals

Why We Pretend to Win: The Hidden Math Behind Evaluation Bias

The article explores how people manipulate evaluation systems by redefining variables, adjusting weights, and shifting perspectives, turning losses into perceived wins, and reveals the psychological and statistical biases that create this illusion, urging more honest, multi‑dimensional, transparent modeling for genuine assessment.

BiasPsychologydecision-making
0 likes · 9 min read
Why We Pretend to Win: The Hidden Math Behind Evaluation Bias
DataFunSummit
DataFunSummit
May 9, 2025 · Artificial Intelligence

Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product

This article presents a comprehensive overview of Zhihu Direct Answer, describing its AI‑driven search architecture, RAG framework, query understanding, retrieval, chunking, reranking, generation, evaluation mechanisms, engineering optimizations, and the professional edition, while sharing concrete performance‑boosting practices and future development plans.

AIProduct DevelopmentRAG
0 likes · 14 min read
Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product
Architect
Architect
Apr 17, 2025 · Artificial Intelligence

The Second Half of AI: From Model Innovation to Real‑World Utility

The article argues that artificial intelligence has entered a new phase where reinforcement learning finally generalizes, evaluation becomes more important than pure model performance, and researchers must redesign benchmarks and utility‑focused tasks to drive truly transformative progress.

evaluationresearch strategy
0 likes · 16 min read
The Second Half of AI: From Model Innovation to Real‑World Utility
Nightwalker Tech
Nightwalker Tech
Apr 1, 2025 · Artificial Intelligence

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

This article reviews AutoGLM, the first "think‑while‑doing" AI agent released by Zhipu AI, detailing its core capabilities, full‑stack architecture, user experience, identified limitations, and the outcomes of three hands‑on tests using both the client application and a Chrome extension.

AI AgentAutoGLMLarge Language Model
0 likes · 4 min read
Evaluation of AutoGLM: Features, Architecture, and Practical Test Results
Meituan Technology Team
Meituan Technology Team
Mar 27, 2025 · Artificial Intelligence

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

AIGCMachine LearningMultimodal
0 likes · 9 min read
Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 24, 2025 · Artificial Intelligence

Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek

This article analyses the shortcomings of large‑model internet search—such as unverifiable sources, fabricated content, and poor instruction compliance—by comparing Qwen‑max, Doubao‑1.5‑pro‑256k, and DeepSeek‑v3, and proposes prompt engineering, post‑processing, and custom tool improvements to boost reliability.

AILLMevaluation
0 likes · 22 min read
Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek
DaTaobao Tech
DaTaobao Tech
Mar 19, 2025 · Artificial Intelligence

Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques

Retrieval‑augmented generation (RAG) enhances large language models by integrating a preprocessing pipeline—cleaning, chunking, embedding, and vector storage—with a query‑driven retrieval and prompt‑injection workflow, leveraging vector databases, multi‑stage recall, advanced prompting, and comprehensive evaluation metrics to mitigate knowledge cut‑off, hallucinations, and security issues.

LLMRAGRetrieval-Augmented Generation
0 likes · 27 min read
Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques
Efficient Ops
Efficient Ops
Mar 12, 2025 · Operations

How BizDevOps Is Accelerating Digital Transformation in Finance

This article explains the governmental push for digital transformation in financial institutions, introduces the BizDevOps integration model and its domestic and international standards, outlines the evaluation framework and process, showcases case studies, and announces the open registration for the 2025 BizDevOps assessment.

BizDevOpsFinancial IndustryOperations
0 likes · 9 min read
How BizDevOps Is Accelerating Digital Transformation in Finance
AI Algorithm Path
AI Algorithm Path
Feb 20, 2025 · Artificial Intelligence

What Is Perplexity in Large Language Models?

The article explains perplexity as a metric for evaluating large language models, walks through a step‑by‑step probability calculation for a sample sentence, shows how to normalize by sentence length using the geometric mean, and demonstrates that lower perplexity indicates a more accurate and less uncertain model.

AIPerplexityevaluation
0 likes · 6 min read
What Is Perplexity in Large Language Models?
JD Tech
JD Tech
Feb 14, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation

JD’s Merchant Intelligent Assistant leverages a large‑language‑model‑based multi‑agent architecture to provide 24/7 e‑commerce support, detailing its evolution, planning techniques, online inference, evaluation methods, sample generation, and practical insights for scalable AI‑driven operations.

AutomationE-commerce AILLM
0 likes · 22 min read
JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation
JD Retail Technology
JD Retail Technology
Feb 10, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration

The JD Merchant Intelligent Assistant employs a large‑language‑model‑driven multi‑agent architecture with dynamic ReAct planning, enabling merchants to query and execute store operations in under a second with over 90 % decision accuracy, while reducing inference cost, hallucinations, and engineering effort across diverse e‑commerce tasks.

AILLMReAct
0 likes · 25 min read
JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration