How to Evaluate an AI Agent Beyond Just Accuracy

Evaluating AI agents requires more than accuracy; you must measure task completion, execution trace, tool usage, latency, cost, error rates, and both explicit and implicit user feedback, using observability, offline smoke‑test and regression suites, and continuous online monitoring to create a closed‑loop improvement process.

AI AgentMetricsObservability

0 likes · 14 min read

How to Evaluate an AI Agent Beyond Just Accuracy

DataFunTalk

Sep 30, 2023 · Fundamentals

Different Types of Experiments in Search Scenarios

The presentation by Tencent PCG data product manager Wang Dongxing introduces A/B testing fundamentals and shares practical experiences with various search experiment methods—including regular A/B, vocabulary, diffAB, and interleaving—while highlighting common pitfalls and offering actionable insights for practitioners.

A/B testingData ProductOnline Testing

0 likes · 2 min read

Different Types of Experiments in Search Scenarios

DataFunTalk

Oct 22, 2022 · Big Data

Design and Practice of a Risk Control Experiment Platform at Du Xiaoman

This article explains the background, architecture, challenges, and step‑by‑step design of a big‑data‑driven risk control experiment platform used for online and offline strategy testing in internet finance.

Big DataExperiment PlatformFinTech

0 likes · 12 min read

Design and Practice of a Risk Control Experiment Platform at Du Xiaoman

Baidu Intelligent Testing

Apr 16, 2018 · Operations

Online Load‑Testing Practices for Baidu Nuomi Marketing Activities

This article presents a comprehensive case study of Baidu Nuomi's online load‑testing methodology for high‑traffic marketing events, covering capacity estimation, test planning, execution, anti‑attack measures, platform architecture, and lessons learned to ensure system reliability and performance under peak loads.

Online Testingcapacity planningload-testing

0 likes · 16 min read

Online Load‑Testing Practices for Baidu Nuomi Marketing Activities