Artificial Intelligence 8 min read

Model Testing and Evaluation Metrics for Strategy Projects in the AI Era

This article explains the challenges of testing machine‑learning models for strategy projects, outlines the overall testing workflow, describes key offline and online evaluation metrics such as AUC and AB‑testing, and summarizes best‑practice procedures for assessing model performance, user experience, and effect differences.

Baidu Waimai Technology Team
Baidu Waimai Technology Team
Baidu Waimai Technology Team
Model Testing and Evaluation Metrics for Strategy Projects in the AI Era

In the AI era, as machine learning gains popularity, a new testing approach based on machine learning is emerging. Unlike traditional projects that end after launch, strategy projects continue after deployment and require ongoing observation of small‑traffic data to determine full‑scale rollout, focusing on controlling false‑positive and false‑negative rates and ensuring the observed data matches expectations.

1. Difficulties of Model Testing

1) The model itself is hard to test and requires deep familiarity with formulas and usage.

2) Data pipelines are long, making bugs hard to detect yet having a large impact on results.

3) Model changes affect related strategies and need thorough verification.

4) Metric evaluation is time‑consuming, subjective, and difficult to quantify.

2. Overall Process

3. Four Main Evaluation Indicators

Offline module evaluation indicators:

1) Definition

AUC (Area Under Curve): the area under the ROC curve, ranging from 0.1 to 1. Higher values indicate a better classifier.

AUC represents the probability that a randomly chosen positive sample is ranked higher than a randomly chosen negative sample; larger AUC means the model is more likely to place positives before negatives.

It is a more comprehensive robustness metric than accuracy or recall.

Reference: http://blog.csdn.net/pzy20062141/article/details/48711355

2) Formula

3) Graph

4) AUC Acquisition Process

4. Online Conversion Rate – AB Test

1) Create a small‑traffic dataset and split traffic by passuid.

2) After the model goes live, monitor data indicators on the Magic Mirror platform for visualized results.

Detailed charts and metric comparisons facilitate acceptance.

5. PM Evaluation of User Experience

1) Evaluate from multiple feature dimensions:

- Delivery time: during afternoon tea, many merchants have long delivery times and rank in the top 30.

Reason: few merchants lead to limited training samples, making the model less sensitive to delivery time.

Action: add a delivery‑time de‑weighting strategy for the afternoon tea period.

- Sales volume: if a store with very low sales (and not new) appears in the top 20, it may be a bad case, prompting strategy review.

- Business status: must be ordered as "open", "available for reservation", "closed".

6. Effect Difference (Diff)

For a specific strategy, compare online and offline top‑ranking differences, excluding resource slots, to examine ranking effects in other positions.

7. Summary

Strategy project effect acceptance testing can be divided into four aspects:

Offline model evaluation – AUC, RMSE, etc., focusing on model‑level data metrics.

Online order conversion rate – business‑level metrics via AB testing.

User experience testing – crowdsourced perception metrics based on user behavior.

Effect diff – detailed testing of changes such as model updates, parameter tuning, feature addition, or ranking adjustments.

Corresponding tests:

AUC spot checks – compute directly, monitor spans and boundary values.

AB test – verify whether business metrics move positively.

Crowd testing – collect user experience feedback.

Detailed diff analysis for each change.

Author Introduction

Siberia, Quality Lead of the Business Intelligence Center, responsible for traffic and user operation quality assurance, leads the development of a complete traffic guidance, process management, and effect evaluation system for BI strategy projects.

The BI Center quality team specializes in comprehensive testing of strategy projects, with practical experience in algorithms, models, and MapReduce‑type projects.

AI First: after a strategy project goes live, the real work begins; let’s improve, standardize, and develop it together.

AB testingAIevaluation metricsAUCModel Testingstrategy projects
Baidu Waimai Technology Team
Written by

Baidu Waimai Technology Team

The Baidu Waimai Technology Team supports and drives the company's business growth. This account provides a platform for engineers to communicate, share, and learn. Follow us for team updates, top technical articles, and internal/external open courses.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.