AI Engineering Efficiency Platform: Architecture, Practices, and Case Studies
The presentation outlines the AI engineering efficiency platform covering algorithm metric and evaluation, micro‑service performance testing, and dataset management architectures, detailing business pain points, platform‑wide improvements, technical designs, real‑world demos, and future directions to achieve accurate, fast, and stable AI services.
At the TiD2020 Quality Competitiveness Conference, Zhao Ming, head of AI platform quality and efficiency at TAL, delivered a talk titled “AI Engineering Efficiency Platform Development”. He introduced the three core principles—accuracy (准), speed (快), and stability (稳)—and described the business context of TAL’s AI middle‑platform.
Business Overview
TAL’s AI middle‑platform integrates AI technologies into education, focusing on three model types: speech (ASR, evaluation, emotion), image (OCR, photo search, content moderation), and data mining/NLP (keyword search, classroom analytics, oral proficiency assessment). These models are deployed as micro‑services on a PaaS platform.
Key Pain Points
Frequent algorithm testing without automation, leading to low efficiency.
Lack of visibility into industry leadership metrics for KPI setting.
High cost of performance evaluation after model or service optimization.
Data fragmentation across roles and versions, making management difficult.
Improvement Strategy
To address these issues, a platform‑centric solution was built, comprising an algorithm metric and evaluation platform, a micro‑service performance testing platform, and a dataset management platform.
Tool Platform System
The toolchain empowers product, algorithm, development, testing, and operations teams by providing end‑to‑end quality monitoring, DevOps‑based full‑link tracing, and standardized interfaces for AI capabilities.
AI Algorithm Metric & Evaluation Platform
Scenarios & Users: Unlabeled data bad‑case screening, new data accuracy assessment, competitive benchmark evaluation, and detailed metric analysis for annotated data. Primary users are algorithm engineers, testers, and product managers.
Technical Architecture: Consists of a foundational layer (permission, data source, storage, analysis instance, report management), a logical abstraction layer (workflow orchestration, data preprocessing, metric calculation, model registration), and a UI layer for visual composition and reporting.
Actual Effects: Enables automated bad‑case detection for OCR and ASR, provides detailed precision, recall, F1, WER/CER, and resource usage metrics (CPU, MEM, GPU), and supports KPI‑driven model optimization.
Demos: Bad‑case automated screening for unlabeled data, competitive benchmark evaluation for annotated data, and visual reporting via integrated JMeter reports.
AI Micro‑Service Performance Testing Platform
Scenarios & Users: Shared environment management for algorithm and service testing, automated deployment, pressure testing (TPS/concurrency), and bottleneck analysis for developers, testers, and product managers.
Technical Architecture: Data source layer (Prometheus monitoring, persistent storage), interface layer (automated pressure scripts, remote execution, one‑click login), and UI layer for resource management and result visualization.
Actual Effects: Unattended automated pressure testing using a binary search algorithm to find maximum TPS, reducing manual test cycles from 600 minutes to about 10 minutes, and generating detailed JMeter reports with CPU, memory, and TPS metrics.
Demos: Automated pressure testing workflow, real‑time monitoring dashboards, and threshold‑based memory leak detection.
Dataset Management Platform
Scenarios & Users: Supports training and testing data labeling, versioning, and quality control for algorithm engineers, testers, product managers, and data operators.
Process: Select dataset → automated download → model processing → result verification, with continuous monitoring of accuracy, speed, and stability.
Actual Effects: Monitors memory (6‑8 GB) and CPU usage (~50 % average) to ensure model stability and resource safety.
Future Planning
The platform aims to enhance quality, efficiency, and cost reduction by providing intelligent recommendation for algorithm improvements, automated bottleneck localization, and advanced memory‑leak scanning, ultimately delivering high‑quality AI products and micro‑services more agilely.
TAL Education Technology
TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.