Artificial Intelligence 10 min read

Practical AI‑Powered Voice Recognition for Game Dialogue Testing: A Step‑by‑Step Case Study

This article presents a detailed case study of using AI speech‑recognition techniques—including acoustic modeling with VGG, pypinyin conversion, feature extraction, and CTC decoding—to automatically verify game dialogue audio against script text, outlining the workflow, challenges, implementation details, and experimental results.

NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Practical AI‑Powered Voice Recognition for Game Dialogue Testing: A Step‑by‑Step Case Study

The article introduces a real‑world AI project aimed at improving the efficiency of testing voice‑over dialogue in the mobile game "Qian Nu" by automatically matching audio files with their corresponding script subtitles.

Product background: Testers previously had to listen to each voice clip in full and manually compare it with the planned subtitles, a time‑consuming process.

Requirement: Use AI to recognize speech and determine whether the spoken content aligns with the textual description.

Solution overview: An acoustic model converts Chinese characters to tonal pinyin, assigns numeric IDs to each pinyin token, and uses a CNN‑based VGG model with CTC decoding to output the most likely pinyin sequence. The language model step can be omitted for this verification task, directly comparing acoustic‑model output with expected pinyin.

Implementation steps:

Extract the list of voice files and their expected text from the design documents.

Convert the expected Chinese text to tonal pinyin (e.g., "ni3hao3") using the pypinyin library.

Map each unique pinyin token to a fixed numeric ID.

Transform each audio file into a spectrogram and extract features (mp3 → wav conversion, framing, windowing, spectrogram generation).

Train an AI model (VGG‑based CNN) to map spectrogram features to the numeric pinyin IDs.

Run the trained model on the original voice files to obtain predicted pinyin sequences.

Compare the predicted sequences with the expected ones, allowing for a predefined error set to handle minor oral‑written variations.

Technical challenges addressed:

Linking Unity/Fmod event identifiers with raw audio files required custom scripting.

Inconsistent references to audio events across multiple design sheets demanded a flexible extraction pipeline.

Handling oral‑written discrepancies by defining an error‑tolerance dictionary.

Key technologies used:

Acoustic modeling with VGG (a classic CNN) and CTC decoding to collapse repeated symbols and remove blank tokens.

Feature extraction via spectrograms; attempted SpecAugment but ultimately adopted the ASRT feature‑extraction method.

Training data from public Chinese speech corpora (THCHS30 and ST‑CMDS).

Training parameters: batch size 8 (later 16 on external GPU resources), over 250,000 batches, loss convergence observed but accuracy still below target.

Results and future work: The prototype achieved modest recognition accuracy; further improvements are needed before production deployment. The team plans to refine the model, experiment with larger batch sizes, and explore additional AI use cases in product testing.

Overall, the case study demonstrates how AI techniques can be integrated into game development pipelines to automate quality‑assurance tasks, reducing manual effort and paving the way for broader AI adoption in testing workflows.

PythonAISpeech Recognitiongame testingCTC decodingpypinyinVGG
NetEase LeiHuo Testing Center
Written by

NetEase LeiHuo Testing Center

LeiHuo Testing Center provides high-quality, efficient QA services, striving to become a leading testing team in China.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.