Comparative Evaluation of Deepl and ChatGPT Machine Translation for Game Localization
This article investigates the translation quality of Deepl and ChatGPT for the game 'Naraka: Bladepoint' by comparing their outputs against professional human translations across Chinese‑English, Chinese‑Spanish, and English‑Spanish pairs using BLEU scores and manual assessment, revealing strengths and limitations of each system.
Recent advances in artificial intelligence have expanded the scope of AIGC (AI‑generated content), with tools like ChatGPT demonstrating capabilities in natural language processing, code generation, and content creation. In the gaming industry, AIGC can assist with AI‑generated art, voice‑overs, copywriting, and even programming.
This study evaluates whether ChatGPT’s translation function can be applied to real‑world game localization by selecting four representative text groups from the Chinese game Naraka: Bladepoint (skill descriptions, story background, action descriptions, and literary style). Human translations serve as the reference standard, and the machine translation outputs of Deepl and ChatGPT (version 3.5) are compared using BLEU scores and manual evaluation.
Test preparation
ChatGPT version: 3.5
Machine translation tool: Deepl (chosen for its generally higher accuracy on technical and academic texts)
Test languages: Chinese ↔ English, Chinese ↔ Spanish, English ↔ Spanish
Evaluation metric: BLEU score (the most widely used automatic metric for MT quality)
Test method
The four text groups were translated in three directions (Chinese→English, Chinese→Spanish, English→Spanish) by both Deepl and ChatGPT. BLEU scores were calculated for each output against the human reference, and a manual review examined grammar, terminology, idioms, cultural references, and literary quality.
Results and analysis
Overall, both systems achieved only one BLEU score above 40, indicating that current MT quality is still far from professional standards.
Deepl outperformed ChatGPT in 7 out of 12 BLEU evaluations, showing higher similarity to human translations.
English→Spanish translations scored higher than Chinese→Spanish for both tools, likely due to larger English‑Spanish corpora and closer linguistic families.
Grammar was generally acceptable for both systems, but subjective judgments (e.g., correct subject selection in skill descriptions) favored human translators.
Terminology, idioms, cultural references, and mythological allusions were often mistranslated or overly literal, with examples such as "单双排" rendered as "single and double rows" (Deepl) and "single and double formations" (ChatGPT) instead of the correct "Solo and Duos".
Literary passages lost poetic nuance; Deepl and ChatGPT produced straightforward renderings lacking the original’s aesthetic depth.
The analysis confirms that while machine translation can handle basic grammatical structures, it struggles with domain‑specific terminology, cultural nuances, and literary style. Consequently, human post‑editing remains essential for high‑quality game localization.
Conclusion
At the current stage, Deepl and ChatGPT demonstrate solid grammatical performance but fall short in handling game‑specific terms, idioms, cultural background, and literary expression. The practical workflow should still prioritize human translators with machine translation serving as an auxiliary tool. As models continue to evolve, ChatGPT’s potential may increase, but reliable, nuanced localization will likely remain a collaborative effort between humans and AI.
References
https://www.letsmt.eu/Bleu.aspx
https://cloud.tencent.com/developer/article/1159767
https://arxiv.org/pdf/2301.08745.pdf
NetEase LeiHuo Testing Center
LeiHuo Testing Center provides high-quality, efficient QA services, striving to become a leading testing team in China.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.