Artificial Intelligence 13 min read

Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations

Bilibili’s in‑house role‑playing large language model, built on the Index architecture and refined through pre‑training, supervised fine‑tuning, and preference optimization (PPO and DPO), achieved top scores on the Chinese CharacterEval benchmark, surpassing rivals while incorporating safety alignment and showcasing consistent, personality‑driven dialogue examples.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations

In recent years, rapid advances in large‑model algorithms and computing power have brought unprecedented attention to general artificial intelligence technologies, spawning a wide range of application scenarios. Among them, role‑playing AI has become a hot field, with many companies launching dialogue products that showcase their AIGC capabilities. Bilibili (B‑Station) has built a role‑playing model on top of its Index large model.

Evaluation of the Role‑Playing Model

The model was assessed using the Chinese‑scene benchmark CharacterEval , which contains 77 character profiles extracted from novels and films and 1,785 dialogue pairs. The benchmark evaluates three major aspects—dialogue ability, character consistency, and role‑playing attractiveness—across 12 fine‑grained dimensions. Index‑70B achieved the highest overall score and ranked first in 7 of the 12 sub‑dimensions, outperforming competing products such as CharacterYuyan, Minimax, and Baichuan. The open‑source Index‑1.9B also showed superior performance compared with other models of similar scale.

Technical Overview

The development pipeline consists of three stages: Pre‑Training (PT), Supervised Fine‑Tuning (SFT), and Preference Optimization (PO).

Pre‑Training

Bilibili’s Index base model is continuously refined from years of internal research. During PT, the model learns from massive corpora that include publicly available books, encyclopedias, papers, STEM data, and a large volume of user‑generated dialogues, especially from the anime and entertainment domains. Data cleaning employs heuristic rules and classifier‑based filtering.

Supervised Fine‑Tuning (SFT)

SFT aligns the generic model to the specific role‑playing task. High‑quality role‑description and role‑dialogue data are constructed. Role descriptions cover attributes such as gender, age, height, nickname, personality, background, speaking style, catchphrases, etc. Role dialogues capture language behavior that reflects personality, preferences, dialect, and stylistic quirks. Example role description for a character named “萌萌酱” and a sample dialogue are provided in the source.

Preference Optimization (PO)

After SFT, the model is further refined using reinforcement‑learning‑based methods. Both Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) are explored. PPO involves four models (Actor, Critic, Reward, Reference) and requires four‑times the computational resources. DPO directly learns from human‑ranked pairs, reducing resource consumption while still improving alignment with human preferences.

Safety and Alignment

Before deployment, content‑safety risks are considered. The model is taught to refuse disallowed queries and to follow human values, leveraging the SFT + DPO pipeline for alignment.

Framework Diagram

Dialogue Demonstration

An example character profile (三三, a 14‑year‑old Bilibili mascot) is shown, illustrating the model’s ability to generate consistent, personality‑driven responses.

Outlook

The in‑house role‑playing model has achieved strong benchmark results and is being explored in internal business scenarios. Future work aims to further strengthen model capabilities, expand data sources, and collaborate with external partners.

References

PPO vs DPO alignment discussion: https://mp.weixin.qq.com/s/nQXSkMeUhFTob9GKTD4_lA

网易伏羲易生诸相多模态模型语言部分: https://zhuanlan.zhihu.com/p/690626399

CharacterEval paper: https://arxiv.org/abs/2401.01275

large language modelPretrainingcontent safetyEvaluation Benchmarkpreference optimizationrole-playing AISupervised Fine-tuning
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.