Artificial Intelligence 13 min read

How Quwan’s Kaitian Model Tackles Emotional AI for Social Apps – Architecture, Training Tricks, and Safety

Quwan Technology presents its Kaitian social large model, designed for personalized, emotionally rich, multimodal AI interactions, detailing its scene‑specific goals, CPT+SFT+RLHF training pipeline, data desensitization, LoRA fine‑tuning, evaluation methods, pruning, latency trade‑offs, safety mechanisms, and future feedback loops.

DataFunSummit
DataFunSummit
DataFunSummit
How Quwan’s Kaitian Model Tackles Emotional AI for Social Apps – Architecture, Training Tricks, and Safety

With the explosive growth of AI‑driven social demand, users increasingly require personalized interaction, high‑density emotional companionship, real‑time multimodal understanding, and privacy protection. General large models fall short in vertical social scenarios due to poor adaptation, weak emotional value, inefficient multimodal fusion, and compliance risks, prompting the need for a dedicated, controllable social large model.

Quwan Technology proposes the self‑developed "Kaitian" social model, targeting three goals: scene‑specificity, safety & trustworthiness, and multimodal collaboration. The core technical path includes an innovative CPT+SFT+RLHF iterative framework, a mask mechanism, and dynamic sample‑adjustment to boost learning efficiency and quality.

Data and training optimization are achieved through desensitized social data and a two‑stage training process (general pre‑training followed by vertical fine‑tuning). Lightweight LoRA techniques balance high precision with low cost. Model evaluation and inference involve collaboration with domestic universities to build social‑domain evaluation methods and datasets, as well as model pruning and quantization to accelerate online inference and improve concurrency.

The Kaitian model has already been deployed in multiple business scenarios (e.g., TT voice), offering 7B, 14B, and 32B parameter versions to meet varying complexity, emotional companionship needs, and latency requirements.

01 Industry Insight & Technical Necessity

DataFun: General models show weak emotional value and inefficient multimodal fusion. How do these shortcomings affect deep user needs in emotional companionship?

Ma Jinlong: General models focus on language understanding, problem solving, and logical reasoning, while social scenarios demand human‑like emotional understanding, recognition, and response. The difference lies in answer quality (human‑like & emotionally satisfying), format (concise, colloquial, emoji‑rich), and perspective (role‑based, purpose‑driven). A side‑by‑side test shows the emotional model giving short, empathetic replies, whereas the general model provides lengthy, task‑oriented answers.

02 Core Technical Path & Innovation

DataFun: Explain the CPT+SFT+RLHF iterative technique and the dynamic sample‑adjustment method.

Ma Jinlong: The dynamic sample‑adjustment method masks low‑quality or incoherent dialogue turns during training, dynamically lowering their weight so the model focuses on high‑quality content, thereby improving learning efficiency.

DataFun: How does desensitized social data combined with LoRA maintain low‑cost training without sacrificing multimodal synergy?

Ma Jinlong: After desensitization, we rewrite content to recover information density, discard low‑quality samples, and use LoRA for lightweight fine‑tuning. While some multimodal capacity is sacrificed, the current focus is on textual performance.

DataFun: How are voice, text, and visual signals integrated for consistent emotional output?

Ma Jinlong: We extract tone, emotion, and event tags from voice, align them with textual emotion labels, and use a multimodal model to produce unified understanding tags. For output, a hyper‑humanlike emotional TTS generates speech consistent with the generated text.

03 Engineering Deployment & Effect Verification

DataFun: How does the evaluation dataset quantify abstract metrics like "emotional value"?

Ma Jinlong: We built a three‑dimensional evaluation system incorporating emotional satisfaction scores, 8,000 scenario‑specific questions, and psychological metrics (EQ, IQ scales) to assess both emotional and general language abilities.

DataFun: What are the latency differences between the 32B and 7B models after pruning and quantization?

Ma Jinlong: The 32B model (1500‑token input, 50‑token output) takes ~2 seconds, while the 7B model under the same conditions takes ~0.6 seconds. Although hierarchical inference was considered, we prioritized consistent emotional experience over cost reduction.

04 Compliance, Safety & Ecosystem Value

DataFun: Describe the full‑chain content audit platform and its dual‑mechanism approach.

Ma Jinlong: We employ pre‑generation prompt filtering and post‑generation multimodal content detection. The audit model, refined from our existing multimodal safety tech, achieves industry‑leading accuracy and complies with regional evaluation standards.

DataFun: How does the model mitigate ethical risks in sensitive scenarios like mental‑health support?

Ma Jinlong: We adopt a tiered human‑intervention system triggered by extreme emotional states, combined with knowledge‑base constraints to limit risky suggestions.

05 Future Evolution & Industry Impact

DataFun: Will a user‑model‑platform feedback loop drive continuous emotional understanding?

Ma Jinlong: Yes, we are building a feedback loop where user behavior data refines high‑quality training datasets, enabling the model to evolve into a "more warm, healthier" social AI.

LoRAlarge language modelmultimodalRLHFAI safetymodel pruningsocial AI
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.