Artificial Intelligence 8 min read

Large Model Application Engineering: ZhiLight Inference Framework and Zhihu Direct Answer System

The article details Zhihu's technical salon on large‑model engineering, covering the RAG‑based Zhihu Direct Answer system, the open‑source ZhiLight inference framework, prompt engineering, agent research, and future plans for integrating AI into product and community workflows.

Zhihu Tech Column
Zhihu Tech Column
Zhihu Tech Column
Large Model Application Engineering: ZhiLight Inference Framework and Zhihu Direct Answer System

On December 7, Zhihu and DataFun hosted a technical salon titled “Large Model Application Engineering: From Lab to Ten‑Million‑User Products,” featuring Zhihu AI algorithm lead Wang Jiewu, machine‑learning platform lead Wang Xin, senior Baichuan large‑model algorithm expert Wang Yulong, and prompt evangelist Li Jigang. The event attracted participants from 15 industries and 130 companies, and Zhihu CTO Sun Bin announced the open‑source release of Zhihu's lightweight, high‑efficiency large‑model inference framework ZhiLight.

Zhihu Direct Answer, launched earlier in 2024, is a RAG (retrieval‑augmented generation) system that first retrieves relevant knowledge from a database and then generates answers using a large language model. Compared with a continuous pre‑train + post‑train approach, RAG was chosen for its lower cost, faster latency, higher accuracy, and better scalability. The team optimized query understanding, retrieval, and generation to reduce hallucinations and improve answer authority.

The professional search component of Direct Answer employs multi‑agent collaboration, supporting information retrieval, content analysis, mathematical computation, and more. It leverages large‑model reasoning to deeply analyze user context and queries, achieving breakthroughs in chain‑of‑thought prompting, intelligent document parsing, and dynamic resource scheduling.

Zhihu also showcased the ZhiLight inference framework, which runs large‑model services on various NVIDIA GPUs and integrates open‑source projects such as vLLM and SGLang. ZhiLight focuses on PCIe inter‑card communication optimization, memory management, and concurrent request handling, and incorporates FlashAttention, Marlin, TensorRT, and exllama. Benchmarks show a >40% reduction in single‑layer Transformer compute time and superior TTFT performance at 70B and 110B model scales compared to other open‑source engines.

Future plans for Direct Answer include deeper integration with the Zhihu community, enhancements in accuracy, latency, and interaction modes, and further improvements to model reasoning capabilities.

The salon also featured a forward‑looking discussion on agents, AGI, and prompt engineering. Wang Yulong explained the shift from traditional agents to LLM‑driven agents that can plan, remember, and use tools across diverse tasks, while highlighting current challenges such as lack of theoretical guidance and inconsistent results. Li Jigang presented his view on prompt engineering, defining a prompt as "expression" (intent + text + interpretation) and emphasizing concise, resonant prompts that align with large‑model behavior.

Zhihu’s technical team plans to publish detailed technical articles, PPTs, and recordings of the salon in their tech column, continuing to foster high‑quality discussions and collaborations with the broader technology community.

prompt engineeringLarge Language ModelsRAGagentAI EngineeringInference Framework
Zhihu Tech Column
Written by

Zhihu Tech Column

Sharing Zhihu tech posts and exploring community technology innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.