Artificial Intelligence 8 min read

Large Model Application Engineering: ZhiLight Inference Framework and Zhihu Direct Answer System

The article details Zhihu's technical salon on large‑model engineering, covering the RAG‑based Zhihu Direct Answer system, the open‑source ZhiLight inference framework, prompt engineering, agent research, and future plans for integrating AI into product and community workflows.

Zhihu Tech Column

Dec 9, 2024

Large Model Application Engineering: ZhiLight Inference Framework and Zhihu Direct Answer System

On December 7, Zhihu and DataFun hosted a technical salon titled “Large Model Application Engineering: From Lab to Ten‑Million‑User Products,” featuring Zhihu AI algorithm lead Wang Jiewu, machine‑learning platform lead Wang Xin, senior Baichuan large‑model algorithm expert Wang Yulong, and prompt evangelist Li Jigang. The event attracted participants from 15 industries and 130 companies, and Zhihu CTO Sun Bin announced the open‑source release of Zhihu's lightweight, high‑efficiency large‑model inference framework ZhiLight.

Zhihu Direct Answer, launched earlier in 2024, is a RAG (retrieval‑augmented generation) system that first retrieves relevant knowledge from a database and then generates answers using a large language model. Compared with a continuous pre‑train + post‑train approach, RAG was chosen for its lower cost, faster latency, higher accuracy, and better scalability. The team optimized query understanding, retrieval, and generation to reduce hallucinations and improve answer authority.

The professional search component of Direct Answer employs multi‑agent collaboration, supporting information retrieval, content analysis, mathematical computation, and more. It leverages large‑model reasoning to deeply analyze user context and queries, achieving breakthroughs in chain‑of‑thought prompting, intelligent document parsing, and dynamic resource scheduling.

Zhihu also showcased the ZhiLight inference framework, which runs large‑model services on various NVIDIA GPUs and integrates open‑source projects such as vLLM and SGLang. ZhiLight focuses on PCIe inter‑card communication optimization, memory management, and concurrent request handling, and incorporates FlashAttention, Marlin, TensorRT, and exllama. Benchmarks show a >40% reduction in single‑layer Transformer compute time and superior TTFT performance at 70B and 110B model scales compared to other open‑source engines.

Future plans for Direct Answer include deeper integration with the Zhihu community, enhancements in accuracy, latency, and interaction modes, and further improvements to model reasoning capabilities.

The salon also featured a forward‑looking discussion on agents, AGI, and prompt engineering. Wang Yulong explained the shift from traditional agents to LLM‑driven agents that can plan, remember, and use tools across diverse tasks, while highlighting current challenges such as lack of theoretical guidance and inconsistent results. Li Jigang presented his view on prompt engineering, defining a prompt as "expression" (intent + text + interpretation) and emphasizing concise, resonant prompts that align with large‑model behavior.

Zhihu’s technical team plans to publish detailed technical articles, PPTs, and recordings of the salon in their tech column, continuing to foster high‑quality discussions and collaborations with the broader technology community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Prompt Engineering RAG agent AI Engineering Inference Framework

Written by

Zhihu Tech Column

Sharing Zhihu tech posts and exploring community technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.