Master LLM Engineering: Model Conversion, Parallel Inference, and Channel‑Loss Techniques

This article outlines essential LLM engineering skills, including scripts for converting various model checkpoints to Llama format, customizing modeling files for advanced features, building a multi‑GPU inference class, and adding channel‑aware loss tracking to fine‑tuning pipelines.

Flash AttentionLLMTraining Optimization

0 likes · 6 min read

Master LLM Engineering: Model Conversion, Parallel Inference, and Channel‑Loss Techniques

Baobao Algorithm Notes

Feb 17, 2025 · Artificial Intelligence

Can TransMLA Turn GQA into a More Powerful MLA? A Deep Dive into DeepSeek Models

This article presents a theoretical and experimental analysis of converting Group Query Attention (GQA) models to Multi‑Head Linear Attention (MLA) using the TransMLA method, demonstrating superior expressiveness and performance on DeepSeek‑based large language models while keeping KV‑Cache costs unchanged.

DeepSeekLarge Language ModelsMLA

0 likes · 11 min read

Can TransMLA Turn GQA into a More Powerful MLA? A Deep Dive into DeepSeek Models

Baobao Algorithm Notes

Aug 26, 2024 · Artificial Intelligence

Master Essential LLM Engineering Skills: Transform, Model, and Infer with Custom Scripts

This guide presents a hands‑on curriculum of core large‑model engineering tasks—including model conversion scripts, custom modeling wrappers, multi‑model inference utilities, and channel‑aware loss tracking—to help practitioners build practical, reusable tools without deep theoretical overhead.

AI EngineeringInference OptimizationPython scripting

0 likes · 8 min read

Master Essential LLM Engineering Skills: Transform, Model, and Infer with Custom Scripts

Alibaba Cloud Big Data AI Platform

Sep 13, 2023 · Artificial Intelligence

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

This article introduces Pai‑Megatron‑Patch, an open‑source tool from Alibaba Cloud that streamlines large language model (LLM) training, weight conversion, FP8 mixed‑precision acceleration, and reinforcement‑learning workflows, providing detailed architecture, key features, code examples, and step‑by‑step usage instructions.

FP8LLM trainingMegatron

0 likes · 19 min read

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

DaTaobao Tech

Jul 12, 2023 · Artificial Intelligence

Optimizing ChatGLM-6B Deployment with MNN: Model Conversion, Quantization, and Edge Inference

The article details a workflow that converts the PyTorch ChatGLM‑6B model to MNN, splits and compresses embeddings, applies int4/int8 quantization, supports dynamic shapes, and uses hybrid GPU/CPU or CPU‑only loading to enable low‑memory edge inference on PCs and mobile devices with competitive token‑per‑second performance.

ChatGLMLLMMNN

0 likes · 16 min read

Optimizing ChatGLM-6B Deployment with MNN: Model Conversion, Quantization, and Edge Inference

58 Tech

Dec 8, 2021 · Artificial Intelligence

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration

The article introduces dl_inference, an open‑source deep learning inference platform that integrates TensorRT GPU acceleration, Intel MKL CPU optimization, and Caffe support, detailing its features, model conversion workflow, deployment steps, performance gains, and how developers can contribute.

Intel MKLTensorRTinference

0 likes · 12 min read

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration