Tagged articles

31 articles

Page 1 of 1

Apr 19, 2026 · Artificial Intelligence

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

This guide shows how to fine‑tune Qwen3.5 models—from 0.8B to 122B—using Unsloth Studio or pure code, covering text SFT, vision fine‑tuning, MoE models, reinforcement‑learning (GRPO), extensive GGUF quantization benchmarks, hardware requirements, export formats, and deployment tips.

LLMUnslothfine-tuning

0 likes · 12 min read

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

Old Zhang's AI Learning

Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashInference AccelerationSGLang

0 likes · 12 min read

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

Old Zhang's AI Learning

Apr 10, 2026 · Artificial Intelligence

How a 9B‑parameter Qwen3.5 model achieves full‑auto data analysis on a consumer GPU

The open‑source CoPaw‑Flash‑9B‑DataAnalyst‑LoRA model, fine‑tuned via LoRA, can autonomously load, explore, statistically analyze, visualize, and generate structured reports for CSV/Excel/JSON datasets, achieving a 90% success rate with an average of 26 iteration rounds, and it runs on a single consumer‑grade GPU using vLLM and the Data Analyst framework.

AgentData AnalystGPU

0 likes · 10 min read

How a 9B‑parameter Qwen3.5 model achieves full‑auto data analysis on a consumer GPU

Node.js Tech Stack

Apr 6, 2026 · Artificial Intelligence

Run Full AI Models Directly in the Browser with Transformers.js v4

Transformers.js v4 rewrites its WebGPU runtime in C++ and compiles to WASM, delivering ten‑fold faster build times, 10% smaller bundles, and up to four‑fold speedups for BERT‑style models, while supporting over 20 new architectures such as Qwen3.5 and enabling offline, privacy‑preserving AI inference directly in the browser.

Transformers.jsWasmWebGPU

0 likes · 8 min read

Run Full AI Models Directly in the Browser with Transformers.js v4

Old Zhang's AI Learning

Apr 3, 2026 · Artificial Intelligence

Qwopus3.5‑v3: From Reason‑Then‑Act to Act‑Then‑Refine – Claude‑Opus Distillation Turns Qwen3.5 into a Tool‑Using Agent

The newly released Qwopus3.5‑v3 model combines higher‑quality reasoning chains, dedicated tool‑calling reinforcement learning, and an act‑then‑refine paradigm, delivering a 5‑point HumanEval boost, a 1.43‑point MMLU‑Pro gain, 31.7% faster inference and 24% lower token cost, while remaining runnable on a 3090 or a 16 GB MacBook, with easy deployment via GGUF, LM Studio, Ollama or llama.cpp.

Claude OpusHumanEvalMMLU-Pro

0 likes · 12 min read

Qwopus3.5‑v3: From Reason‑Then‑Act to Act‑Then‑Refine – Claude‑Opus Distillation Turns Qwen3.5 into a Tool‑Using Agent

AI Engineering

Apr 1, 2026 · Artificial Intelligence

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

H Company’s new Holo3 series delivers a visual language model that outperforms GPT‑5.4 on the OSWorld‑Verified benchmark with a 78.85% score while costing only about one‑tenth as much, offering both a flagship API‑only version and an open‑source lightweight variant optimized for GUI agents.

AI BenchmarkGUI AgentHolo3

0 likes · 4 min read

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

Old Zhang's AI Learning

Mar 31, 2026 · Artificial Intelligence

Turning a Bluetooth Speaker into a Smart Assistant with Qwen 3.5‑Omni

The author demonstrates a proof‑of‑concept that combines Qwen 3.5‑Omni's real‑time internet search and audio output with a locally hosted voice‑wake‑up model to transform a Bluetooth speaker into an always‑on smart assistant, while noting latency challenges and the potential of a sub‑10B open‑source alternative.

AI integrationBluetoothLarge Language Model

0 likes · 2 min read

Turning a Bluetooth Speaker into a Smart Assistant with Qwen 3.5‑Omni

Old Zhang's AI Learning

Mar 28, 2026 · Artificial Intelligence

Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

Using the open‑source ToolCall‑15 benchmark, the author shows that the 27‑billion‑parameter Qwen3.5 model consistently scores full marks while the 397‑billion‑parameter version fails on several tasks, and that the Q6 quantized variant offers the best trade‑off between size and tool‑calling accuracy.

AILLM BenchmarkTool Calling

0 likes · 7 min read

Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

Old Zhang's AI Learning

Mar 25, 2026 · Artificial Intelligence

Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

The new Claude‑Opus‑4.6 distilled Qwen3.5‑v2 keeps code‑generation accuracy while cutting reasoning length by 24% and boosting per‑token correctness by 31.6%, offering a noticeable speed and cost advantage for local LLM deployment despite a 7.2% drop on MMLU‑Pro.

Claude Opusdistillationlocal LLM deployment

0 likes · 7 min read

Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

Old Zhang's AI Learning

Mar 19, 2026 · Artificial Intelligence

Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review

The article evaluates oMLX, a Mac‑only LLM runtime built on Apple Silicon and MLX, by walking through installation, UI features, memory usage, single‑request speed, benchmark results for Claude‑Opus‑4.6 and Qwen3.5‑9B, continuous batch processing gains, Claude Code optimizations, multi‑model support, and the failure to run a 27B model.

Apple SiliconClaude OpusMLX

0 likes · 9 min read

Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review

Old Zhang's AI Learning

Mar 18, 2026 · Artificial Intelligence

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

The article details a hands‑on test of the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model running on a single RTX 4090 via llama.cpp, showing a steady 46 tokens per second generation speed, a 64K context window, and a step‑by‑step Docker‑based setup while comparing it to GLM‑4.7‑Flash‑AWQ‑4bit and discussing llama.cpp’s limitations for multi‑GPU inference.

Claude OpusDockerLLM inference

0 likes · 5 min read

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

Old Zhang's AI Learning

Mar 16, 2026 · Artificial Intelligence

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

The article evaluates the GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 9B model on a 16 GB Mac Mini M4 using LM Studio, detailing model sizes, performance metrics, deployment steps, API integration with Claude Code, and concluding that while the 9B version is usable, its capabilities remain limited compared to larger models.

Claude OpusGGUFLM Studio

0 likes · 12 min read

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

Old Zhang's AI Learning

Mar 9, 2026 · Artificial Intelligence

Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

The article walks through upgrading vLLM to 0.17.0, configuring Docker containers for 4090 GPUs, comparing FP8 and 4‑bit quantization of Qwen3.5 35B and 27B models, and presents detailed performance numbers and script parameters that reveal trade‑offs in memory usage and throughput.

4-bit quantizationDockerFP8

0 likes · 7 min read

Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

Old Zhang's AI Learning

Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention

0 likes · 12 min read

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

Old Zhang's AI Learning

Mar 4, 2026 · Artificial Intelligence

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

This guide shows step‑by‑step how to enable or disable the thinking mode of Qwen3.5 series large language models across Ollama, LM Studio (GGUF and MLX), llama.cpp, and vLLM/SGLang using command‑line flags, custom model YAML files, and API parameters.

LM StudioOllamaThinking mode

0 likes · 4 min read

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

Alibaba Cloud Infrastructure

Mar 4, 2026 · Cloud Computing

How to Deploy Qwen 3.5‑Plus with CoPaw on Alibaba Cloud ACK/ACS via Agent Sandbox

This guide walks you through deploying the Qwen 3.5‑plus model on Alibaba Cloud ACK/ACS using the ACS Agent Sandbox, creating a CoPaw sandbox, configuring model access, integrating with DingTalk, and optionally using the sandbox’s pause‑and‑wake features.

ACKACSAgent Sandbox

0 likes · 13 min read

How to Deploy Qwen 3.5‑Plus with CoPaw on Alibaba Cloud ACK/ACS via Agent Sandbox

Alibaba Cloud Native

Mar 3, 2026 · Cloud Native

Deploy Alibaba's Qwen3.5-397B Model in Minutes with Serverless Function Compute

This guide explains how to quickly deploy the new Qwen3.5-397B-A17B open‑source large model using Alibaba Cloud Function Compute's serverless GPU service, covering model features, deployment steps, required commands, and performance benefits.

AICloud NativeFunction Compute

0 likes · 5 min read

Deploy Alibaba's Qwen3.5-397B Model in Minutes with Serverless Function Compute

Old Zhang's AI Learning

Mar 3, 2026 · Artificial Intelligence

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

This guide walks you through deploying Qwen3.5's 0.8B, 2B, 4B and 9B models on CPUs or modest GPUs using Unsloth's GGUF quantization, explains hardware requirements, shows how to run them with llama.cpp, llama‑server, vLLM or SGLang, and provides a free Colab fine‑tuning workflow with export options.

AI modelsGGUFUnsloth

0 likes · 19 min read

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

AI Engineering

Mar 3, 2026 · Artificial Intelligence

Alibaba Qwen‑3.5 Small Models: 0.8B Parameters Enable Video on Edge Devices

Alibaba released four Qwen‑3.5 models (0.8B‑9B) that use a Gated DeltaNet hybrid‑attention architecture and native multimodal training to achieve 262k‑token contexts, outperform larger rivals on visual, reasoning, and math benchmarks, and run video analysis on phones and laptops, though they still demand significant VRAM.

Edge AIGated DeltaNetbenchmark

0 likes · 6 min read

Alibaba Qwen‑3.5 Small Models: 0.8B Parameters Enable Video on Edge Devices

Old Zhang's AI Learning

Mar 2, 2026 · Artificial Intelligence

Qwen3.5 Small Models Unveiled: From 0.8B to 9B with Full Capabilities

The article introduces the newly released Qwen3.5 small model series (0.8B, 2B, 4B, 9B), explains their shared Gated Delta Networks architecture, early multimodal token fusion, 201‑language support and up to 1 million‑token context, and presents benchmark data that show the 9B model rivaling much larger LLMs, followed by practical guidance on model selection and deployment.

Gated Delta NetworksMultimodalbenchmark

0 likes · 10 min read

Qwen3.5 Small Models Unveiled: From 0.8B to 9B with Full Capabilities

Old Zhang's AI Learning

Mar 2, 2026 · Artificial Intelligence

Why the Qwen3.5 Series Makes Qwen3.5-27B the No‑Brainer Choice

The author reviews the Qwen3.5 model family, showing that the 27‑billion‑parameter dense Qwen3.5-27B offers the best balance of size, stability, low‑cost local deployment, and comprehensive capabilities, making it the default pick for most users.

AI benchmarkingLarge Language ModelRTX 4090

0 likes · 6 min read

Why the Qwen3.5 Series Makes Qwen3.5-27B the No‑Brainer Choice

Old Zhang's AI Learning

Feb 26, 2026 · Artificial Intelligence

How to Disable Thinking Output in Qwen3.5 Models Using LM Studio

This guide explains how to turn off the reasoning (thinking) output of Qwen3.5 series large language models in LM Studio by creating a virtual “-no‑thinking” model directory, editing a model.yaml file, and handling common pitfalls and error messages.

AI model configurationLM Studiodisable thinking

0 likes · 8 min read

How to Disable Thinking Output in Qwen3.5 Models Using LM Studio

Old Zhang's AI Learning

Feb 26, 2026 · Artificial Intelligence

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

This guide reviews the Qwen3.5 model lineup, explains mixed‑inference and MoE architecture, presents benchmark comparisons with GPT‑5.2, Claude 4.5 and Gemini‑3 Pro, evaluates 4‑bit and 3‑bit quantization loss, outlines hardware requirements, and provides step‑by‑step deployment options using llama.cpp or llama‑server.

Large Language ModelMoEinference

0 likes · 14 min read

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

Alibaba Cloud Infrastructure

Feb 23, 2026 · Cloud Native

Deploying Qwen 3.5 Multimodal Model on Alibaba Cloud ACK with RoleBasedGroup

This guide details how to deploy the open‑source Qwen 3.5‑397B‑A17B multimodal LLM on Alibaba Cloud ACK using the RoleBasedGroup (RBG) engine, covering model preparation, Kubernetes resources, role‑based orchestration, performance tuning, and benchmark testing.

Cloud Native AIKubernetesRoleBasedGroup

0 likes · 24 min read

Deploying Qwen 3.5 Multimodal Model on Alibaba Cloud ACK with RoleBasedGroup

SuanNi

Feb 21, 2026 · Artificial Intelligence

How Qwen3.5 Packs 397B Parameters Yet Activates Only 17B – A Deep Dive into Its Multimodal Architecture

Qwen3.5-397B-A17B is an open‑source multimodal model that unifies vision and language through a hybrid architecture and asynchronous RL framework, achieving trillion‑scale performance with only 17 B active parameters, dramatically improving efficiency, language coverage, and benchmark rankings.

AI researchqwen3.5

0 likes · 8 min read

How Qwen3.5 Packs 397B Parameters Yet Activates Only 17B – A Deep Dive into Its Multimodal Architecture

Fun with Large Models

Feb 17, 2026 · Artificial Intelligence

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Qwen3.5‑397B‑A17B, the newly open‑sourced multimodal giant, combines a 400‑billion‑parameter sparse MoE architecture with FP8 pipelines and an asynchronous RL framework to deliver GPT‑5.2‑level capabilities, 60% lower memory usage, up to 19× higher throughput, and extensive image, video, and agent support, while outlining its deployment requirements and API pricing.

AI inferenceFP8Sparse MoE

0 likes · 11 min read

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Old Zhang's AI Learning

Feb 17, 2026 · Artificial Intelligence

Running Qwen3.5 Locally: Step‑by‑Step Guide with Unsloth Dynamic Quantization

This article explains how to run the 397B Qwen3.5 model on a Mac by using Unsloth Dynamic 2.0 quantization (2‑bit, 3‑bit, or 4‑bit), outlines hardware requirements, provides compilation and download commands for llama.cpp, shows how to launch inference in thinking and non‑thinking modes, and compares several deployment options such as llama‑server, Transformers, SGLang/vLLM, and MLX.

Dynamic QuantizationGGUFLLM deployment

0 likes · 14 min read

Running Qwen3.5 Locally: Step‑by‑Step Guide with Unsloth Dynamic Quantization

Alibaba Cloud Big Data AI Platform

Feb 17, 2026 · Artificial Intelligence

Deploy Alibaba’s Qwen3.5‑397B‑A17B Model in One Click with PAI‑Model Gallery

Alibaba's open‑source Qwen3.5‑397B‑A17B model, featuring 397 billion parameters and a hybrid Gated Delta Network/MoE architecture, delivers superior performance and reduced memory usage, and can be deployed instantly through the PAI‑Model Gallery with step‑by‑step guidance and enterprise‑grade security.

AI inferenceAlibaba CloudLarge Language Model

0 likes · 5 min read

Deploy Alibaba’s Qwen3.5‑397B‑A17B Model in One Click with PAI‑Model Gallery

Machine Learning Algorithms & Natural Language Processing

Feb 16, 2026 · Artificial Intelligence

Alibaba’s Qwen 3.5‑Plus: 397 B Open‑Source Model Beats Gemini‑3 and GPT‑5.2 at Low Cost

Alibaba released the Qwen 3.5‑Plus open‑source large model (397 B total parameters, 170 B active) that outperforms top closed‑source models such as Gemini‑3‑Pro and GPT‑5.2 on multiple benchmarks, offers native multimodal understanding, supports 201 languages, reduces deployment memory by 60 % and inference latency by up to 19×, and is priced at only 0.8 CNY per million tokens.

AILarge Language ModelMultimodal

0 likes · 15 min read

Alibaba’s Qwen 3.5‑Plus: 397 B Open‑Source Model Beats Gemini‑3 and GPT‑5.2 at Low Cost

Old Zhang's AI Learning

Feb 16, 2026 · Artificial Intelligence

Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

This article provides a detailed analysis of Qwen3.5, covering its multimodal MoE design, massive inference speedups, extensive benchmark results against GPT‑5.2, Claude 4.5 Opus and Gemini‑3 Pro, RL scaling strategies, training infrastructure innovations, and practical usage via API and local deployment.

FP8 trainingLarge Language Modelbenchmark

0 likes · 13 min read

Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

AI Engineering

Feb 16, 2026 · Artificial Intelligence

Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×

Alibaba’s Qwen3.5-397B-A17B, a 397‑billion‑parameter open‑source multimodal LLM, combines mixed linear attention with a sparse MoE architecture to achieve 8.6‑19× higher decoding throughput than Qwen3‑Max, supports 201 languages, and can be deployed via vLLM, Docker, Transformers, or SGLang with various optimization presets.

Inference OptimizationLarge Language ModelSparse MoE

0 likes · 8 min read

Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×