Artificial Intelligence 16 min read

Deploying DeepSeek R1 671B Model Locally with Ollama and Dynamic Quantization

This guide explains how to download, quantize, and run the full‑size 671‑billion‑parameter DeepSeek R1 model on local hardware using Ollama, covering model selection, hardware requirements, step‑by‑step deployment commands, optional web UI setup, performance observations, and practical recommendations.

Top Architect
Top Architect
Top Architect
Deploying DeepSeek R1 671B Model Locally with Ollama and Dynamic Quantization

The article introduces the motivation for running DeepSeek R1 locally, emphasizing that the full 671B MoE model (720 GB) is too large for most users and that Unsloth AI’s dynamic quantization can shrink it to a manageable size (as low as 131 GB) for consumer‑grade hardware.

Model selection : Two quantized variants are tested – DeepSeek‑R1‑UD‑IQ1_M (1.73‑bit dynamic quantization, 158 GB) and DeepSeek‑R1‑Q4_K_M (4‑bit standard quantization, 404 GB). Unsloth AI provides four versions ranging from 1.58‑bit to 2.51‑bit.

Hardware requirements : Minimum combined RAM + VRAM of ≥200 GB for the 1.73‑bit model and ≥500 GB for the 4‑bit model. Example workstation: 4× RTX 4090 (24 GB each), 4‑channel DDR5 5600 (96 GB total), ThreadRipper 7980X (64 cores).

Deployment steps (Linux/macOS/Windows) :

1. Download model files

Obtain the .gguf files from HuggingFace and merge split parts.

2. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

3. Create a Modelfile

For the 1.73‑bit model (DeepSeek‑R1‑UD‑IQ1_M):

FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf
PARAMETER num_gpu 28
PARAMETER num_ctx 2048
PARAMETER temperature 0.6
TEMPLATE "<|User|>{{ .Prompt }}<|Assistant|>"

For the 4‑bit model (DeepSeek‑R1‑Q4_K_M):

FROM /home/snowkylin/DeepSeek-R1-Q4_K_M.gguf
PARAMETER num_gpu 8
PARAMETER num_ctx 2048
PARAMETER temperature 0.6
TEMPLATE "<|User|>{{ .Prompt }}<|Assistant|>"

Adjust the FROM path, num_gpu , and num_ctx according to your hardware.

4. Create the Ollama model

ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile

Ensure the Ollama model directory has enough space or change its location.

5. Run the model

ollama run DeepSeek-R1-UD-IQ1_M --verbose

The --verbose flag shows token‑per‑second speed. If memory or CUDA errors occur, revisit step 4 and modify parameters.

6. (Optional) Install a web UI

pip install open-webui
open-webui serve

Additional tuning parameters such as num_gpu (layers loaded onto GPU) and num_ctx (context window) are explained, with examples of typical values for a 4‑GPU RTX 4090 setup.

Observations : The 1.73‑bit model delivers faster inference (7‑8 tokens/s for short prompts) and lower resource usage than the 4‑bit version, while both outperform distilled 8‑70B models. The 4‑bit model tends to refuse risky prompts more conservatively. CPU utilization dominates during inference, indicating a memory‑bandwidth bottleneck.

Conclusion & recommendations : For users unable to fit the entire model into GPU memory, the 1.73‑bit dynamic‑quantized version offers the best trade‑off of speed, resource consumption, and quality. It is best suited for short‑text generation or single‑turn dialogues on consumer hardware; longer contexts will degrade speed to 1‑2 tokens/s.

Readers are encouraged to share their deployment experiences and questions in the comments.

AIDeepSeeklarge language modelDynamic QuantizationLocal DeploymentOllama
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.