Common AI Model Formats Developers Use: GGUF, PyTorch, Safetensors, and ONNX

Developers face a variety of AI model formats—GGUF, PyTorch (.pt/.pth), Safetensors, and ONNX—each with distinct structures, advantages, drawbacks, and hardware support, and this article analyzes their metadata organization, quantization options, security considerations, and suitability for different deployment scenarios.

Smart Era Software Development
Smart Era Software Development
Smart Era Software Development
Common AI Model Formats Developers Use: GGUF, PyTorch, Safetensors, and ONNX

Developers face a growing diversity of model formats when using AI models from Hugging Face. The four most common formats—GGUF, PyTorch (.pt/.pth), Safetensors, and ONNX—are compared in terms of structure, quantization options, security, and hardware compatibility.

GGUF

GGUF was created for the llama.cpp project (https://github.com/ggml-org/llama.cpp). It is a binary, single‑file format designed for fast loading and saving via mmap(). Models are typically built in PyTorch or other frameworks and then converted to GGUF for use with the GGML library.

Supported inference runtimes include llama.cpp, Ollama (https://ollama.com/), and vLLM (https://github.com/vllm-project/vllm). GGUF can also store diffusion models via stable‑diffusion.cpp (https://github.com/leejet/stable-diffusion.cpp), though this is less common.

GGUF files consist of three parts:

Metadata stored as key‑value pairs (architecture, version, hyper‑parameters, etc.).

Tensor metadata describing shape, data type, and name of each tensor.

The raw tensor data.

GGUF format diagram
GGUF format diagram

GGUF and the GGML library provide flexible quantization schemes that keep high accuracy while reducing storage size. Common schemes are: Q4_K_M: most tensors quantized to 4 bits, some to 6 bits (most common). IQ4_XS: almost all tensors quantized to 4 bits with an importance matrix for calibration. IQ2_M: similar to IQ4_XS but uses 2 bit quantization, suitable for very limited memory. Q8_0: all tensors quantized to 8 bits, offering near‑original precision.

GGUF Llama‑3.1 8B model example
GGUF Llama‑3.1 8B model example

Pros:

Simple single‑file format, easy to share.

Fast loading and saving via mmap().

Efficient storage with flexible quantization.

Portable binary readable without special libraries.

Cons:

Most models must be converted from other formats (e.g., PyTorch, Safetensors).

Not all models are convertible; some are unsupported by llama.cpp.

Modifying or fine‑tuning a GGUF file is difficult.

Typical use cases are production model serving where rapid load time is critical and community model sharing because of the format’s simplicity.

Useful resources: llama.cpp repository (https://github.com/ggml-org/llama.cpp), gguf‑my‑repo HF space (https://hf.co/spaces/ggml-org/gguf-my-repo), Ollama integration (https://ollama.com/).

PyTorch (.pt/.pth)

The .pt and .pth extensions are PyTorch’s default serialization formats. They store a model’s state dictionary (weights, biases), optimizer state, and training metadata.

.pt : saves the entire model, including architecture and parameters.

.pth : saves only the state dictionary (parameters and some metadata).

PyTorch serialization relies on Python’s pickle module. Example:

import pickle
model_state_dict = {"layer1": "hello", "layer2": "world"}
pickle.dump(model_state_dict, open("model.pkl", "wb"))

Loading the file:

import pickle
model_state_dict = pickle.load(open("model.pkl", "rb"))
print(model_state_dict)
# Output: {'layer1': 'hello', 'layer2': 'world'}

Limitations:

Security : arbitrary code can be executed during deserialization, creating potential back‑doors (see Snyk article https://snyk.io/articles/python-pickle-poisoning-and-backdooring-pth-files/).

Efficiency : no lazy loading or partial data loading, leading to slower load times and higher memory usage for large models.

Portability : tied to Python, making cross‑language sharing difficult.

When working exclusively in Python/PyTorch, this format may be appropriate, but the community is shifting toward more efficient and secure formats such as GGUF and Safetensors.

Useful resources: PyTorch documentation (https://pytorch.org/docs/stable/generated/torch.save.html) and the ExecuTorch project for converting to .pte for mobile/edge (https://github.com/pytorch/executorch).

Safetensors

Safetensors, developed by Hugging Face (https://hf.co/docs/safetensors/en/index), addresses the security and efficiency problems of traditional Python serialization. It uses a restricted deserialization process that prevents code execution.

A safetensors file contains:

Metadata stored in JSON, describing each tensor’s shape, data type, and name (optional custom metadata).

The raw tensor data.

Safetensors format diagram
Safetensors format diagram

Pros:

Safe: restricted deserialization prevents code‑execution vulnerabilities.

Fast: supports lazy loading and partial loading, often using mmap().

Efficient: supports quantized tensors.

Portable: language‑agnostic, easing cross‑language model sharing.

Cons:

Quantization flexibility is lower than GGUF because PyTorch’s quantization support is limited.

Requires a JSON parser to read metadata, which can be problematic in low‑level languages lacking built‑in JSON support.

Safetensors is the default serialization format for the Transformers library (https://hf.co/docs/transformers/index). New models on Hugging Face—including Llama, Gemma, Phi, Stable‑Diffusion, Flux, etc.—are released as safetensors.

Useful resources: Transformers documentation (https://hf.co/docs/transformers/quicktour), Bitsandbytes quantization guide (https://hf.co/docs/transformers/en/quantization/bitsandbytes), and the MLX community HF space for Apple‑chip compatible models (https://hf.co/mlx-community).

ONNX

Open Neural Network Exchange (ONNX) provides a vendor‑agnostic representation for machine‑learning models and is part of the ONNX ecosystem (https://onnx.ai/), which includes tools for interoperability between frameworks such as PyTorch, TensorFlow, and MXNet.

ONNX models are stored in a single .onnx file that contains tensors, metadata, and the computation graph.

ONNX computation graph example
ONNX computation graph example

Pros:

Flexibility: the embedded computation graph makes conversion between frameworks easier.

Portability: the ONNX ecosystem enables deployment on many platforms, including mobile and edge devices.

Cons:

Quantized tensor support is limited; ONNX decomposes quantized tensors into integer tensors plus scale factors, which can degrade quality.

Complex architectures may require operator fallbacks or custom implementations, potentially causing performance loss during conversion.

ONNX is a solid choice for inference on mobile devices or browsers.

Useful resources: onnx‑community HF space (https://hf.co/onnx-community), transformers.js for WebGPU/WebAssembly inference (https://github.com/huggingface/transformers.js), ONNX Runtime high‑performance engine (https://onnxruntime.ai/), and Netron visualizer (https://netron.app/).

Hardware Support Summary

CPU – GGUF (optimal), ONNX (full), PyTorch and Safetensors (partial).

GPU – All formats fully supported.

Mobile deployment – GGUF and ONNX supported; PyTorch via ExecuTorch (partial); Safetensors not supported.

Apple silicon – GGUF, ONNX, and Safetensors (via MLX) supported; PyTorch partial.

Conclusion

GGUF, PyTorch, Safetensors, and ONNX each have distinct advantages and trade‑offs. Selecting the appropriate format depends on the target use case and deployment hardware.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PyTorchONNXhardware compatibilityGGUFSafetensorsAI model formats
Smart Era Software Development
Written by

Smart Era Software Development

Committed to openness and connectivity, we build frontline engineering capabilities in software, requirements, and platform engineering. By integrating digitalization, cloud computing, blockchain, new media and other hot tech topics, we create an efficient, cutting‑edge tech exchange platform and a diversified engineering ecosystem. Provides frontline news, summit updates, and practical sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.