Artificial Intelligence 6 min read

Add Step‑Level Diagnostics to PyTorch Training in Three Lines with TraceML

TraceML provides a lightweight, step‑level profiler for PyTorch training that requires only a few code changes—initializing the library and wrapping each training step—to generate real‑time diagnostics and a compact JSON summary, helping engineers quickly identify whether data loading, forward, backward, or optimizer phases dominate execution time.

DeepHub IMBA

May 21, 2026

Add Step‑Level Diagnostics to PyTorch Training in Three Lines with TraceML

Problem

During a training step it is unclear how time is split among data loading, forward pass, backward pass, and optimizer. Existing tools provide GPU utilization ( nvidia-smi), loss curves (W&B, MLflow, TensorBoard), or kernel‑level traces (PyTorch Profiler, Nsight Systems), but they do not give lightweight, step‑granular visibility with minimal configuration.

Existing tooling gap

System monitors show overall GPU usage.

Experiment trackers show loss curves and run history.

Heavyweight profilers expose kernels and timelines after a problem is already suspected.

TraceML approach

TraceML turns a single step boundary into structured diagnostic data. Integration requires only three lines of Python code:

import traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

Run the script with:

traceml run train.py

Runtime behavior

While the script runs, TraceML opens a real‑time terminal view beside the logs and records timestamps, memory usage, rank information, and system signals for each step.

An example on a single‑GPU PyTorch job is classified as compute‑bound: the backward pass consumes the majority of step time and the memory panel shows a steadily increasing reserved memory.

Structured output

At the end of the run TraceML writes a compact final_summary.json file instead of a large trace dump. A simplified excerpt looks like:

{
  "step_time": {
    "diagnosis": "INPUT_BOUND",
    "dataloader_pct": 47.0,
    "forward_pct": 31.0,
    "backward_pct": 18.0,
    "optimizer_pct": 4.0
  },
  "step_memory": {
    "diagnosis": "BALANCED"
  }
}

Each diagnosis directly maps to the next investigative step (e.g., checking data‑loader load, computation dominance, rank variance, memory pressure, or overall balance).

Cross‑run comparison

The JSON can be logged to W&B or MLflow, stored as a CI artifact, and compared across runs using the built‑in workflow: traceml compare run_a.json run_b.json When a performance regression occurs, the focus shifts from “did throughput drop?” to “where did the time shift?”—the information captured in final_summary.json supports this analysis.

Scope

TraceML does not replace kernel‑level profilers; tools such as PyTorch Profiler or Nsight Systems are still required for detailed CUDA timelines or NCCL behavior. TraceML acts as a lightweight “tool zero” that quickly classifies a run and indicates whether deeper profiling is warranted.

Installation and availability

pip install traceml-ai
traceml run train.py

Open‑source repository: https://github.com/traceopt-ai/traceml

Currently supports single‑GPU and single‑node DDP/FSDP; multi‑node support is forthcoming.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PyTorch profiler ML infrastructure step-level profiling TraceML training diagnostics

Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.