Add Step‑Level Diagnostics to PyTorch Training in Three Lines with TraceML
TraceML provides a lightweight, step‑level profiler for PyTorch training that requires only a few code changes—initializing the library and wrapping each training step—to generate real‑time diagnostics and a compact JSON summary, helping engineers quickly identify whether data loading, forward, backward, or optimizer phases dominate execution time.
Problem
During a training step it is unclear how time is split among data loading, forward pass, backward pass, and optimizer. Existing tools provide GPU utilization ( nvidia-smi), loss curves (W&B, MLflow, TensorBoard), or kernel‑level traces (PyTorch Profiler, Nsight Systems), but they do not give lightweight, step‑granular visibility with minimal configuration.
Existing tooling gap
System monitors show overall GPU usage.
Experiment trackers show loss curves and run history.
Heavyweight profilers expose kernels and timelines after a problem is already suspected.
TraceML approach
TraceML turns a single step boundary into structured diagnostic data. Integration requires only three lines of Python code:
import traceml
traceml.init(mode="auto")
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()Run the script with:
traceml run train.pyRuntime behavior
While the script runs, TraceML opens a real‑time terminal view beside the logs and records timestamps, memory usage, rank information, and system signals for each step.
An example on a single‑GPU PyTorch job is classified as compute‑bound: the backward pass consumes the majority of step time and the memory panel shows a steadily increasing reserved memory.
Structured output
At the end of the run TraceML writes a compact final_summary.json file instead of a large trace dump. A simplified excerpt looks like:
{
"step_time": {
"diagnosis": "INPUT_BOUND",
"dataloader_pct": 47.0,
"forward_pct": 31.0,
"backward_pct": 18.0,
"optimizer_pct": 4.0
},
"step_memory": {
"diagnosis": "BALANCED"
}
}Each diagnosis directly maps to the next investigative step (e.g., checking data‑loader load, computation dominance, rank variance, memory pressure, or overall balance).
Cross‑run comparison
The JSON can be logged to W&B or MLflow, stored as a CI artifact, and compared across runs using the built‑in workflow: traceml compare run_a.json run_b.json When a performance regression occurs, the focus shifts from “did throughput drop?” to “where did the time shift?”—the information captured in final_summary.json supports this analysis.
Scope
TraceML does not replace kernel‑level profilers; tools such as PyTorch Profiler or Nsight Systems are still required for detailed CUDA timelines or NCCL behavior. TraceML acts as a lightweight “tool zero” that quickly classifies a run and indicates whether deeper profiling is warranted.
Installation and availability
pip install traceml-ai
traceml run train.pyOpen‑source repository: https://github.com/traceopt-ai/traceml
Currently supports single‑GPU and single‑node DDP/FSDP; multi‑node support is forthcoming.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
