Bringing eBPF Inside GPU Kernels: The bpftime for GPU Breakthrough

The article introduces bpftime for GPU, a tool that extends eBPF's programmable, low‑overhead observation capabilities into GPU kernels, explains its implementation pipeline, compares its performance against Nsight and NVBit, and outlines future enhancements for GPU profiling.

GPUPTXProfiling

0 likes · 13 min read

Bringing eBPF Inside GPU Kernels: The bpftime for GPU Breakthrough

Alibaba Cloud Developer

Sep 8, 2025 · Fundamentals

How to Profile GPU Kernels with PTX Probes: From CUDA Basics to Custom Instrumentation

This article walks through GPU performance analysis, starting with CUDA architecture fundamentals, demonstrating matrix multiplication optimization, explaining PTX assembly, and introducing the Neutrino framework for programmable GPU probes that enable fine‑grained, custom instrumentation and detailed timing measurements of kernel execution.

CUDAGPUNeutrino

0 likes · 45 min read

Infra Learning Club

Feb 22, 2025 · Fundamentals

Understanding NVCC Compilation: A Step‑by‑Step Technical Guide

This article walks through the NVCC compilation pipeline, explaining how CUDA source files are transformed into host and device binaries, detailing file extensions, compilation stages, command‑line options, intermediate artifacts, and the role of registration functions such as __nv_cudaEntityRegisterCallback and __sti____cudaRegisterAll.

CUDAGPUPTX

0 likes · 12 min read

Understanding NVCC Compilation: A Step‑by‑Step Technical Guide

AI2ML AI to Machine Learning

Feb 8, 2025 · Artificial Intelligence

Analyzing DeepSeek R1 Inference Projects: Source Code, Cold‑Start, and Scaling Techniques

This article examines DeepSeek R1’s three breakthroughs, its low‑cost optimizations that bypass CUDA, and the resulting impact on the AI ecosystem, then provides a detailed technical review of seven open‑source reproductions—Open‑R1, Tiny‑Zero, SimpleScaling‑S1, and simpleRL‑reason—covering their architectures, reinforcement‑learning pipelines, and code implementations.

DeepSeekInference ScalingOpen Source

0 likes · 10 min read

Analyzing DeepSeek R1 Inference Projects: Source Code, Cold‑Start, and Scaling Techniques

Infra Learning Club

Jan 24, 2025 · Fundamentals

Inside NVCC: How CUDA Code Is Compiled and Linked

The article dissects NVCC’s compilation pipeline, showing how internal registration functions from host_runtime.h are injected into the host binary, how a simple CUDA demo is processed with --dryrun, and how the generated fatbin, PTX, and cubin files are linked and registered for GPU execution.

CUDAFatBinaryGPU

0 likes · 10 min read

Inside NVCC: How CUDA Code Is Compiled and Linked