Cloud Computing 12 min read

Performance Comparison and CPU Pinning Techniques for Enterprise‑Level Virtual Machine Instances

The article analyzes the instability of shared‑type virtual machines, introduces enterprise‑level instances with fixed CPU scheduling and NUMA topology, details the applied technologies such as CPU pinning, PCI‑Passthrough and multi‑queue NICs, and presents extensive sysbench and STREAM benchmark results that demonstrate superior isolation, stability and performance of enterprise instances over shared ones.

360 Tech Engineering

Sep 2, 2021

Performance Comparison and CPU Pinning Techniques for Enterprise‑Level Virtual Machine Instances

Background – In the current internal cloud, shared‑type VM instances use an oversubscription model (3‑5×) where each vCPU is randomly mapped to any idle physical CPU hyper‑thread, causing performance fluctuations under load, although the cost per instance is low.

Enterprise‑level instance characteristics – Enterprise instances adopt a fixed CPU scheduling mode: each vCPU is bound to a dedicated physical CPU hyper‑thread, eliminating inter‑instance CPU contention and providing stable compute performance, at a higher CPU cost.

Key technologies applied

NUMA topology and CPU affinity binding to VMs.

Fixed CPU scheduling (vCPU → specific physical CPU hyper‑thread).

Host‑process core binding to keep host services off the VM CPUs.

Separation of host machines for shared and enterprise instances.

NIC multi‑queue enablement.

PCI Passthrough for local NVMe SSDs.

Different network and cloud‑disk throttling rules per VM size.

CPU pinning technique introduction – The physical servers used are 80‑core (2 socket × 20 core × 2 threads) machines equipped with 768 GB memory and dual‑port Mellanox‑CX5 NICs. In shared instances, vCPUs float across all 80 cores, while enterprise instances pin each vCPU (and host services such as nova‑compute, neutron‑openvswitch, neutron‑l3) to specific cores (e.g., 0, 1, 40, 41) to avoid contention.

VM XML comparison

Shared instance XML (excerpt):

<memory unit='KiB'>16777216</memory>
<vcpu placement='static'>8</vcpu>
<cputune>
  <shares>8192</shares>
</cputune>
<cpu mode='host-passthrough' check='none'>
  <topology sockets='8' cores='1' threads='1'/>
</cpu>

Enterprise instance XML (excerpt):

<memory unit='KiB'>16777216</memory>
<vcpu placement='static'>8</vcpu>
<cputune>
  <shares>8192</shares>
  <vcpupin vcpu='0' cpuset='18'/>
  <vcpupin vcpu='1' cpuset='58'/>
  ...
  <emulatorpin cpuset='8,12-13,18,48,52-53,58'/>
</cputune>
<numatune>
  <memory mode='strict' nodeset='0'/>
  <memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>
<cpu mode='host-passthrough' check='none'>
  <topology sockets='1' cores='4' threads='2'/>
  <numa>
    <cell id='0' cpus='0-7' memory='16777216' unit='KiB'/>
  </numa>
</cpu>

Performance testing – Two benchmark tools were used: sysbench cpu --cpu-max-prime=20000 --threads=8 --time=30 run (CPU prime calculation).

STREAM (memory bandwidth: Copy, Scale, Add, Triad).

Test environment – Two physical servers, each 80‑core, 768 GB RAM, dual‑port 25 Gbps NICs. Server A hosts 9 enterprise instances (8 vCPU / 16 GB each) after reserving 4 cores for host processes; Server B runs 29 shared instances (8 vCPU / 16 GB each) with a 3× oversell factor.

Test scenarios and results

Single‑instance comparison: enterprise instance fully utilizes its 8 pinned cores (100 % CPU) without affecting other cores; shared instance shows fluctuating CPU usage across all cores.

Enterprise vs. shared under no load: CPU performance loss of enterprise vs. bare metal ≈ 16 %; shared vs. bare metal ≈ 1.5 %.

Memory bandwidth (STREAM) – shared instances achieve ~37 % of bare‑metal bandwidth, enterprise ~25 % (showing a large gap that needs analysis).

Stress tests with multiple instances: 9 enterprise instances saturate 72 physical cores (100 %); 9 shared instances also saturate all cores, but with higher contention and latency.

When 29 shared instances (3× oversell) run alongside 9 enterprise instances, enterprise CPUs remain stable while shared CPUs exhibit significant performance variance.

Emulator thread issue – Multi‑queue NICs bind interrupt queues to separate cores, improving I/O throughput. However, both VCPU threads and emulator threads share the same pinned cores, leading to resource contention under heavy network load; separating them can improve latency but increases CPU overhead.

Future exploration – Leveraging the high isolation and stability of enterprise instances for offline workloads can raise overall resource utilization and reduce costs, while further tuning of CPU pinning and NIC queue allocation may yield additional performance gains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing performance testing NUMA CPU pinning enterprise instances Sysbench

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.