Analysis of Arm's 2023 Cortex‑X4, A720, and A520 Microarchitectures
Arm’s 2023 processor lineup—Cortex‑X4, A720, and A520—introduces a 15% performance boost, 20‑22% efficiency gains, a 64‑bit‑only Armv9.2 ISA with QARMA3 PAC, larger caches, expanded decode and execution resources, and a DSU120 module supporting up to 14 cores and 32 MiB L3.
In May 2023 Arm released its next‑generation processor lineup: the high‑performance Cortex‑X4, the efficiency‑focused A720, and the small‑core A520. This article reviews the architectural changes of these cores, highlights the new Armv9.2 ISA, and discusses the updated DSU120 system‑level module.
The three cores target different goals. Cortex‑X4 aims for a 15% performance uplift over Cortex‑X3, while A720 and A520 focus on 20% and 22% energy‑efficiency improvements respectively, all on the same TSMC 4 nm process.
Arm also introduced the Armv9.2 instruction set, adding the QARMA3 PAC algorithm, expanded floating‑point capabilities, and PMU enhancements. Notably, all three new cores drop 32‑bit support.
The DSU120 module now supports up to 14 cores and up to 32 MiB of L3 cache, improving inter‑core data management.
Cortex‑X4 Microarchitecture
Code‑named Hunter‑ELP, Cortex‑X4 expands the front‑end by removing the L0 MOP cache, increasing the number of decoders from 6 to 10, and unifying the pipeline width to 10‑wide. The pipeline depth is reduced from 11 to 10 stages after the L1 cache fetch.
Back‑end changes include an extra branch unit (3 → 4), two additional ALUs (6 → 8), a second full‑width MAC ALU, and a 20% larger reorder buffer (ROB) from 320 to 384 entries.
The AGU configuration changes to 1 LS AGU, 2 LD AGU, and 1 ST AGU (total 4 AGU). The L1 d‑TLB entries double from 48 to 96. L2 cache capacity doubles from 1 MiB to 2 MiB, which reduces refill and write‑back rates per thousand instructions.
Performance figures show a double‑digit increase in SPECint2K7 (≈13‑14%), modest 6‑8% gains in Geekbench, and a more noticeable uplift in the L2‑sensitive Sppdometer2 benchmark.
Key Cortex‑X4 changes:
Removal of L0 MOP cache
Decoders increased to 10
Pipeline unified to 10 stages
Branch units: 2 → 3
ALU units: 6 → 8
Additional AGU unit
ROB size: 320 → 384
L1 d‑TLB: 48 → 96 entries
L2 cache: 1 MiB → 2 MiB
No 32‑bit support
A720 Microarchitecture
Code‑named Hunter, A720 targets a 20% efficiency gain over A715 while keeping power consumption similar. Front‑end improvements focus on branch‑prediction latency (recovery cycles reduced from 12 to 11) and power‑optimized unconditional/conditional prediction.
Back‑end adds pipelined FDIV/FSQRT units, optimizes data movement between integer and floating‑point units, and refines the issue queue and AGU pathways.
L2 cache latency drops from 10 to 9 cycles, and the maximum L2 size remains 512 KB.
A new “A720min” variant offers a smaller die comparable to Cortex‑A78, delivering ~10% higher performance than A78 while maintaining similar power characteristics.
Key A720 changes:
Branch‑prediction recovery: 12 → 11 cycles
L2 latency: 10 → 9 cycles
Introduction of A720min (A78‑sized core with ~10% better performance)
A520 Microarchitecture
Code‑named Hayes, A520 is a 64‑bit only efficiency core derived from the A510 design. It removes one ALU (3 → 2) and adds the QARMA3 PAC algorithm to keep PAC overhead below 1%.
Arm claims a 22% power reduction at equal performance, or an 8% performance boost at equal power.
Key A520 changes:
ALU count reduced from 3 to 2
QARMA3 PAC algorithm introduced
64‑bit only, no 32‑bit support
Significant energy‑efficiency improvements
DSU120 Module
The updated DSU120 can manage up to 14 cores and up to 32 MiB of L3 cache within a single cluster. It also provides an L3 power‑gating feature to reduce static leakage when large caches are not needed.
Overall Summary
Arm’s 2023 releases demonstrate a clear trend toward larger, higher‑performance cores (Cortex‑X4) combined with efficiency‑focused cores (A720, A520) that improve power‑per‑watt while dropping legacy 32‑bit support. The architectural refinements—more decoders, larger ROB, expanded AGU set, and bigger L2 caches—translate into measurable performance gains, especially in SPECint2K7 and L2‑sensitive workloads. Developers and system designers should consider these changes when optimizing software stacks and power‑management strategies for next‑generation mobile and embedded devices.
OPPO Kernel Craftsman
Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.