Mobile Development 13 min read

Improving iOS App Launch Speed with Binary Reordering and Page Fault Reduction

By generating an order file that places frequently executed code contiguously, the authors reduce iOS app cold‑launch page faults by about 15%, achieving roughly a 10% launch‑time improvement without code changes, using runtime hooking and static analysis to build the symbol ordering.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Improving iOS App Launch Speed with Binary Reordering and Page Fault Reduction

As product requirements grow, the number of stacked features in an app increases, making the overall app size larger and leading to more performance and experience issues. Among these, the most direct factor affecting users is the app's launch speed.

Traditional launch optimizations focus on reducing unnecessary code, lazy loading, task prioritization, and multithreading, which mainly target reducing main‑thread work. These techniques have become commonplace and offer limited further gains. This article explores launch optimization from the perspective of the memory loading mechanism, achieving an average 10% launch‑time improvement without changing a single line of code.

When a user opens an iOS app, the binary is mmap‑ed into memory by dyld. Actual physical pages are loaded lazily on demand, causing the CPU’s program counter (PC) to jump between functions, fetch, decode, and execute instructions. The most time‑consuming part is instruction fetch, which often triggers page faults. A page fault occurs when the MMU cannot find a physical page for a virtual address, leading to a fault interrupt that loads the required page.

Measurements show that a single page fault can take 0.3–0.6 ms under normal conditions and up to 1 ms in worse cases. In a typical cold launch, more than 2,000 page faults may occur, accounting for over 300 ms of the total launch time. Reducing this overhead can significantly improve launch performance.

On iOS devices, early processors (A7, A8) use 4 KB physical pages, while later ones (A9 and above) use 16 KB pages. By arranging frequently executed code so that it resides contiguously within the same physical pages, the number of page faults can be minimized. This is the essence of binary reordering.

Binary reordering is implemented by providing an order file to the Xcode linker, which forces symbols to be placed in a specific order in the final binary. The order file lists function symbols line‑by‑line, and the linker arranges the code sections accordingly.

To generate the required symbol order, the following techniques are used:

Hooking objc_msgSend and objc_msgSendSuper2 to capture all Objective‑C method calls.

Hooking block invocations by intercepting retain/copy and swapping the original invoke function.

Hooking +load methods via a stub inserted into a writable data segment.

Capturing C++ constructors that run before main using stub instrumentation.

Tracking initialize calls with an Objective‑C stub.

Example of hooking objc_msgSend (assembly):

.text .align 2 .global _pgoobjcmsgSend _pgoobjcmsgSend:
// push stp q6, q7, [sp, #-32]!
// ... (other register saves)
// call stub for monitoring
bl pgoastub_msgSend
mov x9, x0
// pop registers
ldp x0, x1, [sp], #16
ldp x2, x3, [sp], #16
ldp x4, x5, [sp], #16
ldp x6, x7, [sp], #16
ldp x8, lr, [sp], #16
ldp q0, q1, [sp], #32
ldp q2, q3, [sp], #32
ldp q4, q5, [sp], #32
ldp q6, q7, [sp], #32
// call original objc_msgSend
br x9

For code that cannot be hooked (e.g., some C/C++ functions), static analysis scans the binary for bl instructions. The scanner recursively follows branches up to a configurable depth:

define PGOAINSBL (0x94000000)
define PGOAINSBL_FLAG (0xfc000000)
static void pgoascansubroutiner(intptr_t func, int depth) {
    if (depth <= 0) return;
    pgoainssst p = (pgoainssst)func;
    int i = 0;
    while (i < 2048) {
        intptr_t vpc = (intptr_t)p;
        int value = *(int *)p;
        if ((value & PGOAINSBL_FLAG) == PGOAINSBL) {
            // ... handle branch ...
            if (pgoavamainvalid(vpc)) {
                pgoaaddfunc(vpc);
                pgoascansubroutiner(vpc, depth-1);
            }
        } else if ((value & PGOAINSRET_FLAG) == PGOAINSRET) {
            return; // return instruction
        }
        i++; p++;
    }
}

After collecting the execution order, an order file is generated and supplied to the Xcode build settings. The resulting binary shows a ~10% reduction in cold‑launch time and about a 15% decrease in page‑fault count, as demonstrated by the accompanying performance graphs.

A one‑click SDK is also provided. Integrating the SDK requires only a few lines of code and automatically outputs the required symbol list after a single app run:

if defined(arm64) || defined(aarch64)
    bool debug = false;
    #ifdef DEBUG
        debug = true;
    #endif
    pgoa_logall(debug, nil);
#endif

Developers can then add the generated order_symbol.txt as the order file in the Xcode project and rebuild the Release package.

The article concludes with a comparison to existing solutions, highlighting the SDK’s ease of integration, support for dynamic C++ constructor tracing, initialize hooking, comprehensive block hooking, and full coverage of bl calls. Remaining challenges include occasional mis‑ordering from static scans, random accesses to constants/global variables causing extra page faults, and translation efficiency of symbols. Future work will address these issues and further refine the approach.

performance optimizationiOSHookingpage faultPGOapp launchBinary Reordering
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.