Backend Development 15 min read

Performance Optimization of Floating‑Point Conversion and GC Tuning in Java Backend Services

This article details how replacing Java's native double parsing with the Ryu and FastFloat algorithms, alongside JVM GC tuning, dramatically reduces CPU usage and latency in backend services, offering practical code examples, benchmark results, and actionable optimization strategies.

DeWu Technology
DeWu Technology
DeWu Technology
Performance Optimization of Floating‑Point Conversion and GC Tuning in Java Backend Services

Introduction In algorithm engineering, stability, cost, effect, and performance are the four core dimensions, with performance being the most critical for reducing latency and cost. This article shares practical SRE experiences on daily performance optimization, focusing on floating‑point conversion and GC issues.

1. Cooling Down Floating‑Point Conversion Feature vectors in ranking systems are often doubles, leading to massive double‑to‑string and string‑to‑double conversions that can cause CPU spikes. The article introduces two high‑performance algorithms:

Ryu Algorithm (float‑to‑string) Ryu replaces the traditional BigInt‑based conversion with a table‑lookup and fixed‑length integer arithmetic approach, eliminating dynamic memory allocation.

<code>void convertStandard(double d, char *out) {
    // 1. Split float: sign, exponent, mantissa
    bool sign = (d < 0);
    int exp = extractExponent(d); // extract binary exponent
    uint64_t mant = extractMantissa(d);
    // 2. Build big integer: mant * 2^exp (may need allocation)
    BigInt num = BigInt_from_uint64(mant);
    num = BigInt_mul_pow2(num, exp); // high‑cost multi‑precision shift
    // 3. Divide by 10 repeatedly to generate decimal digits
    char buf[32];
    int len = 0;
    while (!BigInt_is_zero(num)) {
        BigInt digit, rem;
        BigInt_divmod(num, 10, &digit, &rem); // slow multi‑precision division
        buf[len++] = '0' + BigInt_to_uint32(digit);
        BigInt_free(num);
        num = rem;
    }
    // 4. Trim zeros, insert decimal point and sign
    formatOutput(sign, buf, len, out);
}
</code>
<code>void convertRyu(double d, char *out) {
    // 1. Split float: sign, real exponent, mantissa (implicit 1)
    bool sign = (d < 0);
    int e2 = extractBiasedExponent(d) - BIAS;
    uint64_t m2 = extractMantissa(d) | IMPLIED_ONE;
    // 2. Table lookup for 5^k and shift
    int k = computeDecimalExponent(e2);
    uint64_t pow5 = POW5_TABLE[k];
    int shift = SHIFT_TABLE[k];
    // 3. Single 64×64 multiplication + shift
    __uint128_t prod = (__uint128_t)m2 * pow5;
    uint64_t v = (uint64_t)(prod >> shift);
    // 4. Up to ~20 fixed loops to extract digits
    char buf[24];
    int len = 0;
    do {
        buf[len++] = '0' + (v % 10);
        v /= 10;
    } while (v);
    // 5. Trim zeros, add decimal point and sign
    formatShort(sign, buf, len, k, out);
}
</code>

The comparison shows that the traditional method incurs dynamic memory allocation, multi‑precision division (hundreds of nanoseconds), and unpredictable loop counts, while Ryu performs a single 64‑bit multiplication and a fixed‑cost loop (≈30‑40 ns), is cache‑friendly, and requires no heap allocation.

2. FastFloat Algorithm (string‑to‑float) FastFloat parses doubles without using BigDecimal or BigInteger, employing staged parsing, 64‑bit integer shortcuts, pre‑computed power tables, and SIMD‑like optimizations (in C++). The conversion flow is:

<code>Input: "123.45e2"
1. Split into significand=12345, exponent=2‑2=0
2. result = 12345 * 10^0 = 12345.0
3. Construct double via Double.longBitsToDouble
</code>

Benchmarks demonstrate a 4.43× speed‑up over JDK's Double.parseDouble, reducing CPU time from 18 % to 0.19 % (≈98 % improvement) and cutting RT latency by up to 50 %.

3. Removing GC Spikes The article analyzes a small‑heap GC issue where frequent minor GC caused periodic RT99 spikes. By increasing the promotion‑to‑old‑generation threshold (‑XX:GPGCTimeStampPromotionThresholdMS) and extending the old‑GC interval (‑XX:GPGCOldGCIntervalSecs), the GC‑induced latency spikes disappeared, stabilizing RT and eliminating error bursts.

4. Who Steals RT Time? High feature‑vector cache miss rates and cache‑expiration (TTL 60‑90 s) cause cache‑miss “page‑fault‑like” pauses, leading to RT spikes. The article shows that even with 99.9 % cache hit rate, the 1700‑item ad feature cache still creates occasional cache‑miss bursts that dominate latency.

5. Solution & Recommendations • Replace JDK double conversion with Ryu and FastFloat. • Tune JVM GC parameters to raise promotion thresholds and lengthen old‑GC intervals. • Reduce cache‑TTL where possible and improve cache hit ratios. • Use wall‑time flame graphs and trace tools to pinpoint bottlenecks beyond superficial CPU usage.

Conclusion Performance optimization is an endless journey; by applying algorithmic improvements, GC tuning, and detailed profiling, engineers can achieve massive latency reductions and cost savings, turning micro‑optimizations into tangible business value.

JVMperformanceoptimizationGCFloating-pointFastFloatRyu
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.