Fundamentals 15 min read

Why Java's String.hashCode() Uses 31 as Multiplier: Theory, Experiments, and Visualization

This article explains why Java's String.hashCode() uses the multiplier 31, covering its implementation, mathematical justification, JVM optimization, comparative analysis with other multipliers, and experimental results on hash collision rates and distribution using a large English word dataset.

Java Captain

Sep 30, 2019

Why Java's String.hashCode() Uses 31 as Multiplier: Theory, Experiments, and Visualization

When the author opened the String.hashCode() method in the JDK, they noticed the mysterious constant 31 used as a multiplier and set out to discover its purpose.

The implementation of String.hashCode() is straightforward:

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;
        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

From this loop we can derive the polynomial formula s[0]*31^(n-1) + s[1]*31^(n-2) + … + s[n-1], where s is the internal character array.

Two main reasons are given for choosing 31:

It is an odd prime of moderate size, which historically reduces hash collisions.

It can be optimized by the JVM because 31 * i == (i << 5) - i, allowing the multiplication to be replaced by a shift and a subtraction.

The classic explanation from Effective Java is quoted:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i . Modern VMs do this sort of optimization automatically.

Another answer notes that a small set of primes (31, 33, 37, 39, 41) produces fewer than seven collisions when hashing over 50,000 English words, explaining why Java often picks 31.

To verify these claims, the author performed experiments on more than 230,000 English words from /usr/share/dict/words. A helper method computes a hash with an arbitrary multiplier:

public static Integer hashCode(String str, Integer multiplier) {
    int hash = 0;
    for (int i = 0; i < str.length(); i++) {
        hash = multiplier * hash + str.charAt(i);
    }
    return hash;
}

public static void calculateConflictRate(Integer multiplier, List<Integer> hashs) {
    Comparator<Integer> cp = (x, y) -> x > y ? 1 : (x < y ? -1 : 0);
    int maxHash = hashs.stream().max(cp).get();
    int minHash = hashs.stream().min(cp).get();
    int uniqueHashNum = (int) hashs.stream().distinct().count();
    int conflictNum = hashs.size() - uniqueHashNum;
    double conflictRate = (conflictNum * 1.0) / hashs.size();
    System.out.println(String.format("multiplier=%4d, minHash=%11d, maxHash=%10d, conflictNum=%6d, conflictRate=%.4f%%",
            multiplier, minHash, maxHash, conflictNum, conflictRate * 100));
}

The results showed that very small primes (e.g., 2) produce a high collision rate (≈55 %), while primes around 31, 37, 41 keep collisions below seven. Larger primes such as 101 and 199 also give low collision rates but can overflow the 32‑bit integer range, which may discard information.

Even numbers performed poorly; multiplier 32 yielded a collision rate over 50 %.

For visualizing hash distribution, the author partitioned the 32‑bit integer space into 64 equal intervals (step = 2^26) and counted how many hash values fell into each interval:

public static Map<Integer, Integer> partition(List<Integer> hashs) {
    final int step = 67108864; // 2^32 / 64
    Map<Integer, Integer> statistics = new LinkedHashMap<>();
    int start = 0;
    for (long i = Integer.MIN_VALUE; i <= Integer.MAX_VALUE; i += step) {
        final long min = i;
        final long max = min + step;
        int num = (int) hashs.parallelStream()
                .filter(x -> x >= min && x < max).count();
        statistics.put(start++, num);
    }
    return statistics;
}

Scatter‑curve charts for multipliers 2, 3, 17, 31, and 101 revealed that 2 and 3 concentrate almost all hash values in a single bucket, 17 spreads them more evenly, 31 achieves a good balance, and 101 distributes values uniformly but risks overflow.

In conclusion, the combination of being a moderate‑sized odd prime and allowing cheap JVM‑level optimization makes 31 an excellent choice for the String hashCode multiplier, offering low collision rates and good value distribution without overflow concerns.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java algorithm hashcode prime

Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.