Big Data 10 min read

External Sorting of a 4.6 GB File with 500 Million Integers: Strategies and Implementation

This article explains how to sort a 4.6 GB file containing 500 million random integers using internal quicksort and merge sort attempts, the Unix sort command, a bitmap-based method, and a detailed external sorting strategy with multi‑way merge, discussing performance and resource constraints.

Architecture Digest
Architecture Digest
Architecture Digest
External Sorting of a 4.6 GB File with 500 Million Integers: Strategies and Implementation

The problem presents a 4.6 GB file named bigdata containing 500 million random integers, each on a separate line, and asks how to sort it.

Initial attempts at internal sorting include a three‑way quicksort implementation and a merge sort, both with cutoff thresholds for small sub‑arrays and using insertion sort for those cases. Code snippets illustrate the algorithms.

private final int cutoff = 8;

public
void perform(Comparable
[] a) {
    perform(a,0,a.length - 1);
}

private
int median3(Comparable
[] a,int x,int y,int z) {
    if(lessThan(a[x],a[y])) {
        if(lessThan(a[y],a[z])) {
            return y;
        } else if(lessThan(a[x],a[z])) {
            return z;
        } else {
            return x;
        }
    } else {
        if(lessThan(a[z],a[y])){
            return y;
        } else if(lessThan(a[z],a[x])) {
            return z;
        } else {
            return x;
        }
    }
}

private
void perform(Comparable
[] a,int low,int high) {
    int n = high - low + 1;
    if(n <= cutoff) {
        InsertionSort insertionSort = SortFactory.createInsertionSort();
        insertionSort.perform(a,low,high);
    } else if(n <= 100) {
        int m = median3(a,low,low + (n >>> 1),high);
        exchange(a,m,low);
    } else {
        int gap = n >>> 3;
        int m = low + (n >>> 1);
        int m1 = median3(a,low,low + gap,low + (gap << 1));
        int m2 = median3(a,m - gap,m,m + gap);
        int m3 = median3(a,high - (gap << 1),high - gap,high);
        int ninther = median3(a,m1,m2,m3);
        exchange(a,ninther,low);
    }
    if(high <= low) return;
    int lt = low;
    int gt = high;
    Comparable
pivot = a[low];
    int i = low + 1;
    while(i <= gt) {
        if(lessThan(a[i],pivot)) {
            exchange(a,lt++,i++);
        } else if(lessThan(pivot,a[i])) {
            exchange(a,i,gt--);
        } else {
            i++;
        }
    }
    perform(a,low,lt - 1);
    perform(a,gt + 1,high);
}

Running the Unix sort -n bigdata -o bigdata.sorted command completes in about 24 minutes, but the author notes high I/O, memory pressure, and CPU time spent on context switches.

A bitmap‑based sorting method is then described, using a BitSet to record the presence of each integer, reading the large file into memory, setting bits, and then writing out the sorted numbers in ascending or descending order. This approach finishes in roughly 190 seconds on a machine with 4.6 GB/32 GB memory.

private BitSet bits;

public void perform(String largeFileName,int total,String destLargeFileName,Castor
castor,int readerBufferSize,int writerBufferSize,boolean asc) throws IOException {
    System.out.println("BitmapSort Started.");
    long start = System.currentTimeMillis();
    bits = new BitSet(total);
    InputPart
largeIn = PartFactory.createCharBufferedInputPart(largeFileName, readerBufferSize);
    OutputPart
largeOut = PartFactory.createCharBufferedOutputPart(destLargeFileName, writerBufferSize);
    largeOut.delete();
    Integer data;
    int off = 0;
    while(true) {
        data = largeIn.read();
        if(data == null) break;
        int v = data;
        set(v);
        off++;
    }
    largeIn.close();
    int size = bits.size();
    System.out.println(String.format("lines : %d ,bits : %d", off, size));
    if(asc) {
        for(int i = 0; i < size; i++) {
            if(get(i)) {
                largeOut.write(i);
            }
        }
    } else {
        for(int i = size - 1; i >= 0; i--) {
            if(get(i)) {
                largeOut.write(i);
            }
        }
    }
    largeOut.close();
    long stop = System.currentTimeMillis();
    long elapsed = stop - start;
    System.out.println(String.format("BitmapSort Completed.elapsed : %dms",elapsed));
} 

private void set(int i) { bits.set(i); }
private boolean get(int v) { return bits.get(v); }

Finally, an external sorting strategy is detailed for environments with limited memory. The file is processed in chunks: each chunk is read into a small buffer, internally sorted, and written to temporary sorted part files. Afterwards, a multi‑way merge reads the smallest elements from each part file and writes them to the final sorted output. The complete external sort runs in about 13 minutes.

JavaBig Datamerge sortbitmap sortexternal sort
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.