WebAssembly with Emscripten: High‑Performance MD5 Hashing and Archive Extraction in the Browser
This article demonstrates how to leverage WebAssembly and Emscripten to compile C code for high‑performance MD5 hashing and archive (zip/7z) parsing in the browser, covering library selection, memory management, file I/O via WorkerFS, async processing, and integration of C functions with JavaScript.
In a recent project the author needed to compute file MD5 hashes and parse compressed archive directories (including 7z) directly in the browser. JavaScript libraries such as spark‑md5 were too slow for gigabyte‑size files, so the solution switched to C libraries compiled to WebAssembly.
WebAssembly (Wasm) is a low‑level, binary format that runs in browsers with near‑native performance. By compiling C/C++/Rust code to Wasm, developers can reuse existing high‑performance libraries.
Hello World
A minimal C program is compiled with Emscripten:
#include
int main() {
printf("hello world\n");
return 0;
}Using the Docker image trzeci/emscripten the compilation command is:
docker run --rm -v $(pwd):/working trzeci/emscripten \
emcc /working/main.c -o /working/index.htmlEmscripten produces three files: index.wasm (the binary), index.js (JS glue code) and index.html (entry page). The Wasm module can be loaded in a static server.
Loading WebAssembly
To load the compiled module as a reusable UMD module, compile with:
docker run --rm -v $(pwd):/working trzeci/emscripten \
emcc /working/main.c -o /working/cutils.js \
-s MODULARIZE=1 -s EXPORT_NAME=CUtilsThen include the generated script and instantiate:
<script src="path/to/cutils.js"></script>
<script>
const Module = CUtils({
onRuntimeInitialized: () => {
// module ready
}
});
</script>JavaScript calling C functions (MD5 example)
The C MD5 implementation provides MD5_Init , MD5_Update and MD5_Final . After compiling with the appropriate EXPORTED_FUNCTIONS and EXTRA_EXPORTED_RUNTIME_METHODS , the functions are invoked from JavaScript via Module.ccall or Module.cwrap :
const STRUCT_MD5_CTX_SIZE = 152;
const pMd5Ctx = Module.ccall('malloc', 'number', ['number'], [STRUCT_MD5_CTX_SIZE]);
// … call MD5_Init, MD5_Update, MD5_Final via ccallThe full MD5 calculation routine allocates memory for the context, the input buffer, and the result, copies the file data into Wasm memory, runs the C functions, extracts the 16‑byte digest, and frees the allocations.
Memory and ArrayBuffer
WebAssembly memory is an ArrayBuffer . Emscripten exposes typed‑array views such as Module.HEAP8 and Module.HEAPU8 . Files are read with FileReader (or FileReaderSync in a worker) and copied into Wasm memory via Module.HEAP8.set .
WorkerFS for large files
To avoid loading an entire file into memory, Emscripten’s WORKERFS provides a read‑only file system that streams File or Blob objects inside a Web Worker. The C side can then read the file in chunks:
void md5(char *path, unsigned char *md5_result) {
const int CHUNK_SIZE = 16 * 1024 * 1024;
char *buff = malloc(CHUNK_SIZE);
FILE *stream = fopen(path, "r");
size_t read_size;
MD5_Init(&md5_ctx);
while ((read_size = fread(buff, 1, CHUNK_SIZE, stream)) > 0) {
MD5_Update(&md5_ctx, buff, read_size);
}
MD5_Final(md5_result, &md5_ctx);
fclose(stream);
free(buff);
}In JavaScript the worker mounts the file:
Module.FS.mount(Module.FS.filesystems.WORKERFS, {files: [file]}, '/working');Parallel optimisation
Reading the next chunk while the current chunk is being hashed reduces total time. An AsyncGenerator ( makeBlobIterator ) yields file slices ahead of the MD5 update loop, allowing overlapping I/O and computation.
Using third‑party libraries (libarchive)
To add a C library such as libarchive , compile it with Emscripten using emconfigure and emmake :
emconfigure ./configure
emmake make
emmake make installThe resulting Wasm module can be used to extract archives, with callbacks to JavaScript for each entry.
C calling JavaScript callbacks
Emscripten’s addFunction creates a function pointer that C code can invoke. Example:
// JavaScript
function handlePathname(path) { console.log(path); }
const handlePathnamePtr = Module.addFunction(handlePathname);
Module.ccall('extract', null, ['string', 'number'], [path, handlePathnamePtr]);The C signature expects a function pointer void (*on_pathname)(const char*) .
64‑bit values and JavaScript limits
Standard C long is 32‑bit in Emscripten, preventing offsets >2 GB. libarchive uses 64‑bit callbacks ( int64_t ). Since JavaScript numbers lose precision beyond 53 bits, the solution passes pointers to 64‑bit values and reads/writes them directly from Wasm memory using helper functions setInt64 and getInt64 .
Conclusion
The article shows how to compile C code to WebAssembly with Emscripten, handle file I/O efficiently via WorkerFS, manage memory, integrate third‑party libraries, and bridge C‑to‑JavaScript and JavaScript‑to‑C calls, achieving up to 65 % faster MD5 computation and enabling complex tasks such as archive extraction directly in the browser.
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.