Artificial Intelligence 13 min read

Boosting LLM Inference on Java: Vector API, Project Panama & TornadoVM Performance

This article evaluates the performance of large language model (LLM) inference on the Java platform, examining llama2.java implementations that leverage Java Streams, the Vector API, Project Panama, and TornadoVM GPU acceleration, and compares them against native C versions across various model sizes.

Java Architecture Diary

Dec 22, 2023

Boosting LLM Inference on Java: Vector API, Project Panama & TornadoVM Performance

1. Introduction

Large language models (LLM) can generate human‑quality text, translate, and answer questions, but their high computational demand creates memory‑bandwidth bottlenecks on most hardware. Recent frameworks such as NVIDIA TensorRT‑LLM, vLLM, OpenLL and Ray Serve aim to optimise LLM inference on GPUs. This paper evaluates the performance prospects of LLM inference on the Java platform, establishing a baseline for future JVM‑focused optimisation. Because many high‑performance approaches rely on native implementations (e.g., llama2.cpp), the study emphasises Java Vector API and Project Panama, which add off‑heap data types and parallelism. Finally, we demonstrate how TornadoVM can combine with these technologies to achieve seamless GPU acceleration from Java.

2. LLM Inference in Java

Java implementations of LLaMA (e.g., llama2.java, jlama, java‑llama.cpp, llama2j) are emerging. This work focuses on llama2.java, which uses Java Streams, Project Panama and the Vector API to approach native‑like performance. Transformer‑based LLMs are memory‑bound due to many matrix‑vector multiplications on relatively small data batches. Prior optimisation relied on parallel streams or automatic vectorisation from C2/Graal. With the Vector API, developers can write efficient vectorised code suited to LLM workloads. Benchmarks on 15M‑110M‑parameter models show that a Java implementation combining Parallel Streams, Panama and Vector API reaches about 90% of the multithreaded llama2.c performance, and the gap narrows as model size grows because memory limits dominate.

Further Exploring Performance Limits

The forward pass of LLMs contains seven matmul operations, all vectorised with the Vector API. Profiling with IntelliJ reveals that the last matmul accounts for 19% of total forward time, making it a prime optimisation target.

3. Using TornadoVM to Boost Performance

All matmul operations are inherently parallel, making them suitable for vectorisation or GPU offloading. However, GPU acceleration must consider data‑transfer overhead from host to device memory. Profiling directs attention to the final matmul, which we offload to the GPU via TornadoVM.

3.1 TornadoVM Overview

TornadoVM is a plugin for OpenJDK and GraalVM that enables Java programs to run on heterogeneous hardware (GPUs, multi‑core CPUs). Version 1.0 adds a new API based on Panama’s MemorySegment for off‑heap object and array allocation, improving memory efficiency.

3.2 Accelerating llama2.java with TornadoVM

Key modifications include:

Using TornadoVM’s VectorFloat8 API to store read‑only weights.

// Convert FloatBuffer to primitive float
this.wclsAsPrimitive = new float[wcls.remaining()];
wcls.get(wclsAsPrimitive);

// Convert the read‑only weights to TornadoVM datatypes that use MemorySegments
this.weightInFloatArray = FloatArray.fromArray(wclsAsPrimitive);
this.weightInVectorFloat8 = createVectorFloat8Array(weightInFloatArray);

Implementing the matrix‑vector operation with TornadoVM annotations.

static void matrixVectorFloat8(float[] xout, VectorFloat8 x, VectorFloat8 w, int n, int d) {
    @Parallel int i = 0;
    for (i = 0; i < d; i++) {
        float val = 0f;
        int vectorLaneWidth = x.vectorWidth();
        for (int j = 0; j < n; j += vectorLaneWidth) {
            Float8 xv8 = x.get(j / vectorLaneWidth);
            Float8 wv8 = w.get(i * (n / vectorLaneWidth) + j / vectorLaneWidth);
            val += Float8.dot(wv8, xv8);
        }
        xout[i] = val;
    }
}

Initialising a TaskGraph in the forward method’s pre‑processing stage.

taskGraph = new TaskGraph("s0")
    .transferToDevice(DataTransferMode.EVERY_EXECUTION, s.xVectorFloat8)
    .transferToDevice(DataTransferMode.FIRST_EXECUTION, w.weightInVectorFloat8)
    .task("t0", MatrixVectorCollection::matrixVectorFloat8, s.logits, s.xVectorFloat8, w.weightInVectorFloat8, dim, transformer.config.vocab_size)
    .transferToHost(DataTransferMode.EVERY_EXECUTION, s.logits);

By marking s.xVectorFloat8 with DataTransferMode.FIRST_EXECUTION, read‑only data is transferred to the GPU only once, avoiding repeated I/O. Execution logs confirm GPU usage when the --threadInfo flag is enabled.

4. Performance Evaluation

Model: TinyLlama (≈1.1 B parameters) trained on 30 trillion tokens, mirroring Llama 2 architecture.

System: 13th‑gen Intel Core i7‑13700 (24 threads), NVIDIA GeForce RTX 3070, Pop!_OS Linux, OpenJDK 21+35‑2513, TornadoVM 1.0.

4.1 Acceleration of llama2.java

Comparing llama2.java with llama2.tornadoVM.java shows that offloading the dominant matmul to the GPU yields an average token‑per‑second increase of 13 % (range 9‑15 %) across all model sizes.

4.2 Normalised Speed‑up vs Native C (OpenMP)

When benchmarked against the native C implementation (llama2.c) compiled with OpenMP (24 threads), the TornadoVM‑accelerated Java version achieves up to 98 % of the C performance for larger models, further narrowing the Java‑C performance gap.

5. Future Outlook for Java/JVM in AI

The presented analysis is only the beginning of possible Java‑based LLM optimisations. Combining Parallel Streams, Panama, Vector API and TornadoVM opens many advanced optimisation avenues, especially as model sizes grow and GPU benefits outweigh data‑transfer costs. Ongoing work on quantisation, shared/unified memory in TornadoVM, and AI‑specific compiler optimisations (e.g., Mojo, kernel APIs) promises additional gains. Emerging JVM projects such as Babylon and Valhalla are expected to push performance boundaries further, making Java a strong candidate for AI/ML workloads.

6. Conclusion

Java’s performance capabilities are rapidly evolving. By leveraging existing JVM features (Streams, Panama, Vector API) together with TornadoVM, developers can achieve near‑native C performance for LLM inference and have a promising path toward surpassing it. Integrating compiler technologies under GraalVM further simplifies optimisation across diverse hardware accelerators. Acknowledgements go to Oracle’s Alfonso Peterssen for his contributions to extending the original llama2.java application.

Java Architecture Diary

Committed to sharing original, high‑quality technical articles; no fluff or promotional content.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.