Gartner Report: Which Java Runtime is for you? | DOWNLOAD NOW

Support
Blog chevron_right Performance

A Matter of Interpretation: From Bytecodes to Machine Code in the JVM

In the first article of this blog series, we looked at how JVM-based applications use a virtual instruction set in the form of bytecodes to create a platform-neutral application.  The JVM converts these bytecodes at runtime into the instructions for the platform the application is running on.

Before we delve into just-in-time (JIT) compilation, let’s look at how the JVM starts running bytecodes using an interpreter.

The basic idea of the interpreter is simple.  Each method, starting with public static void main(), has a set of bytecodes that define what it will do.  The JVM will interpret the first bytecode and (assuming the instruction does not result in a jump to a different address or call another method) continue with each bytecode in sequence.  Interpreting is the process of converting a bytecode to whatever operating system calls or machine code instructions are required to perform the action of the bytecode.

Stating it like this sounds straightforward, but there’s a considerable amount of work involved.

Let’s return to the sample program from the previous post that sums the numbers from one to ten (we’ll leave out printing the result to keep the code as simple as possible).

public class Sum {
  public static void main(String[] args) {
    int sum = 0;

    for (int i = 1; i <= 10; i++) {
      sum += i;
    }
  }
}

The compiled classfile contains the following bytecodes (produced with javap -c).

Code:
       0: iconst_0
       1: istore_1
       2: iconst_1
       3: istore_2
       4: iload_2
       5: bipush        10
       7: if_icmpgt     20
      10: iload_1
      11: iload_2
      12: iadd
      13: istore_1
      14: iinc          2, 1
      17: goto          4
      20: return

This shows there are 14 bytecodes to be executed (The count of 20 includes the operands for the bytecodes).  Since we have a loop, the nine instructions from positions 4 to 17 will be executed ten times.  Once we include the three extra instructions required to exit the loop (positions 4 to 7), we have a total of 98 bytecodes to interpret.

Here, then is an interesting question.  If we run this application, how many bytecodes are executed?

To determine this, we need a special build of the JVM that provides a greater level of observability than the standard one.  I won’t go into all the details of how to do this, but you need to configure an OpenJDK build environment using the –with-debug-level=slowdebug and –with-native-debug-symbols=external flags.  This provides a range of additional -XX options that can be used when starting the JVM.  I’ll also state that I used jlink to generate a runtime that only included the java.base module, since that’s all we need to run the application.

If we run our application with java -XX:+CountBytecodes Sum, we get the following:

359058 bytecodes executed in 0.4s (0.904MHz)

Seriously!  The expected 98 bytecodes have turned into nearly 360 thousand!

What is even more confusing is that you get three different results if you run the application three times.  So much for the deterministic finite automata I learnt about at university.

What is going on? 

The first thing to look at to figure this out is to run the application with the -verbose:class JVM option.  This will provide information about class loading as the JVM runs.  A simple count of the output reveals that 412 classes are being loaded, 409 of which are loaded before our Sum class.  Although it’s not required that a class is initialised as soon as it is loaded (as defined in the JVM Specification), in this case, they all are.  As you can imagine, many classes require complex initialisation, instantiating objects and calling methods.  All of which adds up to (as we can see) the execution of many thousands of bytecodes. 

It took me a while to figure out why the number of bytecodes executed varies for different runs.  Although the code being executed is not changing (and is therefore deterministic), the path taken through the code can be different.  For example, even something as simple as the value of the current time could result in a different branch being taken in an if statement.

Let’s return to the actual subject of the post, which is how the JVM interprets each of these many bytecodes.  Since there are many possible ways of doing this, we’ll discuss the Hotspot JVM implementation in OpenJDK.

In JDK 1.0, the interpreter, like the rest of the JVM, was written in C and consisted of an infinite while loop.  Inside the loop was a giant switch statement, which included a case for each bytecode instruction.  The while loop starts executing bytecodes for a method, and when the method returns exits the loop via a break.

The C++ Interpreter

The interpreter evolved in JDK 1.2 (I suspect, although I can’t confirm this) with the move to the Hotspot JVM.  Sometimes referred to as the C++Interpreter, this still uses a huge switch statement but makes changes to improve efficiency.  For example, various functions were inlined to take advantage of platform-specific features.

Although the C++Interpreter is still included in the OpenJDK source code, it is not built and included in the JVM by default.  In this case, the Highlander principle does hold, and there can be only one (interpreter).  The only build that makes use of this is the zero-assembler one.  

The Template Interpreter

The interpreter used in OpenJDK by default is the template interpreter (this has been the default since, I believe, JDK 1.4, although, again, I have not confirmed this).  Using hardcoded case statements to implement each bytecode’s functionality limits optimisations that can be exploited for a given platform.  We’ll see this more when we look at the advantages and disadvantages of AOT and JIT compilation later in this series.

The template interpreter builds the interpreter code for each bytecode when the JVM starts up.  As the name suggests, each bytecode has an assembler template.  The JVM can determine the precise details of the microarchitecture it’s running on and use this knowledge to generate optimised code for each bytecode.  These pieces of code are called codelets.  Rather than using a switch statement, the template interpreter uses a dispatch table with entry points for each codelet it generates.

You can get more information about what the template interpreter does using the -XX:+UnlockDiagnosticVMOptions -XX:+PrintInterpreter command-line options.  Running this on my build of JDK 17 with our simple application showed:

Interpreter
code size        =    144K bytes
total space      =    144K bytes
wasted space     =      0K bytes

# of codelets    =    271
avg codelet size =    546 bytes

As you can see, the template interpreter generates 271 codelets, which is more than the number of bytecodes in the JVM instruction set.  A closer examination shows that, in addition to the bytecode instructions, there are a set of codelets to optimise certain operations.  For example, there are codelets to handle things like standard maths functions (sin, cos, tan, etc.)

There are also a number of fast path codelets.  When a field from the current object is loaded via the this reference, it always results in two bytecodes, aload_0 followed by getfield.  This combination of bytecodes results in the value of this being pushed onto the stack and then immediately popped back off again.  In this case, the JVM provides a pseudo-instruction, fast_agetfield, which eliminates the redundant push-pop and reduces the two bytecodes of a very common operation to one.  Other fast path codelets do similar things to improve efficiency of common multiple bytecode operations.

There are also, somewhat bizarrely, four nofast codelets that are related to the use of the class data sharing (CDS) feature.

The average size of the codelets is 546 bytes.  All codelets have a size that is a multiple of 32 bytes, padding with zeros where necessary; many are only 96 bytes.  Some, however, are surprisingly large.  All the branch if int comparison with zero bytecodes (ifeq, ifne, etc.), generate codelets (on my Xeon powered Linux box) of 1376 bytes.  (I haven’t disassembled the code to try and figure out why that would be so big).  Another curious thing is that the codelet for no-op is 96 bytes!

The impact of this template, on-the-fly code generation approach, is substantial.  Volker Simonis made a benchmark comparison of both, and his results showed that the template interpreter was (broadly speaking) twice as fast as the simpler C++ interpreter.

Surely, it would make sense to cache the template generated code to reduce the start-up time of the JVM?  Helpfully, we can see how long it takes to generate all the codelets using the -Xlog:startuptime flag.  For our application, the results on my machine are:

java -Xlog:startuptime Sum
[0.026s][info][startuptime] StubRoutines generation 1, 0.0006406 secs
[0.040s][info][startuptime] Genesis, 0.0139758 secs
[0.044s][info][startuptime] Interpreter generation, 0.0013366 secs
[0.047s][info][startuptime] StubRoutines generation 2, 0.0021575 secs
[0.047s][info][startuptime] MethodHandles adapters generation, 0.0000343 secs
[0.047s][info][startuptime] Start VMThread, 0.0002398 secs
[0.055s][info][startuptime] Initialize java.lang classes, 0.0078322 secs
[0.057s][info][startuptime] Initialize java.lang.invoke classes, 0.0005092 secs
[0.062s][info][startuptime] Initialize module system, 0.0045786 secs
[0.062s][info][startuptime] Create VM, 0.0606478 secs

The interpreter generation only took 1.3ms, so even reducing this to zero would have no noticeable impact on the start-up time.

I remember when Java was first released and started to become popular.  One of the arguments people had against the platform was that it was very slow compared to traditional statically compiled languages like C and C++.  Which, at that time, it was.  As we’ve seen, the interpreter only concerns itself with a single (or possibly two using fast path codelets) bytecode at a time.  Even using templates will not change this fact.  A static compiler analyses larger blocks of code like loops and methods.  It can then optimise the generated code using techniques like loop unrolling and method inlining.  The interpreter will never be able to match this level of performance.

To address this without sacrificing “Write once, run anywhere”, Java adopted the use of adaptive, just-in-time (JIT) compilation. 

In the next post in this series, we’ll look at the basic ideas of JIT compilation and examine how the JVM decides which methods to compile and when to do so.