Lifting The Curse of Apollo For Benchmarking

Here at Azul, a large part of our business is improving the performance of Java applications through improvements to the JVM. The key to determining where improvements can be made is through accurate measurement of how a system performs. This is far from simple due to at least two major factors:

  1. Having repeatable tests so that the effect of changes can be measured accurately. For this we use benchmarks. Much has been written about the merits of benchmarks in different forms: micro, synthetic, kernel and so on, especially in terms of what constitutes a real-world test of an application.
  2. Ensuring that the benchmarking tests we use are measuring the right thing. This may seem like an obvious requirement, but is something that often gets overlooked as a result of subtle effects in a benchmark.

Recently one of my colleagues, Nitsan Wakart, contributed some changes to the Apache Cassandra Stress tests, which he documents in detail here. These changes address an issue in the way the stress tests measure performance.

The need for these changes demonstrates how difficult it is to get benchmarking right. Let’s look at the key points here:

  • Apache Cassandra is a very popular open-source distributed database management system (check the Wikipedia entry on the Greek mythological princess of Troy for how I came up with the blog title). When deploying Cassandra in production it is important to be able to determine whether it meets the non-functional requirements, part of which is a responsiveness Service Level Agreement (SLA). In terms of what the SLA specifies this will be one or both of two things: the maximum number of concurrent users that can be supported and the maximum acceptable time it takes to get a response to a request.
  • As Nitsan points out, the results generated by the stress test on a fully loaded system were not reporting the data required to allow real-world responsiveness to be assessed. This was because the benchmark was measuring service time, not response time. What’s the difference? Well, what an end-user is typically interested in is response time: how long does it take between sending an arbitary request and getting a response. Developers focus on service time (initially, at least) because their goal is to reduce the time it takes to perform the work associated with each request. 

Response time = wait time + service time:

    • Wait time: This is how long a request has to wait after it is received before it can start to be processed. If the system is not fully loaded this should (in theory) be zero or very close to it.
    • Service time: The time from when the request starts being processed until the response is sent.

On a system that is being swamped with requests, service time should hit a maximum as the system processes as many concurrent requests as it can. Assuming that the rate of requests received is greater than the maximum rate at which the system can process them, the response time will continue to increase as the number of new requests in the queue waiting to be processed grows.

The way the Cassandra Stress test worked in rate mode is common to many benchmarking frameworks: it measured how long it takes M threads to make N calls per second. The problem with this is that each thread is sending requests synchronously. If a thread is blocked waiting for a response it is not able to send another request. This is not how the vast majority of systems work in real life; users don’t wait for other users to complete their requests before sending their own request. Gil Tene, our CTO, has given this type of benchmarking problem its own name: coordinated omission (you can find a lot more detail about this in Gil’s talk on “How Not To Measure Latency”).

To put the problem another way when testing in rate mode at, say 1,000 requests per second, what you want is one request submitted every millisecond. If the system under test takes longer than 1ms to respond to a request the rate at which requests are submitted will clearly not be what is desired.

In addition to modifying how the Cassandra stress tests work for applying a constant rate of requests, Nitsan has also added the ability to generate output files that are suitable for use with HdrHistogram (this is a high-definition histogram, which is especially well suited to representing response time data). They say, “A picture is worth a thousand words”; in this case, an HdrHistogram is worth a thousand lines of code when analysing performance.

The mythological Cassandra’s curse was to never be believed. With Nitsan’s changes, this curse has been lifted from the benchmarks generated by the non-mythological Cassandra load generator.

© Azul Systems, Inc. 2016 All rights reserved.