Comparison of k6 test result visualizations

By Pawel Suwala, LoadImpact CTO. On 2019-11-13

As part of my daily work, I often answer support questions from k6 users and LoadImpact customers. I’ve been doing this for a while, and I’ve noticed that many general questions we get are related to visualizing or interpreting the k6 data. k6 is a versatile tool with many different modes of execution and many different outputs.

I thought it would be a good idea to describe how to visualize, and then interpret k6 results. This will hopefully be helpful to a wide group of users.

Users typically want to visualize k6 data to get an answer to one or more of these questions:

  • Was my test successful? (did it pass or fail?)
  • What is the overall performance of my system? (What’s the average RPS or response time)
  • Are there any URLs that are crashing under load?
  • What’s the maximum load my system can handle?
You can visualize k6 results in many different ways, and therefore different ways of getting an answer to these common questions.

This post focuses on comparing the three most popular ways to visualize k6 results:

I thought it would be a good idea to compare these 3 outputs to see how well they help with answering the above-mentioned questions.

Before we dive into the k6 outputs, I think it’s worth noting that no matter how we decide to visualize the data, it’s important to design the performance test well.

How to write a k6 performance test well

It’s much easier to interpret data produced by a well-designed performance test. We are building k6 with developer experience in mind, trying to make the test writing a breeze.

We will use https://test-api.loadimpact.com/ as our target system. This is a simple REST API that we will use to demonstrate different ways of visualizing results. If you want to try it yourself, just copy the k6 script available on that URL into a script.js file.

Notice that the test script we are using is designed to take advantage of Checks and Thresholds. With those two features, we are able to specify our expectations of response status codes, and acceptable response time.

Checks are like asserts in your functional unit tests, but they don’t abort the test execution. Instead, they record the result, so you can see how many checks have succeeded and failed at the end of your test. It’s often acceptable to have few failed checks if you have thousands of successful ones.

Thresholds are for specifying general performance expectations of your system. For example

  • “95% of API calls should return within 0.2s”
  • “number of failed requests must be below 0.5%”

If any of the thresholds fail, the test is considered a fail. Optionally, Thresholds can abort the test during the execution.

With the design of the test in mind, let’s get down to testing!

k6 Terminal Output

The first and simplest output is the k6 terminal summary that is displayed after the test is finished.

Since our test script contains checks and thresholds, the result of the test is nicely visualized in the terminal output, and will immediately answer most of the above questions.

Let’s look at a sample output from k6 for a test run with 10 VUs (virtual users).

We run this script with the k6 run scripts/crocs.js command. (you can name your script differently!)

k6 Terminal Output

 

Everything is “green”. All checks and thresholds have passed. There’s probably no need to look at the detailed information about each request because the test passed out expectations.

Tip: once you have a well-designed test, you may want to include it in your CI/CD pipeline to automate the execution of your performance test. If thresholds are crossed, k6 will exit with a non-0 exit status causing the CI job to fail.

Now let’s increase the VUs to 70 and see how this system performs.

k6 Terminal Output

 

Here we see that 282 checks have succeeded, however, there was one check that failed. It looks like the system is becoming overloaded.

Also, all our thresholds have failed. The 95th percentile for the response time is about 900ms. This is more than we specified in the threshold.

Going beyond 70 VUs crashes this API and makes all checks and thresholds fail.

Let’s see how well this output answers the common questions:

Question k6 Terminal
Determine the result of the test Yes.
It’s clear when the test has failed or succeeded.
See the overall performance Yes.
We can see the http_req_duration` which tells us the average/min/max response time, and we see the number of requests per second.
See which URLs are crashing under load No.
k6 doesn’t display this data by default (unless you specify a check for it).
See the maximum load system can handle Not easily.
While it’s easy to see if the test is successful or not, it’s not simple to determine the load at which the system is becoming overloaded.

 

Summary

If your tests are well designed and you don’t need many details, the terminal output may be just enough for you. Of course, Grafana offers a little more functionality, so we will explore it next.

Grafana Dashboard

k6 loves Grafana. Both k6 and Grafana are open source and work very well together. FYI: our offices are about 100 meters apart, so we are also friends

We have already extensively described how to configure Grafana dashboards, so if you are interested in trying it yourself, follow our guide.

While writing this article I realized that we have not done a great job at building good Grafana dashboards for k6. Several dashboards have been contributed by the community, the most popular is k6 Load Testing Results.

There are many good things about this dashboard, but it hasn’t been updated since 2017, and several important features are missing such as Thresholds and percentage of failed checks.

We will work on improving our Grafana dashboards, but for now, let’s run our k6 tests and see how the current dashboards perform.

For the purpose of this comparison, I’ll just show Grafana output for the same two test runs. First, let’s start with 10 VUs.

I have executed this test by running: k6 run -o influxdb=http://localhost:8086/k6 scripts/crocs.js.

k6 Grafana Dashboard

 

And here’s a 70VU test:

k6 Grafana Dashboard

 

On these dashboards, we get a nice visualization of VUs, RPS as well as response time. With this data, we are able to pinpoint the moment when the performance started degrading.

It’s interesting to note that the performance was stable until the test ramped up to 30 VUs. After that point, the response time started to rise and fluctuate. The RPS stopped raising around that time as well, indicating that the has system reached its capacity. We were not able to get this information from the k6 terminal output, so that’s one advantage of the Grafana output.

It's Interesting to see how many checks have been executed per second during the run, but we don’t see if these checks succeeded or failed, making this data not very useful.

Grafana makes it easy to zoom in on any particular chart to see more detailed data. This is especially interesting if you execute a long, 60 minutes test.

This specific dashboard doesn’t provide a URL table, but others do. For example, dashboard 10660 provides a table with min, max, p95 metrics per URL so users can see which endpoints are slower than others. This table, however, is missing an error column, so it’s not possible to see how many failures have happened.

What’s not so good about the current grafana dashboards:

  • No thresholds are displayed.
  • It doesn’t show the percentage of failed checks.
  • It doesn’t show how many HTTP errors happened during the execution.

Until these shortcomings are fixed, I think the Grafana dashboard should be only used as a complementary visualization tool. The k6 terminal output needs to be consulted to get the full overview of the test result.

Note on a single timeline

The biggest drawback I found in Grafana is the fact that all data ends up on a single timeline. Grafana doesn’t have a notion of a discrete “test run”. If you run one test after another, data from both will be displayed on the same chart.

This is a serious shortcoming once you run more than a few tests. For example:

  • You can not organize the results based on your different tests.
  • You can not compare two tests on a single dashboard.
  • You can not easily look at tests from the past.

Note on the additional data sources

Grafana is a general visualization tool which has its drawbacks (as mentioned above), but it also has benefits. You can pull data from other sources and correlate it with the load test results. Grafana now offers Loki integration, which allows you to correlate server-side logs with load test data. This can be very useful for debugging.

Now let’s look at our 4 questions to see if the current Grafana dashboards are able to answer them:

Question k6 Grafana Dashboard
Determine the result of the test No.
This dashboard doesn’t include thresholds and pass/fail check data. It’s not easy to determine if the test was successful or not. .
See the overall performance Yes.
We can see the http_req_duration chart that clearly displays the overall performance. The RPS chart displays the number of requests per second.
See which URLs are crashing under load No.
See the maximum load system can handle Yes.
Grafana provides an HTTP response time chart that makes it easy to determine the moment the system becomes overloaded.

 

Summary

Grafana visualization is a good addition to the k6 terminal output, but for most purposes, it can’t be used on its own until we add thresholds, checks and URL failure rate to the dashboard.

LoadImpact Cloud Service

Now let’s take a look at LoadImpact’s Cloud Service.

At the get-go, I need to note that I’m trying to be objective in this comparison, but I’m obviously a little biased because I work at LoadImpact and I try to make this product as good as possible. Nevertheless, I don’t want to hide any shortcomings we have, and I’ll try to be fair in my comparison.

Load Impact has many other features such as distributed cloud execution, result storage, generating load from multiple geographical locations, performance trending, alerting, etc. For the purpose of this article, we will only focus on data visualization and analysis, because other tools simply don’t have any of this functionality.

Let’s run the same test with LoadImpact. Starting with 10 VUs.

I executed this test by typing k6 cloud scripts/crocs.js from my terminal.

k6 LoadImpact Results

 

You can see this test live, here: https://app.loadimpact.com/k6/anonymous/adb0bc979f104732a8fd0c461410544f

One thing we try to do in LoadImpact is to minimize the work required to analyze the test results. We try to make this interface as clear as possible, so users can get a good idea of what the result is at first glance.

The first thing you notice in this UI is the test result summary in the middle of the screen. In this case, it’s clearly a positive result, indicated by a green tick. No performance issues have been automatically detected. You also see that all thresholds and checks have passed.

It’s very clear that the load test was successful. You can dive deeper into the results by clicking on thresholds, or specific URLs, but it’s not necessary for most cases.

Now let’s try to run a 70VU test, and see how that interface looks like.

k6 LoadImpact Results

 

You can see this test live, here: https://app.loadimpact.com/k6/anonymous/f51401ed49c5436db309cf0c681a3334

Again, this interface makes it obvious that the test is not successful. We got performance alerts, all 5 thresholds failed, 24 checks failed, and several HTTP calls returned unexpected status codes.

Note on the performance alerts

LoadImpact developed a machine-assisted analysis tool called Performance alerts to automatically detect many different performance problems. These algorithms run in the background, analyzing the test results in real-time, and inform the user whenever a problem has been detected. You can read more about this here.

Note on the test comparison functionality

Another useful function in LoadImpact’s UI is test comparison. Users have the ability to correlate data of two distinct test runs to determine performance improvement or degradation. Additionally, there’s a trending chart visualizing response time over all test executions.

Now let’s see how well LoadImpact answers the 4 questions:

Question LoadImpact Cloud Service
Determine the result of the test Yes.
LoadImpact automatically analyzes the test data and determines the result automatically. .
See the overall performance Yes.
The main chart contains the overall test performance. There’s a performance summary displayed once the test finishes.
See which URLs are crashing under load Yes.
LoadImpact UI displays the performance of each individual URL, together with the HTTP status code, and a number of failures.
See the maximum load system can handle Yes.
The performance overview chart makes it easy to see at which point the response time started to go up. It’s also possible to drill down to specific URLs and determine at which point they started returning errors.

 

Summary

The Cloud Service is a comprehensive tool for visualizing and analyzing the k6 data.

We built it specifically to answer the common questions we hear from our users. Cloud Service clearly displays the data, analyzes the results and streamlines the whole load testing process.

Comparison Table

Question k6 Terminal k6 Grafana Dashboard LoadImpact Cloud Service
Determining the test result (success or failure) Good but Basic.
Easy when tests are well designed to include checks and thresholds. No automatic performance alerts.
Poor.
Difficult with the current dashboard configuration. Checks and thresholds are omitted. No automatic performance alerts.
Very good.
UI is clear and focuses on the most important thing first. In addition, it provides automatic performance analysis.
See the overall performance Yes Yes Yes
See which URLs are crashing under load No No Yes
See the maximum load system can handle No Yes Yes

Conclusions

We have reviewed 3 different ways of visualizing k6 test results. Depending on your project and needs, you may choose to use k6 with Grafana, with the LoadImpact Cloud Service or with nothing at all.

It’s not clear which solution is better for you. If you are running a performance test for a small system, standalone k6 might be good enough for you. The current Grafana Dashboard is a good complement to correlate different general metrics during the test execution. If you need to manage multiple performance tests and a deeper analysis of your test result, LoadImpact’s Cloud product might be a better fit.

I can’t make a general recommendation on how you should use k6. In this article, I have only compared the visualization aspects of running load tests. Still, in real life, one has to consider other elements such as the cost of setup and maintenance of your infrastructure vs the expense of other services as well as evaluating the needs of your solution for performance testing.

Loading...