Performance testing guidelines

Written sometime between 2001 and 2009. The original publication date is lost. This post has moved across three blogging platforms during its life. I preserve it here as a snapshot of my thinking about testing at the time I wrote it.

Introduction

This page exists to help you design, run and interpret useful performance tests. These are high-level tips to help you avoid common pitfalls. It’s still largely draft; feel free to contact me if you have any questions about any of this!

Unlike other forms of testing, performance testing is more prone to the bike shed syndrome: everyone has an opinion on what colour to paint the bike shed. In fact, building a meaningful performance test is hard. The bike shop syndrome makes it harder.

Good reasons for executing performance tests

There are two main reasons for performance testing:

Execute the same performance test over several builds while making performance improvements
Execute the same performance test to verify that performance remains stable as the software evolves

Deciding what to measure

In short, pick one or two things to measure, but monitor several system statistics.

Let’s say you’ve got to build a test which will measure the time to create a number of entities (Let’s call them ‘transactions’, but they could be anything) in your test system. From this you’ll be able to generate two figures: average number of transactions created per second and average time to create a single transaction. (See the difference?) In order for these figures to be meaningful, you’ll need to monitor the load on the machine (or machines) during the test. It’s the system statistics that give context to your primary measurements. For example, you may achieve a thoughput of 20 transactions per second, but this only becomes meaningful when you also report that the database CPU spends 20% of its time waiting for the disk (iowait) and virtually none of its time in user time. In this case, you’re disk-bound at the database tier. The most likely culprit is inefficient query design, or costly database write operations.

The test complexity trap

Performance tests are difficult to design well; you’ll fall into the test complexity trap by trying to make the test be all things to all people. Design your test to do one thing well. Don’t expect it to be an all-singing, all dancing performance tool. If you do, most likely the output it produces will be worthless and you’ll have wasted a lot of time.

Another factor that increases complexity is shifting goalposts. The goalposts shift when a stakeholder asks you to measure one more parameter, or carry out the test in a slightly different way. Tell them that this is actually a different set of tests. If you change direction after a period of days or weeks, any data you’ve already gathered is rendered meaningless.

Test design

Keep it simple

As far as possible, limit confounding factors and over-complex test scenarios. These are variables that upset the accuracy of your test results.

Confounding factors include poorly tuned Java VM heap, poorly tuned connection pool configuration, database misconfiguration or other system components taking up CPU time. If your database has a single disk (or even a pair of disks in a RAID0 configuration) you can expect performance to drop off slowly if your test runs for several hours. This is caused by ever-increasing disk activity on the database server as the volume of data grows. It seems to be an artifact of Oracle housekeeping, but may apply to other DBMSs as well.

When designing your test, you need to think of the effects of each of the elements that make up the entire stack:

The software stack

The software under test
The Virtual Machine on which the application software runs
Third party libraries (database connectors, connection pools, message queues, protocol stacks…)
The operating system

The hardware stack

Hardware: CPU, memory, disk, network
Networked components such as databases and load balancers

Use the same hardware for all your tests. Guard it with your life. If halfway though your testing you’re forced to switch to using different hardware, all of your previous test results are meaningless. The weeks of work you’ve spend gathering and analyzing them are wasted. Make sure everyone understands this.

It’s tempting to try to get ‘maximum performance’ from your system by kicking off several threads and hammering the system repeatedly to get as much throughput as possible; this doesn’t actually give you an accurate reflection of how your application performs because you’re forcing it up against a bottleneck somewhere.

Understanding your data

Complex test scenarios include starting with a pre-populated database. To begin with, do your performance testing on an empty database, with just enough schema and data to start your application. As the data population grows and grows, your data or database configuration will become a confounding factor. Incorrectly designed schema, missing or wrong indexes and inefficient queries can affect the performance of your application, especially as the database population grows. Poor database disk tuning can also affect performance - this may only become apparent as the data volume grows.

Be aware that Oracle will cause smoothly degrading performance in your application as the size of your database population grows. Avoid executing performance tests on a single-spindle ext2/3 Oracle database. Instead, use a database backed by an OCFS disk array. Multi-disk OCFS databases appear to be immune to this performance degradation.

Note also that apparently simple things like the automatically generated names you give to entities can adversely affect performance. Take for example an entity name, generated from a UNIX timestamp, which is indexed for fast searching. If all of the entity names are almost the same, then the index b-tree will be hugely lop-sided. This will badly affect the efficiency of the index, giving you skewed results for search performance. We’ve found it’s better to hash and then base64-encode entity names in order to generate truly unique names. No, they won’t be very human-readable, but they’ll be printable and will index like real names.

Choosing a tool

You have complete freedom on this one; choose what works for you and allows you to get your results in the shortest time. I favour the Grinder because of its built-in ability to gather statistics. Jmeter is another tool which I’m planning to investigate.

Running the test

Here’s something you can be certain of: you’ll have to run your test multiple times. Make sure you include enough time in your project schedule for this.

The first few test runs will be necessary to iron out kinks in the test design.
The second set of runs will highlight the most obvious confounding factors.
The third set of runs will start to give you a meaningful performance baseline. You can start to record statistics from this point
During the fourth set of runs you’ll carry out some runs to see what effect tweaking certain variables has on the outcome (and beware of falling into the rabbit warren on this one!)
Finally, as you test new builds, you’ll be able to show what effects attempts to improve performance are having.

Fixing performance issues is hard, so expect several iterations of new builds, test runs and slow, incremental improvements. Schedule accordingly. Expect to find and remove a series of bottlenecks.

Thread tuning

Tuning incoming request handling threads and database connection pool size is a bit of a black art. I’ll attempt to clarify the topic here. There are some specifics in this post on tuning Tomcat incoming connection request threads.

Working in conjunction with developers, I use rules of thumb derived from testing experience to converge on what I consider a sensible default configuration for incoming thread connection tuning and for database connection pool tuning. Performance testing is an integral part of this process. These settings are subsequently further tuned in production.

Note that even if you bump up the number of threads handling incoming connections, you’re still constrained by the number of database connection pool threads. If all of the database connection pool threads are busy, then an incoming request will be accepted by an incoming request thread, but that thread may have to wait for a database connection to come free. Annoyingly, there isn’t a one-to-one mapping between incoming request threads and database connection pool threads.

Diminishing returns

Adding more client threads (or even server threads) beyond a certain point will not improve performance - in fact it’s likely to negatively impact performance. The graph below illustrates this.

The graph plots results for a series of test runs. Two parameters are measured for each test run; response time (average transaction time) and thoughput (average transactions per second). The first test run uses just one client thread; client threads are progressively added in each subsequent run, finishing with nine client threads in the last test run.

When the data is graphed, the data points in each set are joined by a line to highlight the differences between test runs. The scale for response times and throughput are both on the left.

As the number of clients increases from 1 to 5, throughput (transactions per second) increases steadily, and response time (transaction time) is barely affected. However as the number of client threads increases beyond six, response time gets longer and longer, and the number of transactions processed per second barely increases. The system is saturated.

Although this data is faked up to show how a typical system will behave, let’s imagine for a moment that this graph shows the true behaviour of a real system. Let’s say, having never performance tested this system before, that you put together a test that triggers eight client threads. Your result will show that you can push through 70 transactions per second, but at a cost of 40 seconds per transaction. This isn’t the optimal load for the server. The optimal load, which is the best trade-off between the highest throughput and the lowest transaction time, is actually a five-client load.

Monitoring the system

After much experimentation, I’ve found that vmstat provides the most useful information. It’s a one-stop-shop for recording CPU and virtual memory behaviour before, during and after your test run.

For diagnosing heap problems, visualgc is essential.

You may like to use top in batch mode to monitor CPU and memory usage. I’ve found that nine times out of ten, unexpected test results are down to resource-contention - specifically, contention for either CPU or disk on the database box. You can quickly discover if what’s causing the problem by using top first on your application-tier hosts and then on the database-tier hosts:

top - 12:16:37 up 116 days, 18:03, 2 users, load average: 14.12, 4.73, 1.67 Tasks: 319 total, 15 running, 304 sleeping, 0 stopped, 0 zombie Cpu0 : 79.9% us, 19.4% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.6% hi, 0.0% si Cpu1 : 75.5% us, 24.2% sy, 0.0% ni, 0.0% id, 0.3% wa, 0.0% hi, 0.0% si

Press ‘1’ on your keyboard to see the load on individual CPUs.

The first line shows the load average over the last one, five and fifteen minutes. A perfectly loaded machine has a load average of 1. Anything above one and your host is struggling.

In the example above, the load average over the last minute is 14.12. That’s off the scale. Now look at the two CPUs: The most important metrics are us (user time), sy (system time), id (idle time) and wa (wait time). Fist of all, the CPUs are spending zero time in the idle state. The rest of the time is spent in user time (which is the time the CPU spends running user processes) and system time (which is the time the CPU spends in operating system calls; running the task scheduler, memory management, managing I/O and so on). The other useful CPU metric is wait time, which is the time spent waiting for I/O devices - literally waiting for the disk to spin around to read or write the required block.

If the system wait time is high (5% or more), then the machine is disk-bound; if the idle time is low or zero but user and system time are high, then the machine is CPU-bound.

Interpreting performance test results

“It’s almost never the network.” If it is, your network is misconfigured and your test run is invalidated.

Are my results even valid?

Well, on this particular hardware, yes, they probably are. Attempting to extrapolate your results to other (perhaps similar) hardware will quickly bring you into the realms of speculation and wishful thinking, no matter how rigorous you think you’re being with your calculations. If your software crawls along on a single disk spindle, it’s pure fantasy to expect it to run twelve times faster on twelve spindles. You can’t even say it’ll run twice as fast. You just don’t know. If you want to know, get a disk array with twelve spindles and try it.

If any part of your hardware stack has changed, all of your previous results are invalid.

If any CPU in your hardware stack is spending more than 20% of its time in the iowait state, then that component is disk-bound and you’re not measuring true throughput.

Tips on writing up your results

Most often, you’ll be writing bug reports. Sometimes you’ll be expected to write up a formal document for internal consumption, or, rarer yet, external consumption. Scott Barber has written an excellent presentation on the correct way to present your results.

Report the CPU, memory and disk details of the hardware you’re using. Note any changes to default configurations that you’ve made on the software under test, the Java VM, the operating system, the database server or any other component.