Equinox IT Blog

Calculating Percentiles with Tableau

I've recently been doing something which makes a great little example of how a small change in the approach to doing something can make a big difference to the performance and scalability of the result.  It also happens to be something which is an invaluable weapon in the Performance Engineering arsenal.

Aggregate or Detail?

Performance Engineering involves lots of data analysis: whether it’s interpreting the results of a performance test or mining mountains of production log files.  Often the goal is to summarise and aggregate in order to get the big picture about how a system is performing.  But when a system isn’t performing well, or is unstable under load, the real value comes from being able to drill down to the finest-grained detail possible.

Most of the Performance Testing tools in the marketplace have some level of data analysis and reporting built in.  The capabilities range from rudimentary (even downright wrong) to powerful, but are mainly focussed on summary, aggregated analysis, with much more limited ability to access the fine-grained detail that is so essential for rapid problem diagnosis.  In many cases there is no access at all to the detail.  They also don’t help with all the other data we have to analyse that hasn’t been captured through the tool.

So to round out our kit-bag we use an excellent data visualisation tool: Tableau.  It handles most of our needs very well.  It can access data from text files and spreadsheets through to a wide range of databases and it copes well with large data sets, which it achieves through off-loading most of the grunt work to the back-end database.  But off-loading that work has one disadvantage: if the analysis needs to access all of the detailed data and the DBMS can’t do it, then Tableau can’t do it.

The most obvious example of this is calculating percentiles.  We use percentiles extensively in our performance analysis, not least because Non Functional Requirements usually state response time requirements in terms of percentiles (and quite right too).  Percentiles are also just an invaluable tool for seeing the big-picture distribution of statistical data.

Tableau has no built-in support for percentiles (other than percentiles of already aggregated data, a decidedly dodgy and highly misleading statistic, in my view).  With a bit of SQL knowledge it is possible to use Oracle’s built-in percentile functions, but it’s tricky, and only helps if your data happens to be in Oracle.  Other than that, you’re on your own.

With a lot of head-scratching I did manage to get a percentile calculation going in Tableau’s calculation language after version 6 of Tableau added some new analytical functions last year, as I described on the Tableau forum.  But the resulting calculation was hard to understand, hard to reuse, fragile and didn’t scale to the sorts of large datasets that we are typically dealing with.

There is still a steady trickle of pleas for proper percentile support on the Tableau forum (some of which are from other forum users who share my assessment of my method).  The Tableau response to these requests is (quite validly) that Tableau is constrained by lack of support in the database back-ends.  I also suspect that there just isn’t enough demand to justify the work to do it another way, though with version 7 due later in the year, who knows?

Having to swap to and fro between two different toolsets and data representations to get both the summary and the detailed view of our performance data is a pain for us, so prompted by some grumbles from another member of the team about how hard it is to use my original method, I’ve recently had another go at it.

So what’s so hard about calculating percentiles?

A percentile is the value of a variable below which a certain percentage of observations fall.  So to work it out, you need to arrange all of the observations in order; count them and then find the value that falls the required percentage of the way through.  That sounds easy.  And it is, as long as you can get at all of the values.

But that is where the problem comes.  Retrieving all data rows into Tableau, sorting them and counting through to find the required value just doesn’t scale very well and ultimately is constrained by addressable memory (Tableau is a 32 bit application).  Depending on the data it generally works OK up to tens of thousands or maybe a few hundred thousand rows – but once you are into the millions it simply runs out of memory.

What are the alternatives?

Up until now, when we’ve needed percentiles we have either used another tool (such as the built-in reporting in our test tools) or pre-calculated the percentiles outside Tableau (typically with a Perl script).  Neither approach is ideal.  Whilst both approaches are OK for standard reporting, both lose the benefit of rapid exploration of the data.

A different approach to the Tableau calculation

My original method works by fetching all of the values from the database and sorting them in Tableau, in order to count through to the right value.  But in reality, in a typical data distribution, lots of rows will share the same value.  So all you need to return is a count of each distinct value, rather than returning multiple instances of the repeated values.

An example will make this clear.  Consider how to calculate the median of this set of 7 numbers:

3, 1, 4, 4, 1, 2, 1

My original method returns all the numbers, counts them and then sorts them like this:

1, 1, 1, 2, 3, 4, 4

Finally it picks the mid-point of 2.

The alternative is to return counts of the distinct numbers:

1(3), 2(1), 3(1), 4(2).

Then add the counts: 3 + 1 + 1 + 2 = 7 and again work through to the mid-point.

With 7 numbers, the alternative sounds more complex.  With millions of numbers, especially with lots of repetition, this method works out to be orders of magnitude quicker.  This is primarily because the bulk of the work of sorting, grouping and counting the numbers is done in the database back-end (and databases are good at that), leaving Tableau to do the final calculation on a much smaller set of results.

Now there is nothing new or special about that technique - I've seen it done like that in all sorts of contexts and I've done it that way myself for years in Perl scripts.  It's just that when I first tackled this in the new (powerful but hard to understand) Tableau Table Calculations I was so buried in the arcane details of "partitioning" and "addressing" that I forgot to lift my head up and think about the best way to do it.

Making it even quicker

The big gain with this approach comes from returning a small number of values.  But if the values to be analysed are expressed to very high precision and widely distributed, there may still be a large number of distinct values, with relatively small numbers of occurrences.

In this case, instead of working out the answer exactly, you can just approximate.  Rounding all of the values to a lower precision will reduce the number of distinct values returned.  This can be parameterised, so the data can initially be explored quickly with low precision, then when you discover interesting looking patterns, crank up the precision and wait a few seconds for the calculation to complete.

Final comments

This approach actually addresses several issues with my previous attempt:

  • It scales much further – essentially only limited by the database (I’ve tested it on a 90 million row Tableau data extract, and this method actually returned the median faster than the Tableau data engine’s built-in median function)
  • It can be made to go further (bigger data) and faster by reducing the required precision
  • It is much simpler to set up and the resulting workbook is much more robust

I’ll describe the nitty-gritty detail on the Tableau forum, but for any Tableau users, you can download the Tableau Public workbook above and explore the calculations.

Above all this was a timely reminder that it is often worth challenging assumptions (both your own and other people’s).

Lastly: a big thanks to Joe Mako (one of the regulars on the Tableau forum) who reviewed my new calculation and made a couple of simplifications which speeded it up by another 30%.

Free recorded webinar: Managing software performance risk - why performance test at all?

Subscribe by email