Share this
Calculating Percentiles with Tableau
by Richard Leeke on 09 September 2011
I've recently been doing something which makes a great little example of how a small change in the approach to doing something can make a big difference to the performance and scalability of the result. It also happens to be something which is an invaluable weapon in the Performance Engineering arsenal.
Aggregate or Detail?
Performance Engineering involves lots of data analysis: whether it’s interpreting the results of a performance test or mining mountains of production log files. Often the goal is to summarise and aggregate in order to get the big picture about how a system is performing. But when a system isn’t performing well, or is unstable under load, the real value comes from being able to drill down to the finest-grained detail possible.
Most of the Performance Testing tools in the marketplace have some level of data analysis and reporting built in. The capabilities range from rudimentary (even downright wrong) to powerful, but are mainly focussed on summary, aggregated analysis, with much more limited ability to access the fine-grained detail that is so essential for rapid problem diagnosis. In many cases there is no access at all to the detail. They also don’t help with all the other data we have to analyse that hasn’t been captured through the tool.
So to round out our kit-bag we use an excellent data visualisation tool: Tableau. It handles most of our needs very well. It can access data from text files and spreadsheets through to a wide range of databases and it copes well with large data sets, which it achieves through off-loading most of the grunt work to the back-end database. But off-loading that work has one disadvantage: if the analysis needs to access all of the detailed data and the DBMS can’t do it, then Tableau can’t do it.
The most obvious example of this is calculating percentiles. We use percentiles extensively in our performance analysis, not least because Non Functional Requirements usually state response time requirements in terms of percentiles (and quite right too). Percentiles are also just an invaluable tool for seeing the big-picture distribution of statistical data.
Tableau has no built-in support for percentiles (other than percentiles of already aggregated data, a decidedly dodgy and highly misleading statistic, in my view). With a bit of SQL knowledge it is possible to use Oracle’s built-in percentile functions, but it’s tricky, and only helps if your data happens to be in Oracle. Other than that, you’re on your own.
With a lot of head-scratching I did manage to get a percentile calculation going in Tableau’s calculation language after version 6 of Tableau added some new analytical functions last year, as I described on the Tableau forum. But the resulting calculation was hard to understand, hard to reuse, fragile and didn’t scale to the sorts of large datasets that we are typically dealing with.
There is still a steady trickle of pleas for proper percentile support on the Tableau forum (some of which are from other forum users who share my assessment of my method). The Tableau response to these requests is (quite validly) that Tableau is constrained by lack of support in the database back-ends. I also suspect that there just isn’t enough demand to justify the work to do it another way, though with version 7 due later in the year, who knows?
Having to swap to and fro between two different toolsets and data representations to get both the summary and the detailed view of our performance data is a pain for us, so prompted by some grumbles from another member of the team about how hard it is to use my original method, I’ve recently had another go at it.
So what’s so hard about calculating percentiles?
A percentile is the value of a variable below which a certain percentage of observations fall. So to work it out, you need to arrange all of the observations in order; count them and then find the value that falls the required percentage of the way through. That sounds easy. And it is, as long as you can get at all of the values.
But that is where the problem comes. Retrieving all data rows into Tableau, sorting them and counting through to find the required value just doesn’t scale very well and ultimately is constrained by addressable memory (Tableau is a 32 bit application). Depending on the data it generally works OK up to tens of thousands or maybe a few hundred thousand rows – but once you are into the millions it simply runs out of memory.
What are the alternatives?
Up until now, when we’ve needed percentiles we have either used another tool (such as the built-in reporting in our test tools) or pre-calculated the percentiles outside Tableau (typically with a Perl script). Neither approach is ideal. Whilst both approaches are OK for standard reporting, both lose the benefit of rapid exploration of the data.
A different approach to the Tableau calculation
My original method works by fetching all of the values from the database and sorting them in Tableau, in order to count through to the right value. But in reality, in a typical data distribution, lots of rows will share the same value. So all you need to return is a count of each distinct value, rather than returning multiple instances of the repeated values.
An example will make this clear. Consider how to calculate the median of this set of 7 numbers:
3, 1, 4, 4, 1, 2, 1
My original method returns all the numbers, counts them and then sorts them like this:
1, 1, 1, 2, 3, 4, 4
Finally it picks the mid-point of 2.
The alternative is to return counts of the distinct numbers:
1(3), 2(1), 3(1), 4(2).
Then add the counts: 3 + 1 + 1 + 2 = 7 and again work through to the mid-point.
With 7 numbers, the alternative sounds more complex. With millions of numbers, especially with lots of repetition, this method works out to be orders of magnitude quicker. This is primarily because the bulk of the work of sorting, grouping and counting the numbers is done in the database back-end (and databases are good at that), leaving Tableau to do the final calculation on a much smaller set of results.
Now there is nothing new or special about that technique - I've seen it done like that in all sorts of contexts and I've done it that way myself for years in Perl scripts. It's just that when I first tackled this in the new (powerful but hard to understand) Tableau Table Calculations I was so buried in the arcane details of "partitioning" and "addressing" that I forgot to lift my head up and think about the best way to do it.
Making it even quicker
The big gain with this approach comes from returning a small number of values. But if the values to be analysed are expressed to very high precision and widely distributed, there may still be a large number of distinct values, with relatively small numbers of occurrences.
In this case, instead of working out the answer exactly, you can just approximate. Rounding all of the values to a lower precision will reduce the number of distinct values returned. This can be parameterised, so the data can initially be explored quickly with low precision, then when you discover interesting looking patterns, crank up the precision and wait a few seconds for the calculation to complete.
Final comments
This approach actually addresses several issues with my previous attempt:
- It scales much further – essentially only limited by the database (I’ve tested it on a 90 million row Tableau data extract, and this method actually returned the median faster than the Tableau data engine’s built-in median function)
- It can be made to go further (bigger data) and faster by reducing the required precision
- It is much simpler to set up and the resulting workbook is much more robust
I’ll describe the nitty-gritty detail on the Tableau forum, but for any Tableau users, you can download the Tableau Public workbook above and explore the calculations.
Above all this was a timely reminder that it is often worth challenging assumptions (both your own and other people’s).
Lastly: a big thanks to Joe Mako (one of the regulars on the Tableau forum) who reviewed my new calculation and made a couple of simplifications which speeded it up by another 30%.
Share this
- Agile Development (153)
- Software Development (126)
- Agile (76)
- Scrum (66)
- Application Lifecycle Management (50)
- Capability Development (47)
- Business Analysis (46)
- DevOps (43)
- IT Professional (42)
- Equinox IT News (41)
- Agile Transformation (38)
- IT Consulting (38)
- Knowledge Sharing (36)
- Lean Software Development (35)
- Requirements (35)
- Strategic Planning (35)
- Solution Architecture (34)
- Digital Disruption (32)
- IT Project (31)
- International Leaders (31)
- Digital Transformation (26)
- Project Management (26)
- Cloud (25)
- Azure DevOps (23)
- Coaching (23)
- IT Governance (23)
- System Performance (23)
- Change Management (20)
- Innovation (20)
- MIT Sloan CISR (15)
- Client Briefing Events (13)
- Architecture (12)
- Working from Home (12)
- IT Services (10)
- Data Visualisation (9)
- Kanban (9)
- People (9)
- Business Architecture (8)
- Communities of Practice (8)
- Continuous Integration (7)
- Business Case (4)
- Enterprise Analysis (4)
- Angular UIs (3)
- Business Rules (3)
- Java Development (3)
- Lean Startup (3)
- Satir Change Model (3)
- API (2)
- Automation (2)
- GitHub (2)
- Scaling (2)
- Toggles (2)
- .Net Core (1)
- Diversity (1)
- Security (1)
- Testing (1)
- February 2024 (3)
- January 2024 (1)
- September 2023 (2)
- July 2023 (3)
- August 2022 (4)
- August 2021 (1)
- July 2021 (1)
- June 2021 (1)
- May 2021 (1)
- March 2021 (1)
- February 2021 (2)
- November 2020 (2)
- September 2020 (1)
- July 2020 (1)
- June 2020 (3)
- May 2020 (3)
- April 2020 (2)
- March 2020 (8)
- February 2020 (1)
- November 2019 (1)
- August 2019 (1)
- July 2019 (2)
- June 2019 (2)
- April 2019 (3)
- March 2019 (2)
- February 2019 (1)
- December 2018 (3)
- November 2018 (3)
- October 2018 (3)
- September 2018 (1)
- August 2018 (4)
- July 2018 (5)
- June 2018 (1)
- May 2018 (1)
- April 2018 (5)
- March 2018 (3)
- February 2018 (2)
- January 2018 (2)
- December 2017 (2)
- November 2017 (3)
- October 2017 (4)
- September 2017 (5)
- August 2017 (3)
- July 2017 (3)
- June 2017 (1)
- May 2017 (1)
- March 2017 (1)
- February 2017 (3)
- January 2017 (1)
- November 2016 (1)
- October 2016 (6)
- September 2016 (1)
- August 2016 (5)
- July 2016 (3)
- June 2016 (4)
- May 2016 (7)
- April 2016 (13)
- March 2016 (8)
- February 2016 (8)
- January 2016 (7)
- December 2015 (9)
- November 2015 (12)
- October 2015 (4)
- September 2015 (2)
- August 2015 (3)
- July 2015 (8)
- June 2015 (7)
- April 2015 (2)
- March 2015 (3)
- February 2015 (2)
- December 2014 (4)
- September 2014 (2)
- July 2014 (1)
- June 2014 (2)
- May 2014 (9)
- April 2014 (1)
- March 2014 (2)
- February 2014 (2)
- December 2013 (1)
- November 2013 (2)
- October 2013 (3)
- September 2013 (2)
- August 2013 (6)
- July 2013 (2)
- June 2013 (1)
- May 2013 (4)
- April 2013 (5)
- March 2013 (2)
- February 2013 (2)
- January 2013 (2)
- December 2012 (1)
- November 2012 (1)
- October 2012 (2)
- September 2012 (3)
- August 2012 (3)
- July 2012 (3)
- June 2012 (1)
- May 2012 (1)
- April 2012 (1)
- February 2012 (1)
- December 2011 (4)
- November 2011 (2)
- October 2011 (2)
- September 2011 (4)
- August 2011 (2)
- July 2011 (3)
- June 2011 (4)
- May 2011 (2)
- April 2011 (2)
- March 2011 (3)
- February 2011 (1)
- January 2011 (4)
- December 2010 (2)
- November 2010 (3)
- October 2010 (1)
- September 2010 (1)
- May 2010 (1)
- February 2010 (1)
- July 2009 (1)
- April 2009 (1)
- October 2008 (1)