As described in Finding the haystack: troubleshooting a hard to find software performance problem and in Part two: Following the trail to performance problems in front of the webserver, we had learned a lot in a short space of time. The problem space had shrunk from a being anywhere inside a large enterprise system, to being somewhere between the arrival of the data required to draw the page, and actually putting the page on the screen in front of the user. This left three broad areas of investigation: The performance of the Citrix farm itself, the performance of the “last mile” between the Citrix server and the user, and the work done by the browser to render the application’s HTML.
First cab off the rank: Citrix utilisation
The evidence gathered so far begged the question: Was the Citrix farm overloaded? A resource constrained Citrix environment could explain the problems users were seeing, but a problem like that would be expected to appear in other applications, too. It didn’t quite fit.
Citrix is a complicated system requiring specialist knowledge, so we asked the client’s Citrix support vendor to assess whether the farm was overloaded. A number of dimensions were checked, and there were two primary conclusions:
- The Citrix farm is not overloaded from a CPU, memory, network or disk perspective (though some of the XenApp servers showed short bursts of high CPU instruction queue depth)
- An increase in CPU consumption coincided with the new release of the application
This meant that the new release of the application did require more CPU, but not enough to overload the Citrix farm. Like an overloaded Citrix farm, problems in the last mile would also be expected to impact other applications. Could the problem be the result of a combination of factors?
Pictures and mouse clicks
The interaction between a Citrix client and the server is a mysterious one: The configuration guide alone runs to over four hundred pages, and I don’t plan to cover all the variables here. On a very basic level, the Citrix server sends the client a “picture” of what’s on the screen, which the client then displays for the user. When the user moves the mouse or types on the keyboard, the client sends this data back to the Citrix server. The server updates the picture with the result of the user input, and sends it back to the client.
To speed up the transmission of these “pictures”, images are compressed before they are transmitted. Compression works better for some images than others: A pure black image will compress better than a picture of the Grand Canyon. Given the heavy UX changes associated with this release, it was possible that some characteristic of how the application now looked was causing problems for the Citrix compression, resulting in last mile slowdowns which only impact one application.
We devised a very simple test which we hoped would separate last mile issues from browser rendering issues:
- Open the application and navigate to a “rich” page
- Take a screenshot of what was on the screen
- Move the real application around the screen and see how it performs
- Move the screenshot of the application around the screen and see if it’s any different
In the process of running this test, we found something very interesting indeed…
There’s the thing you’re looking for, and there’s the thing you find
The test was designed to detect issues with compressing the new UX, but in the process of running it, one thing became abundantly clear: Using Alt+Tab to move between windows, there was definitely a delay when moving into the problematic application. A lag of about half a second to actually get the screen onto the glass. This delay didn’t appear when moving to other applications (Outlook, folder views, etc.). It didn’t occur when Alt+Tabbing to the screenshot we had taken, so it was definitely the live version of the application.
We did a few checks using other web pages, including ones with very rich content, and confirmed that the issue was limited to just this application. Opening Task Manager and watching CPU consumption for the Internet Explorer process, we saw a brief spike of CPU consumption when moving in to the application. The spike was relatively small – 15% of the box for a short time – but on a shared environment such as CHD, with a large number of users sharing the same compute resources, the effect would be multiplied many times, and exacerbated when the CHD host was under heavy resource load from other users.
Rendering problems specific to this application fit the user reports nicely: Only affecting one application, and we surmised that the transient effect could be due to overall load on a particular CHD host varying during the day. A quick test on a physical PC confirmed the behaviour wasn’t reproducible, which was also consistent with the user reports.
It was time to share what we had learned with the wider group and get some input from the wider troubleshooting team.
The best laid plans
A meeting was booked and the team was assembled to have a look at what we’d found. We used the PC in the room to connect to the Citrix environment and showed everyone the behaviour. So far, so good. Then, to prove it was a problem in the Citrix environment and didn’t happen on a physical PC, I logged in remotely to my workstation (meeting room PCs can’t access test environments). I confidently ran through the same steps as we’d completed in Citrix to prove there was a problem, secure in my belief that the problem wouldn’t appear.
And then it did…
Find out what happened next in the next article in this series 'Part four: Gotcha! Pinpointing a browser performance problem under Citrix'.