In part one of this series 'Finding the haystack: troubleshooting a hard to find software performance problem', we talked about a client with a post-release performance problem and discussed the first few steps involved in the trouble shooting phase. The analysis told us that the problem was not server side and narrowed the problem space to “everything outside the server.” So where does that leave us?
Complex environments, incomplete information
The client has large numbers of staff spread throughout the country. The majority of users connect to a Citrix desktop via a “toaster” terminal on their desk, and most users log on to a Citrix Hosted Desktop (CHD). A Citrix Delivered Desktop (CDD) is available for users with more complex needs, and a small number of power users have physical PCs. There are Citrix farms in two data centres, one co-located with the application, and one remotely.
At this stage, the only source of information about how the system performed was service desk tickets raised by end users. This is a valuable source of information, but it is difficult to use as a basis for diagnosis of a problem: User reports are subjective and self-selecting to the extent that they border on “anecdotal.” However, analysis of the tickets showed one thing very clearly: users on CHD were most affected, followed by CDD users. Physical PCs were largely unaffected.
These reports pointed us squarely at the performance of the Citrix environment, and what we wanted was an objective measure of user experience across all CHD and CDD instances. This would have been very valuable given the apparent transience and non-repeatability of the problem. Unfortunately there was no monitoring in place which could provide those figures, so it was up to us to fill the information gap.
We wanted to see for ourselves how things were performing for users who were logging service desk tickets, so we asked the service desk to put staff in touch with us. Sitting down with the end users, it was clear that the performance they were experiencing was unacceptably poor, and worse than what we’d expect as a result of the known changes in performance under this release. We observed that the performance problems came and went, lasting anything up to half an hour.
Armed with details of problematic sessions, we looked again at the webserver level response time figures, focussing on users we knew were experiencing problems. In every case, the server response times were consistent throughout the day: From the server’s perspective, the response time was the same when users reported bad performance as it was when performance was good. This supported the theory that the problem was outside the bounds of the webserver.
The next thing we wanted to confirm was that the network-level HTTP response times were close to the HTTP response times recorded by the webservers. If they weren’t, we could be confident we were dealing with a network issue. When the service desk next put us in touch with a user who reported poor performance, we used WireShark to capture the network traffic between the client and the server. (We did this on CDD to keep the packet capture as noise-free as possible.) Comparing the WireShark measured response times to those recorded by the webserver showed a difference of a handful of milliseconds; far too small to suggest the problem was the network CDD instance and the webserver.
It must be Citrix…?
By this stage we were confident we could rule out problems with:
- Webserver performance (encapsulating database and storage layer performance, application server resource consumption limits, etc.)
- Problems between the webservers and CHDs (including the network, load balancers, and so on)
This narrows the problem space down to somewhere between receiving the response from the webserver and getting the rendered page on the glass, in front of the user. Still a lot of ground to cover. The obvious thing to look at is whether the Citrix servers are overloaded. But Citrix supports a wide variety of applications, and performance problems are only being reported in one of them. Wouldn’t an overloaded Citrix environment impact all applications?
Look out for the next article in this series 'Part three: Target sighted – is Citrix causing the performance problem?'