A while ago I was asked to aid in some performance issues in a customers’ vSphere environment. This article is all about troubleshooting your own environment using esxtop in combination with a tool called PAL (Performance Analysis of Logs) which can generate a report including alerts and graphs.
First of all, you need to decide how much data you need to capture from your ESXi host(s). In my case, the environment got really slow between 8 and 10 AM, so I suggested to capture data from 7 until 11 AM to easily identify the healthy and unhealthy state.
The way to capture this data is done using esxtop, which can be run in batch mode and captures all data in a CSV-based file or even a compressed file as the output can grow to extreme amounts (Excel files of about 200 meg is not impossible). How to exactly setup and run esxtop in the right way can be found on Duncan Epping’s article about esxtop.
This page also includes all the important counters you should pay attention to while you are troubleshooting.
Performance Analysis of Logs (PAL)
After you captured your data, it’s possible to replay it using various tools (also described in Duncan’s article), but PAL is not mentioned here. PAL was suggested to me by one of my colleagues who is using it to create Microsoft Exchange, Active Directory and IIS health reports. He is blogging together with other colleagues at uccexperts.com if you’re interested in some articles about Microsoft products.
Easy does it, launch the application, browse to your log file(s) and apply a threshold file (more about that further below).
PAL is free and available for download at their CodePlex project page.
PAL can directly read your exported CSV file and apply so-called threshold filters to clean up unnecessary data. There are already over 60 built-in filters for various products like Exchange, Sharepoint, Lync and SQL.
As there isn’t a VMware vSphere threshold file yet, I used the metrics and thresholds described in Duncan’s article. I was able to put in the following counters:
- CPU – CoStop
- CPU – Max Limited
- CPU – Ready
- CPU – Swap Wait
- CPU – System
- DISK – Average Driver MilliSec/Command
- DISK – Average Guest MilliSec/Command
- DISK – Average Kernel MilliSec/Command
- DISK – Average Queue MilliSec/Command
- MEM – Memctl Current MBytes
- MEM – Swap MBytes Read/sec
- MEM – Swap MBytes Write/sec
- MEM – Swap Used MBytes
- MEM – Total Compress MBytes
- NETWORK – Outbound Packets Dropped
- NETWORK – Received Packets Dropped
For some reason, I wasn’t able to add MEM – N%L, DISK – ABRTS/s, DISK – RESETS/s and DISK – CONS/s. Maybe because the esxtop export I was using didn’t contain these counters.
To save you some time, the VMware vSphere (snowvm.com) version 1.1 threshold file is publicly available using the URL below:
After downloading the threshold file, place it in the installation folder of PAL (default path is C:Program FilesPALPAL). Be sure to remove any old versions of the threshold file.
Give PAL a spin afterwards and see the new threshold filter appear in the list as seen in the screenshot below. Following the PAL wizard and using this threshold will provide you with a readable report, including alerts based on proven thresholds.
Oh, before I forget; I included some exclusions for idle CPU counters to filter out unnecessary data. Because each ESXi process has it’s own unique ID, the exclusion is not applied 100% correctly. Therefore, you should edit the PAL.ps1 file using the instructions on this page or simply paste these lines of code:
ForEach ($XmlExcludeNode in $XmlDataSource.SelectNodes(‘./EXCLUDE’))
If ($XmlCounterInstanceNode.NAME -match $XmlExcludeNode.INSTANCE)
$IsCounterInstanceMatch = $False
To give you an idea about the way those reports are presented, be sure to check out the screenshots below. The time displayed is based on the UTC timezone when running the esxtop batch export, remember that when analyzing the report.
If you are ever in need to parse a lot of performance data, be sure to check this tool out! Got feedback? Please leave it below in the comments.