Thursday, May 18, 2017

VM Performance Issue Troubleshooting

Always Remove Unnecessary Hardware on a VM after Doing P2V


One day, a customer of mine called me asking for advise for an application that his team has just P2V but was having performance issue. As usual, my response was to understand the issue and the application itself, came up with some suggestions, including asked them to raise a support request (SR) to VMware Support. Couple of days passed by, but no resolution on the issue. So I decide to visit the customer to see whether I can help. This blog documents the process I took which ended on resolve the issue, and points out one important step post P2V which tends to be missed out.

What's the issue? On what ground the user said the application is slow?


This is the first thing I try to understand if facing a performance issue. Can the user really quantify the slowness? Or is it just based on felling? For this case, the performance issue was quite clear. They showed me that running one process took about 10 seconds in P2V app, where in physical normally only 3 seconds. Ok, now I know what to expect. My goal was to get that 3 seconds back.

What's the application? How's the architecture? How user access it?


Next is to know what application is that. We might want to cross-check whether any documented best practice available for that application. One place to check is at Virtualizing Business Critical Applications page on VMware website, or just Google with keyword "application name on VMware best practices". If one exists, you can use it to later check whether the VM/application already configured according to the best practices.


For this case, I found out that it is Microsoft Dynamics GP 2013 running on Windows Server 2008 R2. The application only consists of one VM. The database required for the application also installed on the same VM. There are two methods used by user to access the application, using client app and through web browser. We tested the application locally to eliminate whether that the issue was caused by network, and it confirmed that it was also happened locally.

By then I found out that there's official best practices document for MS SQL, but not for MS Dynamics GP. Couple of blogs/forums do note some best practices which might help you out, but use that information carefully. You need to understand all the context on why they configured it that way. I also found official performance tuning guide for MS Dynamics GP here. It covers more on configuration best practices and optimisation on the OS and application level, but at that point I did not know whether it will be useful or not. So I just keep it.

How's the VM configured?


Next is to check how that VM hardware configured compare to common best practices.

  • How many vCPU configured for that VM? Is it more than available CPU cores on the physical ESXi host? Is it more than available cores from one CPU socket?
  • How many vRAM configured for that VM? Is it more than available physical RAM? Will it create wide VM (cross NUMA node)?
  • Does it configured with the correct guest OS option and minimum guest hardware version?
  • What network adapter type being used?
  • What VM SCSI adapter type being used? How many virtual disks that VM has? What is the disk provisioning method used?

At that point I found out that:
  • The VM configured with vCPU and RAM at the same size with the source physical machine.
    • The vCPU is the same with total physical cores of the source physical machine. Stated in the document to turn off hyperthreading, means we do not need to factor in hyperthreading performance gain.
  • The VM configured with vCPU and RAM which can be accommodated by one NUMA nodes, so there won't be any performance impact caused by this configuration.
  • Guest OS configured (Windows 2008 R2 64 bit) is matched guest OS installed, VM hardware is the latest based on the installed vSphere version.
  • VM Network adapter configured with VMXNET3, so this is already matched with best practice for Windows Server 2008 R2.
  • VM SCSI adapter configured with LSI Logic SAS, default adapter for Windows Server 2008 R2. There are two virtual disks configured to connect to one SCSI adapter. Both virtual disks are thick provision lazy zeroed.
    • There are some room for improvement here. We can configure the VM with additional SCSI controller, and use paravirtual SCSI for the non boot disk SCSI controller, separate virtual disks to each SCSI controller, and change virtual disks type to thick provision eager zeroed. But I'll keep this finding for now and not apply any changes.

Does overprovisioning negatively impact this VM?


Four things I checked for this VM:
  • CPU Contention
  • RAM Contention
  • Disk Latency (for all virtual disks)
  • Network Dropped Packets (Tx and Rx)
This customer has vROps Standard, so it was easy for me to check those performance history. To do that select the VM and open All Metrics tab, then plot for these metrics:
  • All Metrics|CPU|CPU Contention(%)
  • All Metrics|Memory|Contention(%)
  • All Metrics|Virtual Disks|scsix:y|Total Latency
    • Select corresponding scsix:y based on your configuration
  • All Metrics|Network I/O|Transmitted Packets Dropped
  • All Metrics|Network I/O|Received Packets Dropped
For this case, I found out that for the past couple of days since it was running as VM, the CPU Contention was below 1%, Memory Contention was flat 0, the disk latency was below 20ms, and no dropped packets observed. This value is very healthy and shows that the physical layer serves the VM well. So it's obvious that the reason is not because it runs as VM!

What's next?


Ok, I know that the issue is not on the virtual infrastructure, but still I need to find the culprit, else user will still see this issue caused by running the application as a VM, or worse they think it's because VMware. 😓 Get this situation a lot? I feel you. So now it's time to look on the guest OS level.

First thing I check is antivirus. One quick way to check is whether disabling the antivirus will make the issue goes away. For this case, it was NOT, but on another case I found out that it was the antivirus which not configured correctly keep on deleting the temporary files used by the application.

To be honest, now I'm worried that I need to run through the MS Dynamics GP tuning guide I mentioned above. Lucky I then remembered one thing. This VM resulted from a P2V process, and I remember one best practice from a book titled Mastering VMware vSphere 4 by Scott Lowe about to clean up unnecessary drivers for Windows after P2V process.

The following steps are taken from his book:

For a Windows-based virtual machine, perform the following steps to view and remove unnecessary drivers:
  1. Log in to the new Windows-based virtual machine using an account with administrative credentials.
  2. Open a command prompt, and enter set devmgr_show_nonpresent_devices=1.
  3. From that same command prompt, enter start %systemroot%\system32\devmgmt.msc. This will open Device Manager. It's important that you do this from the same command prompt; otherwise, Device Manager won't show you all the devices you need to see.
  4. From the View menu, select Show Hidden Devices.
  5. Remove devices that are clearly specific to the physical hardware upon which this operating system instance used to run. For example, if Windows had previously been running on HP hardware, you would remove the HP RAID array driver since it is no longer necessary. The same goes for old network card drivers and old video drivers.
I did those steps on the problematic VM, and I found out a lot of reference to the old (physical) disks, including some logical disk configured with multipathing pointing to external storage. Removed a lot of no longer available devices, and then I also performed additional steps to remove all old hardware related management software (the physical was on HP server). After a lot of clicking and restart, we tested the result. Yup it solved the performance issue! We did the process that previously took 10 seconds couple of times for around half an hour, and after the clean up, it never fails to come back with the result in 3 seconds.

With that, I concluded my troubleshooting. I share what I did with the customer, and also share them the document I found on optimizing MS Dynamics GP if they want to perform the optimization further. Thanks for the old but not obsolete great book from Scott Lowe which has saved the day. And I hope for another happy customer! 🎉 

PS. You can also find another article on how to remove old hardware after a P2V conversion here.

No comments:

Post a Comment