Monday, November 7, 2016

Quickly Identify Whether My Virtual Machines Get All CPU Resources They Need

Background

One of the capability brings by virtualization is the ability to run several virtual machines in one physical machine. This ability may lead to something called over provisioned, where we provisioned resources to VMs more than what we have in the physical layer. For instance, we can create 20 VMs, where each has 4 vCPUs - in total of 80 vCPUs provisioned, while the server we used only has 20 CPU cores. Wait.. wait.... If we only has 20, how can we give 80? How can we give more than what we actually had? Actually the answers is one of the reason why virtualization rose in the first place: most of our server has - in average - low CPU utilization and each server has different time in experiencing peak and low utilization. VMware vSphere manages how VMs get their turn utilizing physical CPU resources in efficient and fair manner by a component called CPU Scheduler. In simple word, CPU Scheduler is like traffic light. It rules who may go, or in this case who may use the physical CPU resources, and who need to stop and wait. More about CPU Scheduler can be found on this CPU Scheduler Technical Whitepaper.

Using the analogy of traffic light, we know that at one time, the number of cars can go will be defined by numbers of lanes available. If the road has 4 lanes, then only maximum of 4 cars can pass at the same time, other cars will queue behind the first row. This is also true in virtualization, even though we can do over provisioning, what CPU scheduler can schedule at a time will be limited to how many  logical CPUs available on the physical server. Means, if at any one time there are several VMs, with total vCPUs more than available logical CPUs, asking for their share to use logical CPUs, then some of those VMs will need to queue. By having to queue, it will takes longer for a VM to finish its job. Now the challenge is how to identify this queue, and furthermore how to manage that queue into an acceptable timeframe. This article will try to answer the first part, while for the latter will be discussed in the future article.

Problem Statement

How to quickly identify whether my Virtual Machines get all the CPU resources they need?

Tools Used

VMware vRealize Operations Manager Standard Edition

Solution

vRealize Operations Manager has a derived metric to show the degree of impact experienced by a Virtual Machine due to over provisioning, called contention. This metric count by percentage where a VM waiting to be scheduled to use physical resources, compare to the total time. The lower value means better. Zero percent means every time a VM request for resources, the resources always available for that VM.

OK, now the question, how to quickly monitor this contention metric? Of course we can look at each VM, and show that contention metric, but if you have hundreds of VMs, don't you think that will be a hassle? Luckily vROps, even the Standard edition has it ready for you.



  • From vROps Home screen, go to Environment (1).
  • Select vSphere Hosts and Clusters (2).
  • Browse to vSphere Cluster that you want to check (3).
    • In this example I choose mgmt-core cluster in msbu-mgmt Datacenter managed by vc mgmt vCenter. 
  • On the right pane, select Details tab (4).
  • Make sure you are in View pane (5).
  • Search for Virtual Machine CPU Diagnose List view (6).
    • Notice that we can filter the view. Here I filter with keyword CPU Diagnose.
  • This view will list all VMs in the cluster you select earlier, complete with some information where one of it is maximum CPU Contention (7) of a VM in a certain period.
    • Note that the default period is 7 days. In this example, I have changed the period to show only 1 day period.
    • To easily identify which VMs experienced the highest CPU contention, I sort that column to show highest value at the top.
Here is a screenshot showing all the information provided by this view. About the metrics and informations can be found in this vROps documentation.


With this View, you can quickly identify which VM experience high CPU contention. But that would rise another question, what is the number of CPU Contention we can afford? To understand that, please read this article from Iwan Rahabok. And I think it is very useful to refer to another article from e1 which talks about contention here.

After knowing the degree of CPU contention our VMs experienced, and knowing the limit or SLA where we want to keep, the next question is how to identify what causing this situation. And further after knowing the cause, how to manage that? is there anything we can do to lower the value? Well, I would need to discuss that on the future articles. So stay tune and I hope you find this article useful.





3 comments:

  1. Nice article Pak Shiraz. We wait for the next article 'how to manage cpu contention' hehe

    ReplyDelete