As we begin to go through all of the new features and functionalities of vSphere 6.5, I wanted to take some time to point out some new features and updates to current functionalities of DRS, since I am the Product Manager for Distributed Resource Management.
So let’s jump in!
When you head over to edit cluster settings in vSphere, you may notice an additional row in the settings, ‘Additional Options’. What are additional options and why did we add them?
Expanding Additional Options, you will see 3 options here:
- VM Distribution
- Memory Metric for Load Balancing
- CPU Over-Commitment
These are actually some of the advanced options which we have bubbled up to have their own UI settings. These are some of the most used advanced options so we thought we’d make them easier to consume. More on these below.
VM Distribution already occurs with DRS. However, because there is a cost associated with each vMotion, DRS tends to avoid a vMotion unless it truly sees a significant gain from doing so. (If you want to know more about the costs associated with ANY vMotion, reach out to me on twitter or in the comments. I’ll write up a blog post about it in the near future). The only potential issue with the way DRS works by default is that, for availability purposes, DRS may have one host loaded with a number of VMs more than any other host in the cluster. If that host fails, you will have a number of additional HA restarts than you would have, had DRS spread the VMs around. This advanced option, TryBalanceVmsPerHost is a new advanced option, which we are introducing to help balance VM count based on availability.
How does it work?
When this option is set, each host is given a load-balancing ‘maxVMs’ limit which is the calculated average VMs per host. This limit is applied only to the load-balancing algorithm, meaning that initial placement can violate it.
Before the load-balancing algorithm starts, DRS attempts to fix any load-balancing maxVMs limit violations that exist.
For each host with a load-balancing maxVMs violation, we sort its VMs based on entitlement and attempt to move VMs (small VMs first) to a less-loaded host. A move is only allowed when it improves the imbalance AND the destination host does not violate the load-balancing maxVMs limit.
This advanced option will be enforced on a best-effort basis while maintaining that performance and VM happiness ALWAYS comes first.
Memory Metric for Load Balancing
DRS by default works to balance hosts by using the Active Memory metric of Virtual Machines. This is different than some of the other 3rd-party offerings out there, but we do this for several important reasons.
- Active Memory is a more accurate metric in terms of what is being used.
- Consumed Memory may be inaccurate in terms of what the VM needs.
- The Active Memory metric is also subsidized with a percentage of idle memory (25% by default, can be changed with an advanced option)
This is great when you are over-committing memory in your hosts, but for some customers that don’t overcommit memory, they are just as happy to have DRS balance based on Consumed memory. This is now an easy checkbox for customers to leverage the swap between which memory metric DRS uses.
Some customers, especially those running VDI environments, or with specific vCPU:pCPU ratio requirements for the applications running in the cluster, tend to use this advanced option MaxVcpusPerClusterPct
Any value between 0-99% is under-committing (less than one vCPU:1pCPU)
100% is a 1vCPU:1pCPU-Core ratio
500% is a 5vCPU:1pCPU-Core ratio
As we receive feedback about this setting, we may increase the max value of this advanced option. In general, when this option is used, we rarely see customers limiting this ratio to anything higher than the 5:1 ratio.
*Note: This is based on cluster %, rather than being locked down on a per-host level. This means that it is possible for a single host to be higher than the CPU Over-Commitment as long as the overall cluster percent is below the value placed by the admin. For those who are looking to enforce a value across each ESXi host (set once for the entire cluster, but enforced on each host), see the advanced option MaxVcpusPerCore.
In this release we have added Network-Awareness to DRS to make DRS more intelligent and aware of what is going on with the ESXi hosts when making load-balancing decisions. What we have found is that in rare circumstances, there would be an ESXi host with relatively less CPU/Memory utilization compared to the other hosts in the cluster, but with very high network-throughput. DRS in the past would not have been able to see the network saturation and could potentially (once again, in very rare circumstances), begin dogpiling VMs on this host because from a CPU/MEM standpoint, it appeared to be the best host for placement. In doing so, the applications could possibly have performance issues based on the uplink saturation.
With Network-Aware DRS implemented, this will no longer happen. DRS will now check for network saturation of hosts before placing workloads on said hosts. How does this work? DRS still makes its decisions off of the CPU/MEM stats. Once a target host has been chosen for placement/load-balancing, DRS will then check to see if that host’s network is saturated (default is 80% utilization of connected uplinks, but can be configured with ‘NetworkAwareDrsSaturationThresholdPercent’. If the host is considered saturated, it will use a different target host. Think of this as a check to ensure that placement is optimal.
Resource Utilization optimization
We continue to streamline our features and enhance them with great additional functionality. At the same time, we’ve done some housekeeping to not only improve performance, but reduce resource utilization. A few numbers on this release:
- Performance increase of more than 2.5x throughput
- 70% resource reduction at scale
- VM Power-on latency improved by more than 3x
- DRS Cluster Compatibility Check
- > 21x improvement
- Less than 2% CPU utilization
- > 850 MB resource reduction
As with each release, there are additional updates and improvements we have made to the DRS algorithm. One of these improvements will be used to bring all of the hosts into a tighter alignment and distribution by ensuring that the most entitled host and the least entitled host deviations are within a specific range; DRS will continue to balance the cluster until this target has been reached. Essentially DRS will be more aggressive in identifying outlier situations and work to bring these within a specific value-range based on the Migration Threshold settings of the cluster.
We will continue to build upon this feature, which is used by ~97% of our Enterprise customers. As always, receiving feedback is crucial for me to continue bringing value to you and your business. Any additional feedback or comments are always appreciated.
See Parts 2 and 3 for more DRS goodness: