Frequent DAG failovers in a virtualized Microsoft Exchange environment on VMware vSphere

Running Microsoft Exchange in a virtualized environment provides a lot of extra flexibility and even increased availability when running in a HA configuration. This short article is dedicated to some extra tuning that might be necessary in your environment.

The environment I’m talking about is consisting of 2 virtual Exchange 2013 servers, running on VMware vSphere 5.5. Storage is provided by an iSCSI-based array. Compute by HP Gen6 Intel-blades.

Ever since these servers are running, a failover is triggered by the Microsoft Cluster Service, failing all the active mailbox databases over to the second Exchange server. It seems that a snapshot creation task is triggering this failover. As we are using Veeam for backups, we contacted them to ask if there are any workaround for this issue.

Veeam released this following KB article, telling you how to decrease the cluster sensitivity and prevent the failovers to happen. In our case, these settings sadly didn’t solve our issues.

What seems to be the problem, are dropped network packets from within the Guest OS. Following this KB article by VMware, it seems there are some issues with the VMXNET3 NIC on systems that have high traffic bursts (like Exchange).

For now, these settings seem to solve our issue and no failovers are happening again, but if it arises again, I will definitely update this article.

Hopefully both possible solutions by Veeam and VMware can help you in case you are running into the same issue.

Got feedback? Please leave it below!

2 thoughts on “Frequent DAG failovers in a virtualized Microsoft Exchange environment on VMware vSphere

  1. I’m having similar issues with a customer of mine and I was wondering what values you ended up using for the VMXNET3 driver and which machines you applied to (veeam, mailboxes servers, etc). I just made the changes yesterday and so far it seems to have cut down on the number of false positives that we were getting, however during the backup job there were still issues happening including a DAG node in a different site/location that isn’t even part of the Veeam backup job. I’m hoping with some additional tweaking we’ll get this right. I’m wondering if i should just revert to an E1000 driver for these servers. Also, you didn’t mention if you had a separate heartbeat network for the DAG/Cluster. I went the route of not using one, now I’m wondering if I should put one in?

    Thanks for the post.

    • Hi Albert,

      Just some time ago in January, I was investigating this same issue at a customer which had a much bigger environment than I described in this article. They were experiencing DAG failovers during snapshotting or vMotioning.

      The customer had combined servers (mailbox database and CAS on the same machine) and yet applied a lot of best practices. In the end, the following improvements fixed the encountered issues:

      – DAG cluster settings changed (tweaks are very, very specific and you should keep the values as low as possible to prevent unnecessary downtime)
      SameSubnetDelay from 1 to 2 seconds,
      SameSubnetThreshold from 5 to 10 heartbeats,
      CrossSubnetDelay from 1 to 4 seconds and
      CrossSubnetThreshold from 5 to 10 heartbeats

      – Configured Multi-NIC vMotion
      Available bandwidth permitted the customer to go to 10Gbit/s for vMotion traffic, achieved by using Multi-NIC vMotion (using more than one physical NIC, which is not enabled by default).

      Don’t use the E1000 NIC! The VMXNET3 is a lot more efficient and robust, it should be used unless unsupported by the OS.

      The customer was using a separate heartbeat network, but it wasn’t running on a dedicated vNIC though. You definitely need one when having a DAG.

      Hope this helps!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s