Exchange 2013 DAG quorum lost

Today some maintenance had to be done on a Exchange 2013 mailbox server, which was in a 2-node cluster using a fileserver share as witness.

The particular Exchange server was disabled on our load balancer to drain connections. Next, the StartDagServerMaintenance.ps1 script was used to prevent new sessions and to failover the mailbox databases to the other Exchange server.

These actions were performed OK and the server was ready to be shut down and perform maintenance. After shutting down, the mailbox databases were dismounted on the second Exchange server and could not be mounted anymore. Uh-oh..

The reason for not being able to mount the mailbox databases was due to the fact that quorum was lost. I saw the following error when opening up the Microsoft Failover Cluster Manager:

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

The strange thing was, that the fileserver running the witness share was fine and reachable.
Because the offline Exchange server could not be brought online in a matter of minutes, I had to override the quorum safety and bring the Cluster Service back online using the ForceQuorum command:

net start clussvc /fq

I got this command from the following Microsoft TechNet Article: http://technet.microsoft.com/en-us/library/cc770620(v=ws.10).aspx

After running the command, the cluster was back online and mailbox databases were abled to be mounted again. Just before maintenance was completed on the Exchange server and before booting it up again, I disabled the Cluster Service on the secondary server because of the fact that this server was running in ForceQuorum state. This to prevent data loss or corruption.

When the server was booted up again, I started the Cluster Services on both servers and everything returned back to normal.

The reason for the lost quorum is probably due to the fact that the Cluster Service is configured with “Node Majority”, which isn’t a setting you want with 2 nodes =)
Tomorrow we will investigate if the “Node and File Share Majority” is a better choice, which probably is due to the fact that we are using a file server share as witness.

Merging a 140GB Hyper-V 2008 R2 snapshot

Last week I was notified that one of the production LUNs of a customer using Hyper-V 2008 R2 was filling up and the reason for this was a ‘deleted’ snapshot of a production system.

Deleting snapshots in Hyper-V 2008 R2 requires a shutdown of the VM in order to completely remove the snapshot (AVHD file) on your storage system. Just removing the snapshot/checkpoint using the Virtual Machine Manager is not sufficient. The AVHD file will still exist and keeps growing until you shut the VM down. This is a feature according to Microsoft.

This growing had been going on for a few weeks and the AVHD file has reached a size of 140GB. We made a rough estimation that the storage system would support a minimum of 15 MB/s throughput and with the size we had to process, this would’ve taken 2 to 3 hours. That meant 2 to 3 hours downtime for this particular VM.

Some people on the net were arguing whether extra space was required on the Cluster Shared Volume to merge the snapshot. This is not true.

Just to be sure, I created a backup of the VM just before starting the merge / shutting down the VM. After office hours I shut down the VM and kept an eye on the merge progress using the following PowerShell command:

Get-WmiObject -Namespace "rootvirtualization" -Query "select * from Msvm_ConcreteJob" | Where {$_.ElementName -eq 'Merge in Progress'}

The merge started within 5 minutes after shutting the VM down and within 15 minutes it reached about 5 percent. In just 90 minutes the merge was completed and the VM was booted back up to restore functionality!

So, snapshotting in Hyper-V 2008 R2 is still shit. It still requires downtime but not as much as calculated. This ‘feature’ is removed in Hyper-V 2012 and will automatically clean up after itself 🙂

Buggy DNS resolution using Microsoft ForeFront TMG 2010

I was experiencing very weird DNS issues with a Windows Server 2008 R2 machine.
While resolving external domain names, it would sometimes come back with a response and some times with a timeout.

I tested this using nslookup and using the server parameter to point to the Google public DNS server. I am trying to resolve http://www.microsoft.com

nslookup
server 8.8.8.8
http://www.microsoft.com

> http://www.microsoft.com
Server: google-public-dns-a.google.com
Address: 8.8.8.8

DNS request timed out.
timeout was 2 seconds.
*** Request to google-public-dns-a.google.com timed-out
> http://www.microsoft.com
Server: google-public-dns-a.google.com
Address: 8.8.8.8

DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to google-public-dns-a.google.com timed-out
> http://www.microsoft.com
Server: google-public-dns-a.google.com
Address: 8.8.8.8

DNS request timed out.
timeout was 2 seconds.
Non-authoritative answer:
Name: lb1.www.ms.akadns.net
Address: 65.55.57.27
Aliases: http://www.microsoft.com
toggle.www.ms.akadns.net
g.www.ms.akadns.net

As you can see, 1 out of 4 requests succeeded. Something was corrupting my DNS query.

In this scenario, Microsoft ForeFront Threat Management Gateway 2010 (TMG 2010) was used.
The client, in this case a DNS server, was placed in the internal network and was NAT’d thru the external interface of the TMG, which was an interface with public IP addresses.

Somehow, the query was not arriving at the external DNS server.
Testing the same queries directly from the TMG, no issues were active.

It had to do with the internal-external NAT translation and specific for DNS traffic, because HTTP/S traffic was working without any trouble.

After some investigation NIS (Network Inspection System, part of the Intrustion Prevention System) was doing something with the queries. In our case NIS was dropping these queries.
We added our DNS server to the NIS exclusion list and the resolution issue was gone!

Since we are yet preparing to implement an alternative to TMG we didn’t see the urge to research this issue further.

Hopefully this will help some people resolve DNS issues with their clients behind TMG.

We will add NIS exclusions to all of our internal DNS servers to prevent DNS issues to arrise in the future.