Buggy DNS resolution using Microsoft ForeFront TMG 2010

I was experiencing very weird DNS issues with a Windows Server 2008 R2 machine.
While resolving external domain names, it would sometimes come back with a response and some times with a timeout.

I tested this using nslookup and using the server parameter to point to the Google public DNS server. I am trying to resolve http://www.microsoft.com

nslookup
server 8.8.8.8
http://www.microsoft.com

> http://www.microsoft.com
Server: google-public-dns-a.google.com
Address: 8.8.8.8

DNS request timed out.
timeout was 2 seconds.
*** Request to google-public-dns-a.google.com timed-out
> http://www.microsoft.com
Server: google-public-dns-a.google.com
Address: 8.8.8.8

DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to google-public-dns-a.google.com timed-out
> http://www.microsoft.com
Server: google-public-dns-a.google.com
Address: 8.8.8.8

DNS request timed out.
timeout was 2 seconds.
Non-authoritative answer:
Name: lb1.www.ms.akadns.net
Address: 65.55.57.27
Aliases: http://www.microsoft.com
toggle.www.ms.akadns.net
g.www.ms.akadns.net

As you can see, 1 out of 4 requests succeeded. Something was corrupting my DNS query.

In this scenario, Microsoft ForeFront Threat Management Gateway 2010 (TMG 2010) was used.
The client, in this case a DNS server, was placed in the internal network and was NAT’d thru the external interface of the TMG, which was an interface with public IP addresses.

Somehow, the query was not arriving at the external DNS server.
Testing the same queries directly from the TMG, no issues were active.

It had to do with the internal-external NAT translation and specific for DNS traffic, because HTTP/S traffic was working without any trouble.

After some investigation NIS (Network Inspection System, part of the Intrustion Prevention System) was doing something with the queries. In our case NIS was dropping these queries.
We added our DNS server to the NIS exclusion list and the resolution issue was gone!

Since we are yet preparing to implement an alternative to TMG we didn’t see the urge to research this issue further.

Hopefully this will help some people resolve DNS issues with their clients behind TMG.

We will add NIS exclusions to all of our internal DNS servers to prevent DNS issues to arrise in the future.

Corrupt Forefront TMG disk cache

While examining the event logs of one of our Forefront TMG servers, I noticed an error stating that the disk cache failed to initialize.

Event ID: 14176
Type: Error
Source: Microsoft Web Proxy
Description:
Disk cache Drive:urlcacheDir1.cdat failed to initialize. Some errors were encountered when ISA Server restored specific data cache files. ISA Server will now attempt to recover these files. These errors may have occurred because there was not enough time to complete all necessary shutdown operations, when ISA Server was previously shut down. To avoid this in the future, you can increase the value of the HKEY_LOCAL_MACHINESystemCurrentControlSetControlWaitToKillServiceTimeout registry key. Identify the reason for cache failure by examining previous recorded events, or the error code. The error code in the Data area of the event properties indicates the cause of the failure (internal code: 503.6333.3.0.1200.166).

No functionality was lost, but the error caught my attention and I found a Microsoft KB that described this error:

http://support.microsoft.com/?scid=kb;en-us;887311

In my case, McAfee Antivirus was active and as described by the KB, you should exclude the disk cache directory within your virus scanner. I already had an exclusion for the on-access scanner but no exclusion was yet active for the on-demand scan. The time that this error occurred, was about 30 minutes after the on-demand scan was executed.

I just added the exclusion for the on-demand scan and hopefully this will prevent the error from appearing.