VMFS VAAI ATS issues on a sunny Saturday

My colleague contacted me this morning as he was seeing strange behaviour with a particular VMFS datastore. The single VM which was running on this datastore was ‘gone’ and storage connectivity lost errors were appearing as well. Time for some troubleshooting…

This environment was using iSCSI based storage and an ESXi cluster talking to the LUNs on this storage array. All LUNs were fine, but this one wasn’t.

After logging into one of the ESXi hosts and cd’ing to the volumes directory, it seemed that the symbolic link of this datastore was broken as it was shown in red:

Navigating to the datastore was therefore not possible. Next I tried rescanning the HBA’s (software initiators) and VMFS volumes to see if anything would clear up, which it didn’t. Some time ago I experienced a PDL in a different environment where I removed the iSCSI session from the static discovery page. This didn’t improve this situation as well, sadly.

Rescanning the HBA’s and VMFS volumes however showed really strange behaviour for this particular datastore. It would display that the datastore was 7TB in size, while it was actually configured with a different capacity. Another rescan displayed 400TB and so on.. Weird shit =)

Next step for me was checking the VMware logs on one of the hosts (/var/log). The vmkwarning.log indicated an issue with this datastore:

2014-05-24T11:54:43.822Z cpu8:18335099)WARNING: HBX: 1968: Failed to initialize VMFS distributed locking on volume 53466fa3-8f3a3c14-cf60-0017a4770 010: Not supported
2014-05-24T11:54:43.822Z cpu8:18335099)WARNING: Fil3: 2492: Failed to reserve volume f530 28 1 53466fa3 8f3a3c14 1700cf60 100077a4 0 0 0 0 0 0 0
2014-05-24T11:54:43.911Z cpu13:18335099)WARNING: FSAts: 1304: Denying reservation access on an ATS-only vol ‘DATASTORENAME’
2014-05-24T11:54:43.911Z cpu13:18335099)WARNING: HBX: 1955: ATS-Only VMFS volume ‘DATASTORENAME’ not mounted. Host does not support ATS or AT S initialization has failed.

I went looking for some kind of chkdsk tool to see if there was any corruption with the datastore and came across the VOMA tool, which can check your VMFS datastore metadata. It did found some errors in my case, but I’m not use whether this was the actual problem. Next I tried mounting the datastore to a ‘fresh’ ESXi host in the test environment as it never touched this datastore yet. Also this host showed the datastore as inaccessible so I knew it had to do with the datastore.

Eventually I found this VMware KB article and from my understanding the datastore could not be locked with ATS. Seems like a lock on the storage array itself, but how do I connect to my datastore with my VM data in it?

The article describes how you can disable the ATS locking feature on an ESXi host. I executed these command on the test ESXi host, which was also able to communicate with the messed up datastore.

List current setting (should be Int Value: 1):
# esxcli system settings advanced list -o /VMFS3/HardwareAcceleratedLocking

Disable hardware accelerated locking:
# esxcli system settings advanced set -i 0 -o /VMFS3/HardwareAcceleratedLocking

Check setting again (should now be Int Value: 0):
# esxcli system settings advanced list -o /VMFS3/HardwareAcceleratedLocking

After disabling ATS locking, I performed a rescan of the storage adapters and VMFS volumes. The datastore appeared! And I could browse the datastore where my VM files are still sitting and waiting to be booted. I have migrated all VM data off the datastore to a new one and removed the old one, as I didn’t know what bad stuff might come from it in the future.

There probably is a way to reset the ATS lock on the array itself, but I didn’t want to take the risk at this time and on this environment.

The VMware KB article states that you should execute these commands in a maintenance windows but if you have a test host like me, you could try it out yourself without harming your production ESXi hosts.

Thanks for reading!

3 thoughts on “VMFS VAAI ATS issues on a sunny Saturday

    • These were HP StoreVirtual storage nodes, but I haven’t seen issues like this ever. Might have been caused by a snapshot action running from vSphere as we noticed later.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s