I have been troubleshooting a vMotion issue with a client recently where One particular VM (a domain controller) would Vmotion fine, yet after a fine hours would BSOD, however all the other VMs hosted on this host were fine.
After a couple of days troubleshooting we managed to work out the issue was a faulty bank of RAM when going over 8GB+.
This meant that if you close a few VMs bring over a new VM, as long as total utilization was under 8GB you were fine, however once you went over 8GB with a VM, that VM was the one to suffer!
During the troubleshooting process this particular VM was migrated in various ways storage then host, then storage and host in one go,cloned,snapshoted etc etc only once the VM was stable and the RAM replaced the fun with AD then started.
The troubleshooting within VMware had caused a little issue with AD.
Here is the main message (amongst a fair few):
Event Type: Error
Event Source: NTDS General
Event Category: Service Control
Event ID: 2103
Date: 26/01/2010
Time: 20:37:18
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: DCV2
Description:
The Active Directory database has been restored using an unsupported restoration procedure.
Active Directory will be unable to log on users while this condition persists. As a result, the Net Logon service has paused.
User Action
See previous event logs for details.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
And for those that like pictures :)
MS knows this as a "USN rollback condition" and talks about it endlessly here
So what was the fix?
Well the VM did have some FSMOs roles, so after "Seizing the roles" ran this command on the DC I was getting the above eventids on:
repadmin /options DC_Name -disable_inbound_repl -disable_outbound_repl
then ran dcpromo (to demote the controller) rebooted and ran dcpromo again (to promote the controller)and all was back to normal.
Although this issue was not directly related to vmware (could of just as easily happened with SAN snapshots or norton ghost) it is something to look out for when snapshoting/cloning and troubleshooting VM issues where the VM is looking after a tier 2 distributed app.