Wednesday 27 January 2010

VMware and Active Directory Replication

Just thought I would drop a note for those that use VMware for tier 2 server roles (most people) and probably are using snapshots/clone etc.

I have been troubleshooting a vMotion issue with a client recently where One particular VM (a domain controller) would Vmotion fine, yet after a fine hours would BSOD, however all the other VMs hosted on this host were fine.

After a couple of days troubleshooting we managed to work out the issue was a faulty bank of RAM when going over 8GB+.
This meant that if you close a few VMs bring over a new VM, as long as total utilization was under 8GB you were fine, however once you went over 8GB with a VM, that VM was the one to suffer!

During the troubleshooting process this particular VM was migrated in various ways storage then host, then storage and host in one go,cloned,snapshoted etc etc only once the VM was stable and the RAM replaced the fun with AD then started.

The troubleshooting within VMware had caused a little issue with AD.
Here is the main message (amongst a fair few):


Event Type: Error
Event Source: NTDS General
Event Category: Service Control
Event ID: 2103
Date: 26/01/2010
Time: 20:37:18
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: DCV2
Description:
The Active Directory database has been restored using an unsupported restoration procedure.

Active Directory will be unable to log on users while this condition persists. As a result, the Net Logon service has paused.

User Action
See previous event logs for details.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.


And for those that like pictures :)



MS knows this as a "USN rollback condition" and talks about it endlessly here

So what was the fix?
Well the VM did have some FSMOs roles, so after "Seizing the roles" ran this command on the DC I was getting the above eventids on:

repadmin /options DC_Name -disable_inbound_repl -disable_outbound_repl


then ran dcpromo (to demote the controller) rebooted and ran dcpromo again (to promote the controller)and all was back to normal.

Although this issue was not directly related to vmware (could of just as easily happened with SAN snapshots or norton ghost) it is something to look out for when snapshoting/cloning and troubleshooting VM issues where the VM is looking after a tier 2 distributed app.

3 comments:

Ade said...

Now this is a VERY interesting post and I think very useful as well

Thanks for taking the time to pass this on

I see you are a Vyatta fan too - nice!

Roggy said...

Thanks for the feedback Ade, glad you found it useful.

Unknown said...

This can happen anywhere especially if you have many admins in the kitchen. My best practice when setting up ESX/vsphere VirtualCeneter/ vcenter when domain controllers is involved. Is to create a role with no snapshot functions but allow everything else. vCenter instructions - Click Home > Roles > Right click Add > Name it no-snapshot > Highlight All privileges > Then scroll down and uncheck State under Virtual Machine. Then assign that role to a virtual DC on the permissions tab of the VM. Like Roggy said this can still happen with a SNAP lun but IMO very unlikely.