Exchange 2010 - Manually fix a down Exchange Organization that uses DAC mode

Exchange 2010 - Manually fix a down Exchange Organization that uses DAC mode

In my experience, when you have Exchange 2010 in a volatile environment, you open yourself up to the cluster behaviors of the Exchange cluster to behave unexpectedly, or just plain fail. If you're running in DAC mode, then to gain those benefits, you have to manually fail/fix sites to prevent services from staying down, or worse, going split-brain. 

Because Exchange 2010 relies heavily on Microsoft FCS (Failover Clustering Service) and AD (Active Directory), there are many scenarios where these distributed decision making functions can fail. When all the servers fail in the primary data center, the second data center takes over as it should, and when the primary data center comes back online, it does not automatically fail back; this is by design (per Microsoft). I have found that to fail services back, you must do two crucial things:

  1. Manually restart clustering services on the Exchange servers in the primary datacenter with this command "net start clussvc /forcequorum"
  2. Restart the DAG in the primary data center with the command "Start-DatabaseAvailabilityGroup -ActiveDirectorySite '[site name]' -Identity '[DAG name]'"
  3. If necessary because errors are preventing the previous two, use the GUI tool for failover cluster management to evict the node in the secondary datacenter
  4. In some cases it may be necessary to forcibly re-mount some databases with this command: 'mount-database -identity '[name of database]'

The sites seem to recover after a few minutes, but the changes are not immediately apparent, and the databases take a few minutes to re-mount. The reasons for these commands were not readily obvious to me, but I've come to the conclusion that the following conditions must be considered:

  • Active Directory is the storage repository, so many calls are to AD, and then the changes are replicated via that mechanism (delay)
  • Direct changes are activated by RPC, which in a distributed environment can occasionally cause issues, especially if you use Microsoft's TMG/ISA Firewall products for your VPN tunnels (more on that in other posts)
  • The Cluster service seems to respond slowly to changes, and shuts itself off if enough errors are encountered
  • Don't confuse DAG and DAC, they are totally different things

Also, the Microsoft documentation is decent (not great) on this, and is definitely worth reading: http://technet.microsoft.com/en-us/library/dd351049.aspx 

 

A bit about this environment:

  • 3 servers, 2 in primary data center, 1 in secondary data center
  • Running in DAC mode
  • All servers members of the same DAG group
  • The quorum share for the DAG group is on a separate server in the primary data center
  • All running on Server 2008 R2
  • All servers running on VMware vSphere 4 (ESXi 4 U1)