Exchange 2010, Server 2008 R2, VMware ESXi 4.0.1 and freezing machines

Through some extensive testing I've discovered a fundamental bug in either Exchange 2010, ESXi, or both.

 The Issue: My VMs freeze completely and never recover at random intervals, usually after running for a few hours. There is no amount of debugging or looking for patterns on the Windows platform that has revealed any culprit, nor have I even found a process or activity that correlates with the failure. From the OS perspective, the HW just freezes. From the ESXi perspective, the VM goes into maximum CPU utilization, medium memory utilization and and stays there permanently. VMware tools fails connections and the VMs have to be manually reset.

My platform is fairly simple:

  • HP DL380 G5 server
  • VMware ESXi 4.0.1 (fully patched)
  • Single VM is Windows Server 2008 R2 (fully patched)
  • Exchange 2010 running all roles except UM

Please note that numerous other VMs running Server 2008 R2 have no problems at all, only the ones running Exchange 2010 have problems.

Troubleshooting steps, none of which made any apparent difference:

  • Isolated this VM by itself on host
  • Tried on DL380 G5 machines with different intel processors (5460 and 5160)
  • I don't have problems running it under VMware fusion or workstation
  • Network card changes don't affect it (it definitely is not the NIC problem, because it happens even if the machines have no NIC)
  • Tried the video change of driver (fix prior to ESXi 4.0.1 update)
  • Removed all other software such as AV, drivers, etc from VM
  • Removed all non-essential virtual HW
  • Tried disabling/enabling independent disks, doesn't help
  • Tried changing storage types, didn't help (Local SAS, external SCSI, USB, etc)
  • Tried changing number of vCPUs, memory, etc
  • Tried Exchange 2010 with and without DAGs and Failover Cluster Service
  • Disabled IPv6 and associated services
  • Tried on both ESXi 4.0.0 and 4.0.1

The workaround: I loaded up Server 2008 sp2 (non R2) and everything works perfectly. No freezes, and in fact no problems of any kind. If anything it feels faster to me in terms of GUI consoles, responsiveness, etc. I even got Forefront and DAGs to work just fine on this platform.

 

What does this mean? I think it means that we have positively identified a serious bug that both VMware and Microsoft should be taking seriously. If Microsoft caused this to promote their Hyper-V product, then that's a definite misstep. If VMware knows this is a problem and isn't officially acknowledging it yet, then that is also a big misstep. We have positively identified that on some platforms, Exchange 2010 on Server 2008 R2 simply will not work properly. There is NO existing change, variable or patch that helps (vcpu, memory,video, network, storage, etc). I've currently only heard reports from individuals running this on HP DL380 G5 systems; although this might indicate a possible culprit, that is hard to say because this HW is one of the most popular platforms on earth. I have confirmed that the issue is identical on Intel 5160 and 5460 processors alike.

So to summarize: EWE = URFM (ESXi 4.0.1 + Windows server 2008 R2 + Exchange 2010 = Unusable Randomly Freezing Machines)