Thursday, April 16, 2009

Erratic or Negative Ping Times in Hyper-V Guests

A customer approached me with a some puzzling issues. They noticed a bunch of 1053 and 1054 Userenv errors in their event logs on their virtual machines. 1053 and 1054 error messages have very similar wording:

1053 - Windows cannot determine the user or computer name. (description>). Group Policy processing aborted.
1054 - Windows cannot obtain the domain controller name for your computer network. (). Group Policy processing aborted.

Typically, these are related to DNS. In this instance, however, the customer also presented some other interesting issues--negative ping times or very high ping times (in excess of 5000ms). And, to boot, these erratic ping times were only present on virtual machines with two or more virtual processors.

So, what's the relationship?

It turns out to be a relatively simple explanation.

In order for a Group Policy client to be qualified to process GPOs, AD measures the RTT time between the client and the DC processing the logon and group policy request. If the average RTT is > 10ms for 2048 byte packets, the link is generally considered "slow" by default (another value can be configured). Under "slow" conditions, group policy will not process. I’ve seen this issue before in environments where authentication is happening over a WAN (where times are greater than 10ms) or with routers dropping or fragmenting large ICMP packets (affectionately known as “blackhole router syndrome”).

So, if a machine is reporting a 5000ms ping, it stands to reason that the OS might think that the link is indeed slow.

As previously mentioned, this problem is occurring on Windows 2003 hosts that are configured for multiple virtual processors (VPs). All operating systems use some sort of clock timing mechanism, and frequently they rely on the Time Stamp Counter (TSC), which counts CPU ticks since system start. Each processor has its own TSC, and the TSC for each processor can be different because they’re not necessarily synchronized. What this ends up meaning is that if a VM is reading the TSC from multiple VPs, the date stamps may actually go backwards or be out of order. This does not happen in a single CPU scenario (physical or virtual), since only one TSC is being used.

The three possible workarounds:
1. Upgrade to Windows 2008. Obviously, this won't work for everyone, so for those people, workarounds two or three should provide some relief.
2. Shut down the VM, change the number of CPUs to 1 in Hyper-V manager and then start the VM.
3. Add the /usepmtimer switch to the boot.ini configuration of each Windows 2003 server using multiple processors. In the physical world, this phenomenon only appears to only happen on AMD processors. The VM world is less discriminating against processor type. Windows 2003 SP2 normally is supposed to use the ACPI Power Management Timer (PM Timer), as long as the BIOS check for it succeeds. In the case of Hyper-V, the BIOS check fails, so it falls back to the TSC. Remember, modifying the boot.ini requires a reboot for the change to become effective.

The Win32 API call, QueryPerformanceCounter, uses the TSC by default. Adding the /usepmtimer boot.ini flag tells QueryPerformanceCounter to use the ACPI/PM timer.

Related information:

Wikipedia - Time Stamp Counter

A Windows Server 2003-based server may experience time-stamp counter drift if the server uses dual-core AMD Opteron processors or multiprocessor AMD Opteron processors

Programs that use the QueryPerformanceCounter function may perform poorly in Windows Server 2000, in Windows Server 2003, and in Windows XP

Explanation for the USEPMTIMER switch in the boot.ini

Windows Server Performance Team Blog : Hyper-V and Multiprocessor VMs

Negative ping times in Windows VM's - what's up?

How a slow link is detected for processing user profiles and Group Policy

How to enable user environment debug logging in retail builds of Windows

Available switch options for the Windows XP and the Windows Server 2003 Boot.ini files

No comments:

Post a Comment