Geoff Chappell - Software Analyst
The Windows kernel started using the 8-byte compare-exchange instruction (CMPXCHG8B) in version 4.0. At first, the instruction had only a few uses, for more efficient coding of the following functions:
These are all exported functions, though of course they are also used internally. Curiously, the Driver Development Kit (DDK) for Windows NT 4.0 left the first of them undocumented, which is conspicuous because that particular function is nothing but a wrapper to get C-language arguments into appropriate registers for executing the instruction. With successive versions, the instruction found ever more use, both in exported functions, such as ExInterlockedFlushSList (added in version 5.0), and especially internally, including for working with 64-bit page table entries when using Physical Address Extension (PAE).
Note that one of the functions that uses CMPXCHG8B in version 4.0 existed already. All functions that use CMPXCHG8B can be coded without the instruction, but at the price (in a multi-processor coding) of having the caller provide storage for a primitive synchronisation object known as a spin lock. For instance, the single instruction
lock cmpxchg8b qword ptr [esi]
is replaceable with the following sequence
try: lock bts dword ptr [edi],0 jnb acquired wait: test dword ptr [edi],1 je try pause ; if available jmp wait acquired: cmp eax,[esi] jne fail cmp edx,[esi+4] je exchange fail: mov eax,[esi] mov edx,[esi+4] jmp done exchange: mov [esi],ebx mov [esi+4],ecx done: mov byte ptr [edi],0
provided that the 8 bytes at ESI are never modified without acquiring the spin lock at EDI which is in turn never used for any other purpose than guarding those 8 bytes. Even putting aside the undesirability of depending on all users of those 8 bytes to cooperate regarding the spin lock, there is the problem that the replacement is a lot of code, not just in terms of space but of clock cycles. The CMPXCHG8B instruction is clearly a nice feature to have to hand, and it was a natural addition when Intel’s processors started working with a 64-bit external bus.
In the early days of Windows NT, however, not all the extant processors implemented the CMPXCHG8B instruction. In versions before 5.1, every function that uses the instruction has an alternate coding for processors that do not support the instruction. Very early during its initialisation, the kernel checks whether the boot processor supports the instruction. If the support is missing, the kernel patches JMP instructions at the start of each of those functions to redirect execution to their alternates. Conversely, if the boot processor does support the instruction, and the functions are left unpatched, then the kernel requires all processors to support the instruction, under pain of the bug check MULTIPROCESSOR_CONFIGURATION_NOT_SUPPORTED (0x3E). Version 5.1 dropped the alternate codings and made it mandatory that the boot processor support the CMPXCHG8B instruction. Without this support, the kernel raises the bug check UNSUPPORTED_PROCESSOR (0x5D).
If reading only Intel’s literature, one might think that testing for the CMPXCHG8B instruction is a simple matter of executing the CPUID instruction with 1 in EAX and testing for the CX8 bit (0x0100) in the feature flags that are returned in edx. However, there have always been quirks.
Versions 4.0 and 5.0 test for the CMPXCHG8B instruction twice. The first test applies only to the boot processor. Its purpose is to determine whether functions that use the instructions must be patched, as described above. If the CPUID bit (0x00200000) in the EFLAGS register cannot be changed, then there is no CPUID instruction let alone CMPXCHG8B. The kernel then sets the CPUID bit and executes the CPUID instruction with 1 in EAX. If the CX8 bit is set in edx afterwards, then CMPXCHG8B is supported and no patches are needed. The second test is done for all processors, including the boot processor, as part of a wider examination of processor features. In builds of version 4.0 from before Windows NT 4.0 SP4, this second test recognises the CX8 bit only if the CPUID vendor string is GenuineIntel, AuthenticAMD or CyrixInstead. (The vendor string is the sequence of characters obtained by executing CPUID with 0 in EAX and then storing the values of ebx, edx and ecx at successive memory locations.)
For processors that set the CX8 bit but are not made by Intel, AMD or Cyrix, the two tests do not agree and the processor falls foul of the requirement that if the boot processor supports CMPXCHG8B then all processors must. The result is the bug check MULTIPROCESSOR_CONFIGURATION_NOT_SUPPORTED even if there is only one processor. In the Knowledge Base article CMPXCHG8B CPUs in Non-Intel/AMD x86 Compatibles Not Supported, Microsoft is at best disingenuous in suggesting that the first test is only a rough guess from the processor’s “type”, which a second test must “verify” by querying for “specific features”: both tests are specifically for the CX8 bit.
Despite its acknowledgement of trouble caused by restricting one CPU feature to known vendors, Microsoft took its time to relax the restrictions for other features. Although the version 4.0 from Windows NT 4.0 SP4 removes the vendor restrictions from its test for the CX8 bit, other features of interest to it are recognised only for particular vendors:
Only with version 5.0 did Microsoft stop its routine practice of excluding unknown (or unfavoured) CPU manufacturers from having Intel-compatible features be usable by Windows.
Meanwhile, of course, the CPU manufacturers who were initially excluded from CMPXCHG8B will naturally have wanted to sell their processors even to people who might want to run an early build of Windows NT 4.0. Inevitably, they will have developed ways to turn the CX8 bit off, and perhaps even to have it off by default. After all, if they’re going to sell a CPU for use in computers that might be sold to just about anyone, then to compete equally they need that all Windows versions are at least usable on their processors, even if less than optimally. They might reasonably have hoped that Microsoft might soon build into its kernel some recognition of these processors to turn the CX8 bit back on. Again, Microsoft took its time.
Starting with version 5.1, which is also the first version that won’t start without support for CMPXCHG8B, the Windows kernel makes special cases for processors that may implement the CMPXCHG8B instruction but do not show the CX8 bit in the feature flags. The processors that are catered for are identified by the vendor strings GenuineTMx86, CentaurHauls and RiseRiseRise. The following notes describe the provisions as actually made by the Windows kernel, and are in no way concerned with how well the implementation corresponds with documentation by the vendors.
For GenuineTMx86 processors starting with family 5 model 4 stepping 2, if the CMPXCHG8B instruction is not indicated in the CPUID feature flags, it is enabled by setting the 0x0100 bit in the machine-specific register 0x80860004.
The previous paragraph describes the presumed intention. As actually coded, the model and stepping, taken together, must be at least 4 and 2 even if the family is greater than 5. A hypothetical family 6 model 1 stepping 1 would need to have the CX8 bit set or cleared in advance of booting Windows, depending on which version is to be run.
If the CPUID feature flags for a CentaurHauls processor do not show support for CMPXCHG8B, then the support is enabled by slightly different methods depending on the family:
Clearing the 0x01 bit is omitted in early builds. It begins with the version 5.1 from Windows XP SP2 and the version 5.2 from Windows Server 2003 SP1.
Starting with the version 5.1 from Windows XP SP2 and the version 5.2 from Windows Server 2003 SP1, all RiseRiseRise processors are treated as supporting CMPXCHG8B, even without the CX8 bit in the CPUID feature flags.