Virtualizing CPUID

In the previous chapter, we successfully initiated the boot process of the guest Linux and confirmed that a VM exit occurs due to CPUID. In this chapter, we will virtualize CPUID by presenting appropriate values to the guest when it executes the CPUID instruction.

important

The source code for this branch is in whiz-vmm-cpuid branch.

Table of Contents

VM Exit Handler

When the guest attempts to execute CPUID instruction, a VM exit is unconditionally triggered. At this time, the Basic Reason field under the Basic VM-Exit Information category in the VMCS is set to 0x0A (.cpuid). In the VM exit handler handleExit(), we invoke the corresponding handler when the cause of the VM exit is CPUID:

ymir/arch/x86/vmx/vcpu.zig
const cpuid = @import("cpuid.zig");

fn handleExit(self: *Self, exit_info: vmx.ExitInfo) VmxError!void {
    switch (exit_info.basic_reason) {
        .cpuid => {
            try cpuid.handleCpuidExit(self);
            try self.stepNextInst();
        },
        ...
    }
    ...
}

cpuid.handleCpuidExit() is a dedicated handler for a CPUID instruction, and it sets appropriate values to the guest registers according to the requested CPUID leaf.

Vcpu.stepNextInst() increments the guest RIP so that it points to the next instruction. Since x64 is a CISC architecture, instruction lengths are variable rather than fixed. Therefore, to increment RIP, we need to know the length of the instruction currently pointed to by RIP. Fortunately, the VMCS VM-Exit Information includes the Exit Instruction Length field, which provides the length of the instruction that caused the VM exit. This saves us from having to write our own x64 instruction decoder, which is a great convenience. In stepNextInst(), we read this field and add its value to RIP so that it points to the next instruction:

ymir/arch/x86/vmx/vcpu.zig
fn stepNextInst(_: *Self) VmxError!void {
    const rip = try vmread(vmcs.guest.rip);
    try vmwrite(vmcs.guest.rip, rip + try vmread(vmcs.ro.exit_inst_len));
}

CPUID Handler

cpuid.handleCpuidExit() is the handler for a CPUID instruction. Its basic structure is as follows:

ymir/arch/x86/vmx/cpuid.zig
const cpuid = arch.cpuid;
const Leaf = cpuid.Leaf;

pub fn handleCpuidExit(vcpu: *Vcpu) VmxError!void {
    const regs = &vcpu.guest_regs;
    switch (Leaf.from(regs.rax)) {
        ...
    }
}

CPUID allows specifying a leaf using RAX register. For certain leaves, a subleaf can also be specified using RCX register. The combination of leaf and subleaf determines the type of information to retrieve. The Leaf enum, defined in the VMX Root Operation chapter, provides operations related to CPUID leaves. In handleCpuidExit(), we obtain the requested CPUID leaf from the guest registers and use a switch statement to handle it accordingly.

There are a large number of CPUID leaves, and the set of supported leaves varies depending on the CPU generation, with new leaves continuing to be added over time. Therefore, it is not feasible to support all leaves in this switch statement. In Ymir, we choose to explicitly support only specific CPUID leaves. For any unsupported leaf, the CPUID instruction will zero out all of RAX, RBX, RCX, and RDX registers as a fallback behavior:

ymir/arch/x86/vmx/cpuid.zig
fn invalid(vcpu: *Vcpu) void {
    const gregs = &vcpu.guest_regs;
    setValue(&gregs.rax, 0);
    setValue(&gregs.rbx, 0);
    setValue(&gregs.rcx, 0);
    setValue(&gregs.rdx, 0);
}

inline fn setValue(reg: *u64, val: u64) void {
    @as(*u32, @ptrCast(reg)).* = @as(u32, @truncate(val));
}

The Leaf enum was not designed to define all possible CPUID leaves in the first place; rather, it was defined as a non-exhaustive enum:

ymir/arch/x86/cpuid.zig
pub const Leaf = enum(u32) {
    ...
    _,
    ...
};

A non-exhaustive enum allows us to catch all undefined variants in a switch using _ =>1. For now, before handling each individual Leaf, we will call invalid() for all variants:

ymir/arch/x86/vmx/cpuid.zig
switch (Leaf.from(regs.rax)) {
    _ => {
        log.warn("Unhandled CPUID: Leaf=0x{X:0>8}, Sub=0x{X:0>8}", .{ regs.rax, regs.rcx });
        invalid(vcpu);
    },
}

Leaf-specific Handling

We will define the values to present to the guest for each leaf. In Ymir, only the minimum necessary leaves are supported. For a complete description of all leaves supported by the CPU, refer to SDM Vol.2A 3.3 CPUID - CPU Identification.

0x0: Basic CPUID Information

Information Returned by CPUID Instruction: Leaf 0x0 Information Returned by CPUID Instruction: Leaf 0x0. SDM Vol.2A Table 3-17.

Leaf 0x0 returns the highest supported leaf number by the CPU, along with the CPU vendor ID. The highest supported leaf is likely 0x20. However, CPUID may also support leaves called extended functions starting from 0x8000_0000. The value returned by leaf 0x0 represents the highest leaf number excluding these extended functions.

The CPU vendor ID is a 12-byte string indicating the CPU manufacturer. For Intel CPUs, this is "GenuineIntel". Linux KVM sets this value to "KVMKVMKVM". Guests can detect they are running under KVM from this vendor ID and use paravirtualization features if available. Following KVM's example, Ymir will return "YmirYmirYmir" as the vendor ID. Of course, since there is no guest OS specifically supporting Ymir, this value won't enable any special behavior on the guest side!

ymir/arch/x86/vmx/cpuid.zig
    .maximum_input => {
        setValue(&regs.rax, 0x20); // Maximum input value for basic CPUID.
        setValue(&regs.rbx, 0x72_69_6D_59); // Ymir
        setValue(&regs.rcx, 0x72_69_6D_59); // Ymir
        setValue(&regs.rdx, 0x72_69_6D_59); // Ymir
    },

0x1: Feature Information

Information Returned by CPUID Instruction: Leaf 0x1 Information Returned by CPUID Instruction: Leaf 0x1. SDM Vol.2A Table 3-17.

Leaf 0x1 returns information about the CPU version and supported features. Since the vendor ID is an unknown value to the guest, the version information has no real meaning. For now, we simply return the version information provided by the host CPU as is:

ymir/arch/x86/vmx/cpuid.zig
    .version_info => {
        const orig = Leaf.query(.version_info, null);
        setValue(&regs.rax, orig.eax); // Version information.
        setValue(&regs.rbx, orig.ebx); // Brand index / CLFLUSH line size / Addressable IDs / Initial APIC ID
        setValue(&regs.rcx, @as(u32, @bitCast(feature_info_ecx)));
        setValue(&regs.rdx, @as(u32, @bitCast(feature_info_edx)));
    },

The feature information stored in ECX and EDX represents the CPU features supported by bitfields. The definition of feature information is as follows, but since it is lengthy, it is displayed in a collapsible section. Feel free to expand it if you're interested:

Feature Information の定義
ymir/arch/x86/cpuid.zig
pub const FeatureInfoEcx = packed struct(u32) {
    /// Streaming SIMD Extensions 3 (SSE3).
    sse3: bool = false,
    /// PCLMULQDQ.
    pclmulqdq: bool = false,
    /// 64-bit DS Area.
    dtes64: bool = false,
    /// MONITOR/MWAIT.
    monitor: bool = false,
    // CPL Qualified Debug Store.
    ds_cpl: bool = false,
    /// Virtual Machine Extensions.
    vmx: bool = false,
    /// Safer Mode Extensions.
    smx: bool = false,
    /// Enhanced Intel SpeedStep Technology.
    eist: bool = false,
    /// Thermal Monitor 2.
    tm2: bool = false,
    /// SSSE3 extensions.
    ssse3: bool = false,
    /// L1 context ID.
    cnxt_id: bool = false,
    /// IA32_DEBUG_INTERFACE.
    sdbg: bool = false,
    /// FMA extesions using YMM state.
    fma: bool = false,
    /// CMPXCHG16B available.
    cmpxchg16b: bool = false,
    /// xTPR update control.
    xtpr: bool = false,
    /// Perfmon and Debug Capability.
    pdcm: bool = false,
    /// Reserved.
    _reserved_0: bool = false,
    /// Process-context identifiers.
    pcid: bool = false,
    /// Ability to prevent data from memory mapped devices.
    dca: bool = false,
    /// SSE4.1 extensions.
    sse4_1: bool = false,
    /// SSE4.2 extensions.
    sse4_2: bool = false,
    /// x2APIC support.
    x2apic: bool = false,
    /// MOVBE instruction.
    movbe: bool = false,
    /// POPCNT instruction.
    popcnt: bool = false,
    /// Local APIC timer supports one-shot operation using TSC deadline.
    tsc_deadline: bool = false,
    /// AES instruction.
    aesni: bool = false,
    /// XSAVE/XRSTOR states.
    xsave: bool = false,
    /// OS has enabled XSETBV/XGETBV instructions to access XCR0.
    osxsave: bool = false,
    /// AVX.
    avx: bool = false,
    /// 16-bit floating-point conversion instructions.
    f16c: bool = false,
    /// RDRAND instruction.
    rdrand: bool = false,
    /// Not used.
    hypervisor: bool = false,
};

pub const FeatureInfoEdx = packed struct(u32) {
    /// x87 FPU.
    fpu: bool = false,
    /// Virtual 8086 mode enhancements.
    vme: bool = false,
    /// Debugging extensions.
    de: bool = false,
    /// Page Size Extension.
    pse: bool = false,
    /// Time Stamp Counter.
    tsc: bool = false,
    /// RDMSR and WRMSR instructions.
    msr: bool = false,
    /// Physical Address Extension.
    pae: bool = false,
    /// Machine Check Exception.
    mce: bool = false,
    /// CMPXCHG8B instruction.
    cx8: bool = false,
    /// APIC on-chip.
    apic: bool = false,
    /// Reserved.
    _reserved_0: bool = false,
    /// SYSENTER/SYSEXIT instructions.
    sep: bool = false,
    /// Memory Type Range Registers.
    mtrr: bool = false,
    /// Page Global Bit.
    pge: bool = false,
    /// Machine check architecture.
    mca: bool = false,
    /// Conditional move instructions.
    cmov: bool = false,
    /// Page attribute table.
    pat: bool = false,
    /// 36-bit Page Size Extension.
    pse36: bool = false,
    /// Processor serial number.
    psn: bool = false,
    /// CLFLUSH instruction.
    clfsh: bool = false,
    /// Reserved.
    _reserved_1: bool = false,
    /// Debug store.
    ds: bool = false,
    /// Thermal monitor and software controlled clock facilities.
    acpi: bool = false,
    /// Intel MMX Technology.
    mmx: bool = false,
    /// FXSAVE and FXRSTOR instructions.
    fxsr: bool = false,
    /// SSE extensions.
    sse: bool = false,
    /// SSE2 extensions.
    sse2: bool = false,
    /// Self snoop.
    ss: bool = false,
    /// Max APIC IDs reserved field.
    htt: bool = false,
    /// Thermal monitor.
    tm: bool = false,
    /// Reserved.
    _reserved_2: bool = false,
    /// Pending Break Enable.
    pbe: bool = false,
};

x64 Linux checks whether the required features are supported in verify_cpu(), and aborts initialization if any are missing:

required-features.h
#define REQUIRED_MASK0	(NEED_FPU|NEED_PSE|NEED_MSR|NEED_PAE|\
			 NEED_CX8|NEED_PGE|NEED_FXSR|NEED_CMOV|\
			 NEED_XMM|NEED_XMM2)

Ymir will support the following features, including these mandatory ones:

Conversely, the following features will be explicitly disabled:

  • ACPI: Ymir does not support at all
ymir/arch/x86/vmx/cpuid.zig
const feature_info_ecx = cpuid.FeatureInfoEcx{
    .pcid = true,
};
const feature_info_edx = cpuid.FeatureInfoEdx{
    .fpu = true,
    .vme = true,
    .de = true,
    .pse = true,
    .msr = true,
    .pae = true,
    .cx8 = true,
    .sep = true,
    .pge = true,
    .cmov = true,
    .pse36 = true,
    .acpi = false,
    .fxsr = true,
    .sse = true,
    .sse2 = true,
};

linux における pcid の無効化

最近の Intel Core CPU であれば、PCID はサポートされているはずです。 しかしながら、Alder Lake (12th Gen) 以降の CPU では、INVLPG 命令でグローバルページがフラッシュされないというバグがあります。 これに対処するため、Linux 6.4 移行のカーネルでは PCID を無効化するパッチ2が適用されています。 お使いのカーネルおよびCPUがこれに該当する場合、上記の pcidfalse にしてゲストでも PCID を無効化してください。

0x6: Thermal and Power Management

Ymir does not support the thermal and power management at all:

ymir/arch/x86/vmx/cpuid.zig
    .thermal_power => invalid(vcpu),

0x7: Extended Feature Flags Enumeration

Information Returned by CPUID Instruction: Leaf 0x7 Information Returned by CPUID Instruction: Leaf 0x7. SDM Vol.2A Table 3-17.

Leaf 0x7 also returns CPU feature information similar to feature information. However, leaf 0x7 has subleaves. The number of supported subleaves is determined by the value returned in RAX by subleaf 0x0. Ymir intended to support only subleaf 0, but even when specifying the maximum supported subleaf as 0, Linux attempts to query subleaves 1 and beyond. Therefore, querying subleaves 1 and 2 are accepted but will return invalid(). Querying subleaf 3 and above will result in an error:

ymir/arch/x86/vmx/cpuid.zig
    .ext_feature => {
        switch (regs.rcx) {
            0 => {
                setValue(&regs.rax, 1); // Maximum input value for supported leaf 7 sub-leaves.
                setValue(&regs.rbx, @as(u32, @bitCast(ext_feature0_ebx)));
                setValue(&regs.rcx, 0); // Unimplemented.
                setValue(&regs.rdx, 0); // Unimplemented.
            },
            1, 2 => invalid(vcpu),
            else => {
                log.err("Unhandled CPUID: Leaf=0x{X:0>8}, Sub=0x{X:0>8}", .{ regs.rax, regs.rcx });
                vcpu.abort();
            },
        }
    },

The feature flags stored in EDX are as follows. Since it is quite lengthy, it is displayed in a collapsible section:

Extended Feature Flags の定義
ymir/arch/x86/cpuid.zig
pub const ExtFeatureEbx0 = packed struct(u32) {
    fsgsbase: bool = false,
    tsc_adjust: bool = false,
    sgx: bool = false,
    bmi1: bool = false,
    hle: bool = false,
    avx2: bool = false,
    fdp: bool = false,
    smep: bool = false,
    bmi2: bool = false,
    erms: bool = false,
    invpcid: bool = false,
    rtm: bool = false,
    rdtm: bool = false,
    fpucsds: bool = false,
    mpx: bool = false,
    rdta: bool = false,
    avx512f: bool = false,
    avx512dq: bool = false,
    rdseed: bool = false,
    adx: bool = false,
    smap: bool = false,
    avx512ifma: bool = false,
    _reserved1: u1 = 0,
    clflushopt: bool = false,
    clwb: bool = false,
    pt: bool = false,
    avx512pf: bool = false,
    avx512er: bool = false,
    avx512cd: bool = false,
    sha: bool = false,
    avx512bw: bool = false,
    avx512vl: bool = false,
};

Ymir supports the following features:

ymir/arch/x86/vmx/cpuid.zig
const ext_feature0_ebx = cpuid.ExtFeatureEbx0{
    .fsgsbase = false,
    .smep = true,
    .invpcid = true,
    .smap = true,
};

0xD: Processor Extended State Enumeration

Leaf 0xD returns information about CPU extended features. This leaf also has subleaves. Subleaf 1 provides information about support for XSAVE and XRSTOR. Ymir only allows query to subleaf 1 (although it is not actually implemented), and treats any other queries as errors:

ymir/arch/x86/vmx/cpuid.zig
    .ext_enumeration => {
        switch (regs.rcx) {
            1 => invalid(vcpu),
            else => {
                log.err("Unhandled CPUID: Leaf=0x{X:0>8}, Sub=0x{X:0>8}", .{ regs.rax, regs.rcx });
                vcpu.abort();
            },
        }
    },

0x8000_0000: Extended Function Maximum Input

Information Returned by CPUID Instruction: Leaf 0x8000_0000 Information Returned by CPUID Instruction: Leaf 0x8000_0000. SDM Vol.2A Table 3-17.

Leaf 0x8000_0000 returns the highest supported extended function leaf number. Linux retrieves this value in verify_cpu() and requires it to be at least 0x8000_0001. Therefore, Ymir will also support up to this level:

ymir/arch/x86/vmx/cpuid.zig
    .ext_func => {
        setValue(&regs.rax, 0x8000_0000 + 1); // Maximum input value for extended function CPUID.
        setValue(&regs.rbx, 0); // Reserved.
        setValue(&regs.rcx, 0); // Reserved.
        setValue(&regs.rdx, 0); // Reserved.
    },

0x8000_0001: Extended Function

Leaf 0x8000_0001 returns information about extended features such as syscall and Intel 64. In Ymir, this leaf simply passes through the host's values:

ymir/arch/x86/vmx/cpuid.zig
    .ext_proc_signature => {
        const orig = Leaf.ext_proc_signature.query(null);
        setValue(&regs.rax, 0); // Extended processor signature and feature bits.
        setValue(&regs.rbx, 0); // Reserved.
        setValue(&regs.rcx, orig.ecx); // LAHF in 64-bit mode / LZCNT / PREFETCHW
        setValue(&regs.rdx, orig.edx); // SYSCALL / XD / 1GB large page / RDTSCP and IA32_TSC_AUX / Intel64
    },

Summary

This completes the VM exit handler for CPUID. After all, only seven CPUID leaves have been defined. In this series, it is unnecessary to support additional CPUID leaves until Linux finishes booting. Virtualizing just these seven leaves is sufficient to boot Linux.

Finally, let's run the guest with CPUID virtualization enabled:

txt
[INFO ] main    | Entered VMX root operation.
[INFO ] vmx     | Guest memory region: 0x0000000000000000 - 0x0000000006400000
[INFO ] vmx     | Guest kernel code offset: 0x0000000000005000
[DEBUG] ept     | EPT Level4 Table @ FFFF88800000A000
[INFO ] vmx     | Guest memory is mapped: HVA=0xFFFF888000A00000 (size=0x6400000)
[INFO ] main    | Setup guest memory.
[INFO ] main    | Starting the virtual machine...
[ERROR] vcpu    | Unhandled VM-exit: reason=arch.x86.vmx.common.ExitReason.rdmsr

Unlike last time, the VM exit is no longer caused by CPUID. Instead, a VM exit due to RDMSR has occurred. In the next chapter, we will implement handlers for RDMSR and WRMSR and configure the VMCS to properly save and restore the MSRs for both the guest and host.

1

Exhaustive enums do not allow the use of _, but you can catch all fields not explicitly handled using else.