Virtualizing MSR

This chapter covers MSR virtualization. You can arbitrarily set the MSR values that'll be presented to the guest, and conversely, the host can modify values the guest attempts to write to MSRs. Additionally, MSR values are properly saved and restored during VM Entry and VM Exit.

important

The source code for this branch is in whiz-vmm-msr branch.

Table of Contents

VM Exit Handler

When the guest tries to execute RDMSR or WRMSR instructions, a VM Exit may occur. Whether a VM Exit happens is controlled by the MSR Bitmaps in the VMCS Execution Control category. MSR Bitmaps are bitmaps mapped to MSR addresses, and if an RDMSR or WRMSR is executed on an MSR with a bit set to 1, a VM Exit occurs. Operations on MSRs with bits set to 0 do not trigger a VM Exit. Additionally, disabling MSR Bitmaps causes all RDMSR and WRMSR instructions on any MSR to trigger a VM Exit.

In this series, MSR Bitmaps are disabled so that all RDMSR and WRMSR instructions on all MSRs trigger a VM Exit:

ymir/arch/x86/vmx/vcpu.zig
fn setupExecCtrls(vcpu: *Vcpu, _: Allocator) VmxError!void {
    ...
    ppb_exec_ctrl.use_msr_bitmap = false;
    ...
}

RDMSR causes a VM Exit with Exit Reason 31, and WRMSR with 32. Let's add the switch case for each instruction:

ymir/arch/x86/vmx/vcpu.zig
fn handleExit(self: *Self, exit_info: vmx.ExitInfo) VmxError!void {
    switch (exit_info.basic_reason) {
        ...
        .rdmsr => {
            try msr.handleRdmsrExit(self);
            try self.stepNextInst();
        },
        .wrmsr => {
            try msr.handleWrmsrExit(self);
            try self.stepNextInst();
        },
    }
    ...
}

Saving and Restoring MSR

Before implementing the VM Exit handler for MSR access, let's ensure MSR values are saved and restored during VM Entry and VM Exit. Currently, except for a few MSRs, most of them are shared between the guest and host.

MSRs Automatically Saved and Restored

The following guest MSRs are automatically loaded from their respective locations during VM Entry:

MSRConditionSource of Load
IA32_DEBUGCTLload debug controls in VMCS VM-Entry Control is enabledGuest-State
IA32_SYSENTER_CS(Unconditional)VMCS Guest-State
IA32_SYSENTER_ESP(Unconditional)VMCS Guest-State
IA32_SYSENTER_EIP(Unconditional)VMCS Guest-State
IA32_FSBASE(Unconditional)FS.Base in Guest-State
IA32_GSBASE(Unconditional)GS.Base in Guest-State
IA32_PERF_GLOBAL_CTRLload IA32_PERF_GLOBAL_CTRL in VMCS VM-Entry Control is enabledGuest-State
IA32_PATload IA32_PAT in VMCS VM-Entry Control is enabledGuest-State
IA32_EFERload IA32_EFER in VMCS VM-Entry Control is enabledGuest-State
IA32_BNDCFGSload IA32_BNDCFGS in VMCS VM-Entry Control is enabledGuest-State
IA32_RTIT_CTLload IA32_RTIT_CTL in VMCS VM-Entry Control is enabledGuest-State
IA32_S_CETload CET in VMCS VM-Entry Control is enabledGuest-State
IA32_INTERRUPT_SSP_TABLE_ADDRload CET in VMCS VM-Entry Control is enabledGuest-State
IA32_LBR_CTRLload IA32_LBR_CTRL in VMCS VM-Entry Control is enabledGuest-State
IA32_PKRSload PKRS in VMCS VM-Entry Control is enabledGuest-State

The following host MSRs are automatically loaded from their respective locations during VM Exit:

MSRConditionSource of Load
IA32_DEBUGCTL(Unconditional)Cleared to 0
IA32_SYSENTER_CS(Unconditional)VMCS Host-State
IA32_SYSENTER_ESP(Unconditional)VMCS Host-State
IA32_SYSENTER_EIP(Unconditional)VMCS Host-State
IA32_FSBASE(Unconditional)FS.Base in Host-State
IA32_GSBASE(Unconditional)GS.Base in Host-State
IA32_PERF_GLOBAL_CTRLload IA32_PERF_GLOBAL_CTRL in VMCS VM-Exit Control is enabledHost-State
IA32_PATload IA32_PAT in VMCS VM-Exit Control is enabledHost-State
IA32_EFERload IA32_EFER in VMCS VM-Exit Control is enabledHost-State
IA32_BNDCFGSclear IA32_BNDCFGS in VMCS VM-Exit Control is enabledCleared to 0
IA32_RTIT_CTLclear IA32_RTIT_CTL in VMCS VM-Exit Control is enabledCleared to 0
IA32_S_CETload CET in VMCS VM-Exit Control is enabledHost-State
IA32_INTERRUPT_SSP_TABLE_ADDRload CET in VMCS VM-Exit Control is enabledHost-State
IA32_PKRSload PKRS in VMCS VM-Exit Control is enabledHost-State

The following guest MSRs are automatically saved to their respective locations during VM Exit:

MSRConditionDestination of Save
IA32_DEBUGCTLsave debug controls in VMCS VM-Exit Control is enabledGuest-State
IA32_PATsave IA32_PAT in VMCS VM-Exit Control is enabledHost-State
IA32_EFERsave IA32_EFER in VMCS VM-Exit Control is enabledHost-State
IA32_BNDCFGSload IA32_BNDCFGS in VMCS VM-Exit Control is enabledHost-State
IA32_RTIT_CTLload IA32_RTIT_CTL in VMCS VM-Exit Control is enabledHost-State
IA32_S_CETload CET in VMCS VM-Exit Control is enabledHost-State
IA32_INTERRUPT_SSP_TABLE_ADDRload CET in VMCS VM-Exit Control is enabledHost-State
IA32_LBR_CTRLload IA32_LBR_CTRL in VMCS VM-Exit Control is enabledHost-State
IA32_PKRSload PKRS in VMCS VM-Exit Control is enabledHost-State
IA32_PERF_GLOBAL_CTRLsave IA32_PERF_GLOBAL_CTRL in VMCS VM-Exit Control is enabledHost-State

These MSRs are automatically saved and loaded during VM Entry and VM Exit. Since the values are stored in the VMCS, there is no risk of sharing between host and guest. Some MSRs require enabling settings in the VM-Exit/-Entry Controls. In Ymir, auto loading is enabled for the following MSRs. Among the MSRs listed below, those not listed below are not used in Ymir, so they do not need to be virtualized (the guest's MSRs will remain visible while in the host):

  • IA32_PAT
  • IA32_EFER
ymir/arch/x86/vmx/vcpu.zig
fn setupExitCtrls(_: *Vcpu) VmxError!void {
    ...
    exit_ctrl.load_ia32_efer = true;
    exit_ctrl.save_ia32_efer = true;
    exit_ctrl.load_ia32_pat = true;
    exit_ctrl.save_ia32_pat = true;
    ...
}

fn setupEntryCtrls(_: *Vcpu) VmxError!void {
    ...
    entry_ctrl.load_ia32_efer = true;
    entry_ctrl.load_ia32_pat = true;
    ...
}

MSR Area

MSRs other than those listed above are not saved or restored during VM Entry/VM Exit unless explicitly configured. In Ymir, the following MSRs will be additionally saved and restored:

  • IA32_TSC_AUX
  • IA32_STAR
  • IA32_LSTAR
  • IA32_CSTAR
  • IA32_FMASK
  • IA32_KERNEL_GS_BASE

MSRs loaded and saved during VM Exit/Entry are stored in an area called the MSR Area. The MSR Area is an array of 128-bit entries called MSR entry. Each MSR entry has the following structure and holds the data of the MSR specified by index:

Format of an MSR Entry Format of an MSR Entry. SDM Vol.3C Table 25-15.

There're the following three categories of MSR Area:

  • VM-Entry MSR-Load Area:
  • VM-Exit MSR-Store Area: Area for saving guest MSRs during VM Exit
  • VM-Exit MSR-Load Area: Area for loading host MSRs during VM Exit

There is no area for saving host MSRs during VM Entry. Presumably (and understandably), since the host is virtualization-aware, it is expected to save them manually before VM Entry.

Define a structure representing the MSR Area:

ymir/arch/x86/vmx/msr.zig
pub const ShadowMsr = struct {
    /// Maximum number of MSR entries in a page.
    const max_num_ents = 512;

    /// MSR entries.
    ents: []SavedMsr,
    /// Number of registered MSR entries.
    num_ents: usize = 0,
    /// MSR Entry.
    pub const SavedMsr = packed struct(u128) {
        index: u32,
        reserved: u32 = 0,
        data: u64,
    };

    /// Initialize saved MSR page.
    pub fn init(allocator: Allocator) !ShadowMsr {
        const ents = try allocator.alloc(SavedMsr, max_num_ents);
        @memset(ents, std.mem.zeroes(SavedMsr));

        return ShadowMsr{
            .ents = ents,
        };
    }

    /// Register or update MSR entry.
    pub fn set(self: *ShadowMsr, index: am.Msr, data: u64) void {
        return self.setByIndex(@intFromEnum(index), data);
    }

    /// Register or update MSR entry indexed by `index`.
    pub fn setByIndex(self: *ShadowMsr, index: u32, data: u64) void {
        for (0..self.num_ents) |i| {
            if (self.ents[i].index == index) {
                self.ents[i].data = data;
                return;
            }
        }
        self.ents[self.num_ents] = SavedMsr{ .index = index, .data = data };
        self.num_ents += 1;
        if (self.num_ents > max_num_ents) {
            @panic("Too many MSR entries registered.");
        }
    }

    /// Get the saved MSRs.
    pub fn savedEnts(self: *ShadowMsr) []SavedMsr {
        return self.ents[0..self.num_ents];
    }

    /// Find the saved MSR entry.
    pub fn find(self: *ShadowMsr, index: am.Msr) ?*SavedMsr {
        const index_num = @intFromEnum(index);
        for (0..self.num_ents) |i| {
            if (self.ents[i].index == index_num) {
                return &self.ents[i];
            }
        }
        return null;
    }

    /// Get the host physical address of the MSR page.
    pub fn phys(self: *ShadowMsr) u64 {
        return mem.virt2phys(self.ents.ptr);
    }
};

ShadowMsr holds an array of MSR entries and provides an API to manipulate the registered MSRs. Among the three MSR Areas, member variables representing the host (Load) and guest (Store+Load) areas are added to Vm:

ymir/arch/x86/vmx/vcpu.zig
pub const Vcpu = struct {
    host_msr: msr.ShadowMsr = undefined,
    guest_msr: msr.ShadowMsr = undefined,
    ...
}

During VMCS initialization (setupVmcs()), initialize the guest and host MSR Areas. The physical addresses of the MSR Areas are set in the VM-Exit Controls and VM-Entry Controls as MSR-load address and MSR-store address. The number of MSRs registered in the MSR Areas is set in MSR-load count and MSR-store count. Only the first count entries of the registered MSR Area are loaded and saved during VM Exit and VM Entry. The host MSRs are registered with their current values as is. The guest MSRs are initialized to all zeros:

ymir/arch/x86/vmx/vcpu.zig
fn registerMsrs(vcpu: *Vcpu, allocator: Allocator) !void {
    vcpu.host_msr = try msr.ShadowMsr.init(allocator);
    vcpu.guest_msr = try msr.ShadowMsr.init(allocator);

    const hm = &vcpu.host_msr;
    const gm = &vcpu.guest_msr;

    // Host MSRs.
    hm.set(.tsc_aux, am.readMsr(.tsc_aux));
    hm.set(.star, am.readMsr(.star));
    hm.set(.lstar, am.readMsr(.lstar));
    hm.set(.cstar, am.readMsr(.cstar));
    hm.set(.fmask, am.readMsr(.fmask));
    hm.set(.kernel_gs_base, am.readMsr(.kernel_gs_base));

    // Guest MSRs.
    gm.set(.tsc_aux, 0);
    gm.set(.star, 0);
    gm.set(.lstar, 0);
    gm.set(.cstar, 0);
    gm.set(.fmask, 0);
    gm.set(.kernel_gs_base, 0);

    // Init MSR data in VMCS.
    try vmwrite(vmcs.ctrl.exit_msr_load_address, hm.phys());
    try vmwrite(vmcs.ctrl.exit_msr_store_address, gm.phys());
    try vmwrite(vmcs.ctrl.entry_msr_load_address, gm.phys());
}

The VM-Exit MSR-Load Area (the area loaded into the host's MSRs during VM Exit) must be updated before every VM Entry. Otherwise, the initially set values would be used indefinitely. Call the following function at the beginning of the while loop inside the VM Entry loop loop():

ymir/arch/x86/vmx/vcpu.zig
fn updateMsrs(vcpu: *Vcpu) VmxError!void {
    // Save host MSRs.
    for (vcpu.host_msr.savedEnts()) |ent| {
        vcpu.host_msr.setByIndex(ent.index, am.readMsr(@enumFromInt(ent.index)));
    }
    // Update MSR counts.
    try vmwrite(vmcs.ctrl.vexit_msr_load_count, vcpu.host_msr.num_ents);
    try vmwrite(vmcs.ctrl.exit_msr_store_count, vcpu.guest_msr.num_ents);
    try vmwrite(vmcs.ctrl.entry_msr_load_count, vcpu.guest_msr.num_ents);
}

In this series, the number of MSRs registered in the MSR Area does not change. As will be covered later, if the guest attempts a WRMSR on an MSR not registered in the MSR Area, it will cause an abort. Therefore, there is no need to update the MSR counts in practice. This design is intended to accommodate potential future support for dynamically adding MSRs to the MSR Area.

With this, the setup of MSRs registered in the MSR Area and those automatically saved and restored is complete. The remaining task is to implement the handling of guest RDMSR and WRMSR instructions to read and write the values of MSRs registered in the MSR Area.

RDMSR Handler

Implement the handler for RDMSR. First, prepare a helper function to store the RDMSR result into the guest registers. Upper 32 bits of the RDMSR result is stored in RDX and the lower 32 bits in RAX. There are two patterns to present the MSR value to the guest:

  • Returns a value stored in VMCS: when a MSR is saved and loaded automatically
  • Returns a value registered in the MSR Area: others

Use setRetVal() for the former, and shadowRead() for the latter:

ymir/arch/x86/vmx/msr.zig
/// Concatnate two 32-bit values into a 64-bit value.
fn concat(r1: u64, r2: u64) u64 {
    return ((r1 & 0xFFFF_FFFF) << 32) | (r2 & 0xFFFF_FFFF);
}

/// Set the 64-bit return value to the guest registers.
fn setRetVal(vcpu: *Vcpu, val: u64) void {
    const regs = &vcpu.guest_regs;
    @as(*u32, @ptrCast(&regs.rdx)).* = @as(u32, @truncate(val >> 32));
    @as(*u32, @ptrCast(&regs.rax)).* = @as(u32, @truncate(val));
}

/// Read from the MSR Area.
fn shadowRead(vcpu: *Vcpu, msr_kind: am.Msr) void {
    if (vcpu.guest_msr.find(msr_kind)) |msr| {
        setRetVal(vcpu, msr.data);
    } else {
        log.err("RDMSR: MSR is not registered: {s}", .{@tagName(msr_kind)});
        vcpu.abort();
    }
}

With this, let's implement the RDMSR handler:

ymir/arch/x86/vmx/msr.zig
pub fn handleRdmsrExit(vcpu: *Vcpu) VmxError!void {
    const guest_regs = &vcpu.guest_regs;
    const msr_kind: am.Msr = @enumFromInt(guest_regs.rcx);

    switch (msr_kind) {
        .apic_base => setRetVal(vcpu, std.math.maxInt(u64)), // 無効
        .efer => setRetVal(vcpu, try vmx.vmread(vmcs.guest.efer)),
        .fs_base => setRetVal(vcpu, try vmx.vmread(vmcs.guest.fs_base)),
        .gs_base => setRetVal(vcpu, try vmx.vmread(vmcs.guest.gs_base)),
        .kernel_gs_base => shadowRead(vcpu, msr_kind),
        else => {
            log.err("Unhandled RDMSR: {?}", .{msr_kind});
            vcpu.abort();
        },
    }
}

RDMSR on an unsupported MSR (else) causes an abort. The set of supported MSRs has been determined empirically. Initially, a switch with only else was used to run the guest, and MSRs were added one by one until Linux successfully booted. It's surprising how few MSRs are actually required for Linux to run. Speaking of surprises, as I write this section, it's November—the season when chestnuts are in peak flavor.

WRMSR Handler

Let's create a helper function as we did for RDMSR:

ymir/arch/x86/vmx/msr.zig
fn shadowWrite(vcpu: *Vcpu, msr_kind: am.Msr) void {
    const regs = &vcpu.guest_regs;
    if (vcpu.guest_msr.find(msr_kind)) |_| {
        vcpu.guest_msr.set(msr_kind, concat(regs.rdx, regs.rax));
    } else {
        log.err("WRMSR: MSR is not registered: {s}", .{@tagName(msr_kind)});
        vcpu.abort();
    }
}

Here's a WRMSR handler:

ymir/arch/x86/vmx/msr.zig
pub fn handleWrmsrExit(vcpu: *Vcpu) VmxError!void {
    const regs = &vcpu.guest_regs;
    const value = concat(regs.rdx, regs.rax);
    const msr_kind: am.Msr = @enumFromInt(regs.rcx);

    switch (msr_kind) {
        .star,
        .lstar,
        .cstar,
        .tsc_aux,
        .fmask,
        .kernel_gs_base,
        => shadowWrite(vcpu, msr_kind),
        .sysenter_cs => try vmx.vmwrite(vmcs.guest.sysenter_cs, value),
        .sysenter_eip => try vmx.vmwrite(vmcs.guest.sysenter_eip, value),
        .sysenter_esp => try vmx.vmwrite(vmcs.guest.sysenter_esp, value),
        .efer => try vmx.vmwrite(vmcs.guest.efer, value),
        .gs_base => try vmx.vmwrite(vmcs.guest.gs_base, value),
        .fs_base => try vmx.vmwrite(vmcs.guest.fs_base, value),
        else => {
            log.err("Unhandled WRMSR: {?}", .{msr_kind});
            vcpu.abort();
        },
    }
}

There are more MSRs that need to be supported for WRMSR compared to RDMSR. This makes sense—MSRs like STAR, LSTAR, and CSTAR (used as syscall entry points) are typically written to but not read.

Summary

In this chapter, we configured the MSR Area to ensure that guest and host MSRs are properly saved and restored during VM Entry and VM Exit. This separates the MSR space between the host and guest. We also implemented RDMSR and WRMSR handlers to read and write values registered in the VMCS or MSR Area. With this, MSR virtualization is complete.

As has become customary, let's run the guest at the end:

txt
[INFO ] main    | Entered VMX root operation.
[INFO ] vmx     | Guest memory region: 0x0000000000000000 - 0x0000000006400000
[INFO ] vmx     | Guest kernel code offset: 0x0000000000005000
[DEBUG] ept     | EPT Level4 Table @ FFFF88800000E000
[INFO ] vmx     | Guest memory is mapped: HVA=0xFFFF888000A00000 (size=0x6400000)
[INFO ] main    | Setup guest memory.
[INFO ] main    | Starting the virtual machine...
No EFI environment detected.
early console in extract_kernel
input_data: 0x0000000002d582b9
input_len: 0x0000000000c7032c
output: 0x0000000001000000
output_len: 0x000000000297e75c
kernel_total_size: 0x0000000002630000
needed_size: 0x0000000002a00000
trampoline_32bit: 0x0000000000000000


KASLR disabled: 'nokaslr' on cmdline.


Decompressing Linux... Parsing ELF... No relocation needed... done.
Booting the kernel (entry_offset: 0x0000000000000000).
[ERROR] vcpu    | Unhandled VM-exit: reason=arch.x86.vmx.common.ExitReason.triple_fault
[ERROR] vcpu    | === vCPU Information ===
[ERROR] vcpu    | [Guest State]
[ERROR] vcpu    | RIP: 0xFFFFFFFF8102E0B9
[ERROR] vcpu    | RSP: 0x0000000002A03F58
[ERROR] vcpu    | RAX: 0x00000000032C8000
[ERROR] vcpu    | RBX: 0x0000000000000800
[ERROR] vcpu    | RCX: 0x0000000000000030
[ERROR] vcpu    | RDX: 0x0000000000001060
[ERROR] vcpu    | RSI: 0x00000000000001E3
[ERROR] vcpu    | RDI: 0x000000000000001C
[ERROR] vcpu    | RBP: 0x0000000001000000
[ERROR] vcpu    | R8 : 0x000000000000001C
[ERROR] vcpu    | R9 : 0x0000000000000008
[ERROR] vcpu    | R10: 0x00000000032CB000
[ERROR] vcpu    | R11: 0x000000000000001B
[ERROR] vcpu    | R12: 0x0000000000000000
[ERROR] vcpu    | R13: 0x0000000000000000
[ERROR] vcpu    | R14: 0x0000000000000000
[ERROR] vcpu    | R15: 0x0000000000010000
[ERROR] vcpu    | CR0: 0x0000000080050033
[ERROR] vcpu    | CR3: 0x00000000032C8000
[ERROR] vcpu    | CR4: 0x0000000000002020
[ERROR] vcpu    | EFER:0x0000000000000500
[ERROR] vcpu    | CS : 0x0010 0x0000000000000000 0xFFFFFFFF

Incredible! The guest has finally started printing logs! Although the main kernel hasn't booted yet, logs are being printed because we passed earlyprintk=serial on the command line back in the Linux Boot Protocol chapter1. As you can see from the message 'nokaslr' on cmdline, the command line specified in BootParams is correctly passed to the guest.

The message Decompressing Linux... is printed from extract_kernel(). This function is called from relocated() in head_64.S. It decompresses the compressed kernel and places it in memory, preparing to transfer control. The decompressed kernel is placed at the address specified by BootParams (i.e., 0x10_0000). Immediately after extract_kernel(), control jumps to this address, transferring execution to startup_64() in the other head_64.S (not the one in compressed/).

The triple fault that eventually occurs is triggered when attempting to set the PSE bit in CR4:

arch/x86/kernel/head_64.S
ffffffff8102e0a8 <common_startup_64>:
ffffffff8102e0a8:       ba 20 10 00 00          mov    edx,0x1020
ffffffff8102e0ad:       83 ca 40                or     edx,0x40
ffffffff8102e0b0:       0f 20 e1                mov    rcx,cr4
ffffffff8102e0b3:       21 d1                   and    ecx,edx
ffffffff8102e0b5:       0f ba e9 04             bts    ecx,0x4
ffffffff8102e0b9:       0f 22 e1                mov    cr4,rcx
ffffffff8102e0bc:       0f ba e9 07             bts    ecx,0x7

This MOV to CR4 clears the CR4.VMXE bit. If a MOV to CR4 does not trigger a VM Exit but results in a value that violates the constraints defined by IA32_VMX_CR4_FIXED0 or IA32_VMX_CR4_FIXED1, the guest will receive a #GP exception (not a VM Exit)2. Since the guest has not yet set up an interrupt handler, this #GP leads directly to a triple fault. So next time, we'll implement proper handling of CR accesses made by the guest.

1

Since serial console virtualization is not yet implemented, the guest is accessing the serial port directly. For now, let's allow this behavior.

2

SDM Vol.3C 26.3 CHANGES TO INSTRUCTION BEHAVIOR IN VMX NON-ROOT OPERATION