Control Registers の仮想化

In this chapter, we will virtualize access to control registers. Control registers are critical for managing CPU behavior and must be handled properly when running a guest. We'll cover both reading from and writing to CRs, ensuring consistency with other VMCS fields throughout the handling process.

important

The source code for this chapter is in whiz-vmm-cr branch.

CR Read Shadows / Masks
Exit Qualification
MOV from CR Handler
- MOV from CR0 / CR4
- MOV from CR3
MOV to CR Handler
- MOV to CR0 / CR4
  - Pass-through
  - IA-32e Mode
- MOV to CR3
  - PCID
  - Combined Mappings
Enabling INVPCID
Summary

CR Read Shadows / Masks

Whether access to CR0 and CR4 triggers a VM Exit depends on the settings in the VM-Execution Controls of the VMCS. Control over specific bits in a CR—whether it belongs to the guest or the host—is determined by the Guest/Host Masks. These masks are bitfields with the same size as the CR registers, which is 64 bits on Intel 64 architecture. When a bit in the Guest/Host Mask is set to 1, the corresponding bit in the CR is considered "owned by the host". Conversely, if the bit is 0, the corresponding bit is "owned by the guest".

When the guest performs a read from a CR, if the bit is owned by the guest, the value is read directly from the CR. Similarly, when the guest writes to a CR, if the bit is owned by the guest, the value is written directly to the CR.

Access to bits owned by the host is managed through the Guest/Host Read Shadows. These are also 64-bit bitfields on Intel 64, with each bit corresponding to a bit in the CR. When the guest reads from a CR and the bit is owned by the host, the value is read from the corresponding bit in the Read Shadows. When the guest writes to a CR, a VM Exit occurs if it attempts to write a value to a host-owned bit that differs from the corresponding bit in the Read Shadows.

mermaid

---
title: Read/Write CR's N-th bit
---
flowchart TD
    A(CR Read) --> B{Masked by Guest/Host Masks?}
    B -- Yes --> C(N-th bit of Read Shadows)
    B -- No --> D(N-th bit of CR)

    E(CR Write) --> F{Masked by Guest/Host Masks?}
    F -- Yes --> G{Differs from Read Shadows?}
        G -- Yes --> H(VM Exit)
        G -- No --> I(Nothing happens)
    F -- No --> J(Write to CR)

In Ymir, all bits in the Guest/Host Masks are set. This means that every bit in CR0 and CR4 is owned by the host.

ymir/arch/x86/vmx/vcpu.zig

fn setupExecCtrls(vcpu: *Vcpu, allocator: Allocator) VmxError!void {
    ...
    try vmwrite(vmcs.ctrl.cr0_mask, std.math.maxInt(u64));
    try vmwrite(vmcs.ctrl.cr4_mask, std.math.maxInt(u64));
    ...
}

As shown in the previous diagram, reads from bits that are set in the masks return the value from the Read Shadows. Since we are setting all bits in the masks, reads from CRs will always return the values in the Read Shadows. Therefore, the Read Shadows must be kept in sync with the actual values of the CRs. We'll implement the logic to maintain this synchronization in the VM Exit handler later, but for now, we'll initialize the Read Shadows with the current values of CR0 and CR4:

ymir/arch/x86/vmx/vcpu.zig

fn setupGuestState(vcpu: *Vcpu) VmxError!void {
    ...
    try vmwrite(vmcs.ctrl.cr0_read_shadow, cr0);
    try vmwrite(vmcs.ctrl.cr4_read_shadow, cr4);
    ...
}

Exit Qualification

Since all bits in the Host/Guest Masks are set, any attempt by the guest to write a value to CR0 or CR4 that differs from the corresponding Read Shadows will cause a VM Exit. In such cases, additional information about which register was accessed and what value was attempted is stored in the Exit Qualification field of the VMCS. Note that Exit Qualification is not provided for all VM Exits—only for specific ones, including CR access. Other exit reasons that provide a qualification field include I/O access and EPT violations.

Exit Qualification is a 64-bit value whose meaning depends on the Exit Reason. For CR access exits, the Exit Qualification has the following format:

Exit Qualification for Control-Register Accesses. SDM Vol.3C Table 28-3.

Since Exit Qualification will be extended to support reasons other than CR access in the future, we'll define a dedicated qual namespace to organize them. As the first entry, we'll add Exit Qualification for CR access:

ymir/arch/x86/vmx/common.zig

pub const qual = struct {
    pub const QualCr = packed struct(u64) {
        index: u4,
        access_type: AccessType,
        lmsw_type: LmswOperandType,
        _reserved1: u1,
        reg: Register,
        _reserved2: u4,
        lmsw_source: u16,
        _reserved3: u32,

        const AccessType = enum(u2) {
            mov_to = 0,
            mov_from = 1,
            clts = 2,
            lmsw = 3,
        };
        const LmswOperandType = enum(u1) {
            reg = 0,
            mem = 1,
        };
        const Register = enum(u4) {
            rax = 0,
            rcx = 1,
            rdx = 2,
            rbx = 3,
            rsp = 4,
            rbp = 5,
            rsi = 6,
            rdi = 7,
            r8 = 8,
            r9 = 9,
            r10 = 10,
            r11 = 11,
            r12 = 12,
            r13 = 13,
            r14 = 14,
            r15 = 15,
        };
    };
};

VM Exits caused by CR access can occur not only due to MOV (from/to) instructions but also in two other ways:

CLTS: Clears TS flag in CR0
LMSW: Load into [15:0] of CR0

However, these instructions are not used by Linux before boot. Therefore, Ymir does not support VM Exits caused by these two instructions. Since implementing support for them is straightforward, feel free to try it yourself if you're interested.

Let's also provide a function to retrieve the Exit Qualification. Since its contents vary depending on the Exit Reason, the caller is responsible for deciding how to interpret the retrieved value:

ymir/arch/x86/vmx/vcpu.zig

fn getExitQual(T: anytype) VmxError!T {
    return @bitCast(@as(u64, try vmread(vmcs.ro.exit_qual)));
}

MOV from CR Handler

Add a handler for CR access in handleExit(). There is only a single Exit Reason for both MOV to and MOV from instructions. To determine whether it is MOV to or from, you need to check the Exit Qualification:

ymir/arch/x86/vmx/vcpu.zig

fn handleExit(self: *Self, exit_info: vmx.ExitInfo) VmxError!void {
    switch (exit_info.basic_reason) {
        .cr => {
            const q = try getExitQual(qual.QualCr);
            try cr.handleAccessCr(self, q);
            try self.stepNextInst();
        },
        ...
    }
}

When a VM Exit caused by CR access occurs, use the previously defined function to retrieve the Exit Qualification and pass it to the dedicated handler cr.handleAccessCr(). This function will be implemented in cr.zig:

ymir/arch/x86/vmx/cr.zig

pub fn handleAccessCr(vcpu: *Vcpu, qual: QualCr) VmxError!void {
    switch (qual.access_type) {
        .mov_to => ...
        .mov_from => ...
        else => {
            log.err("Unimplemented CR access: {?}", .{qual});
            vcpu.abort();
        },
    }
}

Determine whether the operation is MOV to or MOV from by checking .access_type in the Exit Qualification, and implement the corresponding handling for each case. For other types, such as CLTS or LMSW, which are not supported, output an error log and abort.

MOV from CR0 / CR4

Read access to CR0 and CR4 does not cause a VM Exit. Since all bits in the masks are set, reads from CR0 and CR4 always return the values stored in the Read Shadows.

MOV from CR3

Reads from CR3 will expose the actual value of CR3 directly to the guest. To facilitate passing CR values through to the guest, we'll prepare a helper function:

ymir/arch/x86/vmx/cr.zig

fn passthroughRead(vcpu: *Vcpu, qual: QualCr) VmxError!void {
    const value = switch (qual.index) {
        3 => try vmx.vmread(vmcs.guest.cr3),
        else => vcpu.abort(),
    };
    try setValue(vcpu, qual, value);
}

The .index field in the Exit Qualification contains the number of the CR being read. Since VM Exits only occur for CR3, we handle only the case where .index is 3. The value of CR3 is stored in the VMCS Guest-State, so we'll use setValue() to set this value into the guest register. setValue() is a helper function that sets a value to the guest register specified by the Exit Qualification. You can expand the following fold to check the implementation:

setValue() と getValue() の実装

ymir/arch/x86/vmx/cr.zig

fn setValue(vcpu: *Vcpu, qual: QualCr, value: u64) VmxError!void {
    const gregs = &vcpu.guest_regs;
    switch (qual.reg) {
        .rax => gregs.rax = value,
        .rcx => gregs.rcx = value,
        .rdx => gregs.rdx = value,
        .rbx => gregs.rbx = value,
        .rbp => gregs.rbp = value,
        .rsi => gregs.rsi = value,
        .rdi => gregs.rdi = value,
        .r8 => gregs.r8 = value,
        .r9 => gregs.r9 = value,
        .r10 => gregs.r10 = value,
        .r11 => gregs.r11 = value,
        .r12 => gregs.r12 = value,
        .r13 => gregs.r13 = value,
        .r14 => gregs.r14 = value,
        .r15 => gregs.r15 = value,
        .rsp => try vmx.vmwrite(vmcs.guest.rsp, value),
    }
}

fn getValue(vcpu: *Vcpu, qual: QualCr) VmxError!u64 {
    const gregs = &vcpu.guest_regs;
    return switch (qual.reg) {
        .rax => gregs.rax,
        .rcx => gregs.rcx,
        .rdx => gregs.rdx,
        .rbx => gregs.rbx,
        .rbp => gregs.rbp,
        .rsi => gregs.rsi,
        .rdi => gregs.rdi,
        .r8 => gregs.r8,
        .r9 => gregs.r9,
        .r10 => gregs.r10,
        .r11 => gregs.r11,
        .r12 => gregs.r12,
        .r13 => gregs.r13,
        .r14 => gregs.r14,
        .r15 => gregs.r15,
        .rsp => try vmx.vmread(vmcs.guest.rsp),
    };
}

Let's call the read handler in handleAccessCr():

ymir/arch/x86/vmx/cr.zig

switch (qual.access_type) {
    .mov_from => try passthroughRead(vcpu, qual),
    ...
}

MOV to CR Handler

MOV to CR0 / CR4

Pass-through

Writes to CR0 and CR4 are generally passed through directly to the CRs. The reason for the qualifier "generally" is that certain bits in CR0 and CR4 have fixed allowed values during VMX operation. Recall in the VMX Root Operation chapter when setting the host's CR0/CR4 values, we adjusted them using the IA32_VMX_CR{0,4}_FIXED0 and IA32_VMX_CR{0,4}_FIXED1 MSRs. Bits set to 1 in FIXED0 must always be set to 1. Conversely, bits cleared to 0 in FIXED1 must always be cleared to 0. This rule also applies to the guest’s CRs. When the guest writes to CR0/CR4, you must check these MSRs and adjust the written value accordingly. We'll prepare a helper function to handle passing through and adjusting the CR values:

ymir/arch/x86/vmx/cr.zig

fn passthroughWrite(vcpu: *Vcpu, qual: QualCr) VmxError!void {
    const value = try getValue(vcpu, qual);
    switch (qual.index) {
        0 => {
            try vmx.vmwrite(vmcs.guest.cr0, adjustCr0(value));
            try vmx.vmwrite(vmcs.ctrl.cr0_read_shadow, value);
        },
        4 => {
            try vmx.vmwrite(vmcs.guest.cr4, adjustCr4(value));
            try vmx.vmwrite(vmcs.ctrl.cr4_read_shadow, value);
        },
        else => vcpu.abort(),
    }
}

adjustCr0() and adjustCr4() are helper functions that adjust the values of CR0 and CR4 respectively according to the MSR. The implementation is the same as before, so if you’ve forgotten, expand the following to review:

adjustCr0() と adjustCr4() の実装

ymir/arch/x86/vmx/cr.zig

fn adjustCr0(value: u64) u64 {
    var ret: u64 = @bitCast(value);
    const vmx_cr0_fixed0: u32 = @truncate(am.readMsr(.vmx_cr0_fixed0));
    const vmx_cr0_fixed1: u32 = @truncate(am.readMsr(.vmx_cr0_fixed1));

    ret |= vmx_cr0_fixed0;
    ret &= vmx_cr0_fixed1;

    return ret;
}

fn adjustCr4(value: u64) u64 {
    var ret: u64 = @bitCast(value);
    const vmx_cr4_fixed0: u32 = @truncate(am.readMsr(.vmx_cr4_fixed0));
    const vmx_cr4_fixed1: u32 = @truncate(am.readMsr(.vmx_cr4_fixed1));

    ret |= vmx_cr4_fixed0;
    ret &= vmx_cr4_fixed1;

    return ret;
}

Values are written to the Read Shadows because their values must always stay in sync with the actual CR values. In Ymir, since all bits in the Guest/Host Masks are set, reading from a CR effectively means reading from the Read Shadows. To ensure the guest sees the correct value, the Read Shadows must be updated accordingly.

IA-32e Mode

There is one more thing to be careful about when updating CR0/CR4: the state of IA-32e mode. IA-32e mode has two states: Long mode (64-bit mode) and compatible mode (32-bit mode). IA-32e mode becomes active when not in VMX Operation and all of the following conditions are met (in practice, these must be checked in order)¹:

CR4.PAE is 1
IA32_EFER.LME is 1
CR0.PG is 1
IA32_EFER.LMA is 1

However, in VMX Non-Root Operation, whether the guest is in IA-32e mode is determined by the VM-Entry Controls setting (.ia32e_mode_guest) at VM Entry (it can be changed during VMX Non-Root Operation). Conversely, the register settings mentioned above must be consistent with the VM-Entry Controls settings. If there is any mismatch between them, VM Entry will fail with an invalid guest error.

When writing to CR0/CR4, you need to determine whether IA-32e mode is enabled based on the CR0/CR4 values, and update both the EFER register and the VM-Entry Controls settings accordingly:

ymir/arch/x86/vmx/cr.zig

fn updateIa32e(vcpu: *Vcpu) VmxError!void {
    const cr0: am.Cr0 = @bitCast(try vmx.vmread(vmcs.guest.cr0));
    const cr4: am.Cr4 = @bitCast(try vmx.vmread(vmcs.guest.cr4));
    const ia32e_enabled = cr0.pg and cr4.pae;

    vcpu.ia32e_enabled = ia32e_enabled;

    var entry_ctrl = try vmcs.EntryCtrl.store();
    entry_ctrl.ia32e_mode_guest = ia32e_enabled;
    try entry_ctrl.load();

    var efer: am.Efer = @bitCast(try vmx.vmread(vmcs.guest.efer));
    efer.lma = vcpu.ia32e_enabled;
    efer.lme = if (cr0.pg) efer.lma else efer.lme;
    try vmx.vmwrite(vmcs.guest.efer, efer);
}

In handleAccessCr(), perform the passthrough and update the IA-32e mode accordingly:

ymir/arch/x86/vmx/cr.zig

switch (qual.access_type) {
    .mov_to => {
        switch (qual.index) {
            0, 4 => {
                try passthroughWrite(vcpu, qual);
                try updateIa32e(vcpu);
            },
            ...
        }
    },
    ...
}

MOV to CR3

Writes to CR3 are also generally passed through directly to the actual CR3. However, there are two caveats associated with this "generally":

PCID

PCID: Processor Context ID is a feature that assigns an ID to CR3 to distinguish TLB entries. TLB entries differentiated by PCID can be selectively flushed using INVPCID, which targets entries with a specific PCID. In Ymir, PCID usage is permitted for guests based on the values exposed via CPUID. PCID becomes effective when CR4.PCIDE is set to 1.

When PCID is enabled, the most significant bit of CR3 (the 63rd bit) has a special meaning². If PCID is disabled, any MOV to CR3 flushes all TLB entries. When PCID is enabled, if CR3[63] is 0, all TLB entries associated with the new CR3’s PCID are flushed³. If CR3[63] is 1, no TLB entries are flushed⁴.

However, in VMX Operation, the guest’s CR3[63] must always be 0. Otherwise, VM Entry fails with an invalid guest error. Even outside VMX Root Operation, if CR3[63] is 1, it is cleared to 0 when actually written to CR3. Similarly, the guest’s CR3[63] must always be 0. Therefore, if the guest attempts to write a value with the 63rd bit set, it must be cleared before writing:

ymir/arch/x86/vmx/cr.zig

switch (qual.access_type) {
    .mov_to => {
        switch (qual.index) {
            3 => {
                const val = try getValue(vcpu, qual);
                try vmx.vmwrite(vmcs.guest.cr3, val & ~@as(u64, (1 << 63)));
                ...
            },
            ...
        }
    },
    ...
}

Combined Mappings

Do you remember when we introduced EPT and listed three types of information cached in the TLB during VMX Operation? Just to recap, here is the information cached in the TLB:

Linear Mappings: The results of GVA to GPA (= HPA) translations, along with the page table entries used for those translations.
Guest-Physical Mappings: The results of GPA to HPA translations, along with the page table entries used for those translations.
Combined Mappings: The results of GVA to HPA translations, along with the page table entries used for those translations.

Each mapping is used depending on whether the guest’s memory access uses a physical address or a virtual address, as follows:

Virtual address: Combined mappings tagged with VPID, PCID, and EPTRTA are used.
Physical address: Guest-physical mappings tagged with EPTRTA are used.

Among these, linear mappings are flushed by MOV to CR3. However, guest-physical mappings and combined mappings are not flushed. Both of these mappings are used for translation to HPA, so it seems that MOV to CR3 from the guest does not invalidate them. The author is not entirely sure why, but since they are not flushed, it cannot be helped. If left as is, the TLB will use stale combined mappings cached previously for GVA/GPA to HPA translation. Because the result of GVA to GPA translation changes with MOV to CR3, the combined mappings used for GVA to HPA translation must also be invalidated⁵.

There are several ways to invalidate these mappings. The first is to use INVEPT instruction. INVEPT flushes EPT entries associated with the specified EPTP. By specifying the guest’s EPTP, you can flush all mappings linked to the guest. However, this has the drawback of also flushing guest-physical mappings that do not necessarily need to be flushed. The second method is to use INVVPID instruction. This flushes TLB entries tagged with the VPID associated with the vCPU. Unlike INVEPT, INVVPID only flushes combined mappings and does not flush guest-physical mappings⁶. Since we want to flush only combined mappings in this case, we will use INVVPID.

There are four types of INVVPID instructions, each targeting different scopes for TLB invalidation. For our case, we use the single context type, which flushes all combined mappings associated with the specified VPID.

ymir/arch/x86/asm.zig

const InvvpidType = enum(u64) {
    individual_address = 0,
    single_context = 1,
    all_context = 2,
    single_global = 3,
};

pub inline fn invvpid(comptime inv_type: InvvpidType, vpid: u16) void {
    const descriptor: packed struct(u128) {
        vpid: u16,
        _reserved: u48 = 0,
        linear_addr: u64 = 0,
    } align(128) = .{ .vpid = vpid };
    asm volatile (
        \\invvpid (%[descriptor]), %[inv_type]
        :
        : [inv_type] "r" (@intFromEnum(inv_type)),
          [descriptor] "r" (&descriptor),
        : "memory"
    );
}

At the end of MOV to CR3, use INVVPID to flush the combined mappings.

ymir/arch/x86/vmx/cr.zig

    .mov_to => {
        switch (qual.index) {
            3 => {
                ...
                am.invvpid(.single_context, vcpu.vpid);
            },
            ...
        }
    },

Enabling INVPCID

This completes the handling of accesses to control registers. Now, let's see how far Linux boots. Let's run the guest.

txt

[INFO ] main    | Starting the virtual machine...
No EFI environment detected.
early console in extract_kernel
input_data: 0x0000000002d582b9
input_len: 0x0000000000c702ff
output: 0x0000000001000000
output_len: 0x000000000297e75c
kernel_total_size: 0x0000000002630000
needed_size: 0x0000000002a00000
trampoline_32bit: 0x0000000000000000


KASLR disabled: 'nokaslr' on cmdline.


Decompressing Linux... Parsing ELF... No relocation needed... done.
Booting the kernel (entry_offset: 0x0000000000000000).
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x0000000F, Sub=0x00000000
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x0000000F, Sub=0x00000001
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x0000000F, Sub=0x00000001
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x0000000F, Sub=0x00000001
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000010, Sub=0x00000000
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000010, Sub=0x00000000
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000010, Sub=0x00000001
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000010, Sub=0x00000002
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000010, Sub=0x00000000
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000010, Sub=0x00000003
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000012, Sub=0x00000000
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000012, Sub=0x00000000
[WARN ] vmcpuid | Unhandled CPUID: Leaf=0x00000012, Sub=0x00000000
[    0.000000] Linux version 6.9.0 (smallkirby@bel) (gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #184 SMP PREEMPT_DYNAMIC Mon Nov 11 21:10:38 JST 2024
[    0.000000] Command line: console=ttyS0 earlyprintk=serial nokaslr
[    0.000000] CPU: vendor_id 'YmirYmirYmir' unknown, using generic init.
[    0.000000] CPU: Your system may be unstable.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x00000000063fffff] usable
[    0.000000] printk: legacy bootconsole [earlyser0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] APIC: Static calls initialized
[    0.000000] DMI not present or invalid.
[    0.000000] last_pfn = 0x6400 max_arch_pfn = 0x400000000
[    0.000000] MTRRs disabled (not available)
[    0.000000] x86/PAT: PAT not supported by the CPU.
[    0.000000] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
[    0.000000] Kernel does not support x2APIC, please recompile with CONFIG_X86_X2APIC.
[    0.000000] Disabling APIC, expect reduced performance and functionality.
[    0.000000] Using GB pages for direct mapping
PANIC: early exception 0x06 IP 10:ffffffff8107930e error 0 cr2 0xffff888003600000
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.9.0 #184
[    0.000000] RIP: 0010:native_flush_tlb_global+0x3e/0xa0

Linux has started booting! Yay. The logs following Linux version... are from the decompressed kernel. There are a few things we can learn from the boot log:

The vendor_id is set to YmirYmirYmir. This is the value specified in CPUID.
The Command line shows the exact string specified in BootParams.
BIOS-e820 is displayed, which corresponds to the E820 map specified in BootParams.
X86/PAT: PAT not supported by the CPU. is displayed because it was specified via CPUID.

This log clearly shows that everything we've done in the previous chapters is correctly reflected.

Now, the system stopped with PANIC: early exception 0x06. This exception number corresponds to #UD: Invalid Opcode, which occurs when an unsupported instruction is executed. At this point, the kernel hasn’t set up a proper interrupt handler yet, so it simply panics. Checking the displayed RIP with addr2line reveals that it points to __invpcid().

Now, within the VM Execution Controls category, the Secondary Processor-Based Execution Controls has a field that determines whether to allow the guest to execute INVPCID. If this field is disabled, the guest executing INVPCID triggers a #UD: Invalid Opcode exception. We accidentally forgot to enable this field. By enabling it in setupExecCtrls(), this exception will no longer occur:

ymir/arch/x86/vmx/vcpu.zig

fn setupExecCtrls(vcpu: *Vcpu, _: Allocator) VmxError!void {
    ...
    ppb_exec_ctrl2.enable_invpcid = false;
    ...
}

Summary

In this chapter, we virtualized access to the control registers. Although the ownership of each bit in CR0 and CR4 is determined by Guest/Host Masks, Ymir is configured so that the host owns all bits. When a VM Exit occurs due to CR access, we expose the actual values to the guest by passing them through with slight adjustments. However, special handling was necessary for IA-32e mode regarding CR0/CR4 and for combined mappings regarding CR3. Additionally, we also enabled the guest to execute the INVPCID instruction.

After enabling INVPCID, the boot process should progress even further. Let's run it and see. From here on, we'll skip the initial part of the Linux boot log (it's a bit sad to leave out those logs we were so happy to see, but they're quite lengthy).

txt

...
[    0.128997] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.128997] RPC: Registered named UNIX socket transport module.
[    0.129997] RPC: Registered udp transport module.
[    0.129997] RPC: Registered tcp transport module.
[    0.129997] RPC: Registered tcp-with-tls transport module.
[    0.130997] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    0.132997] pci_bus 0000:00: resource 4 [io  0x0000-0xffff]
[    0.132997] pci_bus 0000:00: resource 5 [mem 0x00000000-0xfffffffff]
[    0.133996] pci 0000:00:01.0: PIIX3: Enabling Passive Release
[    0.133996] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    0.134996] PCI: CLS 0 bytes, default 64
[    0.136996] no MSR PMU driver.
[    0.136996] platform rtc_cmos: registered platform RTC device (no PNP device found)
[    0.146994] Initialise system trusted keyrings
[    0.146994] workingset: timestamp_bits=56 max_order=14 bucket_order=0
[    0.147994] NFS: Registering the id_resolver key type
[    0.147994] Key type id_resolver registered
[    0.147994] Key type id_legacy registered
[    0.148994] 9p: Installing v9fs 9p2000 file system support
[    0.153993] Key type asymmetric registered
[    0.154993] Asymmetric key parser 'x509' registered
[    0.154993] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
[    0.155993] io scheduler mq-deadline registered
[    0.155993] io scheduler kyber registered
[    0.155993] kworker/R-acpi_ (38) used greatest stack depth: 15744 bytes left
[    0.156993] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    0.415954] serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[    0.416953] Non-volatile memory driver v1.3
[    0.416953] Linux agpgart interface v0.103
[    0.417953] loop: module loaded
[    0.417953] scsi host0: ata_piix
[    0.418953] scsi host1: ata_piix
[    0.418953] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc040 irq 14 lpm-pol 0
[    0.418953] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc048 irq 15 lpm-pol 0
[    0.419953] e100: Intel(R) PRO/100 Network Driver
[    0.419953] e100: Copyright(c) 1999-2006 Intel Corporation
[    0.420953] e1000: Intel(R) PRO/1000 Network Driver
[    0.420953] e1000: Copyright (c) 1999-2006 Intel Corporation.
[ERROR] vcpu    | Unhandled VM-exit: reason=arch.x86.vmx.common.ExitReason.ept
[ERROR] vcpu    | === vCPU Information ===
[ERROR] vcpu    | [Guest State]
[ERROR] vcpu    | RIP: 0xFFFFFFFF819B8E1B
...

It made a huge leap forward. It's even trying to access and initialize the serial port—which we haven't virtualized yet—entirely on its own. Bold move.

Eventually, we hit a VM Exit due to an EPT violation. As discussed in the EPT chapter, Ymir pre-maps all physical memory available to the guest before booting, so ideally, EPT violations should not occur. The fact that one did means the guest is accessing memory it shouldn't touch—this is because we haven't virtualized I/O yet. So next, we'll tackle I/O virtualization.

SDM Vol.3C 10.8.5. Initializing IA-32e Mode

SDM Vol.3A 4.10.4.1. Operations that Invalidate TLBs and Paging-Structure Caches

It does not global pages.

⁴

Strictly speaking, "no TLB entries may be flushed". Not only MOV to CR3, but also INVEPT, INVPCID, and INVVPID are guaranteed to flush entries that are the explicit target of the operation. However, for entries that are not the target, the SDM clearly states that they may or may not be flushed.

⁵

Guest-physical mappings do not need to be invalidated. These mappings are used for accesses via physical addresses, and are not affected by the guest’s paging structures or the value of CR3. As a result, they do not become stale due to a MOV to CR3.

⁶

SDM Vol.3C 29.4.3.3. Guidelines for Use of the INVVPID Instruction

Writing Hypervisor in Zig

Control Registers の仮想化

Table of Contents

CR Read Shadows / Masks

Exit Qualification

MOV from CR Handler

MOV from CR0 / CR4

MOV from CR3

MOV to CR Handler

MOV to CR0 / CR4

Pass-through

IA-32e Mode

MOV to CR3

PCID

Combined Mappings

Enabling INVPCID

Summary