Basics of VMCS

This chapter provides an overview of VMCS, the most critical data structure in VMX. Since VMCS controls a vast number of settings, this chapter and the series as a whole will not cover all of them. We will focus only on the fields necessary for Ymir to boot Linux.

important

The source code for this chapter is in whiz-vmm-vmcs branch.

Table of Contents

Overview of VMCS

VMCS: Virtual-Machine Control Data Structures is a data structure that manages the behavior of the vCPU during VMX Non-root Operation, as well as the settings related to VM Exit and VM Entry events.

VMCS is classified to the below 6 categories:

CategoryDescription
Guest-StateIt holds the guest processor's state. This state is loaded on VM Entry and saved on VM Exit.
Host-StateIt holds the host processor's state, which is loaded during VM Exit.
VM-Execution ControlIt controls the behavior of VMX Non-root Operation.
VM-Exit ControlIt controls the behavior of VM Exit.
VM-Entry ControlIt controls the behavior of VM Entry.
VM-Exit InformationIt holds information related to VM Exit.

Guest-State

This category holds the guest processor's state. These states are automatically loaded at VM Entry, allowing the guest to resume execution from the exact state it was in just before VM Exit. The processor states saved and restored include the following:

  • Control Registers
  • RSP, RIP, RFLAGS
  • Some field of segment registers
  • Some MSRs

Note that general-purpose registers like RAX, RBX, etc., are not saved in the VMCS. The VMM must handle saving and loading these registers in a software manner.

Also, only a limited set of MSRs are saved. RDMSR and WRMSR instructions can be marked as privileged instructions. In this case, when the guest executes RDMSR/WRMSR, a VM Exit occurs, allowing the VMM to control the values that are actually read or written.

Host-State

This category holds the VMM processor state. These values are loaded during VM Exit, allowing the VMM to resume execution from the state just before VM Entry. The processor states restored include the following:

  • Control Registers
  • RSP, RIP
  • Segment selector
  • Base part of FS / GS / TR / GDTR / IDTR
  • Some MSRs

Unlike Guest-State, Host-State is not automatically saved during VM Entry. Therefore, before VM Entry, the VMM must set these values appropriately in a software manner. These values are automatically loaded during VM Exit.

VM-Execution Control

This category controls the processor behavior during VMX Non-root Operation. It likely contains the most configuration fields among the six categories. Therefore, we will provide an overview of only the representative fields here and explain others as needed. This category includes the following fields:

  • Pin-Based VM-Execution Controls: It controls async events such as interrupts and exceptions
  • Processor-Based VM-Execution Controls: It mainly controls events generated by special instructions
    • "Special instructions" refer to instructions such as RDTSCP, HLT, INVLPG, and MOV to control registers (CR).
  • Exception Bitmap: It controls which exception to trigger VM Exit
  • I/O-Bitmap: It controls which I/O port accesses trigger VM Exit.
  • MSR-Bitmap: It controls which MSR accesses trigger VM Exit.
  • EPTP: Extended-Page-Table Pointer: Pointer to the level 4 table of EPT. EPT will be covered in detail in a separate chapter.

VM-Exit Control

This category controls the behavior during VM Exit. The control items include the following:

  • Whether DR (Debug Registers) are saved
  • Whether the processor mode immediately after VM Exit is 64-bit mode.
  • MSR Area: The list of host MSRs to be loaded during VM Exit

Compared to other categories, this one has fewer fields and features, and Ymir doesn’t use many of them. It’s the “comfort zone” of VMCS.

VM-Entry Control

This category controls the behavior during VM Entry. The control items include the following:

  • Whether DR (Debug Registers) are saved
  • Whether the processor is in IA-32e mode immediately after VM entry
  • Whether the processor mode immediately after VM Entry is IA-32e mode.
  • Event Injection: It can inject an arbitrary interrupts into the guest on VM Entry

Like the VM-Exit Control category, this one is also a “comfort zone.”

VM-Exit Information

This category stores information related to VM Exit. It is the only read-only category. The stored information includes the following:

  • Basic VM-Exit Information: The cause of the VM Exit
    • Provides a general reason for VM Exit, such as I/O access, CPUID, MSR access, or EPT violation.
  • VM-Instruction Error Field: When an error code is available from a VMX extended instruction, it is stored here.
    • Completely not related to VM Exit

As it is read-only, this category does not require any configuration by a VMM.

VMCS Region

A template for setting up VMCS will be added to Vcpu:

ymir/arch/x86/vmx/vcpu.zig
pub const Vcpu = struct {
    ...
    vmcs_region: *VmcsRegion = undefined,
    ...

    pub fn setupVmcs(self: *Self, allocator: Allocator) VmxError!void {
        ...

        // Initialize VMCS fields.
        try setupExecCtrls(self, allocator);
        try setupExitCtrls(self);
        try setupEntryCtrls(self);
        try setupHostState(self);
        try setupGuestState(self);
    }
}

fn setupExecCtrls(vcpu: *Vcpu, allocator: Allocator) VmxError!void {}
fn setupExitCtrls(vcpu: *Vcpu) VmxError!void {}
fn setupEntryCtrls(vcpu: *Vcpu) VmxError!void {}
fn setupHostState(vcpu: *Vcpu) VmxError!void {}
fn setupGuestState(vcpu: *Vcpu) VmxError!void {}

The VMCS is put in an area called the VMCS region. It has a structure similar to the VMXON region:

Format of the VMCS Region Format of the VMCS Region. SDM Vol.3C 25.2 Table 25-1.

The VMCS Revision Identifier is the same value set in the VMXON region and represents the VMCS version number. This value can be obtained from the IA32_VMX_BASIC MSR. The VMX-Abort Indicator is not used in this series. Following this are the VMCS configuration fields. The layout of each field is entirely implementation-dependent, and the VMM neither needs to know nor can determine it. How to write to a data structure with an unknown layout will be explained later.

Define a VMCS region as below:

ymir/arch/x86/vmx/vcpu.zig
const VmcsRegion = packed struct {
    vmcs_revision_id: u31,
    zero: u1 = 0,
    abort_indicator: u32,

    pub fn new(page_allocator: Allocator) VmxError!*align(mem.page_size) VmcsRegion {
        const size = am.readMsrVmxBasic().vmxon_region_size;
        const page = try page_allocator.alloc(u8, size);
        if (@intFromPtr(page.ptr) % mem.page_size != 0) {
            return error.OutOfMemory;
        }
        @memset(page, 0);
        return @alignCast(@ptrCast(page.ptr));
    }
};

Next, we set the allocated VMCS region. The VMCS has states represented by the following state transition diagram:

State of VMCS X State of VMCS X. SDM Vol.3C 25.1 Figure 25-1.

Active and Inactive represent the VMCS state. Among active VMCSs, a logical core can have at most one Current VMCS. When a current VMCS is set, the logical core references that VMCS to determine its behavior. Clear and Launched indicate whether the VMCS has been used. When a core with the VMCS as "current" performs VM Entry via VMLAUNCH instruction, the VMCS transitions to the launched state. Once a VMCS becomes launched, it cannot be cleared again. To perform VM Entry with a launched VMCS, VMRESUME instruction must be used. There is no direct way to determine whether a VMCS is clear or launched1; therefore, this state must be tracked in software.

Setting the VMCS region means putting the VMCS into the Active + Current + Clear state. This involves using the following two instructions:

  • VMCLEAR: Set the state to Inactive + Not Current + Clear.
  • VMPTRLD: Set the state to Active + Current + Clear
ymir/arch/x86/vmx/vcpu.zig
fn resetVmcs(vmcs_region: *VmcsRegion) VmxError!void {
    try am.vmclear(mem.virt2phys(vmcs_region));
    try am.vmptrld(mem.virt2phys(vmcs_region));
}
ymir/arch/x86/asm.zig
pub inline fn vmclear(vmcs_region: mem.Phys) VmxError!void {
    var rflags: u64 = undefined;
    asm volatile (
        \\vmclear (%[vmcs_phys])
        \\pushf
        \\popq %[rflags]
        : [rflags] "=r" (rflags),
        : [vmcs_phys] "r" (&vmcs_region),
        : "cc", "memory"
    );
    try vmxerr(rflags);
}

pub inline fn vmptrld(vmcs_region: mem.Phys) VmxError!void {
    var rflags: u64 = undefined;
    asm volatile (
        \\vmptrld (%[vmcs_phys])
        \\pushf
        \\popq %[rflags]
        : [rflags] "=r" (rflags),
        : [vmcs_phys] "r" (&vmcs_region),
        : "cc", "memory"
    );
    try vmxerr(rflags);
}

Both instructions take the physical address of the VMCS region as an argument. vmxerr() is a function implemented in the previous chapter for handling errors from VMX extended instructions.

Based on the above, the allocation and initialization of the VMCS region is as follows:

ymir/arch/x86/vmx/vcpu.zig
pub fn setupVmcs(self: *Self, allocator: Allocator) VmxError!void {
    const vmcs_region = try VmcsRegion.new(allocator);
    vmcs_region.vmcs_revision_id = getVmcsRevisionId();
    self.vmcs_region = vmcs_region;
    try resetVmcs(self.vmcs_region);
    ...
}

Make sure to call setupVmcs() in Vm.init():

ymir/vmx.zig
pub fn init(self: *Self, allocator: Allocator) VmxError!void {
    ...
    try self.vcpu.setupVmcs(allocator);
}

Access to Fields

VMCS-field Encoding

The VMCS field layout is implementation-dependent. It is even possible that the values stored are compressed or encrypted rather than immediate values. A VMM neither needs to be aware of this nor has any way to know.

To read from or write to VMCS fields with unknown layouts, we use the VMX instructions VMREAD and VMWRITE. These instructions specify the target VMCS field using a 32-bit encoded value called the VMCS-field Encoding.

Therefore, to access VMCS fields, you need to define the encoding for each field. The 32-bit encoding is structured as follows:

Structure of VMCS Component Encoding Structure of VMCS Component Encoding. SDM Vol.3C 25.11.2 Table 25-21.

Width refers to the size of the VMCS field. Each VMCS field has one of four sizes: 16-bit, 32-bit, 64-bit, or natural width.

Access Type specifies the access width for the field. Among the four possible Width types, only 64-bit fields can be accessed with 32-bit width. In this case, 1:high is specified for access type. For all other cases, 0:full is used.

Index is the number assigned to each field. This value is unique within fields that share the same access type, type, and width. Conversely, fields with different Type values may share the same Index.

Type refers to the field category. Among the six categories, VM-Execution Control, VM-Exit Control, and VM-Entry Control are collectively classified under the control type.

The encoding details for each field are listed in SDM Appendix B FIELD ENCODING IN VMCS. Based on this list, we define a helper function to calculate the encoding for each field2:

ymir/arch/x86/vmx/vmcs.zig
fn encode(
    comptime field_type: FieldType,
    comptime index: u9,
    comptime access_type: AccessType,
    comptime width: Width,
) u32 {
    return @bitCast(ComponentEncoding{
        .access_type = access_type,
        .index = index,
        .field_type = field_type,
        .width = width,
    });
}

/// Encodes a VMCS field for the guest state area.
fn eg(
    comptime index: u9,
    comptime access_type: AccessType,
    comptime width: Width,
) u32 { return encode(.guest_state, index, access_type, width); }
/// Encodes a VMCS field for the host state area.
fn eh(
    comptime index: u9,
    comptime access_type: AccessType,
    comptime width: Width,
) u32 { return encode(.host_state, index, access_type, width); }
/// Encodes a VMCS field for the control area.
fn ec(
    comptime index: u9,
    comptime access_type: AccessType,
    comptime width: Width,
) u32 { return encode(.control, index, access_type, width); }
/// Encodes a VMCS field for the read-only area.
fn er(
    comptime index: u9,
    comptime access_type: AccessType,
    comptime width: Width,
) u32 { return encode(.vmexit, index, access_type, width); }

const AccessType = enum(u1) {
    full = 0,
    high = 1,
};
const Width = enum(u2) {
    word = 0,
    qword = 1,
    dword = 2,
    natural = 3,
};
const FieldType = enum(u2) {
    control = 0,
    vmexit = 1,
    guest_state = 2,
    host_state = 3,
};
const ComponentEncoding = packed struct(u32) {
    access_type: AccessType,
    index: u9,
    field_type: FieldType,
    _reserved1: u1 = 0,
    width: Width,
    _reserved2: u17 = 0,
};

eg(), eh(), ec(), and er() are functions that calculate the encoding for each respective Type. Using these helper functions, you can define the encoding for VMCS fields. However, there are many fields - around 200 in total. Therefore, we will not include the full snippet here. Please refer to the Ymir repository for the complete encoding definitions.

(GitHub にアクセスできないという稀有な人のために Guest-State タイプの encoding 定義だけ抜粋しておきます)
ymir/arch/x86/vmx/vmcs.zig
pub const guest = enum(u32) {
    // Natural-width fields.
    cr0 = eg(0, .full, .natural),
    cr3 = eg(1, .full, .natural),
    cr4 = eg(2, .full, .natural),
    es_base = eg(3, .full, .natural),
    cs_base = eg(4, .full, .natural),
    ss_base = eg(5, .full, .natural),
    ds_base = eg(6, .full, .natural),
    fs_base = eg(7, .full, .natural),
    gs_base = eg(8, .full, .natural),
    ldtr_base = eg(9, .full, .natural),
    tr_base = eg(10, .full, .natural),
    gdtr_base = eg(11, .full, .natural),
    idtr_base = eg(12, .full, .natural),
    dr7 = eg(13, .full, .natural),
    rsp = eg(14, .full, .natural),
    rip = eg(15, .full, .natural),
    rflags = eg(16, .full, .natural),
    pending_debug_exceptions = eg(17, .full, .natural),
    sysenter_esp = eg(18, .full, .natural),
    sysenter_eip = eg(19, .full, .natural),
    s_cet = eg(20, .full, .natural),
    ssp = eg(21, .full, .natural),
    intr_ssp_table_addr = eg(22, .full, .natural),
    // 16-bit fields.
    es_sel = eg(0, .full, .word),
    cs_sel = eg(1, .full, .word),
    ss_sel = eg(2, .full, .word),
    ds_sel = eg(3, .full, .word),
    fs_sel = eg(4, .full, .word),
    gs_sel = eg(5, .full, .word),
    ldtr_sel = eg(6, .full, .word),
    tr_sel = eg(7, .full, .word),
    intr_status = eg(8, .full, .word),
    pml_index = eg(9, .full, .word),
    uinv = eg(10, .full, .word),
    // 32-bit fields.
    es_limit = eg(0, .full, .dword),
    cs_limit = eg(1, .full, .dword),
    ss_limit = eg(2, .full, .dword),
    ds_limit = eg(3, .full, .dword),
    fs_limit = eg(4, .full, .dword),
    gs_limit = eg(5, .full, .dword),
    ldtr_limit = eg(6, .full, .dword),
    tr_limit = eg(7, .full, .dword),
    gdtr_limit = eg(8, .full, .dword),
    idtr_limit = eg(9, .full, .dword),
    es_rights = eg(10, .full, .dword),
    cs_rights = eg(11, .full, .dword),
    ss_rights = eg(12, .full, .dword),
    ds_rights = eg(13, .full, .dword),
    fs_rights = eg(14, .full, .dword),
    gs_rights = eg(15, .full, .dword),
    ldtr_rights = eg(16, .full, .dword),
    tr_rights = eg(17, .full, .dword),
    interruptibility_state = eg(18, .full, .dword),
    activity_state = eg(19, .full, .dword),
    smbase = eg(20, .full, .dword),
    sysenter_cs = eg(21, .full, .dword),
    preemp_timer = eg(22, .full, .dword),
    // 64-bit fields.
    vmcs_link_pointer = eg(0, .full, .qword),
    dbgctl = eg(1, .full, .qword),
    pat = eg(2, .full, .qword),
    efer = eg(3, .full, .qword),
    perf_global_ctrl = eg(4, .full, .qword),
    pdpte0 = eg(5, .full, .qword),
    pdpte1 = eg(6, .full, .qword),
    pdpte2 = eg(7, .full, .qword),
    pdpte3 = eg(8, .full, .qword),
    bndcfgs = eg(9, .full, .qword),
    rtit_ctl = eg(10, .full, .qword),
    lbr_ctl = eg(11, .full, .qword),
    pkrs = eg(12, .full, .qword),
};

It’s definitely some hellish code.

VMREAD / VMWRITE

With the VMCS encoding defined, let's create a function to access VMCS fields. First, VMREAD instruction. It takes an encoding as an argument and returns the value of the read field. Ideally, the return type should match the field's Width, but for simplicity, we uniformly use 64-bit for all fields. Also, since VMREAD is a VMX extended instruction, error handling is done using RFLAGS:

ymir/arch/x86/vmx/common.zig
pub fn vmread(field: anytype) VmxError!u64 {
    var rflags: u64 = undefined;
    const ret = asm volatile (
        \\vmread %[field], %[ret]
        \\pushf
        \\popq %[rflags]
        : [ret] "={rax}" (-> u64),
          [rflags] "=r" (rflags),
        : [field] "r" (@as(u64, @intFromEnum(field))),
    );
    try vmxtry(rflags);
    return ret;
}

Next is VMWRITE. VMWRITE takes an encoding and the value to write as arguments, but the value is typed as anytype. This is because there are many cases where you want to pass a packed struct (e.g., packed struct (u64)) directly to VMWRITE, avoiding the need to call @bitCast() every time.

ymir/arch/x86/vmx/common.zig
pub fn vmwrite(field: anytype, value: anytype) VmxError!void {
    const value_int = switch (@typeInfo(@TypeOf(value))) {
        .Int, .ComptimeInt => @as(u64, value),
        .Struct => switch (@sizeOf(@TypeOf(value))) {
            1 => @as(u8, @bitCast(value)),
            2 => @as(u16, @bitCast(value)),
            4 => @as(u32, @bitCast(value)),
            8 => @as(u64, @bitCast(value)),
            else => @compileError("Unsupported structure size for vmwrite"),
        },
        .Pointer => @as(u64, @intFromPtr(value)),
        else => @compileError("Unsupported type for vmwrite"),
    };

    const rflags = asm volatile (
        \\vmwrite %[value], %[field]
        \\pushf
        \\popq %[rflags]
        : [rflags] "=r" (-> u64),
        : [field] "r" (@as(u64, @intFromEnum(field))),
          [value] "r" (@as(u64, value_int)),
    );
    try vmxtry(rflags);
}

VMX Instruction Error

Now, although we've defined the VMCS encoding and set up the VMCS, its contents are still empty. But being able to set the VMCS is already something to be happy about. If we don’t celebrate these small wins, life can get tough. I highly recommend making a fuss over these little moments of joy. Happy! Fun! So, since we're feeling good, let's go ahead and try VMLAUNCH for now. It will probably fail anyway, but hey, we have the VMCS set. Let's see what kind of complaints it throws at us:

ymir/arch/x86/vmx/vcpu.tmp.zig
...
pub fn setupVmcs(self: *Self, allocator: Allocator) VmxError!void {
    ...
    try resetVmcs(self.vmcs_region);
    asm volatile("vmlaunch");

If you run it as is... nothing happens. No errors or messages.

This is because, as covered in the previous chapter, VMLAUNCH, a VMX extended instruction, returns errors in a special way instead of throwing exceptions when it fails. Instead of just calling VMLAUNCH blindly, let's properly check for these errors:

ymir/arch/x86/vmx/vcpu.tmp.zig
const rflags = asm volatile (
    \\vmlaunch
    \\pushf
    \\popq %[rflags]
    : [rflags] "=r" (-> u64),
);
vmx.vmxtry(rflags) catch |err| {
    log.err("VMLAUNCH: {?}", .{err});
};

The output will look like the following:

txt
[ERROR] vcpu    | VMLAUNCH: error.VmxStatusAvailable

This means RFLAGS.ZF is set, indicating that an error code is available. The error code is stored in the VM-Instruction Error Field within the VMCS's VM-Exit Information. Let's define an enum for the error codes and a helper function to retrieve the error code from the VM-Instruction Error Field. For a list and description of the error codes, refer to SDM Vol.3C 31.4 VM INSTRUCTION ERROR NUMBERS:

ymir/arch/x86/vmx/common.zig
pub const InstructionError = enum(u32) {
    error_not_available = 0,
    vmcall_in_vmxroot = 1,
    vmclear_invalid_phys = 2,
    vmclear_vmxonptr = 3,
    vmlaunch_nonclear_vmcs = 4,
    vmresume_nonlaunched_vmcs = 5,
    vmresume_after_vmxoff = 6,
    vmentry_invalid_ctrl = 7,
    vmentry_invalid_host_state = 8,
    vmptrld_invalid_phys = 9,
    vmptrld_vmxonp = 10,
    vmptrld_incorrect_rev = 11,
    vmrw_unsupported_component = 12,
    vmw_ro_component = 13,
    vmxon_in_vmxroot = 15,
    vmentry_invalid_exec_ctrl = 16,
    vmentry_nonlaunched_exec_ctrl = 17,
    vmentry_exec_vmcsptr = 18,
    vmcall_nonclear_vmcs = 19,
    vmcall_invalid_exitctl = 20,
    vmcall_incorrect_msgrev = 22,
    vmxoff_dualmonitor = 23,
    vmcall_invalid_smm = 24,
    vmentry_invalid_execctrl = 25,
    vmentry_events_blocked = 26,
    invalid_invept = 28,

    /// Get an instruction error number from VMCS.
    pub fn load() VmxError!InstructionError {
        return @enumFromInt(@as(u32, @truncate(try vmread(vmcs.ro.vminstruction_error))));
    }
};

load() is the function that retrieves the error code from the VMCS. This is the first time we use VMREAD - how cute.

ro enum の定義:
ymir/arch/x86/vmx/vmcs.zig
pub const ro = enum(u32) {
    // Natural-width fields.
    exit_qual = er(0, .full, .natural),
    io_rcx = er(1, .full, .natural),
    io_rsi = er(2, .full, .natural),
    io_rdi = er(3, .full, .natural),
    io_rip = er(4, .full, .natural),
    guest_linear_address = er(5, .full, .natural),
    // 32-bit fields.
    vminstruction_error = er(0, .full, .dword),
    vmexit_reason = er(1, .full, .dword),
    exit_intr_info = er(2, .full, .dword),
    exit_intr_ec = er(3, .full, .dword),
    idt_vectoring_info = er(4, .full, .dword),
    idt_vectoring_ec = er(5, .full, .dword),
    exit_inst_len = er(6, .full, .dword),
    exit_inst_info = er(7, .full, .dword),
    // 64-bit fields.
    guest_physical_address = er(0, .full, .qword),
};

Now, let's try retrieving the error code after VMLAUNCH:

ymir/arch/x86/vmx/vcpu.tmp.zig
vmx.vmxtry(rflags) catch |err| {
    log.err("VMLAUNCH: {?}", .{err});
    log.err("VM-instruction error number: {s}", .{@tagName(try vmx.InstructionError.load())});
};

The output will look like the following:

txt
[ERROR] vcpu    | VMLAUNCH: error.VmxStatusAvailable
[ERROR] vcpu    | VM-instruction error number: vmentry_invalid_ctrl

Error number 7: VM entry with invalid control fields. This error will likely be seen at least 800 billion times during development. It indicates that invalid values are set in the VMCS's VM-Entry Control fields. Before VM Entry, VMLAUNCH and VMRESUME check whether all categories in the VMCS are properly configured. If even one invalid value is present, VM Entry will fail3. Furthermore, it does not provide any details about what exactly is wrong. Since there are over 100 validation checks4, pinpointing the issue is extremely difficult. The author has implemented each check one by one and runs them only in debug builds. Please try to write your code so that this error does not occur in the first place. Well, even so, you will probably still encounter this error. When that happens, you must suffer through it properly.

Summary

In this chapter, we have described the six categories of VMCS, roughly describing the role of each. We confirmed that the layout of VMCS is implementation-dependent and that encoding must be specified to access the fields, and defined encoding and the VMREAD/VMWIRE helper functions. We also checked the state that VMCS has and set VMCS to the current logical core. Although VMCS is still in a fresh state, we tried to VMLAUNCH it and confirmed that an error occurs. In the process, we also succeeded in actually reading fields from VMCS.

Although we have given a disturbing explanation of the error at the end, from the next chapter onward, we will actually set values in VMCS. For now, our goal is to transition to VMX Non-root Operation so that we can execute one instruction in a guest.

1

You can try executing VMLAUNCH instruction and determine that it is launched if an error occurs.

2

All these function calls are comptime. There is no runtime overhead.

3

If a VM Entry fails for this reason, it is treated as a simple instruction failure rather than a VM Exit because the VM Entry itself is not executed.

4

SDM Vol.3C CHAPTER 27. VMENTRIES