VMLAUNCH: Launching Restricted Guest
In the previous chapter, we set the VMCS on the current core and confirmed that executing VMLAUNCH with an uninitialized VMCS results in an error. In this chapter, we will properly configure the VMCS and execute VMLAUNCH. Our goal is to execute one instruction in VMX non-root operation.
important
The source code for this chapter is in whiz-vmm-vmlaunch
branch.
Table of Contents
- Overview
- VM-Execution Control
- Host-State
- Guest-State
- VM-Entry Control
- VM-Exit Control
- VMLAUNCH
- Summary
Overview
The ultimate goal of this series is to boot Linux and run a shell. However, configuring everything at once is quite difficult. Therefore, in this chapter, our goal is simply to transition into the guest. To achieve this, we need to minimally configure five of the six VMCS categories, excluding the read-only VM-Exit Information category.
First, we define the function that will be executed as the guest in this chapter:
export fn blobGuest() callconv(.Naked) noreturn {
while (true) asm volatile ("hlt");
}
The calling convention is set to .Naked
. This is because we are not setting a valid RSP for the guest in this chapter, and any prologue that attempts to push to the stack would trigger a fault. This function simply enters an infinite HLT loop. While not particularly interesting, it serves to verify whether we can successfully transition into VMX non-root operation.
For paging and segmentation, the guest we handle in this chapter will operate with the following configuration. Conceptually, it's less like running a separate guest and more like transitioning Ymir itself directly into VMX non-root operation:
- Restricted Guest: A mode in which paging is required
- IA-32e 64bit mode (Long Mode)
- GDT and page tables are shared with the host
- Other important registers and related state are also shared with the host
VM-Execution Control
First, we configure the VM-Execution Control category. This field controls the processor’s behavior in VMX non-root operation. In this chapter, we will set two fields within the Execution Control.
Pin-Based Controls
Pin-Based VM-Execution Controls1 (hereafter Pin-Based Controls) is a 32-bit data structure that manages asynchronous events such as exceptions:
pub const PinExecCtrl = packed struct(u32) {
const Self = @This();
external_interrupt: bool,
_reserved1: u2,
nmi: bool,
_reserved2: u1,
virtual_nmi: bool,
activate_vmx_preemption_timer: bool,
process_posted_interrupts: bool,
_reserved3: u24,
pub fn new() Self {
return std.mem.zeroes(Self);
}
pub fn load(self: Self) VmxError!void {
const val: u32 = @bitCast(self);
try vmx.vmwrite(ctrl.pin_exec_ctrl, val);
}
pub fn store() VmxError!Self {
const val: u32 = @truncate(try vmx.vmread(ctrl.pin_exec_ctrl));
return @bitCast(val);
}
};
The meaning of each field will be explained when we actually use them. This structure defines load()
and store()
methods to read from and write to the VMCS. For the ctrl
enum, please refer to Github.
一応ここにも定義を示しておきます:
pub const ctrl = enum(u32) {
// Natural-width fields.
cr0_mask = ec(0, .full, .natural),
cr4_mask = ec(1, .full, .natural),
cr0_read_shadow = ec(2, .full, .natural),
cr4_read_shadow = ec(3, .full, .natural),
cr3_target0 = ec(4, .full, .natural),
cr3_target1 = ec(5, .full, .natural),
cr3_target2 = ec(6, .full, .natural),
cr3_target3 = ec(7, .full, .natural),
// 16-bit fields.
vpid = ec(0, .full, .word),
posted_intr_notif_vector = ec(1, .full, .word),
eptp_index = ec(2, .full, .word),
hlat_prefix_size = ec(3, .full, .word),
pid_pointer_index = ec(4, .full, .word),
// 32-bit fields.
pin_exec_ctrl = ec(0, .full, .dword),
proc_exec_ctrl = ec(1, .full, .dword),
exception_bitmap = ec(2, .full, .dword),
pf_ec_mask = ec(3, .full, .dword),
pf_ec_match = ec(4, .full, .dword),
cr3_target_count = ec(5, .full, .dword),
primary_exit_ctrl = ec(6, .full, .dword),
exit_msr_store_count = ec(7, .full, .dword),
vexit_msr_load_count = ec(8, .full, .dword),
entry_ctrl = ec(9, .full, .dword),
entry_msr_load_count = ec(10, .full, .dword),
entry_intr_info = ec(11, .full, .dword),
entry_exception_ec = ec(12, .full, .dword),
entry_inst_len = ec(13, .full, .dword),
tpr_threshold = ec(14, .full, .dword),
secondary_proc_exec_ctrl = ec(15, .full, .dword),
ple_gap = ec(16, .full, .dword),
ple_window = ec(17, .full, .dword),
instruction_timeouts = ec(18, .full, .dword),
// 64-bit fields.
io_bitmap_a = ec(0, .full, .qword),
io_bitmap_b = ec(1, .full, .qword),
msr_bitmap = ec(2, .full, .qword),
exit_msr_store_address = ec(3, .full, .qword),
exit_msr_load_address = ec(4, .full, .qword),
entry_msr_load_address = ec(5, .full, .qword),
executive_vmcs_pointer = ec(6, .full, .qword),
pml_address = ec(7, .full, .qword),
tsc_offset = ec(8, .full, .qword),
virtual_apic_address = ec(9, .full, .qword),
apic_access_address = ec(10, .full, .qword),
posted_intr_desc_addr = ec(11, .full, .qword),
vm_function_controls = ec(12, .full, .qword),
eptp = ec(13, .full, .qword),
eoi_exit_bitmap0 = ec(14, .full, .qword),
eoi_exit_bitmap1 = ec(15, .full, .qword),
eoi_exit_bitmap2 = ec(16, .full, .qword),
eoi_exit_bitmap3 = ec(17, .full, .qword),
eptp_list_address = ec(18, .full, .qword),
vmread_bitmap = ec(19, .full, .qword),
vmwrite_bitmap = ec(20, .full, .qword),
vexception_information_address = ec(21, .full, .qword),
xss_exiting_bitmap = ec(22, .full, .qword),
encls_exiting_bitmap = ec(23, .full, .qword),
sub_page_permission_table_pointer = ec(24, .full, .qword),
tsc_multiplier = ec(25, .full, .qword),
tertiary_proc_exec_ctrl = ec(26, .full, .qword),
enclv_exiting_bitmap = ec(27, .full, .qword),
low_pasid_directory = ec(28, .full, .qword),
high_pasid_directory = ec(29, .full, .qword),
shared_eptp = ec(30, .full, .qword),
pconfig_exiting_bitmap = ec(31, .full, .qword),
hlatp = ec(32, .full, .qword),
pid_pointer_table = ec(33, .full, .qword),
secondary_exit_ctrl = ec(34, .full, .qword),
spec_ctrl_mask = ec(37, .full, .qword),
spec_ctrl_shadow = ec(38, .full, .qword),
};
In the function that sets the Execution Control, we configure the Pin-Based Controls:
fn setupExecCtrls(_: *Vcpu, _: Allocator) VmxError!void {
const basic_msr = am.readMsrVmxBasic();
// Pin-based VM-Execution control.
const pin_exec_ctrl = try vmcs.PinExecCtrl.store();
try adjustRegMandatoryBits(
pin_exec_ctrl,
if (basic_msr.true_control) am.readMsr(.vmx_true_pinbased_ctls) else am.readMsr(.vmx_pinbased_ctls),
).load();
...
}
Since this chapter does not handle asynchronous events yet, the Pin-Based Controls will use their default values.
Many values written to the VMCS contain Reserved Bits. These reserved bits cannot simply be cleared to zero. Instead, you must refer to the appropriate MSR for each field and use its value to properly set the reserved bits. For Pin-Based Controls, depending on the 55th bit (.true_control
) of the IA32_VMX_BASIC
MSR, you use the value from either IA32_VMX_PINBASED_CTRLS
(address 0x0482
) or IA32_VMX_TRUE_PINBASED_CTRLS
(address 0x048E
). These MSRs impose the following constraints on Pin-Based Controls:
- [31:0]: Allowed 0-settings: If a bit in the MSR is set to
1
, the corresponding bit in the VMCS field must also be set to1
(Mandatory 1). - [63:32]: Allowed 1-settings: If a bit in the MSR is
0
, the corresponding bit in the VMCS field must be0
(Mandatory 0).
Since Allowed 0 and 1-settings will frequently appear going forward, we provide a helper function to apply these settings to VMCS fields:
fn adjustRegMandatoryBits(control: anytype, mask: u64) @TypeOf(control) {
var ret: u32 = @bitCast(control);
ret |= @as(u32, @truncate(mask)); // Mandatory 1
ret &= @as(u32, @truncate(mask >> 32)); // Mandatory 0
return @bitCast(ret);
}
In setupExecCtrls()
, this helper function is used to apply the constraints imposed by IA32_VMX_PINBASED_CTRLS
or IA32_VMX_TRUE_PINBASED_CTRLS
to the Pin-Based Controls.
Primary Processor-Based Controls
Processor-Based VM-Execution Controls2 (hereafter Processor-Based Controls) is a data structure that manages synchronous events, such as the execution of specific instructions. It consists of two parts: the Primary Processor-Based Controls (32 bits) and the Secondary Processor-Based Controls (64 bits). In this chapter, we will only configure the Primary Controls:
pub const PrimaryProcExecCtrl = packed struct(u32) {
const Self = @This();
_reserved1: u2,
interrupt_window: bool,
tsc_offsetting: bool,
_reserved2: u3,
hlt: bool,
_reserved3: u1,
invlpg: bool,
mwait: bool,
rdpmc: bool,
rdtsc: bool,
_reserved4: u2,
cr3load: bool,
cr3store: bool,
activate_teritary_controls: bool,
_reserved: u1,
cr8load: bool,
cr8store: bool,
use_tpr_shadow: bool,
nmi_window: bool,
mov_dr: bool,
unconditional_io: bool,
use_io_bitmap: bool,
_reserved5: u1,
monitor_trap: bool,
use_msr_bitmap: bool,
monitor: bool,
pause: bool,
activate_secondary_controls: bool,
pub fn load(self: Self) VmxError!void {
const val: u32 = @bitCast(self);
try vmx.vmwrite(ctrl.proc_exec_ctrl, val);
}
pub fn store() VmxError!Self {
const val: u32 = @truncate(try vmx.vmread(ctrl.proc_exec_ctrl));
return @bitCast(val);
}
};
Similarly, we configure the Primary Processor-Based Controls within setupExecCtrls()
:
fn setupExecCtrls(_: *Vcpu, _: Allocator) VmxError!void {
...
var ppb_exec_ctrl = try vmcs.PrimaryProcExecCtrl.store();
ppb_exec_ctrl.hlt = false;
ppb_exec_ctrl.activate_secondary_controls = false;
try adjustRegMandatoryBits(
ppb_exec_ctrl,
if (basic_msr.true_control) am.readMsr(.vmx_true_procbased_ctls) else am.readMsr(.vmx_procbased_ctls),
).load();
}
The .hlt
field controls whether a VMExit occurs on a HLT instruction. Since blobGuest()
runs a HLT loop, we set this to false
. The .activate_secondary_controls
field determines whether to enable the Secondary Processor-Based Controls. Because we only want to use the Primary Processor-Based Controls in this chapter, this is set to false
.
Similar to Pin-Based Controls, reserved bits must be set according to the values in the MSRs. The MSR used is either IA32_VMX_PROCBASED_CTRLS
(address 0x0482
) or IA32_VMX_TRUE_PROCBASED_CTRLS
(address 0x048E
).
Host-State
Next, we configure the Host-State category. This category controls the host state when a VM exit occurs.
Control Registers
The Control Registers specify the values of CR0, CR3, and CR4 upon VM exit. In this series, we want the host state after VM exit to be the same as it was just before VMLAUNCH, so we set these registers to the current host state:
fn setupHostState(_: *Vcpu) VmxError!void {
// Control registers.
try vmwrite(vmcs.host.cr0, am.readCr0());
try vmwrite(vmcs.host.cr3, am.readCr3());
try vmwrite(vmcs.host.cr4, am.readCr4());
...
}
RIP / RSP
These two fields are loaded into the VMM's registers immediately after VM exit to restore the execution context. Since our goal for now is simply to get the guest running, we set temporary values.
// RSP / RIP
try vmwrite(vmcs.host.rip, &vmexitBootstrapHandler);
try vmwrite(vmcs.host.rsp, @intFromPtr(&temp_stack) + temp_stack_size);
vmexitBootstrapHandler()
is a simple VM Exit handler. For now, it just logs the VM Exit Reason and enters an HLT loop. Note that the VMM has not yet restored the registers. At the point this function is called, RBP and other general-purpose registers are not set at all. Therefore, this function uses the .Naked
calling convention to omit the function prologue:
const temp_stack_size: usize = mem.page_size;
var temp_stack: [temp_stack_size + 0x10]u8 align(0x10) = [_]u8{0} ** (temp_stack_size + 0x10);
fn vmexitBootstrapHandler() callconv(.Naked) noreturn {
asm volatile (
\\call vmexitHandler
);
}
export fn vmexitHandler() noreturn {
log.debug("[VMEXIT handler]", .{});
const reason = vmx.ExitInfo.load() catch unreachable;
log.debug(" VMEXIT reason: {?}", .{reason});
while (true) asm volatile ("hlt");
}
ExitInfo
is an enum
representing the VM Exit Reason. When a VM Exit occurs, the cause is stored in the Basic VM-Exit Information field within the VMCS VM-Exit Information category. By checking this value, you can identify the general reason for the VM Exit. The load()
method retrieves the value from this field. If you're interested in the implementation, you can expand the following to take a closer look:
VM Exit Reason and Host
pub const ExitInfo = packed struct(u32) {
basic_reason: ExitReason,
_zero: u1 = 0,
_reserved1: u10 = 0,
_one: u1 = 1,
pending_mtf: u1 = 0,
exit_vmxroot: bool,
_reserved2: u1 = 0,
entry_failure: bool,
pub fn load() VmxError!ExitInfo {
return @bitCast(@as(u32, @truncate(try vmx.vmread(ro.vmexit_reason))));
}
};
pub const ExitReason = enum(u16) {
exception_nmi = 0,
extintr = 1,
triple_fault = 2,
init = 3,
sipi = 4,
io_intr = 5,
other_smi = 6,
intr_window = 7,
nmi_window = 8,
task_switch = 9,
cpuid = 10,
getsec = 11,
hlt = 12,
invd = 13,
invlpg = 14,
rdpmc = 15,
rdtsc = 16,
rsm = 17,
vmcall = 18,
vmclear = 19,
vmlaunch = 20,
vmptrld = 21,
vmptrst = 22,
vmread = 23,
vmresume = 24,
vmwrite = 25,
vmxoff = 26,
vmxon = 27,
cr = 28,
dr = 29,
io = 30,
rdmsr = 31,
wrmsr = 32,
entry_fail_guest = 33,
entry_fail_msr = 34,
mwait = 36,
monitor_trap = 37,
monitor = 39,
pause = 40,
entry_fail_mce = 41,
tpr_threshold = 43,
apic = 44,
veoi = 45,
gdtr_idtr = 46,
ldtr_tr = 47,
ept = 48,
ept_misconfig = 49,
invept = 50,
rdtscp = 51,
preemption_timer = 52,
invvpid = 53,
wbinvd_wbnoinvd = 54,
xsetbv = 55,
apic_write = 56,
rdrand = 57,
invpcid = 58,
vmfunc = 59,
encls = 60,
rdseed = 61,
page_log_full = 62,
xsaves = 63,
xrstors = 64,
pconfig = 65,
spp = 66,
umwait = 67,
tpause = 68,
loadiwkey = 69,
enclv = 70,
enqcmd_pasid_fail = 72,
enqcmds_pasid_fail = 73,
bus_lock = 74,
timeout = 75,
seamcall = 76,
tdcall = 77,
};
pub const host = enum(u32) {
// Natural-width fields.
cr0 = eh(0, .full, .natural),
cr3 = eh(1, .full, .natural),
cr4 = eh(2, .full, .natural),
fs_base = eh(3, .full, .natural),
gs_base = eh(4, .full, .natural),
tr_base = eh(5, .full, .natural),
gdtr_base = eh(6, .full, .natural),
idtr_base = eh(7, .full, .natural),
sysenter_esp = eh(8, .full, .natural),
sysenter_eip = eh(9, .full, .natural),
rsp = eh(10, .full, .natural),
rip = eh(11, .full, .natural),
s_cet = eh(12, .full, .natural),
ssp = eh(13, .full, .natural),
intr_ssp_table_addr = eh(14, .full, .natural),
// 16-bit fields.
es_sel = eh(0, .full, .word),
cs_sel = eh(1, .full, .word),
ss_sel = eh(2, .full, .word),
ds_sel = eh(3, .full, .word),
fs_sel = eh(4, .full, .word),
gs_sel = eh(5, .full, .word),
tr_sel = eh(6, .full, .word),
// 32-bit fields.
sysenter_cs = eh(0, .full, .dword),
// 64-bit fields.
pat = eh(0, .full, .qword),
efer = eh(1, .full, .qword),
perf_global_ctrl = eh(2, .full, .qword),
pkrs = eh(3, .full, .qword),
};
Segment Registers
We configure the following two types of segment registers:
- Segment selector of CS / SS / DS / ES / FS / GS / TR
- Base of FS / GS / TR / GDTR / IDTR (no LDTR)
Note that some segment registers only specify the selector, while others require the base address to be set as well. As explained in the GDT chapter, segment registers that are not used for address translation (the former) only need the selector set, whereas those actually involved in address translation require the base to be specified too. Since GDTR and IDTR only have a base and no selector, the selector cannot be set for them:
fn setupHostState(_: *Vcpu) VmxError!void {
...
// Segment registers.
try vmwrite(vmcs.host.cs_sel, am.readSegSelector(.cs));
try vmwrite(vmcs.host.ss_sel, am.readSegSelector(.ss));
try vmwrite(vmcs.host.ds_sel, am.readSegSelector(.ds));
try vmwrite(vmcs.host.es_sel, am.readSegSelector(.es));
try vmwrite(vmcs.host.fs_sel, am.readSegSelector(.fs));
try vmwrite(vmcs.host.gs_sel, am.readSegSelector(.gs));
try vmwrite(vmcs.host.tr_sel, am.readSegSelector(.tr));
try vmwrite(vmcs.host.fs_base, am.readMsr(.fs_base));
try vmwrite(vmcs.host.gs_base, am.readMsr(.gs_base));
try vmwrite(vmcs.host.tr_base, 0); // Not used in Ymir.
try vmwrite(vmcs.host.gdtr_base, am.sgdt().base);
try vmwrite(vmcs.host.idtr_base, am.sidt().base);
...
}
The selectors of the segment registers are obtained using the following assembly function:
const Segment = enum {
cs,
ss,
ds,
es,
fs,
gs,
tr,
ldtr,
};
pub fn readSegSelector(segment: Segment) u16 {
return switch (segment) {
.cs => asm volatile ("mov %%cs, %[ret]"
: [ret] "=r" (-> u16),
),
.ss => asm volatile ("mov %%ss, %[ret]"
: [ret] "=r" (-> u16),
),
.ds => asm volatile ("mov %%ds, %[ret]"
: [ret] "=r" (-> u16),
),
.es => asm volatile ("mov %%es, %[ret]"
: [ret] "=r" (-> u16),
),
.fs => asm volatile ("mov %%fs, %[ret]"
: [ret] "=r" (-> u16),
),
.gs => asm volatile ("mov %%gs, %[ret]"
: [ret] "=r" (-> u16),
),
.tr => asm volatile ("str %[ret]"
: [ret] "=r" (-> u16),
),
.ldtr => asm volatile ("sldt %[ret]"
: [ret] "=r" (-> u16),
),
};
}
All segment selectors except TR and LDTR can be obtained directly using the MOV instruction3. TR and LDTR are retrieved using their dedicated instructions: STR and SLDT, respectively.
The bases of FS and GS are mapped in hardware to the MSRs IA32_FS_BASE
and IA32_GS_BASE
, respectively. Therefore, their base addresses can be obtained by reading these MSRs. The bases of GDTR and IDTR can be retrieved using SGDT and SIDT instructions, respectively.
SIDT/SGDT の実装
const SgdtRet = packed struct {
limit: u16,
base: u64,
};
pub inline fn sgdt() SgdtRet {
var gdtr: SgdtRet = undefined;
asm volatile (
\\sgdt %[ret]
: [ret] "=m" (gdtr),
);
return gdtr;
}
const SidtRet = packed struct {
limit: u16,
base: u64,
};
pub inline fn sidt() SidtRet {
var idtr: SidtRet = undefined;
asm volatile (
\\sidt %[ret]
: [ret] "=m" (idtr),
);
return idtr;
}
MSR
Some MSRs can be set by hardware during a VM Exit. These MSRs include (but are not limited to) the following:
IA32_SYSENTER_CS
/IA32_SYSENTER_ESP
/IA32_SYSENTER_EIP
IA32_EFER
IA32_PAT
Since this series does not implement system calls, the SYSENTER
-related MSRs do not need to be restored. Although IA32_PAT
is an MSR that defines page caching attributes, it is also not used in this series. The IA32_EFER
MSR (address: 0xC0000080
) is essential for enabling 64-bit mode, so we only set this MSR:
fn setupHostState(_: *Vcpu) VmxError!void {
...
// MSR.
try vmwrite(vmcs.host.efer, am.readMsr(.efer));
}
Guest-State
Next, we configure the Guest-State category. This category controls the state of the guest at VM entry.
Control Registers
The Control Registers specify the values of CR0, CR3, and CR4 of the guest on a VM Entry. In this chapter, these values will be shared with the host:
fn setupGuestState(_: *Vcpu) VmxError!void {
// Control registers.
try vmwrite(vmcs.guest.cr0, am.readCr0());
try vmwrite(vmcs.guest.cr3, am.readCr3());
try vmwrite(vmcs.guest.cr4, am.readCr4());
...
}
Segment Registers
For the guest's segment registers, you need to set the Selector, Base, Limit, and Access Rights individually. This process is quite tedious.
First, set the Base values. Since none of the segments actually use the Base, we simply set it to 0
. For LDTR, we set it to 0xDEAD00
. Although this value is not actually used, it serves as a marker to distinguish whether the currently running context is the VMM or the guest.
try vmwrite(vmcs.guest.cs_base, 0);
try vmwrite(vmcs.guest.ss_base, 0);
try vmwrite(vmcs.guest.ds_base, 0);
try vmwrite(vmcs.guest.es_base, 0);
try vmwrite(vmcs.guest.fs_base, 0);
try vmwrite(vmcs.guest.gs_base, 0);
try vmwrite(vmcs.guest.tr_base, 0);
try vmwrite(vmcs.guest.gdtr_base, 0);
try vmwrite(vmcs.guest.idtr_base, 0);
try vmwrite(vmcs.guest.ldtr_base, 0xDEAD00); // Marker to indicate the guest.
The Limit
field is also not used in practice, so we simply set it to the maximum possible value for now:
try vmwrite(vmcs.guest.cs_limit, @as(u64, std.math.maxInt(u32)));
try vmwrite(vmcs.guest.ss_limit, @as(u64, std.math.maxInt(u32)));
try vmwrite(vmcs.guest.ds_limit, @as(u64, std.math.maxInt(u32)));
try vmwrite(vmcs.guest.es_limit, @as(u64, std.math.maxInt(u32)));
try vmwrite(vmcs.guest.fs_limit, @as(u64, std.math.maxInt(u32)));
try vmwrite(vmcs.guest.gs_limit, @as(u64, std.math.maxInt(u32)));
try vmwrite(vmcs.guest.tr_limit, 0);
try vmwrite(vmcs.guest.ldtr_limit, 0);
try vmwrite(vmcs.guest.idtr_limit, 0);
try vmwrite(vmcs.guest.gdtr_limit, 0);
Next, we configure the Selectors. The guest function used in this chapter, blobGuest()
, has no function prologue and does not use data segments. Only the CS segment is actually used. Therefore, we set the CS selector to the same value as the host, and leave the others unused.
try vmwrite(vmcs.guest.cs_sel, am.readSegSelector(.cs));
try vmwrite(vmcs.guest.ss_sel, 0);
try vmwrite(vmcs.guest.ds_sel, 0);
try vmwrite(vmcs.guest.es_sel, 0);
try vmwrite(vmcs.guest.fs_sel, 0);
try vmwrite(vmcs.guest.gs_sel, 0);
try vmwrite(vmcs.guest.tr_sel, 0);
try vmwrite(vmcs.guest.ldtr_sel, 0);
Finally, we set the Access Rights. These hold nearly the same information as the GDT entries discussed in the GDT chapter. However, since the format is slightly different, we define it again specifically for use in VMCS. For the meaning of each field, please refer to the explanation in the GDT chapter.
pub const SegmentRights = packed struct(u32) {
const gdt = @import("../gdt.zig");
accessed: bool = true,
rw: bool,
dc: bool,
executable: bool,
desc_type: gdt.DescriptorType,
dpl: u2,
present: bool = true,
_reserved1: u4 = 0,
avl: bool = false,
long: bool = false,
db: u1,
granularity: gdt.Granularity,
unusable: bool = false,
_reserved2: u15 = 0,
};
Honestly, as long as CS is correctly configured, that's sufficient for this chapter. However, since we're already here, we’ll go ahead and set up the other segments as well.
const cs_right = vmx.SegmentRights{
.rw = true,
.dc = false,
.executable = true,
.desc_type = .code_data,
.dpl = 0,
.granularity = .kbyte,
.long = true,
.db = 0,
};
const ds_right = vmx.SegmentRights{
.rw = true,
.dc = false,
.executable = false,
.desc_type = .code_data,
.dpl = 0,
.granularity = .kbyte,
.long = false,
.db = 1,
};
const tr_right = vmx.SegmentRights{
.rw = true,
.dc = false,
.executable = true,
.desc_type = .system,
.dpl = 0,
.granularity = .byte,
.long = false,
.db = 0,
};
const ldtr_right = vmx.SegmentRights{
.accessed = false,
.rw = true,
.dc = false,
.executable = false,
.desc_type = .system,
.dpl = 0,
.granularity = .byte,
.long = false,
.db = 0,
};
try vmwrite(vmcs.guest.cs_rights, cs_right);
try vmwrite(vmcs.guest.ss_rights, ds_right);
try vmwrite(vmcs.guest.ds_rights, ds_right);
try vmwrite(vmcs.guest.es_rights, ds_right);
try vmwrite(vmcs.guest.fs_rights, ds_right);
try vmwrite(vmcs.guest.gs_rights, ds_right);
try vmwrite(vmcs.guest.tr_rights, tr_right);
try vmwrite(vmcs.guest.ldtr_rights, ldtr_right);
CS and DS are set to the same values used by the host. Although Ymir doesn't use TR and LDTR at all, we still need to configure them; otherwise, checks on a VM Entry will fail. That said, the values to assign to these two are essentially fixed, so just accept them as-is for now.
RIP / RSP / MSR, RFLAGS and so on
Since the guest function blobGuest()
does not use the RSP register, there's no need to configure it. RIP should be set to the address of blobGuest()
. We also need to initialize RFLAGS. Additionally, certain MSRs can be configured via VMCS fields. Among them, we will set only IA32_EFER
in this chapter, as this MSR is required to enable 64-bit mode.
try vmwrite(vmcs.guest.rip, &blobGuest);
try vmwrite(vmcs.guest.efer, am.readMsr(.efer));
try vmwrite(vmcs.guest.rflags, am.FlagsRegister.new());
Finally, we configure the VMCS Link Pointer. This field is used if you want to use VMCS shadowing. Since we do not use the shadowing in this series, we follow the convention of setting it to 0xFFFF_FFFF_FFFF_FFFF
.
try vmwrite(vmcs.guest.vmcs_link_pointer, std.math.maxInt(u64));
VM-Entry Control
This category controls the processor behavior during VM Entry4. Fortunately, it requires only a few settings, making it a relatively light and straightforward part of the setup.
pub const EntryCtrl = packed struct(u32) {
pub const Self = @This();
_reserved1: u2,
load_debug_controls: bool,
_reserved2: u6,
ia32e_mode_guest: bool,
entry_smm: bool,
deactivate_dualmonitor: bool,
_reserved3: u1,
load_perf_global_ctrl: bool,
load_ia32_pat: bool,
load_ia32_efer: bool,
load_ia32_bndcfgs: bool,
conceal_vmx_from_pt: bool,
load_rtit_ctl: bool,
load_uinv: bool,
load_cet_state: bool,
load_guest_lbr_ctl: bool,
load_pkrs: bool,
_reserved4: u9,
pub fn load(self: Self) VmxError!void {
const val: u32 = @bitCast(self);
try vmx.vmwrite(ctrl.entry_ctrl, val);
}
pub fn store() VmxError!Self {
const val: u32 = @truncate(try vmx.vmread(ctrl.entry_ctrl));
return @bitCast(val);
}
};
Among these, we set IA-32e Mode Guest (.ia32e_mode_guest
). This field indicates that the guest should operate in IA-32e mode after VM Entry. When this is enabled, the IA32_EFER.LMA
(Long Mode Active) bit will be set during VM Entry, allowing the processor to run the guest in 64-bit mode.
fn setupEntryCtrls(_: *Vcpu) VmxError!void {
const basic_msr = am.readMsrVmxBasic();
var entry_ctrl = try vmcs.EntryCtrl.store();
entry_ctrl.ia32e_mode_guest = true;
try adjustRegMandatoryBits(
entry_ctrl,
if (basic_msr.true_control) am.readMsr(.vmx_true_entry_ctls) else am.readMsr(.vmx_entry_ctls),
).load();
}
ここでも Reserved Bits は IA32_VMX_ENTRY_CTRLS
(address: 0x0484
) または IA32_VMX_TRUE_ENTRY_CTRLS
(address: 0x0490
) の値を参照して設定します。 As with other control fields, Reserved Bits here must also be configured according to the values in IA32_VMX_ENTRY_CTRLS
(address: 0x0484
) or IA32_VMX_TRUE_ENTRY_CTRLS
(address: 0x0490
), depending on the value of the .true_control
bit in IA32_VMX_BASIC
.
VM-Exit Control
This category controls the processor behavior during VM Exit5. There are two types: Primary and Secondary. However, the Secondary has only one setting and we only use the Primary in this series. This is another simple and minimal one to relax with.
pub const PrimaryExitCtrl = packed struct(u32) {
const Self = @This();
_reserved1: u2,
save_debug: bool,
_reserved2: u6,
host_addr_space_size: bool,
_reserved3: u2,
load_perf_global_ctrl: bool,
_reserved4: u2,
ack_interrupt_onexit: bool,
_reserved5: u2,
save_ia32_pat: bool,
load_ia32_pat: bool,
save_ia32_efer: bool,
load_ia32_efer: bool,
save_vmx_preemption_timer: bool,
clear_ia32_bndcfgs: bool,
conceal_vmx_from_pt: bool,
clear_ia32_rtit_ctl: bool,
clear_ia32_lbr_ctl: bool,
clear_uinv: bool,
load_cet_state: bool,
load_pkrs: bool,
save_perf_global_ctl: bool,
activate_secondary_controls: bool,
pub fn load(self: Self) VmxError!void {
const val: u32 = @bitCast(self);
try vmx.vmwrite(ctrl.primary_exit_ctrl, val);
}
pub fn store() VmxError!Self {
const val: u32 = @truncate(try vmx.vmread(ctrl.primary_exit_ctrl));
return @bitCast(val);
}
};
Here, we configure the Host Address-Space Size (.host_addr_space_size
). This field indicates that the host operates in 64-bit mode after a VM Exit. When enabled, the IA32_EFER.LME
(Long Mode Enable) and IA32_EFER.LMA
(Long Mode Activate) bits are set after VM Exit, allowing the host to run in 64-bit mode.
fn setupExitCtrls(_: *Vcpu) VmxError!void {
const basic_msr = am.readMsrVmxBasic();
var exit_ctrl = try vmcs.PrimaryExitCtrl.store();
exit_ctrl.host_addr_space_size = true;
exit_ctrl.load_ia32_efer = true;
try adjustRegMandatoryBits(
exit_ctrl,
if (basic_msr.true_control) am.readMsr(.vmx_true_exit_ctls) else am.readMsr(.vmx_exit_ctls),
).load();
}
Here as well, the Reserved Bits are set by referring to the values of IA32_VMX_EXIT_CTRLS
(address: 0x0483
) or IA32_VMX_TRUE_EXIT_CTRLS
(address: 0x048F
).
VMLAUNCH
With this, the VMCS setup is complete. Finally, execute the VMLAUNCH instruction to transition into VMX Non-root Operation.
pub fn loop(_: *Self) VmxError!void {
const rflags = asm volatile (
\\vmlaunch
\\pushf
\\popq %[rflags]
: [rflags] "=r" (-> u64),
);
vmx.vmxtry(rflags) catch |err| {
log.err("VMLAUNCH: {?}", .{err});
log.err("VM-instruction error number: {s}", .{@tagName(try vmx.InstructionError.load())});
};
}
This function first executes VMLAUNCH instruction. VM Entry can fail in two ways:
- VMLAUNCH itself faills
- As with other VMX extension instructions that fail, it returns a VMX Instruction Error.
- Execution resumes immediately after the
VMLAUNCH
instruction inside theloop()
function.
- VMLAUNCH suceeds immediately followed by VM Exit
- This is the case where VMLAUNCH itself succeeds but VM Entry fails.
- A VM Exit occurs, and execution transfers to the RIP set in the VMCS Host-State. In this case,
vmexitBootstrapHandler()
is called.
Add a function to Vmx
that calls Vcpu.loop()
. This will be covered in a later chapter, but as a principle, once Ymir starts running the VM, it disables interrupts on the host side.
pub fn loop(self: *Self) Error!void {
arch.disableIntr();
try self.vcpu.loop();
}
Call it in kernelMain()
:
// Launch
log.info("Starting the virtual machine...", .{});
try vm.loop();
The output will look like the following:
[INFO ] main | Entered VMX root operation.
[INFO ] main | Starting the virtual machine...
It seems to be stuck in an infinite HLT loop. In this state, let's check the register values using the QEMU monitor.
[INFO ] main | Starting the virtual machine...
QEMU 8.2.2 monitor - type 'help' for more information
(qemu) info registers
CPU#0
RAX=000000000000000a RBX=ffffffff8010c300 RCX=0000000000000000 RDX=00000000000003f8
RSI=0000000000000000 RDI=000000000000000a RBP=ffffffff80514018 RSP=0000000000000000
R8 =0000000000001000 R9 =0000000000000001 R10=0000000000000000 R11=00000000000001fe
R12=0000000080000000 R13=ffffffff8010c300 R14=0000000000001000 R15=0000000000009000
RIP=ffffffff8010a6b1 RFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0000 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0000 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0000 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
FS =0000 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
GS =0000 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
LDT=0000 0000000000dead00 00000000 00008200 DPL=0 LDT
TR =0000 0000000000000000 00000000 00008b00 DPL=0 TSS64-busy
GDT= 0000000000000000 00000000
IDT= 0000000000000000 00000000
CR0=80010033 CR2=0000000000000000 CR3=0000000000001000 CR4=00002668
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d00
There is no direct way to determine whether the processor is in VMX Root Operation or VMX Non-root Operation. The key indicator is the LDT Base. We set this value as a marker (0xDEAD00
) in the VMCS Guest-State. Since the current LDT Base is 0xDEAD00
, it confirms that the processor has successfully transitioned into VMX Non-root Operation.
Next, let's use addr2line
to check which part of the code corresponds to the RIP value 0xFFFFFFFF8010A6B1
.
> addr2line -e ./zig-out/bin/ymir.elf 0xFFFFFFFF8010A6B1
/home/lysithea/ymir/ymir/arch/x86/vmx/vcpu.zig:390
> sed -n '390,392p' /home/lysithea/ymir/ymir/arch/x86/vmx/vcpu.zig
asm volatile (
\\hlt
);
We can confirm that it is indeed stuck in the HLT loop. In other words, we've successfully transitioned into VMX Non-root Operation and started running the guest.
As another experiment, let's set the .hlt
field to true
in the Primary Processor-Based Controls of the Execution Control category. This will cause a VM Exit whenever the guest executes the HLT instruction.
var ppb_exec_ctrl = try vmcs.PrimaryProcExecCtrl.store();
- ppb_exec_ctrl.hlt = false;
+ ppb_exec_ctrl.hlt = true;
ppb_exec_ctrl.activate_secondary_controls = false;
You'll see the below output:
[INFO ] main | Starting the virtual machine...
[DEBUG] vcpu | [VMEXIT handler]
[DEBUG] vcpu | VMEXIT reason: arch.x86.vmx.vmcs.ExitInfo{ .basic_reason = arch.x86.vmcs.ExitReason.hlt, ._zero = 0, ._reserved1 = 0, ._one = 0, .pending_mtf = 0, .exit_vmxroot = false, ._reserved2 = 0, .entry_failure = false }
When the guest executes HLT, a VM Exit occurs and control transfers to the RIP set in the Host-State. The RIP points to vmexitBootstrapHandler()
, where the VM Exit Reason is retrieved and displayed. As expected, this time the reason is hlt
. This confirms that the VM Exit handler is correctly called.
Summary
In this chapter, we configured the VMCS and transitioned into VMX Non-root Operation. We ran a simple function in the guest that loops on HLT, and confirmed successful VM Entry by checking the marker we placed in the IDTR. We also set up a VM Exit handler and verified that the VM Exit Reason can be retrieved correctly.
At last, we have successfully run the guest. No matter how small, the guest is still virtualized. At this point, Ymir could arguably call itself a "hypervisor". Or maybe not.
So far, the guest runs while inheriting almost all of the host’s register state. Conversely, the host also inherits the register state as it was at the time of the VM Exit. In the next chapter, we will implement proper saving of the guest and host states, including registers, during VM Entry and VM Exit.
SDM Vol.3C 25.6.1 Pin-Based VM-Execution Controls
SDM Vol.3C 25.6.2 Processor-Based VM-Execution Controls
When accessing segment registers directly using the MOV instruction, the hidden part of the register is cleared to zero.
SDM Vol.3C 25.8.1 VM-Entry Control Fields
SDM Vol.3C 25.7.1 VM-Exit Control Fields