Virtualizing I/O
In the previous chapter, the guest caused an EPT violation. This happened because I/O access was not virtualized, allowing the guest to directly interact with host devices and receive information about addresses it should not have access to. In this chapter, we will start virtualizing I/O accesses. To boot Linux, we only need to virtualize a small set of devices—others can simply be passed through or disabled.
important
The source code for this chapter is in whiz-vmm-io
branch.
Table of Contents
Boilerplate
First, let's write a skeleton for the VM Exit handler triggered by I/O access. Similar to CR access, VM Exits caused by I/O access also provide an Exit Qualification field.
Exit Qualification for I/O Instructions. SDM Vol.3C Table 28-5.
The qualification field contains information such as access length, access direction, and port number. Let's define a structure that represents the Exit Qualification for I/O access:
pub const qual = struct {
pub const QualIo = packed struct(u64) {
/// Size of access.
size: Size,
/// Direction of the attempted access.
direction: Direction,
/// String instruction.
string: bool,
/// Rep prefix.
rep: bool,
/// Operand encoding.
operand_encoding: OperandEncoding,
/// Not used.
_reserved2: u9,
/// Port number.
port: u16,
/// Not used.
_reserved3: u32,
const Size = enum(u3) {
/// Byte.
byte = 0,
/// Word.
word = 1,
/// Dword.
dword = 3,
};
const Direction = enum(u1) {
out = 0,
in = 1,
};
const OperandEncoding = enum(u1) {
/// I/O instruction uses DX register as port number.
dx = 0,
/// I/O instruction uses immediate value as port number.
imm = 1,
};
};
...
}
Ymir will only support OUT and IN instructions. Instructions like OUTS and INS, indicated by the .string
field in the qualification, will not be supported. However, aside from the fact that these instructions use memory instead of registers as the source or destination, they are not particularly difficult to handle—so feel free to try implementing them if you're interested.
Next, we’ll define the handler for I/O access. The handler will be separated based on the direction of access (IN or OUT), and each handler will process the access depending on the port number. For now, since we haven’t defined handlers for any specific port, we’ll catch all port accesses using an else
branch and abort:
pub fn handleIo(vcpu: *Vcpu, qual: QualIo) VmxError!void {
return switch (qual.direction) {
.in => try handleIoIn(vcpu, qual),
.out => try handleIoOut(vcpu, qual),
};
}
fn handleIoIn(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
switch (qual.port) {
else => {
log.err("Unhandled I/O-in port: 0x{X}", .{qual.port});
vcpu.abort();
},
}
}
fn handleIoOut(vcpu: *Vcpu, qual: QualIo) VmxError!void {
switch (qual.port) {
else => {
log.err("Unhandled I/O-out port: 0x{X}", .{qual.port});
vcpu.abort();
},
}
}
In the VM Exit handler Vcpu.handleExit()
, when the Exit Reason is .io
, retrieve the Exit Qualification. Passing the obtained qualification to the previously defined handler completes the basic skeleton.
fn handleExit(self: *Self, exit_info: vmx.ExitInfo) VmxError!void {
switch (exit_info.basic_reason) {
.io => {
const q = try getExitQual(qual.QualIo);
try io.handleIo(self, q);
try self.stepNextInst();
},
...
}
}
To cause I/O access to trigger VM Exit, you need to set the .unconditional_io
field in the Pin-Based VM-Execution Controls category of the VMCS Execution Controls.
fn setupExecCtrls(vcpu: *Vcpu, _: Allocator) VmxError!void {
...
ppb_exec_ctrl.unconditional_io = true;
...
}
This causes all I/O accesses to trigger VM Exit. To decide whether a VM Exit occurs for each port number, you can use the I/O bitmap. The I/O bitmap corresponds to port numbers bit-by-bit, and setting a bit to 1
means I/O access to that port triggers a VM Exit. You can enable I/O bitmap by setting the .use_io_bitmap
field in Primary Processor-Based VM-Execution Controls. However, in this series, we will use the unconditional I/O exit instead and will not use the I/O bitmap.
PCI (Unsupported)
The first I/O port Linux accesses after booting is PCI configuration address located at 0x0CF8
. PCI is a standard used for communication with peripheral devices, and by accessing the configuration address and PCI configuration data at 0x0CFC
, the system can enumerate PCI devices.
Each device has its own space called a BAR: Base Address Register, and Linux accesses the detected device's BAR to retrieve and configure device information. Until now, PCI was not virtualized, and the host's PCI space was directly exposed to the guest. The BAR addresses given to the guest are not present within the guest physical address space set up by EPT. As a result, when the guest tries to access a BAR, an EPT violation occurs. This explains the EPT violation abort seen at the end of the previous chapter (./cr.md#summary).
Unfortunately, Ymir does not support PCI virtualization. Serial input and output are sufficient for booting and running a shell, so PCI is not necessary for this purpose. That said, enabling PCI would allow the use of external storage, external keyboards, and NICs, which makes things much more interesting. If you're interested, I encourage you to give it a try.
Disabling PCI is as simple as returning 0
in RAX. Before using PCI, Linux performs a process called probing, and by always setting RAX to 0
, probing will fail. When probing fails, Linux stops using PCI from that point onward.
fn handleIoIn(vcpu: *Vcpu, qual: QualIo) VmxError!void {
switch (qual.port) {
0x0CF8...0x0CFF => regs.rax = 0, // PCI. Unimplemented.
0xC000...0xCFFF => {}, // Old PCI. Ignore.
...
}
}
fn handleIoOut(vcpu: *Vcpu, qual: QualIo) VmxError!void {
switch (qual.port) {
0x0CF8...0x0CFF => {}, // PCI. Unimplemented.
0xC000...0xCFFF => {}, // Old PCI. Ignore.
...
}
}
Serial
First and foremost, serial output. It's a universal truth unchanged since the Earth was formed 4.6 billion years ago, and of course, it hasn't changed since I was born 7 years ago. Let's start by virtualizing the serial port. Up until now, the guest Linux has been using the serial port to output logs. Since the serial port wasn't virtualized, the guest was directly accessing the host's serial port. Honestly, this setup isn't necessarily problematic. Although there might be some issues related to interrupts, it will probably work as is. However, since we have the opportunity, Ymir will virtualize the serial port as well.
As we covered in the Serial Output chapter, its I/O ports looks as follows:
Port Start | Description |
---|---|
0x02F8 - 0x3EF | COM4 |
0x02F8 - 0x2FF | COM2 |
0x03E8 - 0x3EF | COM3 |
0x03F8 - 0x3FF | COM1 |
Ymir supports only COM1 port.
As covered in the Serial Output chapter again, the serial port maps 8 I/O ports to 12 registers. Among these, we will virtualize the following registers. Note that registers listed side-by-side share the same port but are mapped differently depending on the value of DLAB and the access direction (read/write):
- TX / (DLL): RX is pass-throughed. For TX, Ymir will handle the write on behalf of the guest.
- IER / DLH: Save the value separately.
- MCR: Save the value separately.
- LCR: Always
0
for read / Write is ignored - SR: Always
0
for read / Write is ignored - FCR: Write is ignored
All the following registers are passed through:
- RX
- IIR
- LSR: Read-Only
- MSR: Read-Only
For the three registers among the virtualized registers whose values need to be stored separately (apart from the actual serial registers), define a structure to hold these values and include it in the Vcpu
structure.
pub const Serial = struct {
/// Interrupt Enable Register.
ier: u8 = 0,
/// Modem Control Register.
mcr: u8 = 0,
pub fn new() Serial {
return Serial{};
}
};
First, let's virtualize read accesses. The registers to be read are RX, DLL, IER, DLH, IIR, LCR, MCR, LSR, MSR, and SR.
fn handleSerialIn(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
switch (qual.port) {
// Receive buffer.
0x3F8 => regs.rax = am.inb(qual.port), // pass-through
// Interrupt Enable Register (DLAB=1) / Divisor Latch High Register (DLAB=0).
0x3F9 => regs.rax = vcpu.serial.ier,
// Interrupt Identification Register.
0x3FA => regs.rax = am.inb(qual.port), // pass-through
// Line Control Register (MSB is DLAB).
0x3FB => regs.rax = 0x00,
// Modem Control Register.
0x3FC => regs.rax = vcpu.serial.mcr,
// Line Status Register.
0x3FD => regs.rax = am.inb(qual.port), // pass-through
// Modem Status Register.
0x3FE => regs.rax = am.inb(qual.port), // pass-through
// Scratch Register.
0x3FF => regs.rax = 0, // 8250
else => {
log.err("Unsupported I/O-in to the first serial port: 0x{X}", .{qual.port});
vcpu.abort();
},
}
}
The handling for each register corresponds to the list provided earlier. For registers that are passed through, the IN instruction is used to read the value directly from the actual serial register and then passed to the guest as-is. For registers whose values are maintained separately, the stored values are passed to the guest.
Next, we define the handler for write accesses. Except for writes to TX, no other writes are passed through. This prevents the guest's output from affecting the host. The registers involved in write operations are TX, DLL, IER, DLH, FCR, LCR, MCR, and SR. Among the registers listed, LSR and MSR are read-only, so writes to them do not occur. Therefore, the number of branches is reduced.
const sr = arch.serial;
fn handleSerialOut(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
switch (qual.port) {
// Transmit buffer.
0x3F8 => sr.writeByte(@truncate(regs.rax), .com1),
// Interrupt Enable Register.
0x3F9 => vcpu.serial.ier = @truncate(regs.rax),
// FIFO control registers.
0x3FA => {}, // ignore
// Line Control Register (MSB is DLAB).
0x3FB => {}, // ignore
// Modem Control Register.
0x3FC => vcpu.serial.mcr = @truncate(regs.rax),
// Scratch Register.
0x3FF => {}, // ignore
else => {
log.err("Unsupported I/O-out to the first serial port: 0x{X}", .{qual.port});
vcpu.abort();
},
}
}
Writes to TX are handled by Ymir using Serial.writeByte()
implemented in the Serial Output chapter.
tip
By setting the I/O bitmap, you can control whether a VM Exit occurs on a per-port basis. In serial virtualization, some registers allow to be passed through. For those registers, you can disable VM Exit on their ports using the I/O Bitmap, reducing the overhead caused by VM Exits.
Add the implemented serial handler to the previously defined I/O handler. As mentioned earlier, only the COM1 port is supported; all other ports should simply be ignored.
fn handleIoIn(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
switch (qual.port) {
0x02E8...0x02EF => {}, // Fourth serial port. Ignore.
0x02F8...0x02FF => {}, // Second serial port. Ignore.
0x03E8...0x03EF => {}, // Third serial port. Ignore.
0x03F8...0x03FF => try handleSerialIn(vcpu, qual),
...
}
}
fn handleIoOut(vcpu: *Vcpu, qual: QualIo) VmxError!void {
switch (qual.port) {
0x02E8...0x02EF => {}, // Fourth serial port. Ignore.
0x02F8...0x02FF => {}, // Second serial port. Ignore.
0x03E8...0x03EF => {}, // Third serial port. Ignore.
0x03F8...0x03FF => try handleSerialOut(vcpu, qual),
...
}
}
When running the guest up to this point, the boot process should reach a fairly advanced stage, though not quite as far as in the previous chapter. Most importantly, just like before, you will see serial logs being output. Visually, nothing looks different, but unlike before, the serial output is now virtualized by Ymir. After some progress, you will likely see an abort accompanied by an error message like Unhandled I/O-out port: 0x43
. This port corresponds to the PIT. Next, let's virtualize this device.
PIT
PIT: Programmable Interval Timer is a device that generates periodic interrupts. It is connected to IRQ 0, allowing the OS to receive interrupts from the PIT to measure time necessary for scheduling and other tasks.
The Ymir kernel does not have a scheduler, nor does it have any other components that require a timer. Therefore, PIT is never used. As a result, PIT will be passed through to the guest without virtualization:
fn handlePitIn(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
switch (qual.size) {
.byte => regs.rax = @as(u64, am.inb(qual.port)),
.word => regs.rax = @as(u64, am.inw(qual.port)),
.dword => regs.rax = @as(u64, am.inl(qual.port)),
}
}
fn handlePitOut(vcpu: *Vcpu, qual: QualIo) VmxError!void {
switch (qual.size) {
.byte => am.outb(@truncate(vcpu.guest_regs.rax), qual.port),
.word => am.outw(@truncate(vcpu.guest_regs.rax), qual.port),
.dword => am.outl(@truncate(vcpu.guest_regs.rax), qual.port),
}
}
This can be implemented more easily and efficiently by using the previously mentioned I/O bitmap. If you're interested, please refer to the official Ymir repository while trying your own implementation. This time, we only branch based on the access size to pass-through the value. The handlers will be called on access to the PIT:
fn handleIoIn(vcpu: *Vcpu, qual: QualIo) VmxError!void {
switch (qual.port) {
0x0040...0x0047 => try handlePitIn(vcpu, qual),
...
}
}
fn handleIoOut(vcpu: *Vcpu, qual: QualIo) VmxError!void {
switch (qual.port) {
0x0040...0x0047 => try handlePitOut(vcpu, qual),
...
}
}
PIC
Now, the main focus: virtualizing the PIC. Like other components we virtualize, the goal is to have it work for now, not to fully implement every feature. The PIC has the following I/O ports:
Port | Description |
---|---|
0x20 | Primary PIC Command |
0x21 | Primary PIC Data |
0xA0 | Secondary PIC Command |
0xA1 | Secondary PIC Data |
As covered in the Ymir kernel's PIC initialization, PIC initialization is performed step-by-step using a set of commands called ICW. To recap, the initialization proceeds as follows:
- ICW1: Starts the initialization.
- ICW2: Sets the offset of the interrupt vector.
- ICW3: Configures the secondary PIC.
- ICW4: Sets some modes.
After startup, configuring of interrupt and EOIs are issued using a set of commands called OCW.
On the host side, the following information will be stored. For other settings, Ymir will use the configurations already applied to the PIC, and requests from the guest will be ignored:
- Initialization phase
- Interrupt mask
- Interrupt vector
Each piece of information needs to be stored separately for both primary and secondary PICs. Define a structure representing the virtualized PIC and include it within the Vcpu
structure:
pub const Pic = struct {
/// Mask of the primary PIC.
primary_mask: u8,
/// Mask of the secondary PIC.
secondary_mask: u8,
/// Initialization phase of the primary PIC.
primary_phase: InitPhase = .uninitialized,
/// Initialization phase of the secondary PIC.
secondary_phase: InitPhase = .uninitialized,
/// Vector offset of the primary PIC.
primary_base: u8 = 0,
/// Vector offset of the secondary PIC.
secondary_base: u8 = 0,
const InitPhase = enum {
uninitialized, // ICW1 が送信される前
phase1, // ICW1 が送信された後
phase2, // ICW2 が送信された後
phase3, // ICW3 が送信された後
inited, // ICW4 が送信され初期化終了
};
pub fn new() Pic {
return Pic{
.primary_mask = 0xFF,
.secondary_mask = 0xFF,
};
}
};
First, define the handler for IN instructions from the PIC. Assume that reads only happen from the data port. Also, reads occur only before or after PIC initialization, never during it. Although the value read from the data port normally depends on the most recent OCW type, in this series we always return the interrupt mask. This is not strictly correct, but Linux works fine with this simplification. If you have time, try saving the last OCW command and returning values based on it for a more accurate implementation:
fn handlePicIn(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
const pic = &vcpu.pic;
switch (qual.port) {
// Primary PIC data.
0x21 => switch (pic.primary_phase) {
.uninitialized, .inited => regs.rax = pic.primary_mask,
else => vcpu.abort(),
},
// Secondary PIC data.
0xA1 => switch (pic.secondary_phase) {
.uninitialized, .inited => regs.rax = pic.secondary_mask,
else => vcpu.abort(),
},
else => vcpu.abort(),
}
}
Branch based on whether the accessed port corresponds to primary or secondary. Since reads are assumed to occur only before or after initialization, abort if a read happens during it. The implementation is straightforward: just return the stored interrupt mask.
Next, let's define the handler for OUT instruction:
fn handlePicOut(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
const pic = &vcpu.pic;
const dx: u8 = @truncate(regs.rax);
switch (qual.port) {
// Primary PIC command.
0x20 => switch (dx) {
0x11 => pic.primary_phase = .phase1,
// Specific-EOI.
// It's Ymir's responsibility to send EOI, so guests are not allowed to send EOI.
0x60...0x67 => {},
else => vcpu.abort(),
},
// Primary PIC data.
0x21 => switch (pic.primary_phase) {
.uninitialized, .inited => pic.primary_mask = dx,
.phase1 => {
log.info("Primary PIC vector offset: 0x{X}", .{dx});
pic.primary_base = dx;
pic.primary_phase = .phase2;
},
.phase2 =>
if (dx != (1 << 2)) vcpu.abort(),
else pic.primary_phase = .phase3,
.phase3 => pic.primary_phase = .inited,
},
// Secondary PIC command.
0xA0 => switch (dx) {
0x11 => pic.secondary_phase = .phase1,
// Specific-EOI.
// It's Ymir's responsibility to send EOI, so guests are not allowed to send EOI.
0x60...0x67 => {},
else => vcpu.abort(),
},
// Secondary PIC data.
0xA1 => switch (pic.secondary_phase) {
.uninitialized, .inited => pic.secondary_mask = dx,
.phase1 => {
log.info("Secondary PIC vector offset: 0x{X}", .{dx});
pic.secondary_base = dx;
pic.secondary_phase = .phase2;
},
.phase2 =>
if (dx != 2) vcpu.abort(),
else pic.secondary_phase = .phase3,
.phase3 => pic.secondary_phase = .inited,
},
else => vcpu.abort(),
}
}
First, only two types of values are expected to be written to the command port. The first is 0x11
, which corresponds to ICW1. This starts the initialization, so .phase1
is set on the virtual PIC. The second range is from 0x60
to 0x67
, representing EOIs. The offset between each value and 0x60
corresponds to the IRQ number for the EOI. As covered in the next chapter, in this series, sending EOIs to the PIC is Ymir's responsibility. Guests are never allowed to send EOIs directly. Therefore, EOIs sent by the guest are simply ignored.
Writes to the data port are handled differently depending on the initialization phase. Before and after initialization, all writes are treated as writes to the interrupt mask (IMR) and stored in the virtual PIC's .{primary,secondary}_mask
. This mask is later used when injecting interrupts into the guest. Writes during .phase1
correspond to ICW2, which sets the interrupt vector offset. Writes during .phase2
correspond to ICW3, setting the secondary PIC’s cascade connection. Ymir assumes the secondary PIC is cascaded to primary PIC's IRQ2. Writes during .phase3
correspond to ICW4, which configures the modes. Since Ymir reuses the mode values it set directly on the PIC, the request by the guest is ignored.
Other Ports
For other ports, either ignore or abort. Ports accessed by Linux before boot completion will be ignored; all other accesses will cause an abort. The ports accessed by Linux are as follows:
Ports | Description |
---|---|
[0x0060, 0x0064] | PS/2 |
[0x0070, 0x0071] | RTC |
[0x0080, 0x008F] | DMA |
[0x03B0, 0x03DF] | VGA |
Finally, the I/O exit handler looks as follows:
fn handleIoIn(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
switch (qual.port) {
0x0020...0x0021 => try handlePicIn(vcpu, qual),
0x0040...0x0047 => try handlePitIn(vcpu, qual),
0x0060...0x0064 => regs.rax = 0, // PS/2. Unimplemented.
0x0070...0x0071 => regs.rax = 0, // RTC. Unimplemented.
0x0080...0x008F => {}, // DMA. Unimplemented.
0x00A0...0x00A1 => try handlePicIn(vcpu, qual),
0x02E8...0x02EF => {}, // Fourth serial port. Ignore.
0x02F8...0x02FF => {}, // Second serial port. Ignore.
0x03B0...0x03DF => regs.rax = 0, // VGA. Uniimplemented.
0x03E8...0x03EF => {}, // Third serial port. Ignore.
0x03F8...0x03FF => try handleSerialIn(vcpu, qual),
0x0CF8...0x0CFF => regs.rax = 0, // PCI. Unimplemented.
0xC000...0xCFFF => {}, // Old PCI. Ignore.
else => vcpu.abort(),
}
}
fn handleIoOut(vcpu: *Vcpu, qual: QualIo) VmxError!void {
const regs = &vcpu.guest_regs;
switch (qual.port) {
0x0020...0x0021 => try handlePicOut(vcpu, qual),
0x0040...0x0047 => try handlePitOut(vcpu, qual),
0x0060...0x0064 => {}, // PS/2. Unimplemented.
0x0070...0x0071 => {}, // RTC. Unimplemented.
0x0080...0x008F => {}, // DMA. Unimplemented.
0x00A0...0x00A1 => try handlePicOut(vcpu, qual),
0x02E8...0x02EF => {}, // Fourth serial port. Ignore.
0x02F8...0x02FF => {}, // Second serial port. Ignore.
0x03B0...0x03DF => {}, // VGA. Uniimplemented.
0x03F8...0x03FF => try handleSerialOut(vcpu, qual),
0x03E8...0x03EF => {}, // Third serial port. Ignore.
0x0CF8...0x0CFF => {}, // PCI. Unimplemented.
0xC000...0xCFFF => {}, // Old PCI. Ignore.
else => vcpu.abort(),
}
}
Summary
In this chapter, we virtualized I/O port accesses for the serial port and PIC. For other ports, we ignored only those necessary and aborted on the rest. The virtualization of the serial port and PIC is not fully complete, but it is sufficient to run Linux for now.
Now, let's try running the guest up to this point:
...
[ 0.000000] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[ 0.000000] Fallback order for Node 0: 0
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 24944
[ 0.000000] Policy zone: DMA32
[ 0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off
[ 0.000000] Memory: 59616K/102012K available (18432K kernel code, 2792K rwdata, 6704K rodata, 2704K init, 1292K bss, 42140K reserved, 0K cma-reserved)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[ 0.000000] Kernel/User page tables isolation: enabled
Poking KASLR using i8254...
[ 0.000000] Dynamic Preempt: voluntary
[ 0.000000] rcu: Preemptible hierarchical RCU implementation.
[ 0.000000] rcu: RCU event tracing is enabled.
[ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=1.
[ 0.000000] Trampoline variant of Tasks RCU enabled.
[ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[ 0.000000] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
[ 0.000000] NR_IRQS: 4352, nr_irqs: 32, preallocated irqs: 16
[INFO ] vmio | Primary PIC vector offset: 0x30
[INFO ] vmio | Secondary PIC vector offset: 0x38
[ 0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.000000] Console: colour dummy device 80x25
[ 0.000000] printk: legacy console [ttyS0] enabled
[ 0.000000] printk: legacy console [ttyS0] enabled
[ 0.000000] printk: legacy bootconsole [earlyser0] disabled
[ 0.000000] printk: legacy bootconsole [earlyser0] disabled
[ 0.000000] APIC disabled via kernel command line
[ 0.000000] APIC: Keep in PIC mode(8259)
As in the previous chapter, the first part of the log is omitted. The boot process has progressed quite far. From the intermediate log output, you can see that the vector offsets for the primary and secondary PICs are set to 0x30
and 0x38
, respectively. These match the values used by Ymir, which makes sense since Ymir was configured to match Linux.
At the end, the guest neither triggers a VM Exit nor aborts, but stops running. When opening the QEMU monitor or attaching GDB to check RIP, it shows 0xFFFFFFFF81002246
. Using addr2line
to find the corresponding source reveals that it points to the function calibrate_delay_converge().
/* wait for "start of" clock tick */
ticks = jiffies;
while (ticks == jiffies)
; /* nothing */
/* Go .. */
ticks = jiffies;
This function waits endlessly until jiffies
changes. jiffies
is aliased to jiffies_64
in vmlinux.lds.S. jiffies_64
is incremented in do_timer(). This function is registered as the handler for IRQ 0 in hpet_time_init(). From this, we can deduce that jiffies
should increment when IRQ 0 sends an interrupt; since it doesn’t, the code never exits the while
loop and freezes indefinitely.
Why isn't the guest receiving timer interrupts? First, since the PIT is passed through to the guest, there should be no problem with the PIT configuration. Also, .external_interrupt
is not set in the primary Processor-Based VM-Execution Controls of the VMCS, so external interrupts do not cause VM Exits and are delivered directly to the guest. Try to think about why timer interrupts are not being delivered. Well, to give away the answer quickly: the issue is that EOIs are ignored in the PIC we virtualized this time. As discussed in the PIC chapter, the register called IRR holds the list of pending IRQs. When the CPU acknowledges an IRQ, the IRR bit is cleared and the IRQ is set in the ISR instead. While an IRQ is set in the ISR, the PIC stops sending further interrupts to the CPU. The IRQ set in the ISR is cleared by sending an EOI. If EOIs are ignored, no interrupts—including timer interrupts—are delivered, because all lower-priority IRQs are blocked by the ISR bit still being set.
This is the reason why timer interrupts do not occur and the infinite loop happens. Note that this behavior is intentional. In Ymir, sending EOIs is the responsibility of Ymir itself; the guest is not allowed to send EOIs directly. This setup allows not only the guest but also Ymir to receive interrupts (although Ymir doesn't actually do anything upon receiving external interrupts...). In the next chapter, we will handle interrupts properly on the host side, sending EOIs to the PIC and injecting interrupts into the guest.