Interrupt Injection
In the previous chapter, timer interrupts did not occur because EOI was not properly notified to the PIC, which caused the guest to freeze. In this chapter, we will establish a mechanism to properly share interrupts by dividing roles between the guest and the host. During this process, the host will inject interrupts into the guest at VM Entry.
important
The source code for this branch is in whiz-vmm-intr_injection
branch.
Table of Contents
- Sharing Interrupts
- VM Exit by Interrupts
- Subscriber
- Pending IRQs
- IRQ Injection
- Accepting Interrupts
- HLT
- Summary
Sharing Interrupts
The content covered in this chapter is somewhat special. Therefore, I will start by summarizing the purpose of what we want to achieve before diving into the details.
In this series, both Ymir and the guest will be able to receive interrupts. When an interrupt occurs during guest execution, the guest triggers a VM Exit, and Ymir receives the interrupt first and calls its handler. Ymir's interrupt handler sends an EOI notification to allow the PIC to generate further interrupts. However, notifying EOI causes the PIC to clear the ISR, preventing the guest from recognizing the interrupt. Therefore, Ymir performs the interrupt injection to the guest on its behalf. The guest, upon receiving the interrupt from Ymir, calls its interrupt handler as usual and attempts to notify the PIC of EOI. The EOI sent by the guest is received by the virtualized PIC implemented in the previous chapter, but it is discarded without notifying the real PIC. This prevents redundant EOI notifications while allowing both Ymir and the guest to receive interrupts. Interrupt injection to the guest is achieved using VT-x's VM-Exit Interruption-Information and VM-Entry Interruption-Information fields.
The interrupt mechanism for Ymir was already implemented in the Interrupt and Exception chapter. It has been confirmed that timer interrupts are received and the interrupt handler is called. To maintain modularity, Ymir does not want to directly intervene in these interrupt mechanisms for virtualization purposes. Therefore, we provide a system that allows subscribing to arbitrary interrupts. The Ymir kernel handles interrupts normally as usual, but additionally, a subscriber registered for the VMM is invoked. This enables the kernel to process interrupts without being aware of VMM functionality, while the VMM can transparently intervene in any interrupt as needed.
Overview of IRQ sharing between Ymir and Guest OS.
VM Exit by Interrupts
Whether a VM Exit occurs on external interrupts is controlled by the VMCS field Pin-Based VM-Execution Controls. Currently, this field is not set, so external interrupts do not cause a VM Exit and are delivered directly to the guest. First, we will enable this field to make VM Exit occur on external interrupts:
fn setupExecCtrls(vcpu: *Vcpu, _: Allocator) VmxError!void {
...
var pin_exec_ctrl = try vmcs.PinExecCtrl.store();
pin_exec_ctrl.external_interrupt = true;
...
}
When running the guest, it becomes clear that interrupts occur quite early (before the kernel itself is fully loaded). This is due to the timer interrupt on IRQ 0. Since Ymir has already cleared the timer interrupt mask on the PIC, interrupts occur during guest execution:
No EFI environment detected.
early console in extract_kernel
input_data: 0x0000000002d582b9
input_len: 0x0000000000c702ff
output: 0x0000000001000000
output_len: 0x000000000297e75c
kernel_total_size: 0x0000000002630000
needed_size: 0x0000000002a00000
trampoline_32bit: 0x0000000000000000
KASLR disabled: 'nokaslr' on cmdline.
Decompressing Linux... [ERROR] vcpu | Unhandled VM-exit: reason=extintr
[ERROR] vcpu | === vCPU Information ===
[ERROR] vcpu | [Guest State]
[ERROR] vcpu | RIP: 0x00000000039C9022
Unlike the I/O bitmaps, interrupts cannot be configured to cause a VM Exit on a per-vector basis.
Subscriber
Next, we will provide a mechanism to insert arbitrary code into Ymir kernel's interrupt handler. The mechanism itself is very simple: in addition to the normal interrupt handler, you register functions you want to call, and the interrupt handler invokes them. The entities that subscribe to interrupts will be called subscribers. The subscriber interface is defined as follows:
pub const Subscriber = struct {
/// Context of the subscriber.
self: *anyopaque,
/// Context of the interrupt.
callback: Callback,
pub const Callback = *const fn (*anyopaque, *Context) void;
};
Subscriber
holds a handler called callback
. The callback()
receives the same context information as a normal interrupt handler, which includes register states and the interrupt vector. Additionally, callback()
can receive self
, the instance of the subscriber. You might recall that when implementing the Allocator
instance in General Allocator, the allocate()
function also took self
as an argument. By casting this type within callback()
to the actual Subscriber
type, the subscriber can access its own context information.
Subscribers are managed as global variables and can be registered using the following function:
const max_subscribers = 10;
var subscribers: [max_subscribers]?Subscriber = [_]?Subscriber{null} ** max_subscribers;
pub fn subscribe(ctx: *anyopaque, callback: Subscriber.Callback) !void {
for (subscribers, 0..) |sub, i| {
if (sub == null) {
subscribers[i] = Subscriber{
.callback = callback,
.self = ctx,
};
return;
}
}
return error.SubscriberFull;
}
Registered subscribers are called from the interrupt handler. Although, for our current purpose, it would be sufficient to invoke them only for IRQ interrupts ([0x30, 0x40)
), in this series subscribers are designed to be called for all interrupts. It is the subscriber's responsibility to determine whether the interrupt is actually of interest.
pub fn dispatch(context: *Context) void {
const vector = context.vector;
// Notify subscribers.
for (subscribers) |subscriber| {
if (subscriber) |s| s.callback(s.self, context);
}
// Call the handler.
handlers[vector](context);
}
With this, the mechanism to intervene in arbitrary interrupts is complete. As a test, let's register a sample subscriber. Prepare a sample type A
and a handler blobSubscriber()
. Inside the handler, to verify that self
is correctly obtained, output a log and immediately trigger a panic:
const A = struct { value: u64 };
var something: A = .{ .value = 0xDEAD };
try arch.intr.subscribe(&something, blobSubscriber);
fn blobSubscriber(p: *anyopaque, _: *arch.intr.Context) void {
const self: *A = @alignCast(@ptrCast(p));
log.debug("self: value = {X}", .{self.value});
@panic("Subscriber is HERE!!!");
}
When executed, it behaves as follows:
[DEBUG] main | self: value = DEAD
[ERROR] panic | Subscriber is HERE!!!
A timer interrupt occurs, confirming that the subscriber is properly invoked. The code used here is for testing purposes only and can be safely removed afterward.
Pending IRQs
When an interrupt occurs while the guest is running, a VM Exit happens and the host's interrupt handler processes the interrupt. The subscriber needs to record the interrupt details to inject them into the guest later. To maintain a list of IRQs waiting to be injected into the guest, add a variable to Vcpu
:
pub const Vcpu = struct {
...
pending_irq: u16 = 0,
...
};
.pending_irq
is a bitmap corresponding to 16 IRQ lines. When IRQ N occurs, the corresponding bit in .pending_irq
is set. Once the injection to the guest is completed, that bit is cleared.
Interrupt injection to the guest is not guaranteed to succeed. For example, if the guest has cleared RFLAGS.IF
, the interrupt cannot be injected. Also, if the corresponding bit in the PIC's interrupt mask register (IMR) is set, that IRQ should not be delivered to the guest. Therefore, even if a VM Exit occurs due to an interrupt, the interrupt is not necessarily delivered to the guest immediately on the subsequent VM Entry. Interrupts that occur while the guest’s RFLAGS.IF
is cleared will accumulate in .pending_irq
. This behavior mirrors non-virtualized environments. In a non-virtualized environment, IRQs occurring while RFLAGS.IF
is cleared accumulate in the IRR, and when RFLAGS.IF
is set, the IRQ with the highest priority is moved to the ISR and notified to the CPU.
When an interrupt occurs, the subscriber sets the corresponding bit in .pending_irq
. Since Ymir remaps IRQ interrupts to vectors between 0x20
and 0x2F
, set the IRQ bit only when interrupts within this range occur:
fn intrSubscriberCallback(self_: *anyopaque, ctx: *isr.Context) void {
const self: *Self = @alignCast(@ptrCast(self_));
const vector = ctx.vector;
if (0x20 <= vector and vector < 0x20 + 16) {
self.pending_irq |= bits.tobit(u16, vector - 0x20);
}
}
Let's register the subscriber at the beginning of loop()
. You can register it anywhere as far as it is called only once:
pub fn loop(self: *Self) VmxError!void {
intr.subscribe(self, intrSubscriberCallback) catch return error.InterruptFull;
...
}
IRQ Injection
Define a function to inject IRQs into the guest. As mentioned earlier, the guest is not always ready to accept interrupts (excluding NMIs, of course). The conditions under which interrupts cannot be injected are as follows:
- There're no interrupts to inject
- PIC is not initialized
RFLAGS.IF
bit is cleared- IRQ is masked by PIC's IMR
First, we will handle cases 1 through 3. The function injectExtIntr()
sets the appropriate VMCS fields to inject an interrupt into the guest. It returns true
if the injection setup succeeds:
fn injectExtIntr(self: *Self) VmxError!bool {
const pending = self.pending_irq;
// 1. No interrupts to inject.
if (pending == 0) return false;
// 2. PIC is not initialized.
if (self.pic.primary_phase != .inited) return false;
// 3. Guest is blocking interrupts.
const eflags: am.FlagsRegister = @bitCast(try vmread(vmcs.guest.rflags));
if (!eflags.ief) return false;
...
return false;
}
If the above checks pass, we can inject an interrupt into the guest. However, only one interrupt can be injected at a time. Therefore, we need to select which IRQ to inject. Normally, this selection is handled by the PIC, which delivers interrupts in order of priority. In general, lower IRQ numbers have higher priority, but the PIC allows the priority order to be changed. Since the virtual PIC we implemented in the PIC virtualization chapter does not support priority rotation, Ymir will inject interrupts starting from the lowest IRQ number.
Check IRQs from 0 to 15 in order to determine if they are eligible for injection:
- Whether a corresponding bit in
.pending_irq
is set - Whether the IRQ is not masked by the IMR
Note that if IRQ N belongs to the secondary PIC, you must ensure that both IRQ 2 and IRQ N are not masked in the IMR. The bits
used below is a library implemented in the Bitwise Operation and Testing chapter. It is especially useful for working with bitmaps like .pending_irq
:
fn injectExtIntr(self: *Self) VmxError!bool {
...
const is_secondary_masked = bits.isset(self.pic.primary_mask, IrqLine.secondary);
for (0..15) |i| {
if (is_secondary_masked and i >= 8) break;
const irq: IrqLine = @enumFromInt(i);
const irq_bit = bits.tobit(u16, irq);
// The IRQ is not pending.
if (pending & irq_bit == 0) continue;
// Check if the IRQ is masked.
const is_masked = if (irq.isPrimary()) b: {
break :b bits.isset(self.pic.primary_mask, irq.delta());
} else b: {
const is_irq_masked = bits.isset(self.pic.secondary_mask, irq.delta());
break :b is_secondary_masked or is_irq_masked;
};
if (is_masked) continue;
...
}
...
}
If an IRQ is eligible for injection, configure the VMCS to specify the interrupt to be injected. Interrupt injection into the guest is performed using a 32-bit VMCS field called VM-Entry Interruption-Information. The structure of the Interruption-Information field is as follows:
Format of VM-Entry Interruption-Information. SDM Vol.3C Table 25-17.
Vector represents the interrupt or exception vector to be injected into the guest. Type indicates the kind of interrupt. Deliver error code specifies whether an error code should be delivered. As discussed in the Interrupts and Exceptions chapter, some exceptions push an error code onto the stack to provide more detailed information. If deliver error code is set, the value from the VMCS VM-Entry Exception Error Code field will be delivered to the guest.
Define the Interruption-Information struct as follows:
pub const EntryIntrInfo = packed struct(u32) {
vector: u8,
type: Type,
ec_available: bool,
_notused: u19 = 0,
valid: bool,
const Type = enum(u3) {
external = 0,
_unused1 = 1,
nmi = 2,
hw = 3,
_unused2 = 4,
priviledged_sw = 5,
exception = 6,
_unused3 = 7,
};
const Kind = enum {
entry,
exit,
};
};
Inject the IRQ as shown below. Note that the injected vector must be calculated by taking into account the IRQ remapping. Since the mapping of guest IRQs is recorded in Vcpu.pic
, compute the injected vector by adding the IRQ number to this base value:
fn injectExtIntr(self: *Self) VmxError!bool {
...
for (0..15) |i| {
const intr_info = vmx.EntryIntrInfo{
.vector = irq.delta() + if (irq.isPrimary()) self.pic.primary_base else self.pic.secondary_base,
.type = .external,
.ec_available = false,
.valid = true,
};
try vmwrite(vmcs.ctrl.entry_intr_info, intr_info);
// Clear the pending IRQ.
self.pending_irq &= ~irq_bit;
return true;
}
...
}
Once the interrupt configuration is complete, clear the corresponding bit in .pending_irq
that represents the IRR. Since we've already confirmed that the guest is ready to handle the interrupt, the guest's interrupt handler will be invoked immediately after VM Entry.
Accepting Interrupts
Define a handler for VM Exits caused by interrupts, and call injectExtIntr()
from there. One important point here is that Ymir generally operates with interrupts disabled from just before launching the guest. This means that even if a VM Exit occurs and control returns to the host, the host will not be able to notice any incoming interrupts in this state. As a result, subscribers won't be triggered either.
Therefore, when a VM Exit caused by an interrupt occurs, it is necessary to temporarily enable interrupts in Ymir to allow it to accept and handle them. Add the interrupt handling case to the switch
statement in handleExit()
:
.extintr => {
// Consume the interrupt by Ymir.
// At the same time, interrupt subscriber sets the pending IRQ.
asm volatile (
\\sti
\\nop
\\cli
);
// Give the external interrupt to guest.
_ = try self.injectExtIntr();
},
Use STI instruction to set RFLAGS.IF
and enable interrupts. Note that interrupts remain disabled until the instruction boundary immediately following STI. This means there is a one-instruction delay before interrupts are actually enabled. This design allows functions that disable interrupts to safely re-enable them right before returning, for example with sti; ret;
. Taking this delay into account, place a NOP immediately after the STI. This ensures a window between executing the NOP and the next STI where interrupts can be accepted. During this period, if any interrupts are pending, the CPU will invoke the interrupt handler, and the subscriber will set .pending_irq
. After accepting interrupts, disable them again with CLI.
HLT
As a bonus stage, we'll implement VM Exit on HLT instruction. HLT halts the CPU until an interrupt occurs. Whether a VM Exit occurs when the guest executes HLT is controlled by Primary Processor-Based VM-Execution Controls in the VMCS. Since it is not set currently, VM Exit on HLT does not happen. However, this can cause issues. Suppose the guest executes HLT expecting an interrupt while .pending_irq
already contains pending interrupts. In this case, the expected interrupt has actually already occurred, and the host is just waiting for the right timing to inject it. Currently, interrupts are only injected on VM Exit caused by interrupts, so if no interrupt occurs, no VM Exit happens after HLT, and the interrupt cannot be injected.
There are several possible solutions. The first is to set the VMX-Preemption Timer1. When this value is set in the VMCS, it counts down while in VMX Non-root Operation. Once the count reaches zero, a VM Exit occurs. This allows VM Exits to happen periodically regardless of whether interrupts occur. The second approach is to trigger a VM Exit after every single instruction while .pending_irq
contains pending IRQs. By setting the Monitor Trap Flag2, a VM Exit occurs (basically) after each instruction, enabling a kind of single-step execution.
Ymir adopts another approach where we inject an interrupt if .pending_irq
contains IRQs when HLT is executed. When the guest runs HLT, the host executes HLT on its behalf. At that time, the host enables interrupts by executing STI before running HLT, then waits until .pending_irq
is set by the subscribers:
.hlt => {
// Wait until the external interrupt is generated.
while (!try self.injectExtIntr()) {
asm volatile (
\\sti
\\hlt
\\cli
);
}
try vmwrite(vmcs.guest.activity_state, 0);
try vmwrite(vmcs.guest.interruptibility_state, 0);
try self.stepNextInst();
},
Finally, set Primary Processor-Based VM-Execution Controls to enable VM Exit on HLT:
fn setupExecCtrls(vcpu: *Vcpu, _: Allocator) VmxError!void {
...
ppb_exec_ctrl.hlt = true;
...
}
Summary
In this chapter, we implemented a mechanism for both Ymir and the guest to receive interrupts. We introduced subscribers to keep track of interrupts to be injected into the guest. The stored interrupts are injected by setting the VM-Entry Interrupt-Information when the guest is ready to receive interrupts.
In the previous chapter, the boot process barely progressed due to early exception
, but with the current implementation, how far can it advance? Let's run the guest and find out:
...
[ 0.328952] sched_clock: Marking stable (328952424, 0)->(329000000, -47576)
[ 0.328952] registered taskstats version 1
[ 0.328952] Loading compiled-in X.509 certificates
[ 0.329952] PM: Magic number: 0:110:269243
[ 0.329952] printk: legacy console [netcon0] enabled
[ 0.329952] netconsole: network logging started
[ 0.329952] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[ 0.329952] kworker/u4:1 (40) used greatest stack depth: 14480 bytes left
[ 0.329952] Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[ 0.329952] Loaded X.509 cert 'wens: 61c038651aabdcf94bd0ac7ff06c7248db18c600'
[ 0.329952] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[ 0.329952] cfg80211: failed to load regulatory.db
[ 0.329952] ALSA device list:
[ 0.329952] No soundcards found.
[ 0.329952] md: Waiting for all devices to be available before autodetect
[ 0.329952] md: If you don't use raid, use raid=noautodetect
[ 0.329952] md: Autodetecting RAID arrays.
[ 0.329952] md: autorun ...
[ 0.329952] md: ... autorun DONE.
[ 0.329952] /dev/root: Can't open blockdev
[ 0.329952] VFS: Cannot open root device "" or unknown-block(0,0): error -6
[ 0.329952] Please append a correct "root=" boot option; here are the available partitions:
[ 0.329952] List of all bdev filesystems:
[ 0.329952] ext3
[ 0.329952] ext2
[ 0.329952] ext4
[ 0.329952] vfat
[ 0.329952] msdos
[ 0.329952] iso9660
[ 0.329952]
[ 0.329952] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
...
It looks like the jiffies
loop was successfully passed. Initialization proceeded normally, and finally, it aborted with the message /dev/root: Can't open blockdev
. This log indicates that the guest tried to load initramfs as the filesystem but couldn't find it. The boot process has now reached the point of loading the filesystem. Next, we just need to load initramfs into memory and start the programs inside the filesystem to launch the PID 1 process. We're getting close to the finish line. In the next chapter, we will implement loading initramfs.
SDM Vol.3C 26.5.1 VMX-Preemption Timer
SDM Vol.3C 26.5.2 Monitor Trap Flag