Linux Boot Protocol

By enabling EPT and setting the guest as unrestricted guest, it became possible to run the guest in a memory space separate from the host. This means we are now ready to boot the Linux kernel, which is the goal of this series. Of course, there are still many missing parts, such as the Exit Handler not being implemented yet, but we will add those as needed while actually running Linux. In this chapter, we will follow the Linux x86 boot protocol to load Linux and track the process until control is handed over to the guest kernel.

important

The source code for this chapter is in whiz-vmm-linux_boot branch.

Table of Contents

Boot Process

When booting the Linux kernel, the usual steps are as follows:

1. Transfer of Control to Bootloader

After Legacy BIOS initializes the system, it loads the bootloader from the MBR: Master Boot Record and transfers control to it1. In the case of UEFI, the bootloader starts as a UEFI application.

Well-known bootloaders executed directly by BIOS/UEFI include GRUB2 and coreboot. In this series, Ymir will serve as the bootloader for the guest. Therefore, we can skip the step of loading the bootloader itself.

2. Loading Kernel

The bootloader loaded by BIOS/UEFI loads the kernel according to the boot protocol defined by Linux. This protocol is specified for each architecture. Since Ymir only supports x64, it follows the x86 boot protocol. Starting from x86 boot protocol v2.02, the memory layout is as follows:

x86 Linux Kernel Memory Layout x86 Linux Kernel Memory Layout. https://www.kernel.org/doc/html/v5.6/x86/boot.html

An important data structure in the x86 boot protocol is struct boot_params, also known as the zero page. This structure is placed at the beginning of the bzImage and is used by a bootloader to pass system information and data necessary for loading Linux. The bootloader fills this structure with the appropriate information, and the kernel uses it during initialization. More precisely, the kernel uses a structure called setup headers inside boot_params. We will take a closer look at these structures later.

3. Boot Process of Kernel

After loading the kernel into memory, the bootloader transfers control to the kernel's entry point. For x64, the entry point is startup_32(). At the time of this transition, the CPU is in protected mode with paging disabled. Inside startup_32(), paging is enabled and the switch to long mode occurs. Then, in startup_64(), the compressed kernel image is decompressed. Once decompression is complete, control passes to the kernel entry point startup_64(). Although the function names are the same, the former is defined in kernel/compressed/head_64.S and the latter in kernel/head_64.S. Finally, execution reaches the C function start_kernel().

Normally, these processes are handled entirely within the guest, and the VMM does not need to be aware of the detailed flow. However, during Ymir's development, there will likely be many situations where tracing Linux's execution flow is necessary to identify the cause of issues. Having a basic understanding of the boot sequence may prove useful later on.

Building Linux Kernel

tip

[!TIP] For those who prefer not to build the kernel themselves, a kernel image built using the steps below is available for download here.

You can’t load or boot Linux without a kernel image. So let's start by building the Linux kernel. The source code can be cloned from git://git.kernel.org. Run the following command to clone the repository:

bash
git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
cd linux

You can configure the kernel interactively using make menuconfig. For debugging purposes, it's generally best to disable unnecessary features and minimize the kernel size. However, for the sake of generality in this series, we'll build the kernel with the default configuration. Feel free to adjust these settings according to your preferences:

bash
make defconfig
make -j$(nproc)

Once the build completes, two files will be generated: /vmlinux and /arch/x86/boot/bzImage. vmlinux is an executable ELF file containing the full kernel, while bzImage is a compressed version of vmlinux using zlib or LZMA2. Although bzImage is the one actually used as the kernel image, for debugging with GDB you should use vmlinux, since it is uncompressed and includes symbol information.

During Ymir development, you'll often want to insert log messages into the Linux kernel for debugging purposes. In such cases, having compile_commands.json makes it easier to use code editor features. You can generate it with the following command. For how to load compile_commands.json into your editor, refer to your editor's documentation3:

bash
python3 ./scripts/clang-tools/gen_compile_commands.py

Finally, copy the built kernel image into the Ymir directory. Like the Ymir kernel, we will place bzImage in the root directory of the FAT file system:

bash
cp ./arch/x86/boot/bzImage <Ymir Directory>/zig-out/img/bzImage

Loading bzImage

First, we load bzImage from the FAT file system into memory. Since it's just a file, there's no need to worry about its layout at this point. To access the FAT file system, we use the Simple File System Protocol provided by UEFI Boot Services.

Since Boot Services become unavailable once Ymir is running, we do that within Surtr. Just like we did when parsing the kernel, we first open the file and retrieve its size:

surtr/boot.zig
const guest = openFile(root_dir, "bzImage") catch return .Aborted;

const guest_info_buffer_size: usize = @sizeOf(uefi.FileInfo) + 0x100;
var guest_info_actual_size = guest_info_buffer_size;
var guest_info_buffer: [guest_info_buffer_size]u8 align(@alignOf(uefi.FileInfo)) = undefined;

status = guest.getInfo(&uefi.FileInfo.guid, &guest_info_actual_size, &guest_info_buffer);
if (status != .Success) return status;
const guest_info: *const uefi.FileInfo = @alignCast(@ptrCast(&guest_info_buffer));

Next, allocate enough pages to hold the entire file. For the memory type, specify .LoaderData. As implemented in the Page Allocator chapter, Ymir's page allocator treats memory regions marked as .ConventionalMemory or .BootServiceCode in the UEFI memory map as usable. However, the file data we're about to load should not be released after the allocator is initialized, as it still needs to be transferred to guest memory. For this reason, we place it in .LoaderData, which the page allocator does not claim.

surtr/boot.zig
var guest_start: u64 align(page_size) = undefined;
const guest_size_pages = (guest_info.file_size + (page_size - 1)) / page_size;
status = boot_service.allocatePages(.AllocateAnyPages, .LoaderData, guest_size_pages, @ptrCast(&guest_start));
if (status != .Success) return status;
var guest_size = guest_info.file_size;

Finally, let's load the bzImage into the allocated pages:

surtr/boot.zig
status = guest.read(&guest_size, @ptrFromInt(guest_start));
if (status != .Success) return status;
log.info("Loaded guest kernel image @ 0x{X:0>16} ~ 0x{X:0>16}", .{ guest_start, guest_start + guest_size });

Now that bzImage is loaded into memory, we add information about its location to BootInfo, the shared data structure between Surtr and Ymir, so that Ymir knows where the kernel image is loaded:

surtr/defs.zig
pub const BootInfo = extern struct {
    ...
    guest_info: GuestInfo,
};

pub const GuestInfo = extern struct {
    /// Physical address the guest image is loaded.
    guest_image: [*]u8,
    /// Size in bytes of the guest image.
    guest_size: usize,
};

Add GuestInfo to the arguments for Ymir:

surtr/boot.zig
const boot_info = defs.BootInfo{
    ...
    .guest_info = .{
        .guest_image = @ptrFromInt(guest_start),
        .guest_size = guest_size,
    },
};

With this, bzImage has been loaded into memory and passed to Ymir. Note that the address pointed to by BootInfo.guest_image is a physical address. After Ymir reconstructs the memory map, you will no longer be able to access it directly using this address. Make sure to convert it to a virtual address using ymir.mem.phys2virt() before accessing it.

Boot Parameters

Defining Structure

Define the struct boot_params passed to Linux as BootParams. Since not all fields are used, unused fields are prefixed with _ to indicate they are ignored:

ymir/linux.zig
pub const BootParams = extern struct {
    /// Maximum number of entries in the E820 map.
    const e820max = 128;

    _screen_info: [0x40]u8 align(1),
    _apm_bios_info: [0x14]u8 align(1),
    _pad2: [4]u8 align(1),
    tboot_addr: u64 align(1),
    ist_info: [0x10]u8 align(1),
    _pad3: [0x10]u8 align(1),
    hd0_info: [0x10]u8 align(1),
    hd1_info: [0x10]u8 align(1),
    _sys_desc_table: [0x10]u8 align(1),
    _olpc_ofw_header: [0x10]u8 align(1),
    _pad4: [0x80]u8 align(1),
    _edid_info: [0x80]u8 align(1),
    _efi_info: [0x20]u8 align(1),
    alt_mem_k: u32 align(1),
    scratch: u32 align(1),
    /// Number of entries in the E820 map.
    e820_entries: u8 align(1),
    eddbuf_entries: u8 align(1),
    edd_mbr_sig_buf_entries: u8 align(1),
    kbd_status: u8 align(1),
    _pad6: [5]u8 align(1),
    /// Setup header.
    hdr: SetupHeader,
    _pad7: [0x290 - SetupHeader.header_offset - @sizeOf(SetupHeader)]u8 align(1),
    _edd_mbr_sig_buffer: [0x10]u32 align(1),
    /// System memory map that can be retrieved by INT 15, E820h.
    e820_map: [e820max]E820Entry align(1),
    _unimplemented: [0x330]u8 align(1),

    /// Instantiate boot params from bzImage.
    pub fn from(bytes: []u8) @This() {
        return std.mem.bytesToValue(
            @This(),
            bytes[0..@sizeOf(@This())],
        );
    }
};

The important fields are the setup headers .hdr and the E820 map .e820_map. The E820 map will be covered later. BootParams is located uncompressed at the beginning of bzImage. The from() method extracts BootParams from the binary data of bzImage.

Among the fields in BootParams, the main ones the bootloader (Ymir in this case) needs to set are the setup headers. The setup headers are defined as follows:

ymir/linux.zig
pub const SetupHeader = extern struct {
    /// RO. The number of setup sectors.
    setup_sects: u8 align(1),
    root_flags: u16 align(1),
    syssize: u32 align(1),
    ram_size: u16 align(1),
    vid_mode: u16 align(1),
    root_dev: u16 align(1),
    boot_flag: u16 align(1),
    jump: u16 align(1),
    header: u32 align(1),
    /// RO. Boot protocol version supported.
    version: u16 align(1),
    realmode_switch: u32 align(1),
    start_sys_seg: u16 align(1),
    kernel_version: u16 align(1),
    /// M. The type of loader. Specify 0xFF if no ID is assigned.
    type_of_loader: u8 align(1),
    /// M. Bitmask.
    loadflags: LoadflagBitfield align(1),
    setup_move_size: u16 align(1),
    code32_start: u32 align(1),
    /// M. The 32-bit linear address of initial ramdisk or ramfs.
    /// Specify 0 if there is no ramdisk or ramfs.
    ramdisk_image: u32 align(1),
    /// M. The size of the initial ramdisk or ramfs.
    ramdisk_size: u32 align(1),
    bootsect_kludge: u32 align(1),
    /// W. Offset of the end of the setup/heap minus 0x200.
    heap_end_ptr: u16 align(1),
    /// W(opt). Extension of the loader ID.
    ext_loader_ver: u8 align(1),
    ext_loader_type: u8 align(1),
    /// W. The 32-bit linear address of the kernel command line.
    cmd_line_ptr: u32 align(1),
    /// R. Highest address that can be used for initrd.
    initrd_addr_max: u32 align(1),
    kernel_alignment: u32 align(1),
    relocatable_kernel: u8 align(1),
    min_alignment: u8 align(1),
    xloadflags: u16 align(1),
    /// R. Maximum size of the cmdline.
    cmdline_size: u32 align(1),
    hardware_subarch: u32 align(1),
    hardware_subarch_data: u64 align(1),
    payload_offset: u32 align(1),
    payload_length: u32 align(1),
    setup_data: u64 align(1),
    pref_address: u64 align(1),
    init_size: u32 align(1),
    handover_offset: u32 align(1),
    kernel_info_offset: u32 align(1),

    /// Bitfield for loadflags.
    const LoadflagBitfield = packed struct(u8) {
        /// If true, the protected-mode code is loaded at 0x100000.
        loaded_high: bool = false,
        /// If true, KASLR enabled.
        kaslr_flag: bool = false,
        /// Unused.
        _unused: u3 = 0,
        /// If false, print early messages.
        quiet_flag: bool = false,
        /// If false, reload the segment registers in the 32 bit entry point.
        keep_segments: bool = false,
        /// Set true to indicate that the value entered in the `heap_end_ptr` is valid.
        can_use_heap: bool = false,

        /// Convert to u8.
        pub fn to_u8(self: @This()) u8 {
            return @bitCast(self);
        }
    };

    /// The offset where the header starts in the bzImage.
    pub const header_offset = 0x1F1;

    /// Instantiate a header from bzImage.
    pub fn from(bytes: []u8) @This() {
        var hdr = std.mem.bytesToValue(
            @This(),
            bytes[header_offset .. header_offset + @sizeOf(@This())],
        );
        if (hdr.setup_sects == 0) {
            hdr.setup_sects = 4;
        }

        return hdr;
    }

    /// Get the offset of the protected-mode kernel code.
    /// Real-mode code consists of the boot sector (1 sector == 512 bytes)
    /// plus the setup code (`setup_sects` sectors).
    pub fn getProtectedCodeOffset(self: @This()) usize {
        return (@as(usize, self.setup_sects) + 1) * 512;
    }
};

There are many fields here as well, but Ymir only uses a small subset. For the full meaning of all fields, please refer to the documentation. The from() method extracts the SetupHeader from the bzImage binary image. The setup headers are located at a fixed offset of 0x1F1 bytes from the start of bzImage.

The .setup_sects field represents the size of the setup code in 512-byte sectors. For example, if this value is 4, the setup code size is \(4 \times 512 = 2048\) bytes. Note that if this value is 0, the specification dictates it should be treated as if it were 4.

getProtectedCodeOffset() returns the offset where the kernel's protected mode code is located. The kernel's real mode code consists of the boot sector (1 sector) and the setup code (setup_sects sectors). Since the protected mode code is placed immediately after the real mode code, it is located at the offset of 1 + setup_sects sectors from the start of bzImage. This offset will be used later when loading the kernel.

E820 Map

E820 is the memory map provided by BIOS. By passing it in BootParams, the kernel can be informed of the memory layout. The name originates from the real mode method of setting the AX register to 0xE820 and then executing the INT 15h instruction to retrieve this information. BootParams can hold up to 128 E820 entries. The bootloader sets the number of valid E820 entries in .e820_entries before passing it to the kernel.

E820 entries are defined as follows. The memory region specified by addr and size is described by the type field, which indicates the type of that region:

ymir/linux.zig
pub const E820Entry = extern struct {
    addr: u64 align(1),
    size: u64 align(1),
    type: Type align(1),

    pub const Type = enum(u32) {
        /// RAM.
        ram = 1,
        /// Reserved.
        reserved = 2,
        /// ACPI reclaimable memory.
        acpi = 3,
        /// ACPI NVS memory.
        nvs = 4,
        /// Unusable memory region.
        unusable = 5,
    };
};

Provide a method to add E820 entries to BootParams:

ymir/linux.zig
    pub fn addE820entry(
        self: *@This(),
        addr: u64,
        size: u64,
        type_: E820Entry.Type,
    ) void {
        self.e820_map[self.e820_entries].addr = addr;
        self.e820_map[self.e820_entries].size = size;
        self.e820_map[self.e820_entries].type = type_;
        self.e820_entries += 1;
    }

Memory Layout

Ymir loads the guest kernel in the following physical memory layout:

Memory Layout for Guest Linux Memory Layout for Guest Linux

Although initrd is not covered in this chapter, it holds the filesystem image passed to the kernel. cmdline contains the kernel command-line options. BootParams stores the previously defined BootParams structure, which the kernel uses during initialization.

The addresses of initrd and cmdline are specified in fields within BootParams. The address of the protected mode kernel is set in the VMCS RIP. According to the specification, the address of BootParams is passed to the guest in RSI register when transferring control.

Define the layout in linux.zig:

ymir/linux.zig
pub const layout = struct {
    /// Where the kernel boot parameters are loaded, known as "zero page".
    /// Must be initialized with zeros.
    pub const bootparam = 0x0001_0000;
    /// Where the kernel cmdline is located.
    pub const cmdline = 0x0002_0000;
    /// Where the protected-mode kernel code is loaded
    pub const kernel_base = 0x0010_0000;
    /// Where the initrd is loaded.
    pub const initrd = 0x0600_0000;
};

Loading Kernel

Let's add the function in Vm to configure the above information:

ymir/vmx.zig
fn loadKernel(self: *Self, kernel: []u8) Error!void {
    const guest_mem = self.guest_mem;

    var bp = BootParams.from(kernel);
    bp.e820_entries = 0;

    // Setup necessary fields
    bp.hdr.type_of_loader = 0xFF;
    bp.hdr.ext_loader_ver = 0;
    bp.hdr.loadflags.loaded_high = true; // load kernel at 0x10_0000
    bp.hdr.loadflags.can_use_heap = true; // use memory 0..BOOTPARAM as heap
    bp.hdr.heap_end_ptr = linux.layout.bootparam - 0x200;
    bp.hdr.loadflags.keep_segments = true; // we set CS/DS/SS/ES to flag segments with a base of 0.
    bp.hdr.cmd_line_ptr = linux.layout.cmdline;
    bp.hdr.vid_mode = 0xFFFF; // VGA (normal)

    // Setup E820 map
    bp.addE820entry(0, linux.layout.kernel_base, .ram);
    bp.addE820entry(
        linux.layout.kernel_base,
        guest_mem.len - linux.layout.kernel_base,
        .ram,
    );
    ...
}

This function receives the bzImage data loaded into memory by Surtr. The setup headers fields being set here have the following meanings:

FieldDescription
type_of_loaderThe type of bootloader. Since Ymir does have the assigned type of course, 0xFF (undefined) is specified.
ext_loader_verVersion of bootloader. Not used.
loaded_highProtected mode code is placed at 0x100000. If false, it's loaded at 0x10000.
can_use_heapIt indicates that heap region specified by heap_end_ptr is available. If false, some features will be disabled.
heap_end_ptrThe end address of the real-mode heap and stack (offset from the real-mode code). It is specified to subtract 0x200. It is likely unused.
keep_segmentsWhether to reload segment registers at the 32-bit entry point. If false, they will be reloaded.
cmd_line_ptrAddress of kernel command line.
vid_modeVideo mode. 0xFFFF means VGA(normal).

Next, the E820 map is configured. This time, only two memory regions are set: up to the protected mode kernel and the area beyond it. Since these two regions are adjacent, combining them into one entry would likely be fine, but here they are separated to make it easier to verify that the E820 map is passed correctly.

Next, set up the command line. Place the command line at the address specified by cmd_line_ptr in the setup headers. In Ymir, for easier debugging, we set the command line to console=ttyS0 earlyprintk=serial nokaslr. This disables KASLR and directs the logs to the serial port. Since the serial output is redirected to stdout by QEMU, Linux logs will appear on QEMU’s standard output:

ymir/vmx.zig
    const cmdline_max_size = if (bp.hdr.cmdline_size < 256) bp.hdr.cmdline_size else 256;
    const cmdline = guest_mem[linux.layout.cmdline .. linux.layout.cmdline + cmdline_max_size];
    const cmdline_val = "console=ttyS0 earlyprintk=serial nokaslr";
    @memset(cmdline, 0);
    @memcpy(cmdline[0..cmdline_val.len], cmdline_val);

Next, load the configured BootParams and the kernel's protected-mode code into guest memory. Prepare a helper function to place data into the guest memory:

ymir/vmx.zig
fn loadImage(memory: []u8, image: []u8, addr: usize) !void {
    if (memory.len < addr + image.len) {
        return Error.OutOfMemory;
    }
    @memcpy(memory[addr .. addr + image.len], image);
}

Let's load the BootParams with this:

ymir/vmx.zig
    try loadImage(guest_mem, std.mem.asBytes(&bp), linux.layout.bootparam);

Finally, load the kernel. The offset of the kernel code in bzImage can be obtained using the previously implemented SetupHeader.getProtectedCodeOffset(). The size of the code to load is the total size of the bzImage minus the offset of the real-mode code:

ymir/vmx.zig
    const code_offset = bp.hdr.getProtectedCodeOffset();
    const code_size = kernel.len - code_offset;
    try loadImage(
        guest_mem,
        kernel[code_offset .. code_offset + code_size],
        linux.layout.kernel_base,
    );

This completes the setup of BootParams and loading of the kernel image. Modify Vm.setupGuestMemory() to call these functions:

ymir/vmx.zig.diff
         ) orelse return Error.OutOfMemory;

+        // Load kernel
+        try self.loadKernel(guest_image);
+
         // Create simple EPT mapping.
         const eptp = try impl.mapGuest(self.guest_mem, allocator);

Also, modify kernelMain() to pass the kernel image received from Surtr to setupGuestMemory(). Note that the kernel image address provided by Surtr is a physical address, so it must be converted to a virtual address:

ymir/main.zig
// Copy boot_info into Ymir's stack since it becomes inaccessible after memory mapping is reconstructed.
const guest_info = boot_info.guest_info;

...

// (After entering VMX Operation)
const guest_kernel = b: {
    const ptr: [*]u8 = @ptrFromInt(ymir.mem.phys2virt(guest_info.guest_image));
    break :b ptr[0..guest_info.guest_size];
};
try vm.setupGuestMemory(
    guest_kernel,
    mem.general_allocator,
    &mem.page_allocator_instance,
);

Configuring VMCS

Since the Linux kernel has been loaded into the guest, configure the VMCS to transfer control to Linux. First, set RIP and RSI according to the boot protocol. Set RIP to 0x100000, the entry point of the protected-mode kernel. Set RSI to the address of BootParams:

ymir/arch/x86/vmx/vcpu.zig
fn setupGuestState(vcpu: *Vcpu) VmxError!void {
    ...
    try vmwrite(vmcs.guest.rip, linux.layout.kernel_base);
    vcpu.guest_regs.rsi = linux.layout.bootparam;
}

Next, make a minor adjustment to the guest's CS. When control is handed over to the kernel, it expects to be in protected-mode. Therefore, CS must be set to 32-bit code, not 64-bit code. Also, since BootParams.hdr.loadflags.keep_segments is set to true, all segment selectors need to be set to 0:

ymir/arch/x86/vmx/vcpu.zig
fn setupGuestState(vcpu: *Vcpu) VmxError!void {
    ...
    const cs_right = vmx.SegmentRights{
        .rw = true,
        .dc = false,
        .executable = true,
        .desc_type = .code_data,
        .dpl = 0,
        .granularity = .kbyte,
        .long = false,
        .db = 1,
    };
    ...
    try vmwrite(vmcs.guest.cs_sel, 0);
    ...
}

Next, configure EFER. It contains settings related to long mode. Since long mode should be disabled, set all fields to 0:

ymir/arch/x86/vmx/vcpu.zig
fn setupGuestState(vcpu: *Vcpu) VmxError!void {
    ...
    try vmwrite(vmcs.guest.efer, 0);
    ...
}

This completes the VMCS configuration. Since blobGuest() used in previous chapters is no longer needed, you can safely remove it.

Now, let's finally try booting Linux as a guest:

txt
[INFO ] main    | Entered VMX root operation.
[INFO ] vmx     | Guest memory region: 0x0000000000000000 - 0x0000000006400000
[INFO ] vmx     | Guest kernel code offset: 0x0000000000005000
[DEBUG] ept     | EPT Level4 Table @ FFFF88800000A000
[INFO ] vmx     | Guest memory is mapped: HVA=0xFFFF888000A00000 (size=0x6400000)
[INFO ] main    | Setup guest memory.
[INFO ] main    | Starting the virtual machine...
[ERROR] vcpu    | Unhandled VM-exit: reason=arch.x86.vmx.common.ExitReason.cpuid

A VM Exit caused by CPUID means success. This happens because verify_cpu(), called from startup_32(), executes CPUID. Since an exit handler for CPUID hasn’t been implemented yet, this error occurs. Regardless, the fact that control reaches the kernel and the execution of CPUID confirms the guest has started running. You have successfully started running Linux as a guest.

You can also verify that the guest is running using GDB. After Ymir starts and before the guest boots, connect with GDB using target remote :1234. Then, set a hardware breakpoint at 0x100000 (e.g., hbreak *0x100000). When you run continue, execution should stop at the breakpoint:

gdb
$rax   : 0x0000000000000000
$rbx   : 0x0000000000000000
$rcx   : 0x0000000000000000
$rdx   : 0x0000000000000000
$rsp   : 0x0000000000000000
$rbp   : 0x0000000000000000
$rsi   : 0x0000000000010000 <exception_stacks+0x4000>  ->  0x0000000000000000 <fixed_percpu_data>
$rdi   : 0x0000000000000000
$rip   : 0x0000000000100000  ->  0xffff80567000b848
$r8    : 0x0000000000000000
$r9    : 0x0000000000000000
$r10   : 0x0000000000000000
$r11   : 0x0000000000000000
$r12   : 0x0000000000000000
$r13   : 0x0000000000000000
$r14   : 0x0000000000000000
$r15   : 0x0000000000000000
$eflags: 0x2 [ident align vx86 resume nested overflow direction interrupt trap sign zero adjust parity carry] [Ring=0]
$cs: 0x00 $ss: 0x00 $ds: 0x00 $es: 0x00 $fs: 0x00 $gs: 0x00
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- code:x86:64 ----
    0xffffd 11ff               <NO_SYMBOL>   adc    edi, edi
    0xfffff 90                 <NO_SYMBOL>   nop
 -> 0x100000 48b800705680ff..   <NO_SYMBOL>   movabs rax, 0xffffffff80567000
    0x10000a 4883e810           <NO_SYMBOL>   sub    rax, 0x10
    0x10000e 488945f8           <NO_SYMBOL>   mov    QWORD PTR [rbp - 0x8], rax
    0x100012 0f92c0             <NO_SYMBOL>   setb   al
    0x100015 7202               <NO_SYMBOL>   jb     0x100019
    0x100017 eb27               <NO_SYMBOL>   jmp    0x100040
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- stack ----
[!] Cannot access memory at address 0x0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- threads ----
[Thread Id:1] stopped 0x100000 <NO_SYMBOL> in unknown_frame, reason: BREAKPOINT
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- trace ----
[#0] 0x0000000000100000 <NO_SYMBOL>
[#1] 0x0000000000000000 <fixed_percpu_data>

Note that the displayed instructions are incorrect. This is likely because GDB interprets them as long mode instructions. In reality, the guest is running in 32-bit protected-mode, which causes this discrepancy. When you type si at this point, RIP advances to 0x100001 instead of 0x10000a (since the first instruction is actually a 1-byte CLD). Be aware that when debugging a virtualized guest with GDB, there are various other incorrect displays and inconveniences to consider.

Summary

In this chapter, Surtr and Ymir worked together to load the Linux kernel image bzImage. The loaded data was placed into guest memory mapped by EPT. We also configured BootParams and the command line as required by the x86 boot protocol. Finally, we set up the VMCS and successfully started the Linux boot process in unrestricted guest mode.

The kernel, having started the boot process, stopped at a CPUID instruction. CPUID is used to query the CPU features supported, and the host can freely control which features are exposed to the guest vCPU. This allows selective feature visibility. In the next chapter, we will handle VM Exits caused by CPUID to allow Linux's boot sequence to progress further.

References

1

Since the MBR is only 512 bytes, the initial bootloader loaded from the MBR is typically used solely to load a larger bootloader. The bootloader in the MBR is called the first stage bootloader, and the subsequently loaded bootloader is referred to as the second stage bootloader. In the case of UEFI, there is no size limitation, so the entire bootloader can be loaded directly.

2

The compression algorithm for bzImage can be changed via configuration.