Booting Kernel

Now that we've completed all the cleanup required while still in the UEFI environment, we're finally ready to launch the kernel. In this chapter, we'll prepare the arguments to pass to the Ymir kernel and jump to it. After the jump, we'll pivot the stack to switch to the kernel's stack and then transfer control to the kernel's main function.

important

Source code for this chapter is in whiz-surtr-jump_to_ymir branch.

Table of Contents

Preparing Arguments

Surtr needs to pass some essential information to Ymir, most notably the memory map obtained from UEFI. Once Boot Services are exited, there is no longer a way to retrieve the memory map, so Surtr must pass the previously acquired memory map to Ymir.

Define the data structures shared between Surtr and Ymir in surtr/defs.zig:

surtr/defs.zig
pub const magic: usize = 0xDEADBEEF_CAFEBABE;

pub const BootInfo = extern struct {
    /// Magic number to check if the boot info is valid.
    magic: usize = magic,
    /// UEFI memory map.
    memory_map: MemoryMap,
};

magic is a magic number used to verify that the arguments have been correctly passed to Ymir. memory_map holds the current memory map obtained from Boot Services. Based on this map, Ymir will free unnecessary UEFI memory and construct its own memory allocator.

After completing all cleanup in boot.zig, we construct the BootInfo:

surtr/boot.zig
const boot_info = defs.BootInfo{
    .magic = defs.magic,
    .memory_map = map,
};

Note that since we've already exited Boot Services, we won’t be able to use log output for debugging purposes anymore.

Jump to Kernel

Finally, we are ready to jump to the kernel. This jump can be performed just like a regular function call. The kernel entry point is a function that takes the previously prepared BootInfo as an argument. Since UEFI uses the Windows calling convention1, we specify callconv(.Win64):

surtr/boot.zig
const KernelEntryType = fn (defs.BootInfo) callconv(.Win64) noreturn;
const kernel_entry: *KernelEntryType = @ptrFromInt(elf_header.entry);

The address of the entry point is stored in the entry field of the ELF header. We cast this value to a function pointer of type *KernelEntryType using @ptrFromInt().

All that remains is to call this function:

surtr/boot.zig
kernel_entry(boot_info);
unreachable;

Once control has been transferred to Ymir, it will never return to Surtr. Therefore, we specify unreachable to inform the compiler that this point should never be reached.

Now, let's actually run it and confirm that Ymir starts running. Currently, the Ymir entry point kernelEntry() is a function that simply halts infinitelyp. Run QEMU and verify that it stops in this infinite loop. While in this state, open the QEMU monitor and use info registers to check the values of the registers:

txt
(qemu) info registers

CPU#0
RAX=deadbeefcafebabe RBX=000000001fe93750 RCX=000000001fe91f78 RDX=0000000000000000
RSI=0000000000000030 RDI=000000001fe91ef8 RBP=000000001fe908a0 RSP=000000001fe8fff8
R8 =000000001fe8ff8c R9 =000000001f9ec018 R10=000000001fae6880 R11=0000000089f90beb
R12=000000001feaff40 R13=000000001fe93720 R14=00000000feffc000 R15=00000000ff000000
RIP=ffffffff80100001 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
GDT=     000000001f9dc000 00000047
IDT=     000000001f537018 00000fff
CR0=80010033 CR2=0000000000000000 CR3=000000001e4d9000 CR4=00000668

The RIP shows 0xFFFFFFFF80100001, which corresponds to the .text section address specified in Ymir's linker script. This confirms that control has been successfully transferred to Ymir!

In the Windows calling convention, arguments are passed in RCX, RDX, R8, and R9 registers in order. Since there is only one argument this time, the BootInfo, its address should be in RCX:

txt
(qemu) x/4gx 0x000000001fe91f78
000000001fe91f78: 0xdeadbeefcafebabe 0x0000000000004000
000000001fe91f88: 0x000000001fe91fb0 0x0000000000001770

At the address pointed to by RCX, you can confirm the value of the first field in BootInfo, which is magic with the value 0xDEADBEEFCAFEBABE. This shows that the argument passing is working correctly.

Linker Script and Stack

Although the kernel has started, it is still relying on many resources provided by UEFI, such as the page tables, IDT, and GDT. Among these, stack is one of the most critical components to set up first.

When UEFI wakes Surtr up, it prepares a stack for him. However, this stack is allocated in the boot-time memory region called BootServiceData. Later, Ymir will free this region when it initializes its own memory allocator. Therefore, it is necessary to switch the stack to a kernel memory region first. In this case, we will prepare a stack segment for Ymir to use as its stack space2.

Layout

In the Loading Kernel chapter, we briefly set up Ymir’s memory layout using the linker script. Here, we will configure it a bit more properly. Modify ymir/linker.ld as follows3:

ymir/linker.ld
STACK_SIZE = 0x5000;

SECTIONS {
    . = KERNEL_VADDR_TEXT;

    .text ALIGN(4K) : AT (ADDR(.text) - KERNEL_VADDR_BASE) {
        *(.text)
        *(.ltext)
    } :text

    .rodata ALIGN(4K) : AT (ADDR(.rodata) - KERNEL_VADDR_BASE) {
        *(.rodata)
    } :rodata

    .data ALIGN(4K) : AT (ADDR(.data) - KERNEL_VADDR_BASE) {
        *(.data)
        *(.ldata)
    } :data

    .bss ALIGN(4K) : AT (ADDR(.bss) - KERNEL_VADDR_BASE) {
        *(COMMON)
        *(.bss)
        *(.lbss)
    } :bss

    __stackguard_upper ALIGN(4K) (NOLOAD) : AT (ADDR(__stackguard_upper) - KERNEL_VADDR_BASE) {
        . += 4K;
    } :__stackguard_upper

    __stack ALIGN(4K) (NOLOAD) : AT (ADDR(__stack) - KERNEL_VADDR_BASE) {
        . += STACK_SIZE;
    } :__stack

    __stackguard_lower ALIGN(4K) (NOLOAD) : AT (ADDR(__stackguard_lower) - KERNEL_VADDR_BASE) {
        __stackguard_lower = .;
        . += 4K;
    } :__stackguard_lower
}

.text / .rodata / .data / .bss sections remain unchanged. They are composed by collecting respective sections from each object file to form the final sections. The AT directive specifies the physical address. Since ADDR(.text) is the virtual address of the .text section, subtracting KERNEL_VADDR_BASE from it sets the physical address for the section. In other words, the physical address is set as the offset from the base virtual address.

A new section __stack is added. The stack size is set to 5 pages for now; this can be increased later if needed, but it's sufficient for now. Unlike other sections, NOLOAD is specified. This means the section is not loaded into memory. Since the stack area does not require initial values, it does not need to be included in the ELF file. Although STACK_SIZE defines the section size, this does not affect the overall size of the ELF file itself.

The __stackguard_upper and __stackguard_lower sections placed on both sides of the stack act as stack guard pages. By setting these pages to read-only, a page fault will be generated if a stack overflow or underflow occurs. This mechanism prevents accidental overwriting of adjacent memory regions caused by unnoticed stack overflows.

warning

When the stack overflows and a write occurs to a guard page, a page fault is triggered. If the fault handler attempts to use the guard page as a stack, another fault occurs. This leads to a double fault, and similarly causes further faults as well. Ultimately, this results in a triple fault, causing the CPU to reset and bring about the end of the world...

To prevent this, the page fault handler must switch to a dedicated stack. The interrupt handler stack can be specified using the Task State Segment (TSS). By properly configuring the TSS, GDT, and IDT, you can switch to a custom stack for page faults. In this series, however, we do not use TSS and simply use the stack active at the moment the interrupt occurs. If you want to avoid triple faults, you can implement stack switch by yourself.

The :segment at the end of each section puts that section into the segment segment. The segment is defined as follows:

ymir/linker.ld
PHDRS {
    text PT_LOAD;
    rodata PT_LOAD;
    data PT_LOAD;
    bss PT_LOAD;

    __stackguard_upper PT_LOAD FLAGS(4);
    __stack PT_LOAD FLAGS(6);
    __stackguard_lower PT_LOAD FLAGS(4);
}

The PT_LOAD specified for each segment indicates that the segment should be loaded into memory. The NOLOAD attribute applies to sections, whereas PT_LOAD applies to segments. The FLAGS specify the segment’s permissions, using values 4, 2, and 1 from left to right for Read, Write, and Execute respectively. For the text, rodata, data, and bss segments, no explicit FLAGS are specified, so the section attributes are used as-is. The __stack segment must be RW (read-write, no execute), so it uses FLAGS(6). The guard pages are set as read-only, hence FLAGS(4).

Checking Sections and Segments

After setting up Ymir’s memory layout including the stack, let's verify that the layout matches our expectations. After building Ymir with zig build install, use readelf to display the section and segment information:

bash
> readelf --segment --sections ./zig-out/bin/ymir.elf

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         ffffffff80100000  00001000
       0000000000000003  0000000000000000 AXl       0     0     16
  [ 2] .rodata           PROGBITS         ffffffff80101000  00001003
       0000000000000000  0000000000000000   A       0     0     1
  [ 3] .data             PROGBITS         ffffffff80101000  00001003
       0000000000000000  0000000000000000   A       0     0     1
  [ 4] .bss              NOBITS           ffffffff80101000  00001003
       0000000000000000  0000000000000000  WA       0     0     1
  [ 5] __stackguard[...] NOBITS           ffffffff80101000  00002000
       0000000000001000  0000000000000000  WA       0     0     1
  [ 6] __stack           NOBITS           ffffffff80102000  00002000
       0000000000005000  0000000000000000  WA       0     0     1
  [ 7] __stackguard[...] NOBITS           ffffffff80107000  00002000
       0000000000001000  0000000000000000  WA       0     0     1
...
  [16] .symtab           SYMTAB           0000000000000000  00003250
       00000000000000a8  0000000000000018          18     2     8
  [17] .shstrtab         STRTAB           0000000000000000  000032f8
       00000000000000c9  0000000000000000           0     0     1
  [18] .strtab           STRTAB           0000000000000000  000033c1
       0000000000000058  0000000000000000           0     0     1

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000001000 0xffffffff80100000 0x0000000000100000
                 0x0000000000000003 0x0000000000000003  R E    0x1000
  LOAD           0x0000000000002000 0xffffffff80101000 0x0000000000101000
                 0x0000000000000000 0x0000000000001000  R      0x1000
  LOAD           0x0000000000002000 0xffffffff80102000 0x0000000000102000
                 0x0000000000000000 0x0000000000005000  RW     0x1000
  LOAD           0x0000000000002000 0xffffffff80107000 0x0000000000107000
                 0x0000000000000000 0x0000000000001000  R      0x1000

 Section to Segment mapping:
  Segment Sections...
   00     .text
   01     .bss __stackguard_upper
   02     __stack
   03     __stackguard_lower

You can see:

  • Sections for __stack and the guard page:
    • The Size matches the page size you specified.
    • The Addr corresponds to the specified virtual address.
    • The Offset (the section’s start address within the ELF file) is also 0x2000. This indicates that the section itself has no size and is not included in the ELF binary.
  • Segments for __stack and the guard page:
    • The FileSize is 0, which also indicates that no data is included within the ELF file.
    • The MemSize matches the specified page size.
    • The VirtAddr and PhysAddr match the specified virtual and physical addresses.
  • Stack is read-write.
  • Guard page is read-only.
  • The .bss and __stackguard_upper section are within the same segment. This is because Ymir currently has no variables placed in the .bss section.

note

As a side note, it is perfectly fine if the attributes of segments and sections deviate from their usual meanings. This is because Surtr, the loader that parses them, is custom-made in this series, so how these values are interpreted is entirely up to us.

Stack Trampoline

Now that the stack is prepared, we switch from the stack provided by UEFI to the kernel's own stack.

Modify Ymir’s entry point kernelEntry() as follows:

ymir/main.zig
extern const __stackguard_lower: [*]const u8;

export fn kernelEntry() callconv(.Naked) noreturn {
    asm volatile (
        \\movq %[new_stack], %%rsp
        \\call kernelTrampoline
        :
        : [new_stack] "r" (@intFromPtr(&__stackguard_lower) - 0x10),
    );
}

__stackguard_lower is located at the start address of the __stackguard_lower section defined earlier in the linker script. Since only the address of this variable is used, no specific type is required. The inline assembly sets the stack pointer to the address offset by 0x10 from the __stackguard_lower section. This effectively switches the stack to the prepared kernel stack.

kernelTrampoline() is a trampoline that jumps to a function with Zig’s standard calling convention.

ymir/main.zig
export fn kernelTrampoline(boot_info: surtr.BootInfo) callconv(.Win64) noreturn {
    kernelMain(boot_info) catch |err| {
        @panic("Exiting...");
    };

    unreachable;
}

fn kernelMain(boot_info: surtr.BootInfo) !void {
    while (true) asm volatile("hlt");
}

kernelEntry() must not have a function prologue. This is because the prologue might push something onto the stack before Ymir switches to the kernel’s own stack. Therefore, callconv(.Naked) is specified.

Ymir's main function uses the normal Zig calling convention and can return errors, allowing the use of convenient keywords like try. However, functions marked with callconv(.Naked) cannot call functions in Zig syntax. So inline assembly must be used. To solve this, kernelTrampoline() is introduced as an intermediary function that properly passes arguments and switches the calling convention.

At the start of kernelEntry(), the argument passed from Surtr follows the UEFI calling convention, with the BootInfo argument placed in RCX. Therefore, kernelTrampoline() is specified with callconv(.Win64). Since functions with callconv(.Win64) can normally call functions with other calling conventions, this setup allows calling kernelMain() in the regular Zig way.

info

By adding the export keyword to a function, the function can be referenced by the exact name it was defined with. Since kernelMain() and kernelTrampoline() are called from assembly, they are marked with export. Without the keyword, the function names would include the file or module name, such as main.kernelTrampoline.

Verification of BootInfo

Surtr’s role has ended, and Ymir has taken control. To conclude this chapter, let’s perform a sanity check on the BootInfo argument passed by Surtr.

First, to allow Ymir to reference the information defined by Surtr, create a Surtr module and add it to Ymir. Add the following to build.zig:

build.zig
// Modules
const surtr_module = b.createModule(.{
    .root_source_file = b.path("surtr/defs.zig"),
});
...
ymir.root_module.addImport("surtr", surtr_module);

Now, @import("surtr") allows referencing surtr/defs.zig. Let’s verify the BootInfo in kernelMain():

ymir/main.zig
// Validate the boot info.
validateBootInfo(boot_info) catch {
    // 本当はここでログ出力をしたいけど、それはまた次回
    return error.InvalidBootInfo;
};

fn validateBootInfo(boot_info: surtr.BootInfo) !void {
    if (boot_info.magic != surtr.magic) {
        return error.InvalidMagic;
    }
}

At the beginning of BootInfo, Surtr should have stored a magic number. By checking whether this value is correctly set, we can verify that Surtr has passed the argument properly.

If magic is incorrect, we return error.InvalidMagic. Ideally, we would output an error message here, but since Ymir does not yet have a logging system, we simply return the error silently for now.

Summary

In this chapter, we prepared the arguments to pass from Surtr to Ymir and jumped to the kernel's entry point. At the kernel entry point, we switched to the kernel's dedicated stack and transferred control to the kernel's main function while also switching the calling convention.

This chapter wraps up the core implementation of Surtr. We'll still need to extend it later to support loading a guest Linux kernel, so be sure to treat it kindly when the time comes. In the next chapter, we'll turn our attention to the Ymir kernel. The very first thing we'll implement there is a logging system.

2

Ideally, we should dynamically allocate memory for the stack and map virtual addresses to that region. However, to keep things simple in this series, we'll use this region as a stack forever.

4

Usually, guard pages are often left unmapped altogether. However, to keep things simple this time, we'll protect the guard pages by making them read-only instead.