Booting Kernel
Now that we've completed all the cleanup required while still in the UEFI environment, we're finally ready to launch the kernel. In this chapter, we'll prepare the arguments to pass to the Ymir kernel and jump to it. After the jump, we'll pivot the stack to switch to the kernel's stack and then transfer control to the kernel's main function.
important
Source code for this chapter is in whiz-surtr-jump_to_ymir
branch.
Table of Contents
- Preparing Arguments
- Jump to Kernel
- Linker Script and Stack
- Stack Trampoline
- Verification of
BootInfo
- Summary
Preparing Arguments
Surtr needs to pass some essential information to Ymir, most notably the memory map obtained from UEFI. Once Boot Services are exited, there is no longer a way to retrieve the memory map, so Surtr must pass the previously acquired memory map to Ymir.
Define the data structures shared between Surtr and Ymir in surtr/defs.zig
:
pub const magic: usize = 0xDEADBEEF_CAFEBABE;
pub const BootInfo = extern struct {
/// Magic number to check if the boot info is valid.
magic: usize = magic,
/// UEFI memory map.
memory_map: MemoryMap,
};
magic
is a magic number used to verify that the arguments have been correctly passed to Ymir. memory_map
holds the current memory map obtained from Boot Services. Based on this map, Ymir will free unnecessary UEFI memory and construct its own memory allocator.
After completing all cleanup in boot.zig
, we construct the BootInfo
:
const boot_info = defs.BootInfo{
.magic = defs.magic,
.memory_map = map,
};
Note that since we've already exited Boot Services, we won’t be able to use log output for debugging purposes anymore.
Jump to Kernel
Finally, we are ready to jump to the kernel. This jump can be performed just like a regular function call. The kernel entry point is a function that takes the previously prepared BootInfo
as an argument. Since UEFI uses the Windows calling convention1, we specify callconv(.Win64)
:
const KernelEntryType = fn (defs.BootInfo) callconv(.Win64) noreturn;
const kernel_entry: *KernelEntryType = @ptrFromInt(elf_header.entry);
The address of the entry point is stored in the entry
field of the ELF header. We cast this value to a function pointer of type *KernelEntryType
using @ptrFromInt()
.
All that remains is to call this function:
kernel_entry(boot_info);
unreachable;
Once control has been transferred to Ymir, it will never return to Surtr. Therefore, we specify unreachable
to inform the compiler that this point should never be reached.
Now, let's actually run it and confirm that Ymir starts running. Currently, the Ymir entry point kernelEntry()
is a function that simply halts infinitelyp. Run QEMU and verify that it stops in this infinite loop. While in this state, open the QEMU monitor and use info registers
to check the values of the registers:
(qemu) info registers
CPU#0
RAX=deadbeefcafebabe RBX=000000001fe93750 RCX=000000001fe91f78 RDX=0000000000000000
RSI=0000000000000030 RDI=000000001fe91ef8 RBP=000000001fe908a0 RSP=000000001fe8fff8
R8 =000000001fe8ff8c R9 =000000001f9ec018 R10=000000001fae6880 R11=0000000089f90beb
R12=000000001feaff40 R13=000000001fe93720 R14=00000000feffc000 R15=00000000ff000000
RIP=ffffffff80100001 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
GDT= 000000001f9dc000 00000047
IDT= 000000001f537018 00000fff
CR0=80010033 CR2=0000000000000000 CR3=000000001e4d9000 CR4=00000668
The RIP
shows 0xFFFFFFFF80100001
, which corresponds to the .text
section address specified in Ymir's linker script. This confirms that control has been successfully transferred to Ymir!
In the Windows calling convention, arguments are passed in RCX, RDX, R8, and R9 registers in order. Since there is only one argument this time, the BootInfo
, its address should be in RCX:
(qemu) x/4gx 0x000000001fe91f78
000000001fe91f78: 0xdeadbeefcafebabe 0x0000000000004000
000000001fe91f88: 0x000000001fe91fb0 0x0000000000001770
At the address pointed to by RCX, you can confirm the value of the first field in BootInfo
, which is magic
with the value 0xDEADBEEFCAFEBABE
. This shows that the argument passing is working correctly.
Linker Script and Stack
Although the kernel has started, it is still relying on many resources provided by UEFI, such as the page tables, IDT, and GDT. Among these, stack is one of the most critical components to set up first.
When UEFI wakes Surtr up, it prepares a stack for him. However, this stack is allocated in the boot-time memory region called BootServiceData
. Later, Ymir will free this region when it initializes its own memory allocator. Therefore, it is necessary to switch the stack to a kernel memory region first. In this case, we will prepare a stack segment for Ymir to use as its stack space2.
Layout
In the Loading Kernel chapter, we briefly set up Ymir’s memory layout using the linker script. Here, we will configure it a bit more properly. Modify ymir/linker.ld
as follows3:
STACK_SIZE = 0x5000;
SECTIONS {
. = KERNEL_VADDR_TEXT;
.text ALIGN(4K) : AT (ADDR(.text) - KERNEL_VADDR_BASE) {
*(.text)
*(.ltext)
} :text
.rodata ALIGN(4K) : AT (ADDR(.rodata) - KERNEL_VADDR_BASE) {
*(.rodata)
} :rodata
.data ALIGN(4K) : AT (ADDR(.data) - KERNEL_VADDR_BASE) {
*(.data)
*(.ldata)
} :data
.bss ALIGN(4K) : AT (ADDR(.bss) - KERNEL_VADDR_BASE) {
*(COMMON)
*(.bss)
*(.lbss)
} :bss
__stackguard_upper ALIGN(4K) (NOLOAD) : AT (ADDR(__stackguard_upper) - KERNEL_VADDR_BASE) {
. += 4K;
} :__stackguard_upper
__stack ALIGN(4K) (NOLOAD) : AT (ADDR(__stack) - KERNEL_VADDR_BASE) {
. += STACK_SIZE;
} :__stack
__stackguard_lower ALIGN(4K) (NOLOAD) : AT (ADDR(__stackguard_lower) - KERNEL_VADDR_BASE) {
__stackguard_lower = .;
. += 4K;
} :__stackguard_lower
}
.text
/ .rodata
/ .data
/ .bss
sections remain unchanged. They are composed by collecting respective sections from each object file to form the final sections. The AT
directive specifies the physical address. Since ADDR(.text)
is the virtual address of the .text
section, subtracting KERNEL_VADDR_BASE
from it sets the physical address for the section. In other words, the physical address is set as the offset from the base virtual address.
A new section __stack
is added. The stack size is set to 5 pages for now; this can be increased later if needed, but it's sufficient for now. Unlike other sections, NOLOAD
is specified. This means the section is not loaded into memory. Since the stack area does not require initial values, it does not need to be included in the ELF file. Although STACK_SIZE
defines the section size, this does not affect the overall size of the ELF file itself.
The __stackguard_upper
and __stackguard_lower
sections placed on both sides of the stack act as stack guard pages. By setting these pages to read-only, a page fault will be generated if a stack overflow or underflow occurs. This mechanism prevents accidental overwriting of adjacent memory regions caused by unnoticed stack overflows.
warning
When the stack overflows and a write occurs to a guard page, a page fault is triggered. If the fault handler attempts to use the guard page as a stack, another fault occurs. This leads to a double fault, and similarly causes further faults as well. Ultimately, this results in a triple fault, causing the CPU to reset and bring about the end of the world...
To prevent this, the page fault handler must switch to a dedicated stack. The interrupt handler stack can be specified using the Task State Segment (TSS). By properly configuring the TSS, GDT, and IDT, you can switch to a custom stack for page faults. In this series, however, we do not use TSS and simply use the stack active at the moment the interrupt occurs. If you want to avoid triple faults, you can implement stack switch by yourself.
The :segment
at the end of each section puts that section into the segment
segment. The segment is defined as follows:
PHDRS {
text PT_LOAD;
rodata PT_LOAD;
data PT_LOAD;
bss PT_LOAD;
__stackguard_upper PT_LOAD FLAGS(4);
__stack PT_LOAD FLAGS(6);
__stackguard_lower PT_LOAD FLAGS(4);
}
The PT_LOAD
specified for each segment indicates that the segment should be loaded into memory. The NOLOAD
attribute applies to sections, whereas PT_LOAD
applies to segments. The FLAGS
specify the segment’s permissions, using values 4, 2, and 1 from left to right for Read, Write, and Execute respectively. For the text
, rodata
, data
, and bss
segments, no explicit FLAGS
are specified, so the section attributes are used as-is. The __stack
segment must be RW (read-write, no execute), so it uses FLAGS(6)
. The guard pages are set as read-only, hence FLAGS(4)
.
Checking Sections and Segments
After setting up Ymir’s memory layout including the stack, let's verify that the layout matches our expectations. After building Ymir with zig build install
, use readelf
to display the section and segment information:
> readelf --segment --sections ./zig-out/bin/ymir.elf
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text PROGBITS ffffffff80100000 00001000
0000000000000003 0000000000000000 AXl 0 0 16
[ 2] .rodata PROGBITS ffffffff80101000 00001003
0000000000000000 0000000000000000 A 0 0 1
[ 3] .data PROGBITS ffffffff80101000 00001003
0000000000000000 0000000000000000 A 0 0 1
[ 4] .bss NOBITS ffffffff80101000 00001003
0000000000000000 0000000000000000 WA 0 0 1
[ 5] __stackguard[...] NOBITS ffffffff80101000 00002000
0000000000001000 0000000000000000 WA 0 0 1
[ 6] __stack NOBITS ffffffff80102000 00002000
0000000000005000 0000000000000000 WA 0 0 1
[ 7] __stackguard[...] NOBITS ffffffff80107000 00002000
0000000000001000 0000000000000000 WA 0 0 1
...
[16] .symtab SYMTAB 0000000000000000 00003250
00000000000000a8 0000000000000018 18 2 8
[17] .shstrtab STRTAB 0000000000000000 000032f8
00000000000000c9 0000000000000000 0 0 1
[18] .strtab STRTAB 0000000000000000 000033c1
0000000000000058 0000000000000000 0 0 1
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000001000 0xffffffff80100000 0x0000000000100000
0x0000000000000003 0x0000000000000003 R E 0x1000
LOAD 0x0000000000002000 0xffffffff80101000 0x0000000000101000
0x0000000000000000 0x0000000000001000 R 0x1000
LOAD 0x0000000000002000 0xffffffff80102000 0x0000000000102000
0x0000000000000000 0x0000000000005000 RW 0x1000
LOAD 0x0000000000002000 0xffffffff80107000 0x0000000000107000
0x0000000000000000 0x0000000000001000 R 0x1000
Section to Segment mapping:
Segment Sections...
00 .text
01 .bss __stackguard_upper
02 __stack
03 __stackguard_lower
You can see:
- Sections for
__stack
and the guard page:- The
Size
matches the page size you specified. - The
Addr
corresponds to the specified virtual address. - The
Offset
(the section’s start address within the ELF file) is also0x2000
. This indicates that the section itself has no size and is not included in the ELF binary.
- The
- Segments for
__stack
and the guard page:- The
FileSize
is0
, which also indicates that no data is included within the ELF file. - The
MemSize
matches the specified page size. - The
VirtAddr
andPhysAddr
match the specified virtual and physical addresses.
- The
- Stack is read-write.
- Guard page is read-only.
- The
.bss
and__stackguard_upper
section are within the same segment. This is because Ymir currently has no variables placed in the.bss
section.
note
As a side note, it is perfectly fine if the attributes of segments and sections deviate from their usual meanings. This is because Surtr, the loader that parses them, is custom-made in this series, so how these values are interpreted is entirely up to us.
Stack Trampoline
Now that the stack is prepared, we switch from the stack provided by UEFI to the kernel's own stack.
Modify Ymir’s entry point kernelEntry()
as follows:
extern const __stackguard_lower: [*]const u8;
export fn kernelEntry() callconv(.Naked) noreturn {
asm volatile (
\\movq %[new_stack], %%rsp
\\call kernelTrampoline
:
: [new_stack] "r" (@intFromPtr(&__stackguard_lower) - 0x10),
);
}
__stackguard_lower
is located at the start address of the __stackguard_lower
section defined earlier in the linker script. Since only the address of this variable is used, no specific type is required. The inline assembly sets the stack pointer to the address offset by 0x10
from the __stackguard_lower
section. This effectively switches the stack to the prepared kernel stack.
kernelTrampoline()
is a trampoline that jumps to a function with Zig’s standard calling convention.
export fn kernelTrampoline(boot_info: surtr.BootInfo) callconv(.Win64) noreturn {
kernelMain(boot_info) catch |err| {
@panic("Exiting...");
};
unreachable;
}
fn kernelMain(boot_info: surtr.BootInfo) !void {
while (true) asm volatile("hlt");
}
kernelEntry()
must not have a function prologue. This is because the prologue might push something onto the stack before Ymir switches to the kernel’s own stack. Therefore, callconv(.Naked)
is specified.
Ymir's main function uses the normal Zig calling convention and can return errors, allowing the use of convenient keywords like try
. However, functions marked with callconv(.Naked)
cannot call functions in Zig syntax. So inline assembly must be used. To solve this, kernelTrampoline()
is introduced as an intermediary function that properly passes arguments and switches the calling convention.
At the start of kernelEntry()
, the argument passed from Surtr follows the UEFI calling convention, with the BootInfo
argument placed in RCX. Therefore, kernelTrampoline()
is specified with callconv(.Win64)
. Since functions with callconv(.Win64)
can normally call functions with other calling conventions, this setup allows calling kernelMain()
in the regular Zig way.
info
By adding the export
keyword to a function, the function can be referenced by the exact name it was defined with. Since kernelMain()
and kernelTrampoline()
are called from assembly, they are marked with export
. Without the keyword, the function names would include the file or module name, such as main.kernelTrampoline
.
Verification of BootInfo
Surtr’s role has ended, and Ymir has taken control. To conclude this chapter, let’s perform a sanity check on the BootInfo
argument passed by Surtr.
First, to allow Ymir to reference the information defined by Surtr, create a Surtr module and add it to Ymir. Add the following to build.zig
:
// Modules
const surtr_module = b.createModule(.{
.root_source_file = b.path("surtr/defs.zig"),
});
...
ymir.root_module.addImport("surtr", surtr_module);
Now, @import("surtr")
allows referencing surtr/defs.zig
. Let’s verify the BootInfo
in kernelMain()
:
// Validate the boot info.
validateBootInfo(boot_info) catch {
// 本当はここでログ出力をしたいけど、それはまた次回
return error.InvalidBootInfo;
};
fn validateBootInfo(boot_info: surtr.BootInfo) !void {
if (boot_info.magic != surtr.magic) {
return error.InvalidMagic;
}
}
At the beginning of BootInfo
, Surtr should have stored a magic number. By checking whether this value is correctly set, we can verify that Surtr has passed the argument properly.
If magic
is incorrect, we return error.InvalidMagic
. Ideally, we would output an error message here, but since Ymir does not yet have a logging system, we simply return the error silently for now.
Summary
In this chapter, we prepared the arguments to pass from Surtr to Ymir and jumped to the kernel's entry point. At the kernel entry point, we switched to the kernel's dedicated stack and transferred control to the kernel's main function while also switching the calling convention.
This chapter wraps up the core implementation of Surtr. We'll still need to extend it later to support loading a guest Linux kernel, so be sure to treat it kindly when the time comes. In the next chapter, we'll turn our attention to the Ymir kernel. The very first thing we'll implement there is a logging system.
Ideally, we should dynamically allocate memory for the stack and map virtual addresses to that region. However, to keep things simple in this series, we'll use this region as a stack forever.
Usually, guard pages are often left unmapped altogether. However, to keep things simple this time, we'll protect the guard pages by making them read-only instead.