From c4c3d3c1a1249b86adfef8948b85b6adb200dd94 Mon Sep 17 00:00:00 2001 From: Alexander Kuleshov Date: Sun, 21 Jun 2026 19:11:58 +0500 Subject: [PATCH] init-2: adjust part to the modern kernel versions Signed-off-by: Alexander Kuleshov --- Initialization/images/direct-mapping.svg | 72 ++ .../images/early-top-pgt-entries.svg | 41 + Initialization/images/idt-descriptor.svg | 25 + Initialization/images/idt-gate-descriptor.svg | 69 ++ .../images/interrupt-stack-frame-full.svg | 118 +++ .../images/interrupt-stack-frame.svg | 69 ++ Initialization/linux-initialization-2.md | 857 +++++++++--------- 7 files changed, 800 insertions(+), 451 deletions(-) create mode 100644 Initialization/images/direct-mapping.svg create mode 100644 Initialization/images/early-top-pgt-entries.svg create mode 100644 Initialization/images/idt-descriptor.svg create mode 100644 Initialization/images/idt-gate-descriptor.svg create mode 100644 Initialization/images/interrupt-stack-frame-full.svg create mode 100644 Initialization/images/interrupt-stack-frame.svg diff --git a/Initialization/images/direct-mapping.svg b/Initialization/images/direct-mapping.svg new file mode 100644 index 00000000..93520d79 --- /dev/null +++ b/Initialization/images/direct-mapping.svg @@ -0,0 +1,72 @@ + + + + + + + + + + + + Virtual address space + Physical memory + + + + + + + + + direct mapping of + all physical memory + offset + + + faulting address + 0xffff888001000000 + + + __PAGE_OFFSET + 0xffff888000000000 + + + + + + + same offset + + + physical address + 0x0000000001000000 + + + physical address 0 + 0x0000000000000000 + + + + + maps to + + + maps to + diff --git a/Initialization/images/early-top-pgt-entries.svg b/Initialization/images/early-top-pgt-entries.svg new file mode 100644 index 00000000..d47d0f43 --- /dev/null +++ b/Initialization/images/early-top-pgt-entries.svg @@ -0,0 +1,41 @@ + + + + + + + + + + + early_top_pgt + + + + + + + + Entry 0 + ... + Entry 510 + Entry 511 + + + empty or identity mapping + + + + + next page table + (maps the kernel image) + diff --git a/Initialization/images/idt-descriptor.svg b/Initialization/images/idt-descriptor.svg new file mode 100644 index 00000000..9a656be1 --- /dev/null +++ b/Initialization/images/idt-descriptor.svg @@ -0,0 +1,25 @@ + + + + + + 79 + 16 + 15 + 0 + + + + + Base Address (64 bits) + Limit + diff --git a/Initialization/images/idt-gate-descriptor.svg b/Initialization/images/idt-gate-descriptor.svg new file mode 100644 index 00000000..a9dd07d3 --- /dev/null +++ b/Initialization/images/idt-gate-descriptor.svg @@ -0,0 +1,69 @@ + + + + + + 127 + 96 + + Reserved + + + 95 + 64 + + Offset 63..32 + + + 63 + 48 + 47 + 46 + 44 + 43 + 39 + 36 + 35 + 34 + 32 + + + + + + + + + + + + Offset 31..16 + P + DPL + 0 + Type + 0 0 0 + 0 + 0 + IST + + + 31 + 16 + 15 + 0 + + + Segment Selector + Offset 15..0 + diff --git a/Initialization/images/interrupt-stack-frame-full.svg b/Initialization/images/interrupt-stack-frame-full.svg new file mode 100644 index 00000000..af8fdf19 --- /dev/null +++ b/Initialization/images/interrupt-stack-frame-full.svg @@ -0,0 +1,118 @@ + + + + + + + + + + + + higher addresses + lower addresses + + + offset + + + + + + + + + + + + + + + + + + + + + + + + + + %ss + %rsp + %rflags + %cs + %rip + error code + %rdi + %rsi + %rdx + %rcx + %rax + %r8 + %r9 + %r10 + %r11 + %rbx + %rbp + %r12 + %r13 + %r14 + %r15 + + + +160 + +152 + +144 + +136 + +128 + +120 + +112 + +104 + +96 + +88 + +80 + +72 + +64 + +56 + +48 + +40 + +32 + +24 + +16 + +8 + +0 + + + old stack segment + old stack pointer + flags register + old code segment + return instruction pointer + error code (0 if none) + + + + + + general-purpose registers + pushed by the handler + + + %rsp + + diff --git a/Initialization/images/interrupt-stack-frame.svg b/Initialization/images/interrupt-stack-frame.svg new file mode 100644 index 00000000..1547492b --- /dev/null +++ b/Initialization/images/interrupt-stack-frame.svg @@ -0,0 +1,69 @@ + + + + + + + + + + + + higher addresses + lower addresses + + + offset + + + + + + + + + + + + + %ss + %rsp + %rflags + %cs + %rip + error code + + + +40 + +32 + +24 + +16 + +8 + +0 + + + %rsp + + + + old stack segment + old stack pointer + flags register + old code segment + return instruction pointer + optional, only for some exceptions + + diff --git a/Initialization/linux-initialization-2.md b/Initialization/linux-initialization-2.md index 7fb993df..461411d8 100644 --- a/Initialization/linux-initialization-2.md +++ b/Initialization/linux-initialization-2.md @@ -1,269 +1,365 @@ -Kernel initialization. Part 2. -================================================================================ +# Linux kernel initialization - Part 2 -Early interrupt and exception handling --------------------------------------------------------------------------------- +In the previous [part](linux-initialization-1.md), we saw the first assembly instructions of the Linux kernel code. The kernel started the initialization process and performed the following first steps: -In the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) we stopped before setting up early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have a basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code starts to work. +- Early stack setup +- Loading of the kernel Global Descriptor Table +- Initialization of the kernel page tables -We already started this preparation in the previous ([first](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1)) part of this [chapter](https://0xax.gitbook.io/linux-insides/summary/initialization). We continue in this part and will learn more about interrupt and exception handling. +After these steps, we can finally leave the assembly code, at least for a while, and switch to C code. Last time, we stopped at the call to the `x86_64_start_kernel` function from [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). This is where we will continue in this chapter. -Remember that we stopped before following function: +At this point, the kernel has already loaded or re-initialized a few important structures, but most of the system is still not ready. One of the next structures the Linux kernel has to prepare is the [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). The Interrupt Descriptor Table, or IDT, stores the addresses of interrupt and exception handlers. In this chapter, we will see how this structure is built and how the kernel handles early interrupts and exceptions. +Now that we have a rough plan for what comes next, let's continue our dive into the Linux kernel internals. + +## First steps in the C code + +The assembly code is now behind us, and we are back in C. The kernel is still far from its normal working state. We have not even reached the generic kernel code yet, because we are still in the early architecture-specific setup. At this stage, [maskable](https://en.wikipedia.org/wiki/Interrupt#Masking) hardware interrupts are disabled, so no device interrupt will arrive during this part of boot. However, hardware interrupts are not the only events that can be triggered in the system. For example, CPU exceptions, such as a [page fault](https://en.wikipedia.org/wiki/Page_fault), are different. They are synchronous events raised by the processor itself. For this reason, even at this early initialization stage, the kernel needs to know how to handle exceptions. This becomes especially important after the kernel starts removing temporary identity mappings from the page tables. One of the main goals of the `x86_64_start_kernel` function is to finish this early preparation so the kernel can move on to its main initialization. + +But before we reach the initialization of the Interrupt Descriptor Table, the kernel has a few smaller tasks to finish. The first C code starts with build-time sanity checks: + + ```C - idt_setup_early_handler(); + BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map); + BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE); + BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE); + BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0); + BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0); + BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL)); + MAYBE_BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == + (__START_KERNEL & PGDIR_MASK))); + BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END); ``` -from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) source code file. But before we start to sort out this function, we need to understand interrupts and handlers. +The `BUILD_BUG_ON` macro validates its condition at compile time. If the condition passed to this macro is true, the kernel build fails. Using this macro, the kernel verifies the layout of its virtual address space. For example, it checks that the area reserved for kernel modules does not overlap the kernel image. + +The next step after these sanity checks is quite interesting. Did you know that accessing a CPU register can be more expensive than accessing memory? If you have not spent much time with system programming or reading Intel manuals from cover to cover, this statement may sound surprising. The next function after the sanity checks is a good example of such a case: + + +```C + cr4_init_shadow(); +``` + +We have already met the [`cr4` control register](https://en.wikipedia.org/wiki/Control_register) in the previous parts. This register contains flags that enable or disable certain processor features, among others: + +- [Physical address extension](https://en.wikipedia.org/wiki/Physical_Address_Extension) +- [Page Size Extension](https://en.wikipedia.org/wiki/Page_Size_Extension) + +The kernel preserves the value of this register because it is used quite often. We will see many examples later. Reading and writing this register is an expensive operation. [Intel® 64 and IA-32 Architectures Software Developer's Manual](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html) says: + +> MOV CR* instructions, except for MOV CR8, are serializing instructions + +And: + +> The Intel 64 and IA-32 architectures define several serializing instructions. These instructions force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed + +To avoid paying extra CPU cycles, the Linux kernel saves the value of the `cr4` control register in memory. From this point, the kernel changes bits of the `cr4` register only using special helpers like `cr4_set_bits` and `cr4_clear_bits`, which update the shadow copy and write the new value to the actual register only if it differs from the stored one. + +## Preparing the kernel memory layout + +Before the kernel can move on to the generic initialization, it has to bring its memory into a known and consistent state. So far the kernel runs on top of the page tables and the memory layout that were prepared just enough to get the C code running. Some of these early structures are temporary and have to be cleaned up, while others have to be initialized for the first time. + +In the next few steps we will see how the kernel: + +- [Gets rid of the leftover identity mapping in the early page tables](#resetting-the-early-page-tables) +- [Clears the memory regions that must start zeroed, such as the `BSS` section](#clearing-the-initial-memory-state) +- [Prepares the top-level page table that the kernel will use after the early boot](#preparing-the-final-top-level-page-table) +- [Flushes the global TLB](#flushing-the-global-tlb) + +Let's go through these steps one by one. -Some theory --------------------------------------------------------------------------------- +### Resetting the early page tables -An interrupt is an event caused by the software or hardware to the CPU. For example a user has pressed a key on the keyboard. On the interrupt, CPU stops the current task and transfers control to a special routine called [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler). An interrupt handler handles an interrupt and transfers control back to the previously stopped task. We can split interrupts on three types: +One of the previous kernel steps was to set up new page tables. At this point, they still contain identity mappings left over from the earliest page-table setup. If you read the previous part, you may remember that these mappings were temporary. They existed only so the processor could switch to the new page tables without causing a page fault. -* Software interrupts - when a software signals CPU that it needs kernel attention. These interrupts are generally used for system calls; -* Hardware interrupts - when a hardware event happens, for example button is pressed on a keyboard; -* Exceptions - interrupts generated by CPU, when the CPU detects an error, for example a division by zero or accessing a memory page which is not in RAM. +Since the kernel has switched to running from its high virtual addresses, this identity mapping is no longer needed. At this stage, the top-level page table is referenced by the `early_top_pgt` symbol. The entries of this page table look like this: -Every interrupt and exception is assigned a unique number called a `vector number`. `Vector number` can be any number from `0` to `255`. A common practice is to use the first `32` vector numbers for exceptions, and vector numbers from `32` to `255` are used for user-defined interrupts. +![early_top_pgt entries](./images/early-top-pgt-entries.svg) -CPU uses vector the number as an index in the `Interrupt Descriptor Table` (we will see a description of it soon). CPU catches interrupts from the [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) or through its pins. The following table shows `0-31` exceptions: +The top-level page table contains `PTRS_PER_PGD` entries, which is `512` on `x86_64`. Only the last entry points to the next page table that maps the kernel image. All other entries are either empty or belong to the identity-mapped range. The `reset_early_page_tables` function wipes all of these first `511` entries: + +```C +static void __init reset_early_page_tables(void) +{ + memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1)); + next_early_pgt = 0; + write_cr3(__sme_pa_nodebug(early_top_pgt)); +} ``` ----------------------------------------------------------------------------------------------- -|Vector|Mnemonic|Description |Type |Error Code|Source | ----------------------------------------------------------------------------------------------- -|0 | #DE |Divide Error |Fault|NO |DIV and IDIV | -|--------------------------------------------------------------------------------------------- -|1 | #DB |Reserved |F/T |NO | | -|--------------------------------------------------------------------------------------------- -|2 | --- |NMI |INT |NO |external NMI | -|--------------------------------------------------------------------------------------------- -|3 | #BP |Breakpoint |Trap |NO |INT 3 | -|--------------------------------------------------------------------------------------------- -|4 | #OF |Overflow |Trap |NO |INTO instruction | -|--------------------------------------------------------------------------------------------- -|5 | #BR |Bound Range Exceeded|Fault|NO |BOUND instruction | -|--------------------------------------------------------------------------------------------- -|6 | #UD |Invalid Opcode |Fault|NO |UD2 instruction | -|--------------------------------------------------------------------------------------------- -|7 | #NM |Device Not Available|Fault|NO |Floating point or [F]WAIT | -|--------------------------------------------------------------------------------------------- -|8 | #DF |Double Fault |Abort|YES |An instruction which can generate NMI | -|--------------------------------------------------------------------------------------------- -|9 | --- |Reserved |Fault|NO | | -|--------------------------------------------------------------------------------------------- -|10 | #TS |Invalid TSS |Fault|YES |Task switch or TSS access | -|--------------------------------------------------------------------------------------------- -|11 | #NP |Segment Not Present |Fault|NO |Accessing segment register | -|--------------------------------------------------------------------------------------------- -|12 | #SS |Stack-Segment Fault |Fault|YES |Stack operations | -|--------------------------------------------------------------------------------------------- -|13 | #GP |General Protection |Fault|YES |Memory reference | -|--------------------------------------------------------------------------------------------- -|14 | #PF |Page fault |Fault|YES |Memory reference | -|--------------------------------------------------------------------------------------------- -|15 | --- |Reserved | |NO | | -|--------------------------------------------------------------------------------------------- -|16 | #MF |x87 FPU fp error |Fault|NO |Floating point or [F]Wait | -|--------------------------------------------------------------------------------------------- -|17 | #AC |Alignment Check |Fault|YES |Data reference | -|--------------------------------------------------------------------------------------------- -|18 | #MC |Machine Check |Abort|NO | | -|--------------------------------------------------------------------------------------------- -|19 | #XM |SIMD fp exception |Fault|NO |SSE[2,3] instructions | -|--------------------------------------------------------------------------------------------- -|20 | #VE |Virtualization exc. |Fault|NO |EPT violations | -|--------------------------------------------------------------------------------------------- -|21-31 | --- |Reserved |INT |NO |External interrupts | ----------------------------------------------------------------------------------------------- + +After clearing these entries, the function resets `next_early_pgt` to `0`. This variable is an index into `early_dynamic_pgts`, which is a small pool of reserved page table buffers. We will meet it again later in this part, when the page fault handler builds new page tables on demand. + +Finally, the function reloads the `cr3` control register with the physical address of `early_top_pgt`. The `cr3` register holds the physical address of the top-level page table, so writing to it makes the processor use the updated tables and flushes non-global `TLB` entries. + +Starting from this point on, only the kernel high mapping is left in `early_top_pgt`. Any access that depends on the removed identity mappings, or on mappings that have not been built yet, will trigger a page fault. For example, the `boot_params` structure prepared by the boot loader lives in low physical memory and is reached through the direct mapping of all physical memory. The page fault handler that we will see later in this part will build the missing page tables on demand. + +### Clearing the initial memory state + +The next thing to clear is the kernel's [BSS](https://en.wikipedia.org/wiki/.bss) section. As the name of this function suggests, the `clear_bss` function zeroes it: + + +```C +void __init clear_bss(void) +{ + memset(__bss_start, 0, + (unsigned long) __bss_stop - (unsigned long) __bss_start); + memset(__brk_base, 0, + (unsigned long) __brk_limit - (unsigned long) __brk_base); +} ``` -To react upon the interrupt CPU uses a special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors just like the Global Descriptor Table, but IDT entries are called `gates`. CPU multiplies vector number by 8 to find the IDT entry. However in the 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number by 16 to find the entry in the IDT. We remember from the previous part that CPU uses special `GDTR` register to locate the Global Descriptor Table, so CPU uses special register `IDTR` for Interrupt Descriptor Table and `lidt` instruction for loading base address of the table into this register. +Function names are helpful, of course, but sometimes they do not tell the whole story. As we can see, `clear_bss` clears two memory areas. The first `memset` zeroes the `BSS` section, where global and static variables that must start as zero are stored. The second `memset` clears the `brk` area, which the early kernel uses as a primitive allocator before the real memory allocators are available. + +We already met the `BSS` section in previous chapters. We can check the symbols related to it using the following simple command: + +```bash +$ nm -n vmlinux | awk '/ __bss_start$/,/ __bss_stop$/ { if (n++ < 11 || / __bss_stop$/) print }' +``` -64-bit mode IDT entry has following structure: +The output should be something like this: ``` -127 96 - -------------------------------------------------------------------------------- -| | -| Reserved | -| | - -------------------------------------------------------------------------------- -95 64 - -------------------------------------------------------------------------------- -| | -| Offset 63..32 | -| | - -------------------------------------------------------------------------------- -63 48 47 46 44 42 39 34 32 - -------------------------------------------------------------------------------- -| | | D | | | | | | | -| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST | -| | | L | | | | | | | - -------------------------------------------------------------------------------- -31 16 15 0 - -------------------------------------------------------------------------------- -| | | -| Segment Selector | Offset 15..0 | -| | | - -------------------------------------------------------------------------------- +ffffffff82f6b000 B __bss_start +ffffffff82f6b000 b idt_table +ffffffff82f6b000 D __nosave_end +ffffffff82f6c000 b espfix_pud_page +ffffffff82f6d000 b bm_pte +ffffffff82f6e000 B empty_zero_page +ffffffff82f6f000 B initcall_debug +ffffffff82f6f004 B reset_devices +ffffffff82f6f008 b initcall_calltime +ffffffff82f6f010 b panic_param +ffffffff82f6f018 b panic_later +ffffffff8309a000 B __bss_stop ``` -Where: +Both of these regions are reserved in the kernel's [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S), so all the kernel needs to do here is set their contents to zero. + +### Preparing the final top-level page table + +The next structure that has to be cleared is the final top-level page table that the Linux kernel will switch to for normal operation: -* `Offset` - is the offset to entry point of an interrupt handler; -* `DPL` - Descriptor Privilege Level; -* `P` - Segment Present flag; -* `Segment selector` - a code segment selector in GDT or LDT (actually in Linux, it must point to a valid descriptor in your GDT.) + ```C -#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8) // 0000 0000 0001 0000 -#define GDT_ENTRY_KERNEL_CS 2 + /* + * This needs to happen *before* kasan_early_init() because latter maps stuff + * into that page. + */ + clear_page(init_top_pgt); ``` -* `IST` - provides ability to switch to a new stack for interrupts handling. -And the last `Type` field describes type of the `IDT` entry. There are three different kinds of gates for interrupts: +For now, the kernel is still using the `early_top_pgt` page table, and it will continue to use it during the early initialization stage. But as the comment above the call says, `init_top_pgt` must be cleared before the next initialization steps map anything into it. We will see later how the kernel finishes filling this table and switches to it. -* Task gate -* Interrupt gate -* Trap gate +This page must be cleared before `kasan_early_init()` runs, because KASAN will install its early [shadow-memory](https://docs.kernel.org/dev-tools/kasan.html#shadow-memory) mappings into it. [KASAN](https://docs.kernel.org/dev-tools/kasan.html) uses shadow memory to track accesses to kernel memory. Since `kasan_early_init()` runs in the one of the next steps, `init_top_pgt` must already be ready for KASAN to populate. Clearing `init_top_pgt` gives KASAN an empty page table page to fill, without stale entries from earlier boot code. -Interrupt and trap gates contain a far pointer to the entry point of the interrupt handler. The only difference between these types is how CPU handles the `IF` flag. If an interrupt handler was accessed through the interrupt gate, CPU clears the `IF` flag to prevent other interrupts while current interrupt handler executes. After the current interrupt handler executes, CPU sets the `IF` flag again with `iret` instruction. +### Flushing the global TLB -Other bits in the interrupt descriptor are reserved and must be 0. Now let's look how a CPU handles interrupts: +The last memory-cleanup step before the kernel turns to interrupt handling is to flush the global [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) entries. The `TLB`, or Translation Lookaside Buffer, is a cache that the processor uses to speed up the translation of virtual addresses to physical ones. Whenever the kernel changes the page tables, the entries cached in the `TLB` may become stale and must be invalidated. -* CPU saves flags register, `CS`, and instruction pointer on the stack. -* If an interrupt causes an error code (for example `#PF`), CPU saves an error on the stack after instruction pointer; -* After interrupt handler executes, `iret` instruction will be used to return from it. +The early page tables we saw above had two kinds of mappings: -Now let's go back to code. +- the high kernel mapping +- the identity mapping -Fill and load IDT --------------------------------------------------------------------------------- +This identity mapping was needed during the switch to long mode and to the high kernel mapping, but the `reset_early_page_tables` function has already removed it. The problem is that these identity mappings are global, which means that the processor may keep them in the `TLB` even across a reload of the `cr3` register. Usually writing to the `cr3` register flushes the `TLB`, but global entries are intentionally excluded from this flush. The [Intel® 64 and IA-32 Architectures Software Developer's Manual](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html) describes this behavior: -We stopped at the following function: +> MOV to CR3. The behavior of the instruction depends on the value of CR4.PCIDE: +> +> If CR4.PCIDE = 0, the instruction invalidates all TLB entries associated with PCID 000H except those for global pages. It also invalidates all entries in all paging-structure caches associated with PCID 000H. +So even after the identity mapping is gone from the page tables, stale translations for it might still be cached. To get rid of them, the kernel forces a flush of the global entries with the `__native_tlb_flush_global` function: + + ```C - idt_setup_early_handler(); + __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4)); ``` -`idt_setup_early_handler` is defined in the [arch/x86/kernel/idt.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/idt.c) as following: +An additional reason to flush the `TLB` is the so-called `trampoline page table`. This is a separate page table that establishes the same kind of global identity mappings. Secondary processors use it during their early bring-up path, before they switch to the normal kernel page tables. We will meet it later when we talk about [`SMP`](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) initialization. For now it is enough to know that the boot processor itself was running on the early page tables we discussed above, and the goal of this step is to drop any stale global translation of the identity mapping from the `TLB`. -```C -void __init idt_setup_early_handler(void) -{ - int i; +With this, the early preparation of the kernel memory layout is finished. The kernel can now move on to setting up the handlers for interrupts and exceptions. - for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) - set_intr_gate(i, early_idt_handler_array[i]); +## Early interrupt and exception handling - load_idt(&idt_descr); -} -``` +Now we have reached the main goal of this chapter, which is the initialization of the Interrupt Descriptor Table. But before we jump directly to the code, we need to know what an interrupt is and why this table is used by the Linux kernel. -where `NUM_EXCEPTION_VECTORS` expands to `32`. As we can see, We're filling only first 32 `IDT` entries in the loop, because all of the early setup runs with interrupts disabled, so there is no need to set up an interrupt handlers for vectors greater than `32`. Here we call `set_intr_gate` in the loop, which takes two parameters: +### Interrupt Descriptor Table -* Number of an interrupt or `vector number`; -* Address of the idt handler. +An interrupt is a signal sent to the CPU by software or hardware. For example, a keyboard controller can signal that a user pressed a key. For our purposes, we can split these events into three types: -and inserts an interrupt gate to the `IDT` table represented by the `&idt_descr` array. +- Software interrupts - signals triggered by software to request a service from the kernel. Historically, these interrupts were often used for [system calls](https://en.wikipedia.org/wiki/System_call). For example, a program may need to read a file. +- Hardware interrupts - signals sent by hardware to report that an event occurred. For example, a network card may signal that a packet has arrived. +- Exceptions - processor-generated events raised while executing an instruction. For example, division by zero raises an exception. -The `early_idt_handler_array` array is declared in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) header file and contains addresses of the first `32` exception handlers: +When an interrupt or exception is triggered, the CPU stops the current execution flow and transfers control to an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler). The handler deals with the event and then returns control to the interrupted code. The CPU finds this handler through an entry, traditionally called a gate, in a special table called the Interrupt Descriptor Table, or IDT. -```C -#define EARLY_IDT_HANDLER_SIZE 9 -#define NUM_EXCEPTION_VECTORS 32 +Every interrupt and exception has assigned a unique number called a `vector number`. A vector number can be any value from `0` to `255`. The first `32` (starting from zero) numbers are reserved for CPU exceptions, like divide error, page fault and so on: -extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE]; -``` +| Vector | Mnemonic | Description | Type | Error Code | Source | +|--------|----------|--------------------------------|-------|------------|--------------------------------| +| 0 | #DE | Divide Error | Fault | NO | DIV and IDIV | +| 1 | #DB | Debug | F/T | NO | Debug conditions | +| 2 | --- | Non-maskable Interrupt | INT | NO | External NMI | +| 3 | #BP | Breakpoint | Trap | NO | INT3 | +| 4 | #OF | Overflow | Trap | NO | INTO instruction | +| 5 | #BR | Bound Range Exceeded | Fault | NO | BOUND instruction | +| 6 | #UD | Invalid Opcode | Fault | NO | UD2 or invalid instruction | +| 7 | #NM | Device Not Available | Fault | NO | Floating point or [F]WAIT | +| 8 | #DF | Double Fault | Abort | YES | Exception while handling fault | +| 9 | --- | Reserved | | NO | | +| 10 | #TS | Invalid TSS | Fault | YES | TSS access | +| 11 | #NP | Segment Not Present | Fault | YES | Segment load or access | +| 12 | #SS | Stack-Segment Fault | Fault | YES | Stack operations | +| 13 | #GP | General Protection | Fault | YES | Protection violation | +| 14 | #PF | Page Fault | Fault | YES | Memory reference | +| 15 | --- | Reserved | | NO | | +| 16 | #MF | x87 FPU Floating-Point Error | Fault | NO | x87 floating-point operation | +| 17 | #AC | Alignment Check | Fault | YES | Unaligned data reference | +| 18 | #MC | Machine Check | Abort | NO | Hardware error | +| 19 | #XM/#XF | SIMD Floating-Point Exception | Fault | NO | SSE/SIMD operation | +| 20 | #VE | Virtualization Exception | Fault | NO | EPT violation | +| 21 | #CP | Control Protection Exception | Fault | YES | CET protection violation | +| 22-28 | --- | Reserved | | NO | | +| 29 | #VC | VMM Communication Exception | Fault | YES | SEV-ES | +| 30-31 | --- | Reserved | | NO | | -The `early_idt_handler_array` is a `288` bytes array containing addresses of exception entry points every nine bytes. Every nine bytes of this array consist of two optional bytes for the instruction for pushing dummy error code if an exception does not provide it, two bytes instruction for pushing vector number to the stack and five bytes of `jump` to the common exception handler code. You will see more detail in the next paragraph. +> [!NOTE] +> Some vectors in this range are vendor-specific. For example, Linux defines vector `29` as `#VC`, which is AMD-specific and used by [SEV-ES](https://www.amd.com/en/developer/sev.html) guests. -The `set_intr_gate` function is defined in the [arch/x86/kernel/idt.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/idt.c) source file and looks as follows: +The vector numbers from `32` to `255` are not reserved for processor exceptions. The operating system can use them for external interrupts and other IDT entries, such as inter-processor interrupts or legacy software interrupt entry points. -```C -static void set_intr_gate(unsigned int n, const void *addr) -{ - struct idt_data data; +When an interrupt or exception occurs, the CPU uses the vector number as an index into the `Interrupt Descriptor Table`. Each entry is a descriptor that contains a pointer to the interrupt or exception handler. The base address of the `Interrupt Descriptor Table` is stored in a special register called `IDTR`. This register is loaded with the `LIDT` instruction, which takes a pointer to a descriptor holding the base address and size limit of the `IDT`. - BUG_ON(n > 0xFF); +The structure of the Interrupt Descriptor Table on x86_64 is: - memset(&data, 0, sizeof(data)); - data.vector = n; - data.addr = addr; - data.segment = __KERNEL_CS; - data.bits.type = GATE_INTERRUPT; - data.bits.p = 1; +![IDT gate descriptor](./images/idt-gate-descriptor.svg) - idt_setup_from_table(idt_table, &data, 1, false); -} +Here: + +- `Offset` - the 64-bit virtual address of the interrupt or exception handler +- `Segment Selector` - a code segment selector that the processor loads into the `cs` register before it jumps to the handler. It must point to a valid code segment in the Global Descriptor Table. In the Linux kernel, it points to the kernel code segment `__KERNEL_CS`. +- `IST` - the Interrupt Stack Table index. It lets the processor run the handler on a dedicated, reserved stack instead of the stack that was in use when the interrupt happened. This matters for a few critical handlers that must work even if the current stack is broken, such as a double fault. When this field is zero, the handler just runs on the normal kernel stack. +- `Type` - the kind of the gate. In 64-bit mode the `IDT` may hold two kinds of gates: + - `Interrupt gate` - when the processor enters the handler through it, it clears the `IF` interrupt flag. This flag tells the processor whether it is allowed to deliver hardware interrupts or not. Clearing it prevents other hardware interrupts from interrupting the handler while it runs. + - `Trap gate` - works like an interrupt gate, but the processor leaves the `IF` flag unchanged, so the handler can still be interrupted by hardware interrupts. +- `DPL` - the Descriptor Privilege Level. It is the minimum privilege level a task must have to invoke this gate with a software instruction like [`int n`](https://en.wikipedia.org/wiki/INT_(x86_instruction)). Hardware interrupts and processor-generated exceptions ignore this field. +- `P` - the present flag. It must be set for a valid descriptor. A reference to a gate whose `P` flag is clear raises a segment-not-present (`#NP`) exception. + +The remaining bits, including the topmost `Reserved` part, must be zero. + +The structure of the descriptor pointing to the Interrupt Descriptor Table is: + +![IDT descriptor](./images/idt-descriptor.svg) + +The processor uses this descriptor to find the `IDT` in memory. The `Limit` field holds the size of the table in bytes minus one, and the `Base Address` field holds the virtual address of the first entry of the table. This is exactly the descriptor that the `LIDT` instruction loads into the `IDTR` register. + +### Handling of interrupts on x86_64 + +Knowing how the Interrupt Descriptor Table is structured, we can take a short look at how an interrupt or exception is handled by the processor in 64-bit mode. + +> [!NOTE] +> If you are interested in more details, the exact algorithm is described in the [Intel® 64 and IA-32 Architectures Software Developer's Manual](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html), Volume 3A, in the following sections: +> +> - `7.12 Exception and Interrupt Handling` +> - `7.14 Exception and Interrupt Handling in 64-bit Mode` + +When an interrupt or exception occurs, the processor takes the vector number and multiplies it by `16` to get the offset of the gate inside the `IDT`. The multiplication by `16` is needed because each IDT entry is 16 bytes. The processor reads the gate at this offset and checks that it is an interrupt or trap gate that points to a 64-bit code segment. Then it decides which stack the handler will run on, following the rules we have already seen for the `IST` field. It can be a dedicated stack from the `IST` or the current stack. After the stack is chosen, the processor pushes a so-called interrupt frame. The interrupted state is now on the stack, so the code can be resumed later. + +The interrupt frame consists of the following registers, from higher to lower addresses: + +![Interrupt stack frame](./images/interrupt-stack-frame.svg) + +After the state is saved, the processor loads the handler's code segment selector and offset from the gate into the `cs` and `rip` registers and switches to the execution of the handler. + +When the handler finishes its job, it returns with the special `iretq` instruction. This instruction pops the saved state, restores the saved flags and resumes the interrupted code from the point where it stopped. + +### Set up early Interrupt Descriptor Table + +With the theory behind us, let's return to the kernel code. We stopped in the `x86_64_start_kernel` function, right before the call of: + + +```C + idt_setup_early_handler(); ``` -First of all it checks that vector number passed to it is not greater than `255` with `BUG_ON` macro. We need to do this because we are limited up to `256` interrupts. After this, we fill the idt data with given arguments and others, which will be passed to `idt_setup_from_table`. The `idt_setup_from_table` function is defined in the same file as the `set_intr_gate` function as follows: +At this early stage the kernel does not need a complete `IDT` yet. Interrupts are still disabled, so no hardware interrupt is going to arrive. Exceptions can still happen though. A page fault is the most important example for this part. So the kernel needs at least a minimal `IDT` that can catch processor exceptions. This is exactly what the `idt_setup_early_handler` function does. It is defined in [arch/x86/kernel/idt.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/idt.c) and looks like this: + ```C -static void -idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys) +void __init idt_setup_early_handler(void) { - gate_desc desc; - - for (; size > 0; t++, size--) { - desc.offset_low = (u16) t->addr; - desc.segment = (u16) t->segment - desc.bits = t->bits; - desc.offset_middle = (u16) (t->addr >> 16); - desc.offset_high = (u32) (t->addr >> 32); - desc.reserved = 0; - memcpy(&idt[t->vector], &desc, sizeof(desc)); - if (sys) - set_bit(t->vector, system_vectors); - } + int i; + + for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) + set_intr_gate(i, early_idt_handler_array[i]); +#ifdef CONFIG_X86_32 + for ( ; i < NR_VECTORS; i++) + set_intr_gate(i, early_ignore_irq); +#endif + load_idt(&idt_descr); } ``` -that fills a temporary idt descriptor with the given arguments and others. And then we just copy it to the certain element of the `idt_table` array. `idt_table` is an array of idt entries: +The number of exception vectors specified by `NUM_EXCEPTION_VECTORS` is `32`. The kernel iterates over these vectors and calls the `set_intr_gate` function, which initializes the given gate descriptor with the vector number, handler address and flags: + ```C -gate_desc idt_table[IDT_ENTRIES] __page_aligned_bss; -``` +static __init void set_intr_gate(unsigned int n, const void *addr) +{ + struct idt_data data; -Now we are moving back to main loop code. After main loop finishes, we can load `Interrupt Descriptor table` with the call to the: + init_idt_data(&data, n, addr); -```C - load_idt((const struct desc_ptr *)&idt_descr); + idt_setup_from_table(idt_table, &data, 1, false); +} ``` -where `idt_descr` is: +The `idt_data` structure is defined in [arch/x86/include/asm/desc_defs.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc_defs.h) and contains the following fields: + ```C -struct desc_ptr idt_descr __ro_after_init = { - .size = (IDT_ENTRIES * 2 * sizeof(unsigned long)) - 1, - .address = (unsigned long) idt_table, +struct idt_data { + unsigned int vector; + unsigned int segment; + struct idt_bits bits; + const void *addr; }; ``` -and `load_idt` just executes `lidt` instruction: +The Interrupt Descriptor Table itself is represented as an array of the following structures, defined in the same [arch/x86/include/asm/desc_defs.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc_defs.h): + ```C - asm volatile("lidt %0"::"m" (idt_descr)); +struct gate_struct { + u16 offset_low; + u16 segment; + struct idt_bits bits; + u16 offset_middle; +#ifdef CONFIG_X86_64 + u32 offset_high; + u32 reserved; +#endif +} __attribute__((packed)); ``` -Okay, now after we have filled and loaded the `Interrupt Descriptor Table`, we know how the CPU acts during an interrupt. So now it's time to deal with interrupt handlers. +After all the entries are initialized and copied to the Interrupt Descriptor Table, the `load_idt` function executes the `lidt` instruction to load the address of the newly built table. -Early interrupt handlers --------------------------------------------------------------------------------- +### Common interrupt handlers -As you can read above, we filled `IDT` with the address of the `early_idt_handler_array`. In this section, we are going to look into it in detail. We can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file: +Starting from this point, the `IDT` is initialized and loaded, so the kernel can handle the early exceptions it cares about. But which handlers does it actually have now? The answer is in `early_idt_handler_array`. This array is defined in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): + ```assembly -ENTRY(early_idt_handler_array) +SYM_CODE_START(early_idt_handler_array) i = 0 .rept NUM_EXCEPTION_VECTORS .if ((EXCEPTION_ERRCODE_MASK >> i) & 1) == 0 UNWIND_HINT_IRET_REGS + ENDBR pushq $0 # Dummy error code, to make stack frame uniform .else UNWIND_HINT_IRET_REGS offset=8 + ENDBR .endif pushq $i # 72(%rsp) Vector number jmp early_idt_handler_common @@ -271,142 +367,93 @@ ENTRY(early_idt_handler_array) i = i + 1 .fill early_idt_handler_array + i*EARLY_IDT_HANDLER_SIZE - ., 1, 0xcc .endr - UNWIND_HINT_IRET_REGS offset=16 -END(early_idt_handler_array) +SYM_CODE_END(early_idt_handler_array) ``` -As we can see above, interrupt handlers generation is done for the first `32` exceptions. We check here, if the exception has an error code and then we do nothing. If an exception, however, does not return an error code, we push a zero to the stack. We do it so that the stack is uniform. After that we push `vector number` on the stack and jump to the `early_idt_handler_common` - a generic interrupt handler for the time being. After all, every nine bytes of the `early_idt_handler_array` array consist of an optional push of an error code, push of `vector number` and jump instruction to `early_idt_handler_common`. We can see it in the output of the `objdump` util: +This macro can look scary at first glance, but do not worry. Let's go through it and try to understand what it does. -``` -$ objdump -D vmlinux -... -... -... -ffffffff81fe5000 : -ffffffff81fe5000: 6a 00 pushq $0x0 -ffffffff81fe5002: 6a 00 pushq $0x0 -ffffffff81fe5004: e9 17 01 00 00 jmpq ffffffff81fe5120 -ffffffff81fe5009: 6a 00 pushq $0x0 -ffffffff81fe500b: 6a 01 pushq $0x1 -ffffffff81fe500d: e9 0e 01 00 00 jmpq ffffffff81fe5120 -ffffffff81fe5012: 6a 00 pushq $0x0 -ffffffff81fe5014: 6a 02 pushq $0x2 -... -... -... -``` +The `early_idt_handler_array` macro generates a contiguous block of executable code containing `32` fixed-size exception entry stubs. The [`.rept`](https://sourceware.org/binutils/docs/as/Rept.html) directive is a simple loop that repeats the stub body `32` times. For exceptions where the CPU does not push an error code, the generated stub pushes a dummy zero, so all early exception handlers see the same stack layout. Then the stub pushes the vector number and jumps to the `early_idt_handler_common` label. At the end of each generated stub, the assembler fills the remaining bytes with `0xcc` until the stub has exactly `EARLY_IDT_HANDLER_SIZE` bytes. -As we may know, CPU pushes flag registers, `CS` and `RIP` on the stack before calling the interrupt handler. So before `early_idt_handler_common` will be executed, stack will contain the following data: - -``` -|--------------------| -| %rflags | -| %cs | -| %rip | -| error code | -| vector number |<-- %rsp -|--------------------| -``` +> [!NOTE] +> There is one interesting detail about this padding. `0xcc` is the opcode for the [INT3](https://en.wikipedia.org/wiki/INT_(x86_instruction)#INT3) instruction, so if the padding is accidentally executed, it will cause a breakpoint exception rather than running random bytes. -Now let's look at the `early_idt_handler_common` implementation. It is located in the same [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file. First of all we increment `early_recursion_flag` to prevent recursion in the `early_idt_handler_common`: +If we inspect the kernel image with [`objdump`](https://man7.org/linux/man-pages/man1/objdump.1.html), we can see these generated instructions: -```assembly - incl early_recursion_flag(%rip) +```bash +objdump -d vmlinux | grep ':' -A 24 ``` -The `early_recursion_flag` is defined in the same assembly file as the `early_idt_handler_common` symbol as follows: +The output should look similar to this: -```assembly - early_recursion_flag: - .long 0 ``` - -Next we save general registers on the stack: - -```assembly - pushq %rsi - movq 8(%rsp), %rsi - movq %rdi, 8(%rsp) - pushq %rdx - pushq %rcx - pushq %rax - pushq %r8 - pushq %r9 - pushq %r10 - pushq %r11 - pushq %rbx - pushq %rbp - pushq %r12 - pushq %r13 - pushq %r14 - pushq %r15 - UNWIND_HINT_REGS +ffffffff83d3fd10 : +ffffffff83d3fd10: f3 0f 1e fa endbr64 +ffffffff83d3fd14: 6a 00 push $0x0 +ffffffff83d3fd16: 6a 00 push $0x0 +ffffffff83d3fd18: e9 93 01 00 00 jmp ffffffff83d3feb0 +ffffffff83d3fd1d: f3 0f 1e fa endbr64 +ffffffff83d3fd21: 6a 00 push $0x0 +ffffffff83d3fd23: 6a 01 push $0x1 +ffffffff83d3fd25: e9 86 01 00 00 jmp ffffffff83d3feb0 +ffffffff83d3fd2a: f3 0f 1e fa endbr64 +ffffffff83d3fd2e: 6a 00 push $0x0 +ffffffff83d3fd30: 6a 02 push $0x2 +ffffffff83d3fd32: e9 79 01 00 00 jmp ffffffff83d3feb0 +ffffffff83d3fd37: f3 0f 1e fa endbr64 +ffffffff83d3fd3b: 6a 00 push $0x0 +ffffffff83d3fd3d: 6a 03 push $0x3 +ffffffff83d3fd3f: e9 6c 01 00 00 jmp ffffffff83d3feb0 +ffffffff83d3fd44: f3 0f 1e fa endbr64 +ffffffff83d3fd48: 6a 00 push $0x0 +ffffffff83d3fd4a: 6a 04 push $0x4 +ffffffff83d3fd4c: e9 5f 01 00 00 jmp ffffffff83d3feb0 +ffffffff83d3fd51: f3 0f 1e fa endbr64 +ffffffff83d3fd55: 6a 00 push $0x0 +ffffffff83d3fd57: 6a 05 push $0x5 +ffffffff83d3fd59: e9 52 01 00 00 jmp ffffffff83d3feb0 ``` -Okay, now the stack contains following data: -``` -High |-------------------------| - | %rflags | - | %cs | - | %rip | - | error code | - | %rdi | - | %rsi | - | %rdx | - | %rax | - | %r8 | - | %r9 | - | %r10 | - | %r11 | - | %rbx | - | %rbp | - | %r12 | - | %r13 | - | %r14 | - | %r15 |<-- %rsp -Low |-------------------------| -``` +All of these stubs jump to the common `early_idt_handler_common` routine. Before doing anything else, it saves all general purpose registers on the stack so they can be restored when the kernel returns from the exception. After all the registers are saved, the stack looks like this: -We need to do it to prevent wrong values of registers when we return from the interrupt handler. After this we check the vector number, and if it is `#PF` or a [Page Fault](https://en.wikipedia.org/wiki/Page_fault), we put value from the `cr2` to the `rdi` register and call `early_make_pgtable` (we'll see it soon): +![Stack frame after saving general purpose registers](./images/interrupt-stack-frame-full.svg) -```assembly - cmpq $14,%rsi /* Page fault? */ - jnz 10f - GET_CR2_INTO(%rdi) - call early_make_pgtable - andl %eax,%eax /* It is more efficient, the opcode is shorter than movl 1, %eax, only 2 bytes. */ - jz 20f /* All good */ -``` +With this stack frame prepared, the kernel calls the `do_early_exception` function. This function first handles a few special early exceptions by vector number, and then falls back to the Linux kernel exception table: -otherwise we call `early_fixup_exception` function by passing kernel stack pointer: + +```C +void __init do_early_exception(struct pt_regs *regs, int trapnr) +{ + if (trapnr == X86_TRAP_PF && + early_make_pgtable(native_read_cr2())) + return; -```assembly -10: - movq %rsp,%rdi - call early_fixup_exception -``` + if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT) && + trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs)) + return; -We'll see the implementation of the `early_fixup_exception` function later. + if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs)) + return; -```assembly -20: - decl early_recursion_flag(%rip) - jmp restore_regs_and_return_to_kernel + early_fixup_exception(regs, trapnr); +} ``` -After we decrement the `early_recursion_flag`, we restore registers that we saved before on the stack and return from the handler with `iretq`. +> [!NOTE] +> We skip the virtualization-related exceptions in this chapter, since they are not the main topic here. The `#VC` path is used for AMD memory-encrypted guests, and the `#VE` path is used for TDX guests. For now, we will focus on the page fault path and the generic exception-table fallback. -That is the end of the interrupt handler. We will examine the page fault handling and the other exception handling in order. +### Page fault exception handler -Page fault handling --------------------------------------------------------------------------------- +A page fault is an exception that the processor raises whenever a program, or in our case the kernel, tries to access a virtual memory address that the processor cannot translate into a physical address. This can happen for different reasons. The most common one is that there is no page table entry that maps the address. When this happens, the processor performs the following actions: -In the previous paragraph we saw the early interrupt handler that checks if the vector number is a page fault and calls `early_make_pgtable` for building new page tables if it is. We need to have `#PF` handler in this step because there are plans to add an ability to load kernels above `4G` addresses and allow accesses to `boot_params` structure above the 4G addressing limit. +- stores the faulting address in the `cr2` control register +- pushes an error code that describes the reason of the fault +- transfers control to the page fault handler -You can find the implementation of the `early_make_pgtable` in [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) that takes one parameter - the value of the `cr2` register, containing the address causing page fault. Let's look at it: +During this early boot phase, the Linux kernel does not install the full page fault handler that will be used later during normal kernel execution. The IDT entry for page faults points to one of the generic early stubs, and the real work happens in `do_early_exception`. This function checks whether the exception is a page fault and, if yes, calls `early_make_pgtable`, passing the faulting address from the `cr2` register. The `early_make_pgtable` function translates the faulting virtual address to a physical address and computes the PMD entry for it: + ```C -int __init early_make_pgtable(unsigned long address) +static bool __init early_make_pgtable(unsigned long address) { unsigned long physaddr = address - __PAGE_OFFSET; pmdval_t pmd; @@ -417,212 +464,120 @@ int __init early_make_pgtable(unsigned long address) } ``` -`__PAGE_OFFSET` is defined in the [arch/x86/include/asm/page_64_types.h](https://elixir.bootlin.com/linux/v3.10-rc1/source/arch/x86/include/asm/page_64_types.h#L33) header file, and the suffix `UL` forces the page offset to be an unsigned long data type. - -```C -#define __PAGE_OFFSET _AC(0xffff880000000000, UL) -``` - -And the `_AC` macro is defined in the [include/uapi/linux/const.h](https://elixir.bootlin.com/linux/v3.10-rc1/source/include/uapi/linux/const.h#L16) header file: - -```C -/* Some constant macros are used in both assembler and - * C code. Therefore we cannot annotate them always with - * 'UL' and other type specifiers unilaterally. We - * use the following macros to deal with this. - * - * Similarly, _AT() will cast an expression with a type in C, but - * leave it unchanged in asm. - */ - -#ifdef __ASSEMBLY__ -#define _AC(X,Y) X -#else -#define __AC(X,Y) (X##Y) -#define _AC(X,Y) __AC(X,Y) -#endif -``` -Where `__PAGE_OFFSET` expands to `0xffff888000000000`. But, why is it possible to translate a virtual address to a physical address by subtracting `__PAGE_OFFSET`? The answer is in the [Documentation/x86/x86_64/mm.rst](https://elixir.bootlin.com/linux/v5.10-rc5/source/Documentation/x86/x86_64/mm.rst#L45): +For the 4-level x86_64 layout shown in the [kernel documentation](https://github.com/torvalds/linux/blob/master/Documentation/arch/x86/x86_64/mm.rst), the `__PAGE_OFFSET` macro expands to the `0xffff888000000000` address. This is the virtual base address Linux uses to access physical memory through the direct mapping. + ``` -... -ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) -... + ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) ``` -As explained above, the virtual address space `ffff888000000000-ffffc87fffffffff` is direct mapping of all physical memory. When the kernel wants to access all physical memory, it uses direct mapping. +Since `0xffff888000000000` maps to physical address `0` in this layout, subtracting it from the faulting virtual address gives us the corresponding physical address. -Okay, let's get back to discussing `early_make_pgtable`. We initialize `pmd` and pass it to the `__early_make_pgtable` function along with an `address`. The `__early_make_pgtable` function is defined in the same file as the `early_make_pgtable` function as follows: +![Direct mapping of physical memory in the virtual address space](./images/direct-mapping.svg) -```C -int __init __early_make_pgtable(unsigned long address, pmdval_t pmd) -{ - unsigned long physaddr = address - __PAGE_OFFSET; - pgdval_t pgd, *pgd_p; - p4dval_t p4d, *p4d_p; - pudval_t pud, *pud_p; - pmdval_t *pmd_p; - ... - ... - ... -} -``` +If the faulting address belongs to the valid direct-mapping range, the kernel can build the missing mapping. This work is done by the `__early_make_pgtable` function. The process itself is very similar to what we have already seen a couple of times while mapping new pages. It looks like this: -It starts from the definition of some variables having `*val_t` types. All of these types are declared as an alias of `unsigned long` using `typedef`. +1. Start from the top-level page table and find the entry that covers the address that caused the fault +2. If the current entry points to the next-level table, continue walking down +3. If the next-level table is missing, allocate one from the early page-table pool and install a new entry -After performing the check for invalid addresses, we're getting the address of the Page Global Directory entry containing base address of the Page Upper Directory and put its value into the `pgd` variable: +This process is repeated until the kernel reaches the level it needs for this early mapping. In this case, it installs a PMD entry. As soon as the entry is created, the general purpose registers are restored and control returns to the faulting instruction. This time, the address translation succeeds. -```C -again: - pgd_p = &early_top_pgt[pgd_index(address)].pgd; - pgd = *pgd_p; -``` +### Exception handling through the exception table -And we check if `pgd` is present. If it is, we assign the base address of the page upper directory table to `pud_p`: +In the previous [section](#page-fault-exception-handler) we saw how the early page fault handler recovers from a page fault by building the missing page tables on demand. But a page fault is not the only exception that may happen. If we look back at the `do_early_exception` function, we see one last step after the special early handlers, the call to `early_fixup_exception`. This is the path that handles all remaining early exceptions. -```C - pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base); -``` +Unlike a page fault, such an exception usually cannot be resolved by simply mapping a new page. The kernel can recover only if the faulting instruction is one it knows about in advance. For this purpose, the Linux kernel maintains a table called the `exception table`. -where `PTE_PFN_MASK` is a macro that masks lower `12` bits of `(pte|pmd|pud|pgd)val_t`. +As we already saw, the `do_early_exception` function selects a handler for some exceptions based on the vector number. The exception table works differently. Instead of using the vector number, it relies on the address of the instruction that caused the exception. The kernel collects such known-risk instructions at build time into a special table. In this table, each instruction is associated with fixup metadata that tells the kernel where execution should continue if the instruction faults. So the kernel looks up the faulting instruction in this table and, if it is found, transfers control to the matching fixup path. If no matching entry is found, the early exception handler has no safe recovery path. It prints a panic message, dumps the registers and halts the CPU. -If `pgd` is not present, we check if `next_early_pgt` is not greater than `EARLY_DYNAMIC_PAGE_TABLES` which is `64` and present a fixed number of buffers to set up new page tables on demand. If `next_early_pgt` is greater than `EARLY_DYNAMIC_PAGE_TABLES` we reset page tables and start again from `again` label. If `next_early_pgt` is less than `EARLY_DYNAMIC_PAGE_TABLES`, we assign the next entry of `early_dynamic_pgts` to `pud_p` and fill whole entry of the page upper directory with `0`, then fill the page global directory entry with the base address and some access rights: +The table itself is built during the kernel build and consists of a contiguous set of the following structures: + ```C - if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) { - reset_early_page_tables(); - goto again; - } - - pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++]; - memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD); - *pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE; +struct exception_table_entry { + int insn, fixup, data; +}; ``` -And we fix `pud_p` to point to correct entry and assign its value to `pud` with the following: - -```C - pud_p += pud_index(address); - pud = *pud_p; -``` +Where: -And then we do the same routine as above, but to the page middle directory. +- `insn` - address of the instruction that may fault +- `fixup` - address where execution should continue after fixup +- `data` - fixup type and additional exception metadata -In the end we assign the given `pmd` which is passed by the `early_make_pgtable` function to the certain entry of page middle directory which maps kernel text+data virtual addresses: +The table is populated with entries using `_ASM_EXTABLE_TYPE` and similar macros from the same family: + ```C - pmd_p[pmd_index(address)] = pmd; +# define _ASM_EXTABLE_TYPE(from, to, type) \ + .pushsection "__ex_table", "aM", @progbits, EXTABLE_SIZE ; \ + .balign 4 ; \ + .long (from) - . ; \ + .long (to) - . ; \ + .long type ; \ + .popsection ``` -After page fault handler finished its work, as a result, `early_top_pgt` contains entries which point to the valid addresses. - -Other exception handling --------------------------------------------------------------------------------- - -In the early interrupt phase, exceptions other than the page fault are handled by `early_fixup_exception` function defined in [arch/x86/mm/extable.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/extable.c) taking two parameters - a pointer to the kernel stack that consists of saved registers and a vector number: +For example, one of the next steps during kernel initialization is loading CPU microcode in the `load_ucode_bsp` function. This function uses the `rdmsr` instruction to check the AMD patch level: + ```C -void __init early_fixup_exception(struct pt_regs *regs, int trapnr) +static __always_inline u64 __rdmsr(u32 msr) { - ... - ... - ... -} -``` - -First of all, we need to make some checks as following: - -```C - if (trapnr == X86_TRAP_NMI) - return; - - if (early_recursion_flag > 2) - goto halt_loop; + EAX_EDX_DECLARE_ARGS(val, low, high); - if (!xen_pv_domain() && regs->cs != __KERNEL_CS) - goto fail; -``` - -Here we just ignore [NMI](https://en.wikipedia.org/wiki/Non-maskable_interrupt) and make sure that we are not in recursive situation. - -After that, we get into: + asm volatile("1: rdmsr\n" + "2:\n" + _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_RDMSR) + : EAX_EDX_RET(val, low, high) : "c" (msr)); -```C - if (fixup_exception(regs, trapnr)) - return; + return EAX_EDX_VAL(val, low, high); +} ``` -The `fixup_exception` function finds the actual handler and calls it. It is defined in the same file as `early_fixup_exception` function as follows: - -```C -int fixup_exception(struct pt_regs *regs, int trapnr) -{ - const struct exception_table_entry *e; - ex_handler_t handler; - - e = search_exception_tables(regs->ip); - if (!e) - return 0; +The `rdmsr` instruction reads a [model-specific register](https://en.wikipedia.org/wiki/Model-specific_register) whose number is passed in the `ecx` register. If such a register does not exist on the current processor, the instruction raises a [general protection](https://en.wikipedia.org/wiki/General_protection_fault) exception. Instead of crashing the kernel, the entry registered by the macro allows it to skip over the faulting instruction and continue execution. - handler = ex_fixup_handler(e); - return handler(e, regs, trapnr); -} -``` +The `_ASM_EXTABLE_TYPE` macro is used here to tell the kernel that the `rdmsr` instruction at the label `1` may fault, and if it does, execution should be resumed at the label `2`, which is just the next instruction in this case. The fault itself should be treated as an exception of the `EX_TYPE_RDMSR` type. -The `ex_handler_t` is a type of function pointer, which is defined like: +As a result, for every instruction wrapped in such a macro, the kernel gets one `struct exception_table_entry` in the `__ex_table` section. All these entries together form the exception table that the early exception handler, and later the generic kernel code, search through to recover from faults. -```C -typedef bool (*ex_handler_t)(const struct exception_table_entry *, - struct pt_regs *, int) -``` +Now we know how the Linux kernel can recover from selected exceptions without choosing a handler only by vector number. When an exception reaches this path, the kernel searches the exception table for an entry whose `insn` field matches the saved `rip` value. If an entry is found, the kernel runs the fixup handler encoded in the entry's `data` field. For the failed MSR read shown above, the handler clears the `ax` and `dx` registers so the caller does not see garbage. After that, execution jumps to the `fixup` address, which in this case is just the next instruction after `rdmsr`. -The `search_exception_tables` function looks up the given address in the exception table (i.e. the contents of the ELF section, `__ex_table`). After that, we get the actual address by `ex_fixup_handler` function. At last we call the actual handler. For more information about the exception table, you can refer to [Documentation/x86/exception-tables.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/exception-tables.txt). +## Jump to the generic kernel entry point -Let's get back to the `early_fixup_exception` function, the next step is: +We have finished the main goal of this chapter, setting up the Interrupt Descriptor Table. Only a few architecture-specific steps remain before we reach the generic kernel entry point from [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c): + ```C - if (fixup_bug(regs, trapnr)) - return; +asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector +void start_kernel(void) ``` -The `fixup_bug` function is defined in [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). Let's have a look at its implementation: +Right after the early Interrupt Descriptor Table is loaded, the `x86_64_start_kernel` function performs the last few architecture-specific steps before it hands control over to the generic kernel code. -```C -int fixup_bug(struct pt_regs *regs, int trapnr) -{ - if (trapnr != X86_TRAP_UD) - return 0; - - switch (report_bug(regs->ip, regs)) { - case BUG_TRAP_TYPE_NONE: - case BUG_TRAP_TYPE_BUG: - break; +First, the kernel copies the boot data that the boot loader prepared for it. The boot loader fills the `boot_params` structure together with the kernel command line and leaves them in the real-mode data area, which is usually located at low addresses. The `copy_bootdata` function copies this data into the kernel's own structures in the kernel address space, so the rest of the kernel no longer needs to care where exactly the boot loader placed it. This is also a good moment to recall the page fault handler we saw earlier in this part. The boot data is reached through the direct mapping of physical memory, but after the kernel has reset all top-level page-table entries except the one that maps the kernel image, only the kernel high mapping is left in `early_top_pgt`. The direct mapping is not present there at all. So the first access to the boot data triggers a page fault, and the handler quietly builds the missing mapping on demand. - case BUG_TRAP_TYPE_WARN: - regs->ip += LEN_UD2; - return 1; - } +After the boot data is in place, the kernel loads a [microcode](https://en.wikipedia.org/wiki/Microcode) update for the boot processor, if one is available. This is the same path that uses the `rdmsr` instruction and the exception table machinery we just saw in the previous section. If the processor does not have the model-specific register the code asks for, the exception table entry lets the kernel recover instead of crashing in the middle of early boot. - return 0; -} -``` +Finally, the kernel copies the last entry of the early top-level page table into the final top-level page table, `init_top_pgt`, which it zeroed near the beginning of this part. It also applies a couple of platform-specific quirks and, depending on the hardware, runs some additional early setup. After this, the kernel leaves the early architecture-specific code and jumps to the generic, architecture-independent entry point, the `start_kernel` function, which we will see in the next part. -All what this function does is to return `1` if the exception is generated because `#UD` (or [Invalid Opcode](https://wiki.osdev.org/Exceptions#Invalid_Opcode) occurred and the `report_bug` function returns `BUG_TRAP_TYPE_WARN`), otherwise it returns `0`. +## Conclusion -Conclusion --------------------------------------------------------------------------------- +This is the end of the second part about the initialization process of the Linux kernel. If you have questions or suggestions, feel free to ping me on X - [0xAX](https://twitter.com/0xAX), drop me an [email](mailto:anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). -This is the end of the second part about Linux kernel insides. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](mailto:anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part we will see all the steps before kernel entry point - `start_kernel` function. +In the next part, we will continue this process and see the first non-architecture-specific initialization in the kernel's generic entry point, `start_kernel`. -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** +## Links -Links --------------------------------------------------------------------------------- +Here is the list of the links that you may find useful when reading this chapter: -* [GNU assembly .rept](https://sourceware.org/binutils/docs-2.23/as/Rept.html) -* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) -* [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt) -* [Page table](https://en.wikipedia.org/wiki/Page_table) -* [Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) -* [Page Fault](https://en.wikipedia.org/wiki/Page_fault), -* [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) +- [GNU assembly .rept](https://sourceware.org/binutils/docs-2.23/as/Rept.html) +- [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) +- [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt) +- [Page table](https://en.wikipedia.org/wiki/Page_table) +- [Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) +- [Page Fault](https://en.wikipedia.org/wiki/Page_fault) +- [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register) +- [Microcode](https://en.wikipedia.org/wiki/Microcode) +- [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1)