linux kernel booting process (2) - for nlkb

Download Linux Kernel Booting Process (2) - For NLKB

If you can't read please download the document

Upload: shimosawa

Post on 24-May-2015

1.851 views

Category:

Engineering


13 download

DESCRIPTION

Describes the bootstrapping part in Linux, and related architectural mechanisms and technologies. This is the part two of the slides, and the succeeding slides may contain the errata for this slide.

TRANSCRIPT

  • 1. Booting Process (2) Taku Shimosawa Pour le livre nouveau du Linux noyau 1

2. Materials http://www.slideshare.net/shimosawa/ 2 3. Agenda Virtual Memory From architectural view Unfortunately, this presentation again does not enter the main part of the kernel! Appendices Source code-level overview of the bootstrapping process Linker Scripts Inline Assemblers There are (implicitly) omitted spaces, tabs, white lines, comments in the quoted source code. The omitted effective lines are denoted by or [] 3 4. Scope of the last presentation : x86 Real Mode (16-bit) Boot sector, setup_header, and 16-bit entry point C-Language main function Retrieving memory information Transition to the protected mode Protected Mode (32-bit) 32-bit(/64-bit) entry point, preparing for decompression, calling decompression code (EFI-Stub) efi_main (entry point from UEFI) EFI call functions Protected Mode/Long Mode The beginning of the main kernel 4 arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S * The _32.S files are used in the 32-bit kernel, and _64.S files are not. Vice versa. 5. Scope of the last presentation : ARM Compressed Entry point Decompressing function Actual decompressing algorithm is in lib/decompress_*.c Building a FDT from ATAGS for compatibility (CONFIG_ARM_ATAG_DTB_C OMPAT) Decompressed The beginning of the main kernel 5 arch arm boot compressed head.S decompress.c atags_to_fdt.c kernel head.S 6. Follow-ups for the last presentation x86 assembly language What if instuructions with 3 operands? (e.g.) imul Multiply EBX by 19(0x13) and substitute the result to EAX Therefore, 6 AT&T Intel Operand Order Source, Destination Destination, Source AT&T Intel Example imul $0x13, %ebx, %eax IMUL EAX, EBX, 13h AT&T Intel Operand Order Source, Destination [[Op4,] Op3,] Op2, Op1 Destination, Source Op1, Op2 [, Op3, [Op4]] 7. Follow-ups for the last presentation Multiple Relocations? The conclusion is at most once (in x86 arch) ELF relocation may follow the decompression, so the kernel may be relocated twice in this sense. See the relocation part in this presentation. 7 8. x86 Architecture : Segmentation 6 Segment Registers (16-bit registers) Code Segment Register: CS Data Segment Register: DS, ES, FS, GS Stack Segment Register: SS Real mode : 20-bit address space Linear address = Physical address The size of each segment is 64K (16-bit) The segment register denotes the higher 16-bit offset in 20-bit address space for the segment Protected mode : 32-bit/36-bit physical address space Virtual (Paging)-> Linear (Segmentation)-> Physical The offset and limit are stored in the descriptor table The segment registers points to the entry in the table Long mode : 48-bit physical address space For CS, DS, ES, and SS, the offset is always 0, the limit is ignored. For FS and GS, the offset can be set by the descriptor or through MSR (for > 32-bit addresses) 8 Logical (Segmentation) -> Linear (Paging)-> Physical Errata 9. So what? (p.32) 9 vmlinux boot/compressed/vmlinux.bin (1a) Strip symbols vmlinux.bin.xz (2a) Concatenate and compress (gzip, bzip2, lzma, lzo, lz4) piggy.o (3) mkpiggy (piggy-back) Make an object that contains the compressed image piggy.o*.o boot/compressed/vmlinux (4) Link with the other objects in boot/compressed (Decompressing codes) (5) Transform it into a simple binary boot/vmlinux.bin boot/vmlinux.binboot/setup.bin (6) Concatenate with real-mode setup code, headers, and CRC32 CRC boot/bzImage (1b) Make relocation information (2b) Append the original size info (except for gzip) vmlinux.bin.xz vmlinux.relocs Size Errata? 10. 4. Virtual Memory Segmentation and Paging 10 11. Virtual Memory The address visible to a task is virtualized, i.e. translated by hardware to a certain physical address when it is actually accessed. The hardware mechanism to translate the address is called MMU (memory management unit). Aim / Benefit Using larger memory area than the machine actually is equipped with. Memory swapping, sparse memory areas Isolating tasks memory area so that the different applications cannot touch (read or write) the each others memory Not only between user tasks but between the kernel and tasks Abstracting the memory resources Providing contiguous memory area even if there is no physically contiguous memory area available. User programs can run with certain addresses regardless of the physical addresses where they are actually running. 11 12. Two ways to virtual memory Paging Dividing the memory area into chunks (pages) with a certain small size, and defining a map from each chunk to its physical location A different task may have a different map of the memory Several overhead (both in speed and memory) to translate and hold the map Segmentation The address is considered to be an offset inside a certain segment of memory Less overhead (just adding an offset), but impossible to achieve swapping 12 13. Illustrated 13 1 Segment 1 2 3 5 4 3 1 2 4 VA PA 1 4 2 1 3 3 5 2 2 ~ 4 Seg Star t End 1 2 4 1 Virtual Memory Physical Memory Page Table Segment Desc. Paging Segmentation 14. Architecture and VM Capability x86 Capable of paging 16-bit and 32-bit has segmentation feature 64-bit mode has a very limited segmentation feature Because almost no one is using the segmentation feature effectively! (See flat model described in a later slide) ARM Some CPU series has MMU, and is capable of paging A series Some CPU series only has MPU (memory protection unit) R series No MMU M series (MPU is optional) 14 15. Focusing on paging How it works? 15 Memory instruction with a virtual address CPU (MMU) looks for for the virtual address in TLB (Translation Lookaside Buffer) Does it exist? Use the physical address in the TLB entry TLB Miss! Call the handler, and ask it to fill in a TLB entry corresponding to the virtual address Traverse the page table to find the physical address for the virtual address Present? Use the physical address (May) remember it in TLB Page fault! Call the handler. Kernels Role Software TLBHardware TLB Yes No Yes No 16. How far should hardware do? TLB (Translation Lookaside Buffer) Cache of virtual-to-physical mappings. Limited number of entries. Hardware-controlled TLB When TLB misses occur, the CPU traverses page tables The format for the page table is defined by the architecture. x86 and ARM Software-controlled TLB When TLB misses occur, the software (typically, the OS kernel) traverses page tables, and tell the result (translated physical address) by filling in some entry in TLB. Any type of page tables may be used (hash-based PT, for example) But Linux uses almost the same format for this type of architecture PowerPC 16 17. Multilevel Page Table (tree-like) Typical structure of page table The first-level page table consists of entries that point to another level page table. The index is some of the most significant bits of the virtual memory. Of course, the next page tables address is physical. The entries in the leaf page table denotes the physical addresses. 17 Next level page table Third level page table Phys address Phys address First-level page table Second-level page table Third-level page table 18. x86-64 example 18 Resolving 0x00000004200310a5 = 00000000 00000000 00000000 00000100 00100000 00000011 00010000 10100101 (2) PML4 Table 0 511 Page Directory Pointer Table 0 511 16 0 256 Page Directory Table 511 0x1234567000 0 49 Page Table 511 0x12345670a5 CR3 64 bits 19. x86-64 Currently, only 48-bit in a linear address is effective. 64-bit address is sign-extension of the 48-bit address. Supports up to 52 bits for physical addresses %cr3 register : the physical address for the current PML4 table mov ~~, %cr3 switches the page table (flushing TLB) Four level One entry in PML4 table corresponds to 512 GB of virtual memory, an entry in PDP table to 1 GB, and so on. Each entry is 8 byte Each table has 512 entries Thus, each table is 4 KB = 1 page. 19 20. Large Table One page occupies one entry in TLB If one process uses 1 GB of memory, it uses 256K pages. i.e. If TLB does not have 256K entries (and usually it doesnt), TLB misses are inevitable x86_64 supports three types of page size 4 KB (normal) 2 MB 1 GB (!) The disadvantage is that larger page requires contiguous physical memory of the same size as the page size. 20 An entry in higher-level page table directly contains a physical address. 21. x86-64 example (2MB page) 21 Resolving 0x00000004200310a5 = 00000000 00000000 00000000 00000100 00100000 00000011 00010000 10100101 PML4 Table 0 511 Page Directory Pointer Table 0 511 16 0x1234400000 0 256 Page Directory Table 511 0x12344310a5 CR3 64 bits 22. Linux kernel usage Large Page The kernel mapping The kernel creates straight-mapping of physical memory in the kernel virtual address area This area is created in booting, and never changes after that 1GB, 2MB pages are used Hugetlbfs Explicit use from user applications Transparent Huge Pages Implicit (transparent) use of large pages for user applications 22 23. ARM ARM Two memory architecture VMSA (Virtual Memory System Architecture) : MMU PMSA (Protected Memory System Architecture) : MPU VMSA Two page table formats Short descriptor table Up to two-level lookup 32-bit PA (*By supersection, 40-bit can be output) Long descriptor table Up to three-level lookup 40-bit PA Fixed size of page tables 23 24. Names in Linux Linux uses several arch-independent type names for page table entries pgd_t, pud_t, pmd_t, pte_t Each type is one for an entry in a table of the corresponding level 24 Architecture (& Config) Lv pgd_t pud_t pmd_t pte_t x86_64 4 PML4E PDPTE PDE PTE i386 (PAE) 3 PDPTE - PDE PTE i386 2 PDE - - PTE ARM (LPAE) 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc. ARM 2 1st-lv. Desc. - - 2nd-lv. Desc. ARM64 (64KB page) 2 1st-lv. Desc. - - 2nd-lv. Desc. ARM64 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc. (*)AArch64 supports four-level page tables, thus 48-bit VA. 25. Notes PAE (i386) Physical Address Extension For those who want to enjoy >4GB of memory in 32-bit mode. Virtual address remains 32-bit, but can map to any physical address (< 64-bit) The size of each entry is extended to 64-bit CONFIG_X86_PAE LPAE Logical Physical Address Extension Almost the same as PAE in i386 The current implementation limits the output address range to 40 bits Each entry is extended to 64-bit (long-descriptor translation table format) CONFIG_ARM_LPAE 25 26. ARM example (Short-descriptor) 26 Resolving 0x200310a5 = 00100000 00000011 00010000 10100101 (2) 1st Level Table 0 4095 0x12345000 0 255 49 0x123450a5 TTBR0 2nd Level Table 32 bits 512 27. Quick Chart 27 1st Level 2nd Level 3rd Level 4th Level Intel 64-bit [47:39] [38:30] [29:21] [20:12] 4 KB (64 bit x 512) 512 GB/Entry 1 GB / Entry 2 MB / Entry 4 KB / Entry PAE [31:30] [29:21] [20:12] 256 B (64 bit x 4) 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry 32-bit [31:22] [21:12] 4 KB (32 bit x 1024) 4 MB / Entry 4 KB / Entry ARM LPAE [31:30] [29:21] [20:12] 256 B (64 bit x 4) 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry 32-bit [31:20] [19:12] 16 KB (32 bit x 4096) 1 KB (32 bit x 256) 1 MB / Entry 4 KB / Entry ARM 64 4KB granule [38:30] [29:21] [20:12] 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry VA Range used as index Table size (entry size x n) Size represented by each entry 28. Page size supported (by HW) x86_64 1 GB, 2 MB, 4 KB i386 (PAE) 2 MB, 4 KB i386 4 MB, 4 KB ARM 16 MB(*), 1 MB, 64 KB, 4 KB ARM (LPAE) 1 GB, 2 MB, 4 KB ARM64 1 GB, 2 MB, 4 KB (for 4KB translation granule) 32 MB, 16 KB (for 16KB translation granule) 512 MB, 64 KB (for 64KB translation granule) 28 (*) Depends on implementation 29. Page Attributes Pages can have attributes Used for memory protection Used for demand paging Used for COW (copy-on-write) Attributes Read / Write User / Privileged But where? In the page table entry corresponding to a page However, a page table entry is basically a physical pointer, i.e. a 32-bit entry is occupied by 32-bit physical pointer 29 30. Page Attributes The lower bits in page table entries The start address of a page/page table is aligned! The lower bits are always zero. 30 Ignored Physical Address [31:12] 3252 XD 63 Physical Address [51:32] G Igno red PAT D PCD PWT US RW PA 31 9 0 Physical Address [31:12] C B 1 XN APTEX AP2 S nG x86_64 ARM (short descriptor) 31. Page Attributes Comparison 31 x86_64 ARM (short) Enabled? Present (P) Desc type (Bits 1 & 0) RO or RW? Read/Write (RW) AP [2:1] or AP [2:0] Privileged only or any? User/Supervisor (US) Write-through? PWT TEX[2:0], B, C Cachable? PCD Accessed? Accessed (A) AP[0] (*configurable) Dirty? Dirty (D) N/A Memory Type PAT TEX[0], B, C (*configurable) Global Global (G) Not Global (nG) Executable? Execute-Disable (XD) Execute-Never (XN) Sharable? (PAT) Sharable (S) 32. PowerPC Example [PowerPC 440] TLB is filled by software Search (tlbsx instrunction), R/W (tlbre, tbwe instructions) 32 32220 Effective Page Number [0:21] TS V SIZE TPAR TID 40 Real Page Number [0:21] 0 PA R1 ERPN PA R2 0 Reserved U3-U0 W I M G E X W R X W R U S Attributes V : Valid SIZE : Page Size (4n KB, where n in {0,1,2,3,4,5,7,9,10}) U : User-defined storage attribute W: Write-through I: Caching Inhibited M: Memory coherency required G: Guarded E: Endian UX, UW, UR: User executable, writable, readable SX, SW, SR: Supervisor executable, writable, readable TPAR, PAR1, PAR2: Parity 33. Before the kernel starts x86 (32-bit) Paging is disabled kernel/head_32.S creates a page table and turns on paging x86 (64-bit) compressed/head_64.S creates an identical (virtual = physical) page table for the first 4G Long mode requires paging enabled. kernel/head_64.S creates better page table ARM kernel/head.S creates a page table and turns on paging 33 34. Virtual memory mapping 34 x86_64 Virtuali386 Virtual Physical LOWMEM PAGE_OFFSET (0xC0000000) Up to ~896 MB PAGE_OFFSET (0xFFFF8800 00000000) __START_KERNEL_map (0xFFFFFFFF 80000000) 35. A. Booting in x86 By looking into the source codes 35 36. A-1. Real Mode Plenty of assembler code, LD script, and inline assembly language 36 37. Real mode kernel (from p.45) header.S Boot sector code which is no longer used Contains setup_header Prepares stack and BSS to run C programs Jumps into the C program (main.c) main.c Copies setup_header into zeropage Setups early console Initializes heap Checks the CPUs (64-bit capable for 64-bit kernel?) Collect HW information by querying to BIOS, and stores the results in zeropage Finally transits to protected-mode, and jumps into the protected-mode kernel 37 38. Boot sector (Useless) 38 arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S .global bootsect_start bootsect_start: #ifdef CONFIG_EFI_STUB # "MZ", MS-DOS header .byte 0x4d .byte 0x5a #endif # Normalize the start address ljmp $BOOTSEG, $start2 start2: movw %cs, %ax movw %ax, %ds movw %ax, %es movw %ax, %ss xorw %sp, %sp sti cld movw $bugger_off_msg, %si jmp msg_loop Normalize CS to BOOTSEG (0x7c0). movw %ds, %cs is not allowed. stack starts at 0x17c00 Enable interrupts cf. cli Reset directions for string instructions (Clear DF Flag) cf. std Show the message "Direct floppy boot is not supported. " 39. Wait, how the header code is placed at the beginning of the kernel? The linker concatenates multiple object files The position in the resulting binary are not guaranteed without any order to the linker The linker script (.ld/.lds/.lds.S) orders the positions to the linker! As it is quite likely for you to use the C preprocessor for the linker script, files with the extension .lds.S are first processed by the preprocessor, and passed to the linker. Pass the linker script with -T overrides the default linker script The default linker script can be displayed with ld -- verbose" 39 40. LD script (1) 40 arch x86 boot setup.ld compressed vmlinux.lds.S kernel vmlinux.lds.S /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .; Specifies the output format (identical to --oformat option) OUTPUT_FORMAT(default, big, little) Specifies the output architecture Specifies the entry point symbol (identical to -e option) 41. LD script (2) 41 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .; arch x86 boot setup.ld compressed vmlinux.lds.S kernel vmlinux.lds.S Specifies how the sections are output . means the current position Substituting to . means setting the current position Put the .bstext section at the current position, i.e. at the address 0. Put the .bsdata section after the .bstext section. 42. bstext section? 42 .code16 .section ".bstext", "ax" .global bootsect_start bootsect_start: #ifdef CONFIG_EFI_STUB # "MZ", MS-DOS header .byte 0x4d .byte 0x5a #endif # Normalize the start address ljmp $BOOTSEG, $start2 start2: movw %cs, %ax movw %ax, %ds movw %ax, %es movw %ax, %ss xorw %sp, %sp sti cld movw $bugger_off_msg, %si jmp msg_loop arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S Here it is! [Notes] .code16 = Specify the binary for the following code as 16-bit binary. .section name[, flags] = Starts the section. (excerpted) a : allocatable (loaded to memory when executed) w : writable x : executable .globl/.global symbol = Makes the symbol global (Can be seen from other objects) 43. LD script (3) 43 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .; arch x86 boot setup.ld compressed vmlinux.lds.S kernel vmlinux.lds.S Specifies how the sections are output Set the current position to 495 Places the header section at the address 495 Declares a symbol __end_init that refers to the current position (the end of .initdata section) 44. LD script (4) 44 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .; arch x86 boot setup.ld compressed vmlinux.lds.S kernel vmlinux.lds.S .bstext .bsdata 0 495 .header .entrytext .inittext .initdataxxxx __end_init 45. LD script (5) To be precise, Output a section the name of which is .bstext The output section contains all of the input section .bstext The input and output need not be 1-to-1 The output section .text contains all of the input section .text, and then all of the sections the names of which start with .text. Creates the new symbols _text and _etext which denote the beginning and ending of the output section .text, respectively. 45 .bstext : { *(.bstext) } .text : { _text = .; /* Text */ *(.text) *(.text.*) _etext = . ; } 46. LD script (6) 46 . = ALIGN(16); .data : { *(.data*) } .signature : { setup_sig = .; LONG(0x5a5aaa55) } ... /DISCARD/ : { *(.note*) } /* * The ASSERT() sink to . is intentional, for binutils 2.14 compatibility: */ . = ASSERT(_end assume %sp is reasonably set # Invalid %ss, make up a new stack movw $_end, %dx testb $CAN_USE_HEAP, loadflags jz 1f movw heap_end_ptr, %dx 1: addw $STACK_SIZE, %dx jnc 2f xorw %dx, %dx # Prevent wraparound 2: # Now %dx should point to the end of our stack space andw $~3, %dx # dword align (might as well...) jnz 3f movw $0xfffc, %dx # Make sure we're not zero 3: movw %ax, %ss movzwl %dx, %esp # Clear upper half of %esp If %ds == %ss, %sp is assumed to be properly set by the loader If not, sets up a new stack. The address is _end + STACK_SIZE (512 byte) or heap_end_ptr + STACK_SIZE (if CAN_USE_HEAP is set) 54. In other words, Set the stack segment as the same as %DS Allocate 512-byte for the stack 54 unsigned short stack; if (%ds != %ss) { if (hdr.loadflags & CAN_USE_HEAP) { stack = hdr.heap_end_ptr + STACK_SIZE; } else { stack = _end + STACK_SIZE; } if (carried over) { /* stack >= 0x10000 */ stack = 0; } } /* Align to 4-byte */ stack &= ~3; if (stack == 0) stack = 0xfffc; /* 4 */ %ss = %ds; %esp = stack; 55. Get prepared to C (CS fix and BSS clear) 55 sti # Now we should have a working stack # We will have entered with %cs = %ds+0x20, normalize %cs so # it is on par with the other segments. pushw %ds pushw $6f lretw 6: # Check signature at end of setup cmpl $0x5a5aaa55, setup_sig jne setup_bad # Zero the bss movw $__bss_start, %di movw $_end+3, %cx xorl %eax, %eax subw %di, %cx shrw $2, %cx rep; stosl # Jump to C code (should not return) calll main arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S $6f is the address of the 6f, which is the offset from the boot sector. Signature check Fill the bss by zero. rep; stosl (string instruction) fills the memory from %es:%di for %cx DWORDs with %eax. 56. [Column] Calling conventions 16 bit (name unknown) Arguments: %ax, %dx, %cx Return value: %ax 32 bit (cdecl) Arguments: pushed on the stack (in the reversed order of the arguments) Caller-saved: %eax, %ecx, and %edx Callee-saved: the others Return value: %eax (for int) 64 bit (amd64) Arguments: %rdi, %rsi, %rdx, %rcx, %r8, %r9 Caller-saved: the others than callee-saved. Callee-saved: %rbp, %rbx, %r12 to %r15 Return value: %eax 56 f(2, 5, 9, 11); 11 9 5 2 (return address) stack 57. Real mode kernel (p.45) header.S Boot sector code which is no longer used Contains setup_header Prepares stack and BSS to run C programs Jumps into the C program (main.c) main.c Copies setup_header into zeropage Setups early console Initializes heap Checks the CPUs (64-bit capable for 64-bit kernel?) Collect HW information by querying to BIOS, and stores the results in zeropage Finally transits to protected-mode, and jumps into the protected-mode kernel 57 58. main 58 void main(void) { /* First, copy the boot header into the "zeropage" */ copy_boot_params(); /* Initialize the early-boot console */ console_init(); ... /* End of heap check */ init_heap(); /* Make sure we have all the proper CPU support */ if (validate_cpu()) { ... } set_bios_mode(); detect_memory(); keyboard_init(); query_mca(); query_ist(); ... /* Set the video mode */ set_video(); /* Do the last things and invoke protected mode */ go_to_protected_mode(); } arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S 59. Copy to zeropage Very simple The omitted part is for compatibility with old command-line parameter protocol (located in the certain address) 59 arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S struct boot_params boot_params __attribute__((aligned(16))); ... static void copy_boot_params(void) { ... BUILD_BUG_ON(sizeof boot_params != 4096); memcpy(&boot_params.hdr, &hdr, sizeof hdr); ... } 60. Set up the serial console Parse the command line parameter in very ad-hoc way, and find the serial configuration Find earlyprintk and if it is either of the following format serial,0x3f8,115200 serial,ttyS0,115200 ttyS0,115200 Find console and find uart8250,io, or uart,io, If any serial config is found, set up it using I/O ports 60 void console_init(void) { parse_earlyprintk(); if (!early_serial_base) parse_console_uart8250(); } arch x86 boot header.S main.c early_serial_console.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S 61. Puts and putchar By BIOS call and serial I/O ports 61 void __attribute__((section(".inittext"))) putchar(int ch) { if (ch == 'n') putchar('r'); /* n -> rn */ bios_putchar(ch); if (early_serial_base != 0) serial_putchar(ch); } void __attribute__((section(".inittext"))) puts(const char *str) { while (*str) putchar(*str++); } arch x86 boot header.S main.c tty.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S [Notes] GCC extension __attribute__ section(section) : locate the function/variable in the specified section. 62. Serial and BIOS putchar 62 static void __attribute__((section(".inittext"))) serial_putchar(int ch) { unsigned timeout = 0xffff; while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && -- timeout) cpu_relax(); outb(ch, early_serial_base + TXR); } static void __attribute__((section(".inittext"))) bios_putchar(int ch) { struct biosregs ireg; initregs(&ireg); ireg.bx = 0x0007; ireg.cx = 0x0001; ireg.ah = 0x0e; ireg.al = ch; intcall(0x10, &ireg, NULL); } arch x86 boot header.S main.c tty.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S Put a char on a serial line by using I/O ports (IN and OUT instructions) Put a char on VGA by BIOS Call (INT 0x10, AH = 0x0e) 63. BIOS Call BIOS Call is invoked by using an INT instruction Requires an assembly language support Parameters and return values are passed by a certain set of registers INT instruction only takes an immediate for the interrupt number. C prototype: struct biosregs has all the general registers, data segment registers, the flag register 63 void intcall(u8 int_no, const struct biosregs *ireg, struct biosregs *oreg); void initregs(struct biosregs *reg) { memset(reg, 0, sizeof *reg); reg->eflags |= X86_EFLAGS_CF; reg->ds = ds(); reg->es = ds(); reg->fs = fs(); reg->gs = gs(); } 64. BIOS Call Impl. (1) 64 arch x86 boot header.S main.c bioscall.S memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S .code16 .section ".inittext","ax" .globl intcall ... intcall: cmpb %al, 3f je 1f movb %al, 3f jmp 1f /* Synchronize pipeline */ 1: ... /* Actual INT */ .byte 0xcd /* INT opcode */ 3: .byte 0 ... void intcall(u8 int_no, const struct biosregs *ireg, struct biosregs *oreg); ax dx cx Checks the current operand of the INT instruction, and rewrite (self-modify) the interrupt number if different. 65. BIOS Call Impl. (2) 65 1: /* Save state */ pushfl pushw %fs pushw %gs pushal /* Copy input state to stack frame */ subw $44, %sp movw %dx, %si movw %sp, %di movw $11, %cx rep; movsd /* Pop full state from the stack */ popal popw %gs popw %fs popw %es popw %ds popfl /* Actual INT */ .byte 0xcd /* INT opcode */ 3: .byte 0 EFLAGS FS GS EAX ECX EDI stack EFLAGS FS GS DS ES EAX EDI Copy of struct biosregs *ireg (44 bytes) Registers Registers 66. BIOS Call Impl. (3) 66 /* Push full state to the stack */ pushfl pushw %ds pushw %es pushw %fs pushw %gs pushal ... (Restore %ds, %sp, etc.) ... /* Copy output state from stack frame */ movw 68(%esp), %di /* Original %cx == 3rd argument */ andw %di, %di jz 4f movw %sp, %si movw $11, %cx rep; movsd /* Restore state and return */ popal popw %gs popw %fs popfl retl EFLAGS FS GS EAX ECX EDI stack EFLAGS FS GS DS ES EAX EDI Registers *oregs Registers 67. Inline assembly A quick way to use assembly language inside C source codes For example, when you want to disable interrupts, put into your C code. GCCs extended inline assembly language enables far more features (and more complicated) => Described in twenty or so slides later! 67 asm (cli); static inline void outb(u8 v, u16 port) { asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); } 68. Initialize the heap 68 char *HEAP = _end; char *heap_end = _end; /* Default end of heap = no heap */ ... static void init_heap(void) { char *stack_end; if (boot_params.hdr.loadflags & CAN_USE_HEAP) { asm("leal %P1(%%esp),%0" : "=r" (stack_end) : "i" (-STACK_SIZE)); heap_end = (char *) ((size_t)boot_params.hdr.heap_end_ptr + 0x200); if (heap_end > stack_end) heap_end = stack_end; } else { /* Boot protocol 2.00 only, no heap available */ puts("WARNING: Ancient bootloader, some functionality " "may be limited!n"); } } arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S Substitute %esp STACK_SIZE to stack_end heap_end stack_end 69. When is the heap used? Heap allocation function is very simple And the calls for GET_HEAP exist only in the video code files. 69 static inline char *__get_heap(size_t s, size_t a, size_t n) { char *tmp; HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1)); tmp = HEAP; HEAP += s*n; return tmp; } #define GET_HEAP(type, n)((type *)__get_heap(sizeof(type),__alignof__(type),(n))) saved.data = GET_HEAP(u16, saved.x*saved.y); (boot/video.c) 70. Retrieving memory info. As described in the last presentation, detect_memory tries 3 methods 70 int detect_memory(void) { ... if (detect_memory_e820() > 0) err = 0; if (!detect_memory_e801()) err = 0; if (!detect_memory_88()) err = 0; return err; } arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S 71. Memory Information [from p.48] AX = 0xe820, INT 0x15 [detect_memory_e820()] INPUT AX = 0xe820 CX = size of the buffer EDX = SMAP (0x534d4150 / Signature) EBX = Continuation value ES:DI = address for the buffer OUTPUT CF = 0 if successful, 1 otherwise CX = Returned Byte EBX = Continuation value Each call returns information for one range To get information for the next range, give the continuation value returned in the previous call The range information is returned by the following structure Stored in boot_params.e820_map (struct e820entry[128]) 71 52 struct e820entry { 53 __u64 addr; /* start of memory segment */ 54 __u64 size; /* size of memory segment */ 55 __u32 type; /* type of memory segment */ 56 } __attribute__((packed)); (arch/x86/include/uapi/asm/e820.h) 72. E820 72 static int detect_memory_e820(void) { int count = 0; struct biosregs ireg, oreg; struct e820entry *desc = boot_params.e820_map; static struct e820entry buf; /* static so it is zeroed */ initregs(&ireg); ireg.ax = 0xe820; ireg.cx = sizeof buf; ireg.edx = SMAP; ireg.di = (size_t)&buf; do { intcall(0x15, &ireg, &oreg); ireg.ebx = oreg.ebx; /* for next iteration... */ if (oreg.eflags & X86_EFLAGS_CF) break; ... *desc++ = buf; count++; } while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_map)); return boot_params.e820_entries = count; } arch x86 boot header.S main.c memory.c pm.c pmjump.S compressed head_32.S head_64.S eboot.c efi_stub_32.S efi_stub_64.S kernel head_32.S head_64.S 73. Video Smells like chaos 73 74. Go To Protected Mode 74 void go_to_protected_mode(void) { /* Hook before leaving real mode, also disables interrupts */ realmode_switch_hook(); /* Enable the A20 gate */ if (enable_a20()) { puts("A20 gate not responding, unable to boot...n"); die(); } /* Reset coprocessor (IGNNE#) */ reset_coprocessor(); /* Mask all interrupts in the PIC */ mask_all_interrupts(); /* Actual transition to protected mode... */ setup_idt(); setup_gdt(); protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds()