the e820 trap of linux kernel hibernation
TRANSCRIPT
The e820 trap of Linux kernel hibernation
Aug, 2015, COSCUP 2015, Taipei
Joey Lee, SUSE Labs Taipei
Agenda
FundamentalHibernation (suspen to disk)
e820, EFI memmap
e820 shiftPlatform vs. Shutdown
memory size changing
EFI memmap shiftsetup_data and nosave regions
EFI runtime services broken after S4
Challenges
Q&A
Fundamental
Memory (physical)
pfn = 0
pfn = max_pfn
Memory (runtime)
0
max_pfn
Hibernation (suspend to disk)
Create snapshot image of runtime memory.
Store snapshot image to swap partition or file.
Restore snapshot image to memory.
Hibernation (restore)
0
max_pfn
0
max_pfn
Memory restored
Memory (physical)
pfn = 0
pfn = max_pfn
Memory (BIOS memory map)
0
max_pfn
0
max_pfn
Boot
Boot
e820
Wikipedia: e820 is shorthand to refer to the facility by which the BIOS of x86-based computer systems reports the memory map to the operating system or boot loader.
It is accessed via the int 15h call, by setting the AX register to value E820 in hexadecimal. It reports which memory address ranges are usable and which are reserved for use by the BIOS.
e820 entry type
TypeKernel DefineString in dmesgDescription
Type 1E820_RAMusable,
System RAMUsable (normal) RAM
Type 2E820_RESERVEDreserved,
reservedReserved - unusable
Type 3E820_ACPIACPI data,ACPI TablesACPI reclaimable memory
Type 4E820_NVS*ACPI NVS,
ACPI Non-volatile StorageACPI NVS memory,
ACPI Non-Volatile-Sleeping Memory (NVS)
Type 5E820_UNUSABLEUnusable,Unusable memoryArea containing bad memory
* drivers/acpi/nvs.c::suspend_nvs_*() handle ACPI NVS for S4
Memory (BIOS memory map)
0
max_pfn
0
max_pfn
Boot
Boot
Memory (runtime)
0
max_pfn
0
max_pfn
Boot
ACPI NVSreservedACPI data
reservedBoot
useable
useable
useable
useable
useable
useable
0
max_pfn
Boot
ACPI NVSreservedACPI data
reserved
useable
useable
useable
useable
useable
useable
OS
EFI memory map
EFI spec v2.5EFI_BOOT_SERVICES.GetMemoryMap()Returns the current memory map.
6.2 Memory Allocation ServicesTable 25. Memory Type Usage before ExitBootServices()
Table 26. Memory Type Usage after ExitBootServices()
e820 entry type vs. EFI memory region type
E820 TypeE820 entry typeEFI memory region type
Type 1E820_RAMEFI_LOADER_CODE (type 1)EFI_LOADER_DATA (type 2)EFI_BOOT_SERVICES_CODE (type 3)EFI_BOOT_SERVICES_DATA (type 4)EFI_CONVENTIONAL_MEMORY (type 7)
Type 2E820_RESERVEDEFI_RESERVED_TYPE (type 0)EFI_RUNTIME_SERVICES_CODE (type 5)EFI_RUNTIME_SERVICES_DATA (type 6)EFI_MEMORY_MAPPED_IO (type 11)EFI_MEMORY_MAPPED_IO_PORT_SPACE (type 12)EFI_PAL_CODE (type 13)
Type 3E820_ACPIEFI_ACPI_RECLAIM_MEMORY (type 9)
Type 4E820_NVSEFI_ACPI_MEMORY_NVS (type 10)
Type 5E820_UNUSABLEEFI_UNUSABLE_MEMORY (type 8)
New*E820_PMEMEFI_PERSISTENT_MEMORY (type 14)
* v4.2-rc4
arch/x86/boot/compressed/eboot.c::setup_e820()
e820 shift
e820 shift (1)
Boot 1:
Boot 2:
e820 shift (2)
Boot:[ 0.000000] BIOS-e820: [mem 0x0000000068f45000-0x0000000069d4ffff] usable
Resume Boot:[ 0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
[ 0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]
[ 17.410733] PM: Image loading progress: 0%
[ 17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
[ 17.933469] IP: [] load_image_lzo+0x810/0xe40
Page fault address is in usable memory entry when boot, but in reserved memory entry when resume boot.
e820 shift (3)
0
max_pfn
Boot
ACPI NVSreservedACPI data
reserved
useable
useable
useable
useable
useable
useable
max_pfn
Boot
ACPI NVSreservedACPI data
reserved
useable
useable
useable
useable
useable
useable
0
Boot
Resume Boot
Useable address
in reserved region
Checking e820 shift:
Lee, Chun-Yi [PATCH] PM / hibernate: avoid unsafe pages in e820 reserved regions:84c91b7ae commit in v3.17-rc1https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=84c91b7ae07c62cf6dee7fde3277f4be21331f85
Reverted by f82daee49 commit in v4.0Waiting Yinghai Lu [PATCH]x86: Kill E820_RESERVED_KERN
Lee, Chun-Yi [PATCH] Hibernate: save e820 table to snapshot header for comparisonhttps://lkml.org/lkml/2014/8/11/166
Platform vs. Shutdown (1)
Different modes of hibernation:cat /sys/power/disk [platform] shutdown reboot suspend
Platform mode depends on \_S4 support by BIOS:[ 1.080004] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S4_] (20130725/hwxface-571)
ACPI spec 6.0: Table 7-234 BIOS-Supplied Control Methods for System-Level Functions\_S4: Package that defines system \_S4 state mode.
16.3.2 BIOS Initialization of Memory (since ACPI v1.0):Note: The memory information returned from the system address map reporting interfaces should be the same before and after an S4 sleep.OSPM will invoke E820 interfaces on IA-PC-based legacy systems or the GetMemoryMap() interface on UEFI-enabled systems
Platform vs. Shutdown (2)
Documentation/power/swsusp.txt in kernelQ: What is the difference between "platform" and "shutdown"?
A: "platform" is actually right thing to do where supported, but"shutdown" is most reliable (except on ACPI systems).
Linux Kernel bug #77571:https://bugzilla.kernel.org/show_bug.cgi?id=77571
The same page fault when writing snapshot image to page buffer.
Bug reporter uses shutdown but not platform.After using platform, bug reporter can not reproduce issue.
That's better using platform when BIOS support \_S4. User should aware that has risk when using shutdown.
Memory size mismatch (1)
PM: Loading and decompressing image data (495448 pages)...[ 3.834831] PM: Image mismatch: memory size[ 3.834851] PM: Read 1981792 kbytes in 0.01 seconds (198179.20 MB/s)[ 3.836147] PM: Error -1 resuming[ 3.836162] PM: Failed to load hibernation image, recovering.
Normally: On node 0 totalpages: 4177255When issue happened: On node 0 totalpages: 4177256 num_physpages != get_num_physpages()) reason = "memory size"; if (reason) { printk(KERN_ERR "PM: Image mismatch: %s\n", reason); return -EPERM; }
Memory size mismatch (2)
Boot
Memory map of Boot
Memory size mismatch (3)
Resume Boot
Memory map of Resume Boot
EFI memmap shift
Misidentification of nosave region (1)
1 pageIn usable
Not alignEFI_LOADER_DATA
setup_data and E820_RESERVED_KERN
setup_data: a linked list for carrying data with boot_params to later boot stage.Allocated in EFI stub, reserved via memblock and e820.
Yinghai Lu [PATCH] x86, boot: clean up setup_data handlinghttps://lkml.org/lkml/2015/2/28/272
SETUP_E820_EXT, SETUP_EFI SETUP_DTB, SETUP_PCI SETUP_KASLR
Those setup_data chunks are not page align when allocating. That causes hole between e820 entries, then kernel register it as 1 page nosave regions. trampoline_pgd:We map EFI runtime services in the aforementioned PGD in the virtual range of 64Gb (arbitrarily set, can be raised if needed)0xffffffef00000000 - 0xffffffff00000000
Memory mapping of EFI runtime services (2)
Virtual memory map x86_64 of runtime service trampoline_pgd
Runtime CodeRuntime Data
0xffffffffffffffff
0x0000000000000000
0x00000000bb385000
0xffffffff00000000
4 G
64 G
0x00000000bb3e5000
0xffffffef00000000
Boot DataBoot Code1:1 mapping
workaround1:1 mapping
workaround1:1 mapping
workaround1:1 mapping
workaroundBoot DataBoot
Dataarch/x86/platform/efi/efi_64.c::efi_map_region()
Memory mapping of EFI runtime services (3)
In -4G area:
Runtime CodeRuntime Data0xffffffff00000000
0xffffffef00000000
Boot DataBoot Code64 G
Boot DataBoot Data2M-alignedarch/x86/platform/efi/efi_64.c::efi_map_region()
Should fix runtime services address after S4
Lee, Chun-Yi [PATCH] x86_64/efi: Mapping Boot and Runtime EFI memory regions to different starting virtual addressVA of EFI runtime services should may changed between hibernation, but that's fine when PA doesn't change.
Should checking more detail about EFI page table when hibernation recovery.
Challenges
Hibernation's Challenge
KASLR (Kernel address space layout randomization)Exclusive with hibernation
Intel Rapid StartA replacement of kernel hibernation
May also conflict with KASLR
NVDIMMDo not need hibernation anymore
TheoryMathematics
Q&A
SUSE is Hiring
Please search SUSE Careers
and
http://www.104.com.tw/
SUMMIT 2015
OPENSUSE ASIATaipei,R.O.C(Taiwan)Bring you to the free world
Click to edit the title text format
Outline text format
Test Line 2
Test Line 3
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level
Join us on:
www.opensuse.org
15/08/15