linux insides

201

Upload: smahendar

Post on 11-Sep-2015

34 views

Category:

Documents


3 download

DESCRIPTION

dfg

TRANSCRIPT

  • 1. Introduction2. Booting

    i. Frombootloadertokernelii. Firststepsinthekernelsetupcodeiii. Videomodeinitializationandtransitiontoprotectedmodeiv. Transitionto64-bitmodev. Kerneldecompression

    3. Initializationi. Firststepsinthekernelii. Earlyinterruptshandleriii. Lastpreparationsbeforethekernelentrypointiv. Kernelentrypointv. Continuearchitecture-specificboot-timeinitializationsvi. Architecture-specificinitializations,again...vii. Endofthearchitecture-specificinitializations,almost...viii. Schedulerinitializationix. RCUinitialization

    4. Memorymanagementi. Memblockii. Fixmapsandioremap

    5. Interrupts6. vsyscallsandvdso7. SMP8. Concepts

    i. Per-CPUvariablesii. Cpumasks

    9. DataStructuresintheLinuxKerneli. Doublylinkedlistii. Radixtree

    10. Theoryi. Pagingii. Elf64iii. CPUIDiv. MSR

    11. Initialramdiski. initrd

    12. Misci. Kernelbuildingandinstalationii. WriteandSubmityourfirstLinuxkernelPatchiii. Datatypesinthekernel

    13. Usefullinks14. Contributors

    TableofContents

    LinuxInside

    2

  • Aseriesofpostsaboutthelinuxkernelanditsinsides.

    Thegoalissimple-tosharemymodestknowledgeabouttheinternalsofthelinuxkernelandhelppeoplewhoareinterestedinthelinuxkernelinternals,andotherlow-levelsubjectmatter.

    Questions/Suggestions:Feelfreeaboutanyquestionsorsuggestionsbypingingmeattwitter@0xAX,addingissueorjustdropmeemail.

    SupportIfyoulikelinux-insidesyoucansupportmewith:

    Feelfreetocreateissuesorcreatepull-requestsifyoufindanyissuesormyEnglishispoor.

    PleasereadCONTRIBUTING.mdbeforepushinganychanges.

    @0xAX

    linux-internals

    Support

    Contributions

    Author

    LinuxInside

    3Introduction

  • Thischapterdescribesthelinuxkernelbootprocess.Youwillseehereacoupleofpostswhichdescribethefullcycleofthekernelloadingprocess:

    Fromthebootloadertokernel-describesallstagesfromturningonthecomputertobeforethefirstinstructionofthekernel;Firststepsinthekernelsetupcode-describesfirststepsinthekernelsetupcode.Youwillseeheapinitialization,queryingofdifferentparameterslikeEDD,ISTandetc...Videomodeinitializationandtransitiontoprotectedmode-describesvideomodeinitializationinthekernelsetupcodeandtransitiontoprotectedmode.Transitionto64-bitmode-describespreparationfortransitioninto64-bitmodeandtransitionintoit.KernelDecompression-describespreparationbeforekerneldecompressionanddirectlydecompression.

    Kernelbootprocess

    LinuxInside

    4Booting

  • Ifyouhavereadmypreviousblogposts,youcanseethatsometimeagoIstartedtogetinvolvedwithlow-levelprogramming.Iwrotesomepostsaboutx86_64assemblyprogrammingforLinux.Atthesametime,IstartedtodiveintotheLinuxsourcecode.Itisveryinterestingformetounderstandhowlow-levelthingswork,howprogramsrunonmycomputer,howtheyarelocatedinmemory,howthekernelmanagesprocessesandmemory,howthenetworkstackworksonlow-levelandmanymanyotherthings.IdecidedtowriteyetanotherseriesofpostsabouttheLinuxkernelforx86_64.

    NotethatI'mnotaprofessionalkernelhacker,andIdon'twritecodeforthekernelatwork.It'sjustahobby.Ijustlikelow-levelstuff,anditisinterestingformetoseehowthesethingswork.Soifyounoticeanythingconfusing,orifyouhaveanyquestions/remarks,pingmeontwitter0xAX,dropmeanemailorjustcreateanissue.Iappreciateit.Allpostswillalsobeaccessibleatlinux-insidesandifyoufindsomethingwrongwithmyEnglishorpostcontent,feelfreetosendpullrequest.

    Notethatthisisn'tofficialdocumentation,justlearningandsharingknowledge.

    Requiredknowledge

    UnderstandingCcodeUnderstandingassemblycode(AT&Tsyntax)

    Anyway,ifyoujuststartedtolearnsometools,Iwilltrytoexplainsomepartsduringthisandfollowingposts.Ok,littleintroductionfinishedandnowwecanstarttodiveintokernelandlow-levelstuff.

    Allcodeisactualforkernel-3.18,iftherearechanges,Iwillupdateposts.

    Despitethatthisisaseriesofpostsaboutlinuxkernel,wewillnotstartfromkernelcode(atleastinthisparagraph).Ok,youpressedmagicpowerbuttononyourlaptopordesktopcomputeranditstartedtowork.Afterthemotherboardsendsasignaltothepowersupply,thepowersupplyprovidesthecomputerwiththeproperamountofelectricity.Oncemotherboardreceivesthepowergoodsignal,ittriestoruntheCPU.TheCPUresetsallleftoverdatainitsregistersandsetsuppredefinedvaluesforeveryregister.

    80386andlaterCPUsdefinethefollowingpredefineddatainCPUregistersafterthecomputerresets:

    IP0xfff0CSselector0xf000CSbase0xffff0000

    Theprocessorstartsworkinginrealmodenowandweneedtomakealittleretreatforunderstandingmemorysegmentationinthismode.Realmodeissupportedinallx86-compatibleprocessors,from8086tomodernIntel64-bitCPUs.The8086processorhada20-bitaddressbus,whichmeansthatitcouldworkwith0-2^20bytesaddressspace(1megabyte).Butitonlyhad16-bitregisters,andwith16-bitregistersthemaximumaddressis2^16or0xffff(64kilobytes).Memorysegmentationwasusedtomakeuseofalloftheaddressspace.Allmemorywasdividedintosmall,fixed-sizesegmentsof65535bytes,or64KB.Sincewecannotaddressmemorybehind64KBwith16bitregisters,anothermethodtodoitwasdevised.Anaddressconsistsoftwoparts:thebeginningaddressofthesegmentandtheoffsetfromthebeginningofthissegment.Togetaphysicaladdressinmemory,weneedtomultiplythesegmentpartby16andaddtheoffsetpart:

    Kernelbootingprocess.Part1.

    Fromthebootloadertokernel

    Magicpowerbutton,what'snext?

    LinuxInside

    5Frombootloadertokernel

  • PhysicalAddress=Segment*16+Offset

    ForexampleCS:IPis0x2000:0x0010.Thecorrespondingphysicaladdresswillbe:

    >>>hex((0x2000>>hex((0xffff

  • NowtheBIOShasstartedtowork.Afterinitializingandcheckingthehardware,itneedstofindabootabledevice.AbootorderisstoredintheBIOSconfiguration,controllingwhichdevicesthekernelattemptstoboot.Inthecaseofattemptingtobootaharddrive,theBIOStriestofindabootsector.OnharddrivespartitionedwithanMBRpartitionlayout,thebootsectorisstoredinthefirst446bytesofthefirstsector(512bytes).Thefinaltwobytesofthefirstsectorare0x55and0xaawhichsignalstheBIOSthatthedeviceasbootable.Forexample:

    ;;Note:thisexamplewrittenwithIntelsyntax;[BITS16][ORG0x7c00]

    boot:moval,'!'movah,0x0emovbh,0x00movbl,0x07

    int0x10jmp$

    times510-($-$$)db0

    db0x55db0xaa

    Buildandrunitwith:

    nasm-fbinboot.nasm&&qemu-system-x86_64boot

    ThiswillinstructQEMUtousethebootbinarywejustbuiltasadiskimage.Sincethebinarygeneratedbytheassemblycodeabovefulfillstherequirementsofthebootsector(theoriginissetto0x7c00,andweendwiththemagicsequence),QEMUwilltreatthebinaryasthemasterbootrecordofadiskimage.

    Wewillsee:

    LinuxInside

    7Frombootloadertokernel

  • Inthisexamplewecanseethatthiscodewillbeexecutedin16bitrealmodeandwillstartat0x7c00inmemory.Afterthestartitcallsthe0x10interruptwhichjustprints!symbol.Itfillsrestof510byteswithzerosandfinishwithtwomagicbytes0xaaand0x55.

    Althoughyoucanseebinarydumpofitwithobjdumputil:

    nasm-fbinboot.nasmobjdump-D-bbinary-mi386-Maddr16,data16,intelboot

    Areal-worldbootsectorhascodeforcontinuingthebootprocessandthepartitiontable...insteadofabunchof0'sandanexclamationpoint:)Ok,so,fromthismomentBIOShandedcontroltothebootloaderandwecangoahead.

    NOTE:asyoucanreadabovetheCPUisinrealmode.Inrealmode,calculatingthephysicaladdressinmemoryisasfollows:

    PhysicalAddress=Segment*16+Offset

    asIwroteabove.Butwehaveonly16bitgeneralpurposeregisters.Themaximumvalueof16bitregisteris:0xffff;Soifwetakethebiggestvalues,itwillbe:

    >>>hex((0xffff*16)+0xffff)'0x10ffef'

    Where0x10ffefisequalto1mb+64KB-16b.Buta8086processor,whichwasfirstprocessorwithrealmode,had20bitaddressline,and2^20=1048576.0is1MB,soitmeansthatactuallyavailablememoryamountis1MB.

    Generalrealmode'smemorymapis:

    0x00000000-0x000003FF-RealModeInterruptVectorTable0x00000400-0x000004FF-BIOSDataArea0x00000500-0x00007BFF-Unused0x00007C00-0x00007DFF-OurBootloader0x00007E00-0x0009FFFF-Unused0x000A0000-0x000BFFFF-VideoRAM(VRAM)Memory0x000B0000-0x000B7777-MonochromeVideoMemory0x000B8000-0x000BFFFF-ColorVideoMemory0x000C0000-0x000C7FFF-VideoROMBIOS0x000C8000-0x000EFFFF-BIOSShadowArea0x000F0000-0x000FFFFF-SystemBIOS

    Butstop,atthebeginningofpostIwrotethatfirstinstructionexecutedbytheCPUislocatedataddress0xfffffff0,whichismuchbiggerthan0xfffff(1MB).HowcanCPUaccessitinrealmode?AsIwriteaboutandyoucanreadincorebootdocumentation:

    0xFFFE_0000-0xFFFF_FFFF:128kilobyteROMmappedintoaddressspace

    AtthestartofexecutionBIOSisnotinRAM,itislocatedinROM.

    ThereareanumberofbootloaderswhichcanbootLinux,suchasGRUB2andsyslinux.TheLinuxkernelhasaBoot

    Bootloader

    LinuxInside

    8Frombootloadertokernel

  • protocolwhichspecifiestherequirementsforbootloaderstoimplementLinuxsupport.ThisexamplewilldescribeGRUB2.

    NowthattheBIOShaschosenabootdeviceandtransferredcontroltothebootsectorcode,executionstartsfromboot.img.Thiscodeisverysimpleduetothelimitedamountofspaceavailable,andcontainsapointerthatitusestojumptothelocationofGRUB2'scoreimage.Thecoreimagebeginswithdiskboot.img,whichisusuallystoredimmediatelyafterthefirstsectorintheunusedspacebeforethefirstpartition.Theabovecodeloadstherestofthecoreimageintomemory,whichcontainsGRUB2'skernelanddriversforhandlingfilesystems.Afterloadingtherestofthecoreimage,itexecutesgrub_main.

    grub_maininitializesconsole,getsbaseaddressformodules,setsrootdevice,loads/parsesgrubconfigurationfile,loadsmodulesetc...Attheendofexecution,grub_mainmovesgrubtonormalmode.grub_normal_execute(fromgrub-core/normal/main.c)completeslastpreparationandshowsamenuforselectinganoperatingsystem.Whenweselectoneofgrubmenuentries,grub_menu_execute_entrybeginstobeexecuted,whichexecutesgrubbootcommand.Itstartstobootoperatingsystem.

    Aswecanreadinthekernelbootprotocol,thebootloadermustreadandfillsomefieldsofkernelsetupheaderwhichstartsat0x01f1offsetfromthekernelsetupcode.Kernelheaderarch/x86/boot/header.Sstartsfrom:

    .globlhdrhdr:setup_sects:.byte0root_flags:.wordROOT_RDONLYsyssize:.long0ram_size:.word0vid_mode:.wordSVGA_MODEroot_dev:.word0boot_flag:.word0xAA55

    Thebootloadermustfillthisandtherestoftheheaders(onlymarkedaswriteinthelinuxbootprotocol,forexamplethis)withvalueswhichiteithergotfromcommandlineorcalculated.Wewillnotseedescriptionandexplanationofallfieldsofkernelsetupheader,wewillgetbacktoitwhenkernelusesit.Anyway,youcanfinddescriptionofanyfieldinthebootprotocol.

    Aswecanseeinkernelbootprotocol,thememorymapwillbethefollowingafterkernelloading:

    |Protected-modekernel|100000+------------------------+|I/Omemoryhole|0A0000+------------------------+|ReservedforBIOS|Leaveasmuchaspossibleunused~~|Commandline|(CanalsobebelowtheX+10000mark)X+10000+------------------------+|Stack/heap|Forusebythekernelreal-modecode.X+08000+------------------------+|Kernelsetup|Thekernelreal-modecode.|Kernelbootsector|Thekernellegacybootsector.X+------------------------+|Bootloader|

    Soafterthebootloadertransferredcontroltothekernel,itstartssomewhereat:

    0x1000+X+sizeof(KernelBootSector)+1

    whereXistheaddresskernelbootsectorloaded.InmycaseXis0x10000(),wecanseeitinmemorydump:

    LinuxInside

    9Frombootloadertokernel

  • Ok,bootloaderloadedlinuxkernelintomemory,filledheaderfieldsandjumpedtoit.Nowwecanmovedirectlytothekernelsetupcode.

    Finallyweareinthekernel.Technicallykerneldidn'trunyet,firstofallweneedtosetupkernel,memorymanager,processmanager,etc.Kernelsetupexecutionstartsfromarch/x86/boot/header.Satthe_start.Itislittlestrangeatthefirstlook,therearemanyinstructionsbeforeit.Actually....

    Longtimeagolinuxhaditsownbootloader,butnowifyourunforexample:

    qemu-system-x86_64vmlinuz-3.18-generic

    Youwillsee:

    Actuallyheader.SstartsfromMZ(seeimageabove),errormessageprintingandfollowingPEheader:

    #ifdefCONFIG_EFI_STUB

    Startofkernelsetup

    LinuxInside

    10Frombootloadertokernel

  • #"MZ",MS-DOSheader.byte0x4d.byte0x5a#endif.........pe_header:.ascii"PE".word0

    ItneedsthisforloadingoperatingsystemwithUEFI.Herewewillnotseehowitworks(willlookintoitinthenextparts).

    Soactualkernelsetupentrypointis:

    //header.Sline292.globl_start_start:

    Bootloader(grub2andothers)knowsaboutthispoint(0x200offsetfromMZ)andmakesajumpdirectlytothispoint,despitethefactthatheader.Sstartsfrom.bstextsectionwhichprintserrormessage:

    ////arch/x86/boot/setup.ld//.=0;//currentposition.bstext:{*(.bstext)}//put.bstextsectiontoposition0.bsdata:{*(.bsdata)}

    Sokernelsetupentrypointis:

    .globl_start_start:.byte0xeb.bytestart_of_setup-1f1:////restoftheheader//

    Herewecanseejmpinstructionopcode-0xebtothestart_of_setup-1fpoint.Nfnotationmeansfollowing:2freferstothenextlocal2:label.Inourcaseitislabel1whichgoesrightafterjump.Itcontainsrestofsetupheaderandrightaftersetupheaderwecansee.entrytextsectionwhichstartsatstart_of_setuplabel.

    Actuallyit'sfirstcodewhichstartstoexecutebesidespreviousjumpinstruction.Afterkernelsetupgotthecontrolfrombootloader,firstjmpinstructionislocatedat0x200(first512bytes)offsetfromthestartofkernelrealmode.Thiswecanreadinlinuxkernelbootprotocolandalsoseeingrub2sourcecode:

    state.gs=state.fs=state.es=state.ds=state.ss=segment;state.cs=segment+0x20;

    Itmeansthatsegmentregisterswillhavefollowingvaluesafterkernelsetupstartstowork:

    fs=es=ds=ss=0x1000cs=0x1020

    LinuxInside

    11Frombootloadertokernel

  • formycasewhenkernelloadedat0x10000.

    Afterjumptostart_of_setup,needstodofollowingthings:

    BesurethatallvaluesofallsegmentregistersareequalSetupcorrectstackifneedSetupbssJumptoCcodeatmain.c

    Let'slookatimplementation.

    Firstofallitensuresthatdsandessegmentregisterspointtothesameaddressandenablesinterruptswithstiinstruction:

    movw%ds,%axmovw%ax,%essti

    Asiwroteabove,grub2loadskernelsetupcodeat0x10000addressandcsat0x1020becauseexecutiondoesn'tstartfromthestartoffile,butfrom:

    _start:.byte0xeb.bytestart_of_setup-1f

    jump,whichis512bytesoffsetfromthe4d5a.Alsoneedtoaligncsfrom0x10200to0x10000asallothersegmentregisters.Afterthatwesetupstack:

    pushw%dspushw$6flretw

    pushdsvaluetostack,andaddressof6labelandexecutelretwinstruction.Whenwecalllretw,itloadsaddressof6labeltoinstructionpointerregisterandcswithvalueofds.Afteritwewillhavedsandcswiththesamevalues.

    Actually,almostallofthesetupcodeispreparationforClanguageenvironmentintherealmode.Thenextstepischeckingofssregistervalueandmakingofcorrectstackifssiswrong:

    movw%ss,%dxcmpw%ax,%dxmovw%sp,%dxje2f

    Generally,itcanbe3differentcases:

    sshasvalidvalue0x10000(asallothersegmentregistersbesidecs)ssisinvalidandCAN_USE_HEAPflagisset(seebelow)

    Segmentregistersalign

    Stacksetup

    LinuxInside

    12Frombootloadertokernel

  • ssisinvalidandCAN_USE_HEAPflagisnotset(seebelow)

    Let'slookatallofthesecases:

    1. sshasacorrectaddress(0x10000).Inthiscasewegoto2label:

    2:andw$~3,%dxjnz3fmovw$0xfffc,%dx3:movw%ax,%ssmovzwl%dx,%espsti

    Herewecanseealigningofdx(containsspgivenbybootloader)to4bytesandcheckingthatitisnotzero.Ifitiszeroweput0xfffc(4bytealignedaddressbeforemaximumsegmentsize-64KB)todx.Ifitisnotzerowecontinuetousespgivenbybootloader(0xf7f4inmycase).Afterthisweputaxvaluetosswhichstorescorrectsegmentaddress0x10000andsetupcorrectsp.Afteritwehavecorrectstack:

    1. Inthesecondcase(ss!=ds),firstofallput_end(addressofendofsetupcode)valueindx.Andcheckloadflagsheaderfieldwithtestbinstructiontooseeifwecanuseheapornot.loadflagsisabitmaskheaderwhichisdefinedas:

    #defineLOADED_HIGH(1

  • 1. ThelastcasewhenCAN_USE_HEAPisnotset,wejustuseminimalstackfrom_endto_end+STACK_SIZE:

    ThelasttwostepsthatneedtohappenbeforewecanjumptothemainCcode,arethatweneedtosetupthebssarea,andcheckthe"magic"signature.Firstly,signaturechecking:

    cmpl$0x5a5aaa55,setup_sigjnesetup_bad

    Thissimplyconsistsofcomparingthesetup_sigagainstthemagicnumber0x5a5aaa55;iftheyarenotequal,afatalerrorisreported.

    Butifthemagicnumbermatches,knowingwehaveasetofcorrectsegmentregisters,andastack,weneedonlysetupthebsssectionbeforejumpingintotheCcode.

    Thebsssectionisusedforstoringstaticallyallocated,uninitialized,data.Linuxcarefullyensuresthisareaofmemoryisfirstblanked,usingthefollowingcode:

    movw$__bss_start,%dimovw$_end+3,%cxxorl%eax,%eaxsubw%di,%cxshrw$2,%cxrep;stosl

    Bsssetup

    LinuxInside

    14Frombootloadertokernel

  • Firstofallthe__bss_startaddressismovedintodi,andthe_end+3address(+3-alignsto4bytes)ismovedintocx.Theeaxregisteriscleared(usinganxorinstruction),andthebsssectionsize(cx-di)iscalculatedandputintocx.Then,cxisdividedbyfour(thesizeofa'word'),andthestoslinstructionisrepeatedlyused,storingthevalueofeax(zero)intotheaddresspointedtobydi,andautomaticallyincreasingdibyfour(thisoccursuntilcxreacheszero).Theneteffectofthiscode,isthatzerosarewrittenthroughallwordsinmemoryfrom__bss_startto_end:

    That'sall,wehavestack,bssandnowwecanjumptomainCfunction:

    calllmain

    whichisinarch/x86/boot/main.c.Whatwillbethere?Wewillseeitinthenextpart.

    Thisistheendofthefirstpartaboutlinuxkernelinternals.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateissue.InthenextpartwewillseefirstCcodewhichexecutesinlinuxkernelsetup,implementationofmemoryroutinesasmemset,memcpy,earlyprintkimplementationandearlyconsoleinitializationandmanymore.

    PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.

    Intel80386programmer'sreferencemanual1986MinimalBootLoaderforIntelArchitecture808680386ResetvectorRealmodeLinuxkernelbootprotocolCoreBootdevelopermanualRalfBrown'sInterruptListPowersupply

    Jumptomain

    Conclusion

    Links

    LinuxInside

    15Frombootloadertokernel

  • Powergoodsignal

    LinuxInside

    16Frombootloadertokernel

  • Westartedtodiveintolinuxkernelinternalsinthepreviouspartandsawtheinitialpartofthekernelsetupcode.Westoppedatthefirstcallofthemainfunction(whichisthefirstfunctionwritteninC)fromarch/x86/boot/main.c.Herewewillcontinuetoresearchthekernelsetupcodeandseewhatprotectedmodeis,somepreparationforthetransitionintoit,theheapandconsoleinitialization,memorydetectionandmuchmuchmore.So...Let'sgoahead.

    BeforewecanmovetothenativeIntel64Longmode,thekernelmustswitchtheCPUintoprotectedmode.Whatisprotectedmode?Protectedmodewasfirstaddedtothex86architecturein1982andwasthemainmodeofIntelprocessorsfromthe80286processoruntilIntel64andlongmode.TheMainreasontomoveawayfromrealmodeisthatthereisverylimitedaccesstotheRAM.Asyoumayrememberfromthepreviouspart,thereisonly2^20bytesor1megabyte,sometimesevenonly640kilobytes.

    Protectedmodebroughtmanychanges,butthemainoneisdifferentmemorymanagement.The24-bitaddressbuswasreplacedwitha32-bitaddressbus.Itallowsaccessto4gigabytesofphysicaladdressspace.Alsopagingsupportwasadded,whichyoucanreadaboutinthenextsections.

    Memorymanagementinprotectedmodeisdividedintotwo,almostindependentparts:

    SegmentationPaging

    Herewecanonlyseesegmentation.Asyoucanreadinthepreviouspart,addressesconsistoftwopartsinrealmode:

    BaseaddressofthesegmentOffsetfromthesegmentbase

    Andwecangetthephysicaladdressifweknowthesetwopartsby:

    PhysicalAddress=Segment*16+Offset

    Memorysegmentationwascompletelyredoneinprotectedmode.Thereareno64kilobytefixed-sizesegments.AllmemorysegmentsaredescribedbytheGlobalDescriptorTable(GDT)insteadofsegmentregisters.TheGDTisastructurewhichresidesinmemory.Thereisnofixedplaceforitinmemory,butitsaddressisstoredinthespecialGDTRregister.LaterwewillseetheGDTloadinginthelinuxkernelcode.Therewillbeanoperationforloadingitintomemory,somethinglike:

    lgdtgdt

    wherethelgdtinstructionloadsthebaseaddressandlimitofglobaldescriptortabletotheGDTRregister.GDTRisa48-bitregisterandconsistsoftwoparts:

    size-16bitofglobaldescriptortable;address-32-bitoftheglobaldescriptortable.

    Kernelbootingprocess.Part2.

    Firststepsinthekernelsetup

    Protectedmode

    LinuxInside

    17Firststepsinthekernelsetupcode

  • Theglobaldescriptortablecontainsdescriptorswhichdescribememorysegments.Everydescriptoris64-bits.Thegeneralschemeofadescriptoris:

    3124191670------------------------------------------------------------|||B||A|||||0|E|W|A|||BASE31..24|G|/|L|V|LIMIT|P|DPL|S|TYPE|BASE23:16|4|||D||L|19..16||||1|C|R|A||------------------------------------------------------------||||BASE15..0|LIMIT15..0|0|||------------------------------------------------------------

    Don'tworry,Iknowitlooksalittlescaryafterrealmode,butit'seasy.Let'slookatitcloser:

    1. Limit(0-15bits)definesalength_of_segment-1.ItdependsonGbit.

    ifG(55-bit)is0andsegmentlimitis0,thesizeofthesegmentis1byteifGis1andsegmentlimitis0,thesizeofthesegmentis4096bytesifGis0andsegmentlimitis0xfffff,thesizeofthesegmentis1megabyteifGis1andsegmentlimitis0xfffff,thesizeofthesegmentis4gigabytes

    2. Base(0-15,32-39and56-63bits)definesthephysicaladdressofthesegment'sstartaddress.

    3. Type(40-47bits)definesthetypeofsegmentandkindsofaccesstoit.NextSflagspecifiesdescriptortype.ifSis0thenthissegmentisasystemsegment,whereasifSis1thenthisisacodeordatasegment(Stacksegmentsaredatasegmentswhichmustberead/writesegments).Ifthesegmentisacodeordatasegment,itcanbeoneofthefollowingaccesstypes:

    |TypeField|DescriptorType|Description|-----------------------------|-----------------|------------------|Decimal|||0EWA|||00000|Data|Read-Only|10001|Data|Read-Only,accessed|20010|Data|Read/Write|30011|Data|Read/Write,accessed|40100|Data|Read-Only,expand-down|50101|Data|Read-Only,expand-down,accessed|60110|Data|Read/Write,expand-down|70111|Data|Read/Write,expand-down,accessed|CRA|||81000|Code|Execute-Only|91001|Code|Execute-Only,accessed|101010|Code|Execute/Read|111011|Code|Execute/Read,accessed|121100|Code|Execute-Only,conforming|141101|Code|Execute-Only,conforming,accessed|131110|Code|Execute/Read,conforming|151111|Code|Execute/Read,conforming,accessed

    Aswecanseethefirstbitis0foradatasegmentand1foracodesegment.ThenextthreebitsEWAareexpansiondirection(expand-downsegmentwillgrowdown,youcanreadmoreaboutithere),writeenableandaccessedfordatasegments.CRAbitsareconforming(Atransferofexecutionintoamore-privilegedconformingsegmentallowsexecutiontocontinueatthecurrentprivilegelevel),readenableandaccessed.

    1. DPL(descriptorprivilegelevel)definestheprivilegelevelofthesegment.Itcanbe0-3where0isthemostprivileged.

    2. Pflag-indicatesifthesegmentispresentinmemoryornot.

    3. AVLflag-Availableandreservedbits.

    LinuxInside

    18Firststepsinthekernelsetupcode

  • 4. Lflag-indicateswhetheracodesegmentcontainsnative64-bitcode.If1thenthecodesegmentexecutesin64bitmode.

    5. B/Dflag-defaultoperationsize/defaultstackpointersizeand/orupperbound.

    Segmentregistersdon'tcontainthebaseaddressofthesegmentasinrealmode.Insteadtheycontainaspecialstructure-segmentselector.Selectorisa16-bitstructure:

    -----------------------------|Index|TI|RPL|-----------------------------

    WhereIndexshowstheindexnumberofthedescriptorinthedescriptortable.TIshowswheretosearchforthedescriptor:intheglobaldescriptortableorlocal.AndRPListheprivilegelevel.

    Everysegmentregisterhasavisibleandhiddenpart.Whenaselectorisloadedintooneofthesegmentregisters,itwillbestoredintothevisiblepart.Thehiddenpartcontainsthebaseaddress,limitandaccessinformationofthedescriptorwhichpointedtotheselector.Thefollowingstepsareneededtogetthephysicaladdressintheprotectedmode:

    Thesegmentselectormustbeloadedinoneofthesegmentregisters;TheCPUtriestofind(byGDTaddress+Indexfromselector)andloadthedescriptorintothehiddenpartofthesegmentregister;Baseaddress(fromsegmentdescriptor)+offsetwillbethelinearaddressofthesegmentwhichisthephysicaladdress(ifpagingisdisabled).

    Schematicallyitwilllooklikethis:

    LinuxInside

    19Firststepsinthekernelsetupcode

  • Thealgorithmforthetransitionfromrealmodeintoprotectedmodeis:

    Disableinterrupts;DescribeandloadGDTwithlgdtinstruction;SetPE(ProtectionEnable)bitinCR0(ControlRegister0);Jumptoprotectedmodecode;

    Wewillseethetransitiontoprotectedmodeinthelinuxkernelinthenextpart,butbeforewecanmovetoprotectedmode,weneedtodosomepreparations.

    Let'slookatarch/x86/boot/main.c.Wecanseesomeroutinestherewhichperformkeyboardinitialization,heapinitialization,etc...Let'stakealook.

    Wewillstartfromthemainroutinein"main.c".Firstfunctionwhichiscalledinmainiscopy_boot_params.Itcopiesthekernelsetupheaderintothefieldoftheboot_paramsstructurewhichisdefinedinthearch/x86/include/uapi/asm/bootparam.h.

    Theboot_paramsstructurecontainsthestructsetup_headerhdrfield.Thisstructurecontainsthesamefieldsasdefinedinlinuxbootprotocolandisfilledbythebootloaderandalsoatkernelcompile/buildtime.copy_boot_paramsdoestwothings:copieshdrfromheader.Stotheboot_paramsstructureinsetup_headerfieldandupdatespointertothekernelcommandlineifthekernelwasloadedwiththeoldcommandlineprotocol.

    Copyingbootparametersintothe"zeropage"

    LinuxInside

    20Firststepsinthekernelsetupcode

  • Notethatitcopieshdrwithmemcpyfunctionwhichisdefinedinthecopy.Ssourcefile.Let'shavealookinside:

    GLOBAL(memcpy)pushw%sipushw%dimovw%ax,%dimovw%dx,%sipushw%cxshrw$2,%cxrep;movslpopw%cxandw$3,%cxrep;movsbpopw%dipopw%siretlENDPROC(memcpy)

    Yeah,wejustmovedtoCcodeandnowassemblyagain:)Firstofallwecanseethatmemcpyandotherroutineswhicharedefinedhere,startandendwiththetwomacros:GLOBALandENDPROC.GLOBALisdescribedinarch/x86/include/asm/linkage.hwhichdefinesglobldirectiveandthelabelforit.ENDPROCisdescribedininclude/linux/linkage.hwhichmarksnamesymbolasfunctionnameandendswiththesizeofthenamesymbol.

    Implementationofmemcpyiseasy.Atfirst,itpushesvaluesfromsianddiregisterstothestackbecausetheirvalueswillchangeduringthememcpy,soitpushesthemonthestacktopreservetheirvalues.memcpy(andotherfunctionsincopy.S)usefastcallcallingconventions.Soitgetsitsincomingparametersfromtheax,dxandcxregisters.Callingmemcpylookslikethis:

    memcpy(&boot_params.hdr,&hdr,sizeofhdr);

    Soaxwillcontaintheaddressoftheboot_params.hdr,dxwillcontaintheaddressofhdrandcxwillcontainthesizeofhdr(allinbytes).memcpyputstheaddressofboot_params.hdrintosiandsavesthesizeonthestack.Afterthisitshiftstotherighton2size(ordivideon4)andcopiesfromsitodiby4bytes.Afteritwerestorethesizeofhdragain,alignitby4bytesandcopytherestofthebytesfromsitodibytebybyte(ifthereismore).Restoresianddivaluesfromthestackintheendandafterthiscopyingisfinished.

    Afterthehdriscopiedintoboot_params.hdr,thenextstepisconsoleinitializationbycallingtheconsole_initfunctionwhichisdefinedinarch/x86/boot/early_serial_console.c.

    Ittriestofindtheearlyprintkoptioninthecommandlineandifthesearchwassuccessful,itparsestheportaddressandbaudrateoftheserialportandinitializestheserialport.Valueofearlyprintkcommandlineoptioncanbeoneofthe:

    *serial,0x3f8,115200*serial,ttyS0,115200*ttyS0,115200

    Afterserialportinitializationwecanseethefirstoutput:

    if(cmdline_find_option_bool("debug"))puts("earlyconsoleinsetupcode\n");

    Thedefinitionofputsisintty.c.AswecanseeitprintscharacterbycharacterinaloopbycallingTheputcharfunction.

    Consoleinitialization

    LinuxInside

    21Firststepsinthekernelsetupcode

  • Let'slookintotheputcharimplementation:

    void__attribute__((section(".inittext")))putchar(intch){if(ch=='\n')putchar('\r');

    bios_putchar(ch);

    if(early_serial_base!=0)serial_putchar(ch);}

    __attribute__((section(".inittext")))meansthatthiscodewillbeinthe.inittextsection.Wecanfinditinthelinkerfilesetup.ld.

    Firstofall,put_charchecksforthe\nsymbolandifitisfound,prints\rbefore.AfterthatitoutputsthecharacterontheVGAscreenbycallingtheBIOSwiththe0x10interruptcall:

    staticvoid__attribute__((section(".inittext")))bios_putchar(intch){structbiosregsireg;

    initregs(&ireg);ireg.bx=0x0007;ireg.cx=0x0001;ireg.ah=0x0e;ireg.al=ch;intcall(0x10,&ireg,NULL);}

    Hereinitregstakesthebiosregsstructureandfirstfillsbiosregswithzerosusingthememsetfunctionandthenfillsitwithregistervalues.

    memset(reg,0,sizeof*reg);reg->eflags|=X86_EFLAGS_CF;reg->ds=ds();reg->es=ds();reg->fs=fs();reg->gs=gs();

    Let'slookatthememsetimplementation:

    GLOBAL(memset)pushw%dimovw%ax,%dimovzbl%dl,%eaximull$0x01010101,%eaxpushw%cxshrw$2,%cxrep;stoslpopw%cxandw$3,%cxrep;stosbpopw%diretlENDPROC(memset)

    Asyoucanreadabove,itusesthefastcallcallingconventionslikethememcpyfunction,whichmeansthatthefunctiongetsparametersfromax,dxandcxregisters.

    LinuxInside

    22Firststepsinthekernelsetupcode

  • Generallymemsetislikeamemcpyimplementation.Itsavesthevalueofthediregisteronthestackandputstheaxvalueintodiwhichistheaddressofthebiosregsstructure.Nextisthemovzblinstruction,whichcopiesthedlvaluetothelow2bytesoftheeaxregister.Theremaining2highbytesofeaxwillbefilledwithzeros.

    Thenextinstructionmultiplieseaxwith0x01010101.Itneedstobecausememsetwillcopy4bytesatthesametime.Forexampleweneedtofillastructurewith0x7withmemset.eaxwillcontain0x00000007valueinthiscase.Soifwemultiplyeaxwith0x01010101,wewillget0x07070707andnowwecancopythese4bytesintothestructure.memsetusesrep;stoslinstructionsforcopyingeaxintoes:di.

    Therestofthememsetfunctiondoesalmostthesameasmemcpy.

    Afterthatbiosregsstructureisfilledwithmemset,bios_putcharcallsthe0x10interruptwhichprintsacharacter.Afterwardsitchecksiftheserialportwasinitializedornotandwritesacharactertherewithserial_putcharandinb/outbinstructionsifitwasset.

    Afterthestackandbsssectionwerepreparedinheader.S(seepreviouspart),thekernelneedstoinitializetheheapwiththeinit_heapfunction.

    Firstofallinit_heapcheckstheCAN_USE_HEAPflagfromtheloadflagskernelsetupheaderandcalculatestheendofthestackifthisflagwasset:

    char*stack_end;

    if(boot_params.hdr.loadflags&CAN_USE_HEAP){asm("leal%P1(%%esp),%0":"=r"(stack_end):"i"(-STACK_SIZE));

    orinotherwordsstack_end=esp-STACK_SIZE.

    Thenthereistheheap_endcalculationwhichisheap_end_ptror_end+512andacheckifheap_endisgreaterthanstack_endmakesitequal.

    Fromthismomentwecanusetheheapinthekernelsetupcode.WewillseehowtouseitandhowtheAPIforitisimplementedinthenextposts.

    Thenextstepaswecanseeiscpuvalidationbyvalidate_cpufromarch/x86/boot/cpu.c.

    Itcallsthecheck_cpufunctionandpassescpulevelandrequiredcpuleveltoitandchecksthatthekernellaunchedontherightcpu.Itchecksthecpu'sflags,presenceoflongmode(whichwewillseemoredetailsoninthenextparts)forx86_64,checkstheprocessor'svendorandmakespreparationforcertainvendorsliketurningoffSSE+SSE2forAMDiftheyaremissing,etc...

    Thenextstepismemorydetectionbythedetect_memoryfunction.Itusesdifferentprogramminginterfacesformemorydetectionlike0xe820,0xe801and0x88.Wewillseeonlytheimplementationof0xE820here.Let'slookintothedetect_memory_e820implementationfromthearch/x86/boot/memory.csourcefile.Firstofall,thedetect_memory_e820functioninitializesthebiosregsstructureaswesawaboveandfillsregisterswithspecialvaluesforthe0xe820call:

    Heapinitialization

    CPUvalidation

    Memorydetection

    LinuxInside

    23Firststepsinthekernelsetupcode

  • initregs(&ireg);ireg.ax=0xe820;ireg.cx=sizeofbuf;ireg.edx=SMAP;ireg.di=(size_t)&buf;

    Theaxregistermustcontainthenumberofthefunction(0xe820inourcase),cxregistercontainssizeofthebufferwhichwillcontaindataaboutmemory,edxmustcontaintheSMAPmagicnumber,es:dimustcontaintheaddressofthebufferwhichwillcontainmemorydataandebxhastobezero.

    Nextisaloopwheredataaboutthememorywillbecollected.Itstartsfromthecallofthe0x15biosinterrupt,whichwritesonelinefromtheaddressallocationtable.Forgettingthenextlineweneedtocallthisinterruptagain(whichwedointheloop).Beforethenextcallebxmustcontainthevaluereturnedpreviously:

    intcall(0x15,&ireg,&oreg);ireg.ebx=oreg.ebx;

    Ultimately,itdoesiterationsinthelooptocollectdatafromtheaddressallocationtableandwritesthisdataintothee820entryarray:

    startofmemorysegmentsizeofmemorysegmenttypeofmemorysegment(whichcanbereserved,usableandetc...).

    Youcanseetheresultofthisinthedmesgoutput,somethinglike:

    [0.000000]e820:BIOS-providedphysicalRAMmap:[0.000000]BIOS-e820:[mem0x0000000000000000-0x000000000009fbff]usable[0.000000]BIOS-e820:[mem0x000000000009fc00-0x000000000009ffff]reserved[0.000000]BIOS-e820:[mem0x00000000000f0000-0x00000000000fffff]reserved[0.000000]BIOS-e820:[mem0x0000000000100000-0x000000003ffdffff]usable[0.000000]BIOS-e820:[mem0x000000003ffe0000-0x000000003fffffff]reserved[0.000000]BIOS-e820:[mem0x00000000fffc0000-0x00000000ffffffff]reserved

    Thenextstepistheinitializationofthekeyboardwiththecallofthekeyboard_initfunction.Atfirstkeyboard_initinitializesregistersusingtheinitregsfunctionandcallingthe0x16interruptforgettingthekeyboardstatus.Afterthisitcalls0x16againtosetrepeatrateanddelay.

    Thenextcoupleofstepsarequeriesfordifferentparameters.Wewillnotdiveintodetailsaboutthesequeries,butwillgetbacktoitinlaterparts.Let'stakeashortlookatthesefunctions:

    Thequery_mcaroutinecallsthe0x15BIOSinterrupttogetthemachinemodelnumber,sub-modelnumber,BIOSrevisionlevel,andotherhardware-specificattributes:

    intquery_mca(void){structbiosregsireg,oreg;u16len;

    Keyboardinitialization

    Querying

    LinuxInside

    24Firststepsinthekernelsetupcode

  • initregs(&ireg);ireg.ah=0xc0;intcall(0x15,&ireg,&oreg);

    if(oreg.eflags&X86_EFLAGS_CF)return-1;/*NoMCApresent*/

    set_fs(oreg.es);len=rdfs16(oreg.bx);

    if(len>sizeof(boot_params.sys_desc_table))len=sizeof(boot_params.sys_desc_table);

    copy_from_fs(&boot_params.sys_desc_table,oreg.bx,len);return0;}

    Itfillstheahregisterwith0xc0andcallsthe0x15BIOSinterruption.Aftertheinterruptexecutionitchecksthecarryflagandifitissetto1,theBIOSdoesn'tsupportMCA.Ifcarryflagissetto0,ES:BXwillcontainapointertothesysteminformationtable,whichlookslikethis:

    OffsetSizeDescription)00hWORDnumberofbytesfollowing02hBYTEmodel(see#00515)03hBYTEsubmodel(see#00515)04hBYTEBIOSrevision:0forfirstrelease,1for2nd,etc.05hBYTEfeaturebyte1(see#00510)06hBYTEfeaturebyte2(see#00511)07hBYTEfeaturebyte3(see#00512)08hBYTEfeaturebyte4(see#00513)09hBYTEfeaturebyte5(see#00514)---AWARDBIOS---0AhNBYTEsAWARDcopyrightnotice---PhoenixBIOS---0AhBYTE???(00h)0BhBYTEmajorversion0ChBYTEminorversion(BCD)0Dh4BYTEsASCIZstring"PTL"(PhoenixTechnologiesLtd)---QuadramQuad386---0Ah17BYTEsASCIIsignaturestring"QuadramQuad386XT"---Toshiba(SatellitePro435CDSatleast)---0Ah7BYTEssignature"TOSHIBA"11hBYTE???(8h)12hBYTE???(E7h)productID???(guess)13h3BYTEs"JPN"

    Nextwecalltheset_fsroutineandpassthevalueoftheesregistertoit.Implementationofset_fsisprettysimple:

    staticinlinevoidset_fs(u16seg){asmvolatile("movw%0,%%fs"::"rm"(seg));}

    Thereisinlineassemblywhichgetsthevalueofthesegparameterandputsitintothefsregister.Therearemanyfunctionsinboot.hlikeset_fs,forexampleset_gs,fs,gsforreadingavalueinitetc...

    Attheendofquery_mcaitjustcopiesthetablewhichpointedtobyes:bxtotheboot_params.sys_desc_table.

    ThenextstepisgettingIntelSpeedStepinformationbycallingthequery_istfunction.FirstofallitcheckstheCPUlevelandifitiscorrect,calls0x15forgettinginfoandsavestheresulttoboot_params.

    Thefollowingquery_apm_biosfunctiongetsAdvancedPowerManagementinformationfromtheBIOS.query_apm_bioscallsthe0x15BIOSinterruptiontoo,butwithah-0x53tocheckAPMinstallation.Afterthe0x15execution,query_apm_biosfunctionschecksPMsignature(itmustbe0x504d),carryflag(itmustbe0ifAPMsupported)andvalueof

    LinuxInside

    25Firststepsinthekernelsetupcode

  • thecxregister(ifit's0x02,protectedmodeinterfaceissupported).

    Nextitcallsthe0x15again,butwithax=0x5304fordisconnectingtheAPMinterfaceandconnectthe32bitprotectedmodeinterface.Intheenditfillsboot_params.apm_bios_infowithvaluesobtainedfromtheBIOS.

    Notethatquery_apm_bioswillbeexecutedonlyifCONFIG_APMorCONFIG_APM_MODULEwassetinconfigurationfile:

    #ifdefined(CONFIG_APM)||defined(CONFIG_APM_MODULE)query_apm_bios();#endif

    Thelastisthequery_eddfunction,whichasksEnhancedDiskDriveinformationfromtheBIOS.Let'slookintothequery_eddimplementation.

    Firstofallitreadstheeddoptionfromkernel'scommandlineandifitwassettooffthenquery_eddjustreturns.

    IfEDDisenabled,query_eddgoesoverBIOS-supportedharddisksandqueriesEDDinformationinthefollowingloop:

    for(devno=0x80;devnoext_ramdisk_imagehdr.ramdisk_image;initrd_size=(u64)real_mode->ext_ramdisk_sizehdr.ramdisk_size;mem_avoid[1].start=initrd_start;mem_avoid[1].size=initrd_size;

    Herewecanseecalculationoftheinitrdstartaddressandsize.ext_ramdisk_imageishigh32-bitsoftheramdisk_imagefieldfrombootheaderandext_ramdisk_sizeishigh32-bitsoftheramdisk_sizefieldfrombootprotocol:

    OffsetProtoNameMeaning/Size.........0218/42.00+ramdisk_imageinitrdloadaddress(setbybootloader)021C/42.00+ramdisk_sizeinitrdsize(setbybootloader)...

    Andext_ramdisk_imageandext_ramdisk_sizeyoucanfindintheDocumentation/x86/zero-page.txt:

    OffsetProtoNameMeaning/Size...

    LinuxInside

    53Kerneldecompression

  • ...

    ...0C0/004ALLext_ramdisk_imageramdisk_imagehigh32bits0C4/004ALLext_ramdisk_sizeramdisk_sizehigh32bits...

    Sowe'retakingext_ramdisk_imageandext_ramdisk_size,shiftingtheylefton32(nowtheywillcontainlow32-bitsinthehigh32-bitbits)andgettingstartaddressoftheinitrdandsizeofit.Afterthiswestorethesevaluesinthemem_avoidarraywhichdefinedas:

    #defineMEM_AVOID_MAX5staticstructmem_vectormem_avoid[MEM_AVOID_MAX];

    wheremem_vectorstructureis:

    structmem_vector{unsignedlongstart;unsignedlongsize;};

    Thenextstepafterwecollectedallunsafememoryregionsinthemem_avoidarraywillbesearchoftherandomaddresswhichdoesnotoverlapwiththeunsaferegionswiththefind_random_addrfunction.

    Firstofallwecanseealignoftheoutputaddressinthefind_random_addrfunction:

    minimum=ALIGN(minimum,CONFIG_PHYSICAL_ALIGN);

    youcanrememberCONFIG_PHYSICAL_ALIGNconfigurationoptionfromthepreviouspart.Thisoptionprovidesthevaluetowhichkernelshouldbealignedanditis0x200000bydefault.Afterthatwegotalignedoutputaddress,wegothroughthememoryandcollectregionswhicharegoodfordecompressedkernelimage:

    for(i=0;ie820_entries;i++){process_e820_entry(&real_mode->e820_map[i],minimum,size);}

    Youcanrememberthatwecollectede820_entriesinthesecondpartoftheKernelbootingprocesspart2.

    Firstofallprocess_e820_entryfunctiondoessomechecksthate820memoryregionisnotnon-RAM,thatthestartaddressofthememoryregionisnotbiggerthanMaximumallowedaslroffsetandthatmemoryregionisnotlessthanvalueofkernelalignment:

    structmem_vectorregion,img;

    if(entry->type!=E820_RAM)return;

    if(entry->addr>=CONFIG_RANDOMIZE_BASE_MAX_OFFSET)return;

    if(entry->addr+entry->sizeaddr;region.size=entry->size;

    Aswestorethesevalues,wealigntheregion.startaswediditinthefind_random_addrfunctionandcheckthatwedidn'tgetaddressthatbiggerthanoriginalmemoryregion:

    region.start=ALIGN(region.start,CONFIG_PHYSICAL_ALIGN);

    if(region.start>entry->addr+entry->size)return;

    NextwegetdifferencebetweentheoriginaladdressandalignedandcheckthatifthelastaddressinthememoryregionisbiggerthanCONFIG_RANDOMIZE_BASE_MAX_OFFSET,wereducethememoryregionsizethatendofkernelimagewillbelessthanmaximumaslroffset:

    region.size-=region.start-entry->addr;

    if(region.start+region.size>CONFIG_RANDOMIZE_BASE_MAX_OFFSET)region.size=CONFIG_RANDOMIZE_BASE_MAX_OFFSET-region.start;

    Intheendwegothroughtheallunsafememoryregionsandcheckthatthisregiondoesnotoverlapunsafeareswithkernelcommandline,initrdandetc...:

    for(img.start=region.start,img.size=image_size;mem_contains(&region,&img);img.start+=CONFIG_PHYSICAL_ALIGN){if(mem_avoid_overlap(&img))continue;slots_append(img.start);}

    Ifmemoryregiondoesnotoverlapunsaferegionswecallslots_appendfunctionwiththestartaddressoftheregion.slots_appendfunctionjustcollectsstartaddressesofmemoryregionstotheslotsarray:

    slots[slot_max++]=addr;

    whichdefinedas:

    staticunsignedlongslots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET/CONFIG_PHYSICAL_ALIGN];staticunsignedlongslot_max;

    Afterprocess_e820_entrywillbeexecuted,wewillhavearrayoftheaddresseswhicharesafeforthedecompressedkernel.Nextwecallslots_fetch_randomfunctionforgettingrandomitemfromthisarray:

    if(slot_max==0)return0;

    returnslots[get_random_long()%slot_max];

    whereget_random_longfunctionchecksdifferentCPUflagsasX86_FEATURE_RDRANDorX86_FEATURE_TSCandchooses

    LinuxInside

    55Kerneldecompression

  • methodforgettingrandomnumber(itcanbeobtainwithRDRANDinstruction,Timestampcounter,programmableintervaltimerandetc...).Afterthatwegotrandomaddressexecutionofthechoose_kernel_locationisfinished.

    Nowlet'sbacktothemisc.c.Afterwegotaddressforthekernelimage,thereneedtodosomecheckstobesurethatgottenrandomaddressiscorrectlyalignedandaddressisnotwrong.

    Afterallthesecheckswillseethefamiliarmessage:

    DecompressingLinux...

    andcalldecompressfunctionwhichwilldecompressthekernel.decompressfunctiondependsonwhatdecompressionalgorithmwaschosenduringkernelcompilartion:

    #ifdefCONFIG_KERNEL_GZIP#include"../../../../lib/decompress_inflate.c"#endif

    #ifdefCONFIG_KERNEL_BZIP2#include"../../../../lib/decompress_bunzip2.c"#endif

    #ifdefCONFIG_KERNEL_LZMA#include"../../../../lib/decompress_unlzma.c"#endif

    #ifdefCONFIG_KERNEL_XZ#include"../../../../lib/decompress_unxz.c"#endif

    #ifdefCONFIG_KERNEL_LZO#include"../../../../lib/decompress_unlzo.c"#endif

    #ifdefCONFIG_KERNEL_LZ4#include"../../../../lib/decompress_unlz4.c"#endif

    Afterkernelwillbedecompressed,thelastfunctionhandle_relocationswillrelocatethekerneltotheaddressthatwegotfromchoose_kernel_location.Afterthatkernelrelocatedwereturnfromthedecompress_kerneltothehead_64.S.Theaddressofthekernelwillbeintheraxregisterandwejumponit:

    jmp*%rax

    That'sall.Nowweareinthekernel!

    Thisistheendofthefifthandthelastpartaboutlinuxkernelbootingprocess.Wewillnotseepostsaboutkernelbootinganymore(maybeonlyupdatesinthisandpreviousposts),buttherewillbemanypostsaboutotherkernelinternals.

    Nextchapterwillbeaboutkernelinitializationandwewillseethefirststepsinthelinuxkernelinitializationcode.

    Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeintwitter.

    PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

    Conclusion

    LinuxInside

    56Kerneldecompression

  • addressspacelayoutrandomizationinitrdlongmodebzip2RDdRandinstructionTimeStampCounterProgrammableIntervalTimersPreviouspart

    Links

    LinuxInside

    57Kerneldecompression

  • Youwillfindhereacoupleofpostswhichdescribethefullcycleofkernelinitializationfromitsfirststepsafterthekernelhasdecompressedtothestartofthefirstprocessrunbythekernelitself.

    NoteThattherewillnotbedescriptionoftheallkernelinitializationsteps.Herewillbeonlygenerickernelpart,withoutinterruptshandling,ACPI,andmanyotherparts.AllpartswhichI'llmiss,willbedescribedinotherchapters.

    Firststepsafterkerneldecompression-describesfirststepsinthekernel.Earlyinterruptandexceptionhandling-describesearlyinterruptsinitializationandearlypagefaulthandler.Lastpreparationsbeforethekernelentrypoint-describesthelastpreparationsbeforethecallofthestart_kernel.Kernelentrypoint-describesfirststepsinthekernelgenericcode.Continueofarchitecture-specificinitializations-describesarchitecture-specificinitialization.Architecture-specificinitializations,again...-describescontinueofthearchitecture-specificinitializationprocess.TheEndofthearchitecture-specificinitializations,almost...-describestheendofthesetup_archrelatedstuff.Schedulerinitialization-describespreparationbeforeschedulerinitializationandinitializationofit.RCUinitialization-describestheinitializationoftheRCU.

    Kernelinitializationprocess

    LinuxInside

    58Initialization

  • Inthepreviouspost(Kernelbootingprocess.Part5.)-Kerneldecompressionwestoppedatthejumponthedecompressedkernel:

    jmp*%rax

    andnowweareinthekernel.Therearemanythingstodobeforethekernelwillstartfirstinitprocess.Hopewewillseeallofthepreparationsbeforekernelwillstartinthisbigchapter.Wewillstartfromthekernelentrypoint,whichisinthearch/x86/kernel/head_64.S.Wewillseefirstpreparationslikeearlypagetablesinitialization,switchtoanewdescriptorinkernelspaceandmanymanymore,beforewewillseethestart_kernelfunctionfromtheinit/main.cwillbecalled.

    Solet'sstart.

    Okay,wegotaddressofthekernelfromthedecompress_kernelfunctionintoraxregisterandjustjumpedthere.Decompressedkernelcodestartsinthearch/x86/kernel/head_64.S:

    __HEAD.code64.globlstartup_64startup_64:.........

    Wecanseedefinitionofthestartup_64routineanditdefinedinthe__HEADsection,whichisjust:

    #define__HEAD.section".head.text","ax"

    Wecanseedefinitionofthissectioninthearch/x86/kernel/vmlinux.lds.Slinkerscript:

    .text:AT(ADDR(.text)-LOAD_OFFSET){_text=.;.........}:text=0x9090

    Wecanunderstanddefaultvirtualandphysicaladdressesfromthelinkerscript.Notethataddressofthe_textislocationcounterwhichisdefinedas:

    .=__START_KERNEL;

    forx86_64.Wecanfinddefinitionofthe__START_KERNELmacrointhearch/x86/include/asm/page_types.h:

    Kernelinitialization.Part1.

    Firststepsinthekernelcode

    Firststepsinthekernel

    LinuxInside

    59Firststepsinthekernel

  • #define__START_KERNEL(__START_KERNEL_map+__PHYSICAL_START)

    #define__PHYSICAL_STARTALIGN(CONFIG_PHYSICAL_START,CONFIG_PHYSICAL_ALIGN)

    Herewecanseethat__START_KERNEListhesumofthe__START_KERNEL_map(whichis0xffffffff80000000,seepostaboutpaging)and__PHYSICAL_START.Where__PHYSICAL_STARTisalignedvalueoftheCONFIG_PHYSICAL_START.SoifyouwillnotusekASLRandwillnotchangeCONFIG_PHYSICAL_STARTintheconfigurationaddresseswillbefollowing:

    Physicaladdress-0x1000000;Virtualaddress-0xffffffff81000000.

    Nowweknowdefaultphysicalandvirtualaddressesofthestartup_64routine,buttoknowactualaddresseswemusttocalculateitwiththefollowingcode:

    leaq_text(%rip),%rbpsubq$_text-__START_KERNEL_map,%rbp

    Herewejustputtherip-relativeaddresstotherbpregisterandthansubtract$_text-__START_KERNEL_mapfromit.Weknowthatcompiledaddressofthe_textis0xffffffff81000000and__START_KERNEL_mapcontains0xffffffff81000000,sorbpwillcontainphysicaladdressofthetext-0x1000000afterthiscalculation.Weneedtocalculateitbecausekernelcanberunnednotonthedefaultaddress,butnowweknowactualphysicaladdress.

    Inthenextstepwechecksthatthisaddressisalignedwith:

    movq%rbp,%raxandl$~PMD_PAGE_MASK,%eaxtestl%eax,%eaxjnzbad_address

    Herewejustputaddresstothe%raxandtestfirstbit.PMD_PAGE_MASKindicatesthemaskforPagemiddledirectory(readpagingaboutit)anddefinedas:

    #definePMD_PAGE_MASK(~(PMD_PAGE_SIZE-1))

    #definePMD_PAGE_SIZE(_AC(1,UL)

  • Thefirststepbeforewestartedtosetupidentitypaging,needtocorrectfollowingaddresses:

    addq%rbp,early_level4_pgt+(L4_START_KERNEL*8)(%rip)addq%rbp,level3_kernel_pgt+(510*8)(%rip)addq%rbp,level3_kernel_pgt+(511*8)(%rip)addq%rbp,level2_fixmap_pgt+(506*8)(%rip)

    Hereweneedtocorrectearly_level4_pgtandotheraddressesofthepagetabledirectories,becauseasIwroteabove,kernelcanberunnednotatthedefault0x1000000address.rbpregistercontainsactualladdresssoweaddtotheearly_level4_pgt,level3_kernel_pgtandlevel2_fixmap_pgt.Let'strytounderstandwhatthislabelsmeans.Firstofalllet'slookontheirdefinition:

    NEXT_PAGE(early_level4_pgt).fill511,8,0.quadlevel3_kernel_pgt-__START_KERNEL_map+_PAGE_TABLE

    NEXT_PAGE(level3_kernel_pgt).fillL3_START_KERNEL,8,0.quadlevel2_kernel_pgt-__START_KERNEL_map+_KERNPG_TABLE.quadlevel2_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE

    NEXT_PAGE(level2_kernel_pgt)PMDS(0,__PAGE_KERNEL_LARGE_EXEC,KERNEL_IMAGE_SIZE/PMD_SIZE)

    NEXT_PAGE(level2_fixmap_pgt).fill506,8,0.quadlevel1_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE.fill5,8,0

    NEXT_PAGE(level1_fixmap_pgt).fill512,8,0

    Lookshard,butitisnottrue.

    Firstofalllet'slookontheearly_level4_pgt.Itstartswiththe(4096-8)bytesofzeros,itmeansthatwedon'tusefirst511early_level4_pgtentries.Andafterthiswecanseelevel3_kernel_pgtentry.Notethatwesubtract__START_KERNEL_map+_PAGE_TABLEfromit.Asweknow__START_KERNEL_mapisabasevirtualaddressofthekerneltext,soifwesubtract__START_KERNEL_map,wewillgetphysicaladdressofthelevel3_kernel_pgt.Nowlet'slookon_PAGE_TABLE,itisjustpageentryaccessrights:

    #define_PAGE_TABLE(_PAGE_PRESENT|_PAGE_RW|_PAGE_USER|\_PAGE_ACCESSED|_PAGE_DIRTY)

    moreaboutit,youcanreadinthepagingpost.

    level3_kernel_pgt-storesentrieswhichmapkernelspace.Atthestartofit'sdefinition,wecanseethatitfilledwithzerosL3_START_KERNELtimes.HereL3_START_KERNEListheindexinthepageupperdirectorywhichcontains__START_KERNEL_mapaddressanditequals510.Afteritwecanseedefinitionoftwolevel3_kernel_pgtentries:level2_kernel_pgtandlevel2_fixmap_pgt.Firstissimple,itispagetableentrywhichcontainspointertothepagemiddledirectorywhichmapskernelspaceandithas:

    #define_KERNPG_TABLE(_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|\_PAGE_DIRTY)

    Fixbaseaddressesofpagetables

    LinuxInside

    61Firststepsinthekernel

  • accessrights.Thesecond-level2_fixmap_pgtisavirtualaddresseswhichcanrefertoanyphysicaladdressesevenunderkernelspace.

    Thenextlevel2_kernel_pgtcallsPDMSmacrowhichcreates512megabytesfromthe__START_KERNEL_mapforkerneltext(afterthese512megabyteswillbemodulesmemoryspace).

    NowweknowLet'sbacktoourcodewhichisinthebeginningofthesection.Rememberthatrbpcontainsactualphysicaladdressofthe_textsection.Wejustaddthisaddresstothebaseaddressofthepagetables,thatthey'llhavecorrectaddresses:

    addq%rbp,early_level4_pgt+(L4_START_KERNEL*8)(%rip)addq%rbp,level3_kernel_pgt+(510*8)(%rip)addq%rbp,level3_kernel_pgt+(511*8)(%rip)addq%rbp,level2_fixmap_pgt+(506*8)(%rip)

    Atthefirstlineweaddrbptotheearly_level4_pgt,atthesecondlineweaddrbptothelevel2_kernel_pgt,atthethirdlineweaddrbptothelevel2_fixmap_pgtandaddrbptothelevel1_fixmap_pgt.

    Afterallofthiswewillhave:

    early_level4_pgt[511]->level3_kernel_pgt[0]level3_kernel_pgt[510]->level2_kernel_pgt[0]level3_kernel_pgt[511]->level2_fixmap_pgt[0]level2_kernel_pgt[0]->512MBkernelmappinglevel2_fixmap_pgt[506]->level1_fixmap_pgt

    Aswecorrectedbaseaddressesofthepagetables,wecanstarttobuildit.

    Nowwecanseesetuptheidentitymappingearlypagetables.IdentityMappedPagingisavirtualaddresseswhicharemappedtophysicaladdressesthathavethesamevalue,1:1.Let'slookonitindetails.Firstofallwegettherip-relativeaddressofthe_textand_early_level4_pgtandputtheyintordiandrbxregisters:

    leaq_text(%rip),%rdileaqearly_level4_pgt(%rip),%rbx

    Afterthiswestorephysicaladdressofthe_textintheraxandgettheindexofthepageglobaldirectoryentrywhichstores_textaddress,byshifting_textaddressonthePGDIR_SHIFT:

    movq%rdi,%raxshrq$PGDIR_SHIFT,%rax

    leaq(4096+_KERNPG_TABLE)(%rbx),%rdxmovq%rdx,0(%rbx,%rax,8)movq%rdx,8(%rbx,%rax,8)

    wherePGDIR_SHIFTis39.PGDIR_SHFTindicatesthemaskforpageglobaldirectorybitsinavirtualaddress.Therearemacroforalltypesofpagedirectories:

    #definePGDIR_SHIFT39#definePUD_SHIFT30#definePMD_SHIFT21

    Identitymappingsetup

    LinuxInside

    62Firststepsinthekernel

  • Afterthisweputtheaddressofthefirstlevel3_kernel_pgttotherdxwiththe_KERNPG_TABLEaccessrights(seeabove)andfilltheearly_level4_pgtwiththe2level3_kernel_pgtentries.

    Afterthisweadd4096(sizeoftheearly_level4_pgt)totherdx(itnowcontainstheaddressofthefirstentryofthelevel3_kernel_pgt)andputrdi(itnowcontainsphysicaladdressofthe_text)totherax.Andafterthiswewriteaddressesofthetwopageupperdirectoryentriestothelevel3_kernel_pgt:

    addq$4096,%rdxmovq%rdi,%raxshrq$PUD_SHIFT,%raxandl$(PTRS_PER_PUD-1),%eaxmovq%rdx,4096(%rbx,%rax,8)incl%eaxandl$(PTRS_PER_PUD-1),%eaxmovq%rdx,4096(%rbx,%rax,8)

    Inthenextstepwewriteaddressesofthepagemiddledirectoryentriestothelevel2_kernel_pgtandthelaststepiscorrectingofthekerneltext+datavirtualaddresses:

    leaqlevel2_kernel_pgt(%rip),%rdileaq4096(%rdi),%r81:testq$1,0(%rdi)jz2faddq%rbp,0(%rdi)2:addq$8,%rdicmp%r8,%rdijne1b

    Hereweputtheaddressofthelevel2_kernel_pgttotherdiandaddressofthepagetableentrytother8register.Nextwecheckthepresentbitinthelevel2_kernel_pgtandifitiszerowe'removingtothenextpagebyadding8bytestordiwhichcontaitnsaddressofthelevel2_kernel_pgt.Afterthiswecompareitwithr8(containsaddressofthepagetableentry)andgobacktolabel1ormoveforward.

    Inthenextstepwecorrectphys_basephysicaladdresswithrbp(containsphysicaladdressofthe_text),putphysicaladdressoftheearly_level4_pgtandjumptolabel1:

    addq%rbp,phys_base(%rip)movq$(early_level4_pgt-__START_KERNEL_map),%raxjmp1f

    wherephys_basemathesthefirstentryofthelevel2_kernel_pgtwhichis512MBkernelmapping.

    Afterthatwejumpedtothelabel1weenablePAE,PGE(PagingGlobalExtension)andputthephysicaladdressofthephys_base(seeabove)totheraxregisterandfillcr3registerwithit:

    1:movl$(X86_CR4_PAE|X86_CR4_PGE),%ecxmovq%rcx,%cr4

    addqphys_base(%rip),%raxmovq%rax,%cr3

    Lastpreparations

    LinuxInside

    63Firststepsinthekernel

  • InthenextstepwecheckthatCPUsupportNXbitwith:

    movl$0x80000001,%eaxcpuidmovl%edx,%edi

    Weput0x80000001valuetotheeaxandexecutecpuidinstructionforgettingextendedprocessorinfoandfeaturebits.Theresultwillbeintheedxregisterwhichweputtotheedi.

    Nowweput0xc0000080orMSR_EFERtotheecxandcallrdmsrinstructionforthereadingmodelspecificregister.

    movl$MSR_EFER,%ecxrdmsr

    Theresultwillbeintheedx:eax.GeneralviewoftheEFERisfollowing:

    6332--------------------------------------------------------------------------------|||ReservedMBZ|||--------------------------------------------------------------------------------311615141312111098710--------------------------------------------------------------------------------||T|||||||||||ReservedMBZ|C|FFXSR|LMSLE|SVME|NXE|LMA|MBZ|LME|RAZ|SCE|||E||||||||||--------------------------------------------------------------------------------

    Wewillnotseeallfieldsindetailshere,butwewilllearnaboutthisandotherMSRsinthespecialpartabout.AswereadEFERtotheedx:eax,wechecks_EFER_SCEorzerobitwhichisSystemCallExtensionswithbtslinstructionandsetittoone.BythesettingSCEbitweenableSYSCALLandSYSRETinstructions.Inthenextstepwecheck20thbitintheedi,rememberthatthisregisterstoresresultofthecpuid(seeabove).If20bitisset(NXbit)wejustwriteEFER_SCEtothemodelspecificregister.

    btsl$_EFER_SCE,%eaxbtl$20,%edijnc1fbtsl$_EFER_NX,%eaxbtsq$_PAGE_BIT_NX,early_pmd_flags(%rip)1:wrmsr

    IfNXbitissupportedweenable_EFER_NXandwriteittoo,withthewrmsrinstruction.

    InthenextstepweneedtoupdateGlobalDescriptortablewithlgdtinstruction:

    lgdtearly_gdt_descr(%rip)

    whereGlobalDescriptortabledefinedas:

    early_gdt_descr:.wordGDT_ENTRIES*8-1early_gdt_descr_base:.quadINIT_PER_CPU_VAR(gdt_page)

    LinuxInside

    64Firststepsinthekernel

  • WeneedtoreloadGlobalDescriptorTablebecausenowkernelworksintheuserspaceaddresses,butsoonkernelwillworkinit'sownspace.Nowlet'slookonearly_gdt_descrdefinition.GlobalDescriptorTablecontains32entries:

    #defineGDT_ENTRIES32

    forkernelcode,data,threadlocalstoragesegmentsandetc...it'ssimple.Nowlet'slookontheearly_gdt_descr_base.Firstofgdt_pagedefinedas:

    structgdt_page{structdesc_structgdt[GDT_ENTRIES];}__attribute__((aligned(PAGE_SIZE)));

    inthearch/x86/include/asm/desc.h.Itcontainsonefieldgdtwhichisarrayofthedesc_structstructureswhichdefinedas:

    structdesc_struct{union{struct{unsignedinta;unsignedintb;};struct{u16limit0;u16base0;unsignedbase1:8,type:4,s:1,dpl:2,p:1;unsignedlimit:4,avl:1,l:1,d:1,g:1,base2:8;};};}__attribute__((packed));

    andpresentsfamiliartousGDTdescriptor.Alsowecannotethatgdt_pagestructurealignedtoPAGE_SIZEwhichis4096bytes.Itmeansthatgdtwilloccupyonepage.Nowlet'strytounderstandwhatisitINIT_PER_CPU_VAR.INIT_PER_CPU_VARisamacrowhichdefinedinthearch/x86/include/asm/percpu.handjustconcatsinit_per_cpu__withthegivenparameter:

    #defineINIT_PER_CPU_VAR(var)init_per_cpu__##var

    Afterthiswehaveinit_per_cpu__gdt_page.Wecanseeinthelinkerscript:

    #defineINIT_PER_CPU(x)init_per_cpu__##x=x+__per_cpu_loadINIT_PER_CPU(gdt_page);

    Aswegotinit_per_cpu__gdt_pageinINIT_PER_CPU_VARandINIT_PER_CPUmacrofromlinkerscriptwillbeexpandedwewillgetoffsetfromthe__per_cpu_load.Afterthiscalculations,wewillhavecorrectbaseaddressofthenewGDT.

    Generallyper-CPUvariablesisa2.6kernelfeature.Youcanunderstandwhatisitfromit'sname.Whenwecreateper-CPUvariable,eachCPUwillhavewillhaveit'sowncopyofthisvariable.Herewecreatinggdt_pageper-CPUvariable.Therearemanyadvantagesforvariablesofthistype,liketherearenolocks,becauseeachCPUworkswithit'sowncopyofvariableandetc...Soeverycoreonmultiprocessorwillhaveit'sownGDTtableandeveryentryinthetablewillrepresentamemorysegmentwhichcanbeaccessedfromthethreadwhichrunnedonthecore.Youcanreadindetailsaboutper-CPUvariablesintheTheory/per-cpupost.

    AsweloadednewGlobalDescriptorTable,wereloadsegmentsaswediditeverytime:

    xorl%eax,%eax

    LinuxInside

    65Firststepsinthekernel

  • movl%eax,%dsmovl%eax,%ssmovl%eax,%esmovl%eax,%fsmovl%eax,%gs

    Afterallofthesestepswesetupgsregisterthatitposttotheirqstack(wewillseeinformationaboutitinthenextparts):

    movl$MSR_GS_BASE,%ecxmovlinitial_gs(%rip),%eaxmovlinitial_gs+4(%rip),%edxwrmsr

    whereMSR_GS_BASEis:

    #defineMSR_GS_BASE0xc0000101

    WeneedtoputMSR_GS_BASEtotheecxregisterandloaddatafromtheeaxandedx(whicharepointtotheinitial_gs)withwrmsrinstruction.Wedon'tusecs,fs,dsandsssegmentregistersforaddressationinthe64-bitmode,butfsandgsregisterscanbeused.fsandgshaveahiddenpart(aswesawitintherealmodeforcs)andthispartcontainsdescriptorwhichmappedtoModelspecificregisters.Sowecanseeabove0xc0000101isags.baseMSRaddress.

    Inthenextstepweputtheaddressoftherealmodebootparamstructuretotherdi(rememberrsiholdspointertothisstructurefromthestart)andjumptotheCcodewith:

    movqinitial_code(%rip),%raxpushq$0pushq$__KERNEL_CSpushq%raxlretq

    Hereweputtheaddressoftheinitial_codetotheraxandpushfakeaddress,__KERNEL_CSandtheaddressoftheinitial_codetothestack.Afterthiswecanseelretqinstructionwhichmeansthatafteritreturnaddresswillbeextractedfromstack(nowthereisaddressoftheinitial_code)andjumpthere.initial_codedefinedinthesamesourcecodefileandlooks:

    __REFDATA.balign8GLOBAL(initial_code).quadx86_64_start_kernel.........

    Aswecanseeinitial_codecontainsaddressofthex86_64_start_kernel,whichdefinedinthearch/x86/kerne/head64.candlookslikethis:

    asmlinkage__visiblevoid__initx86_64_start_kernel(char*real_mode_data){.........}

    Ithasoneargumentisareal_mode_data(rememberthatwepassedaddressoftherealmodedatatotherdiregister

    LinuxInside

    66Firststepsinthekernel

  • previously).

    ThisisfirstCcodeinthekernel!

    Weneedtoseelastpreparationsbeforewecansee"kernelentrypoint"-start_kernelfunctionfromtheinit/main.c.

    Firstofallwecanseesomechecksinthex86_64_start_kernelfunction:

    BUILD_BUG_ON(MODULES_VADDR__START_KERNEL));BUILD_BUG_ON(!(((MODULES_END-1)&PGDIR_MASK)==(__START_KERNEL&PGDIR_MASK)));BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses)

  • Afterthisweclear_bssfromthe__bss_stopto__bss_startandthenextstepwillbesetupoftheearlyIDThandlers,butit'sbigthemesowewillseeitinthenextpart.

    Thisistheendofthefirstpartaboutlinuxkernelinitialization.

    Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.

    Inthenextpartwewillseeinitializationoftheearlyinterruptionhandlers,kernelspacememorymappingandmanymanymore.

    PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.

    ModelSpecificRegisterPagingPreviouspart-KerneldecompressionNXASLR

    Conclusion

    Links

    LinuxInside

    68Firststepsinthekernel

  • Inthepreviouspartwestoppedbeforesettingofearlyinterrupthandlers.Wecontinueinthispartandwillknowmoreaboutinterruptandexceptionhandling.

    Rememberthatwestoppedbeforefollowingloop:

    for(i=0;i0xFF);\_set_gate(n,GATE_INTERRUPT,(void*)addr,0,0,\__KERNEL_CS);\_trace_set_gate(n,GATE_INTERRUPT,(void*)trace_##addr,\0,0,__KERNEL_CS);\}while(0)

    Firstofallitcheckswiththatpassedinterruptnumberisnotgreaterthan255withBUG_ONmacro.Weneedtodothischeckbecausewecanhaveonly256interrupts.Afterthisitcalls_set_gatewhichwritesaddressofaninterruptgatetotheIDT:

    staticinlinevoid_set_gate(intgate,unsignedtype,void*addr,unsigneddpl,unsignedist,unsignedseg){gate_descs;pack_gate(&s,type,(unsignedlong)addr,dpl,ist,seg);write_idt_entry(idt_table,gate,&s);write_trace_idt_entry(gate,&s);}

    Atthestartof_set_gatefunctionwecanseecallofthepack_gatefunctionwhichfillsgate_descstructurewiththegivenvalues:

    staticinlinevoidpack_gate(gate_desc*gate,unsignedtype,unsignedlongfunc,unsigneddpl,unsignedist,unsignedseg){gate->offset_low=PTR_LOW(func);gate->segment=__KERNEL_CS;gate->ist=ist;gate->p=1;gate->dpl=dpl;gate->zero0=0;gate->zero1=0;gate->type=type;gate->offset_middle=PTR_MIDDLE(func);gate->offset_high=PTR_HIGH(func);}

    Asmentionedabovewefillgatedescriptorinthisfunction.Wefillthreepartsoftheaddressoftheinterrupthandlerwiththeaddresswhichwegotinthemainloop(addressoftheinterrupthandlerentrypoint).Weareusingthreefollowingmacrotosplitaddressonthreeparts:

    #definePTR_LOW(x)((unsignedlonglong)(x)&0xFFFF)#definePTR_MIDDLE(x)(((unsignedlonglong)(x)>>16)&0xFFFF)#definePTR_HIGH(x)((unsignedlonglong)(x)>>32)

    WiththefirstPTR_LOWmacrowegetthefirst2bytesoftheaddress,withthesecondPTR_MIDDLEwegetthesecond2bytesoftheaddressandwiththethirdPTR_HIGHmacrowegetthelast4bytesoftheaddress.Nextwesetupthesegmentselectorforinterrupthandler,itwillbeourkernelcodesegment-__KERNEL_CS.InthenextstepwefillInterruptStackTableandDescriptorPrivilegeLevel(highestprivilegelevel)withzeros.AndwesetGAT_INTERRUPTtypeintheend.

    NowwehavefilledIDTentryandwecancallnative_write_idt_entryfunctionwhichjustcopiesfilledIDTentrytotheIDT:

    staticinlinevoidnative_write_idt_entry(gate_desc*idt,intentry,constgate_desc*gate){memcpy(&idt[entry],gate,sizeof(*gate));}

    LinuxInside

    72Earlyinterruptshandler

  • Afterthatmainloopwillfinished,wewillhavefilledidt_tablearrayofgate_descstructuresandwecanloadIDTwith:

    load_idt((conststructdesc_ptr*)&idt_descr);

    Whereidt_descris:

    structdesc_ptridt_descr={NR_VECTORS*16-1,(unsignedlong)idt_table};

    andload_idtjustexecuteslidtinstruction:

    asmvolatile("lidt%0"::"m"(*dtr));

    Youcannotethattherearecallsofthe_trace_*functionsinthe_set_gateandotherfunctions.ThesefunctionsfillsIDTgatesinthesamemannerthat_set_gatebutwithonedifference.Thesefunctionsusetrace_idt_tableInterruptDescriptorTableinsteadofidt_tablefortracepoints(wewillcoverthisthemeintheanotherpart).

    Okay,nowwehavefilledandloadedInterruptDescriptorTable,weknowhowtheCPUactsduringinterrupt.Sonowtimetodealwithinterruptshandlers.

    Asyoucanreadabove,wefilledIDTwiththeaddressoftheearly_idt_handlers.Wecanfinditinthearch/x86/kernel/head_64.S:

    .globlearly_idt_handlersearly_idt_handlers:i=0.reptNUM_EXCEPTION_VECTORS.if(EXCEPTION_ERRCODE_MASK>>i)&1ASM_NOP2.elsepushq$0.endifpushq$ijmpearly_idt_handleri=i+1.endr

    Wecanseehere,interrupthandlersgenerationforthefirst32exceptions.Wecheckhere,ifexceptionhaserrorcodethenwedonothing,ifexceptiondoesnotreturnerrorcode,wepushzerotothestack.Wedoitforthatwouldstackwasuniform.Afterthatwepushexceptionnumberonthestackandjumpontheearly_idt_handlerwhichisgenericinterrupthandlerfornow.Asiwroteabove,CPUpushesflagregister,CSandRIPonthestack.Sobeforeearly_idt_handlerwillbeexecuted,stackwillcontainfollowingdata:

    |--------------------||%rflags||%cs||%rip||rsp-->errorcode||--------------------|

    Nowlet'slookontheearly_idt_handlerimplementation.Itlocatesinthesamearch/x86/kernel/head_64.S.Firstofallwe

    Earlyinterruptshandlers

    LinuxInside

    73Earlyinterruptshandler

  • canseecheckforNMI,wenoneedtohandleit,sojustignoretheyintheearly_idt_handler:

    cmpl$2,(%rsp)jeis_nmi

    whereis_nmi:

    is_nmi:addq$16,%rspINTERRUPT_RETURN

    wedroperrorcodeandvectornumberfromthestackandcallINTERRUPT_RETURNwhichisjustiretq.AswecheckedthevectornumberanditisnotNMI,wecheckearly_recursion_flagtopreventrecursionintheearly_idt_handlerandifit'scorrectwesavegeneralregistersonthestack:

    pushq%raxpushq%rcxpushq%rdxpushq%rsipushq%rdipushq%r8pushq%r9pushq%r10pushq%r11

    weneedtodoittopreventwrongvaluesinitwhenwereturnfromtheinterrupthandler.Afterthiswechecksegmentselectorinthestack:

    cmpl$__KERNEL_CS,96(%rsp)jne11f

    itmustbeequaltothekernelcodesegmentandifitisnotwejumponlabel11whichprintsPANICmessageandmakesstackdump.

    Aftercodesegmentwaschecked,wecheckthevectornumber,andifitis#PF,weputvaluefromthecr2totherdiregisterandcallearly_make_pgtable(wellseeitsoon):

    cmpl$14,72(%rsp)jnz10fGET_CR2_INTO(%rdi)callearly_make_pgtableandl%eax,%eaxjz20f

    Ifvectornumberisnot#PF,werestoregeneralpurposeregistersfromthestack:

    popq%r11popq%r10popq%r9popq%r8popq%rdipopq%rsipopq%rdxpopq%rcxpopq%rax

    LinuxInside

    74Earlyinterruptshandler

  • andexitfromthehandlerwithiret.

    Itistheendofthefirstinterrupthandler.Notethatitisveryearlyinterrupthandler,soithandlesonlyPageFaultnow.Wewillseehandlersfortheotherinterrupts,butnowlet'slookonthepagefaulthandler.

    Inthepreviousparagraphwesawfirstearlyinterrupthandlerwhichchecksinterruptnumberforpagefaultandcallsearly_make_pgtableforbuildingnewpagetablesifitis.Weneedtohave#PFhandlerinthisstepbecausethereareplanstoaddabilitytoloadkernelabove4Gandmakeaccesstoboot_paramsstructureabovethe4G.

    Youcanfindimplementationoftheearly_make_pgtableinthearch/x86/kernel/head64.candtakesoneparameter-addressfromthecr2register,whichcausedPageFault.Let'slookonit:

    int__initearly_make_pgtable(unsignedlongaddress){unsignedlongphysaddr=address-__PAGE_OFFSET;unsignedlongi;pgdval_tpgd,*pgd_p;pudval_tpud,*pud_p;pmdval_tpmd,*pmd_p;.........}

    Itstartsfromthedefinitionofsomevariableswhichhave*val_ttypes.Allofthesetypesarejust:

    typedefunsignedlongpgdval_t;

    Alsowewilloperatewiththe*_t(notval)types,forexamplepgd_tandetc...Allofthesetypesdefinedinthearch/x86/include/asm/pgtable_types.handrepresentstructureslikethis:

    typedefstruct{pgdval_tpgd;}pgd_t;

    Forexample,

    externpgd_tearly_level4_pgt[PTRS_PER_PGD];

    Hereearly_level4_pgtpresentsearlytop-levelpagetabledirectorywhichconsistsofanarrayofpgd_ttypesandpgdpointstolow-levelpageentries.

    Afterwemadethecheckthatwehavenoinvalidaddress,we'regettingtheaddressofthePageGlobalDirectoryentrywhichcontains#PFaddressandputit'svaluetothepgdvariable:

    pgd_p=&early_level4_pgt[pgd_index(address)].pgd;pgd=*pgd_p;

    Inthenextstepwecheckpgd,ifitcontainscorrectpageglobaldirectoryentryweputphysicaladdressofthepageglobaldirectoryentryandputittothepud_pwith:

    Pagefaulthandling

    LinuxInside

    75Earlyinterruptshandler

  • pud_p=(pudval_t*)((pgd&PTE_PFN_MASK)+__START_KERNEL_map-phys_base);

    wherePTE_PFN_MASKisamacro:

    #definePTE_PFN_MASK((pteval_t)PHYSICAL_PAGE_MASK)

    whichexpandsto:

    (~(PAGE_SIZE-1))&((1

  • Thisistheendofthesecondpartaboutlinuxkernelinternals.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateissue.Inthenextpartwewillseeallstepsbeforekernelentrypoint-start_kernelfunction.

    PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.

    GNUassembly.reptAPICNMIPreviouspart

    Links

    LinuxInside

    77Earlyinterruptshandler

  • ThisisthethirdpartoftheLinuxkernelinitializationprocessseries.Inthepreviouspartwesawearlyinterruptandexceptionhandlingandwillcontinuetodiveintothelinuxkernelinitializationprocessinthecurrentpart.Ournextpointis'kernelentrypoint'-start_kernelfunctionfromtheinit/main.csourcecodefile.Yes,technicallyitisnotkernel'sentrypointbutthestartofthegenerickernelcodewhichdoesnotdependoncertainarchitecture.Butbeforewewillseecallofthestart_kernelfunction,wemustdosomepreparations.Solet'scontinue.

    InthepreviouspartwestoppedatsettingInterruptDescriptorTableandloadingitintheIDTRregister.Atthenextstepafterthiswecanseeacallofthecopy_bootdatafunction:

    copy_bootdata(__va(real_mode_data));

    Thisfunctiontakesoneargument-virtualaddressofthereal_mode_data.Rememberthatwepassedtheaddressoftheboot_paramsstructurefromarch/x86/include/uapi/asm/bootparam.htothex86_64_start_kernelfunctionasfirstargumentinarch/x86/kernel/head_64.S:

    /*rsiispointertorealmodestructurewithinterestinginfo.passittoC*/movq%rsi,%rdi

    Nowlet'slookat__vamacro.Thismacrodefinedininit/main.c:

    #define__va(x)((void*)((unsignedlong)(x)+PAGE_OFFSET))

    wherePAGE_OFFSETis__PAGE_OFFSETwhichis0xffff880000000000andthebasevirtualaddressofthedirectmappingofallphysicalmemory.Sowe'regettingvirtualaddressoftheboot_paramsstructureandpassittothecopy_bootdatafunction,wherewecopyreal_mod_datatotheboot_paramswhichisdeclaredinthearch/x86/kernel/setup.h

    externstructboot_paramsboot_params;

    Let'slookatthecopy_boot_dataimplementation:

    staticvoid__initcopy_bootdata(char*real_mode_data){char*command_line;unsignedlongcmd_line_ptr;

    memcpy(&boot_params,real_mode_data,sizeofboot_params);sanitize_boot_params(&boot_params);cmd_line_ptr=get_cmd_line_ptr();if(cmd_line_ptr){command_line=__va(cmd_line_ptr);memcpy(boot_command_line,command_line,COMMAND_LINE_SIZE);}

    Kernelinitialization.Part3.

    Lastpreparationsbeforethekernelentrypoint

    boot_paramsagain

    LinuxInside

    78Lastpreparationsbeforethekernelentrypoint

  • }Firstofall,notethatthisfunctionisdeclaredwith__initprefix.Itmeansthatthisfunctionwillbeusedonlyduringtheinitializationandusedmemorywillbefreed.

    Wecanseedeclarationoftwovariablesforthekernelcommandlineandcopyingreal_mode_datatotheboot_paramswiththememcpyfunction.Thenextcallofthesanitize_boot_paramsfunctionwhichfillssomefieldsoftheboot_paramsstructurelikeext_ramdisk_imageandetc...ifbootloaderswhichfailtoinitializeunknownfieldsinboot_paramstozero.Afterthiswe'regettingaddressofthecommandlinewiththecalloftheget_cmd_line_ptrfunction:

    unsignedlongcmd_line_ptr=boot_params.hdr.cmd_line_ptr;cmd_line_ptr|=(u64)boot_params.ext_cmd_line_ptr

  • .p2align4.Lloop:decl%ecx#definePUT(x)movq%rax,x*8(%rdi)movq%rax,(%rdi)PUT(1)PUT(2)PUT(3)PUT(4)PUT(5)PUT(6)PUT(7)leaq64(%rdi),%rdijnz.LloopnopretCFI_ENDPROC.Lclear_page_end:ENDPROC(clear_page)

    Asyoucanunderstartfromthefunctionnameitclearsorfillswithzerospagetables.FirstofallnotethatthisfunctionstartswiththeCFI_STARTPROCandCFI_ENDPROCwhichareexpandstoGNUassemblydirectives:

    #defineCFI_STARTPROC.cfi_startproc#defineCFI_ENDPROC.cfi_endproc

    andusedfordebugging.AfterCFI_STARTPROCmacrowezeroouteaxregisterandput64totheecx(itwillbecounter).Nextwecanseeloopwhichstartswiththe.Llooplabelanditstartsfromtheecxdecrement.Afteritweputzerofromtheraxregistertotherdiwhichcontainsthebaseaddressoftheinit_level4_pgtnowanddothesameprocedureseventimesbuteverytimemoverdioffseton8.Afterthiswewillhavefirst64bytesoftheinit_level4_pgtfilledwithzeros.Inthenextstepweputtheaddressoftheinit_level4_pgtwith64-bytesoffsettotherdiagainandrepeatalloperationswhichecxisnotzero.Intheendwewillhaveinit_level4_pgtfilledwithzeros.

    Aswehaveinit_level4_pgtfilledwithzeros,wesetthelastinit_level4_pgtentrytokernelhighmappingwiththe:

    init_level4_pgt[511]=early_level4_pgt[511];

    Rememberthatwedroppedallearly_level4_pgtentriesinthereset_early_page_tablefunctionandkeptonlykernelhighmappingthere.

    Thelaststepinthex86_64_start_kernelfunctionisthecallofthe:

    x86_64_start_reservations(real_mode_data);

    functionwiththereal_mode_dataasargument.Thex86_64_start_reservationsfunctiondefinedinthesamesourcecodefileasthex86_64_start_kernelfunctionandlooks:

    void__initx86_64_start_reservations(char*real_mode_data){if(!boot_params.hdr.version)copy_bootdata(__va(real_mode_data));

    reserve_ebda_region();

    start_kernel();}

    LinuxInside

    80Lastpreparationsbeforethekernelentrypoint

  • Youcanseethatitisthelastfunctionbeforeweareinthekernelentrypoint-start_kernelfunction.Let'slookwhatitdoesandhowitworks.

    Firstofallwecanseeinthex86_64_start_reservationsfunctioncheckforboot_params.hdr.version:

    if(!boot_params.hdr.version)copy_bootdata(__va(real_mode_data));

    andifitisnotwecallagaincopy_bootdatafunctionwiththevirtualaddressofthereal_mode_data(readaboutaboutit'simplementation).

    Inthenextstepwecanseethecallofthereserve_ebda_regionfunctionwhichdefinedinthearch/x86/kernel/head.c.ThisfunctionreservesmemoryblockforthEBDAorExtendedBIOSDataArea.TheExtendedBIOSDataArealocatedinthetopofconventionalmemoryandcontainsdataaboutports,diskparametersandetc...

    Let'slookonthereserve_ebda_regionfunction.Itstartsfromthecheckingisparavirtualizationenabledornot:

    if(paravirt_enabled())return;

    weexitfromthereserve_ebda_regionfunctionifparavirtualizationisenabledbecauseifitenabledtheextendedbiosdataareaisabsent.Inthenextstepweneedtogettheendofthelowmemory:

    lowmem=*(unsignedshort*)__va(BIOS_LOWMEM_KILOBYTES);lowmem

  • }onlywithonedifference:wepassargumentwiththephys_addr_twhichdependsonCONFIG_PHYS_ADDR_T_64BIT:

    #ifdefCONFIG_PHYS_ADDR_T_64BITtypedefu64phys_addr_t;#elsetypedefu32phys_addr_t;#endif

    ThisconfigurationoptionisenabledbyCONFIG_PHYS_ADDR_T_64BIT.AfterthatwegotvirtualaddressofthesegmentwhichstoresthebaseaddressoftheextendedBIOSdataarea,weshiftiton4andreturn.Afterthisebda_addrvariablescontainsthebaseaddressoftheextendedBIOSdataarea.

    InthenextstepwecheckthataddressoftheextendedBIOSdataareaandlowmemoryisnotlessthanINSANE_CUTOFFmacro

    if(ebda_addrregions[0].size==0){WARN_ON(type->cnt!=1||type->total_size);type->regions[0].base=base;type->regions[0].size=size;type->regions[0].flags=flags;memblock_set_region_node(&type->regions[0],nid);type->total_size=size;return0;}

    Afterwefilledourregionwecanseethecallofthememblock_set_region_nodefunctionwithtwoparameters:

    addressofthefilledmemoryregion;NUMAnodeid.

    whereourregionsrepresentedbythememblock_regionstructure:

    structmemblock_region{phys_addr_tbase;phys_addr_tsize;unsignedlongflags;#ifdefCONFIG_HAVE_MEMBLOCK_NODE_MAPintnid;#endif};

    NUMAnodeiddependsonMAX_NUMNODESmacrowhichisdefinedintheinclude/linux/numa.h:

    #defineMAX_NUMNODES(1

  • memblick_set_region_nodefunctionjustfillsnidfieldfrommemblock_regionwiththegivenvalue:

    staticinlinevoidmemblock_set_region_node(structmemblock_region*r,intnid){r->nid=nid;}

    Afterthiswewillhavefirstreservedmemblockfortheextendedbiosdataareainthe.meminit.datasection.reserve_ebda_regionfunctionfinisheditsworkonthisstepandwecangobacktothearch/x86/kernel/head64.c.

    Wefinishedallpreparationsbeforethekernelentrypoint!Thelaststepinthex86_64_start_reservationsfunctionisthecallofthe:

    start_kernel()

    functionfrominit/main.cfile.

    That'sallforthispart.

    Itistheendofthethirdpartaboutlinuxkernelinternals.Innextpartwewillseethefirstinitializationstepsinthekernelentrypoint-start_kernelfunction.Itwillbethefirststepbeforewewillseelaunchofthefirstinitprocess.

    Ifyouhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

    PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

    BIOSdataareaWhatisintheextendedBIOSdataareaonaPC?Previouspart

    Conclusion

    Links

    LinuxInside

    85Lastpreparationsbeforethekernelentrypoint

  • Ifyouhavereadthepreviouspart-Lastpreparationsbeforethekernelentrypoint,youcanrememberthatwefinishedallpre-initializationstuffandstoppedrightbeforethecalltothestart_kernelfunctionfromtheinit/main.c.Thestart_kernelistheentryofthegenericandarchitectureindependentkernelcode,althoughwewillreturntothearch/foldermanytimes.Ifyoulookinsideofthestart_kernelfunction,youwillseethatthisfunctionisverybig.Forthismomentitcontainsabout86callsoffunctions.Yes,it'sverybigandofcoursethispartwillnotcoveralltheprocessesthatoccurinthisfunction.Inthecurrentpartwewillonlystarttodoit.ThispartandallthenextwhichwillbeintheKernelinitializationprocesschapterwillcoverit.

    Themainpurposeofthestart_kerneltofinishkernelinitializationprocessandlaunchthefirstinitprocess.Beforethefirstprocesswillbestarted,thestart_kernelmustdomanythingssuchas:toenablelockvalidator,toinitializeprocessorid,toenableearlycgroupssubsystem,tosetupper-cpuareas,toinitializedifferentcachesinvfs,toinitializememorymanager,rcu,vmalloc,scheduler,IRQs,ACPIandmanymanymore.Onlyafterthesestepswewillseethelaunchofthefirstinitprocessinthelastpartofthischapter.Somuchkernelcodeawaitsus,let'sstart.

    NOTE:AllpartsfromthisbigchapterLinuxKernelinitializationprocesswillnotcoveranythingaboutdebugging.Therewillbeaseparatechapteraboutkerneldebuggingtips.

    AsIwroteabove,thestart_kernelfunctionisdefinedintheinit/main.c.Thisfunctiondefinedwiththe__initattributeandasyoualreadymayknowfromotherparts,allfunctionswhicharedefinedwiththisattributearenecessaryduringkernelinitialization.

    #define__init__section(.init.text)__coldnotrace

    Aftertheinitializationprocesswillbefinished,thekernelwillreleasethesesectionswithacalltothefree_initmemfunction.Notealsothat__initisdefinedwithtwoattributes:__coldandnotrace.Thepurposeofthefirstcoldattributeistomarkthatthefunctionisrarelyusedandthecompilermustoptimizethisfunctionforsize.Thesecondnotraceisdefinedas:

    #definenotrace__attribute__((no_instrument_function))

    whereno_instrument_functionsaystothecompilernottogenerateprofilingfunctioncalls.

    Inthedefinitionofthestart_kernelfunction,youcanalsoseethe__visibleattributewhichexpandstothe:

    #define__visible__attribute__((externally_visible))

    whereexternally_visibletellstothecompilerthatsomethingusesthisfunctionorvariable,topreventmarkingthisfunction/variableasunusable.Youcanfindthedefinitionofthisandothermacroattributesininclude/linux/init.h.

    Kernelinitialization.Part4.

    Kernelentrypoint

    Alittleaboutfunctionattributes

    Firststepsinthestart_kernel

    LinuxInside

    86Kernelentrypoint

  • Atthebeginningofthestart_kernelyoucanseethedefinitionofthesetwovariables:

    char*command_line;char*after_dashes;

    Thefirstrepresentsapointertothekernelcommandlineandthesecondwillcontaintheresultoftheparse_argsfunctionwhichparsesaninputstringwithparametersintheformname=value,lookingforspecifickeywordsandinvokingtherighthandlers.Wewillnotgointothedetailsrelatedwiththesetwovariablesatthistime,butwillseeitinthenextparts.Inthenextstepwecanseeacalltothe:

    lockdep_init();

    function.lockdep_initinitializeslockvalidator.Itsimplementationisprettysimple,itjustinitializestwolist_headhashesandsetsthelockdep_initializedglobalvariableto1.Lockvalidatordetectscircularlockdependenciesandiscalledwhenanyspinlockormutexisacquired.

    Thenextfunctionisset_task_stack_end_magicwhichtakesaddressoftheinit_taskandsetsSTACK_END_MAGIC(0x57AC6E9D)ascanaryforit.init_taskrepresentstheinitialtaskstructure:

    structtask_structinit_task=INIT_TASK(init_task);

    wheretask_structstoresalltheinformationaboutaprocess.Iwillnotexplainthisstructureinthisbookbecauseit'sverybig.Youcanfinditsdefinitionininclude/linux/sched.h.Atthismomenttask_structcontainsmorethan100fields!Althoughyouwillnotseetheexplanationofthetask_structinthisbook,wewilluseitveryoftensinceitisthefundamentalstructurewhichdescribestheprocessintheLinuxkernel.Iwilldescribethemeaningofthefieldsofthisstructureaswemeettheminpractice.

    Youcanseethedefinitionoftheinit_taskanditinitializedbytheINIT_TASKmacro.Thismacroisfrominclude/linux/init_task.handitjustfillstheinit_taskwiththevaluesforthefirstprocess.Forexampleitsets:

    initprocessstatetozeroorrunnable.ArunnableprocessisonewhichiswaitingonlyforaCPUtorunon;initprocessflags-PF_KTHREADwhichmeans-kernelthread;alistofrunnabletask;processaddressspace;initprocessstacktothe&init_thread_infowhichisinit_thread_union.thread_infoandinitthread_unionhastype-thread_unionwhichcontainsthread_infoandprocessstack:

    unionthread_union{structthread_infothread_info;unsignedlongstack[THREAD_SIZE/sizeof(long)];};

    Everyprocesshasitsownstackanditis16killobytesor4pageframes.inx86_64.Wecannotethatitisdefinedasarrayofunsignedlong.Thenextfieldofthethread_unionis-thread_infodefinedas:

    structthread_info{structtask_struct*task;structexec_domain*exec_domain;__u32flags;__u32status;__u32cpu;intsaved_preempt_count;

    LinuxInside

    87Kernelentrypoint

  • mm_segment_taddr_limit;structrestart_blockrestart_block;void__user*sysenter_return;unsignedintsig_on_uaccess_error:1;unsignedintuaccess_err:1;};

    andoccupies52bytes.Thethread_infostructurecontainsarchitecture-specificinformationonthethread.Weknowthatonx86_64thestackgrowsdownandthread_union.thread_infoisstoredatthebottomofthestackinourcase.Sotheprocessstackis16killobytesandthread_infoisatthebottom.Theremainingthread_sizewillbe16killobytes-62bytes=16332bytes.Notethatthread_uniounrepresentedastheunionandnotstructure,itmeansthatthread_infoandstacksharethememoryspace.

    Schematicallyitcanberepresentedasfollows:

    +-----------------------+|||||stack||||_______________________|||||||||||______________________|+--------------------+|||||thread_info||task_struct|||||+-----------------------++--------------------+

    http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct

    SotheINIT_TASKmacrofillsthesetask_struct'sfieldsandmanymanymore.AsIalreadywroteabout,IwillnotdescribeallthefieldsandvaluesintheINIT_TASKmacrobutwewillseethemsoon.

    Nowlet'sgobacktotheset_task_stack_end_magicfunction.Thisfunctiondefinedinthekernel/fork.candsetsacanarytotheinitprocessstacktopreventstackoverflow.

    voidset_task_stack_end_magic(structtask_struct*tsk){unsignedlong*stackend;stackend=end_of_stack(tsk);*stackend=STACK_END_MAGIC;/*foroverflowdetection*/}

    Itsimplementationissimple.set_task_stack_end_magicgetstheendofthestackforthegiventask_structwiththeend_of_stackfunction.TheendofaprocessstackdependsontheCONFIG_STACK_GROWSUPconfigurationoption.Aswelearninx86_64architecture,thestackgrowsdown.Sotheendoftheprocessstackwillbe:

    (unsignedlong*)(task_thread_info(p)+1);

    wheretask_thread_infojustreturnsthestackwhichwefilledwiththeINIT_TASKmacro:

    #definetask_thread_info(task)((structthread_info*)(task)->stack)

    LinuxInside

    88Kernelentrypoint

  • Aswegottheendoftheinitprocessstack,wewriteSTACK_END_MAGICthere.Aftercanaryisset,wecancheckitlikethis:

    if(*end_of_stack(task)!=STACK_END_MAGIC){////handlestackoverflowhere//}

    Thenextfunctionaftertheset_task_stack_end_magicissmp_setup_processor_id.Thisfunctionhasanemptybodyforx86_64:

    void__init__weaksmp_setup_processor_id(void){}

    asitnotimplementedforallarchitectures,butsomesuchass390andarm64.

    Thenextfunctioninstart_kernelisdebug_objects_early_init.Implementationofthisfunctionisalmostthesameaslockdep_init,butfillshashesforobjectdebugging.AsIwroteabout,wewillnotseetheexplanationofthisandotherfunctionswhicharefordebuggingpurposesinthischapter.

    Afterthedebug_object_early_initfunctionwecanseethecalloftheboot_init_stack_canaryfunctionwhichfillstask_struct->canarywiththecanaryvalueforthe-fstack-protectorgccfeature.ThisfunctiondependsontheCONFIG_CC_STACKPROTECTORconfigurationoptionandifthisoptionisdisabled,boot_init_stack_canarydoesnothing,otherwiseitgeneratesrandomnumbersbasedonrandompoolandtheTSC:

    get_random_bytes(&canary,sizeof(canary));tsc=__native_read_tsc();canary+=tsc+(tscstack_canary=canary;

    andwritethisvaluetothetopoftheIRQstackwiththe:

    this_cpu_write(irq_stack_union.stack_canary,canary);//readbelowaboutthis_cpu_write

    Again,wewillnotdiveintodetailshere,wewillcoveritinthepartaboutIRQs.Ascanaryisset,wedisablelocalandearlybootIRQsandregisterthebootstrapCPUintheCPUmaps.WedisablelocalIRQs(interruptsforcurrentCPU)withthelocal_irq_disablemacrowhichexpandstothecallofthearch_local_irq_disablefunctionfrominclude/linux/percpu-defs.h:

    staticinlinenotracevoidarch_local_irq_enable(void){native_irq_enable();}

    Wherenative_irq_enableiscliinstructionforx86_64.AsinterruptsaredisabledwecanregisterthecurrentCPUwiththegivenIDintheCPUbitmap.

    LinuxInside

    89Kernelentrypoint

  • Thecurrentfunctionfromthestart_kernelisboot_cpu_init.ThisfunctioninitializesvariousCPUmasksforthebootstrapprocessor.Firstofallitgetsthebootstrapprocessoridwithacallto:

    intcpu=smp_processor_id();

    Fornowitisjustzero.IftheCONFIG_DEBUG_PREEMPTconfigurationoptionisdisabled,smp_processor_idjustexpandstothecallofraw_smp_processor_idwhichexpandstothe:

    #defineraw_smp_processor_id()(this_cpu_read(cpu_number))

    this_cpu_readasmanyotherfunctionlikethis(this_cpu_write,this_cpu_addandetc...)definedintheinclude/linux/percpu-defs.handpresentsthis_cpuoperation.Theseoperationsprovideawayofoptimizingaccesstotheper-cpuvariableswhichareassociatedwiththecurrentprocessor.Inourcaseitisthis_cpu_read:

    __pcpu_size_call_return(this_cpu_read_,pcp)

    Rememberthatwehavepassedcpu_numberaspcptothethis_cpu_readfromtheraw_smp_processor_id.Nowlet'slookatthe__pcpu_size_call_returnimplementation:

    #define__pcpu_size_call_return(stem,variable)\({\typeof(variable)pscr_ret__;\__verify_pcpu_ptr(&(variable));\switch(sizeof(variable)){\case1:pscr_ret__=stem##1(variable);break;\case2:pscr_ret__=stem##2(variable);break;\case4:pscr_ret__=stem##4(variable);break;\case8:pscr_ret__=stem##8(variable);break;\default:\__bad_size_call_parameter();break;\}\pscr_ret__;\})

    Yes,itlooksalittlestrangebutit'seasy.Firstofallwecanseethedefinitionofthepscr_ret__variablewiththeinttype.Whyint?Ok,variableiscommon_cpuanditwasdeclaredasper-cpuintvariable:

    DECLARE_PER_CPU_READ_MOSTLY(int,cpu_number);

    Inthenextstepwecall__verify_pcpu_ptrwiththeaddressofcpu_number.__veryf_pcpu_ptrusedtoverifythatthegivenparameterisaper-cpupointer.Afterthatwesetpscr_ret__valuewhichdependsonthesizeofthevariable.Ourcommon_cpuvariableisint,soit4bytesinsize.Itmeansthatwewillgetthis_cpu_read_4(common_cpu)inpscr_ret__.Intheendofthe__pcpu_size_call_returnwejustcallit.this_cpu_read_4isamacro:

    #definethis_cpu_read_4(pcp)percpu_from_op("mov",pcp)

    whichcallspercpu_from_opandpassmovinstructionandper-cpuvariablethere.percpu_from_opwillexpandtotheinlineassemblycall:

    Thefirstprocessoractivation

    LinuxInside

    90Kernelentrypoint

  • asm("movl%%gs:%1,%0":"=r"(pfo_ret__):"m"(common_cpu))

    Let'strytounderstandhowitworksandwhatitdoes.Thegssegmentregistercontainsthebaseofper-cpuarea.Herewejustcopycommon_cpuwhichisinmemorytothepfo_ret__withthemovlinstruction.Orwithanotherwords:

    this_cpu_read(common_cpu)

    isthesameas:

    movl%gs:$common_cpu,$pfo_ret__

    Aswedidn'tsetupper-cpuarea,wehaveonlyone-forthecurrentrunningCPU,wewillgetzeroasaresultofthesmp_processor_id.

    Aswegotthecurrentprocessorid,boot_cpu_initsetsthegivenCPUonline,active,presentandpossiblewiththe:

    set_cpu_online(cpu,true);set_cpu_active(cpu,true);set_cpu_present(cpu,true);set_cpu_possible(cpu,true);

    Allofthesefunctionsusetheconcept-cpumask.cpu_possibleisasetofCPUID'swhichcanbepluggedinatanytimeduringthelifeofthatsystemboot.cpu_presentrepresentswhichCPUsarecurrentlypluggedin.cpu_onlinerepresentssubsetofthecpu_presentandindicatesCPUswhichareavailableforscheduling.ThesemasksdependontheCONFIG_HOTPLUG_CPUconfigurationoptionandifthisoptionisdisabledpossible==presentandactive==online.Implementationoftheallofthesefunctionsareverysimilar.Everyfunctionchecksthesecondparameter.Ifitistrue,itcallscpumask_set_cpuorcpumask_clear_cpuotherwise.

    Forexamplelet'slookatset_cpu_possible.Aswepassedtrueasthesecondparameter,the:

    cpumask_set_cpu(cpu,to_cpumask(cpu_possible_bits));

    willbecalled.Firstofalllet'strytounderstandtheto_cpu_maskmacro.Thismacrocastsabitmaptoastructcpumask*.CPUmasksprovideabitmapsuitableforrepresentingthesetofCPU'sinasystem,onebitpositionperCPUnumber.CPUmaskpresentedbythecpu_maskstructure:

    typedefstructcpumask{DECLARE_BITMAP(bits,NR_CPUS);}cpumask_t;

    whichisjustbitmapdeclaredwiththeDECLARE_BITMAPmacro:

    #defineDECLARE_BITMAP(name,bits)unsignedlongname[BITS_TO_LONGS(bits)]

    Aswecanseefromitsdefinition,theDECLARE_BITMAPmacroexpandstothearrayofunsignedlong.Nowlet'slookathowtheto_cpumaskmacroisimplemented:

    #defineto_cpumask(bitmap)\((structcpumask*)(1?(bitmap)\

    LinuxInside

    91Kernelentrypoint

  • :(void*)sizeof(__check_is_bitmap(bitmap))))

    Idon'tknowaboutyou,butitlookedreallyweirdformeatthefirsttime.Wecanseeaternaryoperatorherewhichistrueeverytime,butwhythe__check_is_bitmaphere?It'ssimple,let'slookatit:

    staticinlineint__check_is_bitmap(constunsignedlong*bitmap){return1;}

    Yeah,itjustreturns1everytime.Actuallyweneedinithereonlyforonepurpose:atcompiletimeitchecksthatthegivenbitmapisabitmap,orinotherwordsitchecksthatthegivenbitmaphasatypeofunsignedlong*.Sowejustpasscpu_possible_bitstotheto_cpumaskmacroforconvertingthearrayofunsignedlongtothestructcpumask*.Nowwecancallcpumask_set_cpufunctionwiththecpu-0an