sql server on linux, will it perform? - qcon london … · hosted windows apis sql server ... •...
TRANSCRIPT
• ModifiedWindowsKerneltoruninusermode,akaLibraryOSorLibOS
• DesignedforrunningonWindowsandleveragesPico-processfeature
• Pico-processisaNTprocesswithemptyaddressspace
IntrotoDrawbridge:Acontainertechnologytoachieveisolation,securityanddensityinthecloud
NT processshared
address space
user32gdi32
ntdll
host
O
S ntoskrnl
win32k
400+NT calls
800+Win32 calls
Picoprocesspicoprocess
isolated address space
ABIboundaryPAL
host
O
S security monitor
ntoskrnl
45calls
• All1200+systemcallsblockedfromuser-mode(NTOSandwin32k)
• Enforcedby35-linechangetoKiSystemServiceHandler
• Noperfimpacttootherprocesses—leverages“slowpath”usedbyUMS
• 45newsystemcallsaddedtoprocess(Drawbridgesystemcalls)
• Evenhard-codedtrapscan’tbreakout
NT UM
LibOS:AusermoderuntimelibraryexposingsemanticsofWindowskernel
Network Stack
I/O
Object Manager
Process Manager
DRTL Simple HeapUnion FS
AFD
Wait Pool
Threads Memory Manager
Loader
PEB/TEB
PAL ABI Handler
Sync Objects ThreadsStreams Memory Manager
APC
• StorageManager
• AsynchronousI/OsubmittedtothehostandregisteredwithWaitPool threads
• OncompletionWaitPool threadsdeliverI/Os totheoriginalthreadthroughAPC
• OriginalthreadsdeliverI/Os totheirfinaldestination
• NetworkManager
• CustomversionofAFD(WinSocksemantics)withathreadpool
• AFDAsynchronousI/OsubmittedtothehostandregisteredwithWaitPoolthreads
• OncompletionWaitPool threadsdeliverI/Os toAFDthreadsthroughAPC
• threadsdelivernetworkrequeststooriginalthreadsinitiatedI/OthroughAPC
• OriginalthreadsdeliverI/Os totheirfinaldestination
• I/OGeneral
• NopropersupportforScatter/Gather
• MemoryManager
• GlobalVirtualAddressDescriptor(VAD)list
• GlobalHeap
• ObjectManager
• GlobalDirectory
• ProcessManager
• Perprocessruntimelibraries– noimagesharing
• Threads
• APCs“injection”throughpolling
SQLOS(SOS):Ausermoderuntimelibraryprovidingperformance,scalabilityanddiagnostic
foundationforSQLServer
Memory Node
CPU Node
…
Network Manager
Scheduler
Storage Manager
Scheduler
Storage Manager
CPU Node
…
Network Manager
Scheduler
Storage Manager
Scheduler
Storage Manager
…
• NetworkManager
• I/Ocompletionport/threadperCPUNode
• Asynchronousdelivery
• StorageManager
• I/Oqueueperscheduler
• Synchronousdeliverythroughperiodicpolling
• MemoryManager/ObjectManager/SchedulingManager
• NUMAawareness
• Partitionedheaps
• Non-preemptivescheduling&UserModeThreads
• Synchronizationprimitives
SQLServerOnTopOfPALs
Ring3
SQLServer
SQLOS
Win32
LibOS
PAL
Ring0 LinuxKernel
Drawbridge
Technologies SQL LibOS Host Extension
Object Management ✔ ✔ ✔
MemoryManagement ✔ ✔ ✔
Threading/Scheduling ✔ ✔ ✔
Synchronization ✔ ✔ ✔
I/O(Disk,Network) ✔ ✔ ✔
• Kernelaio
• PumpthreadsvsWaitPool threads
• FastI/O
// We can do Fast I/O if and only if it follows rules employed by SQL Server// disk I/O: which is delivered nonpreemptively through polling an overlapped// data structure// - I/O is asynchronous// - No user mode APC required// - No I/O completion port specified// - Contains an event to be signaled (leveraged by SQL Server to wake up idle scheduler// - Disk I/O//
canDoFastIO = WaitForCompletion == FALSE;
canDoFastIO = canDoFastIO && (ApcRoutine == NULL && FileObject != NULL);
canDoFastIO = canDoFastIO && (Args->SkipCompletionPort ||NtpGetCompletionPortObject(FileObject,
&CompletionKey) == NULL);
canDoFastIO = canDoFastIO && (Args->EventObject != NULL && IoStatusBlock != NULL);
canDoFastIO = canDoFastIO && (NtpGetObjectType(Args->Object) == NTUM_FILE &&NtpIsIoAsynchronous(Args->Object));
canDoFastIO = canDoFastIO && ((FileObject->Type & NtpSeekableFile) &&(Type == NTUM_IO_READ ||Type == NTUM_IO_WRITE ||Type == NTUM_IO_WRITE_GATHER ||Type == NTUM_IO_READ_SCATTER));
// If it is Gather/Scatter I/O then length can't exceed DK_UIO_MAXIOV supported by the Host//canDoFastIO = canDoFastIO && (!(Type == NTUM_IO_WRITE_GATHER ||
Type == NTUM_IO_READ_SCATTER) ||Length <= DK_UIO_MAXIOV);
• PumpthreadsvsWaitPool
• FastI/O~AFDpassthrough
• SQLOScompletionthreadsarepumpthreads~nocontextswitchoncompletion
// Complete I/Os received via the the IOPort are submitted to the I/O// completion port queueStatus = NtpTryToProcessIoCompletion(IoCompletionPort,
IoCompletionInformation);
// Process any APCs or interruptions for this thread. //NtpProcessKernelApc(threadObject);
Request.IOPort = IoCompletionPort->IOPort;Request.PendingIOs = &PendingIOs;
Status = DrtlReadStreamSync(IoCompletionPort->Stream,0,0,(PVOID)&Request,NULL);
while (PendingIOs != NULL){
//// Remember I/O to complete and move to the next I/O before// we complete the current one since by the time we return from// completion routine the completed I/O will be freed//CompletedIO = PendingIOs;PendingIOs = (PDK_ASYNC_RESULTS_LINKED)PendingIOs->Next;
//// Complete I/O//NtpCompleteNetworkIoRequest((PNTUM_IO_REQUEST)CompletedIO->Request);
}
• MultipleHeaps
• I/ORequestfreelistperthread
• PerprocessVirtualAddressSpaceManager
• NUMAsupport
• ProcessorAffinity
PVOIDDrtlAllocate(
__in ULONG Flags,__in SIZE_T Size,__in ULONG Tag)
{ULONG heapIdx;
//// Early boot we might not have a thread//heapIdx = DrtlGetCurrentThreadId() % g_DrtlNumberHeaps;
return DrtlpAllocate(&g_DrtlHeaps[heapIdx], Flags, Size, Tag);}
NtpAllocateIORequestRaw(__in NTUM_IO_TYPE Type)
{
// Use cache if we have i/O request//LocalRequest = (PNTUM_IO_REQUEST)ExpInterlockedPopEntrySList(
&RequestingThread->IORequestsCache);
// If the cache was empty allocate a new request structure.//if (LocalRequest == NULL){
LocalRequest = (PNTUM_IO_REQUEST)ExAllocatePoolWithTag(PagedPool,sizeof(*LocalRequest),' PRI');
}
HardwareConfigurationPowerSettings:OSControlpoweroption,HighPerformanceinOS,HTOFF,TurboboostOFFNetwork:1x10GBNetworkconnectionpermachineMachine configuration(serverandclient):Gen3systemsModel/Processors:[email protected](2S/16C),Memory:128GBStorage:4x447.13GBSSDs.AllSSDsarestripedtogetherandmountedas1volume.Bothdata
andlogarestoredonthisvolume.
HardwareConfigurationPowerSettings:OSControlpoweroption,HighPerformanceinOS,HTOFF,TurboboostOFFNetwork:1x10GBNetworkconnectionpermachineMachine configuration(serverandclient):4Ssystems(forTPCCtest)Model/Processors:[email protected](4S/40C),Memory:768GBDataStorage:2x1.46TBGBFusionIOdisk.Alldisksarestripedtogetherandmountedas1volume.LogStorage:1x5.54TBHDD
IntroducingSQLPAL
Principles:
• Removeredundancy
• OptimizePerformancecriticalpaths(I/O)
• Shrinkcodepath-lengthLibOSandWin32
Technologies SQL SOSv2 Host Extension
Object Management ❌ ✔ ❌
MemoryManagement ❌ ✔ ✔Hosttranslation(jemalloc)
Threading/Scheduling ❌ ✔ ✔Hosttranslation(pthreads)
Synchronization ❌ ✔ ✔Hosttranslation(conditionvariables)
I/O(Disk,Network) ❌ ✔ ✔Hosttranslation(kaio)
Ring3
SQLServerWin32
SOSv2Lib-OS
HostExtension
Ring0 LinuxKernel
SQLPAL
SQLPALandSOSv2Architecture
HostExtensionandIntegration HEDebuggerBridge
SOSv2(Memory,Scheduling,Synchronization)
StorageManager
NetworkManager
ResourceManager
ProcessManager
SecurityManager
AvailabilityManager
NTUserMode
ConfigManager
PALDebuggerExtension
HostedWindowsAPIs
SQLServer
SOSDirectAPIs
LinuxProcessLayout
•HostExtensionisnativeLinuxprocess
• TheHostExtensionloadstheSQLPALnativeWindowslibrary
• SQLPALloadsSQLServerintoavirtualWindowsProcess.
SQLServer(WindowsBinary)
LibOS(WindowsKernelinUserMode)
HostExtension(LinuxorOSX)
Win32Calls(1200+)
ABICalls(50)
LinuxorOSXOSCalls
LinuxorOSXOS
LLDB
Deb
ugger
Debugger
• DebuggerbridgeforWindbg
• FormostscenariosdebuggingisidenticaltoWindows
• LiveDebugging
• StartSQLonLinuxunderdebuggerbridge
• AttachwithWindbg
• Dscripts etc.worksameasagainstWindows
• CrashDump
• Rundebuggerbridgepassingincrashdumpfile
• AttachwithWindbg andit’sthesameasWindows
• ExtractWindowsdumpfromLinuxCoredump
• AbletoextractaWindowsdumpfromLinuxcoredump
• LosesLinuxinformation
• LinuxEnlightenment
• ThedebuggerextensionalsoaddscommandstodebugLinuxpartsofthePAL
• CommandsmirrornormalWindbg commands
• Examples:• ‘k’showsWindowsstack• ‘!k’showsLinuxstack• Samefordv(!dv),dt (!dt),etc.• Sourcecanbelistedandsourcesteppingworks
• VTune isacrossplatformperformancetool
• Process
• CaptureonLinuxandresolveonLinux
• CopytheprojecttoWindows
• Resolvesymbolsandrerunanalysis
• ThisaddstheWindowsinformationtotheproject
• Afterprocessingallthecodeisavailableforanalysis:Linuxcode,sqlpal.dll,Win32,andSQL