lsrc6 ops for developers

Upload: benklang

Post on 05-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 LSRC6 Ops for Developers

    1/68

    Ops for DevelopersOr: How I Learned To Stop Worrying And Love

    The ShellOr: How I Learned To Stop Worrying And Love

    The Shell

    Ben [email protected]

    Text

    spkr8.com/t/13191

    mailto:[email protected]://spkr8.com/t/13191http://spkr8.com/t/13191mailto:[email protected]
  • 7/31/2019 LSRC6 Ops for Developers

    2/68

    Prologue

    Introductions

  • 7/31/2019 LSRC6 Ops for Developers

    3/68

    Who Am I?

    Ben Klang

    @bklang Github/[email protected]

    mailto:[email protected]:[email protected]
  • 7/31/2019 LSRC6 Ops for Developers

    4/68

    What are my

    passions?

    Telephony Applications

    Information SecurityPerformance and Availability Design

    Open Source

  • 7/31/2019 LSRC6 Ops for Developers

    5/68

    What do I do?

    Today I write codeand run Mojo Lingo

    But Yesterday...

  • 7/31/2019 LSRC6 Ops for Developers

    6/68

    This was my world

  • 7/31/2019 LSRC6 Ops for Developers

    7/68

    Ops Culture

  • 7/31/2019 LSRC6 Ops for Developers

    8/68

    am allergic to downtim

  • 7/31/2019 LSRC6 Ops for Developers

    9/68

    Its About Risk

    If something breaks, it will be mypager that goes off at 2am

    New software == New ways tobreak

    If I cant see it, I cant manage it ormonitor it and it will break

  • 7/31/2019 LSRC6 Ops for Developers

    10/68

    Agenda

    9:00 - 10:30Operating Systems & Hardware

    All About Bootup 10:30 - 11:00: Break 11:00 - 12:30

    Observing a Running SystemOptimization/Tuning

    12:30 - 1:30 Lunch

    1:30 - 3:00

  • 7/31/2019 LSRC6 Ops for Developers

    11/68

    Part I

    Operating Systems& Hardware

  • 7/31/2019 LSRC6 Ops for Developers

    12/68

    OS History LessonBSD, System V, Linux and Windows

  • 7/31/2019 LSRC6 Ops for Developers

    13/68

    UNICS(Sep. 1969)

    Soon renamed UnixSoon renamed Unix

    Time Sharing SystemTime Sharing SystemVersion 1Version 1

    UNIX Time Sharing System Version 5

    (Jun. 1974)

    UNIX Sys III(Nov. 1981)

    1BSD(Mar. 1978)

    4.3BSD(Jun. 1986)

    UNIX Sys V(Jan. 1983)

  • 7/31/2019 LSRC6 Ops for Developers

    14/68

  • 7/31/2019 LSRC6 Ops for Developers

    15/68

    Hardware

    Components

  • 7/31/2019 LSRC6 Ops for Developers

    16/68

    Common

    ArchitecturesIntel x86 (i386, x86_64)SPARC

    POWERARM

    But none of this really mattersanymore

  • 7/31/2019 LSRC6 Ops for Developers

    17/68

    CPU Configurations

    Individual CPU

    SMP: Symmetric Multi-ProcessingMultiple Cores

    Hyperthreading/Virtual Cores

  • 7/31/2019 LSRC6 Ops for Developers

    18/68

    (Virtual) Memory

    RAM + Swap = Available MemorySwapping strategies vary across

    OSes

    What your code sees is a completevirtualization of this

    x86/32-bit processes can only see3GB of RAM from a 4GB addressspace

  • 7/31/2019 LSRC6 Ops for Developers

    19/68

    Storage Types

    Local Storage (SATA, SAS, USB,Firewire)

    Network Storage (NFS, SMB, iSCSI,AOE)

    Storage Network (FibreChannel,Fabrics)

  • 7/31/2019 LSRC6 Ops for Developers

    20/68

    NetworkingLAN (100Mb still common; 1Gbit

    standard; 10Gb and 100Gb onhorizon)

    WAN (T-1, Frame Relay, ATM,MetroE)

    Important Characteristics

    ThroughputLoss

    Delay

  • 7/31/2019 LSRC6 Ops for Developers

    21/68

    Part II

    All About Bootup

  • 7/31/2019 LSRC6 Ops for Developers

    22/68

    Phases

    BIOS

    Kernel BootstrapHardware Detection

    Init System

  • 7/31/2019 LSRC6 Ops for Developers

    23/68

    System ServicesVaries by OSCommon: SysV Init Scripts;

    /etc/inittab; rc.local

    Solaris: SMFUbuntu: UpstartDebian: SysV default; Upstart optionalOSX: launchd

    RedHat/CentOS: SysV Init Scripts

  • 7/31/2019 LSRC6 Ops for Developers

    24/68

    SysV Init ScriptsCreated in /etc/init.d; Symlinked

    into runlevel directories

    Symlinks prefixed with specialcharacters to controlstartup/shutdown order

    Prefixed with S or K to start orstop service in each levelNumeric prefix determines order

    /etc/rc3.d/S10sshd ->

  • 7/31/2019 LSRC6 Ops for Developers

    25/68

    rc.local

    Single dumb startup scriptRun at end of system startupQuick/dirty mechanism to start

    something at bootup

  • 7/31/2019 LSRC6 Ops for Developers

    26/68

    /etc/inittab

    The original process supervisor

    Not (easily) scriptableStarts a process in a given runlevel

    Restarts the process when it dies

  • 7/31/2019 LSRC6 Ops for Developers

    27/68

    Supervisor

    Processes

    Solaris SMF

    Ubuntu UpstartOSX launchd

    daemontools

  • 7/31/2019 LSRC6 Ops for Developers

    28/68

    Ruby Integrations

    Supervisor Processes

    BluepillGod

    Startup Script Generator

    Foreman

  • 7/31/2019 LSRC6 Ops for Developers

    29/68

    Choosing a Boot

    MechanismIs automatic recovery desirable?(Hint: sometimes its not)

    Does it integrate with monitoring?Is it a one-off that will get forgotten?Does it integrate into OS

    startup/shutdown?

    How much work to integrate withyour app?

  • 7/31/2019 LSRC6 Ops for Developers

    30/68

    Part IIIObserving a Running

    System

  • 7/31/2019 LSRC6 Ops for Developers

    31/68

    Common Tools

    top

    free

    vmstatnetstat

    fuserps

    sar (not always installed by default)

  • 7/31/2019 LSRC6 Ops for Developers

    32/68

    Power Tools

    lsof

    iostat

    iftoppstree

    Tracing toolsstrace

    tcpdump/wireshark

  • 7/31/2019 LSRC6 Ops for Developers

    33/68

    Observing CPU

    Go-to tools: top, psCPU is not just about computation

    Most Important:%user, %system, %nice, %idle,%wait

    Other: hardware/softwareinterrupts, stolen time (especiallyon EC2)

  • 7/31/2019 LSRC6 Ops for Developers

    34/68

    The Mystical Load

    Avg.Broken into 1, 5 and 15 minuteaveragesGives a coarse view of overall

    system load

    Based on # processes waiting forCPU time

    Rule of thumb: stay below thenumber of CPUs in a system (eg. a4 CPU host should be below a 4.00

    load average)

  • 7/31/2019 LSRC6 Ops for Developers

    35/68

    When am I CPU

    bound?

    15 minute load average exceedingthe number of non-HT processors%user + %system consistently

    above 90%

  • 7/31/2019 LSRC6 Ops for Developers

    36/68

    Observing RAM

    Go-to tools: top, vmstatAvailable memory isnt just FreeBuffers + Cache fill to consume

    available RAM (this is a good thing!)

  • 7/31/2019 LSRC6 Ops for Developers

    37/68

    RAM vs. SwapRAM is the amount of physical

    memory

    Swap is disk used to augment RAMSwap is orders of magnitude slowerSome VM types have no meaningful

    swap

    Rule of thumb: pretend swapdoesnt exist

  • 7/31/2019 LSRC6 Ops for Developers

    38/68

    Paging Strategies

    Solaris: Page in advanceLinux: Page on demand (last resort)

    Windows: Craziness

  • 7/31/2019 LSRC6 Ops for Developers

    39/68

    When am I memory

    bound?Free + buffers + cache < 15% of

    RAMSwap utilization above 10% avail.

    swap (Linux only)

    Check for high disk utilization toconfirm thrashing

  • 7/31/2019 LSRC6 Ops for Developers

    40/68

    Observing Disk

    Go-to tools: iostat, topDisk is usually hardest thing to

    observe

    Better in recent Linux kernels (>2.6.20)

  • 7/31/2019 LSRC6 Ops for Developers

    41/68

    RAIDRedundant Array of Inexpensive

    Drives

    Different strategies have differentperformance/durability tradeoffsRAID-0

    RAID-1RAID-10

    RAID-5

    Wh I di k

  • 7/31/2019 LSRC6 Ops for Developers

    42/68

    When am I disk

    bound?%wait is consistently above 10% to20%

    ... though %wait can be network

    too

    SCSI and FC command queues arelong

    Known failure mode: disk morethan 85% full causes tremendousVFS overhead

  • 7/31/2019 LSRC6 Ops for Developers

    43/68

    Observing Network

    Go-to tools: netstat, iftop, wireshark

    Be wary of choke-pointsSwitch interconnects

    WAN links

    Firewalls

  • 7/31/2019 LSRC6 Ops for Developers

    44/68

    Link Optimization

    Use Jumbo Frames for Gbit+ links

    Port aggregation for throughput:

    Best: many-to-manyGood: one-to-many

    Useless: one-to-one... but still useful for HA

    Wh I t k

  • 7/31/2019 LSRC6 Ops for Developers

    45/68

    When am I network

    bound?This one is easy: 99% of the timethis is link saturation

    Gotchas: which link?Addendum: loss/delay (especiallyfor TCP) can wreak havoc on

    throughput... but usually only a problem

    across WAN

  • 7/31/2019 LSRC6 Ops for Developers

    46/68

    Part IVOptimization &

    PerformanceTuning

  • 7/31/2019 LSRC6 Ops for Developers

    47/68

    Hardware Options

    A.K.A. Throw hardware at it

    Not the first thing to try

    Are the services tuned? SQLqueries, application behavior,caching options

    Is something broken, causingperformance degradation?

  • 7/31/2019 LSRC6 Ops for Developers

    48/68

    Hardware Options

    RAM is usually the single biggestperformance win (cost/benefittradeoff)

    Faster disk is next bestThen look at CPU and/or Network

    ...but do the work to figure out whyyour performance is limited in thefirst place

  • 7/31/2019 LSRC6 Ops for Developers

    49/68

    Kernel Tunables

    Not as necessary as in the olddays

    Almost all settings can be adjustedat runtime on Linux, SolarisMost valuable settings are buffer

    limits or counters/timersThere be dragons! Read carefully

    before twisting these knobs

    E i t

  • 7/31/2019 LSRC6 Ops for Developers

    50/68

    Environment

    Settingsulimitsmax files

    stack sizememory limits

    core dumps

    othersStill subject to system-wide (kernel)

    limits

  • 7/31/2019 LSRC6 Ops for Developers

    51/68

    A li ti

  • 7/31/2019 LSRC6 Ops for Developers

    52/68

    Application

    TunablesThere are not many for C-Ruby

    JVM has manyMostly related to how RAM is

    allocated and garbage collected

    Very dependent on application

    Any time an xVM is involved,

    there is probably a tunable (JVM,CLR)

    But we are developers!Tune/profile your app before

    P f

  • 7/31/2019 LSRC6 Ops for Developers

    53/68

    Performance

    Management Toolssysstat (sar)SNMP (and related tools like Cacti)

    Integrated Monitoring + TrendingtoolsZabbix

    OpenNMSand a plethora of commercial

    tools

  • 7/31/2019 LSRC6 Ops for Developers

    54/68

    Part VPutting It All

    TogetherAutopsy of a single HTTP request, end-to-end

  • 7/31/2019 LSRC6 Ops for Developers

    55/68

    Live

    Demo/Whiteboard

  • 7/31/2019 LSRC6 Ops for Developers

    56/68

    Part VI

    Pulling It All ApartAnticipating Murphy and his Law

    Most Common

  • 7/31/2019 LSRC6 Ops for Developers

    57/68

    Most Common

    PitfallsDisk FullDNS Unavailable/Slow

    Insufficient RAMSuboptimal Service Configuration

    Firewall misconfigurationArchaic: Network mismatch (Full/HalfDuplex)

    DNS and

  • 7/31/2019 LSRC6 Ops for Developers

    58/68

    DNS and

    PerformancePossibly most-overlooked perf.

    impact

    Everything uses DNS

    If you make nothing elseredundant, make this redundant!

  • 7/31/2019 LSRC6 Ops for Developers

    59/68

    Part VII

    Scaling Up

    Horizontal or

  • 7/31/2019 LSRC6 Ops for Developers

    60/68

    Horizontal or

    Vertical?Vertical: Making one

    server/instance go faster

    Horizontal: Parallelizing requests toget more things done in the same

    amount of time

  • 7/31/2019 LSRC6 Ops for Developers

    61/68

    ClusteringParallelizing requests to increaseoverall throughput: horizontal

    scaling

    Techniques to make informationmore available:Caching (memcache, file-based

    caching)

    Distribute data sets

    Replication

  • 7/31/2019 LSRC6 Ops for Developers

    62/68

    Distributing Data

    Replication

    Split Reads (One writer/master;multiple slaves/readers)Multiple Masters (dangerous!)Sharding (must consider HA)

  • 7/31/2019 LSRC6 Ops for Developers

    63/68

    Failover/HA

    Consistency requires concept ofQuorum

    Losing partition gets killed:STONITH

    Multi-master systems ignore this atthe cost of potential non-determinisim

  • 7/31/2019 LSRC6 Ops for Developers

    64/68

    Tuning ServicesSome VM types (especially JVM orCLR) have tunables for memory

    consumption

    Databases usually have memorysettings

    These can make dramaticdifferences

    Very workload dependentDeep troubleshooting: strace,

    wireshark

  • 7/31/2019 LSRC6 Ops for Developers

    65/68

    Part VIIIDeploying

    Applications

    12 Factor

  • 7/31/2019 LSRC6 Ops for Developers

    66/68

    12 Factor

    ApplicationDeployability starts with application

    design

    Clear line between configurationand logic

    Permit easy horizontal scalingAre OS-agnostic (yay Ruby!)

    Minimize differences between devand prodhttp://12factor.net- by Heroku

    cofounder

    http://12factor.net/http://12factor.net/http://12factor.net/
  • 7/31/2019 LSRC6 Ops for Developers

    67/68

    Deployment Tools

    Capistrano

    The de facto standard

    Requires effort to set up, testRequires integration with system

    startup

    Most flexible

  • 7/31/2019 LSRC6 Ops for Developers

    68/68

    Deployment Tools

    Move it to the cloudHeroku

    Cloud Foundry