lsrc6 ops for developers
TRANSCRIPT
-
7/31/2019 LSRC6 Ops for Developers
1/68
Ops for DevelopersOr: How I Learned To Stop Worrying And Love
The ShellOr: How I Learned To Stop Worrying And Love
The Shell
Text
spkr8.com/t/13191
mailto:[email protected]://spkr8.com/t/13191http://spkr8.com/t/13191mailto:[email protected] -
7/31/2019 LSRC6 Ops for Developers
2/68
Prologue
Introductions
-
7/31/2019 LSRC6 Ops for Developers
3/68
Who Am I?
Ben Klang
@bklang Github/[email protected]
mailto:[email protected]:[email protected] -
7/31/2019 LSRC6 Ops for Developers
4/68
What are my
passions?
Telephony Applications
Information SecurityPerformance and Availability Design
Open Source
-
7/31/2019 LSRC6 Ops for Developers
5/68
What do I do?
Today I write codeand run Mojo Lingo
But Yesterday...
-
7/31/2019 LSRC6 Ops for Developers
6/68
This was my world
-
7/31/2019 LSRC6 Ops for Developers
7/68
Ops Culture
-
7/31/2019 LSRC6 Ops for Developers
8/68
am allergic to downtim
-
7/31/2019 LSRC6 Ops for Developers
9/68
Its About Risk
If something breaks, it will be mypager that goes off at 2am
New software == New ways tobreak
If I cant see it, I cant manage it ormonitor it and it will break
-
7/31/2019 LSRC6 Ops for Developers
10/68
Agenda
9:00 - 10:30Operating Systems & Hardware
All About Bootup 10:30 - 11:00: Break 11:00 - 12:30
Observing a Running SystemOptimization/Tuning
12:30 - 1:30 Lunch
1:30 - 3:00
-
7/31/2019 LSRC6 Ops for Developers
11/68
Part I
Operating Systems& Hardware
-
7/31/2019 LSRC6 Ops for Developers
12/68
OS History LessonBSD, System V, Linux and Windows
-
7/31/2019 LSRC6 Ops for Developers
13/68
UNICS(Sep. 1969)
Soon renamed UnixSoon renamed Unix
Time Sharing SystemTime Sharing SystemVersion 1Version 1
UNIX Time Sharing System Version 5
(Jun. 1974)
UNIX Sys III(Nov. 1981)
1BSD(Mar. 1978)
4.3BSD(Jun. 1986)
UNIX Sys V(Jan. 1983)
-
7/31/2019 LSRC6 Ops for Developers
14/68
-
7/31/2019 LSRC6 Ops for Developers
15/68
Hardware
Components
-
7/31/2019 LSRC6 Ops for Developers
16/68
Common
ArchitecturesIntel x86 (i386, x86_64)SPARC
POWERARM
But none of this really mattersanymore
-
7/31/2019 LSRC6 Ops for Developers
17/68
CPU Configurations
Individual CPU
SMP: Symmetric Multi-ProcessingMultiple Cores
Hyperthreading/Virtual Cores
-
7/31/2019 LSRC6 Ops for Developers
18/68
(Virtual) Memory
RAM + Swap = Available MemorySwapping strategies vary across
OSes
What your code sees is a completevirtualization of this
x86/32-bit processes can only see3GB of RAM from a 4GB addressspace
-
7/31/2019 LSRC6 Ops for Developers
19/68
Storage Types
Local Storage (SATA, SAS, USB,Firewire)
Network Storage (NFS, SMB, iSCSI,AOE)
Storage Network (FibreChannel,Fabrics)
-
7/31/2019 LSRC6 Ops for Developers
20/68
NetworkingLAN (100Mb still common; 1Gbit
standard; 10Gb and 100Gb onhorizon)
WAN (T-1, Frame Relay, ATM,MetroE)
Important Characteristics
ThroughputLoss
Delay
-
7/31/2019 LSRC6 Ops for Developers
21/68
Part II
All About Bootup
-
7/31/2019 LSRC6 Ops for Developers
22/68
Phases
BIOS
Kernel BootstrapHardware Detection
Init System
-
7/31/2019 LSRC6 Ops for Developers
23/68
System ServicesVaries by OSCommon: SysV Init Scripts;
/etc/inittab; rc.local
Solaris: SMFUbuntu: UpstartDebian: SysV default; Upstart optionalOSX: launchd
RedHat/CentOS: SysV Init Scripts
-
7/31/2019 LSRC6 Ops for Developers
24/68
SysV Init ScriptsCreated in /etc/init.d; Symlinked
into runlevel directories
Symlinks prefixed with specialcharacters to controlstartup/shutdown order
Prefixed with S or K to start orstop service in each levelNumeric prefix determines order
/etc/rc3.d/S10sshd ->
-
7/31/2019 LSRC6 Ops for Developers
25/68
rc.local
Single dumb startup scriptRun at end of system startupQuick/dirty mechanism to start
something at bootup
-
7/31/2019 LSRC6 Ops for Developers
26/68
/etc/inittab
The original process supervisor
Not (easily) scriptableStarts a process in a given runlevel
Restarts the process when it dies
-
7/31/2019 LSRC6 Ops for Developers
27/68
Supervisor
Processes
Solaris SMF
Ubuntu UpstartOSX launchd
daemontools
-
7/31/2019 LSRC6 Ops for Developers
28/68
Ruby Integrations
Supervisor Processes
BluepillGod
Startup Script Generator
Foreman
-
7/31/2019 LSRC6 Ops for Developers
29/68
Choosing a Boot
MechanismIs automatic recovery desirable?(Hint: sometimes its not)
Does it integrate with monitoring?Is it a one-off that will get forgotten?Does it integrate into OS
startup/shutdown?
How much work to integrate withyour app?
-
7/31/2019 LSRC6 Ops for Developers
30/68
Part IIIObserving a Running
System
-
7/31/2019 LSRC6 Ops for Developers
31/68
Common Tools
top
free
vmstatnetstat
fuserps
sar (not always installed by default)
-
7/31/2019 LSRC6 Ops for Developers
32/68
Power Tools
lsof
iostat
iftoppstree
Tracing toolsstrace
tcpdump/wireshark
-
7/31/2019 LSRC6 Ops for Developers
33/68
Observing CPU
Go-to tools: top, psCPU is not just about computation
Most Important:%user, %system, %nice, %idle,%wait
Other: hardware/softwareinterrupts, stolen time (especiallyon EC2)
-
7/31/2019 LSRC6 Ops for Developers
34/68
The Mystical Load
Avg.Broken into 1, 5 and 15 minuteaveragesGives a coarse view of overall
system load
Based on # processes waiting forCPU time
Rule of thumb: stay below thenumber of CPUs in a system (eg. a4 CPU host should be below a 4.00
load average)
-
7/31/2019 LSRC6 Ops for Developers
35/68
When am I CPU
bound?
15 minute load average exceedingthe number of non-HT processors%user + %system consistently
above 90%
-
7/31/2019 LSRC6 Ops for Developers
36/68
Observing RAM
Go-to tools: top, vmstatAvailable memory isnt just FreeBuffers + Cache fill to consume
available RAM (this is a good thing!)
-
7/31/2019 LSRC6 Ops for Developers
37/68
RAM vs. SwapRAM is the amount of physical
memory
Swap is disk used to augment RAMSwap is orders of magnitude slowerSome VM types have no meaningful
swap
Rule of thumb: pretend swapdoesnt exist
-
7/31/2019 LSRC6 Ops for Developers
38/68
Paging Strategies
Solaris: Page in advanceLinux: Page on demand (last resort)
Windows: Craziness
-
7/31/2019 LSRC6 Ops for Developers
39/68
When am I memory
bound?Free + buffers + cache < 15% of
RAMSwap utilization above 10% avail.
swap (Linux only)
Check for high disk utilization toconfirm thrashing
-
7/31/2019 LSRC6 Ops for Developers
40/68
Observing Disk
Go-to tools: iostat, topDisk is usually hardest thing to
observe
Better in recent Linux kernels (>2.6.20)
-
7/31/2019 LSRC6 Ops for Developers
41/68
RAIDRedundant Array of Inexpensive
Drives
Different strategies have differentperformance/durability tradeoffsRAID-0
RAID-1RAID-10
RAID-5
Wh I di k
-
7/31/2019 LSRC6 Ops for Developers
42/68
When am I disk
bound?%wait is consistently above 10% to20%
... though %wait can be network
too
SCSI and FC command queues arelong
Known failure mode: disk morethan 85% full causes tremendousVFS overhead
-
7/31/2019 LSRC6 Ops for Developers
43/68
Observing Network
Go-to tools: netstat, iftop, wireshark
Be wary of choke-pointsSwitch interconnects
WAN links
Firewalls
-
7/31/2019 LSRC6 Ops for Developers
44/68
Link Optimization
Use Jumbo Frames for Gbit+ links
Port aggregation for throughput:
Best: many-to-manyGood: one-to-many
Useless: one-to-one... but still useful for HA
Wh I t k
-
7/31/2019 LSRC6 Ops for Developers
45/68
When am I network
bound?This one is easy: 99% of the timethis is link saturation
Gotchas: which link?Addendum: loss/delay (especiallyfor TCP) can wreak havoc on
throughput... but usually only a problem
across WAN
-
7/31/2019 LSRC6 Ops for Developers
46/68
Part IVOptimization &
PerformanceTuning
-
7/31/2019 LSRC6 Ops for Developers
47/68
Hardware Options
A.K.A. Throw hardware at it
Not the first thing to try
Are the services tuned? SQLqueries, application behavior,caching options
Is something broken, causingperformance degradation?
-
7/31/2019 LSRC6 Ops for Developers
48/68
Hardware Options
RAM is usually the single biggestperformance win (cost/benefittradeoff)
Faster disk is next bestThen look at CPU and/or Network
...but do the work to figure out whyyour performance is limited in thefirst place
-
7/31/2019 LSRC6 Ops for Developers
49/68
Kernel Tunables
Not as necessary as in the olddays
Almost all settings can be adjustedat runtime on Linux, SolarisMost valuable settings are buffer
limits or counters/timersThere be dragons! Read carefully
before twisting these knobs
E i t
-
7/31/2019 LSRC6 Ops for Developers
50/68
Environment
Settingsulimitsmax files
stack sizememory limits
core dumps
othersStill subject to system-wide (kernel)
limits
-
7/31/2019 LSRC6 Ops for Developers
51/68
A li ti
-
7/31/2019 LSRC6 Ops for Developers
52/68
Application
TunablesThere are not many for C-Ruby
JVM has manyMostly related to how RAM is
allocated and garbage collected
Very dependent on application
Any time an xVM is involved,
there is probably a tunable (JVM,CLR)
But we are developers!Tune/profile your app before
P f
-
7/31/2019 LSRC6 Ops for Developers
53/68
Performance
Management Toolssysstat (sar)SNMP (and related tools like Cacti)
Integrated Monitoring + TrendingtoolsZabbix
OpenNMSand a plethora of commercial
tools
-
7/31/2019 LSRC6 Ops for Developers
54/68
Part VPutting It All
TogetherAutopsy of a single HTTP request, end-to-end
-
7/31/2019 LSRC6 Ops for Developers
55/68
Live
Demo/Whiteboard
-
7/31/2019 LSRC6 Ops for Developers
56/68
Part VI
Pulling It All ApartAnticipating Murphy and his Law
Most Common
-
7/31/2019 LSRC6 Ops for Developers
57/68
Most Common
PitfallsDisk FullDNS Unavailable/Slow
Insufficient RAMSuboptimal Service Configuration
Firewall misconfigurationArchaic: Network mismatch (Full/HalfDuplex)
DNS and
-
7/31/2019 LSRC6 Ops for Developers
58/68
DNS and
PerformancePossibly most-overlooked perf.
impact
Everything uses DNS
If you make nothing elseredundant, make this redundant!
-
7/31/2019 LSRC6 Ops for Developers
59/68
Part VII
Scaling Up
Horizontal or
-
7/31/2019 LSRC6 Ops for Developers
60/68
Horizontal or
Vertical?Vertical: Making one
server/instance go faster
Horizontal: Parallelizing requests toget more things done in the same
amount of time
-
7/31/2019 LSRC6 Ops for Developers
61/68
ClusteringParallelizing requests to increaseoverall throughput: horizontal
scaling
Techniques to make informationmore available:Caching (memcache, file-based
caching)
Distribute data sets
Replication
-
7/31/2019 LSRC6 Ops for Developers
62/68
Distributing Data
Replication
Split Reads (One writer/master;multiple slaves/readers)Multiple Masters (dangerous!)Sharding (must consider HA)
-
7/31/2019 LSRC6 Ops for Developers
63/68
Failover/HA
Consistency requires concept ofQuorum
Losing partition gets killed:STONITH
Multi-master systems ignore this atthe cost of potential non-determinisim
-
7/31/2019 LSRC6 Ops for Developers
64/68
Tuning ServicesSome VM types (especially JVM orCLR) have tunables for memory
consumption
Databases usually have memorysettings
These can make dramaticdifferences
Very workload dependentDeep troubleshooting: strace,
wireshark
-
7/31/2019 LSRC6 Ops for Developers
65/68
Part VIIIDeploying
Applications
12 Factor
-
7/31/2019 LSRC6 Ops for Developers
66/68
12 Factor
ApplicationDeployability starts with application
design
Clear line between configurationand logic
Permit easy horizontal scalingAre OS-agnostic (yay Ruby!)
Minimize differences between devand prodhttp://12factor.net- by Heroku
cofounder
http://12factor.net/http://12factor.net/http://12factor.net/ -
7/31/2019 LSRC6 Ops for Developers
67/68
Deployment Tools
Capistrano
The de facto standard
Requires effort to set up, testRequires integration with system
startup
Most flexible
-
7/31/2019 LSRC6 Ops for Developers
68/68
Deployment Tools
Move it to the cloudHeroku
Cloud Foundry