ops for developers

Ops for DevelopersOr: How I Learned To Stop Worrying And Love The Shell

Ben Klangbklang@mojolingo.com

spkr8.com/t/13191

Friday, August 10, 12

Prologue

Introductions

Who Am I?

Ben Klang@bklang Github/Twitterbklang@mojolingo.com

What are my passions?

• Telephony Applications

• Information Security

• Performance and Availability Design

• Open Source

What do I do?

• Today I write code and run Mojo Lingo

• But Yesterday...

This was my world

Ops Culture

“I am allergic to downtime”

It’s About Risk

• If something breaks, it will be my pager that goes off at 2am

• New software == New ways to break

• If I can’t see it, I can’t manage it or monitor it and it will break

Agenda• 9:00 - 10:30

• Operating Systems & Hardware

• All About Bootup

• 10:30 - 11:00: Break

• 11:00 - 12:30

• Observing a Running System

• Optimization/Tuning

• 12:30 - 1:30 Lunch

• 1:30 - 3:00

• Autopsy of an HTTP Request

• Dealing with Murphy

• 3:00 - 3:30: Break

• 3:30 - 5:00

• Scaling Up

• Deploying Apps

• Audience Requests

Part I

Operating Systems & Hardware

OS History LessonBSD, System V, Linux and Windows

UNICS(Sep. 1969)

Soon renamed “Unix Time Sharing System

Version 1”

UNIX Time Sharing System Version 5(Jun. 1974)

UNIX Sys III(Nov. 1981)

1BSD(Mar. 1978)

4.3BSD(Jun. 1986)

UNIX Sys V(Jan. 1983)

Hardware Components

Common Architectures

• Intel x86 (i386, x86_64)

• SPARC

• POWER

• ARM

• But none of this really matters anymore

CPU Configurations

• Individual CPU

• SMP: Symmetric Multi-Processing

• Multiple Cores

• Hyperthreading/Virtual Cores

(Virtual) Memory

• RAM + Swap = Available Memory

• Swapping strategies vary across OSes

• What your code sees is a complete virtualization of this

• x86/32-bit processes can only “see” 3GB of RAM from a 4GB address space

Storage Types

• Local Storage (SATA, SAS, USB, Firewire)

• Network Storage (NFS, SMB, iSCSI, AOE)

• Storage Network (FibreChannel, Fabrics)

Networking

• LAN (100Mb still common; 1Gbit standard; 10Gb and 100Gb on horizon)

• WAN (T-1, Frame Relay, ATM, MetroE)

• Important Characteristics

• Throughput

• Loss

• Delay

Part II

All About Bootup

Phases

• BIOS

• Kernel Bootstrap

• Hardware Detection

• Init System

System Services

• Varies by OS

• Common: SysV Init Scripts; /etc/inittab; rc.local

• Solaris: SMF

• Ubuntu: Upstart

• Debian: SysV default; Upstart optional

• OSX: launchd

• RedHat/CentOS: SysV Init Scripts

SysV Init Scripts

• Created in /etc/init.d; Symlinked into runlevel directories

• Symlinks prefixed with special characters to control startup/shutdown order

• Prefixed with “S” or “K” to start or stop service in each level

• Numeric prefix determines order

• /etc/rc3.d/S10sshd -> /etc/init.d/sshdFriday, August 10, 12

rc.local

• Single “dumb” startup script

• Run at end of system startup

• Quick/dirty mechanism to start something at bootup

/etc/inittab

• The original process supervisor

• Not (easily) scriptable

• Starts a process in a given runlevel

• Restarts the process when it dies

Supervisor Processes

• Solaris SMF

• Ubuntu Upstart

• OSX launchd

• daemontools

Ruby Integrations

• Supervisor Processes

• Bluepill

• God

• Startup Script Generator

• Foreman

Choosing a Boot Mechanism

• Is automatic recovery desirable?(Hint: sometimes it’s not)

• Does it integrate with monitoring?

• Is it a one-off that will get forgotten?

• Does it integrate into OS startup/shutdown?

• How much work to integrate with your app?

Part III

Observing a Running System

Common Tools

• top

• free

• vmstat

• netstat

• fuser

• ps

• sar (not always installed by default)

Power Tools

• lsof

• iostat

• iftop

• pstree

• Tracing tools

• strace

• tcpdump/wireshark

Observing CPU

• Go-to tools: top, ps

• CPU is not just about computation

• Most Important:%user, %system, %nice, %idle, %wait

• Other: hardware/software interrupts, “stolen” time (especially on EC2)

The Mystical Load Avg.

• Broken into 1, 5 and 15 minute averages

• Gives a coarse view of overall system load

• Based on # processes waiting for CPU time

• Rule of thumb: stay below the number of CPUs in a system (eg. a 4 CPU host should be below a 4.00 load average)

When am I CPU bound?

• 15 minute load average exceeding the number of non-HT processors

• %user + %system consistently above 90%

Observing RAM

• Go-to tools: top, vmstat

• Available memory isn’t just “Free”

• Buffers + Cache fill to consume available RAM (this is a good thing!)

RAM vs. Swap

• RAM is the amount of physical memory

• Swap is disk used to augment RAM

• Swap is orders of magnitude slower

• Some VM types have no meaningful swap

• Rule of thumb: pretend swap doesn’t exist

Paging Strategies

• Solaris: Page in advance

• Linux: Page on demand (last resort)

• Windows: Craziness

When am I memory bound?

• Free + buffers + cache < 15% of RAM

• Swap utilization above 10% avail. swap (Linux only)

• Check for high disk utilization to confirm “thrashing”

Observing Disk

• Go-to tools: iostat, top

• Disk is usually hardest thing to observe

• Better in recent Linux kernels (> 2.6.20)

• Redundant Array of Inexpensive Drives

• Different strategies have different performance/durability tradeoffs

• RAID-0

• RAID-1

• RAID-10

• RAID-5

• RAID-6Friday, August 10, 12

When am I disk bound?

• %wait is consistently above 10% to 20%

• ... though %wait can be network too

• SCSI and FC command queues are long

• Known failure mode: disk more than 85% full causes tremendous VFS overhead

Observing Network

• Go-to tools: netstat, iftop, wireshark

• Be wary of choke-points

• Switch interconnects

• WAN links

• Firewalls

Link Optimization

• Use Jumbo Frames for Gbit+ links

• Port aggregation for throughput:

• Best: many-to-many

• Good: one-to-many

• Useless: one-to-one

• ... but still useful for HA

When am I network bound?

• This one is easy: 99% of the time this is link saturation

• Gotchas: which link?

• Addendum: loss/delay (especially for TCP) can wreak havoc on throughput

• ... but usually only a problem across WAN

Part IV

Optimization & Performance Tuning

Hardware Options

• A.K.A. “Throw hardware at it”

• Not the first thing to try

• Are the services tuned? SQL queries, application behavior, caching options

• Is something broken, causing performance degradation?

Hardware Options

• RAM is usually the single biggest performance win (cost/benefit tradeoff)

• Faster disk is next best

• Then look at CPU and/or Network

• ...but do the work to figure out why your performance is limited in the first place

Kernel Tunables

• Not as necessary as in the “old days”

• Almost all settings can be adjusted at runtime on Linux, Solaris

• Most valuable settings are buffer limits or counters/timers

• There be dragons! Read carefully before twisting these knobs

Environment Settings

• ulimits

• max files

• stack size

• memory limits

• core dumps

• others

• Still subject to system-wide (kernel) limits

Environment limits

• Hard limits cannot be raised by unprivileged users

• PAM configuration may also be in effect

Application Tunables• There are not many for C-Ruby

• JVM has many

• Mostly related to how RAM is allocated and garbage collected

• Very dependent on application

• Any time an “xVM” is involved, there is probably a tunable (JVM, CLR)

• But we are developers! Tune/profile your app before looking to the environment

Performance Management Tools

• sysstat (sar)

• SNMP (and related tools like Cacti)

• Integrated Monitoring + Trending tools

• Zabbix

• OpenNMS

• and a plethora of commercial tools

Part V

Putting It All TogetherAutopsy of a single HTTP request, end-to-end

Live Demo/Whiteboard

Part VI

Pulling It All ApartAnticipating Murphy and his Law

Most Common Pitfalls

• Disk Full

• DNS Unavailable/Slow

• Insufficient RAM

• Suboptimal Service Configuration

• Firewall misconfiguration

• Archaic: Network mismatch (Full/Half Duplex)

DNS and Performance

• Possibly most-overlooked perf. impact

• Everything uses DNS

• If you make nothing else redundant, make this redundant!

Part VII

Scaling Up

Horizontal or Vertical?

• Vertical: Making one server/instance go faster

• Horizontal: Parallelizing requests to get more things done in the same amount of time

Clustering

• Parallelizing requests to increase overall throughput: horizontal scaling

• Techniques to make information more available:

• Caching (memcache, file-based caching)

• Distribute data sets

• Replication

Distributing Data

• Replication

• Split Reads (One writer/master; multiple slaves/readers)

• Multiple Masters (dangerous!)

• Sharding (must consider HA)

Failover/HA

• Consistency requires concept of Quorum

• Losing partition gets killed: STONITH

• Multi-master systems ignore this at the cost of potential non-determinisim

Tuning Services

• Some VM types (especially JVM or CLR) have tunables for memory consumption

• Databases usually have memory settings

• These can make dramatic differences

• Very workload dependent

• Deep troubleshooting: strace, wireshark

Part VIII

Deploying Applications

12 Factor Application

• Deployability starts with application design

• Clear line between configuration and logic

• Permit easy horizontal scaling

• Are OS-agnostic (yay Ruby!)

• Minimize differences between dev and prod

• http://12factor.net - by Heroku cofounder

Deployment Tools

• Capistrano

• The de facto standard

• Requires effort to set up, test

• Requires integration with system startup

• Most flexible

Deployment Tools

• “Move it to the cloud”

• Heroku

• Cloud Foundry

ops for developers

Technology

what is devops? - scale · pdf filedev devops ops w a ll of...

ops forum 25.06.2010 ops/eop collaboration for eo...

social developers london update for twitter developers

just enough web ops for web developers

17 things developers should know about databases...devs vs...

dev ops != dev+ops

devops -...

applications at rackspace. 2 top challenges: building and...

high performing devops: insights and tips - dynatrace · pdf...

devops with containers for java - oracle · pdf filedevops...

observability for developers€¦ · dev ops not all ... is...

what ops can learn from dev - github pages ops can learn...

devops and agile development - · pdf filecollaboration...

ops-g forum maintenance approach for ground data systems...

57980842 companies list for ops

itnext summit 2019 sponsor prospectus · itnext is a...

dev ops for dummies

dev ops - blogs. · pdf filebusiness dev ops users shared...

lsrc6 ops for developers

ops for developers monitoring with prometheus for java...