nagios conference 2014 - jeremy rust - avoiding downtime using linux high availability

45
Avoiding Down+me Using Linux High Availability Jeremy Rust [email protected] @linbit @nerdhacker

Upload: nagios

Post on 28-Nov-2014

85 views

Category:

Technology


1 download

DESCRIPTION

Jeremy Rust's presentation on Avoiding Downtime Using Linux High Availability. The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference

TRANSCRIPT

Page 1: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Avoiding  Down+me  Using  Linux    High  Availability  

Jeremy Rust

[email protected] @linbit

@nerdhacker

Page 2: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Introduction & Agenda

•  Downtime is not cheap •  What is High Availability = not a back up! •  Raid or Raid over the network (DRBD) •  SANs and clustered applications •  The Linux cluster stack •  Cluster management with Pacemaker •  Disaster Recovery / Linking sites •  DRBD and the Cloud

Page 3: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

DRBD  HA  and  DR  

Page 4: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Down+me  =  $$$    

•  Lost  revenue  •  Lost  reputa+on  •  Almost  every  business  these  days  has  a  

cri+cal  database  or  file  system  that  they  could  not  do  without.    

•  HP  es+mates  $31,705  per  hour  3.8  hours  a  year  totaling  $481,900/  year  

•  40%  internet  traffic  stops  when  Google  goes  down  

Survey:  Cri+cal  “IT-­‐Systems  in  the  medium-­‐sized  business“,  2013  Techconsult,  on  behalf  of  HP  Germany  Basis:  300  medium-­‐sized  companies  from  Germany  

Page 5: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Down+me  =  $$$    

“YOU  LOST  THE  DATABASE?!?!”    

•  “Ummm,  can  you  ping  ____?”  •  “I  can’t  seem  to  reach  our  inventory  system.”  •  “Can  you  try  pulling  up  this  record?”  

Page 6: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Devo+on  to  Duty  -­‐  xkcd  

Page 7: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Why  Monitor?    

•  Hardware  dies  •  DDOS  afacks  •  Set  it  and  forget  it  mentality  •  Internet  connec+on  •  Security  programs  

Page 8: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Hos+ng  /  XaaS  

•  Reliability  •  Security  •  Mul+-­‐tenant  architecture  •  Scalability  •  Up+me  

Page 9: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

The  Pillars  of  IT  Security  

Integrity  

Page 10: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Types  of  Clustering  Solu+ons    

•  Hardware  redundancy    •  SAN  solu+ons  •  NAS  boxes  •  External  hard  drives  or  JBODS  •  So#ware  Solu+ons          

Page 11: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Recovery  Time/Point  Objec+ves  

Page 12: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

What  is  Raid?  Is  it  enough?  

Page 13: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

RAID  

Microsok  Library  hfp://msdn.microsok.com/en-­‐us/library/aa226166(v=sql.70).aspx#sql7perkune_moreinfo  

Page 14: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

What  Could  Go  Wrong  

•  Your  shiny  new  hardware  will  fail  •  Single  points  of  failures  are  dangerous  •  Dropped  alerts  •  Internet  outage  •  Power  outage  

Page 15: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

•  Easy  to  implement  -­‐  high  cost  per  TB  •  Large  SLAs  -­‐  quality  of  technicians  •  Management  via  GUI        •  Scalable  -­‐  with  the  right  packages  •  SAN  maintenance  -­‐  learning  curve  •  Off  site  replica+on  is  expensive  

SAN/NAS  

Page 16: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Single  Point  of  Failure  

SAN  

NFS  

MySQL  

VM’s  

Page 17: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Piualls  

•  High  ini+al  and  ongoing  costs  •  Vender  lock  in  is  required  •  Ongoing  worry  of  voiding  the  warrantee  •  Maintenance  is  tricky  and  ongoing  •  It  is  a  black  box,  typically  Solaris  based  •  Cannot  add  or  remove  features  •  It  is  s+ll  a  single  point  of  failure    

Page 18: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Sokware  Only  Solu+ons  

Things  to  look  for:  •  Synchronous  or  Asynchronous  replica+on  •  Stability  /  maturity  •  Time  to  recovery  •  Chance  of  data  loss  •  Onsite  /  offsite  •  Is  it  real  +me  (live)  or  snap  shots  

Page 19: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Asynchronous  Architecture  

Page 20: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Synchronous  Architecture  

Secondary  

Page 21: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Layer  Cake  of  Replica+on  

•  Virtualiza+on  

•  Applica+on  

•  File  system  

•  Object  store  

•  Block  layer  

hfp://images.pinkcakebox.com/cake696.jpg  

Page 22: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Cluster  Cake  Fail  

Page 23: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Common  Issues  /  Piualls  

•  File  locking  •  Network  conges+on  •  Data  consistency  /  data  corrup+on  •  High  overhead  and/or  addi+onal  CPU  cycles  •  Asynchronous  or  even  back  up  based  •  Require  ongoing  licensing  and  royal+es  

Page 24: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

DRBD  

•  Completely  hardware  and  applica+on  agnos+c  •  German  engineering  •  In  development  since  2001  •  Created  by  LINBIT  founder  and  CEO    

Phillip  Reisner  •  DRBD  built  into  the  na+ve  Linux  kernel  as  of  

2.6.33  •  Ships  in  all  major  Linux  distribu+ons  •  Does  not  void  RHEL  or  Oracle  support  

Page 25: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

DRBD  Users  

Page 26: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

A  DRBD  Cluster  Stack  

LAN

Server

RAID

High Speed RAID-Controller

High Speed NIC

Replication Network

shared nothing

Storage

Page 27: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Fully  Redundant  System  

Storage  1   Storage  2  

Ac+ve   Passive  

MySQL.com  

Page 28: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Fully  Redundant  System  

Storage  1   Storage  2  

Ac+ve  Passive  

Passive   Ac+ve  

MySQL.com    Ausweb.com  

Page 29: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Heartbeat/Corosync:  The  Comm  Layer  

•  These  are  the  communica+on  tools  of  the  cluster  

•  “Are  you  dead?”  •  “Are  you  alive?”  •  Heartbeat  is  seasoned  and  stable  

(reliability  =  HA)  •  Corosync  is  newer  and  under    

development  

Page 30: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Pacemaker

The  Linux  Cluster  Resource  Manager  •  The  powerful  and  bossy  cluster  manager  •  Manages  all  aspects  of  system  •  Decides  who  is  alive  and  primary  •  Well  known    •  Widely  deployed  •  Does  not  require  applica+ons  have  specific  

plugins  

Page 31: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Pacemaker : Sleep All Night

•  It  lets  you  sleep  though  the  night  even  if  there’s  a  failure.    

•  Highly  Configurable  •  Used  with  a  number  of  clustering  

tools  /  File  Systems  •  Very  powerful  if  done  well    

 Disastrous  if  done  wrong      

Page 32: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Linux  HA  Stack  

Page 33: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Disaster  Recovery  /  Offsite  Replica+on  

•  True  Disaster  Recovery  happens  live  •  Interval  based  snapshots  no  longer  meet  

todays  SLA  requirements  •  DRBD  does  real-­‐+me  replica+on  on-­‐site  and  

off-­‐site  •  DRBD  Proxy  tool  mi+gates  throughput  

constraints  and  latency-­‐  highly  configurable      

Page 34: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Real-­‐+me  Disaster  Recovery  

Page 35: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Scaling  DRBD  

•  DRBD  Proxy  is  typically  done  in  3  node  configura+ons.      

•  Extremely  configurable  •  Proxy  mi+gates  bandwidth  constraints  and  

latency  •  Can  replicate  across  4  machines  even  across  

distances    

Page 36: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

3  node  HA  /  DR  

Location A Live Site

Location B DR Site

Proxy Proxy Proxy

Page 37: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

4  node  DR  +  Ac+ve-­‐Ac+ve  HA  

Location A Live Site

Location B DR Site

Proxy Proxy Proxy Proxy

Page 38: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Dedicated  Proxy-­‐Many  Resources  

Location A Live Site

Location B DR Site

Proxy Proxy

Dedicated Server Dedicated Server

Page 39: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

How  to  apply  this  in  your  cloud  

hfp://www.gamesparks.com/wp-­‐content/uploads/2013/07/the-­‐cloud.jpg  

DRBD  works  in  the  cloud  and  AWS  VPC  

DRBD  can  be  used  as  backing  storage  for  ISCSI  

On  na+ve  bare  hardware  or  as  part  of  your  hardware  or  sokware  appliance  

Page 40: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

HA  with  Nagios!  

•  Filesystem  (which  has  many  symlinks  in  it)  •  MySQL  •  PostgreSQL  •  Crond  •  Ndo2db  •  The  Nagios  applica+on  itself  •  A  Virtual  IP  

Page 41: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Q+A Jeremy Rust

[email protected] @NerdHacker

877-DRBD247

www.linkedin.com/in/RustJeremy

DRBD.org Linbit.com

Linux-HA.org

Page 42: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

DRBD  9  the  future  

Page 43: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

DRBD  8  Branch  build  structure  

Page 44: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

DRBD  9  Branch  build  structure  

Page 45: Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

2  Full  redundant  systems  

Storage  1   Storage  2  

Ac+ve  Passive  

Passive   Ac+ve  

MySQL.com    Ausweb.com