make sure your applications crash

Make Sure Your Applications Crash

Moshe Zadka

True story

Python doesn't crash

Memory managed, no direct pointer arithmetic

...except it does

C bugs, untrapped exception, infinite loops,blocking calls, thread dead-lock, inconsistent

resident state

Recovery is important

"[S]ystem failure can usually be considered tobe the result of two program errors[...] the

second, in the recovery routine[...]"

Crashes and inconsistent data

A crash results in data from an arbitraryprogram state.

Avoid storage

Caches are better than master copies.

Databases

Transactions maintain consistencyDatabases can crash too!

Atomic operations

File rename

Example: Counting

def update_counter(): fp = file("counter.txt") s = fp.read() counter = int(s.strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("counter.txt.tmp", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified # The following is an atomic operation os.rename("counter.txt.tmp", "counter.txt")

Efficient caches, reliable masters

Mark inconsistency of cache

No shutdown

Crash in testing

Availability

If data is consistent, just restart!

Improving availability

Limit impactFast detectionFast start-up

Vertical splitting

Different execution paths, different processes

Horizontal splitting

Different code bases, different processes

Watchdog

Monitor -> Flag -> Remediate

Watchdog principles

Keep it simple, keep it safe!

Watchdog: Heartbeats

## In a Twisted processdef beat(): file('beats/my-name', 'a').close()task.LoopingCall(beat).start(30)

Watchdog: Get time-outs

def getTimeout() timeout = dict() now = time.time() for heart in glob.glob('hearts/*'): beat = int(file(heart).read().strip()) timeout[heart] = now-beat return timeout

Watchdog: Mark problems

def markProblems(): timeout = getTimeout() for heart in glob.glob('beats/*'): mtime = os.path.getmtime(heart) problem = 'problems/'+heart if (mtime<timeout[heart] and not os.path.isfile(problem)): fp = file('problems/'+heart, 'w') fp.write('watchdog') fp.close()

Watchdog: check solutions

def checkSolutions(): now = time.time() problemTimeout = now-30 for problem in glob.glob('problems/*'): mtime = os.path.getmtime(problem) if mtime<problemTimeout: subprocess.call(['restart-system'])

Watchdog: Loop

## Watchdogwhile True: markProblems() checkSolutions() time.sleep(1)

Watchdog: accuracy of

Custom checkers can manufacture problems

Watchdog: reliability of

Use cron for main loop

Watchdog: reliability of

Use software/hardware watchdogs

Conclusions

Everything crashes -- plan for it

Questions?

Welcome to the back-up slides

Extra! Extra!

Example: Counting on Windows

def update_counter(): fp = file("counter.txt") s = fp.read() counter = int(s.strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("counter.txt.tmp", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified os.remove("counter.txt") # At this point, the state is inconsistent* # The following is an atomic operation

os.rename("counter.txt.tmp", "counter.txt")

Example: Counting on Windows(Recovery)

def recover(): if not os.path.exists("counter.txt"): # The permanent file has been removed # Therefore, the temp file is valid os.rename("counter.txt.tmp", "counter.txt")

Example: Counting with versions

def update_counter(): files = [int(name.split('.')[-1]) for name in os.listdir('.') if name.startswith('counter.')] last = max(files) counter = int(file('counter.%s' % last ).read().strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("tmp.counter", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified

os.rename('tmp.counter', 'counter.%s' % (last+1)) os.remove('counter.%s' % last)

Example: Counting with versions(cleanup)

# This is not a recovery routine, but a cleanup# routine.# Even in its absence, the state is consistentdef cleanup(): files = [int(name.split('.')[-1]) for name in os.listdir('.') if name.startswith('counter.')] files.sort() files.pop() for n in files: os.remove('counter.%d' % n) if os.path.exists('tmp.counter'): os.remove('tmp.counter')

Correct ordering

def activate_due(): scheduled = rs.smembers('scheduled') now = time.time() for el in scheduled: due = int(rs.get(el+':due')) if now<due: continue rs.sadd('activated', el) rs.delete(el+':due') rs.sremove('scheduled', el)

Correct ordering (recovery)

def recover(): inconsistent = rs.sinter('activated', 'scheduled') for el in inconsistent: rs.delete(el+':due') #* rs.sremove('scheduled', el)

Example: Key/value stores

0.log: ['add', 'key-0', 'value-0'] ['add', 'key-1', 'value-1'] ['add', 'key-0', 'value-2'] ['remove', 'key-1'] . . .

1.log: . . .

2.log:

Example: Key/value stores (utilityfunctions)

## Get the level of a filedef getLevel(s) return int(s.split('.')[0])

## Get all files of a given typedef getType(tp): return [(getLevel(s), s) for s in files if s.endswith(tp)]

Example: Key/value stores(classifying files)

## Get all relevant filesdef relevant(d): files = os.listdir(d): mlevel, master = max(getType('.master')) logs = getType('.log') logs.sort() return master+[log for llevel, log in logs if llevel>mlevel]

Example: Key/value stores (reading)

## Read in a single filedef update(result, fp): for line in fp: val = json.loads(line) if val[0] == 'add': result[val[1]] = val[2] else: del result[val[1]]

## Read in several filesdef read(files): result = dict() for fname in files: try: update(result, file(fname))

except ValueError: pass return result

Example: Key/value stores (writerclass)

class Writer(object): def __init__(self, level): self.level = level self.fp = None self._next() def _next(self): self.level += 1 if self.fp: self.fp.close() name ='%3d.log' % self.currentLevel self.fp = file(name, 'w') self.rows = 0 def write(self, value):

print >>self.fp, json.dumps(value) self.fp.flush() self.rows += 1 if self.rows>200: self._next()

Example: Key/value stores (storageclass)

## The actual data store abstraction.class Store(object): def __init__(self): files = relevant(d) self.result = read(files) level = getLevel(files[-1]) self.writer = Writer(level) def get(self, key): return self.result[key] def add(self, key, value): self.writer.write(['add', key, value]) def remove(self, key): self.writer.write(['remove', key])

Example: Key/value stores(compression code)

## This should be run periodically# from a different threaddef compress(d): files = relevant(d)[:-1] if len(files)<2: return result = read(files) master = getLevel(files[-1])+1 fp = file('%3d.master.tmp' % master, 'w') for key, value in result.iteritems(): towrite = ['add', key, value]) print >>fp, json.dumps(towrite) fp.close()

Vertical splitting: Example

def forking_server(): s = socket.socket() s.bind(('', 8080)) s.listen(5) while True: client = s.accept() newpid = os.fork() if newpid: f = client.makefile() f.write("Sunday, May 22, 1983 " "18:45:59-PST") f.close() os._exit()

Horizontal splitting: front-end

## Process oneclass SchedulerResource(resource.Resource): isLeaf = True def __init__(self, filepath): resource.Resource.__init__(self) self.filepath = filepath def render_PUT(self, request): uuid, = request.postpath content = request.content.read() child = self.filepath.child(uuid) child.setContent(content)fp = filepath.FilePath("things")r = SchedulerResource(fp)s = server.Site(r)reactor.listenTCP(8080, s)

Horizontal splitting: scheduler

## Process twors = redis.Redis(host='localhost', port=6379, db=9)while True: for fname in os.listdir("things"): when = int(file(fname).read().strip()) rs.set(uuid+':due', when) rs.sadd('scheduled', uuid) os.remove(fname) time.sleep(1)

Horizontal splitting: runner

## Process threers = redis.Redis(host='localhost', port=6379, db=9)recover()while True: activate_due() time.sleep(1)

Horizontal splitting: messagequeues

No direct dependencies

Horizontal splitting: messagequeues: sender

## Process fourrs = redis.Redis(host='localhost', port=6379, db=9)params = pika.ConnectionParameters('localhost')conn = pika.BlockingConnection(params)channel = conn.channel()channel.queue_declare(queue='active')while True: activated = rs.smembers('activated') finished = set(rs.smembers('finished')) for el in activated: if el in finished: continue

channel.basic_publish( exchange='', routing_key='active', body=el) rs.add('finished', el)

Horizontal splitting: messagequeues: receiver

## Process five# It is possible to get "dups" of bodies.# Application logic should deal with thatparams = pika.ConnectionParameters('localhost')conn = pika.BlockingConnection(params)channel = conn.channel()channel.queue_declare(queue='active')def callback(ch, method, properties, el): syslog.syslog('Activated %s' % el)channel.basic_consume(callback, queue='hello', no_ack=True)channel.start_consuming()

Horizontal splitting: point-to-point

Use HTTP (preferably, REST)

make sure your applications crash

wprint fp

nif os

temp file

recovery routine

permanent file

inconsistent resident

keyvalue stores storage

arbitraryprogram state

Technology

evaluating linux kernel crash dumping...

crash early, crash often, explain well · 2013-03-07 ·...

connecticut traffic crash facts 2012 · crash report to...

airborne applications of solid state recorders an … ·...

crash reporting for c++, .net, java, mac os x applications

pre-crash scenario typology for crash avoidance research

automotive crash box performance analysis and ….pdf ·...

introduction to momentum & energy applications in crash

crash portal team. 2 agenda introduction to crash deployment...

crash data analyses for vehicle-to- infrastructure ... ·...

applications car body crash management systems · pdf...

crash analysis...• crash types • manner of collision •...

civil applications of infrasound and seismic data: t1,5-p7...

crash data academy - florida department of transportation...

hp sure run · hp sure run protect the applications that...

applications car body crash management systems · version...

crash early, crash often, explain well

crash - univerzita...

tesla crash 05/07/16 crash scene - princeton...

composite crash absorber for aircraft fuselage · pdf...