large scale identification of race conditions · - bad nodepool images - service outages - mirrors...
TRANSCRIPT
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Large Scale Identification of Race ConditionsHow we find race conditions in Joe GordonSean DagueMay 21th, 2014
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.2
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3
Development Scale
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4
Development Principles
● Never break trunk– Master branch is always green
– Developers are never blocked on broken trunk
– Support continuous deployment
● Transparency● Automate everything● Egalitarian● Be Strict. Reduce burden on reviewers
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5
Unit Tests
What Happens When You Submit Code
ProposedChange
Pep8
Unit TestsUnit Tests
Devstack /Tempest
~180 Guests
Devstack /Grenade
Devstack /Tempest
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6
WAT?
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7
1 Proposed Change generates …
● 5 – 10 Devstacks● ~10K integration tests● ~1000 2nd Level Guests● ~1 GB of Log Data (uncompressed)
● 1 week = 250-500 changes merged● 1 week = 1500-3000 change revisions (including updates to
existing changes)● 10,000 new first time changes proposed every 42 days
– 42 days between gerrit 70k – 80k and 80k – 90k
And these add up...
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8
Statistics of Large Numbers
● Factors– Chance of Events - P(E)
– Number of Events / run – N(E/R)
– Number of runs - N(R)
● Ex: Github is down 0.05% of the time– 0.0005 * 20 clones/run * 1500 runs/week = 15
– 15 test failures every week (on average) because of github
– We no longer clone from github
P(E) x N(E/R) x N(R) = Failure Rate
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9
Where do this failures come from?
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10
How we did this in Grizzly
● Someone's change fails– They run recheck, it passes
– No one ever knew about the issue
● Someone has a large patch series (15 patches)– 1/3 of patches fail
– Different 1/3 of patches fail next time around
– “Hey, have you seen this failure → URL”
● My brain is a poor big data solution● … and then we turned on parallel testing – KABOOM!
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11
“Have you seen this recently?”
Elastic Recheck
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12
“Have you seen this recently?”
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13
“Have you seen this recently?”
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14
“Have you seen this recently?”
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15
Unit Tests
What Happens When You Submit Code
ProposedChange
Pep8
Unit TestsUnit Tests
Devstack /Tempest
~180 Guests
Devstack /Grenade
Devstack /Tempest
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16
Elastic Recheck Flow
logs.openstack.org logstash.openstack.orgAll artifacts
Select LOGsat INFO+
recheckbot
Gerrit
TestCompletes
Results
1
irc.freenode.net
KnownPatterns
2
3
4Report < 15 minutes after fail
er datascripts
KnownPatterns
status.openstack.org/elastic-recheck
Every 30 mins
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17
We expected...
● 6 – 10 major bugs● Frequency rates > 1%
– Human detection rates for patterns
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18
We found...
Upstream Service Breaks
Examples:- pypi bad cert- github outages- iaas dns blacklisting- iaas provider network
Assume touching network is poison, cache or bring resources local
Infra Breaks
Examples:- bad nodepool images- service outages- mirrors broken
Fixes:
Make infra more resilient and self healing
Bugs in OpenStack
Examples:- state corruption- races w/ async messaging- races w/ multiple workers- db deadlocks
Fixes:
Ferret out races in the code
● Currently tracking ~100 unique bugs in the system - seen in last 2 weeks● Most at < 0.1% occurrence rate
Bugs in Tests
Examples:- Unsafe global state expectations- Comparing timestamps
Fixes:
Fix the tests
Bugs in Dependencies
Examples:- kernel nbd vs. ovswitch- libvirt wedging
Fixes:
Get bug reported upstream, try to provide work around for buggy versions in OpenStack
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19
Contributing Patterns
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20
Keeping up with categorization
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21
Next Steps
● Deprecating old /rechecks/ page● Finding patterns in the patterns
– Is this only some providers?
– Is this only some configurations?
● Converting from frequency to percentages
– frequency graphs are cool, but misleading at times
– add error bars!
● Packaging up for easier consumption
● Optimizations on data collection– We hit Elastic Search really hard
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22
Thank You!
Elastic Recheck's Valiant Contributors
Joe GordonSean DagueMatt RiedemannMatthew TreinishClark BoylanSalvatore OrlandoJames E. BlairPeter PortanteDavanum SrinivasSergey LukjanovAttila Fazekas
Masayuki IgawaJeremy StanleyDolph MathewsBrant KnudsonAnita KunoMichael StillAllison RandalRussell BryantJerry ZhaoChristopher YeohThierry Carrez
Akihiro MotokiAdam GandelmanMark McLoughlinSean M. CollinsMichael KrotscheckAlexis LeeDean TroyerKen'ichi OhmichiAndrew LaskiMohammed NaserSahid Orentino Ferdjaoui
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23
Thank You!
logs.openstack.org logstash.openstack.orgAll artifacts
Select LOGsat INFO+
recheckbot
Gerrit
TestCompletes
Results
1
irc.freenode.net
KnownPatterns
2
3
4Report < 15 minutes after fail
er datascripts
KnownPatterns
status.openstack.org/elastic-recheck
Every 30 mins