killer bugs from outer space
DESCRIPTION
Working with software means working with bugs. Bugs in software, bugs in hardware; bugs in Open Source code, bugs in proprietary code. If software is eating the world, bugs might end up taking the first bite. We will present a few typical bugs, some of them famous, some of them infamous (including bugs that actually killed people). Since one can never be too well-prepared to fend off the next infestation, we will give tools, tips, and best practices to fix bugs in Open Source software. We will give real world examples of Really Mysterious Bugs (sometimes nicknamed "Heisenbugs" because they tend to disappear when you try to observe them), and how they were fixed, in Node.js, Docker, and the Linux Kernel.TRANSCRIPT
Killer BugsFrom Outer SpaceJérôme Petazzoni — @jpetazzoLinuxCon — Chicago — 2014
Why this talk?
Codito, ergo erro
I code, therefore I make mistakes
Introduction(s)● Hi, I’m Jérôme.
Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.
Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.
Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.● Sometimes, I fix the bugs in my code.
Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.● Sometimes, I fix the bugs in my code.● Sometimes, I fix the bugs in other people’s code.
Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.● Sometimes, I fix the bugs in my code.● Sometimes, I fix the bugs in other people’s code.
I like bullet points!
Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.● Sometimes, I fix the bugs in my code.● Sometimes, I fix the bugs in other people’s code.
I like bullet points!
● And I carry a pager.
Introduction(s)A pager is a device that wakes you up, or tells you to stop whatever you’re doing, so you can fix other people’s bugs.
Introduction(s)A pager is a device that wakes you up, or tells you to stop whatever you’re doing, so you can fix other people’s bugs.
WE HATESSS THEMSS.
What about you?● Do you write code?● Does it sometimes have bugs?● Do you fix them?● Do you fix other people’s code too?● Do you carry a pager?● Do you love it?
Outline● Let’s talk about some really nasty bugs● How they were found, how they were fixed● How to be prepared next time● This is not about testing, TDD, etc.
(when the bugs are there, it’s too late anyway)
Outline● Node.js● Harmless hardware bugs● Docker● Harmful hardware bugs● Linux
Node.js
Context● Hipache* is a reverse-proxy in Node.js● Handles a bit of traffic
○ >100 req/s○ >10K virtual hosts○ >10K different containers
● Vhosts and containers change all the time(more than 1 time per minute)
*Hipache is Hipster’s Apache. Sorry.
The bugIt all starts with an angry customer.
“Sometimes, our application will crash, because this 700 KB JSON file is truncated by Hipache!”
What about Content-Length?The client code should scream, but it doesn’t.
Let’s sniff some packetsLog into the load balancer (running Hipache)...# ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80
interface: any
filter: (ip or ip6) and ( tcp port 80 )
match: /api/v1/download-all-the-things
####
T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP]
GET /api/v1/download-all-the-things.json HTTP/1.0.
Host: angrycustomer.com
X-Forwarded-Port: 443.
X-Forwarded-For: ::ffff:24.13.146.16.
X-Forwarded-Proto: https.
...
Too much traffic, not enough visibility!# tcpdump -peni any -s0 -wdump tcp port 80(Wait a bit)^C
Transfer dump file
DEMO TIME!
What did we find out?● Truncated files happen because a chunk
(probably exactly one) gets dropped.
But:● Impossible to reproduce locally.● Only the customer sees the problem.
TONIGHT, WE DINE IN CODE!
This is Node.js.I have no idea what I’m doing.
● Warm up the debuggers!
This is Node.js.I have no idea what I’m doing.
● Warm up the debuggers!● … but Node.js is asynchronous,
callback-driven, spaghetti code● Hmmmm, spaghetti
This is Node.js.I have no idea what I’m doing.
● Plan B: PRINT ALL THE THINGS
You need a phrasebook!
● How do you say “printf” in your language?
● How do you find where a function comes from?
● How do you trace the standard library?
Shotgun debugging● Add console.log() statements everywhere:
○ in Hipache○ in node-http-proxy○ in node/lib/http.js
● For the last one (part of std lib), we need to:○ replace require(‘http’) with require(‘_http’)○ add our own _http.js to our node_modules○ do the same to net.js (in “our” _http.js).
● Now analyze big stream of obscure events!● Let There Be Light
Interlude about pauses● With Node.js, you can pause a TCP stream.
(Node.js will stop reading from the socket.)● Then whenever you are ready to continue,
you are supposed to send a resume event.● Hipache does that: when a client is too slow,
it will pause the socket to the backend.
SO FAR, SO GOOD
What really happens● There are two layers in Node: tcp and http.● When the tcp layer reads the last chunk,
the backend closes the socket (it’s done).● The tcp layer notices that the socket is now
closed, and emits an end event.● The end event bubbles up to the http layer.● The http layer finishes what it was doing,
without sending a resume.● Node never reads the chunks in the kernel
buffers. They are lost, forever alone.
How do we fix this?Pester Node.js folks
Catch that end event, and when it happens, send a resume to the stream to drain it.
(Implementation detail: you only have the http socket, and you need to listen for an event on the tcp socket, so you need to do slightly dirty things with the http socket. But eh, it works!)
What did we learn?When you can’t reproduce a bug at will, record it in action (tcpdump) and dissect it (wireshark).
Spraying code with print statements helps.(But it’s better to use the logging framework!)
You don’t have to know Node.js to fix Node.js!
Harmless hardware bugs
Intel Pentium(insert appropriate ©™ where required)
● Pentium FDIV bug (1994)○ errors at 4th decimal place○ fixed by replacing CPUs○ cost (for Intel): $475,000,000○ cost (for users): approx. $0
● Pentium F00F bug (1997)○ using the wrong instruction
hangs the machine○ fixed in software○ cost: ???
ATA ribbon cables● Touch or move those cables:
the transfer speed changes● SATA was introduced in 2003,
and (mostly) addresses the issue● Vibration is still an issue, though
Docker(because even when it’s not about Docker, it’s still about Docker)
Bug:It never works the first time# docker run -t -i ubuntu echo hello world
2013/08/06 23:20:53 Error: Error starting container 06d642aae1a: fork/exec /usr/bin/lxc-start: operation not permitted
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
Strace to the rescue!Steps:1. Boot the machine.2. Find pid of process to analyze.
(ps | grep, pidof docker...)3. strace -o log -f -p $PID
4. docker run -t -i run ubuntu echo hello world
5. Ctrl-C the strace process.6. Repeat steps 3-4-5, using a different log file.
Note: can also strace directly, e.g. “strace ls”.
Let’s compare the log files● Thousands and thousands of lines.● Look for the error message in file A.
(e.g. “operation not permitted”)● If lucky: it will reveal the issue.● Otherwise, look what happens in file B.
● Other approach: start from the beginning or the end, and try to find the point when things started to diverge.
Specialized hardware helps
Specialized hardware helps● Now you have a good reason to ask your
CFO about that dual 30” monitor setup!
Investigation resultsFirst time[pid 1331] setsid() = 1331[pid 1331] dup2(10, 0) = 0[pid 1331] dup2(10, 1) = 1[pid 1331] dup2(10, 2) = 2[pid 1331] ioctl(0, TIOCSCTTY) = -1 EPERM (Operation not permitted)[pid 1331] write(12, "\1\0\0\0\0\0\0\0", 8) = 8[pid 1331] _exit(253) = ?
Second time (and every following attempt)[pid 1414] setsid() = 1414[pid 1414] dup2(14, 0) = 0[pid 1414] dup2(14, 1) = 1[pid 1414] dup2(14, 2) = 2[pid 1414] ioctl(0, TIOCSCTTY) = 0[pid 1414] execve("/usr/bin/lxc-start", ["lxc-start", "-n", ...]) <...>
What does that mean?● For some reason, the code wants file
descriptor 0 (stdin) to be a terminal.
● The first time we run, it fails, but in the process, we acquire a terminal.(UNIX 101: when you don’t have a controlling terminal and open a file which is a terminal, it becomes your controlling terminal, unless you open the file with flag O_NOCTTY)
● Next attempts are therefore successful.
… Really?To confirm that this is indeed the bug:● reproduce the issue
(start the process with “setsid”, to detach from controlling terminal)
● check the output of “ps”(it shows controlling terminals)
#before23083 ? Sl+ 0:12 ./docker -d -b br0#after23083 pts/6 Sl+ 0:12 ./docker -d -b br0
V I C T O R Y
What did we learn?You can attach to running processes.
● strace is awesome.It traces syscalls.
● ltrace is awesome too.It traces library calls.
● gdb is your friend.(A very peculiar friend, but a friend nonetheless.)
Harmfulhardware bugs
“Errare humanum est,perseverare autem diabolicum”
“To err is human,but to really foul things up, you need a computer”
Really nasty (and sad) bug:The Therac-25● Radiotherapy machine
(shoots beams to cure cancer)
● Two modes:○ low energy
(direct exposure)○ high energy
(beam hits a specialtarget/filter first)
The problem● In older versions of the machine,
a hardware interlock prevented the high energy beam from shooting if the filter was not in place.
● On the Therac-25, it’s in software.
● What could possibly go wrong?
What went wrong● 6 people got radiation burns● 3 people died● … over the course of 3 years (1985 to 1987)
Konami Code of DeathOn the keyboard, press:(in less than 8 seconds)
X ↑ E [ENTER] B
...And the high energy beam shoots, unfiltered!
How could it happen?● Race condition in the software.
● Never happened during tests:○ the tests did not include “unusual sequences”
(which were not that unusual after all)○ test operators were slower than real operators
Aggravating details● Many engineering and institutional issues
○ No code review○ No evaluation of possible failures○ Undocumented error codes○ No sensor feedback
● The machine had tons of “normal errors”○ And operators learned to ignore them
● So the “real errors” were ignored○ Just hit retry, same player shoot again!
Let’s get back to weird Linux Kernel bugs
Linux Kerneland spinlocks and Xen and ...
Let’s get back to weird Linux Kernel bugs
Random crashes on EC2● Pool of ~50 identical instances● Same role (run 100s of containers)● Sometimes, one of them would crash
○ Total crash○ no SSH○ no ping○ no log○ no nothing
● EC2 console won’t show anything● Impossible to reproduce
Try a million things...● Different kernel versions● Different filesystems tunings● Different security settings (GRSEC)● Different memory settings (overcommit, OOM)● Different instance sizes● Different EBS volumes● Different differences● Nothing changed
And one fine day...● One machine crashes very often
(every few days, sometimes few hours)
CLONE IT!ONE MILLION TIMES!
A New Hope!● Change everything (again!)● Find nothing (again!)● Do something crazy:
contact AWS support● Repeat tests on “official” image (AMI)
(this required porting our stufffrom Ubuntu 10.04 to 12.04)
Happy ending● Re-ran tests with official image● Eventually got it to crash● Left it in crashed state● Support analyzed the image...
Happy ending● Re-ran tests with official image● Eventually got it to crash● Left it in crashed state● Support analyzed the image
“oh yeah it’s a known issue, see that link.”
Happy ending● Re-ran tests with official image● Eventually got it to crash● Left it in crashed state● Support analyzed the image
“oh yeah it’s a known issue, see that link.”
U SERIOUS?
● The bug only happens:○ on workloads using spinlocks intensively○ only on Xen VMs with many CPUs
● Spinlocks = actively spinning the CPU● On VMs, you don’t want to hold the CPU● Xen has special implementation of spinlocks
When waking up CPUs waiting on a spinlock, the code would only wake up the first one,even if there were multiple CPUs waiting.
I can explain!
The patch (priceless)diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d69cc6c..67bc7ba 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -328,7 +328,6 @@ static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl)
if (per_cpu(lock_spinners, cpu) == xl) {
ADD_STATS(released_slow_kicked, 1);
xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
- break;
}
}
}
--
What did we learn?We didn’t try all the combinations.(Trying on HVM machines would have helped!)
AWS support can be helpful sometimes.(This one was a surprise.)
Trying to debug a kernel issue without console output is like trying to learn to read in the dark.(Compare to local VM with serial output…)
Overall ConclusionsWhen facing a mystic bug from outer space:● reproduce it at all costs!● collect data with tcpdump, ngrep, wireshark,
strace, ltrace, gdb; and log files, obviously!● don’t be afraid of uncharted places!● document it, at least with a 2 AM ragetweet!
One last thing...● Get all the help you can get!● Your developers will rarely reproduce bugs
(Ain’t nobody got time for that)● Your support team will
(They talk to your customers all the time)● Help your support team to help your devs● Bonus points if your support team fixes bugs
Thank you! Questions?