copyright © 2005 eem202a/csm213a - fall 2005 ram kumar & roy shea ucla - nesl...
Post on 20-Dec-2015
219 views
TRANSCRIPT
Copyright © 2005
EEM202A/CSM213A - Fall 2005
Ram Kumar & Roy Shea
UCLA - NESL
{ram@ee,roy@cs}.ucla.edu
http://nesl.ee.ucla.edu
Lecture #12: Reliable Embedded Software
2
Reading List for this Lecture
• “The Model Checker Spin”, IEEE Trans. on Software Engineering, Vol. May 1997.
• D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler.“The nesC Language: A Holistic Approach to Networked Embedded Systems”. Proceedings of Programming Language Design and Implementation (PLDI) 2003, June 2003.
• G. Necula, S. McPeak, S.P. Rahul, and W. Weimer. "CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs".Proceedings of Conference on Compiler Construction, 2002.
• Mark Weiser. “Program Slicing”. 5th International Conference on Software Engineering. 1981.
• Robert Wahbe, Steven Lucco, Thomas E. Anderson, Susan L. Graham, “Efficient software-based fault isolation,” Proceedings of the fourteenth ACM symposium on Operating systems principles (SOSP-93).
– http://citeseer.ist.psu.edu/wahbe93efficient.html• Nial Murphy, “Watchdog Timers,” Embedded Systems Programming
– http://www.embedded.com/2000/0011/0011feat4.htm
3
Outline
• Overview of design process
• Static analysis– Concurrency
– Memory usage
• Runtime monitoring– Detection
– Isolation
• Hardware support• Conclusions
Implementation(Static analysis)
Deployment(Runtime
monitoring)
TestingSpecification
4
Overview of Software Design Process
• Specification– Understand task and
constraints– Develop formal
models for protocols– “The Model Checker Spin”,
IEEE Trans. on Software Engineering, Vol. May 1997.
• Testing– Feed inputs– Stress test– Long test
● Implementation*– Coding standards
– Code reviews and pair programming
– Static analysis
● Deployment*– Fault detection
– Isolation
– Feedback
5
What and Why of Static Analysis
• “Testing and verification of a system without running the code”
• Specification may not be implemented correctly• Not all errors appear during test runs
– Concurrency problems with timing dependence– Faults under specific system loads
• Complements other techniques• Early detection such as type checking
6
Techniques
• Create abstract model of the program
– Direct reasoning about code is hard
– Basic blocks or AST – G. Necula, S. McPeak, S.P.
Rahul, and W. Weimer. "CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs".Proceedings of Conference on Compiler Construction, 2002.
• Examine the model– Mark Weiser. “Program Slicing”. 5th
International Conference on Software Engineering. 1981.
– Dataflow to track state through a program
#include <stdlib.h>#include <stdio.h>
int main() {
int x; int y;
x = rand() % 10; y = rand() % 9;
if(x>y) { x = x * x / 2; } else { x = y / 2; y = y * x; }
printf("X+Y = %d", x+y); return 0;}
7
Example: Concurrency
• Problem– Shared data can be corrupted
by concurrent accesses
– Concurrency is a problem even without threading (why?)
• Solution– Annotate atomic code blocks
– Infer what must be protected
– Verify protection by looking at code base
• D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler.“The nesC Language: A Holistic Approach to Networked Embedded Systems”. Proceedings of Programming Language Design and Implementation (PLDI) 2003, June 2003.
8
Example: Memory Management
• Problem– Dynamic memory in embedded applications can result in difficult to
understand bugs and strange errors
– Dangling pointers, memory leaks, data corruption
● Important benefits of dynamic memory– Significantly simplify code base
– Dynamic Memory Allocation in Embedded Apps?
– http://ask.slashdot.org/article.pl?sid=05/11/16/2236235&tid=156&tid=201&tid=4
int *p = malloc (sizeof(int)*num);int *q = malloc (sizeof(int)*num*2);int *r = p;...free(r);...if (p[0] == 0) launchMissile();
9
Model for Memory
Formalized by Shane Markstrum
10
Implementation
• Convert module into an AST• Use data flow to track annotated data
__attribute__((sos_claim)) __attribute__((sos_release))
● Must either:
– persistently store data once
– free data
– release data to ownership of another module
● Must not create any persistent references to data before call
● Must treat data as dead after the call
caller -callee -
11
Outline
• Run Time Techniques– Operate during the execution of system– Access to more information than the static analysis tools– Introduce performance overheads
• Fault Isolation– Localize the impact of the fault– Specifically looking at memory corruption faults
• Fault Tolerance– Detect and recover from a fault
• Restore to a known good state• Re-initialize the state
– Specifically looking at hardware/architecture based techniques
12Memory Corruption Fault
Within Single Address Space
• A program is free to access the entire address space• Memory Corruption Fault
– Very easy for a program to corrupt the state of other programs• Desktop/Server CPUs have MMU
– No MMU in Embedded Processors (esp. micro-controllers)– Power, Performance, Cost … blah blah
Middleware
Operating System
ApplicationsRun-time
Stack
Global Data
and
Heap
Single Continuous
Address Space
Single Continuous
Address Space
Program Memory Data Memory
13
Software Fault Isolation (SFI)
• Re-write the program to perform fault isolation in software– Simple but a very powerful concept
– Useful even in servers/desktops for high performance application extensions, kernel extensions etc.
• Trade slower instrumented code for more protection– No need for a hardware protection boundary
• Slogan - You can still shoot yourself in the foot, but you can’t shoot the other guy in the foot
Ack. Prof. Aiken UCB
14
Overview
• Maintain two invariants for isolated code
• Any jumps stay within the isolated code
• Any writes are to data belonging to the isolated code
• Idea: Divide the address-space into segments– Segment addresses have unique high-order bits
• Protection subdomains are defined by segments– Every write must be within the segment
– Every jump must be within the segment
15
Fault Domain
Run-timeStack
Sampling Application
Operating System
Middleware
Operating System
Sampling Application
Fault Domains
No jumps outside fault domainNo writes outside fault domain
PROG DATA
16
Implementation - Segment Matching• Replace each store by the sequence:
dedicated-reg target addressMove target address into dedicated register
srcatch-reg (dedicated-reg >> shift-reg)Right shift address to get segment identifierShift-reg is dedicated
scratch-reg == segment-regCompare segment identifier with current segmentSegment-reg is dedicated
trap if not equalTrap if store address is outside of the segment
store through dedicated-regGuaranteed to store at the correct address
17
Comments
• Segment matching overhead – 4 instructions for EVERY store instruction in the program
• Requires three dedicated registers– Dedicated-reg holds the address being computed– Segment-reg holds current valid segment– Shift-size holds the size of the shift to perform– These three registers will not be used in the program
• Why dedicated registers ?– What will happen if a jump instruction by-passes all
checks ?– What will happen if a jump lands in the middle of the
checks ?
18
Sandboxing - Faster Approach
• Idea– Don’t test the segment bits– Just overwrite segment bits with correct segment
dedicated-reg (target-reg & and-mask-reg)Use dedicated register and-mask-reg to clear segment identifier bits
dedicated-reg dedicated-reg | segment-regUse dedicated register segment-reg to set segment identifier bits
• This is much faster– Only two instructions per instrumentation point
• Loses information about errors– Program may keep running with incorrect instructions and data
19
Implementation Details
• Optimizations– Traditional compiler optimizations
• Move sandboxing out of the loop
– Don’t instrument statically verifiable writes and jumps
• Binary instrumentation– Most portable & easily deployed– Also the hairiest option– Need to verify the binary
• No use of dedicated registers
• Modified compiler– Less easy to adopt– But easier to implement
20
Things to ponder about …
• How will the applications residing in their respective fault domains communicate with one another ?
• How will the data be shared amongst the fault domains?
• How will SFI be implemented on micro-controllers with less than 1 KB of memory ?
21
Embedded Systems In Real World
• Used in inaccessible places– Controllers for space vehicles - MARS Pathfinder– Closer home … sensor networks in dense forests
• Used for critical applications– Brake-by-wire systems– Medical Instruments
• Unexpected faults– Cosmic rays may flip on-chip bits
• Hard or even impossible to produce perfect firmware– Strive to design our systems to cleanly handle failures
22
Watchdog Timer Hardware
• Hardware counter that is set to an initial value• Continually counts down to zero• Responsibility of the software to set the count to original
value• When the counter reaches zero, the software is assumed
to have failed• Perform any suitable recovery
– Typically reset the CPU
• Visual Metaphor– “If the man stops kicking the dog, the dog will take advantage
of the hesitation and bite the man.”
23
Failures detected by watchdog
• Catch events that hang the system
• Transient Failures– Power glitches may corrupt program counter, stack
pointer or even the data in RAM
• Software Bugs– Infinite loops– Accidental jump out of code memory– Deadlock conditions (Incorrect design)
• Watchdog guarantees that none of the bugs will hang the system indefinitely
24
Watchdog Design Considerations• First Aid - Recovery from watchdog bite• Maintain a count of number of resets
– Shutdown a persistently errant application• Use watchdog for sanity checks
– Verify the control flow through a piece of code– Record failure reports in non-volatile storage– Diagnostic information is very useful
• Choosing watchdog timeout interval– Need to understand the timing characteristics of
the program– Large interval - Slow response– Small interval - Frequent resets, difficult to
diagnose• Space Shuttle’s main engine controller
– WDT timeout 18 ms– Switchover to a backup computer
25
Watchdog Self Test
• What if WDT fails in a way that it never bites ?
• Would be discovered only if a failure hangs the system
• WDT failure is VERY EASILY possible– WDT can be disabled in software
– HW Misconfiguration - Jumper of reset line pulled out
• Startup self-test– Allow WDT to timeout and reset the processor
– Flag to distinguish power on reset from WDT reset
26
Grenade Timer
• Idea - Build a counter that cannot be reloaded once it is running– Grenade whose pin has been pulled will have to explode
• Guaranteed reboot is a “useful feature” in some applications– Purges all bad state and re-initializes the system
Grenade Timer HW Interface
27
Taxonomy• FAILURE
– Event that occurs when the delivered service deviates from the correct service
– Failure is the effect that is observed
– E.g. - “Your iPod Nano stops responding.”
• FAULT– Fault is the cause of an error
– An error may lead to failure
– E.g. - “Memory corruption fault lead to the failure of the iPod”