optimizing systemc for higher speed and coverage

Optimizing SystemC forhigher speed and coverage

Dogan Fennibay

Y? SystemC becoming the de facto system-level

design language SystemC emulates parallelism via scheduling

Additional element effecting the result Makeup for this hole in coverage

We want faster simulations To do more executions / cheaper executions SystemC’s flexibility adds up to slowness

Outline Automatic generation of schedulings for higher

coverage Introduction Related work Definitions Algorithms Evaluation

Scoot Introduction Related work Idea Evaluation

Conclusion

SystemC Do you know SystemC?

No Yes

Introduction

3 different schedulings => 3 different results a; b; a; te; b; a => “Ok” a; b; a; te; a; b => “Ko” b; a; te; b => deadlock (lost notification)

Include process C 30 different schedulings => same 3 different results Equivalence classes

void top::A() { wait(e); wait(20,SC_NS); if (x) cout << "Ok\n"; else cout << "Ko\n";}

void top::B() { e.notify(); x = 0; wait(20,SC_NS); x = 1;}

void top::C() { sc_time T(20,SC_NS); wait(T);}

Introduction Scheduling also effects the results

Not just input data We have to test all possible schedulings

Impossible schedulings: do not test them Due to synchronization constraints

Equivalent schedulings: test only one e.g. two reads from a shared variable

Focus is on scheduling Input data generation is not considered

Introduction Dynamic Dependency Graphvoid top::P() { wait(e); wait(20,SC_NS); if (x) cout << "Ok\n"; else cout << "Ko\n";}

void top::Q() { e.notify(); x = 0; wait(20,SC_NS); x = 1;}

green: non-permutable, red: non-commutative

Related work Formal model

Extract a formal model from SystemC model Combine with a formal model of the non-

deterministic scheduler => Model checking State space explosion

Partial order reduction Dynamic extension is new Used by model checkers, but no non-abstract uses

Test case generation & output checker Assertion based verification promising

Definitions SUTD: System Under Test + one test data

Assume: Independent test case generator Generator always independent of scheduling

Process: event or thread p, q, r, …

Transition: one execution of a process in a scheduling a, b, c, … or p1, p2, q1, r1, p3, r2, …

Scheduling String of transitions & new cycles (delta or

te) Full state

Full memory dump incl. PC of processes

p:a = x;wait(e);printf(“%d\n”, a);wait(e2);a = x * 2;

q:e.notify();b = x;wait(20, SC_NS);x = b * 2;

p1

p2

p3

q1

q2

Definitions Permutation

Modify a scheduling Change the order of a and b Other transitions may come in-between

Equivalence Two different schedulings lead to the same full

statep:

a = xq:

b = x

Definitions Permutability: a and b in a

scheduling can be exchanged An equivalent scheduling with a

& b consecutive available a and b can be exchanged in

this equivalent scheduling Commutativity: which

permutations are useful? Exchanging a and b produces an

equivalent scheduling Non-commutative permutations

are interesting

void t1() {…wait(e1);v2 = 2;}

void t2() {…v2 = 1;e1.notify();wait();}

void t3() {…printf(“%d\n”, v1);wait();}

void t4() {…printf(“%d\n”, v1 + 1);wait();}

++v1

Definitions Dependency

Boolean:permutable’ + permutable.commutative’

a must come before b, otherwise (1) is impossible or (2) a different result will be produced

Causal order: Permutable transitions wrt dependency Equivalent schedulings have the

same causal order

void t1() {…wait(e1);v2 = 2;}

void t2() {…v2 = 1;e1.notify();wait();}

void t3() {…printf(“%d\n”, v1);wait();}

void t4() {…printf(“%d\n”, ++v1);wait();}

Algorithms: Computing commutativity Shared variables

Read, then modifying write

Modifying write, then read

Write, then modifying write

Events Notification, then wait Wait, then notification Caught notification,

then notification

Non-commutative actions All other actions do not harm commutativity

Algorithms: Causal Partial Order Computed step-by-step Start with empty scheduling Choose candidates a, b; where

a or b are new cycles (delta or te) a and b from the same process b is woken up by a

Extend CPO set Add (a, b) Add non-commutative transitions of b Compute & add transitive closure of calculated

relations

Algorithms: Generating schedulings Generating one alternative scheduling

Choose two non-commutative transitions: a and b Execute the scheduling until a Execute additional transitions not causally ordered

to a Execute b, then a Execute the rest

Generating all schedulings

Evaluation: prototype

Model and kernel instrumented Checker

Get the scheduling, generate new one, feed it to patched kernel

Until no more schedulings available

Evaluation: experiments Is the overhead of

calculating schedulings worth it? V. T vs G.T + O

3 examples Indexer

Small, V calculable MPEG Decoder

50 KLOC, 4 processes Full SoC

250 KLOC, 57 processes

Indexer 128 element array for

hash table n components, each

with 2 threads, each write 4 elements

G << V n = 2, V = 3.35e11;

n = 3, V = 2.43e25

Evaluation: experiments MPEG decoder

Overhead is insignificant G.T = 50 s, O = 18 s

Special structures in code not recognized Persistent events

Complete SoC Scalability Not tested fully

because of manual instrumentation

Expectation: ok up to 200 transitions

Observation: more detailed models produce more constrained schedulings => longer schedulings testable

Scoot Helmstetter et al explore all schedulings

To much time spent Let’s go in the opposite direction

Make SystemC less flexible to get it faster Blanc et al

Introduction Faster execution (up to 5 times!) Use verification back-ends

CBMC, SATABS => Get a plain C++ model from SystemC => Use C++ frontend to support more

language constructs

Related work Work on HW synthesis via model extraction

Kostaras & Vergos and Castillo et al Only for small subset of C++

Savoiu et al Speedup via Petri-net reductions Only 1.5 times

Pérez et al Static scheduling Only event processes considered

Idea SystemC is very

flexible Dynamic run-time

binding of ports Via polymorphism

Sensitivity lists Module hierarchy => Consolidate

hierarchy

Scheduler’s inefficiencies Run-time memory

allocations Processes triggered

via function pointers => Convert to static

schedule

Evaluation AES encryption/decryption Speedup achieved up to 5.3 times

Conclusion Helmstetter et al

Eliminated the effect of scheduling

At a reasonable overhead

Problems at scability

Scoot Significant speedup

achieved Most structures of C+

+ supported Preparation for model

checking Further discussion

Why equivalence classes among schedulings? Shouldn’t all schedulings produce the same result?

Why not use Helmstetter’s algorithm for regular software to catch races?

Uses for Scoot?

Extras

SystemC Primer A system-level

design language Used for HW/SW

codesign Based on native C++ Different abstraction

levels: TLM to RTL

SC_MODULE(nand2) {sc_in<bool> A, B;sc_out<bool> F;

void do_nand2() // a C++ function

{F.write( !(A.read() &&

B.read()));}

SC_CTOR(nand2){

SC_METHOD(do_nand2); sensitive << A << B;

}};

SystemC Primer: Concepts Modules

Containers for other SystemC elements incl. modules

Channels Communication means

of modules Ports

Connection point of modules to channels

Interfaces Connection point of

channels to modules

Method processes Non-blocking code parts

triggered on events they’re sensitive to

Thread processes Independent flows of

executions May call wait

Events Basic means of

synchronization Shared variables

Same as C++

SystemC Primer: Scheduler Non-deterministic

specification: unspecified order

Non-preemptive Delta cycles used to

imitate concurrency True parallelism on

the real system

Properties of relations Reflexivity:

aRa Symmetry

aRb => bRa e.g. “is equal to”

Totality aRb or bRa

Transitivity aRb and bRc => aRc e.g. “is ancestor of”

Transitive closure e.g. all ancestors in a community

ReferencesBlanc, N., Kroening, D. and Sharygina, N., 2008,

“Scoot: A Tool for the Analysis of SystemC Models”, TACAS, 2008, 467-470.

Helmstetter, C., Maraninchi, F., Maillet-Contoz, L. and Moy, M., 2006, “Automatic Generation of Schedulings for Improving the Test Coverage of Systems-on-a-Chip”, Verimag Research Report, TR-2006-06.

Helmstetter, C., Maraninchi, F. and Maillet-Contoz, L., 2007, “Test Coverage for Loose Timing Annotations”, Formal Methods: Applications and Technology, 4346/2007, 100-115.

optimizing systemc for higher speed and coverage

Documents

different resultsa b

c function

number of possible schedulings

result checker

valid result

systemc forhigher speed

analysis of systemc

castillo et alonly