multi-core simulation of transaction level models using ... › ~doemer › publications ›...

1

Multi-Core Simulation of Transaction LevelModels using the System-on-Chip Environment

Weiwei Chen, Xu Han, Rainer DomerCenter for Embedded Computer Systems

University of California, Irvine, [email protected], [email protected], [email protected]

Abstract—Effective Electronic System-Level (ESL) designframeworks transform and refine high-level designs into vari-ous transaction level models described in C-based System-levelDescription Languages (SLDLs) and rely on simulation forvalidation. However, the traditional cooperative Discrete Event(DE) simulation of SLDLs is not effective and cannot utilize anyexisting parallelism in todays’ multi-core CPU hosts.

In this work, we present our extension of the DE simulationengine in the SpecC-based ESL framework, System-on-ChipEnvironment (SCE), to support multi-core parallel simulation.With the proposed communication protection and optimizedimplementation, our parallel simulator significantly reduces thesimulation time of large transaction-level models at severalabstraction levels. We demonstrate the benefits of the parallelsimulation engine using a case study on a H.264 video decoderand a JPEG Encoder application.

I. I NTRODUCTION

Modern embedded system platforms often integrate vari-ous types of processing elements into the system, includinggeneral-purpose CPUs, application-specific instruction-set pro-cessors (ASIPs), digital signal processors (DSPs), as wellasdedicated hardware accelerators. The large size and great com-plexity of these systems pose enormous challenges to designand validation using traditional design flows. System designersare forced to move to higher levels of abstraction to cope withthe many problems, including large number of heterogeneouscomponents, complex interconnect, sophisticated functionality,and slow simulation.

At the so-called Electronic System Level (ESL), systemdesign and verification aim at a systematic top-down designmethodology which successively transforms a given high-level specification model into a detailed implementation. Asone example, the System-on-Chip Environment (SCE) is arefinement-based framework for heterogeneous MPSoC design[5]. SCE starts with a system specification model describedin the SpecC language [7] and implements a top-down ESLdesign flow based on the specify-explore-refine methodology.SCE automatically generates a set of Transaction Level Mod-els (TLM) with increasing amount of implementation detailthrough step-wise refinement, resulting at the end in a pin-and cycle-accurate system implementation.

Note that each TLM in the design flowexplicitly specifieskey ESL concepts, including behavioral and structural hierar-chy, potential for parallelism and pipelining, communicationchannels, and constraints on timing. Having these intrinsicfeatures of the application explicit in the model enables

efficient design space exploration and automatic refinementby computer-aided design (CAD) tools.

In this work, we exploit the existing explicit parallelism inESL models to speed-up the simulation needed for validationof TLM and virtual prototype system models. So far, modelvalidation is based on traditional discrete event (DE) simu-lation. The original SCE simulator implements the existingparallelism in the design model in the form of concurrentuser-level threads within a single process. The multi-threadingmodel used is cooperative (i.e. non-preemptive), which greatlysimplifies communication through events and variables inshared memory. Unfortunately, however, this threading modelcannot utilize any available parallelism in multi-core hostCPUs which nowadays are common and readily available inregular PCs1. In Section II, we show an extension of the DEsimulation engine in SCE that overcomes this shortcomingby issuing multiple properly synchronized threads in parallel,leading to significant speedup of simulation time.

A. Related Work

Most ESL frameworks today rely on regular DE-basedsimulators which issue only a single thread at any time toavoid complex synchronization of the concurrent threads. Assuch, the simulator kernel becomes an obstacle in improvingsimulation performance on multi-core host machines [8].

A well-studied solution to this problem is Parallel DiscreteEvent Simulation (PDES) [2], [6], [13]. [10], [12] discussthe PDES on Hardware Description Languages, VHDL andVerilog. To apply PDES solutions to today’s SLDLs andactually allow parallel execution on multi-core processors, thesimulator kernel needs to be modified to issue and properlysynchronize multiple OS kernel threads in each schedulingstep. [4] and [14] have extended the SystemC simulator kernelaccordingly. Clusters with single-core nodes are targetedin [4]which uses multiple schedulers on different processing nodesand defines a master node for time synchronization. A paral-lelized SystemC kernel for fast simulation on SMP machinesis presented in [14] which issues multiple runable OS kernelthreads in each simulation cycle. Our scheduling approachdescribed in Section II is very similar. However, we proposea detailed synchronization protection mechanism which weautomatically generate for any user-defined and hierarchical

1We focus on accelerating a single simulation instance (not onparallelsimulation of multiple models or multiple test vectors).

2

channels, and discuss an implementation optimization as well.Moreover, instead of synthetic benchmarks, we provide resultsfor an actual H.264 video decoder and a JPEG encoder.

In comparison to our previous work [3], we have signifi-cantly extended the discussion of proper channel protection inSection II-C3. This paper also contributes a new optimizationtechnique in Section II-D that reduces the number of contextswitches in the simulator and results in higher simulationperformance, as demonstrated in Section III-C. Finally, weshow new results for a complete set of TLMs generated bySCE at different levels of abstraction.

II. M ULTI -CORE PARALLEL SIMULATION

Design models with explicitly specified parallelism makeit promising to increase simulation performance by parallelexecution on the available hardware resources of a multi-corehost. However, care must be taken to properly synchronizeparallel threads.

In this section, we will first review the scheduling schemein a traditional simulation kernel that issues only a singlethread at a time. We will then present our improved schedulingalgorithm with true multi-threading capability on symmetricmultiprocessing (multi-core) machines and discuss the neces-sary synchronization mechanisms for safe parallel execution.Without loss of generality, we assume the use of SpecC here(i.e. our technique is equally applicable to SystemC).

A. Traditional Discrete Event Simulation

In both SystemC and SpecC, a traditional DE simulator isused. Threads are created for the explicit parallelism describedin the models (e.g.par{} and pipe{} statements in SpecC,and SC METHODSand SC THREADSin SystemC). Thesethreads communicate via events and advance simulation timeusingwait-for-timeconstructs.

To describe the simulation algorithm2, we define the fol-lowing data structures and operations:

1) Definition of queues of threadsth in the simulator:

• QUEUES = {READY , RUN, WAIT , WAITFOR ,COMPLETE }

• READY = {th | th is ready to run}• RUN = {th | th is currently running}• WAIT = {th | th is waiting for some events}• WAITFOR = {th | th is waiting for time advance}• COMPLETE = {th | th has completed its execu-

tion}

2) Simulation invariants:Let THREADS = set of all threads which currently exist.Then, at any time, the following conditions hold:

• THREADS = READY ∪ RUN ∪ WAIT ∪ WAITFOR∪ COMPLETE .

• ∀ A, B ∈ QUEUES, A 6= B : A ∩ B = ∅.

3) Operations on threadsth:

• Go(th): let thread th acquire a CPU and beginexecution.

2The formal definition of execution semantics can be found in [11].

• Stop(th): stop execution of threadth and releasethe CPU.

• Switch(th1, th2): switch the CPU from the execu-tion of threadth1 to threadth2.

4) Operations on threads with set manipulations:Supposeth is a thread in one of the queues, A and Bare queues∈ QUEUES.• th = Create(): create a new threadth and put it in

setREADY .• Delete(th): kill thread th and remove it from set

COMPLETE .• th = Pick(A, B): pick one threadth from set A

(formal selection/matching rules can be found in[11] section 4.2.3) and put it into set B.

• Move(th, A, B): move threadth from set A to B.5) Initial state at beginning of simulation:

• THREADS = {throot}.• RUN = {throot}.• READY = WAIT = WAITFOR = COMPLETE = ∅.• time = 0.

DE simulation is driven by events and simulation timeadvances. Whenever events are delivered or time increases,the scheduler is called to move the simulation forward. Asshown in Fig. 1a, at any time, the traditional scheduler runsasingle thread which is picked from theREADY queue. Withina delta-cycle, the choice of the next thread to run is non-deterministic (by definition). If theREADY queue is empty,the scheduler will fill the queue again by waking threads whohave received events they were waiting for. These are takenout of theWAIT queue and a new delta-cycle begins.

If the READY queue is still empty after event delivery,the scheduler advances the simulation time, moves all threadswith the earliest timestamp from theWAITFOR queue into theREADY queue, and resumes execution. At any time, there isonly one thread actively executing in the traditional simulation.

B. Multi-Core Discrete Event Simulation

The scheduler for multi-core parallel simulation works thesame way as the traditional scheduler, with one exception:in each cycle, it picks multiple OS kernel threads from theREADY queue and runs them in parallel on the available cores.In particular, it fills theRUN set with multiple threads up tothe number of CPU cores available. In other words, it keepsas many cores as busy as possible.

Fig. 1b shows the extended control flow of the multi-corescheduler. Note the extra loop at the left which issues OSkernel threads as long as CPU cores are available and theREADY queue is not empty.

C. Synchronization for Multi-Core Simulation

The benefit of running more than a single thread at thesame time comes at a price. Explicit synchronization becomesnecessary. In particular, shared data structures in the simulationengine, including the thread queues and event lists, and sharedvariables in communication channels of the application modelneed to be properly protected by locks for mutual exclusiveaccess by the concurrent threads.

3

start

READY == ∅ ?

∀!"∈!"#$!"#$"!"%&"'(')*"#&")+,-'./"

%&'(0!"!"!"#$!")*"+,1/"23'45")+,$'."'(')*&/""

!""6"-./00)*"+,12)341/"5&0!"1/"

READY == ∅ ?

78.4*'"*9'"&#:;34,+)",:'/""

:+('"*9'"'453#'&*"!"∈!"#$67)2*+2)*"+,/""

READY == ∅ ?

end

No

Yes

No

No

Yes

Yes

delta-cycle

timed-cycle

&3''8/"

(a) Traditional DE scheduler.

signal(Cond_th, L);

start

!"#$%!""!∅!#!

∀!"∈&#'($!%&!!"'(!)*)+,!%(!+-./)01!

)*+,2!"$!&#'($!!"#$%3$!45)67!+-./)0!)*)+,(1!!

!"#$%!""!∅!#!

8906,)!,:)!(%;<56.-+!.;)1!

;-*)!,:)!)675%)(,!!"∈&#'(-.!/,-/!"#$%1!!

!"#$%!""!∅!#!

end

No

Yes

No

No

Yes

Yes

!"!"!01232!"#$%4/!5631!

7*2!"31!

!56!""!∅!#!8!568!="!>4?8(!!

@@!!"#$%!A"!∅!#!

Yes

No wait(Cond_s, L)

9:,,;</

wait(Cond_s, L);

9:,,;/

No

Yes

(b) Multi-core DE scheduler.

signal(Cond_th, L);

start

!"#$%!""!∅!#!

∀!"∈&#'($!%&!!"'(!)*)+,!%(!+-./)01!

)*+,2!"$!&#'($!!"#$%3$!45)67!+-./)0!)*)+,(1!!

!"#$%!""!∅!#!

8906,)!,:)!(%;<56.-+!.;)1!

;-*)!,:)!)675%)(,!!"∈&#'(-.!/,-/!"#$%1!!

!"#$%!""!∅!#!

end

No

Yes

No

No

Yes

Yes

!"!"!01232!"#$%4/!5631!

/%&2!"!="!#$%!"3!7*2!"31!

#$%!"!%(!9%>?)0#!

8!568!@"!A4B8(!!

CC!!"#$%!="!∅!#!

Yes

No

wait(Cond_curth, L);

9:,,;</

No

Yes

!56!""!∅!#!

No Yes

(c) Optimized multi-core DE scheduler.

Fig. 1: Discrete Event (DE) simulation algorithms

1) Protecting Scheduling Resources:To protect all centralscheduling resources, we run the scheduler in its own threadand introduce locks and condition variables for proper syn-chronization. More specifically, we use

• one central lockL to protect the scheduling resources,• a condition variableCond s for the scheduler, and• a condition variableCond th for each working thread.

When a working thread executes await or waitfor instruction,we switch execution to the scheduling thread by waking thescheduler (signal(Cond s) and putting the working thread tosleep (wait(Cond th, L). The scheduler then uses the samemechanism to resume the next working thread.

2) Protecting Communication:Communication betweenthreads also needs to be explicitly protected as SpecC channelsare defined to act as monitors3. That is, only one thread at atime may execute code wrapped in a specific channel instance.

To ensure this, we introduce a lockch→Lock for eachchannel instance which is acquired at entry and released uponleaving any method of the channel. Fig. 2a shows this for theexample of a simple circular buffer with fixed size.

3) Channel Locking Scheme:In general, however, channelscan be more complex with multiple nested channels andmethods calling each other. For such user-defined channels,weautomatically generate proper locking mechanism as follows.Fig. 2b shows an example of a channelChnlS wrapped inanother channelChnlP. ChnlP has a methodChnlP::f() thatcalls another channel methodChnlP::g(). The inner channelChnlShas a methodChnlS::f()that calls back the outer methodChnlP::g().

In order to avoid duplicate lock acquiring and earlyreleasing in channel methods, we introduce a counterth→ch→lockCountfor each channel instance in each threadth,and a listth→list of acquired channel locks for each workingthread. Instead of manipulatingch→Lock in every channelmethod,th→ch→lockCount is incremented at the entry anddecremented before leaving the method, andch→Lock isonly acquired or released whenth→ch→lockCount is zero.ch→Lock is added intocurTh→list of the current thread whenit is first acquired, and removed from the list when the last

3The SpecC Language Reference Manual V2.0 explicitly statesin Section2.3.2.j that channel methods are protected for mutual exclusive execution.

1 send ( d ) r e c e i v e ( d ){ {

3 Lock ( t h i s−>Lock ) ; Lock ( t h i s−>Lock ) ;wh i l e ( n >= s i z e ){ wh i le ( ! n){

5 ws ++; wr ++;wa i t ( eSend ) ; wa i t ( eRecv ) ;

7 ws −−; wr −−;} }

9 b u f f e r . s t o r e ( d ) ; b u f f e r . l oad ( d ) ;i f ( wr){ i f ( ws){

11 n o t i f y ( eRecv ) ; n o t i f y ( eSend ) ;} }

13 unLock ( t h i s−>Lock ) ; unLock ( t h i s−>Lock ) ;} }

(a) Queue channel implementation for multi-core simulation.

!"#$%&

!"#$'&

!"#$

!"#%$&"#$

(b) Example of user-defined hierarchical channels.

1 channe l chn lS ; channe l chn lP ;i n t e r f a c e i t f s

3 {vo id f ( chn lP∗ chp ) ;

5 vo id g ( ) ;};

7 channe l chn lP ( ) imp lements i t f s{

9 chn lS chs ;vo id f ( chn lP∗ chp ){

11 chs . f ( t h i s ) ; / / ( a ) i n n e r channe l f u n c t i o n c a l lg ( ) ; / / ( b ) c a l l a n o t h e r member f u n c t i o n

13 }vo id g(){}

15 };channe l chn lS ( ) imp lements i t f s

17 {vo id f ( chn lP∗ chp ){ chp−>g ( ) ;}

19 / / ( c ) o u t e r channe l f u n c t i o n c a l l back};

(c) SpecC description of the channels in (b).

Fig. 2: Protecting synchronization in channels for multi-coreparallel simulation

method ofch in the call stack exits. Note that, in order toavoid the possibility of dead locking, channel locks are storedin the order of acquiring. The working thread always releasesthem in the reverse order and keeps the original order whenre-acquiring them. Fig. 3 illustrates the refined control flow ina channel method with proper locking scheme.

4

start

lockLocal()

Perform channel tasks

!"#$%&'$()

Acquire central lock L;

Release acquired channel locks in reverse order of acquiring.

scheduling

Release central lock L;

Re- acquire channel locks in original order.

end

unlockLocal()

channel

scheduler

channel

Yes

No

ch−>l o c k L o c a l ( )2 {

i f ( curTh−>ch−>l ockCount == 0){4 Lock ( ch−>Lock ) ;

curTh−>l i s t . Append ( ch−>Lock ) ;6 }

curTh−>ch−>l ockCount ++;8 }

10 ch−>un lockLoca l ( ){

12 curTh−>ch−>l ockCount −−;i f ( curTh−>ch−>l ockCount == 0){

14 curTh−>l i s t . Remove ( ch−>Lock ) ;unLock ( ch−>Lock ) ;

16 }}

Fig. 3: Synchronization protection for the member functionsof communication channels.

The combination of a central scheduling lock and individuallocks for channel and signal instances with proper lockingscheme and well-defined ordering ensures safe synchroniza-tion among many parallel working threads. Note that ourcareful ordering and the natural order of nested channelsensures that locks are always acquired in acyclic fashion and,thus, no deadlock can occur. Fig. 4 summarizes the detaileduse of all locks and the thread switching mechanism for thelife-cycle of a working thread.

D. Implementation Optimization for Multi-Core Simulation

As shown in Fig. 1b and Fig. 4, the threads in the sim-ulator switch due to event handling and time advancement.Context switches occur when the owner thread of a CPUcore is changed. A dedicated scheduling thread (as discussedin Section II-B and used in Section II-C) introduces contextswitch overhead when a working thread needs scheduling. Thisoverhead can be eliminated by letting the current workingthread perform the scheduling task itself. In other words,instead of using a dedicated scheduler thread, we define afunction schedule()and call it in each working thread whenscheduling is needed.

Then, the condition variableCond s is no longer needed.The “scheduler thread” (which is now the current workingthreadcurTh itself) sleeps by waiting onCond curTh. In theleft loop of Fig. 1b, if the thread picked from theREADY

unLock(L)

if(this != th_root) {

Lock(L); signal(Cond_parent); wait(Cond_this, L); UnLock(L);

unLock(L) Lock(L) notify Add notified event e to events’ list N

start

end

Yes No

Re-acquire

released channel locks

signal(Cond_s)

unLock(L)

Lock(L)

Parent->NumAliveChildren == 0?

Move(Parent, JOINING, READY);

Go(schedule)

Exit(sim)

Go(schedule) sleep

par

NumAliveChildren = 0

For any children, ti=Create(), NumAliveChildren ++;

Move(this, RUN, JOINING).

wait Move(this, RUN, WAIT)

Execute

end this == root? Move(this, RUN, COMPLETE);

Parent->NumAliveChildren --;

waitfor Move(this, RUN, WAITFOR)

Yes

Yes

Yes

Yes Yes

Yes

No

No

No

No

No

No

Lock(L)

signal(Cond_s);

wait(Cond_this, L); unLock(L)

Delete(this)

Go(schedule) sleep

Lock(L)

Lock(L)

Release acquired channel locks

signal(Cond_s);

wait(Cond_this, L); unLock(L)

Fig. 4: Life-cycle of a thread in the multi-core simulator.

queue is the same ascurTh, thesleepstep is skipped andcurThcontinues. After thesleep step, the current working threadwill continue its own work rather than entering into anotherscheduling iteration. In Fig. 4,Go(schedule) is replaced bythe function callschedule()and thesleep step is no longerneeded. Fig. 1c shows the refined multi-core scheduler withthis optimization applied.

Our experiments at the end show that this optimizationreaches about 7.7% reduction in simulation time.

III. C ASE STUDY ON A H.264 VIDEO DECODER

To demonstrate the improved simulation time of our multi-core simulator, we use a H.264 video decoder application.

A. H.264 Advanced Video Coding (AVC) Standard

The H.264 AVC standard [15] is widely used in video appli-cations, such as internet streaming, disc storage, and televisionservices. H.264 AVC provides high-quality video at less thanhalf the bit rate compared to its predecessors H.263 and H.262.At the same time, it requires more computing resources forboth video encoding and decoding. In order to implement thestandard on resource-limited embedded systems, it is highlydesirable to exploit available parallelism in its algorithm.

The H.264 decoder takes as input a video stream consistingof a sequence of encoded video frames. A frame can befurther split into one or more slices during H.264 encoding,as illustrated in the upper right part of Fig. 5. Notably, slicesare independentof each other in the sense that decoding oneslice will not require any data from the other slices (thoughit may need data from previously decoded reference frames).For this reason, parallelism exists at the slice-level and parallelslice decoders can be used to decode multiple slices in a framesimultaneously.

5

B. H.264 Decoder Model with Parallel Slice Decoding

We have specified a H.264 decoder model based on theH.264/AVC JM reference software [9]. In the reference code,aglobal data structure (img) is used to store the input stream andall intermediate data during decoding. In order to parallelizethe slice decoding, we have duplicated this data structure andother global variables so that each slice decoder has its owncopy of input stream data and can decode its own slice locally.As an exception, the output of each slice decoder is still writtento a global data structure (dec picture). This is valid becausethe macro-blocks produced by different slice decoders do notoverlap.

Fig. 5: Parallelized H.264 decoder model.

Fig. 5 shows the block diagram of our model. The decodingof a frame begins with reading new slices from the inputstream. These are then dispatched into four parallel slice de-coders. Finally, a synchronizer block completes the decodingby applying a deblocking filter to the decoded frame. All theblocks communicate via FIFO channels. Internally, each slicedecoder consists of the regular H.264 decoder functions, suchas entropy decoding, inverse quantization and transformation,motion compensation, and intra-prediction.

Using SCE, we partition the above H.264 decoder modelas follows: the four slice decoders are mapped onto fourcustom hardware units; the synchronizer is mapped onto anARM7TDMI processor at 100MHz which also implementsthe overall control tasks and cooperation with the surroundingtestbench. We choose Round-Robin scheduling for tasks inthe processor and allocate an AMBA AHB for communicationbetween the processor and the hardware units.

C. Experimental Results

For our first experiment, we use the same stream ”Harbour”of 299 video frames, each with 4 slices of equal size. As shownin [3], 68.4% of the total computation time is spent in the slicedecoding, which we have parallelized in our decoder model.

As a reference point, we calculate the maximum possibleperformance gain as follows:

MaxSpeedup =1

ParallelPartNumOfCores

+ SerialPart

For 4 parallel cores, the maximum speedup is

MaxSpeedup4 =1

0.6844

+ (1 − 0.684)= 2.05

The maximum speedup for 2 cores is accordinglyMaxSpeedup2 = 1.52 .

Table I lists the simulation results for several TLMs gen-erated with SCE when using our multi-core simulator ona Fedora core 12 host PC with a 4-core CPU (Intel(R)Core(TM)2 Quad) at 3.0 GHz, compiled with optimization (-O2) enabled. We compare the elapsed simulation time againstthe single-core reference simulator (the table also includes theCPU load reported by the Linux OS). Although simulationperformances decrease when issuing only one parallel threaddue to additional mutexes for safe synchronization in eachchannel and the scheduler, our multi-core parallel simulationis very effective in reducing the simulation time for all themodels when multiple cores in the simulation host are used.

Table I also lists the measured speedup and the maximumtheoretical speedup for the models that we have createdfollowing the SCE design flow. The more threads are issuedin each scheduling step, the more speedup we gain. The#delta cyclescolumn shows the total number of delta cyclesexecuted when simulating each model. This number increaseswhen the design is refined and is the reason why we gainless speedup at lower abstraction levels. More communicationoverhead is introduced and the increasing need for schedulingreduce the parallelism. However, the measured speedups aresomewhat lower than the maximum, which is reasonable giventhe overhead introduced due to parallelizing and synchronizingthe slice decoders. The comparatively lower performance gainfor the comm model in simulation with 4 threads can beexplained due to the inefficient cache utilization in our Intel(R)Core(TM)2 Quad machine4.

Table II compares the simulation times of our previousimplementation [3] and the optimized one (Section II-D). Thenumber of the context switches in the optimized simulationreduces to about half and results in an average performancegain of about 7.8%.

TABLE II: Comparison with optimized implementation.

Simulator Regular OptimizedPar. issued sim. #context sim. #context gainthreads: 4 time switches time switches

spec 13.18s 161956 11.96s 81999 10%arch 13.62s 161943 12.05s 82000 13%

models sched 13.44s 175065 12.98s 85747 4%net 13.52s 178742 13.04s 88263 4%tlm 15.26s 292316 13.99s 140544 9%

comm 27.41s 1222183 25.57s 777114 7%

Using a video stream with4 slices in each frame is idealfor our model with4 hardware decoders. However, we evenachieve simulation speedup for less ideal cases. Table IIIshows the results when the test stream contains differentnumber of slices. We also create a test stream file with4 slicesper frame but the size of the slices are imbalanced (percentageof MBs in each slice is 31%, 31%, 31%, 7%). Here, thespeedup of our multi-core simulator versus the reference oneis 0.98 for issuing1 thread,1.28 for 2 threads, and1.58 for4 threads. As expected, the speedup decreases when availableparallel work load is imbalanced.

4The Intel(R) Core(TM)2 Quad implements a two-pairs-of-two-cores ar-chitecture and Intel Advanced Smart Cache technology for each core pair(http://www.intel.com/products/processor/core2quad/prod brief.pdf)

6

TABLE I: Simulation results of H.264 Decoder (”Harbour”, 299 frames 4 slices each, 30 fps).

Simulator Reference Multi-CorePar. issued threads: n/a 1 2 4 #delta cycles #threads

sim. time sim. time speedup sim. time speedup sim. time speedupspec 20.80s (99%) 21.12s (99%) 0.98 14.57s (146%) 1.43 11.96s (193%) 1.74 76280 15arch 21.27s (97%) 21.50s (97%) 0.99 14.90s (142%) 1.43 12.05s (188%) 1.76 76280 15

models sched 21.43s (97%) 21.72s (97%) 0.99 15.26s (141%) 1.40 12.98s (182%) 1.65 82431 16net 21.37s (97%) 21.49s (99%) 0.99 15.58s (138%) 1.37 13.04s (181%) 1.64 82713 16tlm 21.64s (98%) 22.12s (98%) 0.98 16.06s (137%) 1.35 13.99s (175%) 1.55 115564 63

comm 26.32s (96%) 26.25s (97%) 1.00 19.50s (133%) 1.35 25.57s (138%) 1.03 205010 75maximum speedup 1.00 1.00 1.52 2.05 n/a n/a

TABLE III: Simulation speedup with different h264 streams(spec model).

Simulator Reference Multi-CorePar. issued threads: n/a 1 2 4

1 1.00 0.98 0.98 0.952 1.00 0.98 1.40 1.353 1.00 0.99 1.26 1.72

slices 4 1.00 0.98 1.43 1.74frame 5 1.00 0.99 1.27 1.53

6 1.00 0.99 1.41 1.687 1.00 0.98 1.30 1.558 1.00 0.98 1.39 1.59

IV. CASE STUDY ON A JPEG ENCODER

As an example of another application, Table IV shows thesimulation speedup for a JPEG Encoder example [1] whichperforms theDCT , Quantization andZigzag modules forthe 3 color components in parallel, followed by a sequentialHuffman encoder at the end. Significant speedup is gainedby our multi-core parallel simulator for the higher levelmodels (spec, arch). Simulation performance decreases forthe models at the lower abstraction levels (sched, net) due tohigh number of bus transactions and arbitrations which are notparallelized and introduce large overhead due to the necessarysynchronization protection.

TABLE IV: Simulation results of JPEG Encoder

Sim. Ref. Multi-CorePar. isd n/a 1 thread 2 threads 4 threads

sim. sim. speed sim. speed sim. speedmodels time time -up time -up time -up

5.54s 5.97s 4.22s 3.12sspec (99%) (99%) 0.93 (135%) 1.31 (187%) 1.78

5.52s 6.07s 4.28s 3.15sarch (99%) (99%) 0.91 (135%) 1.29 (188%) 1.75

5.89s 6.38s 5.48s 5.47ssched (99%) (99%) 0.92 (108%) 1.07 (113%) 1.08

11.56s 49.3s 40.63s 37.97snet (99%) (99%) 0.23 (131%) 0.28 (128%) 0.30

V. SUMMARY AND CONCLUSION

The System-on-Chip Environment (SCE) is an example ofan effective ESL framework that provides an automated top-down synthesis flow. Starting from an abstract specificationmodel described in the SpecC SLDL, SCE allows the system

designer to explore different design alternatives at variouslevels of abstraction, and to refine the design model step-by-step down to a pin- and cycle-accurate implementation.

In this work, we have presented an extension of the SCEsimulator kernel in order to support parallel simulation onmulti-core hosts. We have presented we present a completechannel locking scheme for proper synchronization protection,and have developed an optimized scheduler in the SLDLsimulator. Issuing multiple simulation threads simultaneouslywhile safe synchronization is ensured, our optimized simulatorallows the fast validation of large MPSoC designs. Using thecase studies on a H.264 video decoder and a JPEG Encoderapplications, our experimental results demonstrate a significantreduction in simulation time for TLMs at various levels ofabstraction.

In future work, we plan to optimize our model analysis andmulti-core simulator further by better grouping and partition-ing of issued threads and utilizing CPU affiliation to reducecommunication cost and cache misses. We also plan to avoidmutex insertion in special situations when it is not necessary.

ACKNOWLEDGMENT

This work has been supported in part by funding fromthe National Science Foundation (NSF) under research grantNSF Award #0747523. The authors thank the NSF for thevaluable support. Any opinions, findings, and conclusions orrecommendations expressed in this material are those of theauthors and do not necessarily reflect the views of the NationalScience Foundation.

We would also like to thank the reviewers and editors forvaluable suggestions to improve this article.

REFERENCES

[1] L. Cai, J. Peng, C. Chang, A. Gerstlauer, H. Li, A. Selka, C. Siska,L. Sun, S. Zhao, and D. D. Gajski. Design of a JPEG encoding system.Technical Report ICS-TR-99-54, Information and Computer Science,University of California, Irvine, November 1999.

[2] K. Chandy and J. Misra. Distributed Simulation: A Case Study in Designand Verification of Distributed Programs.Software Engineering, IEEETransactions on, SE-5(5):440–452, Sept 1979.

[3] W. Chen, X. Han, and R. Domer. ESL Design and Multi-Core Validationusing the System-on-Chip Environment. InHLDVT’10: Proceedingsof the 15th IEEE International High Level Design Validationand TestWorkshop, 2010.

[4] B. Chopard, P. Combes, and J. Zory. A Conservative Approach toSystemC Parallelization. In V. N. Alexandrov, G. D. van Albada,P. M. A. Sloot, and J. Dongarra, editors,International Conference onComputational Science (4), volume 3994 ofLecture Notes in ComputerScience, pages 653–660. Springer, 2006.

7

[5] R. Domer, A. Gerstlauer, J. Peng, D. Shin, L. Cai, H. Yu, S. Abdi, andD. Gajski. System-on-Chip Environment: A SpecC-based Frameworkfor Heterogeneous MPSoC Design.EURASIP Journal on EmbeddedSystems, 2008(647953):13 pages, 2008.

[6] R. Fujimoto. Parallel Discrete Event Simulation.Communications ofthe ACM, 33(10):30–53, Oct 1990.

[7] D. D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao.SpecC:Specification Language and Design Methodology. Kluwer, 2000.

[8] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele. ScalablyDistributed SystemC Simulation for Embedded Applications. InInter-national Symposium on Industrial Embedded Systems, 2008. SIES 2008.,pages 271–274, June 2008.

[9] H.264/AVC JM Reference Software. http://iphome.hhi.de/suehring/tml/.[10] T. Li, Y. Guo, and S.-K. Li. Design and Implementation of a Parallel

Verilog Simulator: PVSim.VLSI Design, International Conference on,pages 329–334, 2004.

[11] W. Mueller, R. Domer, and A. Gerstlauer. The formal executionsemantics of SpecC. InProceedings of the International Symposiumon System Synthesis, Kyoto, Japan, October 2002.

[12] E. Naroska. Parallel VHDL simulation. InDATE ’98: Proceedings of theconference on Design, automation and test in Europe, pages 159–165,1998.

[13] D. Nicol and P. Heidelberger. Parallel Execution for Serial Simulators.ACM Transactions on Modeling and Computer Simulation, 6(3):210–242, July 1996.

[14] E. P, P. Chandran, J. Chandra, B. P. Simon, and D. Ravi. ParallelizingSystemC Kernel for Fast Hardware Simulation on SMP Machines. InPADS ’09: Proceedings of the 2009 ACM/IEEE/SCS 23rd Workshopon Principles of Advanced and Distributed Simulation, pages 80–87,Washington, DC, USA, 2009. IEEE Computer Society.

[15] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra.Overview ofthe H.264/AVC video coding standard.Circuits and Systems for VideoTechnology, IEEE Transactions on, 13(7):560 –576, july 2003.

multi-core simulation of transaction level models using ... › ~doemer › publications ›...

Documents