process synchronization in multi core systems using on-chip memories

Process Synchronization in Multi-core Systems Using On-Chip Memories: ArunJ, NaguD 1

Process Synchronization in Multi-core Systems Using On-Chip Memories

Arun Joseph, Nagu Dhanwada [email protected], [email protected] & Technology Group, IBM

mailto:[email protected]

mailto:[email protected]


SUMMARY We present a novel process synchronization mechanism and the

application of on-chip memories for process synchronization in multi-core systems.

A multi-core processor architecture and a signaling scheme which supports the novel process synchronization mechanism is presented.

The validity of the proposed synchronization mechanism is demonstrated by experiments on a virtual prototyping platform.

Comparison against external memory based schemes shows that the proposed use of on-chip memories in multi-core process synchronization is an effective solution to reduce synchronization overheads.


INTRODUCTION• Multi-core applications need to synchronize the computations in the different

processor cores, so that the computations can proceed with integrity.

• A wide range of working solutions are available: lock-based and lock-free.

• Lock-based techniques locks a shared variable to get exclusive access to the data, and another process that needs to use the shared variable, remains in busy-wait state, frequently checking if the lock has become free, and then competes for the lock once the variable becomes free. [1]

• Lock-free techniques allow multiple threads to concurrently read and write shared data without corrupting it. [2]

• These techniques make use of atomic operations, provided by the processor architecture, which allow a single process to test if the lock is free, and if free, acquires the lock in a single atomic operation. [3, 4]


INTRODUCTION• We introduce a multi-core process synchronization mechanism which is

based on a novel signaling scheme, which does not need the support of atomic operations or disabling of interrupts.

• Performance overhead of synchronization operations is dependent on the number of remote accesses required, and also the latency of each remote access.

• Significant amount of on-chip memory is available in recent multi-core architectures like Cell BE, which is known to improve overall system performance by reducing access time significantly.

• We present a first of its kind approach to exploit the available on-chip memory for efficient process synchronization.


PREVIOUS RELATED WORK• Commonly used lock-based schemes are semaphores and condition

variables. Non-blocking synchronization algorithms are designed in such a way that a critical section is not required. Their implementation requires specific atomic operations like, compare-and-swap (CAS). Maurice teaches that using the CAS atomic primitive and other primitive operations any lock-free mechanism can be implemented [2].

• The proposed mechanism is based on an on-chip memory which is non-caching and shared by all the processor cores and provides a memory region for each processor core with exclusive write access, while all the cores have read access.

• To our knowledge, the proposed signaling scheme is fundamentally different from prior approaches, and does not require any atomic instructions or the need for disabling interrupts.

• Though on-chip memories has been used for a wide range of applications, including speed up [9, 10], to our knowledge, this is the first work to study the use of on-chip, shared, non-cached memories to reduce multi-core process synchronization overheads.


COMPONENTS OF THE SCHEME• The main components of the proposed process synchronization mechanism

are:

• (a) n-core multi-core processor, with an On-chip, Shared, Non-caching (OSN) memory.

• (b) A novel signaling scheme.

• The OSN memory is not essential to the proposed scheme, and in its absence an External, Shared, Non-Caching (ESN) memory can be used for the same purpose, with a penalty in performance.


THE MULTI-CORE PROCESSOR• Similar architectures have been

explored in processors like the Cell [6], and other academic work [10].

• Efficient usage of on-chip memory is important [9].

• OSN memory is used for building a signaling scheme and hence only a small amount of the memory is required.

• While all processor cores have read access to the OSN memory, each core has dedicated regions in the OSN memory, where it has exclusive write access

Cache

Core 1 …………….

On-Chip Memory

Core 0 Core n-1

Cache Cache

External Memory

SYSTEM BUS

Cache

Core 1 …………….

On-Chip Memory

Core 0 Core n-1

Cache Cache

External Memory

SYSTEM BUS

Figure 1. Multi-core processor with on-chip memory.


PROCESS SYNCHRONIZATION MECHANISMSignaling Scheme• The proposed signaling scheme provides a two state system to support a

continuous signaling mechanism.

• Processors cores are logically numbered from 0 to (n-1) where; n is the total number of processor cores.

• The basic signaling mechanism from a signal generator to a signal receiver is based on the proposed concepts of a signal location and two value locations.

• The signal location is a specific location in the on-chip memory for which only the signal generator has write access and all others have read access.

• Of the two value locations, one location is managed by the signal generator in a location on the on-chip memory for which it has write access. The second value location is managed by the receiving side in a location on the on-chip memory for which the receiver has write access.


PROCESS SYNCHRONIZATION MECHANISMSignaling Scheme• The value locations have only two states other than the initial/reset state,

which is a zero value state.

• The two value location states, other than the zero state, are two values which can be set by the core managing the location.

• For example, these states can be 0xfe and 0xff, and a state toggle can be obtained by an 'exclusive or' operation with 0x01.

• A signal is set by the generator to the receiver when the signal location and receiver value locations have the same value.

• After setting the signal, the signal generator toggles its generator value location, and the receiver after receiving the signal, toggles its receiver value location so that a new state is formed for a new signal.


PROCESS SYNCHRONIZATION MECHANISMSignaling Sequence and System

• At the end of the signaling phase, a new state is formed and the signaling process can continue.

• An acknowledgment can be obtained by a reply signal. With this signal mechanism the full implementation of the signaling system can be built.

• The signaling system in a processor with 'n' cores is implemented using one ‘nxn’ Signal Location Matrix and two ‘nxn’ Value Location Matrices.

• These three matrices are maintained in the OSN memory (or the ESN memory, if the external memory scheme is used).

Initialization Phase:Step 1: Initial / Reset StateSignal Location: 0x00Generator Value Location: 0x00 Receiver Value Location: 0x00

Step 2: Cores Set Value LocationsSignal Location: 0x00Generator Value Location: 0xfe Receiver Value Location: 0xfe

Signaling Phase:Step 3: Generator Sets SignalsSignal Location: 0xfeGenerator Value Location: 0xfe Receiver Value Location: 0xfe

Step 4: Generator Toggles Value LocationsSignal Location: 0xfeGenerator Value Location: 0xff Receiver Value Location: 0xfe

Step 5: Receiver Receives Signal and Toggles its Value LocationSignal Location: 0xfeGenerator Value Location: 0xff Receiver Value Location: 0xff

Figure 2. Signaling Sequence.


PROCESS SYNCHRONIZATION MECHANISMSignaling System• We refer to the signal location matrix as 'S' and the two value location

matrices as 'G' and 'R'.

• While G holds the value for setting the signal on the generator side, R holds the expected value on the receiver side.

• Each of the rows of the S matrix are the signal locations for each of the n processors cores. In other words, the ith row vector of Matrix S corresponds to ith core, and are locations in the on-chip memory for which core-i has write access.

• Rows of G and R are also placed in the on-chip memory. jth location in the ith row vector corresponds to the signal location for core-i to set signal for core-j. It uses the jth location of ith row vector of Matrix G for setting the signal to core-j.

• In a similar way, core-j looking for signal from core-i looks at jth location of ith row of S for a value equal to ith location of jth row of Matrix R.


PROCESS SYNCHRONIZATION MECHANISM: Signaling process from core i to core j

0 nji

S

G

R

ithR

ow of S

, G and R

Matrices

Core=i

0 nji

S

G

R

jthR

ow of S

, G and R

Matrices

Core=jCurrent State

0xfe0xff

0xfe

0 nji

S

G

R

ithR

ow of S

, G and R

Matrices

Core=i

0 nji

S

G

R

ithR

ow of S

, G and R

Matrices

Core=i

0 nji

S

G

R

jthR

ow of S

, G and R

Matrices

Core=jCurrent State

0xfe0xff

0xfe

0 nji

S

G

R

ithR

ow of S

, G and R

Matrices

Core=i

0 nji

S

G

R

jthR

ow of S

, G and R

Matrices

Core=ji sets its signal to j

0xfe0xfe

0xfe

0 nji

S

G

R

ithR

ow of S

, G and R

Matrices

Core=i

0 nji

S

G

R

ithR

ow of S

, G and R

Matrices

Core=i

0 nji

S

G

R

jthR

ow of S

, G and R

Matrices

Core=ji sets its signal to j

0xfe0xfe

0xfe

0 nji

S

G

R

ithRow

of S, G

and R M

atrices

Core=i

0 nji

S

G

R

jthRow

of S, G

and R M

atrices

Core=ji toggles its G location

0xff0xfe

0xfe

0 nji

S

G

R

ithRow

of S, G

and R M

atrices

Core=i

0 nji

S

G

R

ithRow

of S, G

and R M

atrices

Core=i

0 nji

S

G

R

jthRow

of S, G

and R M

atrices

Core=ji toggles its G location

0xff0xfe

0xfe

0 nji

S

G

R

ithRow

of S, G

and R M

atricesCore=i

0 nji

S

G

R

jthRow

of S, G

and R M

atrices

Core=jj identifies and receives signal and toggles its R location

0xff0xfe

0xff

0 nji

S

G

R

ithRow

of S, G

and R M

atricesCore=i

0 nji

S

G

R

ithRow

of S, G

and R M

atricesCore=i

0 nji

S

G

R

jthRow

of S, G

and R M

atrices

Core=jj identifies and receives signal and toggles its R location

0xff0xfe

0xff

Figure 3. Core i to j - Current State. Figure 4. Core i to j – i sets its signal to j.

Figure 5. Core i to j – i toggles its G location. Figure 6. Core i to j - j identifies and receives signal and toggles its R location.


PROCESS SYNCHRONIZATION MECHANISM

• The process synchronization between two cores, say 'Core-i' and 'Core-j' is implemented as follows:

– Core-i sets signal to Core-j.– Core-i waits for signal from Core-j.– Core-j waits for signal from Core-i.– Core-j gets the signal from Core-i.– Core-j sets reply signal to Core-i.– Core-i gets the reply signal from Core-j.

• The basic synchronization scheme is built on three matrices of order nxn, where n is the number of cores. Hence, for example, the scheme for a 1000 core system can be implemented using 3MB of on-chip-memory.

• The scheme has the potential to be extended for multiple types of signals and inter core communication, which requires extra memory to implement.


EXPERIMENTAL SETUP• The mechanism was simulated on

a multi-core System-On-Chip (SOC) virtual prototyping platform, as shown in Figure 11.

• The platform also provides mechanism for plugging-in user-defined modules to support abstraction of additionally defined hardware components.

• The CoreConnect-based [16] SOC has 8 processor cores, 1MB OSN memory, in addition to the several other peripherals and bus components.

Figure 11. Virtual Multi-core SOC.


EXPERIMENTAL SETUP• Different experiments were carried out, while running a parallel multiplication

of two 16x16 matrices.

• In scenario 1: Process synchronization was achieved using the proposed OSN-based synchronization technique.

• In scenario 2: Process synchronization was achieved using the proposed ESN-based synchronization technique.

• In scenario 3: Process synchronization was achieved using an external memory based semaphore, BetaSemaphore, which was implemented using an atomic Test and Set operation, as defined in [14].

• Performance comparisons between scenario 1 and 2 indicates that even for applications like matrix multiplication, where the number of synchronization operations is small, the impact of the OSN memory on reducing synchronization overhead is reasonably significant, especially as the number of the cores increase.


EXPERIMENTAL RESULTS• The proposed process

synchronization scheme is not expected to reduce synchronizations overheads, unless used with the OSN memory.

• For an 8-core SOC a speed-up of 7.5 was seen with the OSN-based technique vs. 5.5 in the ESN based technique.

• The performance of scenario 2 and 3 are more comparable. The delta between the two can be potentially attributed to the differences in the approach used to implement them.

Number of processors Execution time (us) Idle time (us) Speed-Up

1.0 5205753.0 54662.0 1.0

2.0 2632011.0 92797.0 2.04.0 1338610.0 171432.0 3.98.0 695988.0 344361.0 7.5


1.0 5205733.0 54662.0 1.02.0 2632086.0 106798.0 2.04.0 1462357.0 332871.0 3.68.0 820631.0 469267.0 5.5


1.0 5205733.0 54662.0 1.02.0 2632094.0 106802.0 2.04.0 1607896.0 213437.0 3.28.0 912465.0 450986.0 5.7

Scenario 1: Proposed Scheme using OSN Memory

Scenario 2: Proposed Scheme using ESN Memory

Scenario 3: Semaphore using ESN Memory

Figure 13. Scenarios 1-3


EXPERIMENTAL SETUP• In another study, a micro benchmark was created to forcefully create 10000

synchronization operations, and then the time taken for those 10000 operations were extracted using selective profiling functions provided in the virtual prototyping platform.

• The study was done on the same SOC as before, but with 2 and 4 processor cores, in 3 different scenarios.

• In scenario 4 the OSN-based proposed synchronization scheme was used.

• In scenario 5, the OSN-based BetaSemaphore implementation was used.

• In scenario 6 the ESN-based BetaSemaphore implementation was used.


EXPERIMENTAL RESULTS

• Synchronization overheads from scenario 4 and 5 are in the comparable range, and again the overhead in scenario 4 was lesser than scenario 5.

• The overhead of scenario 4 was approximately 1/4th the synchronization overhead of scenario 6, and is expected to improve even further as the number of cores increase.

• This strongly suggests that irrespective of process synchronization scheme used, the OSN memory significantly reduces the process synchronization overheads, especially as the number of processor cores increase.

Synchronization Overhead for 10000 synchronizations (in usec)

No. of cores

Scenario 4: Proposed Scheme (OSN)

Scenario 5: BetaSemaphore (OSN)

Scenario 6: BetaSemaphore (ESN)

2 1500723 1862737 37516274 2775356 3184996 12185251

Figure 14. Scenarios 4-6


CONCLUSION & FUTURE WORK• A novel multi-core signaling scheme and a process synchronization

mechanism is presented.

• We have also presented the notion of using on-chip, shared, non-cached memories to reduce the process synchronization overheads in multi-core systems.

• The basic signaling scheme presented is a two state mechanism. However, the scheme can be extended further as a signaling system with multiple states.

• We are investigating how multiple types of signals can be implemented by providing specified number of locations maintained by the generator and read by receiver, to further classify the signal.

• The scheme can be extended to enable inter-processor communication.


REFERENCES1. J. M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors,

ACM Trans. On Computer Systems, 9(1), February 1991.2. M.P. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124--149,

January 1991. 3. Intel Corp. Intel Itanium 2 processor reference manual. 4. C.May, E. Silha, R. Simpson, and H. Warren. The PowerPC Architecture: A Specification for a New Family of

Processors, 2nd edition. Morgan Kaufmann, May 1994.5. Zhen Fang, Lixin Zhang, John B. Carter, Liqun Cheng, and Michael Parker. 2005. Fast synchronization on shared-

memory multiprocessors: An architectural approach. J. Parallel Distrib. Comput. 65, 10 (October 2005), 1158-1170.6. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. 2005. Introduction to the cell

multiprocessor. IBM J. Res. Dev. 49, 4/5 (July 2005), 589-604.7. L. A. Polka et al., Intel Technoloyg Journal, vol. 11, 197 (2007). 8. A. Silberschatz, P. B. Galvin, G. Gagne, “Operating System Concepts”, 7th ed.: John Wiley & Sons, Inc., 2005.9. Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: the data partitioning

problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682-70410. C. Villavieja, I. Gelado, A. Ramrez, and N. Navarro, "Memory Management on Chip-MultiProcessors with on-chip

Memories", Proc. workshop on the Interaction between Operating Systems and Computer Architecture, 2008.11. N.R. Dhanwada, R.A. Bergamaschi, W.W. Dungan, I. Nair, P. Gramann, W.E. Dougherty, and I. Lin, "Transaction-

level modeling for architectural and power analysis of PowerPC and CoreConnect-based systems", ;presented at Design Autom. for Emb. Sys., 2005, pp.105-125.

12. Meet the PowerPC 405 Evaluation Kit, 2005.13. The Open SystemC Initiative. http://www.systemc.org. 14. Benini, L., D. Bertozzi, D. Bruni, N. Drago, F. Fummi, M. Poncino. Legacy SystemC Co-Simulation of Multi-Processor

Systems-on-Chip. In Proceedings 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD), IEEE, 494, 2002.

15. PowerPC User Instruction Set Architecture Book I Version 2.0216. The CoreConnect™ Bus Architecture, 1999


PROCESS SYNCHRONIZATION MECHANISMAppendix A• Though the proposed scheme is a blocking scheme, it need not be so, if the program

logic permits.– Core-i can set the signal to Core-j and continue rather than waiting, until it needs

the acknowledgment or before the next signal. In a similar way, Core-j need not wait for a signal from Core-i. If the program logic permits, it can as well check for the signal and continue if the signal is not available and wait for the signal when it is really needed.

– If it is a synchronization point, but not sure who should initiate the signal, it is possible to have a convention that the lower numbered core sets the signal, and the other waits and acknowledges the signal.

• In a similar way, synchronization of a group of cores, or a barrier point, can be implemented.

– The highest numbered core will scan for signals from all the lower numbered cores, while all the lower numbered cores set signals to the highest numbered core and waits for an acknowledgment from the highest numbered core. When the highest numbered core receives signal from all other cores, it sets acknowledgment to all other cores.

– Since a core can check for a signal without blocking, signal from a number of cores arriving in a random sequence can be handled by searching in a cyclic manner. It can also be seen that the synchronization is built on a scheme in which cores write only on locations where it has exclusive write access. Hence, servicing of interrupts has no adverse effects on the synchronization scheme.