high performance computing
DESCRIPTION
Lecture Notes on High Performance ComputingContains Introductory Notes on High Performance Computing ArchitectureHigh Performance ComputingMPI ProgrammingPthread ProgrammingPOSIX Thread ProgrammingOpenMP ProgrammingTRANSCRIPT
-
High Performance Computing
(Lecture Notes)
Department of Computer Science
School Mathematical & Physical Sciences
Central University of Kerala
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
2
Module 1 Architectures and Models of
Computation
LEARNING OBJECTIVES
Shared and Distributed Memory Machines, PRAM Model,
Interconnection Networks: Crossbar, Bus, Mesh, Tree, Butterfly and
MINs, Hypercube, Shuffle Exchange, etc.; Evaluation based on
Diameter, Bisection Bandwidth, Number of Edges, etc., Embeddings:
Mesh, Tree, Hypercube; Gray Codes, Flynns Taxonomy
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
3
1. What is parallel computing? Why parallel computing? How does it
differ from concurrency? What are its application areas?
Parallel Computing What it is?
Parallel computing is the use of parallel computer to reduce the time needed to
solve a single computational problem. Parallel computers are computer systems
consisting of multiple processing units connected via some interconnection
network plus the software needed to make the processing units work together.
The processing units can communicate and interact with each other using either
shared memory or message passing methods. Parallel computing is now
considered a standard way for computational scientists and engineers to solve
computational problems that demands high performance computing power.
Parallel Computing Why it is?
Sequential computing systems have been with us for more than six decades since
John von Neumann introduced digital computing in the 1950s. The traditional
logical view of a sequential computer consists of a memory connected to a
processor via a datapath. In sequential computing, all the three components processor, memory, and datapath present bottlenecks to the overall processing rate of a computer system. To speed up the execution, one would need to either
increase the clock rate or to improve the memory performance by reducing its
latency or increasing the bandwidth. A number of architectural innovations like
multiplicity (in processing units, datapaths and memory units), cache memory,
pipelining, superscalar execution, multithreading, prefetching, etc., over the
years have been exploited to address these performance bottlenecks. Though,
these architectural innovations have brought about an average 50% performance
improvement per year during the period 1986 to 2002, the performance gain
recorded a low rate after 2002 primarily due to the fundamental architectural
limitations of the sequential computing and the computing industry came to the
realization that uniprocessor architectures cannot sustain the rate of realizable
performance increments in the future. This realization resulted in computing
industry to focus more on parallel computing for achieving the sustained
realizable performance improvement. And the idea of a single processor
computer is fast becoming outdated and old-fashioned.
Parallelism Vs Concurrency
In many fields, the words parallel and concurrent are synonyms; not so in
programming, where they are used to describe fundamentally different concepts.
A parallel program is one that uses a multiplicity of computational hardware
(e.g., several processor cores) to perform a computation more quickly. The aim is
to arrive at the answer earlier, by delegating different parts of the computation
to different processors that execute at the same time.
By contrast, concurrency is a program-structuring technique in which there are
multiple threads of control, which may be executed in parallel on multiple
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
4
physical processors or in interleaved fashion on a single processor. Whether they
actually execute in parallel or not is therefore an implementation detail.
While parallel programming is concerned only with efficiency, concurrent
programming is concerned with structuring a program that needs to interact
with multiple independent external agents (for example, the user, a database
server, and some external clients). Concurrency allows such programs to be
modular. In the absence of concurrency, such programs have to be written with
event loops and callbacks, which are typically more cumbersome and lack the
modularity that threads offer.
Parallel Computing Advantages.
The main argument for using multiprocessors is to create powerful computers by
simply connecting multiple processors. A multiprocessor is expected to reach
faster speed than the fastest single-processor system. In addition, a
multiprocessor consisting of a number of single processors is expected to be more
cost-effective than building a high-performance single processor. Another
advantage of a multiprocessor is fault tolerance. If a processor fails, the
remaining processors should be able to provide continued service, although with
degraded performance.
Parallel Computing The Limits.
A theoretical result known as Amdahls law says that the amount of performance improvement that parallelism provides is limited by the amount of sequential
processing in your application. This may, at first, seem counterintuitive.
Amdahls law says that no matter how many cores you have, the maximum speed-up you can ever achieve is (1 / fraction of time spent in sequential
processing).
Parallel Computing Application Areas
Parallel computing is a fundamental and irreplaceable technique used in todays science and technology, as well as manufacturing and service industries. Its
applications cover a wide range of disciplines:
Basic science research, including biochemistry for decoding human genetic information as well as theoretical physics for understanding the interactions
of quarks and possible unification of all four forces.
Mechanical, electrical, and materials engineering for producing better materials such as solar cells, LCD displays, LED lighting, etc.
Service industry, including telecommunications and the financial industry.
Manufacturing, such as design and operation of aircrafts and bullet trains.
Its broad applications in oil exploration, weather forecasting, communication,
transportation, and aerospace make it a unique technique for national
economical defence. It is precisely this uniqueness and its lasting impact that
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
5
defines its role in todays rapidly growing technological society
2. Development of parallel software has traditionally been thought of as time and effort intensive. Justify the statement.
Traditionally, computer software has been written for serial computation. To
solve a problem, an algorithm is constructed and implemented as a serial stream
of instructions. These instructions are executed on a central processing one after
another.
Parallel computing, on the other hand, uses multiple processing elements
simultaneously to solve a problem. This is accomplished by breaking the problem
into independent tasks so that each processing element can execute its part of
the algorithm simultaneously with the others. The processing elements can be
diverse and include resources such as a single computer with multiple
processors, several networked computers, specialized hardware, or any
combination of the above.
However, development of parallel software is traditionally considered as a time
and effort intensive activity due to the following reasons:
Complexity in specifying and coordinating concurrent tasks lack of portable parallel algorithms lack of standardized parallel environments, and lack of parallel software development toolkits
Complexity in specifying and coordinating concurrent tasks: Concurrent
computing involves overlapping of the execution of several computations over
one or more processors. Concurrent computing often requires complex
interactions between the processes. These interactions are often communication
via message passing, which may be synchronous or asynchronous; or may be
access to shared resources. The main challenges in designing concurrent
programs are concurrency control: ensuring the correct sequencing of the
interactions or communications between different computational executions, and
coordinating access to resources that are shared among executions. Problems
that may occur include non-determinism (from race conditions), deadlock, and
resource starvation.
Lack of portable parallel algorithms: Because the interconnection scheme
among processors (or between processors and memory) signicantly affects the running time, efcient parallel algorithms must take the interconnection scheme into account. Because of this, most of the existing parallel algorithms for real
world applications suffer from a major limitation that these algorithms have
designed with a specific underlying parallel architecture in mind and are not
portable to different parallel architecture.
Lack of standardized parallel environments: The lack of standards in
parallel programming languages makes parallel programs difficult to port across
parallel computers.
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
6
Lack of standardized parallel software development toolkits: Unlike
sequential programming tools, the parallel programming tools available are
highly dependent on both on the characteristics of the problem and on the
parallel programming environment opted for. The lack of standard programming
tools makes parallel programming difficult and the resultant programs are not
portable across parallel computers.
However, in the last few decades, researchers have made considerable progress
in designing efficient and cost-effective parallel architectures and parallel
algorithms. Together with this, the factors such as
reduction in the turnaround time required for the development of microprocessor based parallel machine and
standardization of parallel programming environments and parallel programming tools to ensure a longer life-cycle for parallel applications
have made the parallel computing today less time and effort intensive.
3. Briefly explain some of the compelling arguments in favour of
parallel computing platforms
Though considerable progress has been made in the microprocessor technology in
the past few decades, the industry came to the realization that the implicit
parallel architecture alone cannot provide sustained realizable performance increments. Together with this, the factors such as
reduction in the turnaround time required for the development of microprocessor based parallel machine and
standardization of parallel programming environments and parallel programming tools to ensure a longer life-cycle for parallel applications
present compelling arguments in favour of parallel computing platforms.
The major fascinating arguments in favour of parallel computing platforms
include:
a) The Computational Power Argument.
b) Memory/Disk speed Argument.
c) Data Communication Argument
The Computational Power Argument: Due to the sustained development in
the microprocessor technology, the computational powers of the systems are
doubling in every 18 months (called Moores law). This sustained development in the microprocessor technology favours parallel computing platforms.
Memory/Disk Speed Argument: The overall speed of a system is determined
not just by the speed of the processor, but also by the ability of the memory
system to feed data to it. Considering the 40% annual increase in clock speed
coupled with the increases in instructions executed per clock cycle, the small 10%
annual improvement in memory access time (memory latency) has resulted in a
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
7
performance bottleneck. This growing mismatch between processor speed and
DRAM latency can be bridged to a certain level by introducing cache memory
that relies on locality of data reference. Besides memory latency, the effective
memory bandwidth also influences the sustained improvements in computation
speed. Compared uniprocessor systems, parallel platforms typically provide
better memory system performance because they provide (a) larger aggregate
caches, and (b) higher aggregate memory bandwidth. Besides, design of parallel
algorithms that can exploit the locality of data reference can also improve the
memory and disk latencies.
The Data Communication Argument: Many of the modern real world
applications in quantum chemistry, statistical mechanics, cosmology,
astrophysics, computational fluid dynamics and turbulence, superconductivity,
biology, pharmacology, genome sequencing, genetic engineering, protein folding,
enzyme activity, cell modelling, medicine, modelling of human organs and bones,
global weather and environmental modelling, data mining, etc., are massively
parallel and demands large scale wide area distributed heterogeneous
parallel/distributed computing environments.
4. Describe the classical von Neumann architecture of computing
systems.
The classical von Neumann architecture consists of main memory, a central
processing unit (also known as CPU or processor or core) and an
interconnection between the memory and the CPU. Main memory consists of a
collection of locations, each of which is capable of storing both instructions and
data. Every location consists of an address, which is used to access the
instructions or data stored in the location. The classical von Neumann
architecture is depicted below:
The central processing unit is divided into a control unit and an arithmetic
and logic unit (ALU). The control unit is responsible for deciding which
instructions in a program should be executed, and the ALU is responsible for
executing the actual instructions. Data in the CPU and information about the
state of an executing program are stored in special, very fast storage called
registers. The control unit has a special register called the program counter.
It stores the address of the next instruction to be executed.
Instructions and data are transferred between the CPU and memory via the
interconnect called bus. A bus consists of a collection of parallel wires and some
hardware controlling access to the wires. A von Neumann machine executes a
single instruction at a time, and each instruction operates on only a few pieces of
data.
The process of transferring data or instructions from memory to the CPU is
referred to as data or instructions fetch or memory read operation. The process
of transferring data from the CPU to memory is referred to as memory write.
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
8
Fig: The Classical von Neumann Architecture
The separation of memory and CPU is often called the von Neumann
bottleneck, since the interconnect determines the rate at which instructions
and data can be accessed. CPUs are capable of executing instructions more than
one hundred times faster than they can fetch items from main memory.
5. Explain the terms processes, multitasking, and threads.
Process
A process is an instance of a computer program that is being executed. When a
user runs a program, the operating system creates a process. A process consists
of several entities:
The executable machine language program
A block of memory, which will include the executable code, a call stack that keeps track of active functions, a heap, and some other memory locations
Descriptors of resources that the operating system has allocated to the process for example, file descriptors. .
Security information for example, information specifying which hardware and software resources the process can access.
Information about the state of the process, such as whether the process is ready to run or is waiting on some resource, the content of the registers, and
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
9
information about the process memory.
Multitasking
A task is a unit of execution. In some operating systems, a task is synonymous
with a process, in others with a thread. An operating system is called
multitasking if it can execute multiple tasks. Most modern operating systems are
multitasking. This means that the operating system provides support for the
simultaneous execution of multiple programs. This is possible even on a system
with a single core, since each process runs for a time slice (typically a few
milliseconds). After one running program has executed for a time slice, the
operating system can run a different program. A multitasking OS may change
the running process many times a minute, even though changing the running
process can results overheads. In a multitasking OS, if a process needs to wait
for a resource (for example, it needs to read data from external storage) the OS
will block the process and schedule another ready process to run. For example,
an airline reservation system that is blocked waiting for a seat map for one user
could provide a list of available flights to another user. Multitasking does not
imply parallelism but it involves concurrency.
Threads
A thread of execution is the smallest unit of a program that can be managed
independently by an operating system scheduler. Threading provides a
mechanism for programmers to divide their programs into more or less
independent tasks with the property that when one thread is blocked another
thread can be run. Furthermore, the context switching among threads is much
faster as compared context switching among processes. This is because threads
are lighter weight than processes. Threads are contained within processes, so they can use the same executable, and they usually share the same memory and
the same I/O devices. In fact, two threads belonging to one process can share
most of the process resources. Different threads of a process need only to keep a record of their own program counters and call stacks so that they can execute
independently of each other.
6. Describe the architectural innovations employed to overcome the
von Neumann bottleneck
The classical von Neumann architecture consists of main memory, a processor
and an interconnection between the memory and the processor. This separation
of memory and processor is called the von Neumann bottleneck, since the
interconnect determines the rate at which instructions and data can be accessed.
This has resulted in creating a large speed mismatch (of the order of 100 times or
more) between the processor and memory. Several architectural innovations
have been exploited as an extension to the classical von Neumann architecture
for hiding this speed mismatch and improving the overall system performance.
The prominent architectural innovations for hiding the von Neumann bottleneck
include:
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
10
1. Caching,
2. Virtual memory, and
3. Low-level parallelism Instruction-level and thread-level parallelisms.
Caching
Cache is a smaller and faster memory between the processor and the DRAM,
which stores copies of the data from frequently used main memory locations.
Cache acts as a low-latency high-bandwidth storage (improves both memory
latency and bandwidth).
Cache works by the principle of locality of reference, which state that programs
tend to use data and instructions that are physically close to recently used data
and instructions.
The data needed by the processor is first fetched into the cache. All subsequent
accesses to data items residing in the cache are serviced by the cache, thereby
reducing the effective memory latency.
In order to exploit the principle of locality, the memory access to cache operates
on blocks (called cache blocks or cache lines) of data and instructions instead of
individual instructions and individual data items (cache lines ranges from 8 to
16 words). A cache line of 16 means that 16 memory words can be accessed in
115ns (assuming 100ns memory latency) instead of 1600ns (if accessed one word
at a time), thereby increasing the memory bandwidth from 5MWords/Sec to
70MWords/Sec. Blocked access can also reduce the memory latency. A cache line
of 16 means that next 15 accesses to memory can be found from cache (if
program exhibits strong locality of reference), thereby hiding the effective
memory latency.
Rather than implementing CPU cache as a single monolithic structure, in
practice, the cache is usually divided into levels: the first level (L1) is the
smallest and the fastest, and higher levels (L2, L3, . . . ) are larger and slower.
Most systems currently have at least two levels. Caches usually store copies of
information in slower memory. For example, a variable stored in a level 1 cache
will also be stored in level 2. However, some multilevel caches dont duplicate information thats available in another level. For these caches, a variable in a level 1 cache might not be stored in any other level of the cache, but it would be
stored in main memory.
When the CPU needs to access an instruction or data, it works its way down the
cache hierarchy: First it checks the level 1 cache, then the level 2, and so on.
Finally, if the information needed isnt in any of the caches, it accesses main memory.
When a cache is checked for information and the information is available, it is
called a cache hit. If the information is not available, it is called a cache miss.
Hit or miss is often modified by the level. For example, when the CPU attempts
to access a variable, it might have an L1 miss and an L2 hit.
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
11
When the CPU writes data to a cache, the value in the cache and the value in
main memory are different or inconsistent. There are two basic approaches to
dealing with the inconsistency. In write-through caches, the cache line is
written to main memory when it is written to the cache. In write-back caches,
the data is not written immediately. Rather, the updated data in the cache is
marked dirty, and when the cache line is replaced by a new cache line from
memory, the dirty line is written to memory.
Virtual Memory
Caches make it possible for the CPU to quickly access instructions and data that
are in main memory. However, if we run a very large program or a program that
accesses very large data sets, all of the instructions and data may not fit into
main memory. This is especially true with multitasking operating systems. In
order to switch between programs and create the illusion that multiple programs
are running simultaneously, the instructions and data that will be used during
the next time slice should be in main memory. Thus, in a multitasking system,
even if the main memory is very large, many running programs must share the
available main memory.
Virtual memory was developed so that main memory can function as a cache for
secondary storage. It exploits the principle of locality by keeping in main memory
only the active parts of the many running programs. Those parts that are idle
are kept in a block of secondary storage called swap space. Like CPU caches,
virtual memory operates on blocks of data and instructions. These blocks are
commonly called pages (size ranges from 4 to 16 kilobytes).
When a program is compiled, its pages are assigned virtual page numbers. When
the program is loaded into memory, a Page Map Table (PMT) is created that
maps the virtual page numbers to physical addresses. The virtual address
references made by a running program are translated into corresponding
physical addresses by using this PMT.
The drawback of storing PMT in main memory is that a virtual address
reference made by the running program requires two memory accesses: one to
get the appropriate page table entry of the virtual page to find its location in
main memory, and one to actually access the desired memory. In order to avoid
this problem, CPUs have a special page table cache called the translation look
aside buffer (TLB) that caches a small number of entries (typically 16512) from the page table in very fast memory. Using the principle of locality, one can
expect that most of the memory references will be to pages whose physical
address is stored in the TLB, and the number of memory references that require
accesses to the page table in main memory will be substantially reduced.
If the running process attempts to access a page that is not in memory, that is,
the page table does not have a valid physical address for the page and the page is
only stored on disk, then the attempted access is called a page fault. In such
case, the running process will be blocked until the faulted page is brought into
memory and the corresponding entry is made in PMT.
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
12
When the running program look for an address and the virtual page number is in
the TLB, it is called a TLB hit. If it is not in the TLB, it is called a TLB miss.
Due to the relative slowness of disk accesses, virtual memory always uses a
write-back scheme for handling write accesses. This can be handled by keeping a
bit on each page in memory that indicates whether the page has been updated. If
it has been updated, when it is evicted from main memory, it will be written to
disk.
Low-Level Parallelism Instruction-Level parallelism
Low-level parallelisms are the parallelism that are not visible to the programmer
(i.e., programmer has no control). Two of the low-level parallelisms are
instruction-level parallelism and thread-level parallelism.
Instruction-level parallelism, or ILP, attempts to improve processor performance
by having multiple processor components or functional units simultaneously
executing instructions. There are two main approaches to ILP: pipelining, in
which functional units are arranged in stages with the output of one being the
input to the next, and multiple issue, in which the functional units are
replicated and issues multiple instructions simultaneously.
Pipelining: Pipelining is a technique used to increase their instruction
throughput of the processor. The main idea is to divide the basic instruction cycle
into a series of independent steps of micro-operations. Rather than processing
each instruction sequentially, the independent micro-operations of different
instructions are executed concurrently (by different functional units) in parallel.
Pipelining enables faster execution by overlapping various stages in instruction
execution (fetch, schedule, decode, operand fetch, execute, store, among others).
Multiple Issue (Superscalar Execution): Superscalar execution is an
advanced form of pipelined instruction-level parallelism that allows dispatching
of multiple instructions to multiple pipelines, processed concurrently by
redundant functional units on the processor. Superscalar execution can provide
better cost-effective performance as it improves the degree of overlapping of
parallel concurrent execution of multiple instructions.
Low-Level Parallelism Thread-Level parallelism
Thread-level parallelism, or TLP, attempts to provide parallelism through the
simultaneous execution of different threads. Thread level parallelism splits a
program into independent threads for running it concurrent. TLP is considered
as a coarser-grained parallelism than ILP, that is, the program units that are
being simultaneously executed (threads in TLP) are larger or coarser than the
finer-grained units (individual instructions in ILP).
Multithreading provides a means for systems to continue doing useful work by
switching the execution to another thread when the thread being currently
executed has stalled (for example, if the current task has to wait for data to be
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
13
loaded from memory). There are different ways to implement the multithreading.
In fine-grained multithreading, the processor switches between threads after
each instruction, skipping threads that are stalled. While this approach has the
potential to avoid wasted machine time due to stalls, it has the drawback that a
thread thats ready to execute a long sequence of instructions may have to wait to execute every instruction.
Coarse-grained multithreading attempts to avoid this problem by only
switching threads that are stalled waiting for a time-consuming operation to
complete (e.g., a load from main memory).
Simultaneous multithreading is a variation on fine-grained multithreading.
It attempts to exploit superscalar processors by allowing multiple threads to
make use of the multiple functional units.
7. Explain how implicit parallelisms like pipelining and super scalar
execution results in better cost-effective performance gains
Microprocessor technology has recorded an average 50% annual performance
improvement over the last few decades. This development has also uncovered
several performance bottlenecks in achieving the sustained realizable
performance improvement. To alleviate these bottlenecks, microprocessor
designers have explored a number of alternate architectural innovations to cost-
effective performance gains. One of the most important innovations is
multiplicity in processing units, datapaths, and memory units. This multiplicity is either entirely hidden from the programmer or exposed to the
programmer in different forms.
Implicit parallelism is an approach to provide multiplicity at the level of
instruction execution for achieving the cost-effective performance gain. In
implicit parallelism, parallelism is exploited by the compiler and/or the runtime
system and this type of parallelism is transparent to the programmer.
Two common approaches for the implicit parallelism are (a) Instruction
Pipelining and (b) Superscalar Execution.
Instruction Pipelining: Instruction pipelining is a technique used in the
design of microprocessors to increase their instruction throughput. The main
idea is to divide the basic instruction cycle into a series of independent steps of
micro-operations. Rather than processing each instruction sequentially, the
independent micro-operations of different instructions are executed concurrently
(by different functional units) in parallel. Pipelining enables faster execution by
overlapping various stages in instruction execution (fetch, schedule, decode,
operand fetch, execute, store, among others).
To illustrate how instruction pipelining enables faster execution, consider the
execution of the following code fragment:
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
14
load R1, @1000 load R2, @1008 add R1, @1004 add R2, @100C add R1, R2 store R1, @2000
Sequential Execution
Instruction Sequential Execution Stages load R1, @1000 IF ID OF load R2, @1008 IF ID OF add R1, @1004 IF ID OF E add R2, @100C IF ID OF E add R1, R2 IF ID E store R1, @2000 IF ID WB
Clock Cycles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Pipelined Execution
Instruction Pipeline Stages load R1, @1000 IF ID OF load R2, @1008 IF ID OF add R1, @1004 IF ID OF E add R2, @100C IF ID OF E add R1, R2 IF ID NOP E Store R1, @2000 IF ID NOP WB
Clock Cycles 1 2 3 4 5 6 7 8 9
As seen in the example, the pipelined execution requires only 9 clock cycle, which
is a significant improvement over the 20 clock cycles needed in sequential
execution. However, the speed of a single pipelining is always limited by the
largest atomic task. Also, the pipeline performance is always dependent on the
efficiency of the dynamic branch prediction mechanism employed.
Superscalar Execution: Superscalar execution is an advanced form of
pipelined instruction-level parallelism that allows dispatching of multiple
instructions to multiple pipelines, processed concurrently by redundant
functional units on the processor. Superscalar execution can provide better cost-
effective performance as it improves the degree of overlapping of parallel
concurrent execution of multiple instructions.
To illustrate how the superscalar execution results in better performance gain,
consider the execution of the previous code fragment on a processor with two
pipelines and the ability to simultaneously issue two instructions.
Superscalar Execution
Instruction Pipeline Stages load R1, @1000 IF ID OF
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
15
load R2, @1008 IF ID OF add R1, @1004 IF ID OF E add R2, @100C IF ID OF E add R1, R2 IF ID NOP E store R1, @2000 IF ID NOP WB
Clock Cycles 1 2 3 4 5 6 7
With the superscalar execution, the execution of the same code fragment takes
only 7 clock cycles instead of 9.
These examples illustrates that the implicit parallelisms like pipelining and
superscalar execution can results in better cost-effective performance gain.
8. Explain the concepts of pipelining and superscalar Execution with
suitable examples. Also explain their individual merits and demerits.
Pipelining and superscalar execution are two forms of instruction-level implicit
parallelism inherent in the design of modern microprocessors to increase their
instruction throughput.
Pipelining: The main idea of the instruction pipelining is to divide the
instruction cycle into a series of independent steps of micro-operations. Rather
than processing each instruction sequentially, the independent micro-operations
of different instructions are executed concurrently (by different circuitry) in
parallel. Pipelining enables faster execution by overlapping various stages in
instruction execution (fetch, schedule, decode, operand fetch, execute, store,
among others).
To illustrate how instruction pipelining executes instructions, consider the
execution of the following code fragment using pipelining:
load R1, @1000 load R2, @1008 add R1, @1004 add R2, @100C add R1, R2 store R1, @2000
Pipelined Execution
Instruction Pipeline Stages load R1, @1000 IF ID OF load R2, @1008 IF ID OF add R1, @1004 IF ID OF E add R2, @100C IF ID OF E add R1, R2 IF ID NOP E store R1, @2000 IF ID NOP WB
Clock Cycles 1 2 3 4 5 6 7 8 9
As seen in the example, pipelining results in the overlapping in execution of the
various stages of different instructions.
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
16
Advantage of Pipelining:
The cycle time of the processor is reduced by overlapping the different execution stages of various instruction, thereby increases the overall
instruction throughput.
Disadvantages of Pipelining:
Design Complexity: Pipelining involves adding hardware to the chip Inability to continuously run the pipeline at full speed because of pipeline
hazards, such as data dependency, resource dependency and branch
dependency, which disrupt the smooth execution of the pipeline.
Superscalar Execution: Superscalar execution is an advanced form of
pipelined instruction-level parallelism that allows dispatching of multiple
instructions to multiple pipelines, processed concurrently by redundant
functional units on the processor. Superscalar execution can provide better cost-
effective performance as it improves the degree of overlapping of parallel
concurrent execution of multiple instructions.
To illustrate how the superscalar pipelining executes instructions, consider the
execution of the previous code fragment on a processor with two pipelines and
the ability to simultaneously issue two instructions.
Superscalar Execution
Instruction Pipeline Stages load R1, @1000 IF ID OF load R2, @1008 IF ID OF add R1, @1004 IF ID OF E add R2, @100C IF ID OF E add R1, R2 IF ID NOP E store R1, @2000 IF ID NOP WB
Clock Cycles 1 2 3 4 5 6 7
Advantage of Superscalar Pipelining:
Since the processor accepts multiple instructions per clock cycle, superscalar execution results in better performance as compared single
pipelining.
Disadvantages of Pipelining:
Design Complexity: Design of superscalar processors are still more complex as compared to single pipeline design
Inability to continuously run the pipeline at full speed because of pipeline hazards, such as data dependency, resource dependency and branch
dependency, which disrupt the smooth execution of the pipeline.
The performance of superscalar architectures is limited by the available
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
17
instruction level parallelism and the ability of a processor to detect and
schedule concurrent instructions
9. Though the superscalar execution seems to be simple and natural, there are a number of issues to be resolved. Elaborate on the issues that need to be resolved.
Superscalar execution is an advanced form of pipelined instruction-level
parallelism that allows dispatching of multiple instructions to multiple pipelines,
processed concurrently by redundant functional units on the processor.
Superscalar execution can provide better cost-effective performance as it
improves the degree of overlapping of parallel concurrent execution of multiple
instructions. Since superscalar execution exploits multiple instruction pipelines,
it seems to be a simple and natural means for improving the performance.
However, it needs to resolve the following issues for achieving the expected
performance improvement.
a. Pipeline Hazards
b. Out of Order Execution
c. Available Instruction Level Parallelism
Pipeline Hazards: A pipeline hazard is the inability to continuously run the
pipeline at full speed because of various pipeline dependencies such as data
dependency (called Data Hazard), resource dependency (called Structural
Hazard) and branch dependency (called Control or Branch Hazard).
Data Hazards: Data hazards occur when instructions that exhibit data
dependence modify data in different stages of a pipeline. Ignoring potential data
hazards can result in race conditions. There are three situations in which a data
hazard can occur:
read after write (RAW), called true dependency write after read (WAR), called anti-dependency write after write (WAW), called output dependency
As an example, consider the superscalar execution of the following two
instructions i1 and i2, with i1 occurring before i2 in program order.
True dependency: i2 tries to read R2 before i1 writes to it
i1. R2 R1 + R3
i2. R4 R2 + R3
Anti-dependency: i2 tries to write R5 before it is read by i1
i1. R4 R1 + R5
i2. R5 R1 + R2
Output dependency: i2 tries to write R2 before it is written by i1
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
18
i1. R2 R4 + R7
i2. R2 R1 + R3
Structural Hazards: A structural hazard occurs when a part of the processor's
hardware is needed by two or more instructions at the same time. A popular
example is a single memory unit that is accessed both in the fetch stage where
an instruction is retrieved from memory, and the memory stage where data is
written and/or read from memory
Control hazards: Branching hazards (also known as control hazards) occur with
branches. On many instruction pipeline, the processor will not know the outcome
of the branch when it needs to insert a new instruction into the pipeline
(normally the fetch stage).
Dependencies of the above types must be resolved before simultaneous issue of
instructions. Pipeline Bubbling (also known as a pipeline break or a pipeline
stall) is the general strategy to prevent all the three kinds of hazards. As
instructions are fetched, control logic determines whether a hazard will occur. If
this is true, then the control logic inserts NOPs into the pipeline. Thus, before
the next instruction (which would cause the hazard) is executed, the previous
one will have had sufficient time to complete and prevent the hazard.
A variety of specific strategies are also available for handling the different
pipeline hazards. Examples include branch prediction for handling control
hazards and out-of-order execution for handling data hazards.
There are two implications to the pipeline hazards handling. First, since the
resolution is done at runtime, it must be supported in hardware; the complexity
of this hardware can be high. Second, the amount of instruction level parallelism
in a program is often limited and is a function of coding technique.
Out of Order Execution: The ability of a processor to detect and schedule
concurrent instructions is critical to superscalar performance. As an example,
consider the execution of the following code fragment on a processor with two
pipelines and the ability to simultaneously issue two instructions.
1. load R1, @1000 2. add R1, @1004 3. load R2, @1008 4. add R2, @100C 5. add R1, R2 6. store R1, @2000
In the above code fragment, there is a data dependency between the first two
instructions
load R1, @1000 and add R1, @1004.
Therefore, these instructions cannot be issued simultaneously. However, if the
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
19
processor has the ability to look ahead, it will realize that it is possible to
schedule the third instruction
load R2, @1008
with the first instruction
load R1, @1000.
In the next issue cycle, instructions two and four
add R1, @1004 add R2, @100C
can be scheduled, and so on.
However, the processor needs the ability to issue instructions out-of-order to
accomplish the desired reordering. The parallelism available in in-order issue of
instructions can be highly limited as illustrated by this example. Most current
microprocessors are capable of out-of-order issue and completion.
Available Instruction Level Parallelism: The performance of superscalar
architectures is also limited by the available instruction level parallelism. As an
example, consider the execution of the following code fragment on a processor
with two pipelines and the ability to simultaneously issue two instructions.
load R1, @1000 add R1, @1004 add R1, @1008 add R1, @100C store R1, @2000
Superscalar Execution
Instruction Pipeline Stages load R1, @1000 IF ID OF add R1, @1004 IF ID OF E add R1, @1008 IF ID OF E add R1, @100C IF ID OF E store R1, @2000 IF ID NOP WB
Clock Cycles 1 2 3 4 5 6 7
For simplicity of discussion, let us ignore the pipelining aspects of the example
and focus on the execution aspects of the program. Assuming two execution units
(multiply-add units), the following figure illustrates that there are several zero-
issue cycles (cycles in which the floating point unit is idle).
Clock Cycles Exe.Unit 1 Exe.Unit 2
4 Vertical Waste
5 E E Full Issue Slot
6 E Horizontal Waste
7 Vertical Waste
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
20
These are essentially wasted cycles from the point of view of the execution unit.
If, during a particular cycle, no instructions are issued on the execution units, it
is referred to as vertical waste; if only part of the execution units are used during
a cycle, it is termed horizontal waste. In the example, we have two cycles of
vertical waste and one cycle with horizontal waste. In all, only three of the eight
available cycles are used for computation. This implies that the code fragment
will yield no more than three eighths of the peak rated FLOP count of the
processor.
In short, though the superscalar execution seems to be a simple and natural
means for improving the performance, due to limited parallelism, resource
dependencies, or the inability of a processor to extract parallelism, the resources
of superscalar processors are heavily under-utilized.
10. The ability of a processor to detect and schedule concurrent
instructions is critical to superscalar performance. Justify the
statement with example
Superscalar execution is an advanced form of pipelined instruction-level
parallelism that allows dispatching of multiple instructions to multiple pipelines,
processed concurrently by redundant functional units on the processor.
Superscalar execution can provide better cost-effective performance as it
improves the degree of overlapping of parallel concurrent execution of multiple
instructions. Since superscalar execution exploits multiple instruction pipelines,
it seems to be a simple and natural means for improving the performance.
However, the ability of a processor to detect and schedule concurrent
instructions is critical to superscalar performance.
To illustrate this point, consider the execution of the following two different code
fragments for adding four numbers on a processor with two pipelines and the
ability to simultaneously issue two instructions.
Code Fragment 1:
1. load R1, @1000 2. load R2, @1008 3. add R1, @1004 4. add R2, @100C 5. add R1, R2 6. store R1, @2000
Consider the execution of the above code fragment for adding four numbers. The
first and second instructions are independent and therefore can be issued
concurrently. This is illustrated in the simultaneous issue of the instructions
load R1, @1000 and load R2, @1008
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
21
at t = 1. The instructions are fetched, decoded, and the operands are fetched.
These instructions terminate at t = 3. The next two instructions,
add R1, @1004 and add R2, @100C
are also mutually independent, although they must be executed after the first
two instructions. Consequently, they can be issued concurrently at t = 2 since the
processors are pipelined. These instructions terminate at t = 5. The next two
instructions,
add R1, R2 and store R1, @2000
cannot be executed concurrently since the result of the former (contents of
register R1) is used by the latter. Therefore, only the add instruction is issued at
t = 3 and the store instruction at t = 4. Note that the instruction
add R1, R2
can be executed only after the previous two instructions have been executed. The
instruction schedule is illustrated below.
Superscalar Execution of Code Fragment 1
Instruction Pipeline Stages load R1, @1000 IF ID OF load R2, @1008 IF ID OF add R1, @1004 IF ID OF E add R2, @100C IF ID OF E add R1, R2 IF ID NOP E store R1, @2000 IF ID NOP WB
Clock Cycles 1 2 3 4 5 6 7
Code Fragment 2:
1. load R1, @1000 2. add R1, @1004 3. load R2, @1008 4. add R2, @100C 5. add R1, R2 6. store R1, @2000
This code fragment is exactly equivalent to code fragment 1 and it computes the
sum of four numbers. In this code fragment, there is a data dependency between
the first two instructions
load R1, @1000 and add R1, @1004
Therefore, these instructions cannot be issued simultaneously. However, if the
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
22
processor has the ability to look ahead, it will realize that it is possible to
schedule the third instruction
load R2, @1008
with the first instruction
load R1, @1000.
In the next issue cycle, instructions two and four
add R1, @1004 add R2, @100C
can be scheduled, and so on.
However, the processor needs the ability to issue instructions out-of-order to
accomplish the desired reordering. The parallelism available in in-order issue of
instructions can be highly limited as illustrated by this example. Most current
microprocessors are capable of out-of-order issue and completion.
11. Explain how the VLIW processors can achieve the cost effective
performance gain over uniprocessor. What are its merits and
demerits?
Microprocessor technology has recorded an unprecedented growth over the past
few decades. This growth has also unveiled various bottlenecks in achieving
sustained performance gain. To alleviate these performance bottlenecks,
microprocessor designers have explored a number of alternate architectural
innovations involving implicit instruction-level parallelisms like pipelining,
superscalar architectures and out-of-order execution. All these implicit
instruction-level parallelism approaches have the demerits that they involve
increased hardware complexity (higher cost, larger circuits, higher power
consumption) because the processor must inherently make all of the decisions
internally for these approaches to work (for example, the scheduling of
instructions and determining of interdependencies).
Another alternate architectural innovation to cost-effective performance gains is
the Very Long Instruction Word (VLIW) processors. VLIW is one particular style
of processor design that tries to achieve high levels of explicit instruction level
parallelism by executing long instruction words composed of multiple operations.
The long instruction word called a MultiOp consists of multiple arithmetic, logic
and control operations. The VLIW processor concurrently executes the set of
operations within a MultiOp thereby achieving instruction level parallelism.
A VLIW processor allows programs to explicitly specify instructions to be
executed in parallel. That is, a VLIW processor depends on the programs
themselves for providing all the decisions regarding which instructions are to be
executed simultaneously and how conflicts are to be resolved. A VLIW processor
relies on the compiler to resolve the scheduling and interdependencies at compile
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
23
time. Instructions that can be executed concurrently are packed into groups and
are passed to the processor as a single long instruction word (thus the name) to
be executed on multiple functional units at the same time. This means that the
compiler becomes much more complex, but the hardware is simpler than many
other approaches to parallelism.
Advantages:
Since VLIW processors depends on compilers for resolving scheduling and interdependencies, the decoding and instruction issue mechanisms are
simpler in VLIW processors.
Since scheduling and interdependencies are resolved at compilation time, instruction level parallelism can be exploited to maximum as the compiler
has a larger-scale view of the program as compared to the instruction-level
view of a superscalar processor for selecting parallel instructions. Further,
compilers can also use a variety of transformations to optimize parallelism
when compared to a hardware issue unit.
The VLIW approach executes operations in parallel based on a fixed schedule determined when programs are compiled. Since determining the order of
execution of operations (including which operations can execute
simultaneously) is handled by the compiler, the processor does not need the
scheduling hardware. As a result, VLIW CPUs offer significant computational
power with less hardware complexity.
Disadvantages:
VLIW programs only work correctly when executed on a processor with the same number of execution units and the same instruction latencies as the
processor they were compiled for, which makes it virtually impossible to
maintain compatibility between generations of a processor family. For
example, if number of execution units in a processor increases between
generations; the new processor will try to combine operations from multiple
instructions in each cycle, potentially causing dependent instructions to
execute in the same cycle. Similarly, changing instruction latencies between
generations of a processor family can cause operations to execute before their
inputs are ready or after their inputs have been overwritten, resulting in
incorrect behaviour
Since the scheduling and interdependencies are resolved at compilation time, the compilers lack dynamic program states like branch history
buffer that helps making scheduling decisions. Since the static prediction
mechanism employed by the compiler may not be as effective as the
dynamic one, the branch and memory prediction made by the compiler
may not be accurate. Moreover, some runtime situations such as stalls on
data fetch because of cache misses are extremely difficult to predict
accurately. This limits the scope and performance of static compiler-based
scheduling
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
24
12. With an example, illustrate how the memory latency can be a
bottleneck in achieving the peak processor performance. Also
illustrate how the cache memory can reduce this performance
bottleneck.
The effective performance of a program on a computer relies not just on the
speed of the processor, but also on the ability of the memory system to feed data
to the processor. There are two figures that are often used to describe the
performance of a memory system: the latency and the bandwidth.
The memory latency is the time that elapses between the memory beginning to
transmit the data and the processor starting to receive the first byte. The
memory bandwidth is the rate at which the processor receives data after it has
started to receive the first byte. So if the latency of a memory system is l seconds
and the bandwidth is b bytes per second, then the time it takes to transmit a
message of n bytes is l+n/b.
To illustrate the effect of memory system latency on system performance,
consider a processor operating at 1 GHz (1/109 = 10-9 = 1 ns clock) connected to a
DRAM with a latency of 100 ns (no caches). Assume that the size of the memory
block is 1 word per block. Also assume that the processor has two multiply-add
units and is capable of executing four instructions in each cycle of 1 ns. The peak
processor rating is therefore 4 GFLOPS (109 clock cycles4 FLOPS per clock
cycles=4109 = 4 GFLOPS). Since the memory latency is equal to 100 cycles and
block size is one word, every time a memory request is made, the processor must
wait 100 cycles before it can process the data. That is, the peak speed processor
is limited to one floating point operation in every 100 ns, or a speed of 10
MFLOPS, a very small fraction of the peak processor rating.
This example highlights how the memory longer memory latency (hence larger
speed mismatch between memory and CPU) can be a bottleneck in achieving the
peak processor performance.
One of the architectural innovations in memory system design for reducing the
mismatch in processor and memory speeds is the introduction of a smaller and
faster cache memory between the processor and the memory. The cache acts as
low-latency high-bandwidth storage.
The data needed by the processor is first fetched into the cache. All subsequent
accesses to data items residing in the cache are serviced by the cache. Thus, in
principle, if a piece of data is repeatedly used, the effective latency of this
memory system can be reduced by the cache.
To illustrate the impact of caches on memory latency and system performance,
consider a 1 GHz processor with a 100 ns latency DRAM. Assume that the size of
the memory block is 1 word per block and that a cache memory of size 32 KB
with a latency of 1 ns is available. Assume that this setup is used to multiply two
matrices A and B of dimensions 32 32. Fetching the two matrices into the cache
from memory corresponds to fetching 2K words (one matrix = 32 32 words =
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
25
2525 = 210 = 1K words, i.e., 2 matrices = 2K words), which takes approximately
200 s (Memory latency = 100 ns. Memory latency for 2K words = 2103100ns =
200000ns = 200 s micro seconds). Multiplying two nn matrices takes 2n3 operations. For our problem, this corresponds to 64K operations (2323=2(25)3 =
216 = 64K), which can be performed in 16K cycles (or 16 s) at four instructions per cycle (64K/4 = 16K cycles = 16000ns = 16 s). The total time for the computation is therefore approximately the sum of time for load/store operations
and the time for the computation itself, i.e., 200+16 s. This corresponds to a peak computation rate of 64K/216 or 303 MFLOPS.
Note that this is a thirty-fold improvement over the previous example, although
it is still less than 10% of the peak processor performance. This example
illustrates that placing of a small cache memory improves the processor
utilization considerably.
13. With suitable example illustrate the effect of memory bandwidth on
improving processor performance gain
Memory bandwidth refers to the rate at which data can be moved between the
processor and memory. It is determined by the bandwidth of the memory bus as
well as the memory units. Memory bandwidth of system decides the rate at
which the data can be pumped to the processor and it has larger impact on
realizable peak time system performance.
One commonly used technique to improve memory bandwidth is to increase the
size of the memory blocks. Since bigger blocks can effectively utilize the special
locality, increasing the block size results in hiding memory latency.
To illustrate the effect of block size on hiding memory latency (improving system
performance), consider a 1 GHz processor with a 100 ns latency DRAM. Assume
that memory block size (cache line) is 1 word. Assume that this set up is used to
find the dot-product of two vectors. Since the block size is one word, the
processor takes 100 cycles to fetch each word. For each pair of words, the dot-
product performs one multiply-add, i.e., two FLOPs in 200 cycles. Therefore, the
algorithm performs one FLOP every 100 cycles for a peak speed of 10 MFLOPS.
Now let us consider what happens if the block size is increased to four words, i.e.,
the processor can fetch a four-word cache line every 100 cycles. For each pair of
four-words, the dot-product performs eight FLOPs in 200 cycles. This
corresponds to a FLOP every 25 ns, for a peak speed of 40 MFLOPS. Note that
increasing the block size from one to four words did not increase the latency of
the memory system. However, it increased the bandwidth four-fold.
The above example assumed a wide data bus equivalent to the size of the cache
line. In practice, such wide buses are expensive to construct. In a more practical
system, consecutive words are sent on the memory bus on subsequent bus cycles
after the first word is retrieved. For example, with a 32 bit data bus, the first
word is put on the bus after 100 ns (the associated latency) and one word is put
on each subsequent bus cycle. This changes our calculations above slightly since
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
26
the entire cache line becomes available only after 100 + 3 cycles. However, this
does not change the execution rate significantly.
The above examples clearly illustrate how increased bandwidth results in higher
peak computation rates.
14. Data reuse is critical on cache performance. Justify the statement with the example.
The effective performance of a program on a computer relies not just on the
speed of the processor, but also on the ability of the memory system to feed data
to the processor. There are two figures that are often used to describe the
performance of a memory system: the latency and the bandwidth. Memory
latency has a larger role in controlling the speed mismatch between processor
and memory. One of the architectural innovations in memory system design for
reducing the mismatch in processor and memory speeds is the introduction of a
smaller and faster cache memory between the processor and the memory. The
data needed by the processor is first fetched into the cache. All subsequent
accesses to data items residing in the cache are serviced by the cache. Thus, in
principle, if a piece of data is repeatedly used, the effective latency of this
memory system can be reduced by the cache. The fraction of data references
satisfied by the cache is called the cache hit ratio of the computation on the
system.
The data reuse measured in terms of cache hit ratio is critical for cache
performance because if each data item is used only once, it would still have to be
fetched once per use from the DRAM, and therefore the DRAM latency would be
paid for each operation.
To illustrate this, consider a 1 GHz processor with a 100 ns latency DRAM with
a memory block size of 1 word. Assume that a cache memory of size 32 KB with a
latency of 1 ns is available. Also assume that the processor has two multiply-add
units and is capable of executing four instructions in each cycle of 1 ns. Assume
that this setup is used to multiply two matrices A and B of dimensions 32 32.
Fetching the two matrices into the cache from memory corresponds to fetching
2K words (one matrix = 32 32 words = 2525 = 210 = 1K words, i.e., 2 matrices =
2K words). Multiplying two nn matrices takes 2n3 operations (indicates data
reuse because 2K data words are used 64K times). For our problem, this
corresponds to 64K operations (2323=2(25)3 = 216 = 64K. This results in a cache
hit ratio:
Hit ratio =
i.e., Hit ratio =
(Of the 64K matrix operation, only 2K memory access is required and the rest
62K accesses are made from cache)
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
27
A higher hit ration results in lower memory latency and higher system
performance.
For example, in the earlier example of matrix multiplication, the peak
performance of the system would be 4GFLOPS per second at the rate of 4 FLOPS
per clock cycle (for a total of 1GHz =109 clock cycles per second). However, due to
the memory latency of 100ns, in the absence of cache memory, the realizable
peak performance will be 4109/100=0.04109=40106=40MFLOPS per second. In
the presence of a cache memory of size 32 KB with a latency of 1 ns, the increase
in the realizable peak performance can be illustrated with the matrix
multiplication example. Fetching the two 32 KB matrices into the cache from
memory corresponds to fetching 2K words (one matrix = 32 32 words = 2525 =
210 = 1K words, i.e., 2 matrices = 2K words), which takes approximately 200 s (2K words = 2103 words = 2103102ns=2105ns=200s). Multiplying two nn matrices takes 2n3 operations. For our problem, this corresponds to 64K
operations (2323=2(25)3 = 216 = 64K), which can be performed in 16K cycles or
16 s at four instructions per cycle (No. of cycles = 64K/4 = 16Kcycles = 16K1ns=16 s). The total time for the computation is therefore approximately the sum of time for load/store operations and the time for the computation itself,
i.e., 200+16 s. This corresponds to a peak computation rate of 64K/216 or 303 MFLOPS. This results in a ten-fold improvement over the model where there is
no cache. This performance improvement is due to the data reuse because 2K
data words are used 64K times.
If the data items are used exactly one time only, then the number of memory
references becomes equal to the number of references found in cache (cache hit).
In our case, hit ratio becomes
Hit ratio =
i.e., Hit ratio =
A lower hit ratio results in higher memory latency and lower system
performance.
For example, consider the case of finding the dot product of the previous matrices
(instead of multiplication). Now the total operations will 32322=211=2K (one
multiply and one add for each element), which can be performed in 0.5K cycles or
0.5 s at four instructions per cycle (No. of cycles = 2K/4 = 0.5Kcycles =
0.5K1ns=0.5 s). The total time for the computation is therefore 200+0.5 s. This corresponds to a peak computation rate of 2K/200.5 or 9.97 MFLOPS. This
performance reduction is due to the absence of data reuse as 2K data words are
used only in 2K operations.
The above examples illustrate that the data reuse measured in terms of cache hit
ratio is critical for cache performance.
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
28
15. The performance of a memory bound program is critically impacted by the cache hit ratio. Justify the statement with example.
The effective performance of a program on a computer relies not just on the
speed of the processor, but also on the ability of the memory system to feed data
to the processor. There are two figures that are often used to describe the
performance of a memory system: the latency and the bandwidth. Memory
latency has a larger role in controlling the speed mismatch between processor
and memory. One of the architectural innovations in memory system design for
reducing the mismatch in processor and memory speeds is the introduction of a
smaller and faster cache memory between the processor and the memory. The
data needed by the processor is first fetched into the cache. All subsequent
accesses to data items residing in the cache are serviced by the cache. Thus, in
principle, if a piece of data is repeatedly used, the effective latency of this
memory system can be reduced by the cache. The fraction of data references
satisfied by the cache is called the cache hit ratio of the computation on the
system. The effective computation rate of many applications is bounded not by
the processing rate of the CPU, but by the rate at which data can be pumped into
the CPU. Such computations are referred to as being memory bound. The
performance of memory bound programs is critically impacted by the cache hit
ratio.
To illustrate this, consider a 1 GHz processor with a 100 ns latency DRAM.
Assume that a cache memory of size 32 KB with a latency of 1 ns is available.
Assume that this setup is used to multiply two matrices A and B of dimensions
32 32. Fetching the two matrices into the cache from memory corresponds to
fetching 2K words (one matrix = 32 32 words = 2525 = 210 = 1K words, i.e., 2
matrices = 2K words). Multiplying two nn matrices takes 2n3 operations
(indicates data reuse because 2K data words are used 64K times). For our
problem, this corresponds to 64K operations (2323=2(25)3 = 216 = 64K. This
results in a cache hit ratio:
Hit ratio =
i.e., Hit ratio =
A higher hit ration results in lower memory latency and higher system
performance.
If the data items are used exactly one time only, then the number of memory
references becomes equal to the number of references found in cache (cache hit).
In our case, hit ratio becomes
Hit ratio =
i.e., Hit ratio =
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
29
A lower hit ration results in higher memory latency and lower system
performance.
The above examples illustrate that the performance of a memory bound program
is critically impacted by the cache hit ratio.
16. How the locality of reference can influence the performance gain of
a processor.
Locality of reference (also known as the principle of locality) is a phenomenon
describing the same or related memory locations, being frequently accessed.
Two types of locality of references have been observed:
temporal locality and spatial locality
Temporal locality is the tendency for a program to reference the same memory
location or a cluster several times during brief intervals of time. Temporal
locality is exhibited by program loops, subroutines, stacks and variables used for
counting and totalling.
Spatial locality is the tendency for program to reference clustered locations in
preference to randomly distributed locations. Spatial locality suggests that once
a location is referenced, it is highly likely that nearby locations will be referenced
in the near future. Spatial is exhibited by array traversals, sequential code
execution, the tendency to reference stack locations in the vicinity of the stack
pointer, etc.
Locality of reference is one type of predictable program behaviour and the
programs that exhibit strong locality of reference are great candidates for
performance optimization through the use of techniques such as the cache and
instruction prefetch technology that can improve the memory bandwidth and can
hide the memory latency.
Effect of Locality Reference in Hiding Memory Latency Using Cache:
Both the special and temporal locality of reference exhibited by program can
improve the cache hit ratio, which results in hiding memory latency and hence
improved system performance.
To illustrate how the locality of reference can improve the system performance
by hiding memory latency through the use of cache, consider the following
example:
Consider a processor operating at 1 GHz (1/109 = 10-9 = 1 ns clock) connected to a
DRAM with a latency of 100 ns (no caches). Assume that the size of the memory
block is 1 word per block. Also assume that the processor has two multiply-add
units and is capable of executing four instructions in each cycle of 1 ns. The peak
processor rating is therefore 4 GFLOPS (109 clock cycles4 FLOPS per clock
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
30
cycles=4109 = 4 GFLOPS). Since the memory latency is equal to 100 cycles and
block size is one word, every time a memory request is made, the processor must
wait 100 cycles before it can process the data. That is, the peak speed processor
is limited to one floating point operation in every 100 ns, or a speed of 10
MFLOPS, a very small fraction of the peak processor rating.
The performance of the above processor can be improved at least 30 fold by
incorporating a cache memory of size 32 KB, as illustrated below:
Assume that the size of the memory block is 1 word per block and that a cache
memory of size 32 KB with a latency of 1 ns is available. Assume that this setup
is used to multiply two matrices A and B of dimensions 32 32. Fetching the two
matrices into the cache from memory corresponds to fetching 2K words (one
matrix = 32 32 words = 2525 = 210 = 1K words, i.e., 2 matrices = 2K words),
which takes approximately 200 s (Memory latency = 100 ns. Memory latency for 2K words = 2103100ns = 200000ns = 200 s micro seconds). Multiplying two nn matrices takes 2n3 operations. For our problem, this corresponds to 64K
operations (2323=2(25)3 = 216 = 64K), which can be performed in 16K cycles (or
16 s) at four instructions per cycle (64K/4 = 16K cycles = 16000ns = 16 s). The total time for the computation is therefore approximately the sum of time for
load/store operations and the time for the computation itself, i.e., 200+16 s. This corresponds to a peak computation rate of 64K/216 or 303 MFLOPS (a 30
fold increment).
Effect of Locality Reference in Improving the Memory Bandwidth:
The locality of reference also has an effect on improving the memory bandwidth
by allowing the blocks of larger size to be brought into the memory, as illustrated
in the following example:
Consider again a memory system with a single cycle cache and 100 cycle latency
DRAM with the processor operating at 1 GHz. If the block size is one word, the
processor takes 100 cycles to fetch each word. If use this set up to find the dot-
product of two vectors, for each pair of words, the dot-product performs one
multiply-add, i.e., two FLOPs. Therefore, the algorithm performs one FLOP
every 100 cycles for a peak speed of 10 MFLOPS. Now let us consider what
happens if the block size is increased to four words, i.e., the processor can fetch a
four-word cache line every 100 cycles. Assuming that the vectors are laid out
linearly in memory, eight FLOPs (four multiply-adds) can be performed in 200
cycles. This is because a single memory access fetches four consecutive words in
the vector. Therefore, two accesses can fetch four elements of each of the vectors.
This corresponds to a FLOP every 25 ns, for a peak speed of 40 MFLOPS. Note
that increasing the block size from one to four words did not change the latency
of the memory system. However, it increased the bandwidth four-fold. In this
case, the increased bandwidth of the memory system enabled us to accelerate
system performance.
Effect of Locality Reference in Hiding Memory Latency Using
Prefetching:
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
31
Prefetching is process of bringing data or instructions from memory into the
cache, on anticipation, before they are needed. The prefetching is found
successful with programs exhibiting good locality of reference, which results in
increased system performance by hiding the memory latency.
The following example illustrates how prefetching can hide memory latency:
Consider the problem of adding two vectors a and b using a single for loop. In the
first iteration of the loop, the processor requests a[0] and b[0]. Since these are
not in the cache, the processor must pay the memory latency. While these
requests are being serviced, the processor also requests the subsequent elements
a[1] and b[1], a[2] and b[2], etc and put them in cache in advance. Assuming
that each request is generated in one cycle (1 ns) and memory requests are
satisfied in 100 ns, after 100 such requests the first set of data items is returned
by the memory system. Subsequently, one pair of vector components will be
returned every cycle. In this way, in each subsequent cycle, one addition can be
performed and processor cycles are not wasted (results in latency hiding).
17. Explain the terms pre-fetching and multithreading. How can pre-
fetching and multithreading result in processor performance gain.
The scope for achieving the effective performance of a program on a computer
has traditionally been limited by two memory related performance factors --
latency and bandwidth. Several techniques have been proposed for handling this
problem, including cache memory, pre-fetching and multithreading. Here, the
latter two approaches are discussed.
Prefetching
Prefetching is process of bringing data or instructions from memory into the
cache, on anticipation, before they are actually needed. The prefetching works
well with programs exhibiting good locality of reference, thereby by hiding the
memory latency.
In a typical program, a data item is loaded and used by a processor in a small
time window. If the load results in a cache miss, then the program stalls. A
simple solution to this problem is to advance the load operation so that even if
there is a cache miss, the data is likely to have arrived by the time it is used.
To illustrate the effect of prefetching on memory latency hiding, consider the
problem of adding two vectors a and b using a single for loop. In the first
iteration of the loop, the processor requests a[0] and b[0]. Since these are not in
the cache, the processor must pay the memory latency. While these requests are
being serviced, the processor also requests the subsequent elements a[1] and
b[1], a[2] and b[2], etc and put them in cache in advance. Assuming that each
request is generated in one cycle (1 ns) and memory requests are satisfied in 100
ns, after 100 such requests the first set of data items is returned by the memory
system. Subsequently, one pair of vector components will be returned every
cycle. In this way, in each subsequent cycle, one addition can be performed and
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
32
processor cycles are not wasted (results in latency hiding).
Multithreading
Multithreading is the ability of an operating system to execute different parts of
a program by maintaining multiple threads of execution at a time. The different
threads of control of a program can be executed concurrently with other thread of
the same program. Since multiple threads of the same program are concurrently
available, execution control can be switched between processor resident threads
on cache misses. The programmer must carefully design the program in such a
way that all the threads can run at the same time without interfering with each
other
To illustrate the effect of threading on hiding memory latency, consider the
following code segment for multiplying an nn matrix a by a vector b to get
vector c.
1 for(i=0;i
-
Lecture Notes High Performance Computing
Department of Computer Science, Central University of Kerala
33
memory, pre-fetching and multithreading. Contrary to the general belief that
that the pre-fetching and multithreading with supportive cache can solve all the
problems related to memory system performance, they are critically impacted by
the memory bandwidth.
To illustrate the impact of bandwidth on multithreaded programs, consider a
computation running on a machine with a 1 GHz clock, 4-word cache line, single
cycle access to the cache, and 100 ns latency to DRAM. Assume that the
computation has a cache hit ratio of 25% at 1 KB and of 90% at 32 KB. Consider
two cases: first, a single threaded execution in which the entire cache is available
to the serial context, and second, a multithreaded execution with 32 threads
where each thread has a cache residency of 1 KB. If the computation makes one
data request in every cycle of 1 ns, in the first case the bandwidth requirement to
DRAM is one word every 10 ns since the other words come from the cache (90%
cache hit ratio. This corresponds to a bandwidth of 400 MB/s. In the second case,
the bandwidth requirement to DRAM increases to three words every four cycles
of each thread (25% cache hit ratio). Assuming that all threads exhibit similar
cache behaviour, this corresponds to 0.75 words/ns, or 3 GB/s.
In the above example, while a sustained DRAM bandwidth of 400 MB/s is
reasonable (case I), 3.0 GB/s (case II) is more than most systems currently offer.
At this point, multithreaded systems become bandwidth bound instead of latency
bound because the bandwidth requirement is now more severe as compared to
memory latency. It is important to realize that multithreading and prefetching
only address the latency problem and may often exacerbate the bandwidth
19. Describe the Flynns Taxonomy of Computers
The most popular taxonomy of computer architecture is Flynns taxonomy that was dened by Flynn in 1966. Flynns classication model is based on the concept of a stream of information. Two types of information ow into a processor: instructions and data. The instruction stream is dened as the sequence of instructions performed by the processing unit. The data stream is
dened as the data trafc exchanged between the memory and the processing unit. According to Flynns classication, a computer's hardware may support a single instruction stream or multiple instructi