parallel programming many-core computing: hardware …bal/college12/class2-hardware.pdf ·...
TRANSCRIPT
![Page 1: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/1.jpg)
PARALLEL PROGRAMMING
MANY-CORE COMPUTING:
HARDWARE (2/5)
Rob van Nieuwpoort
Rob van Nieuwpoort
![Page 2: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/2.jpg)
Schedule 2
1. Introduction, performance metrics & analysis
2. Many-core hardware, low-level optimizations
3. Cuda class 1: basics
4. Cuda class 2: advanced
5. Case study: LOFAR telescope with many-cores
![Page 3: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/3.jpg)
Multi-core CPUs 3
![Page 4: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/4.jpg)
General Purpose Processors 4
Architecture
Few fat cores
Vectorization Streaming SIMD Extensions (SSE)
Advanced Vector Extensions (AVX)
Homogeneous
Stand-alone
Memory
Shared, multi-layered
Per-core cache and shared cache
Programming
Multi-threading
OS Scheduler
Coarse-grained parallelism
![Page 5: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/5.jpg)
Intel 5
![Page 6: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/6.jpg)
AMD Magny-Cours
Two 6-core processors on a single chip
Up to four of these chips in a single compute node
48 cores in total
Non-uniform memory access (NUMA)
Per-core cache
Per-chip cache
Local memory
Remote memory (hypertransport)
6
![Page 7: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/7.jpg)
AMD Magny-Cours 7
![Page 8: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/8.jpg)
AMD Magny-Cours 8
![Page 9: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/9.jpg)
AWARI on the Magny-Cours 9
DAS-2 (1999)
51 hours
72 machines / 144 cores
72 GB RAM in total
1.4 TB disk in total
Magny-Cours (2011)
45 hours
1 machine, 48 cores
128 GB RAM in 1 machine
4.5 TB disk in 1 machine
Less than 12 hours with new algorithm (needs more RAM)
![Page 10: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/10.jpg)
Multi-core CPU programming
Threads
Pthreads
Java threads
OpenMP
…
Message passing (MPI)
Vectorization
MMX, SSE, AVX, AltiVec, …
OpenCL
Supports threads and vectors
10
![Page 11: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/11.jpg)
Vectorization on x86 architectures 11
Since Name Bits Single
precision
vector size
Double precision
vector size
1996 MultiMedia eXtensions (MMX) 64 bit Integer only Integer only
1999 Streaming SIMD Extensions (SSE) 128 bit 4 float 2 double
2011 Advanced Vector Extensions (AVX) 256 bit 8 float 4 double
2012 Intel Xeon Phi accelerator
(was Larrabee, MIC)
512 bit 16 float 8 double
![Page 12: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/12.jpg)
Vectorizing with SSE
Assembly instructions
16 vector registers
C or C++: intrinsics
Declare vector variables
Name instruction
Work on variables, not registers
12
![Page 13: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/13.jpg)
Vectorizing with SSE examples
float data[1024];
// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.
init(data);
// Set all elements in my vector to zero.
__m128 myVector0 = _mm_setzero_ps();
// Load the first 4 elements of the array into my vector.
__m128 myVector1 = _mm_load_ps(data);
// Load the second 4 elements of the array into my vector.
__m128 myVector2 = _mm_load_ps(data+4);
0.0
0 element
value
1 2 3
0.0 0.0 0.0
0.0
0 element
value
1 2 3
3.0 2.0 1.0
4.0
0 element
value
1 2 3
7.0 6.0 5.0
13
![Page 14: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/14.jpg)
Vectorizing with SSE examples
// Add vectors 1 and 2; instruction performs 4 FLOPs.
__m128 myVector3 = _mm_add_ps(myVector1, myVector2);
// Multiply vectors 1 and 2; instruction performs 4 FLOPs.
__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);
// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.
__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,
_MM_SHUFFLE(2, 3, 0, 1));
0 element
value
1 2 3
4.0 = + 6.0 8.0 10.0
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
0 element
value
1 2 3
0.0 = x 5.0 12.0 21.0
0 element
value
1 2 3
2.0 = 3.0 4.0 5.0 s
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
14
![Page 15: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/15.jpg)
Vector add
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i++) {
c[i] = a[i] + b[i];
}
}
15
![Page 16: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/16.jpg)
Vector add with SSE: unroll loop
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i += 4) {
c[i+0] = a[i+0] + b[i+0];
c[i+1] = a[i+1] + b[i+1];
c[i+2] = a[i+2] + b[i+2];
c[i+3] = a[i+3] + b[i+3];
}
}
16
![Page 17: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/17.jpg)
Vector add with SSE: vectorize loop
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i += 4) {
__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a
__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b
__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts
_mm_store_ps(c + i, vecC); // store four elts
}
}
17
![Page 18: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/18.jpg)
The Cell Broadband Engine 18
![Page 19: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/19.jpg)
Cell/B.E. 19
Architecture
Heterogeneous
1 PowerPC (PPE)
8 vector-processors (SPEs)
Programming
User-controlled scheduling
6 levels of parallelism, all under user control
Fine- and coarse-grain parallelism
![Page 20: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/20.jpg)
Cell/B.E. memory 20
“Normal” main memory
PPE: normal read / write
SPEs: Asynchronous manual transfers
Direct Memory Access (DMA)
Per-core fast memory: the Local Store (LS)
Application-managed cache
256 KB
128 x 128 bit vector registers
![Page 21: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/21.jpg)
Cell/B.E. 21
![Page 22: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/22.jpg)
Roadrunner (IBM) 22
Los Alamos National Laboratory
#1 of top500 June 2008 – November 2009
Now #19
122,400 cores, 1.4 petaflops
First petaflop system
PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz
![Page 23: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/23.jpg)
The Cell’s vector instructions
Differences with SSE
SPEs execute only vector instructions
More advanced shuffling
Not 16, but 128 registers!
Fused Multiply Add support
23
![Page 24: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/24.jpg)
FMA instruction
A B
Product
C
D =
( truncate digits )
A B
Product
C
D =
+
×
= (retain all digits)
×
=
+
(no loss of precision)
Multiply-Add (MAD): D = A * B + C
Fused Multiply-Add (FMA): D = A * B + C
24
![Page 25: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/25.jpg)
Cell Programming models
IBM Cell SDK
C + MPI
OpenCL
Many models from academia...
25
![Page 26: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/26.jpg)
Cell SDK
Threads, but only on the PPE
Distributed memory
Local stores = application-managed cache!
DMA transfers
Signaling and mailboxes
Vectorization
26
![Page 27: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/27.jpg)
Direct Memory Access (DMA)
Start asynchronous DMA mfc_get (local store space, main mem address, #bytes, tag);
Wait for DMA to finish mfc_write_tag_mask(tag);
mfc_read_tag_status_all();
DMA lists
Overlap communication with useful work
Double buffering
27
![Page 28: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/28.jpg)
Vector sum
float vectorSum(int size, float* vector) {
float result = 0.0;
for(int i=0; i<size; i++) {
result += vector[i];
}
return result;
}
28
![Page 29: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/29.jpg)
Parallelization strategy
Partition problem into 8 pieces
(Assuming a chunk fits in the Local Store)
PPE starts 8 SPE threads
Each SPE processes 1 piece
Has to load data from PPE with DMA
PPE adds the 8 sub-results
29
![Page 30: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/30.jpg)
Vector sum SPE code (1)
float vectorSum(int size, float* PPEVector) {
float result = 0.0;
int chunkSize = size / NR_SPES; // Partition the data.
float localBuffer[chunkSize]; // Allocate a buffer in
// my private local store.
int tag = 42;
// Points to my chunk in PPE memory.
float* myRemoteChunk = PPEVector + chunkSize * MY_SPE_NUMBER;
30
![Page 31: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/31.jpg)
Vector sum SPE code (2)
// Copy the input data from the PPE.
mfc_get(localBuffer, myRemoteChunk, chunkSize*4, tag);
mfc_write_tag_mask(tag);
mfc_read_tag_status_all();
// The real work.
for(int i=0; i<chunkSize; i++) {
result += localBuffer[i];
}
return result;
}
31
![Page 32: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/32.jpg)
Can we optimize this strategy? 32
![Page 33: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/33.jpg)
Can we optimize this strategy? 33
Vectorization
Overlap communication and computation
Double buffering
Strategy:
Split in more chunks than SPEs
Let each SPE download the next chunk while processing the
current chunk
![Page 34: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/34.jpg)
DMA double buffering example (1)
float vectorSum(float* PPEVector, int size, int nrChunks) {
float result = 0.0;
int chunkSize = size / nrChunks;
int chunksPerSPE = nrChunks / NR_SPES;
int firstChunk = MY_SPE_NUMBER * chunksPerSPE;
int lastChunk = firstChunk + nrChunks;
// Allocate two buffers in my private local store.
float localBuffer[2][chunkSize];
int currentBuffer = 0;
// Start asynchronous DMA of first chunk.
float* myRemoteChunk = PPEVector + firstChunk * chunkSize;
mfc_get(localBuffer[currentBuffer], myRemoteChunk, chunkSize,
currentBuffer);
34
![Page 35: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/35.jpg)
DMA double buffering example (2)
for (int chunk = firstChunk; chunk < lastChunk; chunk++) {
// Prefetch next chunk asynchronously.
if(chunk != lastChunk - 1) {
float* nextRemoteChunk = PPEVector + (chunk+1) * chunkSize;
mfc_get(localBuffer[!currentBuffer], nextRemoteChunk,
chunkSize, !currentBuffer);
}
// Wait for of current buffer DMA to finish.
mfc_write_tag_mask(currentBuffer); mfc_read_tag_status_all();
// The real work.
for(int i=0; i<chunkSize; i++)
result += localBuffer[currentBuffer][i];
currentBuffer = !currentBuffer;
}
return result;
}
35
![Page 36: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/36.jpg)
Double and triple buffering
Read-only data
Double buffering
Read-write data
Triple buffering!
Work buffer
Prefetch buffer, asynchronous download
Finished buffer, asynchronous upload
General technique
On-chip networks
GPUs (PCI-e)
MPI (cluster)
…
36
![Page 37: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/37.jpg)
Intel’s many-core platforms 37
![Page 38: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/38.jpg)
Intel Single-chip Cloud Computer 38
Architecture
Tile-based many-core (48 cores)
A tile is a dual-core
Stand-alone
Memory
Per-core and per-tile
Shared off-chip
Programming
Multi-processing with message passing
User-controlled mapping/scheduling
Gain performance …
Coarse-grain parallelism
Multi-application workloads (cluster-like)
![Page 39: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/39.jpg)
Intel Single-chip Cloud Computer 39
![Page 40: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/40.jpg)
Intel SCC Tile
2 cores
16K L1 cache per core
256K L2 per core
8K Message passing buffer
On-chip network router
40
![Page 41: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/41.jpg)
Intel's Larrabee 41
GPU based on x86 architecture
Hardware multithreading
Wide SIMD
Achieved 1 tflop sustained application performance (SC09)
Canceled in Dec 2009, re-targeted to HPC market
![Page 42: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced](https://reader034.vdocument.in/reader034/viewer/2022051809/6012eaa4b719c3679327bac6/html5/thumbnails/42.jpg)
Intel Xeon Phi 42
Larrabee + 80-core research chip + SCC → MIC architecture
Brand name now Xeon Phi
First product: Knights corner
GPU-like accelerator
60+ pentium1-like cores
512-bit SIMD
At least 8GB of GDDR5
1 teraflop double precision
Programming is x86 compatible
OpenMP, OpenCL, Cilk, parallel libraries