gpu functional simulator yi yang cda 6938 term project orlando april. 20, 2008
DESCRIPTION
University of Central Florida Motivation and background Motivation Better understanding of GPU Improving the GPU architecture. Background Two GPU manufacturers: Nvidia and ATI Similar programming mode: block vs group share memory vs lds ATI uses vliw We want to work on both.TRANSCRIPT
![Page 2: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/2.jpg)
University of Central Florida
Outline
Motivation and background Software design Implementation Test cases Future work
![Page 3: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/3.jpg)
University of Central Florida
Motivation and background
Motivation Better understanding of GPU Improving the GPU architecture.
Background Two GPU manufacturers: Nvidia and ATI Similar programming mode:
block vs group share memory vs lds
ATI uses vliw We want to work on both.
![Page 4: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/4.jpg)
University of Central Florida
Software design
Programming Model Layer Platform independent Define abstract part, most of ISA: ISA, Register Implement similar most of resource: group, wavefront, …
Hardware Implementation Layer Implement the abstract part of PML for different platform
ATI NVIDIA
![Page 5: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/5.jpg)
University of Central Florida
Programming Model Layer•Code parser to get instruction list
•Allocate resource by the configuration file: group, thread, share memory, memory, wavefront schedule.
•Load input stream from txt file.
•Wavefront schedule executes instruction list on the wavefronts
•When instruction is executed on one thread, instruction update the resource: register of thread, share memory of group, texture(global) memory of gpuprogram.
•Save the output memory to txt file
![Page 6: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/6.jpg)
University of Central Florida
Code Parser(HIL)
read the assembly and parse it into instructions
INST LABEL NO: unique # of instruction Stream core label: one of x, y, z, w, t INST: Operand
![Page 7: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/7.jpg)
University of Central Florida
Operand(HIL)
Global Purpose Register:0 y: ADD ____, R0.x, -0.5
Previous Vector(x, y, z, w) and Previous Scalar (t)3 t: F_TO_I ____, R0.x 4 t: MULLO_UINT R1.z, 1, PS3
Temporary Register3 t: RCP_UINT T0.x, R1.x
Constant Register1 z: AND_INT ____, R0.x, (0x0000003F, 8.828180325e-44f).x
![Page 8: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/8.jpg)
University of Central Florida
Instruction (HIL)
Opcode dst, src1, src2, … ADD_INT R0.x, R1.x, R2.x Dst, src1, src2 is Operand
GPUProgram hold instruction lists. Instruction implement the execution
Receive the thread as parameter, and execute on the thread For example:
ADD_INT R0.x, R1.x, R2.x Instruction get value of R1.x, R2.x from thread Save value of R1.x+R2.x as R0.x to thread
![Page 9: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/9.jpg)
University of Central Florida
Memory Handle (HIL)
Texture Memory 0 SAMPLE R1, R0.xyxx, t0, s0 UNNORM(XYZW) EXP_DONE: PIX0, R0 Cache support (future work)
Global Memory 6 RD_SCATTER R3, DWORD_PTR[0+R2.x], ELEM_SIZE(3)
UNCACHED BURST_CNT(0) 03 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x].x___,
R0, ELEM_SIZE(3) Coalesced support: First thread handle (future work)
Use the text file as input and output.
![Page 10: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/10.jpg)
University of Central Florida
Thread(PML)
Belong Group Hold Data Unit(HIL)
128 bit (x, y, z, w) + 32 bit ( t ) Most of resource is 4 component: register One thread processor is five-way, and have 5 output (x,y,z,w,t)
Hold mapping table of Register (GPR, CR, TR) to Data Unit
![Page 11: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/11.jpg)
University of Central Florida
Wavefront(PML)
Hold Program counter Hold the thread id list Belong to Group
![Page 12: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/12.jpg)
University of Central Florida
Group(PML)
Hold threads Hold wavefront Belong to GPUProgram Hold Share memory(PML)
Instruction access the share memory through Group Instruction (HIL)
12 LOCAL_DS_WRITE (8) R0, STRIDE(16) SIMD_REL 17 LOCAL_DS_READ R2, R2.xy WATERFALL
![Page 13: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/13.jpg)
University of Central Florida
Wavefront Schedule(PML)
Current version (function simulator) Pick up one instruction, let all wavefronts execute this operation.
for time simulator Decided by the hardware capacity and software request Decided by the static instruction list Decided by execution result
![Page 14: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/14.jpg)
University of Central Florida
GPUProgram(PML)
Code parser parses instruction list Load input stream from txt file to memory. Allocate resource by the configuration file: group, thread,
share memory, memory, wavefront schedule. Wavefront schedule executes instruction list on the
wavefronts When instruction is executed on one thread, instruction update
the resources: register of thread, share memory of group, texture(global) memory of gpuprogram.
Save the output memory to txt file
![Page 15: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/15.jpg)
University of Central Florida
Test case Sum, division, subtract, multiplication
Support texture memory Support different data types (int, float, uint, int1, int4…) Support fundamental ALU operations (+-*/, shift, and, compare, cast)
domain_sum Support global memory read and write
Sum_share_memory support share memory read and write Support group, wavefront
Branch and Loop: to be done Support constant buffer Loop operation
![Page 16: GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1b117f8b9ab05998f8b6/html5/thumbnails/16.jpg)
University of Central Florida
Future work
Now support 30 of 200 instructions for ATI Support Nvidia, optimize two layers design Timing simulator