team programming project byunghyun (byung) jang ph.d student northeastern university jul. 26 2009...
TRANSCRIPT
![Page 1: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/1.jpg)
Team Programming Project
Byunghyun (Byung) JangPh.D student
Northeastern UniversityJul. 26 2009
CRA-W/CDC Careers in High Performance Systems (CHiPS)Mentoring Workshop
July 25-27 2009National Center for Supercomputing Applications (NCSA) at
University of Illinois at Urbana-Champaign (UIUC)
![Page 2: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/2.jpg)
CHiPS - Team Programming Project
Some words about me
▪4th year Ph.D student
▪Born and raised in South Korea
▪34 years old (never too late to learn)
▪B.S. in mechanical engineering and M.S. in computer science
▪Full time engineer at Samsung Electronics for 3 years
▪GPGPU
▪Internship at AMD and fellowship from AMD
▪Happy
![Page 3: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/3.jpg)
CHiPS - Team Programming Project
Goals
▪Understand General Purpose Computing on GPU (a.k.a. GPGPU)
▪Experience CUDA GPU programming
▪Understand how massively multi-threaded parallel programming works
▪Think about solving a problem in a parallel fashion
▪Experience the tremendous computational power of GPU
▪Experience the challenges in efficient parallel programming
![Page 4: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/4.jpg)
CHiPS - Team Programming Project
Outlines
▪Application 1: Image Rotation
▪ Introduction and Design (15 min)
▪Preparation (5 min)▪Installing a skeleton code, compile test, image view test
▪Hands-on Programming (30 min)▪Replace ??? with your own CUDA code
▪Application 2: Histogram
▪ Introduction and Design (15 min)
▪Preparation (5 min)▪Installing a skeleton code, compile test
▪Hands-on Programming (40 min)▪Replace ??? with your own CUDA code
▪Conclusion
![Page 5: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/5.jpg)
CHiPS - Team Programming Project
Application 1: Image Rotation - Introduction -
Original Input Image Rotated Output Image
▪Rotate an image by a given angle
▪A basic feature in image processing applications
![Page 6: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/6.jpg)
CHiPS - Team Programming Project
▪What the application does:
Step 1. Compute a new location according to the rotation angle(trigonometric computation)
Step 2. Read the pixel value of original location
Step 3. Write the pixel value to the new location computed at Step 1
▪Create the same number of threads as the number of pixels
▪Each thread takes care of moving one pixel
▪Our goals are
▪To understand how to use GPU for data parallelism
▪To know how to map threads to data
Application 1: Image Rotation - Introduction -
![Page 7: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/7.jpg)
CHiPS - Team Programming Project
Application 1: Image Rotation - Design -
ThreadBlock(0, 0)
ThreadBlock(0, 1)
ThreadBlock(0, 63)
ThreadBlock(63, 0)
ThreadBlock
(63, 63)
512
Treads Mapping 512
8
8
![Page 8: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/8.jpg)
CHiPS - Team Programming Project
1. Deploy the skeleton code in the proper directory
[..@ac ~]$ cp /tmp/projects.tar ./
[..@ac ~]$ cp /tmp/cuda.pdf ./
[..@ac ~]$ tar -xf projects.tar
2. Request a cluster node for interactive use for 2 hours
[..@ac ~]$ qsub -I -l walltime=02:00:00
3. Compile
[..@ac ~]$ cd PROJECTS/projects/ImageRotation
[..@ac ~]$ make clean
[..@ac ~]$ make
To use printf() to debug, use “make emu=1” instead of “make”
4. Execute
[..@ac ~]$ ./ImageRotation
5. Convert image from “pgm” to “jpg” format
[..@ac ~]$ convert data/lena_out.pgm data/lena_out.jpg
6. Download “lena_out.jpg” to your laptop to view it
Application 1: Image Rotation - Preparation -
Download for your future reference
![Page 9: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/9.jpg)
CHiPS - Team Programming Project
▪ Replace ??? in the skeleton code with your own CUDA code
▪ Refer to the hints and comments in skeleton code
▪ Talk to me if you have any questions or are done
▪ Try to finish by 2:30 pm
▪ Help others if you finish early
Application 1: Image Rotation - Hands-on Programming -
![Page 10: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/10.jpg)
CHiPS - Team Programming Project
Application 2: Histogram - Introduction -
Input Image Output Histogram
0 (black) 255 (white)
y-axis: Number of Pixels
x-axis: Intensity
▪Shows the frequency of occurrence of the intensity value of each pixel
▪A commonly used analysis tool in image processing and data mining applications
![Page 11: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/11.jpg)
CHiPS - Team Programming Project
▪Serial implementation looks like
▪Access to data[] is sequential but access to histogram[] is random depending on the value, therefore,
▪We will use a fast shared memory to store per-block sub-histogram (s_hist[]) because shared memory handles random memory access much more efficiently than global memory does
Application 2: Histogram - Introduction -
data[DATA_COUNT]; // input data histogram[BIN_COUNT]; // histogram data for (int i=0; i < BIN_COUNT; i++) histogram[i] = 0; // initialization for (int i=0; i < DATA_COUNT; i++) histogram[ data[i] ]++; // updating corresponding bin
![Page 12: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/12.jpg)
CHiPS - Team Programming Project
Application 2: Histogram - Design -
▪The structure of shared memory would look like the follow
▪Notice that shared memory is per thread block and limiteddata[DATA_COUNT]
Shared Memorys_hist[]
64 data elements64 data elements
64 data elements64 data elements
![Page 13: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/13.jpg)
CHiPS - Team Programming Project
Application 2: Histogram - Design -
▪Merging per-thread histogram into per-block histogram
Shared Memorys_hist[]
per block
d_result[] # of thread blocks
BIN_COUNT
BIN_COUNT= 64
THREAD_N = 192
BIN_COUNT
final histogram
![Page 14: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/14.jpg)
CHiPS - Team Programming Project
1. Compile[..@ac ~]$ cd PROJECTS/projects/Histogram
[..@ac ~]$ make clean
[..@ac ~]$ make
To use printf() to debug, use “make emu=1” instead of “make”
2. Execute[..@ac ~]$ ./Histogram
4. Check output message
“*** TEST FAILED”: something wrong
“*** TEST PASSED”: you got it
Application 1: Image Rotation - Preparation -
![Page 15: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/15.jpg)
CHiPS - Team Programming Project
Application 1: Histogram - Hands-on Programming -
▪ Replace ??? in the skeleton code with your own CUDA code
▪ Refer to the hints and comments in skeleton code
▪ Talk to me if you have any questions or are done
▪ Try to finish by 3:30 pm
▪ Help others if you finish early
![Page 16: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/16.jpg)
CHiPS - Team Programming Project
Conclusions
▪What we’ve learned throughout the two projects
▪Understood a massive parallel computing on GPU
▪Experienced what CUDA programming looks like
▪Understood how to explicitly program hardware resources
▪Understood the importance and challenges in parallel programming
▪Experienced solving problem in massively parallel fashion
▪GPU is the platform of choice for data-parallel computationally- intensive applications
▪In a few years, we are likely to see many people buying a new graphics card to increase the desktop’s computing performance, not to increase 3D game performance
![Page 17: Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cf35503460f949c1190/html5/thumbnails/17.jpg)
CHiPS - Team Programming Project
Thank you!