cours gpgpu

Upload: abdoulaye-aw

Post on 01-Jun-2018

239 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Cours Gpgpu

    1/99

    1

    General-Purpose Programmingon the GPU

    M1 Info 2014/2015

    [email protected]

    12h CM (8 san es!

    18h "# (12 san es!$%aluation par un mini&pro'et ( erniers "#!

    ) rit final ( o uments autoriss!

    mailto:[email protected]:[email protected]
  • 8/9/2019 Cours Gpgpu

    2/99

    2

    *efs +,penC- #ro rammin ui e+ + etero eneous Computin ith

    ,penC-+ http3// .hetero eneous ompute.or / http3//uni%&limo es. erli ris.fr/ oo6/88807 45

    an s ,n ,penC-3 http3//han sonopen l. ithu .io/ #ro rammation es s st9mes parall9les htro 9nes

    http3// .te hni:ues&in enieur.fr/ ase& o umentaire/te hnoloies& e&l&information&th7/lan a es& e&pro rammation&42;04210/pr

    o rammation& es&s stemes&paralleles&hetero enes&h;1 0/ C $?ample3 >n Intro u tion to eneralurpose # ison Desle Eul 2010.p f

    http://www.heterogeneouscompute.org/http://univ-limoges.cyberlibris.fr/book/88809645http://handsonopencl.github.io/http://www.techniques-ingenieur.fr/base-documentaire/technologies-de-l-information-th9/langages-de-programmation-42304210/programmation-des-systemes-paralleles-heterogenes-h3160/http://www.techniques-ingenieur.fr/base-documentaire/technologies-de-l-information-th9/langages-de-programmation-42304210/programmation-des-systemes-paralleles-heterogenes-h3160/http://www.techniques-ingenieur.fr/base-documentaire/technologies-de-l-information-th9/langages-de-programmation-42304210/programmation-des-systemes-paralleles-heterogenes-h3160/http://www.techniques-ingenieur.fr/base-documentaire/technologies-de-l-information-th9/langages-de-programmation-42304210/programmation-des-systemes-paralleles-heterogenes-h3160/http://www.techniques-ingenieur.fr/base-documentaire/technologies-de-l-information-th9/langages-de-programmation-42304210/programmation-des-systemes-paralleles-heterogenes-h3160/http://www.techniques-ingenieur.fr/base-documentaire/technologies-de-l-information-th9/langages-de-programmation-42304210/programmation-des-systemes-paralleles-heterogenes-h3160/http://handsonopencl.github.io/http://univ-limoges.cyberlibris.fr/book/88809645http://www.heterogeneouscompute.org/
  • 8/9/2019 Cours Gpgpu

    3/99

    3

    he ule

    1.Intro u tion to #arallel #ro rammin F # #r hite tures;.Basi ,penC- $?amples4.

  • 8/9/2019 Cours Gpgpu

    4/994

    ,penC- H "he Open Computing Language (,penC-! is a

    hetero eneous pro rammin frame or6 mana e the nonprofit onsortium Khronos Group .

    It supports a i e ran e of le%els of parallelism aneffi ientl maps to homo eneous or hetero eneousFsin le& or multiple& e%i e s stems onsistin of C#

  • 8/9/2019 Cours Gpgpu

    5/995

    ,penC- 1.0F 1.1F 1.2F ... ,penC- as initiall e%elope >pple In ,penC- 1.0 release ith Ma , no

    -eopar (2008! "ersion en salle #P ,penC- 1.1 (2010! most e$amples

    presente% in this ourse ,penC- 1.2 (2012! ,penC- 2.0 (2014! supports an >n roi

    $?tension ,penC- %s C H

  • 8/9/2019 Cours Gpgpu

    6/996

    #arallel #ro rammin

  • 8/9/2019 Cours Gpgpu

    7/997

    #arallel #ro rammin

  • 8/9/2019 Cours Gpgpu

    8/99

    8

    #arallel #ro rammin "hrea s an hare

    Memor Messa eassin

    Communi ation =ata harin an

    n hroniJation

    =ifferent rains of#arallelism...

  • 8/9/2019 Cours Gpgpu

    9/99

    9

    Man &Core Kuture

    Man ores runnin at lo er fre:uen ies are

    fun amentall more po er&effi ient

  • 8/9/2019 Cours Gpgpu

    10/99

  • 8/9/2019 Cours Gpgpu

    11/99

    11

    etero eneous #latforms... arealrea here L

  • 8/9/2019 Cours Gpgpu

    12/99

    12

    eneral #ro rammin on the #< "ra itionall F mo ules are e?pli itl tie to the

    omponents in the hetero eneous platform. Kore?ampleF raphi s soft are runs on the #

  • 8/9/2019 Cours Gpgpu

    13/99

    13

    Con eptual Koun ations of ,penC-

    1. =is o%er the omponents that ma6e up the hetero eneouss stem.

    2. #ro e the hara teristi s o! these omponents so that thesoft are an a apt to the spe ifi features of ifferent har are

    elements.;. Create the lo 6s of instru tions ( &ernels ! that ill run on the

    platform.

    4. et up an manipulate memor' o()e ts in%ol%e in the

    omputation.5. $?e ute the 6ernels in the ri ht or er an on the right

    omponents of the s stem.

    . Colle t the final results.

  • 8/9/2019 Cours Gpgpu

    14/99

    14

    #latform Mo el

    > e%i e an e a C#

  • 8/9/2019 Cours Gpgpu

    15/99

    15

    o a Oernel $?e utes on an,penC- =e%i e

    1. > 6ernel is efine on the host.

    2. "he host pro ram issues a omman that su mits the 6ernel for e?e ution on an,penC- e%i e.

    ;. Dhen this omman is issue the host 3 the ,penC- runtime s stem reates an integer in%e$ spa e* >n instan e of the 6ernel e?e utes for ea h point in this in e? spa eF alle a

    +or&-item* Its oor inates in the in e? spa e are the glo(al ID for the or6&item.

    4. Dor6&items are or aniJe into +or&-groups hi h e?a tl span the lo al in e?spa e. Dor6&items are assi ne a uni ue lo al ID +ithin a +or&-group so that asin le or6&item an e uni:uel i entifie its lo al I= or a om(ination o!its lo al ID an% +or&-group ID .

    5. ,penC- onl assures that the +or&-items +ithin a +or&-group e$e uteon urrentl' on the pro essin elements of a sin le ompute unit (an share

    pro essor resour es on the e%i e!

  • 8/9/2019 Cours Gpgpu

    16/99

    16

    P=*an e "he in e? spa e spans an P& imensione

    ran e of %alues an thus is alle an ND ange(P an e 1F 2 or ;!

    Insi e an ,penC- pro ramF an P=*an e isefine an inte er arra of len th Pspe if in the siJe of the in e? spa e in ea h

    imension.

  • 8/9/2019 Cours Gpgpu

    17/99

    17

    Dor6&items an or6& roups

    Glo(al ID .g$, g'/ 0 .1, 2/3or&-group ID .+$, +'/ 0 .4, 4/

    Lo al ID .l$, l'/ 0 .5, 4/

  • 8/9/2019 Cours Gpgpu

    18/99

    18

    Conte?t De"i es 3 the olle tion of ,penC- e%i es to

    e use the host Kernels 3 the ,penC- fun tions that run on

    ,penC- e%i es Program o()e ts 3 the pro ram sour e o e

    an e?e uta les that implement the 6ernels

    Memor' o()e ts 3 a set of o 'e ts in memorthat are %isi le to ,penC- e%i es an ontain%alues that an e operate on instan es ofa 6ernel

  • 8/9/2019 Cours Gpgpu

    19/99

    19

    Comman &Queues "he intera tion et een the host an the ,penC- e%i es

    o urs throu h omman s poste the host to theomman%- ueue .

    "hese omman s ait in the omman &:ueue until the

    e?e ute on the ,penC- e%i e. > omman &:ueue is reate the host an atta he to a

    sin le ,penC- e%i e after the onte?t has een efine . Kernel e$e ution omman%s e?e ute a 6ernel on the

    pro essin elements of an ,penC- e%i e. Memor' omman%s transfer ata et een the host an ifferent

    memor o 'e tsF mo%e ata et een memor o 'e tsF or map anunmap memor o 'e ts from the host a ress spa e.

    6'n hroni7ation omman%s put onstraints on the or er in

    hi h omman s e?e ute.

  • 8/9/2019 Cours Gpgpu

    20/99

    20

    Memor Mo el

  • 8/9/2019 Cours Gpgpu

    21/99

    21

    Memor Mo el 8ost memor' 3 %isi le onl to the host. Glo(al memor' 3 permits rea / rite a ess to all or6&items

    in all or6& roups. *ea s an rites to lo al memor mae a he epen in on the apa ilities of the e%i e.

    Constant memor' 3 remains onstant urin the e?e ution ofa 6ernel (rea &onl a ess!. Lo al memor' 3 lo al to a or6& roup. It an e use to

    allo ate %aria les that are share all or6&items in thator6& roup. It ma e implemente as e i ate re ions of

    memor on the ,penC- e%i e. Pri"ate memor' 3 "his re ion of memor is pri%ate to a or6&

    item. Raria les efine in one or6&itemGs pri%ate memor arenot %isi le to other or6&items.

  • 8/9/2019 Cours Gpgpu

    22/99

    22

    ummar

  • 8/9/2019 Cours Gpgpu

    23/99

    23

    *emem er this H

    1. =is o%er the omponents that ma6e up the hetero eneouss stem.

    2. #ro e the hara teristi s o! these omponents so that thesoft are an a apt to the spe ifi features of ifferent har are

    elements.;. ...

  • 8/9/2019 Cours Gpgpu

    24/99

    24

    #latform>n =e%i es.Sin lu e TC-/ l.hUSin lu e T...U

    S efine C- C $CO( e?pr! Vo W l int err X e?prY V

    if ( err XX C- M$F 10240F ufferF PM$ X [sVn+F uffer!Y C- C $CO( l et#latformInfo(platforms]i^F C- #->"K,*M R$P=,*F 10240F ufferF P"K,*M $ "$P I,P F 10240F ufferF P

  • 8/9/2019 Cours Gpgpu

    25/99

    25

    #latform>n =e%i es. ( t ! printf(+XXX [ ,penC- e%i e(s! foun on

    platform3Vn+F platforms n!Y

    for (int iX0Y iT e%i es nY i))! W har uffer]10240^Y l uint uf uintY l ulon uf ulon Y printf(+ && [ &&Vn+F i!Y C- C $CO( l et=e%i eInfo( e%i es]i^F

    C- =$RIC$ P>M$F siJeof( uffer!F ufferF PM$ X [sVn+F uffer!Y C- C $CO( l et=e%i eInfo( e%i es]i^F

    C- =$RIC$ R$P=,*F siJeof( uffer!F ufferFP

  • 8/9/2019 Cours Gpgpu

    26/99

    26

    Compilin an runnin (finall L!

    `U g;; plat!ormAn%De"i es* -lOpenCL`U *M$ X eKor e 8 00 " =$RIC$ R$P=,* X PRI=I> Corporation =$RIC$ R$* I,P X ,penC- 1.0 C =*IR$* R$* I,P X ;17.1 =$RIC$ M> C,M# C-,CO K*$Q- M$M I_$ X 5; 15001

    C,M# C-,CO K*$Q- M$M I_$ X 824 722 88

    Dith >M= 3 g;; plat!ormAn%De"i es* -I

  • 8/9/2019 Cours Gpgpu

    27/99

    27

    l et#latformI=s

    l?int lGetPlat!ormIDs . l?uint num?entries, l?plat!orm?i% 9 plat!orms, l?uint 9 num?plat!orms/

    "his omman o tains the list of a%aila le platforms In the ase that the ar ument plat!orms is Pnother e?ample 3

    errNum 0 lGetPlat!ormIDs. , NULL, numPlat!orms/ plat!ormI%s 0 . l?plat!orm?i% 9/allo a.si7eo!. l?plat!orm?i%/ 9 numPlat!orms/ errNum 0 lGetPlat!ormIDs.numPlat!orms, plat!ormI%s, NULL/

  • 8/9/2019 Cours Gpgpu

    28/99

    28

    l et#latformInfo

    l?int lGetPlat!ormIn!o . l?plat!orm?i% plat!orm, l?plat!orm?in!o param?name, si7e?t param?"alue?si7e, "oi% 9 param?"alue,

    si7e?t 9 param?"alue?si7e?ret/ "his omman returns spe ifi information a out the

    ,penC- platform 3 profileF %ersionF nameF b >nother e?ample 3

    err 0 lGetPlat!ormIn!o.i%, CL?PLA# O M?NAM=, , NULL, si7e/ har 9 name 0 . har 9/allo a.si7eo!. har/ 9 si7e/ err 0 lGetPlat!ormIn!o.i%, CL?PLA# O M?NAM=, si7e, in!o, NULL/

  • 8/9/2019 Cours Gpgpu

    29/99

    29

    l et=e%i eI=s

    l?int lGetDe"i eIDs . l?plat!orm?i% plat!orm, l?%e"i e?t'pe %e"i e?t'pe, l?uint num?entries, l?%e"i e?i% 9%e"i es,

    l?uint 9num?%e"i es/ "his omman o tains the list of a%aila le ,penC-

    e%i es asso iate ith platform. %e"i e?t'pe 3

    C- =$RIC$ "A#$ C#< 3 ,penC- e%i e that is the host pro essor. C- =$RIC$ "A#$ #< 3 ,penC- e%i e that is a #

  • 8/9/2019 Cours Gpgpu

    30/99

    30

    l et=e%i eInfo

    l?int lGetDe"i eIn!o . l?%e"i e?i% %e"i e, l?%e"i e?in!o param?name, si7e?t param?"alue?si7e, "oi% 9 param?"alue,

    si7e?t 9 param?"alue?si7e?ret/ "his omman returns spe ifi information a out the

    ,penC- e%i e. param?name 3

    C- =$RIC$ "A#$F C- =$RIC$ R$P=,* I=F b http3//m .safari oo6sonline. om/ oo6/pro rammin /7 8

    01;248800 /platforms& onte?ts&an &e%i es/ h0;le%1se 2

  • 8/9/2019 Cours Gpgpu

    31/99

  • 8/9/2019 Cours Gpgpu

    32/99

  • 8/9/2019 Cours Gpgpu

    33/99

  • 8/9/2019 Cours Gpgpu

    34/99

    "$# 3 Drite host ata to e%i e uffers

  • 8/9/2019 Cours Gpgpu

    35/99

    35

    $"$# 3 Create an ompile the pro ram

    "$# 83 Create the 6ernel////////////////////////// "$# ////////////////////////////////////////// // // to the e%i e uffer uffer> status X l$n:ueueDriteBuffer( m QueueF uffer>F C- K>- $F 0F

    atasiJeF >F 0FP

  • 8/9/2019 Cours Gpgpu

    36/99

    36

    "$# 103 Confi ure the or6&item stru ture"$# 113 $n:ueue the 6ernel for e?e ution

    ////////////////////////// "$# 7 //////////////////////////////////////////

    // >sso iate the input an output uffers ith the// 6ernel usin l etOernel>r (!

    status X l etOernel>r (6ernelF 0F siJeof( l mem!Fuffer>!Y

    status dX l etOernel>r (6ernelF 1F siJeof( l mem!FufferB!Y

    status dX l etOernel>r (6ernelF 2F siJeof( l mem!FufferC!Y

    ////////////////////////// "$# 10 //////////////////////////////////////////

    // =efine an in e? spa e ( lo al or6 siJe! of or6 // items for e?e ution. > or6 roup siJe (lo al or6 // siJe! is not re:uire F ut an e use . siJe t lo alDor6 iJe]1^Y

    // "here are ZelementsZ or6&items lo alDor6 iJe]0^ X elementsY

    ////////////////////////// "$# 11 //////////////////////////////////////////

    // $?e ute the 6ernel usin // l$n:ueueP=*an eOernel(!. // Z lo alDor6 iJeZ is the 1= imension of the // or6&items status X l$n:ueueP=*an eOernel( m QueueF 6ernelF 1F P

  • 8/9/2019 Cours Gpgpu

    37/99

  • 8/9/2019 Cours Gpgpu

    38/99

    38

    Memor mappin

    $a h e?e utin or6&item nee s to 6no hi h in i%i ual elementsfrom arra s a an ( nee to e summe . "his must e a uni:ue %alue for ea h or6&item an shoul e eri%e

    from the P&= omain spe ifie hen :ueuin the 6ernel for e?e ution. "he get?glo(al?i%. / returns the one& imensional lo al I= for ea h

    or6&item.

    o to he 6 for errors in our

  • 8/9/2019 Cours Gpgpu

    39/99

    39

    o to he 6 for errors in our6ernel 3 lGetProgramBuil%In!o

    houl ta6e pla e after lBuil%Program (step ! et the len th of the lo strin 3

    si7e?t lenlGetProgramBuil%In!o.program, %e"i esE F,

    CL?P OG AM?BUILD?LOG, , NULL, len/

    et the lo itself 3har 9(u!!er 0 . har 9/mallo .len/lGetProgramBuil%In!o.program, %e"i esE F,

    CL?P OG AM?BUILD?LOG, len, (u!!er, NULL/

    " pi al results 3

    e?pe te ZYZ after e?pression use of un e lare i entifier Zlo alI Z error3 e?pe te Z\Z ...

  • 8/9/2019 Cours Gpgpu

    40/99

    40

    $?er i es "esteJ lZen%ironnement ,penC- en salle "# et

    sur %otre propre ma hine "esteJ les performan es u pro ramme :ui

    a itionne eu? %e teurs 3 $n omparant a%e une implmentation

    s:uentielle lassi:ue $n faisant %arier le %olume es onnes

    ( s ala ilit !

    imple Matri? Multipli ation

  • 8/9/2019 Cours Gpgpu

    41/99

    41

    imple Matri? Multipli ation(se:uential o e!

    // Iterate o%er the ro s of Matri? >

    for(int i X 0Y i T hei ht>Y i))! W

    // Iterate o%er the olumns of

    // Matri? B for(int ' X 0Y ' T i thBY '))! W C]i^]'^ X 0Y

    // Multipl an a umulate the // %alues in the urrent ro // of > an olumn of B for(int 6 X 0Y 6 T i th>Y 6))! C]i^]'^ )X >]i^]6^ c B]6^]'^Y \

    \

  • 8/9/2019 Cours Gpgpu

    42/99

  • 8/9/2019 Cours Gpgpu

    43/99

    43

    o an e retrie%e ro an olumn in i es from get?glo(al?i%. / H

    # ll l M i? M l i li i O l

  • 8/9/2019 Cours Gpgpu

    44/99

    44

    #arallel Matri? Multipli ation Oernel

    // i th> X hei htB for %ali matri? multipli ation 6ernel %oi simpleMultipl ( lo al floatc outputCF int i th>F int hei ht>F int i thBF int hei htBF lo al floatc input>F

    lo al floatc inputB! W // et lo al position in A ire tion int ro+ 0 get?glo(al?i%.4/ // et lo al position in ire tion int ol 0 get?glo(al?i%. /

    float sum X 0.0fY //Cal ulate result of one element of Matri? C for (int i X 0Y i T i th>Y i))!

    sum )X input>] ro+9+i%thA;i ^ c inputB]i9+i%thB; ol ^Y outputC]ro c i thB) ol^ X sumY\

    * hi H

  • 8/9/2019 Cours Gpgpu

    45/99

    45

    *emem er this H

    Glo(al ID .g$, g'/ 0 .1, 2/3or&-group ID .+$, +'/ 0 .4, 4/

    Lo al ID .l$, l'/ 0 .5, 4/

    M i

  • 8/9/2019 Cours Gpgpu

    46/99

    46

    Memor mappin

    siJe t lo alDor6 iJe]2^Y

    lo alDor6 iJe]0^ X DBYlo alDor6 iJe]1^ X >Y

    err o e X l$n:ueueP=*an eOernel (:ueueF 6ernelF 2F P

  • 8/9/2019 Cours Gpgpu

    47/99

    47

    l$n:ueueP=*an eOernel l int l=n ueueND angeKernel ( omman :ueueF 6ernelF or6 imF lo al or6 offsetF

    lo al or6 siJeF lo al or6 siJeF num e%ents in ait listF e%ent ait listF e%ent!

    +or&?%im 3 the num er of imensions use to spe if the lo al or6&items an or6&items in the or6&roup

    lo al or6 offset 3 must urrentl e a P

  • 8/9/2019 Cours Gpgpu

    48/99

    Ima e *otation Oernel

  • 8/9/2019 Cours Gpgpu

    49/99

    49

    Ima e *otation Oernel

    6ernel %oi im rotate( lo al floatc est ataF lo al floatc sr ataFint DF int F //Ima e =imensions

    float sin"hetaF float os"heta ! //*otation #arametersW

  • 8/9/2019 Cours Gpgpu

    50/99

    50

    Ima e *otation $?ample

    ee the full e?ample to is o%er ho to3 -oa a BM# ima e file into a uffer

    (http3//fr. i6ipe ia.or / i6i/Din o s itmap ! tore ,penC- 6ernels in separate files an loa them at runtime

    Co e ith ,penC- in C)) L

  • 8/9/2019 Cours Gpgpu

    51/99

    51

    $?er i es

    Co eJ a%e une implmentation s:uentiellelassi:ue puis en ,penC- 3

    la multipli ation e eu? matri es

    la rotation Zima e (en utilisant lZ>#I C ! "esteJ les performan es en faisant %arier le

    %olume es onnes

    Kun tion Qualifiers

  • 8/9/2019 Cours Gpgpu

    52/99

    52

    Kun tion Qualifiers

    ??&ernel or &ernel "he follo in rules appl to 6ernel fun tions3

    "he return t pe must e %oi . If the return t pe is not%oi F it ill result in a ompilation error.

    "he fun tion an e e?e ute on a e%i e en:ueuin a omman to e?e ute the 6ernel from thehost.

    "he fun tion eha%es as a re ular fun tion if it is allefrom a 6ernel fun tion. "he onl restri tion is that a6ernel fun tion ith %aria les e lare insi e thefun tion ith the lo al :ualifier annot e alle fromanother 6ernel fun tion.

    Ba / oo

  • 8/9/2019 Cours Gpgpu

    53/99

    53

    Ba / oo

    6ernel %oi m fun a( lo al float csr F lo al float c st!W lo al float l %ar];2^Y b\

    6ernel %oi m fun ( lo al float csr F lo al float c st!W // implementation& efine eha%ior m fun a(sr F st!Y\

    &ernel "oi% m'?!un ?a.glo(al !loat 9sr , glo(al !loat 9%st, lo al !loat 9l?"ar/ :

    &ernel "oi% m'?!un ?(.glo(al !loat 9 sr , glo(al !loat 9%st, lo al !loat 9l?"ar/:

    m'?!un ?a.sr , %st, l?"ar/

    > ress pa e Qualifiers

  • 8/9/2019 Cours Gpgpu

    54/99

    54

    > ress pa e Qualifiers

    "he t pe :ualifier an e glo(al (or ??glo(al !Flo al (or ??lo al !F onstant (or ?? onstant !F orpri"ate (or ??pri"ate !

    If the t pe of an o 'e t is :ualifie an a ress spa e

    nameF the o 'e t is allo ate in the spe ifie a ressspa e (if not spe ifie F then the o 'e t is allo ate in thepri"ate a ress spa e!

    #ointers to the glo(al a ress spa e are allo e asar uments to fun tions (in lu in 6ernel fun tions! an%aria les e lare insi e fun tions. Raria les e lareinsi e a fun tion cannot e allo ate in the glo(al a ressspa e.

    -e al an ille al

  • 8/9/2019 Cours Gpgpu

    55/99

    55

    -e al an ille al

    6ernel %oi m fun (int cp! // illegal e ause eneri a ress// spa e name for p is pri%ate. 6ernel %oi m fun (pri%ate int cp! //illegal e ause memor

    // pointe to p is allo ate in// pri%ate.

    %oi m fun (int cp! // eneri a ress spa e name for p is// pri%ate3legal as m fun is not a// 6ernel fun tion

    %oi m fun (pri%ate int cp!// legal as m fun is not a 6ernel fun tion

    %oi m fun ( lo al float4 c%>F lo al float4 c%B! W lo al float4 cpY // legal lo al float4 aY // illegal\

    Constant > ress pa e

  • 8/9/2019 Cours Gpgpu

    56/99

    56

    Constant > ress pa e

    ] ^ X W 0F 1F 2F . . . \Y // pro ram s ope

    // illegal & pro ram s ope %aria les an e allo ate onl// in the onstant a ress spa e

    lo al float tsB] ^ X W 0F 1F 2F . . . \Y 6ernel %oi m fun ( onstant float4 c%>F onstant float4 c%B! W

    onstant float4 cp X %>Y // legal onstant float aY // illegal not initialiJe onstant float X 2.0fY // legal initialiJe ith a

    // ompile&time onstant p]0^ X (float4!(1.0f!Y //illegal p annot e mo ifie ...

    Lo al > ress pa e

  • 8/9/2019 Cours Gpgpu

    57/99

    57

    Lo al > ress pa e

    oo analo for lo al memor is a user&mana e a he. It is prefera leto rea the re:uire ata from lo al memor ( hi h is an or er ofma nitu e slo er! on e into lo al memor an then ha%e the or6&itemsrea multiple times from lo al memor .

    6ernel %oi m fun ( lo al float4 c%>F lo al float4 cl! W lo al float4 cpY // legal lo al float4 aY // legal a X 1Y

    lo al float4 X (float4!(0!Y // illegal annot e initialiJe if (b! W lo al float Y //illegal must e allo ate at 6ernel fun tion s ope b \\

    Ima e Con%olution $?ample

  • 8/9/2019 Cours Gpgpu

    58/99

    58

    Ima e Con%olution $?ample

    e:uential on%olution

  • 8/9/2019 Cours Gpgpu

    59/99

    59

    e:uential on%olution

    spe ifi a to han le ima es inC 1 0

  • 8/9/2019 Cours Gpgpu

    60/99

    60

    ,penC- 1.0 "Y

  • 8/9/2019 Cours Gpgpu

    61/99

    61

    Con%olution 6ernel 6ernel %oi on%olution( ??rea%?onl' ima e2 t sour eIma eF ??+rite?onl' ima e2 t outputIma eF int ro sF int olsF ?? onstant floatc filterF

    int filterDi thF sampler t sampler! W

  • 8/9/2019 Cours Gpgpu

    62/99

    62

    esults ith a ;?; filter

    Iteration 1 Iteration 2

    Iteration ; Iteration

    ...

    n hroniJation ith ,penC-

  • 8/9/2019 Cours Gpgpu

    63/99

    63

    ,p

    $?ample of s n hroniJationi i 6 l (l l!

  • 8/9/2019 Cours Gpgpu

    64/99

    64

    insi e a 6ernel (lo al!// ost o e. . .

    l mem input X lCreateBuffer( onte?tF C- M$M *$>= ,P-AF 10csiJeof(float!F 0F 0!Y

    l mem interme iate X lCreateBuffer( onte?tF C- M$M *$>= ,P-AF 10csiJeof(float!F 0F 0!Y

    l mem output X lCreateBuffer( onte?tF C- M$M D*I"$ ,P-AF 10csiJeof(float!F 0F 0!Y

    l$n:ueueDriteBuffer(:ueueF inputF C- "*r (6ernelF 2F 2csiJeof(float!F 0!Y

    siJe t lo al s]1^ X W2\ YsiJe t lo al s]1^ X W10\Y

    l$n:ueueP=*an eOernel(:ueueF 6ernelF 1F P ress X ( et lo al i (0! ) 1! [et lo al siJe(0!Y

    ] et lo al i (0!^ X l ata] et lo al i (0!^ ) l ata]other> ress^Y

    \

    $?ample of lo al s n hroniJation

  • 8/9/2019 Cours Gpgpu

    65/99

    65

    p// #erform setup of platformF onte?t an reate uffers. . .// Create :ueue lea%in parameters as efault so :ueue is in-or%er :ueue X lCreateComman Queue( onte?tF e%i es]0^F 0F 0!Y. . .

    l$n:ueueDriteBuffer(:ueueF uffer>F C- "*

  • 8/9/2019 Cours Gpgpu

    66/99

    66

    p ,p

    ="ents pro%i e a ate a to aomman Gs histor 3 the ontaininformation etailin hen the

    omman as pla e in the :ueueFhen it as su mitte to the e%i eF

    an hen it starte an en e

    e?e ution. l int l et$%ent#rofilin Info (

    l e%ent e%entF l profilin infoparam nameF siJe tparam %alue siJeF %oi

    cparam %alueF siJe tcparam %alue siJe ret! #rofilin is ena le hen reatin a

    omman :ueue settin theC- Q

  • 8/9/2019 Cours Gpgpu

    67/99

    67

    If no lo al imension is i%en to l$n:ueueP=*an eOernelFthen OpenCL %e i%es !or the programmer

    ,ther ise itZs safer to hoose a po+er o! 5 Po e an test ho performan es are affe te hen e use

    this feature f or %e tor a ition or simple matri? multipli ation

    ,ptimiJin matri? multipli ation...

  • 8/9/2019 Cours Gpgpu

    68/99

    68

    p p

    Dith the pre%ious implementation an or er&1000 matri esF oneor6&item per matri? element results in a million or6&items(appro?. 511 MK-,# !

    In the ne?t %ersion of the pro ramF ea h or6&item ill omputea ro of the matri?

    "he P=*an e is han e from a 2= ran e set to mat h theimensions of the C matri? to a 4D range set to the num(er o!

    ro+s in the C matri$* M e%i e has four ompute units.

    en e for a or er&1000 matri? ean set the or6& roup siJe to 250an reate four or6& roups to

    o%er the full siJe of the pro lem.

    Rersion 1

  • 8/9/2019 Cours Gpgpu

    69/99

    69

    // ,ptimiJe matri? multipli ation 6ernel// Rersion 1

    6ernel mmul( onst int M imFonst int P imFonst int # imF

    lo al floatc >F lo al floatc BF lo al floatc C! W

    int 6F'Y int i X get?glo(al?i%. / float tmpY if (i T P im! W for('X0Y'TM imY'))! W tmp X 0.0Y for(6X0Y6T# imY6))!

    tmp )X >]icP im)6^ c B]6c# im)'^Y C]icP im)'^ X tmpY \ \\

    In the host o eF laun hin the 6ernele omes 3

    siJe t lo alDor6 iJe]1^Flo alDor6 iJe]1^Y

    lo alDor6 iJe]0^ X >/4Ylo alDor6 iJe]0^ X >Yl$n:ueueP=*an eOernel(

    lComman QueF lOernelF 4FP

  • 8/9/2019 Cours Gpgpu

    70/99

    70

    ,ur matri? multipli ation 6ernels up to this point ha%ele!t all three matri es in glo(al memor' . "his meansthe omputation streams ro s an olumns throu hthe memor hierar h ( lo al to pri%ate! repeate l forea h ot pro u t.

    De an re u e this memor traffi re o niJin thatea h or6&item reuses the same ro+ o! A for ea hro of C that is ompute .

    Rersion 2

  • 8/9/2019 Cours Gpgpu

    71/99

    71

    // ,ptimiJe matri? multipli ation 6ernel// Rersion 2

    6ernel mmul( onst int M imFonst int P imFonst int # imF

    lo al floatc >F lo al floatc BF lo al floatc C! W

    int 6F'Y int i X et lo al i (0!Y !loat A+r&E4 5@F float tmpY if (i T P im! W !or.&0 &HP%im &;;/ A+r&E&F 0 AEi9N%im;&F for('X0Y'TM imY'))! W

    tmp X 0.0Y for(6X0Y6T# imY6))!tmp )X A+r&E&F c B]6c# im)'^Y

    C]icP im)'^ X tmpY \ \\

    Be areF e an notuse prepro essin

    onstants L

    Cop %alues in

    pri%ate memor

  • 8/9/2019 Cours Gpgpu

    72/99

    72

    "he use of pri%ate memor has a ramati impa t onperforman e 3 8 ; MK-,# But a areful onsi eration sho s that hile ea h or6&item

    reuses its o n uni:ue ro of >F all the or6&items in a rouprepeate l stream the same olumns of B

    De an re u e the o%erhea of mo%in ata from lo almemor if the or6&items in a or6& roup op the olumns ofthe matri? B into lo al memor efore the start up atin theirro s of C.

    Rersion ;

  • 8/9/2019 Cours Gpgpu

    73/99

    73

    // ,ptimiJe matri? multipli ation 6ernel// Rersion ;

    6ernel mmul( onst int M imF onst int P imFonst int # imF

    lo al floatc >F lo al floatc BF lo al floatc CF ??lo al !loat9 B+r& ! W

    int 6F'Y int i X et lo al i (0!Y int ilo 0 get?lo al?i%. /, nlo 0 get?lo al?si7e. / float > r6]1024^Y float tmpY if (i T P im! W for(6X0Y6T# imY6))! > r6]6^ X >]icP im)6^Y for('X0Y'TM imY'))! W !or.&0ilo &HP%im &0&;nlo / B+r&E&F 0 BE&9P%im;)F (arrier.CLK?LOCAL?M=M? =NC=/ tmp X 0.0Y for(6X0Y6T# imY6))!

    tmp )X > r6]6^ c B r6]6^Y C]icP im)'^ X tmpY \ \

    \

    Before laun hin the 6ernel 3l6etKernelArg.9&ernel, 1,

    si7eo!.!loat/9P%im, NULL/

    *elati%e to lo al spa e

  • 8/9/2019 Cours Gpgpu

    74/99

    74

    #he goal is to ma$imi7e the amount o! +or& per &ernelan% optimi7e memor' mo"ement

    ,ptimiJin ima e on%olution...

  • 8/9/2019 Cours Gpgpu

    75/99

    75

    Ima e support in ,penC- ( lCreateImage5D F et !pro%i es automati a hin an ata a esstransformations that impro%e memor s stemperforman eF espe iall on #n optimiJe on%olution 6ernel an e naturall i%i einto three se tions3

    1. "he a hin of input ata from lo al to lo al memor2. #erformin the on%olution;. "he ritin of output ata a 6 to lo al memor

  • 8/9/2019 Cours Gpgpu

    76/99

    ele tin or6 roup siJes ana hin ata

  • 8/9/2019 Cours Gpgpu

    77/99

    77

    a hin ata In ,penC-F or6&item reation an al orithm esi n must

    e onsi ere simultaneousl F espe iall hen lo almemor is use

    "he first approa h is to reate the same num er of or6&

    items as there are ata elements to e a he in lo almemor $a h element oul simpl op one pi?el from lo al to

    lo al memor ... an then the +or&-items representingthe (or%er pi$els +oul% sit i%le urin the on%olution

    Conse:uentl F lar e filter siJes ill not allo manoutput elements to e ompute per or6 roup

    ele tin or6 roup siJes ana hin ata

  • 8/9/2019 Cours Gpgpu

    78/99

    78

    a hin ata "he se on approa h is to reate fe er or6&items than pi?els to e

    a he F so some or6&items ill ha%e to op multiple elements annone +ill sit i%le %uring the on"olution

    ele tin an effi ient or6 roup siJe re:uires onsi eration of theun erl in memor ar hite ture 3

    Kor the >M= 7 0 #

  • 8/9/2019 Cours Gpgpu

    79/99

    79

    Kor an ima e ith imensions image3i%th animage8eight F onl .image3i%th-pa%%ingPi$els/ $.image8eight-pa%%ingPi$els/ or6&items are nee e

    Be ause the ima e ill li6el not e an e?a t multiple of

    the or6 roup siJeF a itional or6 roups must ereate 3 the ill not e full utiliJe F an this must ea ounte for in the 6ernel

    Computin P=&*an e

  • 8/9/2019 Cours Gpgpu

    80/99

    80

    // "his fun tion ta6es a positi%e inte er an roun s it up to// the nearest multiple of another pro%i e inte er unsigne% int roun%Up.unsigne% int "alue, unsigne% int multiple/ : // =etermine ho far past the nearest multiple the %alue is unsigne% int remain%er 0 "alue Q multiple // > the ifferen e to ma6e the %alue a multiple i!.remain%er R0 / "alue ;0 .multiple-remain%er/ return "alue

    bint !ilter3i%th 0 S, pa%%ingPi$els 0 .int/.!ilter3i%th

  • 8/9/2019 Cours Gpgpu

    81/99

    81

    "he pro ess of op in ata from lo al memor to lo al memor isoften the most error&prone operation hen ritin a 6ernel

    "he or6&items first nee to etermine here in lo al memor toop from an then ensure that the o not a ess a re ion that is

    outsi e of their or6in area or out of oun s for the ima e.

    Ca hin =ata to -o al Memor

  • 8/9/2019 Cours Gpgpu

    82/99

    82

    ??&ernel "oi% on"olution.

    ??glo(al !loat9 imageIn, ??glo(al !loat9 imageOut, ?? onstant !loat9 !ilter,int ro+s, int ols, int !ilter3i%th,

    ??lo al !loat9 lo alImage,int lo al8eight, int lo al3i%th/ :

    // =etermine the amount of pa in for this filter int !ilter a%ius 0 .!ilter3i%th

  • 8/9/2019 Cours Gpgpu

    83/99

    83

    #erforman e on oth PRI=I> an >M= #

  • 8/9/2019 Cours Gpgpu

    84/99

    84

    "he onl re:uirement is to pa the input ata ith e?traolumns so that its i th e omes a multiple of the &imension of the or6 roup

    But manuall pa in a ata arra on the host an eompli ate F time& onsumin F an sometimes infeasi le

    "o a%oi su h te ious ata fi?upF ,penC- has a ommanalle l=n ueue3riteBu!!er e t to op a host arra into the

    mi le of a lar er e%i e uffer ,ther impro%ement in lu e usin "e tor rea%s, for e?ample

    rea in !loat@ ata allo s us to ome loser to a hie%in pea6memor an i th than rea in !loat ata ,n the >M= *a eon 7 0F a si nifi ant performan e ain is

    a hie%e usin %e tor rea s... ut a sli ht performan ee ra ation as seen on PRI=I> #

  • 8/9/2019 Cours Gpgpu

    85/99

    85

    // #erform the on%olution

    i!.glo(al o+ H ro+s-pa%%ing glo(alCol H ols-pa%%ing/ :

    // $a h or6 item ill filter aroun its start // lo ation (from the filter ra ius left an up! !loat sum 0 * ! int !ilterI%$ 0 // Pot unrolle !or.int i 0 lo al o+ i H lo al o+;!ilter3i%th

    i;;/ :int o!!set 0 i9lo al3i%th!or.int ) 0 lo alCol ) H lo alCol;!ilter3i%th

    );;/sum ;0 lo alImageEo!!set;)F 9

    !ilterE!ilterI%$;;F // Drite the ata out imageOutE.glo(al o+;!ilter a%ius/9 ols ;

    .glo(alCol;!ilter a%ius/F 0 sum return

    // Inner loop unrolle for(int i X lo al*o Y i T lo al*o )filterDi thY i))! W

    int offset X iclo alDi th)lo alColYsum )X lo alIma e]offset))^ c

    filter]filterI ?))^Ysum )X lo alIma e]offset))^ c

    filter]filterI ?))^Ysum )X lo alIma e]offset))^ c

    filter]filterI ?))^Ysum )X lo alIma e]offset))^ c

    filter]filterI ?))^Ysum )X lo alIma e]offset))^ c

    filter]filterI ?))^Ysum )X lo alIma e]offset))^ c

    filter]filterI ?))^Ysum )X lo alIma e]offset))^ c

    filter]filterI ?))^Y \

    ,n an >M= *a eon 7 0F ith a ? filter an a00?400 ima eF unrollin the innermost looppro%i es a 2.4 spee up. In eneralF this

    pro u es a su stantial spee up on oth >M=an PRI=I> e%i es.

    #arallel ata re u tion

  • 8/9/2019 Cours Gpgpu

    86/99

    86

    > re%u tion is an al orithm that on%erts a lar e ata setinto a smaller ata set usin an operator on ea h element

    > simple re u tion e?ample is to ompute the sum o! theelements in an arra' F ut it oul also e minF ma?F or6eep onl positi%e elementsF et !

    float sum arra (float c aF int Po of elements! Wfloat sum X 0.0fYfor (int i X 0Y i T Po of elementsY i))! sum )X a]i^Yreturn sumY

    \ Dith ,penC-F the ommon a to paralleliJe a re u tion

    is to i%i e the input ata set et een ifferent or6roups on a #

  • 8/9/2019 Cours Gpgpu

    87/99

    87

    Dithin a or6 roupF the re u tion is performeo%er multiple sta es >t ea h sta eF or6&items sum an element an

    its nei h or that is one stri e a a .

    *e u tion 6ernel

  • 8/9/2019 Cours Gpgpu

    88/99

    88

    // > simple re u tion tree 6ernel here ea h or6 roup re u es a set

    // of elements to a sin le %alue in lo al memor an rites the// resultant %alue to lo al memor . ??&ernel "oi% re%u tion?&ernel. unsigne% int N, // num er of elements to re u e

    ??glo(al !loat9 input, ??glo(al !loat9 output, ??lo al !loat9 s%ata/ :

    // et in e? into lo al ata arra an lo al arra unsigne% int lo alI% 0 get?lo al?i%. /, glo(alI% 0 get?glo(al?i%. / unsigne% int groupI% 0 get?group?i%. /, +g6i7e 0 get?lo al?si7e. / // *ea in ata if ithin oun s s%ataElo alI%F 0 .glo(alI%HN/ T inputEglo(alI%F // n hroniJe sin e all ata nee s to e in lo al memor an %isi le to all or6 items (arrier.CLK?LOCAL?M=M? =NC=/ // $a h or6 item a s t o elements in parallel. >s stri e in reasesF or6 items remain i le. !or.int o!!set 0 +g6i7e o!!set o!!set 0 4/ : i! .lo alI% H o!!set lo alI% ; o!!set H +g6i7e/

    s%ataElo alI%F ;0 s%ataElo alI% ; o!!setF(arrier.CLK?LOCAL?M=M? =NC=/

    // ,nl one or6 item nee s to rite out result of the or6 roupGs re u tion (arrier.CLK?LOCAL?M=M? =NC=/i! . lo alI% 00 / outputEgroupI%F 0 s%ataE F

    Impro%in re u tion performan es(see +,penC- ,ptimiJation Case tu 3 imple *e u tions+!

  • 8/9/2019 Cours Gpgpu

    89/99

    89

    >t ea h step of the re u tion treeF the a ti%eor6&items et sparser an sparser. "his lea s to poor IM= effi ien 3 e ha%e

    onl a out ;0[ of the or6&items a ti%eF ona%era e.

    Impro%in re u tion performan es(see +,penC- ,ptimiJation Case tu 3 imple *e u tions+!

  • 8/9/2019 Cours Gpgpu

    90/99

    90

    *e u tions usin atomi s 3 operations su h as atom? a%%./ anre u e the partial results from ea h lo al re u tion. But the arelimite to the operators an ata&t pes supporte the platform

    " o&sta e re u tion 3 the input is i%i e up hun6s lar e enou h to6eep all of pro essors us . "he final lo al re u tion is performese:uentiall F hi h impro%es effi ien ompare to the full &parallelmulti&sta e re u tion

    C#< re u tion (min! 6ernel

  • 8/9/2019 Cours Gpgpu

    91/99

    91

  • 8/9/2019 Cours Gpgpu

    92/99

    92

    "he histogram o! animage pro%i es afre:uen istri ution ofpi?el %alues in the ima e.

    De ha%e either a sin lehisto ram if the luminositis use as the pi?el %alueor three histo rams if the*F F an B olor hannel%alues are use .

    "he prin iple of the

    histo ram al orithm is toperform an operation o%erea h pi?el of the ima e3!or .man' input "alues/

    histogramE "alue F;;

    e:uential isto ram

  • 8/9/2019 Cours Gpgpu

    93/99

    93

    (its per hannel

  • 8/9/2019 Cours Gpgpu

    94/99

    94

    Atomic operations

  • 8/9/2019 Cours Gpgpu

    95/99

    95

    in e man or6&items e?e ute in parallel in a or6 roupFe annot uarantee the or erin of rea &after& riteepen en ies on our lo al histo ram ins.

    e oul repro u e the histo ram ins multiple timesF utthis oul re:uire a op of ea h in for ea h or6&iteminthe roup

    "he alternati%e solution is to use har are atomi s 3an' time t+o threa%s operate on a share% "aria(le

    on urrentl', an% one o! those operations per!orms a

    +rite, (oth threa%s must use atomi operations

  • 8/9/2019 Cours Gpgpu

    96/99

    96

    6ernel %oi histogram?image?rg(a (

    ima e2 t im F int num pi?els per or6itemFlo al uint chisto ram! W int lo al siJe X et lo al siJe(0! c et lo al siJe(1!Y int ima e i th X et ima e i th(im !Y int ima e hei ht X et ima e hei ht(im !Y int roup in ? X 25 c ; ( et roup i (1! c

    et num roups(0! ) et roup i (0!!Y int ? X et lo al i (0!Yint X et lo al i (1!Y

    lo al uint tmp histo ram]25 c ;^Y int ti X et lo al i (1! c et lo al siJe(0! )

    et lo al i (0!Y int ' X 25 c ;Yint in ? X 0Y

    int iF i ?Y

  • 8/9/2019 Cours Gpgpu

    97/99

    97

    b

  • 8/9/2019 Cours Gpgpu

    98/99

    98

    "hrust

  • 8/9/2019 Cours Gpgpu

    99/99

    99