torque tutorial - adaptive computingadaptivecomputing.com › wp-content › media › pdf ›...

54
TORQUE Tutorial A Beginner's Guide Kenneth Nielson September 16, 2009

Upload: others

Post on 07-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • TORQUE TutorialA Beginner's Guide

    Kenneth NielsonSeptember 16, 2009

  • 2

    TORQUE Resource Manager

    Wh a t is T O R Q UET O R Q UE 's R oleT O R Q UE C om p one ntsIns ta lla tionC onfig ura tionJob Adm in is tra tionD ia g n os tic sMPI Multi-m om a nd An y m omR oa dm a pQ &A

  • 3

    What is TORQUE?● Terascale Open-Source Resource and QUEue Manager

    ● TORQUE is an open source resource manager providing controlover batch jobs and distributed compute nodes. It is a communityeffort based on the original *PBS project and, with more than1,200 patches, has incorporated significant advances in the areasof scalability, fault tolerance, and feature extensions contributedby NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U ofBuffalo, TeraGrid, and many other leading edge HPCorganizations.

    ● PBS – Portable Batch System

  • 4

    What is TORQUE

    T he Porta b le B a tc h S y s te m , PB S , is a b a tc h job a ndc om pute r s y s te m re s ourc e m a n a g e m e nt p a c k a g e . Itwa s de v e lope d with the inte nt to be c onform a nt withthe PO S IX 1 0 0 3 .2 d B a tc h E n v ironm e nt S ta nda rd . Ass uc h, it will a c c e pt b a tc h jobs , a s he ll s c ript a ndc ontrol a ttribute s , pre s e rv e a nd prote c t the job until itis run , run the job , a nd de liv e r output b a c k to thes ubm itte r. PBS m a y be ins ta lle d a nd c onfig ure d tos upport jobs run on a s ing le s y s te m , or m a n y s y s te m sg roupe d tog e the r. B e c a us e of the flex ib ility of PBS , thes y s te m s m a y be g roupe d in m a n y fa s h ions .

  • 5

    TORQUE's Role

    ● Provide job queuing facility● Monitor resource configuration, utilization, and health● Provide remote job execution and job management facilities● Reports information to cluster scheduler● Receives direction from cluster scheduler● Handles user client requests

  • 6

    TORQUE Components

    Commands

    Job Server

    Job Executor

    Job Scheduler

  • 7

    TORQUE Components

    Commands● Three classes of commands

    ○ user – any authorized user can execute○ Operator – special access privileges required○ Manager – special access privileges required

    ● User commands○ qsub, qstat, pbsnodes, qdel

    ● Operator and manager commands

  • 8

    TORQUE Components

    Job Server● pbs_server

    ● Central focus of TORQUE● All commands and other daemons communicate withpbs_server via TCP/IP and UDP/IP

    ● Provides basic batch services○ Job creation○ Job modification○ Job protection○ Job execution

  • 9

    TORQUE Components

    Job Executor

    ● pbs_mom○ Daemon called MOM – Machine-Oriented Miniserver○ receives copy of jobs from pbs_server○ Places jobs into execution○ Creates new session similar to user login session○ For parallel jobs a Mother Superior manages group ofsister nodes

    ○ Returns output to pbs_server or Mother Superior

  • 10

    TORQUE Components

    Job S c he dule r

    C ontrols s ite polic y T O R Q UE s upports m ultip le s c he dule rs

    pbs _s c he d ■ not s upporte d by Ada ptiv e C om puting

    Ma ui■ O pe n s ourc e■ Us er G roup s up port only

    Moa b■ Torque s upp ort inc lude d■ For wha t Moa b c a n do tha t Ma ui c a nnot g o to

    h t t p : //w w w. c lu s t e r r e s o u r c e s . c o m /p r o d u c t s /m a u i/d o c s /a .k m o a b c o m p . s h t m l

  • 11

    TORQUE InstallationWhere to get it. svn (subversion)

    svn://svn.clusterresources.com/torque

    /trunk – currently 2.4 beta

    /branches/2.3-fixes – snapshot build with latest fixes

    /branches/2.3-multimom – allows multiple moms on a single node

    www.clusterresources.com

    http://www.clusterresources.com/downloads/torque/

    torque-2.3.7.tar.gz is the latest released version

  • 12

    TORQUE Installation

    Extract and build the distribution to the machine that will act asthe TORQUE server.

    > tar -xzvf torqueXXX.tar.gz> cd torqueXXX> ./configure> make> make install

  • 13

    TORQUE InstallationTorque Install Directory

    ● Default location /usr/local/

    ○ - bin● Contains client commands – qstat, pbsnodes, qsub, etc.● Needed on server and login/submission hosts

    ○ - sbin● Contains server and node daemons – pbs_server, pbs_mom,pbs_demux, pbs_sched, momctl

    ○ - lib● Contains TORQUE libraries – libtorque.so.x

  • 14

    TORQUE Installation

    Init ial TORQUE Startup

    pbs_server

    As root typepbs_server -t createortorque.setup < user>

    Stop pbs_server before running in product ionqterm

  • 15

    TORQUE Installation

    root@ke n-linux B ox :/us r/loc a l/s b in# pbs _s e rv e r -t c re a te

    Q m g r: p s## S e t s e rv e r a ttribute s .#s e t s e rv e r a c l_hos ts = ke n-linux B oxs e t s e rv e r log _e v e nts = 5 1 1s e t s e rv e r m a il_from = a dms e t s e rv e r s c he dule r_ite ra tion = 6 0 0s e t s e rv e r node _c he c k _ra te = 1 5 0s e t s e rv e r tc p_tim e out = 6

  • 16

    TORQUE Installationke n@ke n-linux B ox :~ /de v /torq ue /2 .3 -fix e s $ s ud o ./torq ue .s e tup ke n

    c re a te que ue b a tc h # s e t q ue ue b a tc h que ue _ty pe = E x e c ution s e t q ue ue b a tc h re s ou rc e s _de fa u lt.node s = 1 s e t q ue ue b a tc h re s ou rc e s _de fa u lt.w a lltim e = 0 1 :0 0 :0 0 s e t q ue ue b a tc h e n a ble d = True s e t q ue ue b a tc h s ta rte d = True # # S e t s e rv e r a ttrib ute s . # s e t s e rv e r s c he duling = True s e t s e rv e r a c l_h os ts = ke n-linux B ox s e t s e rv e r de fa ult_que ue = b a tc h s e t s e rv e r log _e v e nts = 5 1 1 s e t s e rv e r m a il_from = a dm s e t s e rv e r s c he dule r_ite ra tion = 6 0 0 s e t s e rv e r node _c he c k _ra te = 1 5 0 s e t s e rv e r tc p_tim e out = 6 s e t s e rv e r m om _job_s y nc = True s e t s e rv e r ke e p_c om ple te d = 3 0 0

  • 17

    TORQUE ConfigurationT O R Q UE H om e D ire c tory

    ● D e fa ult /v a r/s pool/torque -- $ TO R Q UE _H O ME , $ PB S _H O ME ,e tc .○ /v a r/s pool/torque

    ● s e rv e r_na m e – Na m e of hos t whe re pbs _s e rv e r re s ide s .C a n ha v e m ultip le hos t na m e s for h ig h a v a ila b ility

    ○ s e rv e r_priv● jobs● node s

    ○ s e rv e r_log s● file s of the form y y y y m m dd ( i.e . 2 0 0 9 0 9 1 6 )

    ○ m om _priv● jobs● c onfig

    ○ m om _log s● file s of the form y y y y m m dd ( i.e . 2 0 0 9 0 9 1 6 )

  • 18

    TORQUE Configuration

    pbs _s e rv e r C onfig ura tion -- node s file● s e rv e r_priv /node s

    ○ c onta ins lis t of m om hos t na m e s a nd a ttribute s■ a ttribute s

    ● np – num be r of proc e s s e s● note – a dm in is tra tor note● prope rtie s – a dm in is tra tors c hoic e

    ● node s file s y nta x○ hos t np= X note = s tring prope rty 1 prope rty 2 ...prope rty n○ ex a m ple :

    ■ hos t1 np= 4 note = ne w inte l_i7 da ta■ hos t2 np= 4 x 8 6 ■ hos t3 np= 8 a m d_6 4

  • 19

    TORQUE Configurationpbs _s e rv e r node c onfig ura tion

    ● Re s ta rt pbs _s e rv e r● Run pbs node s

    hos t1 s ta te = down np = 4 prope rtie s = inte l_i7 ,da ta nty pe = c lus te r note = ne w

    hos t2 s ta te = down

    np= 4

  • 20

    TORQUE Configuration

    pbs _s e rv e r node c onfig ura tion

    ● D y na m ic node c onfig ura tion> qm g r -c “c re a te node node 0 0 3 ”

    Ma nua lly e dit the node s file■ $ T O R Q UE H O ME /s e rv e r_priv /n ode s ● Re s ta rt pbs _s e rv e r da e m on a fte r c ha ng e

  • 21

    TORQUE Configuration

    ● p b s _s e r v e r q u e u e c o n f ig u r a t io n○ Attribute s

    ■ que ue _ty pe● ex e c ution, route

    ■ re s ourc e s _de fa u lt● de fa u lt re s ourc e re quire m e nts for jobs (wa lltim e , node s )

    ● e na ble d○ S pe c ifie s whe the r que ue a c c e pts ne w jobs . (D e fa u lt

    FAL S E )○ s ta rte d

    ■ s pe c ifie s whe the r jobs in que ue a re a llowe d to ex e c ute .(D e fa u lt Fa le s )

  • 22

    TORQUE Configuration● p b s _s e r v e r q u e u e c o n f ig u r a t io n

    ○ de fa u lt que ue ba tc h○ c re a te ne w que ue

    ■ qm g r● c re a te que u e re g● s et q ue ue reg que u e_ty p e= E x ec ution● s et q ue ue reg re s ourc e s _de fa u lt.node= 1● s et q ue ue reg re s ourc e s _de fa u lt.wa lltim e= 0 1 :0 0 :0 0● s et q ue ue reg e na b le d= True● s et q ue ue reg s ta rte d= True

    ○ s e tting de fa u lt que ue■ qm g r -c “s e t s e rv e r de fa u lt_que ue = re g ”

    Note : A que ue is c a lle d a c la s s in Moa b

  • 23

    TORQUE Configuration

    pbs _m om C onfig ura tion● As root run pbs _m om

    ○ No s pe c ia l c onfig ura tion ne e de d to s ta rt○ us e m om _priv /c onfig for options

    ● m om _priv /c onfig○ Allows c us tom c onfig ura tion of m om node○ S y nta x

    ■ $ < option> v a lue■ ex a m ple

    $ log le v e l 3$ us e c p *.fte .c om :/da ta /us r/loc a l/da ta

  • 24

    TORQUE Configuration

    ● For s ha re d file s y s te m s us e the $ us e c p pa ra m e te r in them om _priv /c onfig file

    $ us e c p *.fte .c om :/da ta /us r/loc a l/da ta

    ● For local, non-shared filesystems, rcp or scpmust be c onfig ure d to a llow d ire c t c opy without prom ptingfor pa s s words (ke y a uthe ntic a tion, e tc .)

    http://www.c lus te rre s ource s .c om /produc ts /torque /doc s /6 .1 sc ps e tup.s htm l

  • 25

    TORQUE Configuration

    S c he dule r C onfig ura tion

    ● Follow d ire c tions for s c he dule r of c hoic e● Moa b c onfig ura tion

    ○ http ://www.c lus terre s ourc es .c om /prod uc ts /m wm /doc s /2 .0 ins ta lla tio n .s htm l

  • 26

    Advanced Configuration

    C us tom iz ing the Ins ta llMos t re c om m e nde d c onfig ure options ha v e be e n s e le c te d a s

    de fa u lt. S om e ofte n us e d options

    --with-d e bug – for us e with g db --prefix = < D IR > -- c ha ng e ins ta ll d irec tory --ex ec -prefix = < D IR > -- c ha ng e only ex ec uta ble ins ta ll d irec tory --d is a b le -g c c -wa rn ing s – Us e with c a re .

    ./c onfig ure --h e lp will g iv e a ll options

  • 27

    Advanced Configuration

    ● C onfig uring Job S ubm is s ion H os ts● Us e a c l_hos ts● Us e torque .c fg s ubm ithos ts ,a llowc om pute hos ts● /e tc /hos ts .e quiv

    ● C onfig uring T O R Q UE on a Multi-H om e d S e rv e r● S pe c ify ing Non-Root Adm in is tra tors

    > qm g r

    Q m g r: s e t s e rv e r m a na g e rs + = jos h@*.fs c .c omQ m g r: s e t s e rv e r ope ra tors + = jos h@*.fs c .c omQ m g r: s e t s e rv e r log _le v e l= 3

  • 28

    Job Administration

    Jo b F lo w

    ● pbs _s e rv e r re c e iv e s ne w job● Inform s the s c he dule r● Whe n node s a v a ila b le , s c he dule r s e nds ins truc tions a nd

    node s lis t to pbs _s e rv e r● pbs _s e rv e r s e nds job to the firs t node in the node lis t● T he firs t node , or Mothe r S upe rior, la unc he s the job a nd

    pa s s e s it to the re s t of the node s in the node lis t, or theS is te r m om s

  • 29

    Job Administration

    qsub● Batch and Interactive ● Requesting Resources

    ● Examples● To ask for 2 processors on each of four nodes:

    ● qsub -l nodes=4:ppn=2 ● The following job will wait until node01 is free with 200 MB of

    available memory:● qsub -l nodes=node01,mem=200mb /home/user/script.sh

    ● Directives can be embedded into job script● example on next page

  • 30

    Job Administration

    # !/b in/s h

    # PB S -N ds 1 4 Fe e dba c k D e fa ults# PB S -q te s tque ue# PB S -l node s = 1 :ppn= 2 ,wa lltim e = 2 4 0 :0 0 :0 0# PB S -M us e r@m y dom a in.c om

    s ourc e ~ /.ba s hrc

    c a t $ PB S _NO D E F IL Ec a t $ PB S _O _JO B ID

  • 31

    Job Administration

    Manually Administrating Jobs

    > qsub scatter

    4807.ken-linuxbox

    > qstat

    Job id Name User Time Use S Queue

    ---------------- ---------------- ---------------- -------- - -----

    4807 scatter user01 12:56:34 Q batch

  • 32

    Job Administraton

    Manually Administrating Jobs

    > qrun 4807

    > qstat

    Job id Name User Time Use S Queue

    ---------------- ---------------- ---------------- -------- - -----

    4807 scatter user01 12:56:34 R batch

    >qstat

    Job id Name User Time Use S Queue

    ---------------- ---------------- ---------------- -------- - -----

    4807 scatter user01 12:56:34 C batch

  • 33

    Job Administration

    Canceling Jobsqdel

    -w delay Specify the delay between the sending of the SIGTERM and SIGKILL signals.

    -p purge Forcibly purge the job from the server. This option is only available to a batch operator or the

    batch administrator.-m message

    Specify a comment to be included in the email. The argument message specifies the commentto send. This option is only available to a batch operator or the batch administrator.

    [all|ALL]

    Delete all jobs in the queue

  • 34

    Job AdministrationAutomating Job Administration

    Integrate with an external schedulerMoab Workload Manager

    Job Arrayssubmit multiple jobs at once

    Submit Filters

    Job Preemption

  • 35

    Job Administration

    ● Job Arrays○ TORQUE 2.3 and later○ Allows single line submission of multiple jobs for a single script○ Job can be monitored as a group

    Example> qsub -t 0-3 scatter 33.hostname> qstat

    Job id Name User Time Use S Queue

    ---------------- ---------------- ---------------- -------- - -----33-0 scatter-0 user01 12:56:34 R batch33-1 scatter-1 user01 12:56:34 R batch33-2 scatter-2 user01 12:56:34 R batch

  • 36

    Job Administration

    S ubm it Filte rs

    When s ubm it filters ex is t T O R Q UE s e n ds c om m a nd file to thes c ript/ex ec uta ble whic h m odifie s the reque s t ba s e d on s ite polic ie s .

    S ubm it filter d e s ig na te d in torque.c fg .Found in /v a r/s pool/torqueKe y word S UB MIT F ILT E R

    E x a m ple torque.c fgS UB MIT F ILT E R /hom e /us er/s ubm it_filter

  • 37

    Job AdministrationS ubm it Filte r E x a m ple s

    /hom e /us e r/s ubm it_filte r

    # !/b in/s h

    # a dd de fa ult m e m ory cons tra ints a nd a dd a e -m a il notific a tion a ddre s s to a llre que s ts# tha t d id not s pe c ify it in us e r's s c ript or com m a nd line

    e c ho “# PB S -l m e m = 1 6 MB”e c ho “# PB S -M ke n@a da ptiv e c om puting .c om ”

    while re a d Ido

    e c ho $ idone

  • 38

    Job Administration

    S ubm it Filte r E x a m ple slis tte s t.s h

    # !/b in/s hls -a lR /

    q s ub lis tte s t.s h1 0 .k m n.c ridom a in

    c a t /v a r/s p ool/torqu e /s erv er_priv /jobs /1 0 .k m n.c ridom a in.S C

    # PB S -l m em = 1 6 MB# PB S -M ke n@ a da ptiv ec om puting .c omls -a lR /

  • 39

    TORQUE Administration

    Job Pre e m ptionTorque ha s thre e ba s ic tools

    C a nc e l – qde lre -que – qre runc he c k point

    T he s c he dule r us e s the ba s ic tools to e na ble job pre e m ption.S e e Moa b for m ore inform a tion

    h t t p : //w w w .c lu s t e r r e s o u r c e s .c o m /p r o d u c t s /m w m /d o c s /8 .4 p r e e m p t io n .s h t m l

  • 40

    TORQUE AdministrationMonitoring Resources

    TORQUE reports a number of attributes broken into 3 major categories:

    ConfigurationIncludes both detected hardware configuration, and specified batch attributes Can report static ‘generic resources’ via specification in the mom config file

    UtilizationIncludes information regarding the amount of node resources currently available (in

    use) as well as information about who or what is consuming itCan report dynamic ‘generic resources’ via specification of a ‘monitor script’ in the

    mom config file

    StateIncludes administrative status, general node health information, and general usage

    status

  • 41

    TORQUE AdminstrationM o n it o r in g R e s o u r c e s

    > p b s n o d e s

    k e n - l in u x B o x s t a t e = f r e e n p = 2 p r o p e r t ie s = b l d g 1 , i n t e l_i7 n t y p e = c lu s t e r s t a t u s = o p s y s = l in u x , u n a m e = L in u x k e n - l in u x B o x 2 .6 .2 4 -2 3 -

    g e n e r ic # 1 S M P W e d A p r 1 2 1 : 4 7 : 2 8 U T C 2 0 0 9 i6 8 6 , s e s s i o n s = 4 9 8 3 5 8 7 3 6 2 2 0 6 3 3 1 6 3 3 5 6 3 6 0 6 3 6 9 6 4 0 2 6 4 5 6 6 4 6 0 6 4 8 9 6 5 8 2 , n s e s s i o n s = 1 2 , n u s e r s = 2 , id le t im e = 1 ,

    t o t m e m = 8 1 2 3 8 2 4 k b , a v a i lm e m = 7 5 8 4 6 4 8 k b , p h y s m e m = 2 0 6 7 3 6 0 k b , n c p u s = 2 , lo a d a v e = 0 .0 5 ,n e t lo a d = 3 6 9 5 7 5 3 2 , s t a t e = f r e e , jo b s = , v a r a t t r = , r e c t im e = 1 2 5 2 4 6 7 7 8 7

    n o t e = b a c k e d _u p

  • 42

    TORQUE AdministrationN o d e S t a t e s

    S ta te s down (down)offline (dra ine d)job-ex c lus iv e (bus y ) fre e ( id le /running )re s e rv ejob-s ha ringbus ytim e -s ha re ds ta te -unk nown

    C ha ng ing node s ta teO ffline

    pbs node s -o < node na m e > O nline

    pbs node s -c < node na m e >

    Viewing nod e s of a pa rtic u la r s ta tepbs node s -l

  • 43

    TORQUE AdministrationN o d e P r o p e r t ie s

    ● Nod e Prop erty Attribute s● C a n a pp ly m ultip le prop ertie s per node● Prop ertie s a re ‘opa qu e’● E a c h prop erty c a n b e a pp lie d to m ultip le node s● Prop ertie s c a n not b e c ons um ed

    ● D y n a m ic a lly with qm g r> qm g r -c “s et nod e node 0 0 1 prop ertie s = b ig m em ”> qm g r -c “s et nod e node 0 0 1 prop ertie s + = dua lc ore ”

    ● Ma nua lly e d it s erv er_priv /nod e s file○ a lwa y s re s ta rt p bs _s erv er a fter m odify in g n od e s file

  • 44

    TORQUE AdministrationA c c o u n t in g R e c o r d s

    ● Torque m a inta ins a c c ounting re c ords of jobs ins e rv e r_priv /a c c ounting

    ● file of the form y y y y m m dd●

    Re c ord Ma rke r Re c ord Ty pe D e s c riptionD de le te Job wa s de le te dE ex it Job ha s ex ite d (s uc c e s s fu lly or uns uc c e s s fu lly )Q que ue Job ha s be e n s ubm itte d/que ue dS s ta rt a n a tte m pt to s ta rt the job ha s be e n m a de ( if the

    job fa ils to prope rly s ta rt, it m a y ha v e m ultip lejob s ta rt re cords )

    ● 0 9 /0 8 /2 0 0 9 2 2 :1 5 :5 8 ;Q ;9 .ke n-linux box ;qu e ue = ba tc h

  • 45

    Diagnostics

    L og File s

    pbs _s e rv e r log file s/v a r/s pool/torque /s e rv e r_log sqm g r: s e t s e rv e r log _le v e l= x

    pbs _m om log file s/v a r/s pool/torque /m om _log s/v a r/s pool/torque /m om _priv /c onfig

    $ log le v e l x

  • 46

    DiagnoticsMOM Diagnostics

    momctl○ Diagnoses mom configuration and communication with server○ -d3 option○ Output on next slide

  • 47

    DiagnosticsH os t: ke n-linux B ox /ke n-linux box Ve rs ion: 2 .3 .8 PID : 1 2 7 9 2S e rv e r[0 ]: ke n-linux B ox (1 2 7 .0 .1 .1 :1 5 0 0 1 ) In it Ms g s Rec e iv e d: 0 he llos /1 c lus te r-a ddrs In it Ms g s S e nt: 1 he llos L a s t Ms g From S e rv e r: 8 s e c onds (S ta tus Job) L a s t Ms g To S e rv e r: 1 5 s e c ondsH om e D ire c tory: /v a r/s pool/torque /m om _privs tdout/s tde rr s pool d ire c tory: '/v a r/s pool/torque /s pool/' (1 1 0 5 4 2 3 7 1 bloc k s a v a ila ble )NO T E : s y s log e na ble dMOM a c tiv e : 1 5 3 s e c ondsC he c k Poll T im e : 4 5 s ec ondsS e rv e r Upda te Inte rv a l: 4 5 s ec ondsLog Le v e l: 0 (us e S IG US R 1 /S IG US R 2 to a djus t)C om m unic a tion Mode l: R PPMe m Loc ke d: T R UE (m loc k )TC P T im eout: 2 0 s e c ondsProlog : /v a r/s pool/torque /m om _priv /prolog ue (d is a ble d)Ala rm Tim e : 0 of 1 0 s ec ondsTrus te d C lie nt L is t: 1 2 7 .0 .1 .1 ,1 2 7 .0 .0 .1C opy C om m a nd: /us r/bin/s c p -rpBjob[1 2 .ke n-linux box ] s ta te = R UNNING s id lis t= 1 2 8 3 0As s ig ne d C PU C ount: 1

    A

    dia g nos tic s c om plete

  • 48

    MPI

    M P I ( M e s s a g e P a s s in g In t e r f a c e )

    ● Us e d for pa ra lle l jobs● Aug m e nts c om m unic a tion be twe e n ta s k s d is tribute d a c ros s

    c lus te r● T O R Q UE c a n run with a ny MPI libra ry● T O R Q UE prov ide s lim ite d inte g ra tion with s om e MPI libra rie s● MPI pa c k a g e s

    ○ MPIC H – Arg onne Na tiona l L a b○ MPIC H -V MI – NC S A○ O pe n MPI

  • 49

    MPIMPIE x e c O v e rv ie w

    ● R e pla c em e nt for m pirun s c ript● In itia liz e s a pa ra lle l job with a PB S b a tc h or intera c tiv e e n v ironm e nt● Us e s ta s k m a na g er libra ry of PB S to s p a wn c opie s of ex ec uta ble on

    nod e s● T M interfa c e fa s ter tha n in v ok ing s e p a ra te rs h (m pirun)● R e s ourc e s u s e d b y s pa wne d proc e s s a c c ounted c orrec tly with

    m piex ec● Ta s k s tha t ex c e e d a s s ig n e d lim its (wa lltim e, m em ory , d is k s pa c e)

    a re k illed● m piex ec c a n e nforc e a s ec urity polic y. O bv ia te s us e of rs h or s s h

    S e e m piex ec hom e pa g e for m ore in form a tion.http ://www.os c .e du /~ djohns on /m piex ec /ind ex .php

  • 50

    Multi-Mom

    ● Multip le pbs _m om da e m ons on a s ing le node● Inte nde d to e nha nc e te s ting but pos s ib le to us e in

    produc tion● Mom s un ique ly ide ntifie d by na m e a nd ports● D e fa ult pbs _m om ports

    ○ 1 5 0 0 2○ 1 5 0 0 3

    ● Us e a lia s in /e tc /hos ts○ 1 9 2 .1 6 8 .0 .1 0 m y hos t m y hos t1 m y hos t2○ m a x a lia s na m e s ?

  • 51

    Multi-Mom

    Inv ok ing m ulti-m om● s y nta x – pbs _m om -m -M 3 0 0 0 2 -R 3 0 0 0 3● m odify node s file

    ○ node 1 np= 2○ node 2 np= 2 m om _s e rv ic e _port= 3 0 0 0 2

    m om _m a na g e r_port= 3 0 0 0 3

    ● s topping m ulti-m om○ m om c tl -s -p 3 0 0 0 3

  • 52

    Any-mom

    ● E na ble s a ny m om node to join a c lus te r without ha v ing a ne ntry in the s e rv e r_priv /node s file .

    ● S y nta x● pbs _s e rv e r -e●

    ● C a n dy na m ic a lly a dd m om s to c lus te r without re s ta rtingpbs _s e rv e r

    ● C re a te s s e c urity is s ue s● c a nnot c ontrol who joins the c lus te r● ne e d outs ide s e c urity polic y

  • 53

    TORQUE RoadmapT O R Q UE 2 .3 .8

    ● B ug fix e s only

    T O R Q UE 2 .4● C om ple te 2 .3 -fix e s m e rg e● C PU a ffin ity (v e ry ba s ic im ple m e nta tion)● Multi-m om● Any m om

    T O R Q UE 2 .5● T OR QUE te s ting fra m e work● E lim ina te ne e d for priv ile g e d ports● C PUs e ts im prov e m e nts● Im prov e T OR QUE H A

    T O R Q UE 3 .0● Alte rna te c om m unc a tion m ode l be twee n pbs _s e rv e r, MO Ms a nd s is te rs● s c a lea bilty for s upe r la rg e s y s te m s with la rg e MPI jobs (1 0 ,0 0 0 + node s )

  • 54

    TORQUE Q&A

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52Slide 53Slide 54