monitoring and troubleshooting a glideinwms-based htcondor pool

Post on 15-Jan-2015

278 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

A guide for users of glideinWMS-based HTCondor pools on how to monitor the system, and troubleshoot the most common problems.

TRANSCRIPT

CERN, Dec 2012 glideinWMS monitoring 1

glideinWMS for users

Monitoring and troubleshooting

a glideinWMS-basedHTCondor pool

by Igor Sfiligoi (UCSD)

CERN, Dec 2012 glideinWMS monitoring 2

Scope of this talk

This talk describes whatinformation are available when troubleshooting in a

glideinWMS-based HTCondor pool,and what tools can you use

to mine them.

Reader is expected to already have a basic understanding of HTCondor and glideinWMS.

CERN, Dec 2012 glideinWMS monitoring 3

HTCondor Architecture

● As a reminder

Central manager

Negotiator

Submit node

Schedd

Execute node

Condor

Submit node

Submit node

Execute node

Execute node

Execute node

Execute node

Grid

G.F.

G.F.VO FE

+3

+1

CERN, Dec 2012 glideinWMS monitoring 4

Typical user questionsaddressed in this talk

● Where is/was my job running?● Why are my jobs

not starting?● Why do my jobs

take forever to finish?

CERN, Dec 2012 glideinWMS monitoring 5

Where is/was my job running?

CERN, Dec 2012 glideinWMS monitoring 6

Job progress monitoring

● HTCondor provides two basic means to monitor job progress● Querying the system for current status

– Using the cmdline condor_q/condor_history● Parsing the job event log

– Either plain text or XML formatted– Starting with 7.9.1, condor_history can be used

to extract the last known state

CERN, Dec 2012 glideinWMS monitoring 7

Job status

● Each Job has a status associated with it● An integer attribute calledJobStatus– But has well known semantics

associated with each value

● Jobs start in the Idle state● Become Running if everything works fine● Completed when they terminate

● If anything goes wrong, a Job will go into Hold● If removed before completion, will be Removed

CERN, Dec 2012 glideinWMS monitoring 8

Monitoring the Job Status

● Idle/Running/Held jobs can be polled withcondor_q● Will query the Schedd daemon

● Once they terminate, or are removed,they leave the Schedd queue● Are put into a file on disk● Can use condor_history

to retrieve the last ClassAd

● The job event log has all the state transitions(of course)

One exception:If a job was running when it was removed, but the execute nodedoes not confirm the job was killed remotely, the job will be kept in the Schedd.

CERN, Dec 2012 glideinWMS monitoring 9

So, where is the job running?

● Easy to get the machine name and/or IP● Standard HTCondor attributeRemoteHost & StartdIpAddr

● But may not necessary make sense● Do you recognize all network domains?● And they could be on a private network!

CERN, Dec 2012 glideinWMS monitoring 10

Getting glidein attributes

● Glideins have many more attributes that describe them● e.g. symbolic site name

GLIDEIN_CMSSite

● However, by default, you do not get this info in the Job Classad

● But easy to add● <my attr> = $$(<glidein attr>:Unknown)

– Will get the info in MATCH_EXP_<my attr>

CERN, Dec 2012 glideinWMS monitoring 11

Standard attributes

● Standard glideinWMS attributes● JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)"

● JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"

● JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"

● JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"

● JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)"

● JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"

● JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"

● Standard CMS glideinWMS attribute● JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"

Usefulfor in-depthdebugging

Configured by the HTCondor admin,no need for the user to do anythingSUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ...

CERN, Dec 2012 glideinWMS monitoring 12

Getting them in the event log

● You (or the admins) can also propagate the attributes into the event logjob_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, …

● As a result you get “Job Ad” events

...001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>...028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.TriggerEventTypeNumber = 1Cluster = 20327EventTypeNumber = 28ExecuteHost = "<193.48.85.94:38749>"

JOB_CMSSite = "T2_FR_IPHC"EventTime = "2012-12-03T00:46:33"TriggerEventTypeName = "ULOG_EXECUTE"Proc = 2Subproc = 0CurrentTime = time()MyType = "ExecuteEvent"...

...001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>...028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.TriggerEventTypeNumber = 1Cluster = 20327EventTypeNumber = 28ExecuteHost = "<193.48.85.94:38749>"JOB_CMSSite = "T2_FR_IPHC"EventTime = "2012-12-03T00:46:33"TriggerEventTypeName = "ULOG_EXECUTE"Proc = 2Subproc = 0CurrentTime = time()MyType = "ExecuteEvent"...

CERN, Dec 2012 glideinWMS monitoring 13

Why is my jobnot starting?

CERN, Dec 2012 glideinWMS monitoring 14

Troubleshooting process

● First question● Do my jobs match any (logical) resource?

● Once you are sure of that● Are there jobs from higher priority users?● Are desired sites just too busy?● Are there problems at desired site(s)?

● If nothing gives a satisfying answer● It may be a glideinWMS misconfiguration,

see help from VO FE admins

CERN, Dec 2012 glideinWMS monitoring 15

How do I know if my jobs match?

● Good question!● Unfortunately, the answer is not trivial

● The FE matching policy not “public”● And, of course, no tools to probe for it

● You will have to rely on the FE admins to “explain” the policy● Hopefully in a human readable format● Hopefully without conversion errors!

CERN, Dec 2012 glideinWMS monitoring 16

An example FE policy

● See the CMS FE talk for an actual high level view

● The actual FE policy is a python expression

● And then there is the matching HTCondor one

(glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1))

(glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1))

A simple example – could be much more complex

(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))

(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))

CERN, Dec 2012 glideinWMS monitoring 17

A word about HTCondor matching

● Once glideins start, you can probe their policycondor_status -format '%s' START

● But no tools to help you understand the M.M.● The closest iscondor_q -analyze – But only looks at Job requirements– So, not really helping when all/most of the policy in glideins

$ condor_status -format '%s\n' START( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )...

$ condor_status -format '%s\n' START( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )...

CERN, Dec 2012 glideinWMS monitoring 18

User priorities

● So, jobs should be matching, but are not starting● And there are plenty matching glideins in the system

● Likely there are other higher-priority jobs in the system● Possibly from a different usercondor_userio

● Possibly on a different scheddcondor_status -submitters

● No tools to give you the easy answer● If you need the answer, you will have to investigate

Warning: Slow!

CERN, Dec 2012 glideinWMS monitoring 19

Unclaimed glideins

● If you see plenty of Unclaimed glideins,but no matching jobs from other users● You have either reached the schedd limitMAX_JOBS_RUNNING

● Or something bad is going on!

● You can only ask yout FE admin for help● But first double check that your jobs should

indeed be matching, at least on paper

CERN, Dec 2012 glideinWMS monitoring 20

Supported Sites

● What should you do if there are no (new) glideins coming from an expected site?

● First off, see if the site is even supported by the glideinWMS instance!

● Each Entry has a ClassAdcondor_status -any -const 'MyType==”glideresource”'

● Look for the attributes your FE is matching one.g. GLIDEIN_CMSSite

Sitenot there?Notify yourFE admin!

CERN, Dec 2012 glideinWMS monitoring 21

Is the FE even asking for them?

● You are sure that your jobs should be matching?● But what if you are wrong?

● Check it out… -format '%i\n' GlideFactoryMonitorRequestedIdle

But remember it is

not just yourjobs.

CERN, Dec 2012 glideinWMS monitoring 22

Maybe the site is just busy?

● Glideins have to compete with other Grid jobs at most sites● Maybe the site is just busy?

● Check if glideinWMS has put any glideins in the Grid queues… -format '%i\n' GlideFactoryMonitorStatusPending

If you findzeros,

notify yourFE admin!

CERN, Dec 2012 glideinWMS monitoring 23

Site problems?

● The glideins will validate the worker node before talking to the C.M.● If the test fails, the glidein will “waste” 20 mins on

the node to prevent other jobs to fail on it again

● You can check if there are “Running” glideins in glideinWMS, even though you see none (or few) in the C.M.… -format '%i\n' GlideFactoryMonitorStatusRunning

If you finda discrepancy,

notify yourFE admin!

CERN, Dec 2012 glideinWMS monitoring 24

Still no clue?

● If all your detective work fails● Notify your VO FE admin

● They have access to information you don't

CERN, Dec 2012 glideinWMS monitoring 25

Why do my jobstake forever to finish?

CERN, Dec 2012 glideinWMS monitoring 26

My jobs are running, but...

● Great, your jobs are happily running● But you are getting no results back!● i.e. the jobs are not finishing in the expected time

● Two main likely reasons● They are being restarted● You miscalculated the needed time

CERN, Dec 2012 glideinWMS monitoring 27

Jobs re-starting

● HTCondor tries to be user friendly● If a job gets preempted, for almost any reason,

it will try to re-start it with the hope it will finish on the next try

● And will not ever give up! (by default)

● You can easily check how many times it startedcondor_q -format '%i\n' NumJobStarts

● You may want to cap the number withperiodic_hold/remove

http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-removehttp://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove

CERN, Dec 2012 glideinWMS monitoring 28

Why is it restarting?

● OK, I now know it is restarting... but why?● Most likely, the glidein was killed

● Was it due to your job “misbehaving”?

● Most Grid sites have limits on resource use● Including CPU, memory and disk● If you exceed them, the glidein (and you) will be killed

● Glideins should be configured to detect and hold/remove your job if you “misbehave”● Thus you would not be re-started● If you see many restart, notify your FE admin

Likely there is a policy rule missing

CERN, Dec 2012 glideinWMS monitoring 29

What is my job doing?

● What if it is not restarting... just running forever(or until hitting the time limit)

● HTCondor allows for peeking at a running job● A cmdline tool calledcondor_ssh_to_job

● Unfortunately, needs implicit permission from site– And about half of the sites don't allow it

CERN, Dec 2012 glideinWMS monitoring 30

The End

CERN, Dec 2012 glideinWMS monitoring 31

Pointers

● glideinWMS Home Pagehttp://tinyurl.com/glideinWMS

● HTCondor Home Pagehttp://research.cs.wisc.edu/htcondor/

● HTCondor supporthtcondor-users@cs.wisc.eduhtcondor-admin@cs.wisc.edu

● glideinWMS supportglideinwms-support@fnal.gov

CERN, Dec 2012 glideinWMS monitoring 32

Acknowledgments

● The creation of this document was sponsored by grants from the US NSF and US DOE,and by the University of California system

top related