linux clusters institute: hpc user support · hpc user expectations, categorization and...
TRANSCRIPT
1
Linux Clusters Institute:HPC User Support
Fang (Cherry) Liu, PhD [email protected]
Wesley Emeneker, PhD [email protected]
Mehmet Belgin, PhD [email protected]
A Partnership for an Advanced Computing Environment (PACE)
OIT/ART, Georgia Tech
4-8 August 2014 2
Targets for this session
Target audience:
IT professionals with little or no experience with supporting HPC users
Points of interest:
• differences between HPC and conventional IT
• HPC user categories and differences in their support
• human aspect of HPC support (i.e. politics, conflicts)
• problems common to most institutions/centers supporting HPC
• exchange different approaches/solutions to common problems
• HPC education and training
• lessons we learned at GT
2
4-8 August 2014 3
Overview
Part I: HPC user expectations, categorization and commonalities
Part II: Policies, Politics, Conflicts and Personality Management
Part III: Outreach and Education
Part IV: Lessons learned at GT
4-8 August 2014 4
Differences of HPC from conventional IT
• application performance as the primary target
• usually relies on conventional IT services (by a separate team)
• more focus on supporting end-users than services
• uses common IT technologies in uncommon ways
• requires specific middleware and software layers
• requires code compilations using complicated mechanisms
• may require specific knowledge about the application/science
• has irregular usage patterns (maybe not so different than IT?)
3
4-8 August 2014 5
PART I
HPC user expectations, categorization and commonalities
4-8 August 2014 6
HPC user expectations
Faculty (a.k.a PI) (owner of resources, but not active users):
• Their students and collaborators have everything they need to get the work done (and on time)
• Maximum availability of resources
• Minimum communication with HPC support staff
• Regular status reports
4
4-8 August 2014 7
HPC user expectations
Students/Collaborators (or computationally active PIs) :
• ultra-fast learning curve
• simple and instant solutions to complex problems
• maximum communication with HPC support staff
• simulations running faster than their laptops (not always possible!)
• help with diagnosing problems that are NOT related to systems
• an "insider friend" in the HPC support staff
• answers that match their level of knowledge
4-8 August 2014 8
HPC User Categories
• Three coarse categories:• Novice
• Intermediate
• Advanced
• Difficult to identify a user's category without any prior interaction
• The language used in requests is a good indicator
• Replies to follow-up questions also reveal their level of proficiency
• In case of uncertainty, assume "novice"
5
4-8 August 2014 9
Category 1: Novice Users
Common Points:
• 75-80% of the support requests
• no/little Linux skills
• no/little experience with running the domain specific packages
• no/little understanding of the scientific fundamentals behind the packages
• mostly identical or similar requests with straightforward solutions
• usually not aware of the standard help channels
• may ask the impossible
• may type the examples in the help documents literally
• may feel insecure or apologetic when seeking for help
4-8 August 2014 10
Category 1: Novice Users
Common Needs:
• Cluster orientation
• Linux 101
• an email list
• an easy text editor (nano?)
• help with configuring their MS Windows/OSX systems
• location of existing software
• installation of new software
• help with tools to move data in/out
• help with the very first job submission script
6
4-8 August 2014 11
Category 1: Novice Users
Common approaches for effective support:
• do everything to build mutual trust
• hold regular orientation sessions and help desks
• maintain up-to-date FAQ/help with screenshots
• provide links to existing help locations
• suggest proper web search terms
• make them feel better about their simple (or sometimes stupid) questions
• explain all the steps for resolution in simple, replicable terms
• prefer exact list of commands to general/conceptual answers
• be very patient and polite!
4-8 August 2014 12
Category 2: Intermediate Users
Common Points:
• 10-25% of the support requests
• largest portion of the compute activity on the cluster
• experience with clusters in the same or other institutions
• first to notice and report system problems
• a hybrid mix of straightforward and complex questions
• advanced and multi-step scientific workflows
• aware of the standard help channels
• suggest solutions to their own problems and may not like what you did
• act as the local technical expert and often train novice users in their group
7
4-8 August 2014 13
Category 2: Intermediate Users
Common Needs:
• advanced (and group-specific) information sessions
• well-explained effective solutions
• more performance/efficiency from already running codes
• specific modules/patches/versions for existing software
• higher level of control on their jobs
• access to specialized computational resources
• configurations that may conflict with system defaults
• code development/debugging/profiling support
• data/statistics for the resolution of conflicts with other users
4-8 August 2014 14
Category 2: Intermediate Users
Common approaches for effective support:
• do everything to build mutual trust
• hold advanced classes to "teach how to fish"
• schedule one-on-one meetings
• add exceptional/advanced cases to existing FAQ/help pages
• present solid data/evidence instead of speculation
• admit to speculation if it is inevitable
• show complete transparency; they can separate 'excuses' from 'facts'
• get help from vendor support and user forums, keeping users CC'ed
• be very patient and polite!
8
4-8 August 2014 15
Category 3: Advanced Users
Common Points:
• experience with and access to multiple clusters
• only a small fraction of support requests
• inclination for bypassing the ticket system
• usually complex problems with long resolution time
• try to fix problems themselves, and see HPC support as a last resort (i.e. when it's too late)
• usually on the extremes; either hostile or extremely collaborative
• too busy or advanced to act as the local expert for their group
• have complex to incomprehensible workflows
• usually acknowledge challenging problems, open to workarounds
• suggest improvements on the systems (hardware and software) and provide useful feedback
• open to experimentation with new systems and software
• find bugs in libraries and applications
4-8 August 2014 16
Category 3: Advanced Users
Common Needs:
• VIP treatment
• direct and open communication channels
• social contact
• acknowledgement of their level of knowledge and intelligence
• high-level and direct vendor/developer support
• lots of exceptions, even though they require violation of existing policies
• almost everything else listed under "common intermediate users needs"
• root password (the answer is still no)
9
4-8 August 2014 17
Category 3: Advanced Users
Common approaches for effective support:
• do everything to build mutual trust
• schedule one-on-one meetings
• try to learn more about their research, deadlines and aspirations
• be very careful saying that something is "impossible"
• make small exceptions as long as it does not impact other users
• avoid speculation as much as possible (as with all users)
• be completely transparent, they can easily separate 'excuses' from 'facts'.
• encourage them to contact vendor support or user forums
• Be very patient and polite!
4-8 August 2014 18
How to Build TrustRef: http://www.myspeedoftrust.com/How-The-Speed-of-Trust-works/book
According to Stephen Covey, building trust requires character and competency.
• Character:
• Integrity
• Intent
• Competency:
• Capabilities
• Results
We agree.
10
4-8 August 2014 19
PART II
Policies, Politics, Conflicts and Personality Management
4-8 August 2014 20
Policies
• clear policies help keep user demands under control
• publish policies in places easy to find (online)
• be prepared to explain the reasoning behind each policy item
• make policies as strict as possible, be open to exceptions when necessary
• encourage users to openly discuss and criticize the policies
• don't hesitate updating policies frequently to stay relevant
• build trust and effective communication with decision makers
• seek delegation privileges to speed things up
• don't make policies for resources you don't own, but "influence" them
11
4-8 August 2014 21
Politics and Conflicts
• Tricky but inevitable
• No magic formula, needs case-specific creative solutions
• Biggest challenge: conflicts due to limited resources• configure systems to exactly match policies
• collect and store data for past and present usage
• provide users with tools to browse data/statistics for their accounts
• run regular audits to defuse problems before they explode
4-8 August 2014 22
Tiers of Conflict
• Internal to a group/department: usually easier to solve with communication and gentlemen's agreement
• Between groups/department: can get messy quick. Use "wall of shame"
• Between users and HPC support staff: Have clear policies handy as a basis for declining impossible requests, and keep solid statistics/data as evidence
12
4-8 August 2014 23
Personality Management
• some users are difficult than others; why they behave that way is irrelevant
• do not take anything personal; report any harassment you may receive and do not retaliate
• in most cases users do not mean bad, but they are extremely frustrated
• if your mistake caused frustration, take responsibility and offer an apology
• show empathy and demonstrate sincere intention for resolution
• acknowledge that:
• you understand the problem
• you are aware of its particular impact on the user
• be aware of, and show tolerance for cultural differences and language difficulties
• humor is powerful only when used appropriately, avoid being awkward or insulting
• do not wait until having a resolution, respond immediately to inform that you started working on the problem, and provide frequent updates
4-8 August 2014 24
PART III
Outreach and Education
13
4-8 August 2014 25
Trainings and Tutorials
• Novice Users:• how to ask for help
• usage limitations and best practices
• basic Linux usage
• basic cluster access, concepts, scheduling system, software
• troubleshooting job related problems
• Intermediate Users• debugging/optimization of codes (incl. parallel)
• system architecture specific details
• advanced use of common tools (scientific python, parallel matlab)
4-8 August 2014 26
Group Consultations
• mini-orientations for newly joining groups
• departmental meetings to provide feedback for resolution of internal conflicts
• resolution of technical problems that are specific to a group
• technical feedback to assist in policy making and system purchases
• introduction of services to new groups with interest in getting resources
14
4-8 August 2014 27
Grant Writing Help
Level 1: Answer questions (e.g. hardware specs, software licenses)
Level 2: Contribute facilities document, budget, letters of supports
Level 3: Writing and revising portions of the grant
Level 4: Initiate new grants to get more resources
4-8 August 2014 28
Collaborations with Researchers and Vendors
• research scientists helping research scientists
• crucial for staying relevant
• collaborative grant writing
• collaborative projects/papers (in acknowledgements or as authors)
• support for classes and workshops
• developer/vendor collaborations• Bug tracking and fixes
• Hardware/software feedback, evaluation of new systems and technology
• Pilot studies
15
4-8 August 2014 29
PART IV
Lessons learned at GT
4-8 August 2014 30
PACE: A Partnership for an Advanced Computing Environment
• A High Performance Computing Center at Georgia Tech
• provides the necessary bridge between fast computers and brilliant humans that enables a powerful combination of the two
• is a partnership between Georgia Tech faculty and the Office of Information Technology focused on HPC
• provides HPC environment for GaTech students, researchers and faculty
• total computing capacity includes more than 1,200 compute nodes with total about 30k cores
• total user community has more than 1,200 active users
• 11 staff members and 5 student assistants
16
4-8 August 2014 31
User support in PACE
• User interactions: user request system, monthly portable helpdesk event, online self-support system (in progress)
• Administrative: quarterly maintenance, software upgrades, queue changes, etc.
• Static information: website FAQ, User Forum, polices(software install, storage, new user account, etc.)
• Outreach: HPC entry level classes, Summer classes, orientation, hardware grant proposals
4-8 August 2014 32
Virtual Teams
• Team organization helps on dividing the tasks and use of the students to offload some of daily routine tasks
• Storage team is in charge of storage hardware to provide reliable access through head/compute nodes. Lots of problems may raise from storage issues
• Software team takes care of all system scientific libraries, user applications, software modules, etc. Besides installing the software, knowledge on how to use software is needed
• Operations team covers all hardware installation, maintenance and repair
• User support team acts as an interface between user community and PACE center, takes care of support request triage, scientific consultation, etc.
17
4-8 August 2014 33
Virtual Team Organization
Operations
Storage Team
Software Team
User Support Team
Request Tracking System
4-8 August 2014 34
Use a request tracking system
• Provides an incident tracking system for HPC and other customers on campus• Easy-to-use, streamlined interface
• Regular reminders for unresolved incidents
• Time tracking and reporting functions
• Customizable email notices
• Excellent customer communications tool
• identification of emerging problem patterns
• used as a knowledge base
• Improves user satisfaction and increases productivity to help PACE deliver more value to education and research
18
4-8 August 2014 35
User Support Team
• Acts as entry point for incoming support requests
• Answer the straight-forward questions
• Manually distribute remaining requests to subject matter experts
• Categorize the requests
• Keep track of request assignments
• Internal coordinating/information collection to communicate with user community
• User usage analysis, gathering user experience
• User education and classes
Ticket Categories
• Queue Issues (Q)
• provide access to the resources
• cancel jobs stuck in the queue
• add nodes online/offline
• queue scheduler not responding
• long waiting in the queue
• exam the jobs
• Applications (A)
• compiling, using, debugging, optimizing the user codes
• system level software installation, create modules
• give access to the restricted applications (e.g. VASP and Gaussian)
• system library/user application usage
• system wide script support
4-8 August 2014 36
19
4-8 August 2014 37
Ticket Categories (Cont.)
• General Usage (G)
• VPN issues
• login issue, slow response
• scheduling meetings
• file permission issues
• system issue reporting
• software license issue
• FAQ (F)
• PBS scripts
• modules usage
• resource checking/use
• general usage on job submission, cancellation, status check, etc.
4-8 August 2014 38
Ticket Categories (Cont.)
• Storage (S)
• inability to access the storage
• file moving between local workstations and PACE
• retrieving the accidentally deleted files
• modify the disk quota and permission
• Account (C)
• user account creation/retiring
• Hardware Repairs (H)
• disk/memory/fans/networks, etc.
20
Ticket Analysis (July 2013 to June 2014)
• Manually recording the user request since July 2013
• Total 223 days recorded
• Total number of request is 1834
4-8 August 2014 39
4-8 August 2014 40
Total Monthly Ticket Trend (July 2013 - June 2014)
0
50
100
150
200
250
July 2013 to June 2014 Monthly Ticket Trend
21
4-8 August 2014 41
Tickets Type Trends (July 2013 - June 2014)
Queue18%
Storage15%
General30%
Application18%
Hardware7%
Account creation7%
FAQ5%
Ticket Catagory Percentage
4-8 August 2014 42
Daily Ticket trends (July 2013 - June 2014)
0
5
10
15
20
25
30
35
7/17/13 8/17/13 9/17/13 10/17/13 11/17/13 12/17/13 1/17/14 2/17/14 3/17/14 4/17/14 5/17/14 6/17/14
July 2013 to June 2014 Daily Ticket Trend
Ticket Counts Average Ticket Counts per month
22
4-8 August 2014 43
Quarterly Maintenance
• Set regular, strict dates, advance announcement
• Specify primary and bonus goals, announce them beforehand
• Predefined "worst case" downtime
• Provide a summary of completed tasks after maintenance
• Plan ahead in details:• Team member / task associations• Estimated task duration• Critical paths and B plans
• Prepare to have unforeseen problems during and after the maintenance days
• Show best effort for minimal impact• configure the scheduler to have no running jobs• Disable user access to resources during the maintenance activities
4-8 August 2014 44
PACE Training and Tutorials
Currently offered classes:• Cluster orientation (once every 2 weeks)
• Linux 101 (on demand, but will be a regular class soon)
• Introduction to Parallel Programming with MPI and OpenMP (summer course)
• Introduction to Parallel Application Debugging and Profiling (summer course)
• A Quick Introduction To Python (summer course)
• Python For Scientific Computing (summer course)
• Python For Data Analysis and Visualization (summer course)
• Data Analysis and Visualization with MATLAB (by vendors, on-demand)
• Handling Large Data Sets in MATLAB (by vendors, on-demand)
23
4-8 August 2014 45
Portable Help Desks
• 2-hour help-desks in common circulation areas in departments
• regular (~once monthly) and also on-demand
• usually 2-3 of the team members
• usually 5-10 users stop by
• improves visibility of HPC support services
• good for socializing
• attract shy, novice users and encourage them to ask their simple questions
• attract prospective researchers interested in joining the cluster
• defuse frustration caused by unreported problems
4-8 August 2014 46
Questions and Feedback