operational preparation for large- scale deployment · •using gitlab runners –binary that can...
TRANSCRIPT
![Page 1: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/1.jpg)
ORNL is managed by UT-Battelle
for the US Department of Energy
Robinhood
Operational Preparation for Large-Scale Deployment
![Page 2: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/2.jpg)
2 Presentation_name
Overview of OLCF / NCCS
• National Center for Computational Sciences
– Focus on at-scale HPC challenges
– Support for projects like SNS, NCRC
• Oak Ridge Leadership Computing Facility
– Largest project of NCCS
– Home of Titan/Atlas. Future home of Summit/Alpine
![Page 3: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/3.jpg)
3 Presentation_name
Overview of Robinhood
• Policy Engine for POSIX file systems
• Extra hooks for Lustre
• Allows for near-real time file system information
![Page 4: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/4.jpg)
Making
Robinhood fit
OLCF Production
Standards
![Page 5: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/5.jpg)
5 Presentation_name
Reproducible Builds
• Using GitLab Runners
– Binary that can execute builds as part of a CI pipeline
– Settings -> General -> Enable pipelines
– Settings -> Pipelines
![Page 6: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/6.jpg)
6 Presentation_name
Building Lustre
• Current setup:
– GitLab runner runs as bot build user on storage-util1 node
– Build script checks out copy of Lustre repo
– Uses current build system to create Lustre RPMs
– Stores them in staging area for manual signing/approval
– Only for Robinhood testing currently
• Future setup:
– “Bring your own build host”
– Using runner ”tags”
![Page 7: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/7.jpg)
7 Presentation_name
Building Robinhood
• Similar setup to Lustre
– Lustre RPMs are installed manually
– Kick off pipeline build
– Robinhood RPMs are built against installed Lustre client
– RPMs are placed in staging area for testing/signing/deployment/installation
![Page 8: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/8.jpg)
8 Presentation_name
Puppet Setup
• NCCS uses Puppet’s role and profile design workflow
• https://docs.puppet.com/pe/2017.2/r_n_p_full_example.html
• No current module on Puppet Forge
• WIP robinhood module
![Page 9: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/9.jpg)
9 Presentation_name
Puppet Robinhood Module
Basic 1-to-1 setup between Robinhoodconfig options and Puppet parameters
![Page 10: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/10.jpg)
Testing
Environment
![Page 11: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/11.jpg)
11 Presentation_name
Testing Setup
• Tested against older hardware
• Used AtlasTDS file system
• Partition of NetApp E5500 with 48x 900GB 10k SAS drives, over 6G SAS
![Page 12: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/12.jpg)
12 Presentation_name
Testing Hardware
• Storage-util1 Node:
– Dell PowerEdge R620
– 2x Intel® Xeon® CPU E5-2640 @ 2.50GHz
– 16x 16GB DIMM DDR3 1333 MHz
– Hyperthreading Disabled
– Diskless provisioning
![Page 13: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/13.jpg)
13 Presentation_name
MariaDB tuning
• Mostly same settings as recommended by Robinhood’sstarting page
• innodb_additional_mem_pool_size setting is not used in 10.3
• For stock RHEL installs, the log_slow_queries and associated tunings (long_query_time and log-queries-not-using-indexes) can show if the database is a bottleneck
![Page 14: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/14.jpg)
14 Presentation_name
Robinhood Tuning
• Set nb_threads to twice the number of physical cores
• Changed max_pending_operations from 10000 to 200000
• Set nb_threads_scan to twice the number of physical cores. This may be too many
• Changed queue_max_size to 10000 (from 1000) and queue_max_age from 5s to 10s
• Trade-off between consistency/recovery-time and speed
![Page 15: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/15.jpg)
15 Presentation_name
Disk Utilization
![Page 16: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/16.jpg)
16 Presentation_name
Bottlenecks?
• File system backend limited metadata performance
• Under certain metadata intensive workloads:
– Not really an easy solution
– Mentioned in https://jira.hpdd.intel.com/browse/LU-8047
![Page 17: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/17.jpg)
17 Presentation_name
Issues with RHEL7
• Stock mariadb
• Systemd ulimit settings
![Page 18: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/18.jpg)
18 Presentation_name
Testing Summary
• Current testing hardware can only process so quickly – we appear to have hit this limit
• Moved the bottleneck towards Lustre
• GET_FID is typically highest latency command
• Bursts of metadata traffic cause spikes of “Wait”-state commands; in our testing, shifts between GET_INFO_DB, DB_APPLY and CHGLOG_CLR
![Page 19: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/19.jpg)
19 Presentation_name
Daemon vs. One-shot
• Split use-case
• Daemon:
– File system scanning
– Changelog consumption
– RBH_OPT="--readlog --scan"
• One-shot (“manual” process / cronjob):
– Policy application (e.g., purging)
![Page 20: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/20.jpg)
Comparison to
Existing Tools
![Page 21: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/21.jpg)
21 Presentation_name
PCircle
• Suite of file system tools for parallel data copying, checksumming, and profiling
• Currently used for ~weekly file system profiling
• Includes directory count, sym/hard linkcounts, file count, average file size, maxfiles within a directory, among other statistics
• Reports file size histograms, and top files (by size)
• https://github.com/olcf/pcircle
![Page 22: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/22.jpg)
22 Presentation_name
fprof
• Able to reproduce fprof-like reporting by setting up fileclassbuckets
• Built-in reports like ‘top x’ files/directories provide similar functionality
![Page 23: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/23.jpg)
23 Presentation_name
Output: rbh-report --class-info
![Page 24: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/24.jpg)
24 Presentation_name
LustreDU
![Page 25: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/25.jpg)
25 Presentation_name
LustreDU
• Provides directory-level usage for users/projects
• Populated by:
– Parsing Lester output
– Contacting inode query daemons running on OSS nodes
– Populating/updating MySQL database
• Only updated daily
• Issues running as privileged user
![Page 26: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/26.jpg)
26 Presentation_name
rbh-du output
• Provides a quick du option
• Potentially provide a smart wrapper for users that use du vs rbh-du based on file path
![Page 27: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/27.jpg)
27 Presentation_name
Purging Policies
• Non-Robinhood workflow:
– User submits request
– RUC approval
– UAO team member enters exemption into RATS
– Purge config is generated using those exemptions
• Robinhood pieces still WIP
• Example:
![Page 28: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/28.jpg)
28 Presentation_name
Purging – Integration with Robinhood
• Want to keep same workflow for users and other groups
• Current thoughts:
– Pull list of purge exemptions
– Generate purge configuration file using multiple “tree” statements in a cleanup rule
– Run Robinhood with --once with that policy
– Log and remove configuration
![Page 29: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/29.jpg)
Future Work
![Page 30: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/30.jpg)
30 Presentation_name
Hardware upgrades
• Transition to using similar setup to current MDS nodes
• Single socket, faster clock speed
• SSD / NVMe storage target
![Page 31: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/31.jpg)
31 Presentation_name
Clustering
• Move processes to multiple nodes
– Multiple physical nodes vs namespaced mounts / VMs
• Set up a Mariadb/MySQL cluster
– Millions of SQL statements per second
– https://www.mysql.com/why-mysql/benchmarks/mysql-cluster/
![Page 32: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/32.jpg)
32 Presentation_name
CEA’s Lustre Changelogs Aggregate &
Publish (lcap) integration
• Ability for multiple change-log readers
• Redirect a copy of the changelog to our Kafka instances while still using a single reader
![Page 33: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/33.jpg)
33 Presentation_name
Lustre Jobstats
![Page 34: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/34.jpg)
34 Presentation_name
Jobstats Integration
![Page 35: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/35.jpg)
35 Presentation_name
Jobstats Integration - continued
• Database schema changes
– Add new columns to database: creation_job, last_access_job, last_mod_job, and last_mdchange_job
– Parse job_id (semi-support exists currently) to populate these fields
![Page 36: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/36.jpg)
36 Presentation_name
Jobstats Integration – potential wins
• File system usage heuristics
• Security triggers / auditing
• File-level history
![Page 37: Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can execute builds as part of a CI pipeline –Settings -> General -> Enable pipelines](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42914d21ebf92c460a660b/html5/thumbnails/37.jpg)
37 Presentation_name
References
• https://github.com/cea-hpc/robinhood/wiki/Documenation
• https://dev.mysql.com/doc
• https://mariadb.com
• https://github.com/fwang2/ioutils
• https://cug.org/proceedings/cug2014_proceedings/includes/files/pap157.pdf
• https://gitlab.com/gitlab-org/gitlab-ci-multi-runner
• http://wiki.lustre.org/images/0/02/LUG-2011-Aurelien_Degremont-Robinhood_Quick_Tour.pdf
• https://github.com/cea-hpc/lcap
• http://syst.univ-brest.fr/per3s/wp-content/uploads/2017/02/robinhood-Per3S.pdf
• https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml