a puppet infrastructure at cern -...
TRANSCRIPT
A Puppet Infrastructure at CERN
Steve Traylen CERN IT Department [email protected]
Puppet Camp, Geneva, CH.
11 July 2012
Outline
• CERN and Computing for High Energy Physics
• Today’s CERN IT Deployment – Why and What’s changing
• Adoption of Puppet, Foreman, … – Progress, Integration – Difficulties – Future
Puppet Camp Geneva - CERN
CERN
§ Conseil Européen pour la Recherche Nucléaire § aka European
Laboratory for Particle Physics
§ Facilities for fundamental research
§ Between Geneva and the Jura mountains, straddling the Swiss-French border
§ Founded in 1954
The Large Hadron Collider
§ Accelerator for protons against protons – 14 TeV collision energy § By far the world’s
most powerful accelerator
§ Tunnel of 27 km circumference, 4 m diameter, 50…150 m below ground
§ Detectors at four collision points
The LHC Computing Challenge
� Data volume è 15 PetaBytes of new data
each year � Global compute power
è 250k CPU cores è 100 PB of disk storage
� Worldwide analysis & funding � Distributed computing
infrastructure to provide the production and analysis environments for the LHC experiments
� Managed and operated by a worldwide collaboration between the experiments and the participating computer centres
� Distributed for funding and sociological reasons Puppet Camp Geneva -
CERN
Motivation to Change Tools
• CERN data centre is reaching its limits: – IT staff numbers remain fixed – more computing capacity is needed
• Inefficiencies exist but root cause cannot be easily identified – Tools becoming increasingly brittle and difficult to adapt
• E.g porting of tools to IPv6 would need a development project
– Some core components cannot be scaled up
Puppet Camp Geneva - CERN
Second CERN Data Centre
• Wigner Institute in Budapest, Hungary • Hands off facility, hardware support only • Deploying 2012 to 2014
Puppet Camp Geneva - CERN
Infrastructure Tools Evolution
• We had to develop our own toolset in 2002 – “Extremely Large Fabric Management System” or http://cern.ch/ELFms – Included Quattor for configuration
• Nowadays, – CERN compute capacity is no longer leading edge – Many options available for open source fabric management – We need to scale to meet the upcoming capacity increase
• If there is a requirement which is not available through an open source tool, we should question the need – If we are the first to need it, contribute it back to the open source tool
Puppet Camp Geneva - CERN
Infrastructure as a Service • Goals
– Improve repair processes with virtualisation – More efficient use of our hardware – Better tracking of usage – Enable remote management for new data centre – Support potential new use cases , e.g Cloud – Sustainable support model
• At scale for 2015 – 15,000 servers – 90% of hardware virtualized. – 300,000 VMs needed.
• Plan = OpenStack Adoption
Puppet Camp Geneva - CERN
Chose Puppet for Configuration
• The tool space has exploded in the last few years – In configuration management and ops – Large, shared ‘tool forges’, and lots of experience
• Puppet and Chef are the clear leaders for the ‘core’ tool • Many large-scale enterprises use Puppet
– Its declarative approach fits better with what we are used to in Quattor. – Large installations: friendly, wide-base community and commercial support
and training – You can buy books on it – You can employ people who know puppet better than you do
Puppet Camp Geneva - CERN
Deployed System
Starting with Puppet
• Puppet was and is trivial to setup: – Anyone can do it in a day:
• Configuring something with puppet is easy • What’s hard:
– Deciding module scope and interaction with one another. • Three modules editing grub.conf or one
– We started early 2012 with very little plan in the area of module organization
Puppet Camp Geneva - CERN
Downloading Puppet Modules
• Expectation at start – all done for us: – ssh, iptables , sysctl , apache, mysql all done – example42 or similar can do everything.
• Reality – Modules often not quite correct.
• Too simple, – e.g. I want my sshd_config to be different in two places.
• Too much abstraction – I want to use puppet and not some abstraction of 100s of
variables covering every possible case » e.g puppet with(out) passenger. I only want one
– Parameterized classes and Foreman don’t really work • Resulting modules are not shareable – ENC globals vs params
Puppet Camp Geneva - CERN
Sharing and Fixing Modules
• Not as easy as it should be: – Our modules are littered with CERNisms
• ntpservers, subnets, authorization systems, .. • Adaption to work with foreman • All of us learning puppet and doing things quickly (badly)
• Hiera is being used now: – Provides the code vs data separation we had with
Quattor – Dozens of ways to setup and (ab)use hiera – Little experience with this anywhere yet – Hiera should make modules more sharable across sites
• Looking forward to it becoming the normal standard thing that modules use and every one benefits from
Puppet Camp Geneva - CERN
Sharing Modules With All
• A big aim is to share our modules as much as possible with everyone but in particular: – CERN IT not the only puppet deployment at CERN
• ATLAS Point 1 farm at CERN runs puppet
– ATLAS analysis in the cloud has used puppet – International HEP Labs use or are switching to puppet – Puppet was the “winner” at recent CHEP fabric session
• Presentations from CERN, BNL, PIC, ATLAS
• We will share here but its early days: – http://github.com/cernops
Puppet Camp Geneva - CERN
Organizing Modules On Disk
• Started with all modules in one directory in git: – Obviously wrong, great confusion for new comers
• Current situation two directories in git: – Modules – reusable items – e.g firewall, apache, sysctl, .. – Manifests – top level service, e.g batch machine, public
login machine • Future plans:
– Split up modules into local and downloaded • modules like puppetlabs-firewall mixed with our own junk • Will allow us to track /contribute to upstream better
– Inline with puppet’s upcoming vendor path
Puppet Camp Geneva - CERN
Configuration Complexity,
• We have many configurations of service. – Puppet handles this diversity well
• We have many administrators >= 300 – These admins change, are on different continents – Less obvious what to do with Puppet
150 clusters ranging form 1 to 3000 hosts.
Puppet Camp Geneva - CERN
Trust Amongst SysAdmins
Git Repository
Puppet Master(s) for SysAdmin Team A
Puppet Master (s) for SysAdmin Team B
Team A’s Nodes
Team B’s Nodes
All share one git repository Rely on code review. git branches and environments.
Teams use their own puppet masters. hiera-gpg key for each team. Host acl on puppet masters.
• The full implications of this lack of trust between admins is unclear – Interested to hear what others have done.
Change Control, Dev Cycle
• Core team maintaining OS and basics: – Hardware monitoring, ntp configuration, accounts, ..
• Specialized teams maintaining services on top: – They are ultimately responsible for service stability – We don’t want NTP configured 150 different ways
• Requirements: – Some services will follow core updates – Some service will choose when to take core updates – Parts of services may follow latest updates – LHC has physical shutdowns for doing timely updates
Puppet Camp Geneva - CERN
Change Control , Dev Cycle
• Puppet Environments map to Git Branches: – Nodes in Production, Testing and Devel branches – Big new configurations being tested in feature branches
• A few nodes in these feature branches
– Some services live isolated in their own branch • Risk of divergence
• Current process: – A blind weekly devel -> production merge
• Next Process: – Use Atlassian’s Crucible and Fisheye products to code
review puppet configuration
Puppet Camp Geneva - CERN
Crucible Reviewing Manifest
• Atlassion themselves use puppet and do this – http://blogs.atlassian.com/2011/09/puppet_change_management_for_devops/
Puppet Camp Geneva - CERN
Hardware Provisioning
• Up to now a homegrown tool in use: – Has strong similarities to puppet labs new Razor
• Razor is being followed, tracked for the moment – Final step of tool adds host to foreman
• We are using foreman – happy with it: – Kickstart templating is great – Organising hosts into hostgroups is great – We will now invest time to integrate foreman with CERN
services: • CERN network database , our master for switches, DNS, … • AIMS kerberos managed tftp server • CERN CA – We have our own CA used by other services also
– We will use this for puppet also
Puppet Camp Geneva - CERN
Virtual Machine Provisioning
• Existing Microsoft HyperV infrastructure: – 3000 Virtual Machines of which 70 puppet managed – VMs pre-seeded into a foreman hostgroup – VMs being kickstarted onto puppet and foreman
• Puppet managed OpenStack Nova – Today aiming at 200 hypervisors with up to 4000 puppet
managed VMs. – Machine Images created with Oz – Machines NOT pre-seeded in foreman or puppet
• Register at boot time – amiconfig and cloud-init for contextualizing
• pass puppet server and foreman hostgroup to image
Puppet Camp Geneva - CERN
Next Steps till End of Year
• Migrate to PuppetDB – (300,000 nodes => 300 GB RAM)
• Look at puppet dashboard • Use mcollective for something:
– Necessary as node number increases – Currently set up but not being used particularly
• Check Foreman’s integration with OpenStack • Migrate more services from Quattor to Puppet • Decide a scheme for secure blob delivery:
– hiera-gpg or ACL’ed puppet fileserver
Puppet Camp Geneva - CERN
Conclusions
• Migrating to Puppet – Largest change in our deployment for 5 years
• Has all been fairly painless: Difficulties: – forced to integrate to existing stuff sometimes – Doing things wrong first time
• lack of in house experience
• 300,000 VMs in 2015? – puppet easy to scale, more hardware can be added – We expect to dedicate up to 100 of cores to puppet
• It’s a joy to work with an active community
Puppet Camp Geneva - CERN