a puppet infrastructure at cern -...

A Puppet Infrastructure at CERN

Steve Traylen CERN IT Department [email protected]

Puppet Camp, Geneva, CH.

11 July 2012

Outline

•  CERN and Computing for High Energy Physics

•  Today’s CERN IT Deployment –  Why and What’s changing

•  Adoption of Puppet, Foreman, … –  Progress, Integration –  Difficulties –  Future

Puppet Camp Geneva - CERN

CERN

§  Conseil Européen pour la Recherche Nucléaire §  aka European

Laboratory for Particle Physics

§  Facilities for fundamental research

§  Between Geneva and the Jura mountains, straddling the Swiss-French border

§  Founded in 1954

The Large Hadron Collider

§  Accelerator for protons against protons – 14 TeV collision energy §  By far the world’s

most powerful accelerator

§  Tunnel of 27 km circumference, 4 m diameter, 50…150 m below ground

§  Detectors at four collision points

The LHC Computing Challenge

�  Data volume è 15 PetaBytes of new data

each year �  Global compute power

è 250k CPU cores è 100 PB of disk storage

�  Worldwide analysis & funding �  Distributed computing

infrastructure to provide the production and analysis environments for the LHC experiments

�  Managed and operated by a worldwide collaboration between the experiments and the participating computer centres

�  Distributed for funding and sociological reasons Puppet Camp Geneva -

CERN

Motivation to Change Tools

•  CERN data centre is reaching its limits: –  IT staff numbers remain fixed –  more computing capacity is needed

•  Inefficiencies exist but root cause cannot be easily identified –  Tools becoming increasingly brittle and difficult to adapt

•  E.g porting of tools to IPv6 would need a development project

–  Some core components cannot be scaled up


Second CERN Data Centre

•  Wigner Institute in Budapest, Hungary •  Hands off facility, hardware support only •  Deploying 2012 to 2014


Infrastructure Tools Evolution

•  We had to develop our own toolset in 2002 –  “Extremely Large Fabric Management System” or http://cern.ch/ELFms –  Included Quattor for configuration

•  Nowadays, –  CERN compute capacity is no longer leading edge –  Many options available for open source fabric management –  We need to scale to meet the upcoming capacity increase

•  If there is a requirement which is not available through an open source tool, we should question the need –  If we are the first to need it, contribute it back to the open source tool


Infrastructure as a Service •  Goals

–  Improve repair processes with virtualisation –  More efficient use of our hardware –  Better tracking of usage –  Enable remote management for new data centre –  Support potential new use cases , e.g Cloud –  Sustainable support model

•  At scale for 2015 –  15,000 servers –  90% of hardware virtualized. –  300,000 VMs needed.

•  Plan = OpenStack Adoption


Chose Puppet for Configuration

•  The tool space has exploded in the last few years –  In configuration management and ops –  Large, shared ‘tool forges’, and lots of experience

•  Puppet and Chef are the clear leaders for the ‘core’ tool •  Many large-scale enterprises use Puppet

–  Its declarative approach fits better with what we are used to in Quattor. –  Large installations: friendly, wide-base community and commercial support

and training –  You can buy books on it –  You can employ people who know puppet better than you do


Deployed System

Starting with Puppet

•  Puppet was and is trivial to setup: –  Anyone can do it in a day:

•  Configuring something with puppet is easy •  What’s hard:

–  Deciding module scope and interaction with one another. •  Three modules editing grub.conf or one

–  We started early 2012 with very little plan in the area of module organization


Downloading Puppet Modules

•  Expectation at start – all done for us: –  ssh, iptables , sysctl , apache, mysql all done –  example42 or similar can do everything.

•  Reality –  Modules often not quite correct.

•  Too simple, –  e.g. I want my sshd_config to be different in two places.

•  Too much abstraction –  I want to use puppet and not some abstraction of 100s of

variables covering every possible case »  e.g puppet with(out) passenger. I only want one

–  Parameterized classes and Foreman don’t really work •  Resulting modules are not shareable – ENC globals vs params


Sharing and Fixing Modules

•  Not as easy as it should be: –  Our modules are littered with CERNisms

•  ntpservers, subnets, authorization systems, .. •  Adaption to work with foreman •  All of us learning puppet and doing things quickly (badly)

•  Hiera is being used now: –  Provides the code vs data separation we had with

Quattor –  Dozens of ways to setup and (ab)use hiera –  Little experience with this anywhere yet –  Hiera should make modules more sharable across sites

•  Looking forward to it becoming the normal standard thing that modules use and every one benefits from


Sharing Modules With All

•  A big aim is to share our modules as much as possible with everyone but in particular: –  CERN IT not the only puppet deployment at CERN

•  ATLAS Point 1 farm at CERN runs puppet

–  ATLAS analysis in the cloud has used puppet –  International HEP Labs use or are switching to puppet –  Puppet was the “winner” at recent CHEP fabric session

•  Presentations from CERN, BNL, PIC, ATLAS

•  We will share here but its early days: –  http://github.com/cernops


Organizing Modules On Disk

•  Started with all modules in one directory in git: –  Obviously wrong, great confusion for new comers

•  Current situation two directories in git: –  Modules – reusable items – e.g firewall, apache, sysctl, .. –  Manifests – top level service, e.g batch machine, public

login machine •  Future plans:

–  Split up modules into local and downloaded •  modules like puppetlabs-firewall mixed with our own junk •  Will allow us to track /contribute to upstream better

–  Inline with puppet’s upcoming vendor path


Configuration Complexity,

•  We have many configurations of service. –  Puppet handles this diversity well

•  We have many administrators >= 300 –  These admins change, are on different continents –  Less obvious what to do with Puppet

150 clusters ranging form 1 to 3000 hosts.


Trust Amongst SysAdmins

Git Repository

Puppet Master(s) for SysAdmin Team A

Puppet Master (s) for SysAdmin Team B

Team A’s Nodes

Team B’s Nodes

All share one git repository Rely on code review. git branches and environments.

Teams use their own puppet masters. hiera-gpg key for each team. Host acl on puppet masters.

•  The full implications of this lack of trust between admins is unclear –  Interested to hear what others have done.

Change Control, Dev Cycle

•  Core team maintaining OS and basics: –  Hardware monitoring, ntp configuration, accounts, ..

•  Specialized teams maintaining services on top: –  They are ultimately responsible for service stability –  We don’t want NTP configured 150 different ways

•  Requirements: –  Some services will follow core updates –  Some service will choose when to take core updates –  Parts of services may follow latest updates –  LHC has physical shutdowns for doing timely updates


Change Control , Dev Cycle

•  Puppet Environments map to Git Branches: –  Nodes in Production, Testing and Devel branches –  Big new configurations being tested in feature branches

•  A few nodes in these feature branches

–  Some services live isolated in their own branch •  Risk of divergence

•  Current process: –  A blind weekly devel -> production merge

•  Next Process: –  Use Atlassian’s Crucible and Fisheye products to code

review puppet configuration


Crucible Reviewing Manifest

•  Atlassion themselves use puppet and do this –  http://blogs.atlassian.com/2011/09/puppet_change_management_for_devops/


Hardware Provisioning

•  Up to now a homegrown tool in use: –  Has strong similarities to puppet labs new Razor

•  Razor is being followed, tracked for the moment –  Final step of tool adds host to foreman

•  We are using foreman – happy with it: –  Kickstart templating is great –  Organising hosts into hostgroups is great –  We will now invest time to integrate foreman with CERN

services: •  CERN network database , our master for switches, DNS, … •  AIMS kerberos managed tftp server •  CERN CA – We have our own CA used by other services also

– We will use this for puppet also


Virtual Machine Provisioning

•  Existing Microsoft HyperV infrastructure: –  3000 Virtual Machines of which 70 puppet managed –  VMs pre-seeded into a foreman hostgroup –  VMs being kickstarted onto puppet and foreman

•  Puppet managed OpenStack Nova –  Today aiming at 200 hypervisors with up to 4000 puppet

managed VMs. –  Machine Images created with Oz –  Machines NOT pre-seeded in foreman or puppet

•  Register at boot time –  amiconfig and cloud-init for contextualizing

•  pass puppet server and foreman hostgroup to image


Next Steps till End of Year

•  Migrate to PuppetDB –  (300,000 nodes => 300 GB RAM)

•  Look at puppet dashboard •  Use mcollective for something:

–  Necessary as node number increases –  Currently set up but not being used particularly

•  Check Foreman’s integration with OpenStack •  Migrate more services from Quattor to Puppet •  Decide a scheme for secure blob delivery:

–  hiera-gpg or ACL’ed puppet fileserver


Conclusions

•  Migrating to Puppet –  Largest change in our deployment for 5 years

•  Has all been fairly painless: Difficulties: –  forced to integrate to existing stuff sometimes –  Doing things wrong first time

•  lack of in house experience

•  300,000 VMs in 2015? –  puppet easy to scale, more hardware can be added –  We expect to dedicate up to 100 of cores to puppet

•  It’s a joy to work with an active community


a puppet infrastructure at cern -...

Documents