the cern agile infrastructure project: configuration and operations tools

30
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/ The CERN Agile Infrastructure Project: Configuration and Operations Tools Helge Meinhard / CERN-IT (replacing Manuel Guijarro) HEPiX Spring 2012 24 April 2012, Praha

Upload: marrim

Post on 25-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

The CERN Agile Infrastructure Project: Configuration and Operations Tools. Helge Meinhard / CERN-IT (replacing Manuel Guijarro ) HEPiX Spring 2012 24 April 2012, Praha. Configuration and Operations Tools. https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The CERN Agile Infrastructure Project: Configuration and Operations Tools

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/it

The CERNAgile Infrastructure Project:

Configuration and Operations Tools

Helge Meinhard / CERN-IT(replacing Manuel Guijarro)

HEPiX Spring 201224 April 2012, Praha

Page 2: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Configuration and Operations Tools

https://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://agileinf.cern.ch/jira/

Agile Infrastructure - Configuration and Operation Tools

Page 3: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Project Scope

The project is reviewing the entire CERN computer-centre management toolset– What happens from the bare metal up– Asset management, inventory– Sysadmin tools and maintenance workflows– Service management and configuration tools– Dynamic configuration for ‘virtual’ hosts– Operations monitoring– Workflow automation and continuous deployment– …

Agile Infrastructure - Configuration and Operation Tools

Page 4: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Configuration and Operations Tools

Agile Infrastructure - Configuration and Operation Tools

Page 5: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Why?

Current production system built around the Quattor toolset is successfully managing O(10k) servers– (CERN) Quattor + many CERN components

Why are we changing the toolset?

Agile Infrastructure - Configuration and Operation Tools

Page 6: The CERN Agile Infrastructure Project: Configuration and Operations Tools

What are the Issues (1)

Uncompressible technical debt– The cost to develop and maintain our own solution is not reducing

and clearly exceeds our resources– Small community (less funding) and general support problem. At

CERN, we’ve fallen into the “sticky hands” support model

We need better automation and integration between the sub-components– Lack of automated workflow: everything is a ticket

emailScript™ : your added value in the process is often your CERN password

– The 15-min “CDB commit walk” – context switch cost

Agile Infrastructure - Configuration and Operation Tools

Page 7: The CERN Agile Infrastructure Project: Configuration and Operations Tools

What are the Issues (2)

Transferrable skills and training– Learning curve for our tools is steep and remains high

– It’s easier to hire people who have skills in a widely-used tool than your internal tools

Depending on where you look

Agile Infrastructure - Configuration and Operation Tools

Page 8: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Jobs Adverts – indeed.com

Agile Infrastructure - Configuration and Operation Tools

Index of millions of worldwide job posts across thousands of job sites

These are the sort of posts our departing staff will be applying for.

Puppet

Quattor

Page 9: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Integration is Hard

IPv6, virtualisation, Windows Server all need a solution– We could leverage lots of open source tools

But piecemeal integration of these requires high investment due to our complex system

Years of organic growth have made the system way too ‘hairy’ It’s often easier to reinvent rather than integrate

– Lack of ‘dynamic-ness’ in the infrastructure We hack the config system for dynamic VMs

It’s critical to look at the system as a whole

Agile Infrastructure - Configuration and Operation Tools

Page 10: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Use Puppet for the Core

The tool space has exploded in the last few years– In configuration management and ops– Large, shared ‘tool forges’, and lots of experience

Puppet and Chef are the clear leaders for the ‘core’ tool– other tools in our ‘scope’ try to integrate with those

Many large-scale enterprises use Puppet– Its declarative approach fits better with what we are used to – Large installations: friendly, wide-base community and commercial support

and training– You can buy books on it

Agile Infrastructure - Configuration and Operation Tools

Page 11: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Scaling Challenges: Nodes

Currently we have O(10k) physical nodes IaaS approach:

– Moving to virtual machines– More (smaller, load-balanced) service nodes– VMs for raw compute (batch or pilot jobs)– Homogeneous: compute + storage on the same node

Add another computer centre, 24/48 SMT cores per node, you get 100k – 300k virtual nodes to be managed– 99.6%(1) node update success-rate means 1200 manual interventions to “fix

it”

(1) in a recent intervention on lxbatch

Agile Infrastructure - Configuration and Operation Tools

Page 12: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Scaling Challenges: People

Many, diverse applications (“clusters”) managed by different teams..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system

Agile Infrastructure - Configuration and Operation Tools

Page 13: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Agile Infrastructure 1st Try (1)

First started investigating tools in September 2011 using ‘part-time’ resources from several IT groups– Trying iterative “agile-sprint” style (Scrum): short sprints, feedback, sprint

review, visible– Take first, best-guess at architecture and tool selection, iterate

Mixed success with this agile style– What works: Good visibility and reviews.

Daily “scrum” meeting useful. Weekly review meeting open to management.

– What doesn’t: The “time boxing” part of Scrum sprints is hard with part-time resources

– Now more staff available, but still mostly part-time efforts

Agile Infrastructure - Configuration and Operation Tools

Page 14: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Agile Infrastructure 1st Try (2)

We’re currently running:– OpenStack as cloud software for virtual machines, image management, bulk

storage See later presentation

– Puppet for the configuration management core– …with Foreman as a dashboard

Agile Infrastructure - Configuration and Operation Tools

Page 15: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Foreman Dashboard

Agile Infrastructure - Configuration and Operation Tools

Page 16: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Agile Infrastructure 1st Try (2)

We’re currently running:– OpenStack as cloud software for virtual machines, image management, bulk

storage See later presentation

– Puppet for the configuration management core– …with Foreman as a dashboard

None of the tools are “perfect” out-of-the-box– .. but we’d rather submit patches to a good open source tool than re-implement it– We’ve experienced very good community support: RFCs and patches are quickly

accepted– Very active community: often problems are fixed and missing features implemented

before you even report them

Agile Infrastructure - Configuration and Operation Tools

Page 17: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Agile Infrastructure 1st Try (3)

We’re currently running:– yum for software distribution (replacing spma)– git for template management: why git?

Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates

Many of the tools we can benefit from also assume git We should not be different from the rest of the community

Agile Infrastructure - Configuration and Operation Tools

Page 18: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Puppet

Client/server architecture– “puppetmaster”: horizontally scalable Rails application– X509 cert authenticated nodes: integrate with CERN CA

Agile Infrastructure - Configuration and Operation Tools

Page 19: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Puppet

Puppet runs on the client, applyingthe configuration changes

– It detects the current state and only runs if there’s something to do

It runs every few minutes– new configuration will be ~immediately applied (“fail-fast”).– This is a change from CDB where ‘latent’ changes can be stacked up

Normal mode is client-side compile (“assume success”)– No more CDB commit waits– Change from CDB: the compilation fails later

Good monitoring is a pre-req: puppet sends reports back to the puppetmaster

– The Foreman tool can collect these for you

Agile Infrastructure - Configuration and Operation Tools

Page 20: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Puppet Language

Puppet uses its own Ruby-like language for the templatesto “assert” the desired state of the nodes– With Ruby fall-back for hard stuff (we’ve only needed this once)

Being declarative rather than procedural, there are quirks– Takes a bit of practice to ‘get it’– There are books, online docs, online cook-books, and a large

community to help It dispenses with the need for ncm components

– All the work is done by puppet on the node itself – you just provide the template part to assert what you want done

– Less software -> easier to move to new OS versions

Agile Infrastructure - Configuration and Operation Tools

Page 21: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Externals

Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates

Node function + hardware – Moving a host between clusters is a DB update

Your configuration can use variables the node detects itself– e.g. reconfigure daemons based on where a newly live-migrated VM has found itself

Query the compiled configuration of other hosts– e.g. Open my firewall to the lxadm nodes

Agile Infrastructure - Configuration and Operation Tools

Page 22: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Moving towards PaaS

Parametrisable recipes– Just fill in the blanks

The aim is to make it easy to use “pre-canned” recipes without even touching a Puppet template– e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django

server on my box– …with these parameters

Moving us in the PaaS direction– Ultimately, it would be better if you never even needed to log into this node

(J2EE public service, IT web hosting service, MySQL service)

Agile Infrastructure - Configuration and Operation Tools

Page 23: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Standard Workflow

Agile Infrastructure - Configuration and Operation Tools

check outfrom CDB

updatetemplates

CDB commit

run and check on test node

notify with nc-client

n minutes

Iterate

CDB onlxadm

check outfrom git

updatetemplates

git commitand push

run and check on test node

notify with mcollective

1 minute

Iterate

Puppet onlxadm

check outfrom git on

the test node

updatetemplates

run puppet-apply

check on test node

notify with mcollective

Iterate

Puppet-apply on test

node

check onforeman

check onnode(s)

check onforeman

git commitand push

Page 24: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Modernising our Processes (1)

Our software processes for the computer centre are fairly limited– fire-and-forget broadcasts to project-elfms

…and rather manual– The manual test/ -> preprod/ -> prod/ template dance– Our toolset RPMs are ‘built on laptop’ and uploaded to ‘swrep’ by hand

Add standard continuous integration (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC– .. then automate the testing – e.g. suitably tagged RPMs are automatically deployed to /test nodes.

Agile Infrastructure - Configuration and Operation Tools

Page 25: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Modernising our Processes (2)

We’re working out which of the many puppet / git models suits us– code review, sign-off and automated notification for changes that will affect

multiple clusters– How to automate the test/preprod/prod advancement

Pre-req is flexible monitoring and alarming– you need to trust that an automation failure will be signaled to you

Script-generated emails are banned– Need good monitoring to hang these notifications on

Integrate components rather than use emailScript™– Script-generated tickets (where your value in the process is your password),

are banned

Agile Infrastructure - Configuration and Operation Tools

Page 26: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Current Tool Snapshot (Liable to Change)

Agile Infrastructure - Configuration and Operation Tools

Jenkins

Koji, Mock

PuppetForeman

AIMS/PXEForeman

Yum repoPulp

Puppet stored config DB

mcollective, yum

JIRA

Lemon

git, SVN

Openstack Nova

Hardware database

Page 27: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Preliminary Timelines

Year What Actions2011 Agree overall principles

2012 Prepare formal project planEstablish IaaS in CERN CCProduction Agile InfrastructureMonitoring Implementation as per WGMigrate lxcloudEarly adopters to Agile Infrastructure

2013 LSD 1New Data Centre

Extend IaaS to remote CCBusiness ContinuitySupport Experiment App re-workMigrate CVIGeneral migration to Agile with SLC6 and Windows 8

2014 LSD 1 (to November) Phase out Quattor/CDB/…

Agile Infrastructure - Configuration and Operation Tools

Aggressive schedule if we are to make it for new data centre

Page 28: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Initial Steps

Decided on tools Integrating them to make a production setup

– We can still change.. But we’re starting to commit…

Looking for early adopters– In particular to understand the people-scaling / ACL issues: which of

the git/puppet models is best? e.g. PES/OIS services: batch/VMs, JIRA, Drupal https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012

– Help with integration / coding– Help with ideas– Help with building the task list

Agile Infrastructure - Configuration and Operation Tools

Page 29: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Summary

IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components– Puppet for the core configuration tool– Better integration between components– Use of more modern software processes to aid deployment– Better monitoring– Engage with the community rather than re-implement

Overall project scope is wider (see following presentations)– Improved monitoring– Cloud and virtualisation

Actively seeking wide involvement from CERN-IT and feedback from the community

https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure

Agile Infrastructure - Configuration and Operation Tools

Page 30: The CERN Agile Infrastructure Project: Configuration and Operations Tools

Agile Infrastructure - Configuration and Operation Tools

Acknowledgements

• Many colleagues at CERN-IT, including – Tim Bell– Ian Bird– Bernd Panzer-Steindel– Gavin McCance– Manuel Guijarro

Agile Infrastructure Making IT operations better since 2013

Jenkins

Openstack

Koji

ActiveMQ

Foreman

Puppet

mcollective

git