OPS stack:
● ~80 servers, heavy usage of LXC containers (~1000)
● 99.9% Debian, 1 Ubuntu host :) ● Nginx / Apache2, 2k reqs per sec● 200 million page views monthly● 700Mbps peak traffic● Python is dominant
About BrainlyWorld’s largest homework help social network, connecting over 40 million users monthly
DEV stack:
● PHP- Symfony 2- SOA projects- 200 reqs per sec on russian version
● Erlang- 55k concurrent users- 22k events per sec
● Native Apps- iOS- Android
● Puppet was not feasible for us- *lots* of dependencies which make containers bigger/heavier- problems with Puppet's declarative language- seemed incoherent, lacking integration of orchestration- steep learning curve- YMMV
● "packaging as automation" as an intermediate solution- dependency hell, installing one package could result in uninstalling others- inflexible, lots of code duplication in debian/rules file- LOTS of custom bash and PHP scripts, usually very hard to reuse
and not standardized- this was a dead end :(
● Ansible- initially used only for orchestration- maintaining it required keeping up2date inventory, which later
simplified and helped with lots of things
Starting point
● we decided to move forward with Ansible and use it for setting up machines as well
● first project was nagios monitoring plugins setup● turned out to be ideal for containers and our needs in general
- very little dependencies to begin with (python2, python-apt), and small footprint - "configured" Python modules are transferreddirectly to machine, no need for local repositories
- very light, no compilation on the destination host is needed- easy to understand. Tasks/playbooks map directly to actions
an ops/devops would have done if he was doing it by hand- compatible with "automation by packages". We were able to
migrate from the old system in small steps.
First steps with Ansible
● all policies, rules, and good practices written down in automation's repo main directory
● helps with introducing new people into the team or with devops approach- newbies are able to start committing to repo quickly- what's in GUIDELINES.md, that's law and changing it requires wider
consensus- gives examples on how to deal with certain problems in standardized way
● few examples:- limit the number of tags, each of them should be self-contained
with no cross-dependencies.- do not include roles/tasks inside other roles,
this creates hard to follow dependencies- NEVER subset the list of hosts inside the role, do it in site.yml.
Otherwise debugging roles/hosts will become difficult- think twice before adding new role and esp. groups. As infrastructure
grows, it becomes hard to manage and/or creates "dead” code/roles
Avoiding regressions
● one of the policies introduced was storing one-off scripts in a separate directory in our automation repo.
● most of them are Ansible playbooks used just for one particular task (i.e. Squeeze->Wheezy migration)
● version-control everything!● turned out to be very useful, some of them turned out to be useful
enough to be rewritten to proper role or a tool
Ugly-hacks reusability
● available on GitHub and Ansible Galaxy: https://galaxy.ansible.com/list#/roles/940
https://galaxy.ansible.com/list#/roles/941● “base” role:
- is reused across 8 different production roles we have ATM- contains basic monitoring, log rotation, packages installation, etc…- includes PHP setup in modphp/prefork configuration- PHP disabled functions control- basic security setup- does not include any site-specific stuff
● "site” role:- contains all site specific stuff and dependencies
(vhosts, additional packages, etc...)- usually very simple- more than one site role possible, only one base role though
● It is an example of how we make our roles reusable
Apache2 automation
● automatically setups monitoring basing on inventory and host groups● implements devops approach - if dev has root on machine, he also has
access to all monitoring stuff related to this system● automatic host dependencies basing on host groups● provisioning new hosts is no longer so painful ("auto-discovery")● all services configuration is stored as YAML files, and used in templates● role uses DNS data directly from inventory in order to make monitoring
independent of DNS failures
Icinga
DNS migration
● at the beginning:- dozens of authoritative name servers, each of them having
customized configuration, running ~100 zones, all created by hand- the main reason for that was using DNS for switching between
primary/secondary servers/services● three phases:
- slurping configuration into Ansible- normalizing the configuration- improving the setup
● Python script which uses Ansible API to fetch normalized zone configuration from each server
- results available in a neat hash, with per-host, per-zone keys!- normalization using named-checkconf tool
● use slurped configuration to re-generate all configs, this time using only the data available to Ansible's
● "push-button" migration, after all recipes were ready :)
● secure: all zone transfers are signed with individual keys, ACLs are tight● playbooks use dns data directly from inventory● changing/migrating slaves/masters is easy, NS records are auto-generated● updates to zones automatically bump serial, while still preserving the
YYYYMMDDxx format● CRM records are auto-generated as well
* see next slide about CRM automation● dns entries are always up2date thanks to some custom action modules
- ansible_ssh_host variables are harvested and processed into zones- only custom entries and zone primary/secondary server names are
now stored in YAML- new hosts are automatically added to zones, decommissioned
ones - removed- auto-generation of reverse zones
DNS automation
● we have ~130 CRM clusters● setting them up by hand would be "difficult" at best, impossible at worst● available on Ansible Galaxy:
- https://galaxy.ansible.com/list#/roles/956- https://galaxy.ansible.com/list#/roles/979
● follows pattern from apache2_base- “base” role suitable for manually set up clusters- "cluster” role provides service upon base, with few reusable snippets
and a possibility for more complex configurations● automatic membership based on ansible inventory (no multicasts!)● the most difficult part was providing synchronous handlers● few simple configurations are provided, like single service-single vip
Corosync & Pacemaker
● initially we did not have time nor resources to set up full fledged LDAP● we needed:
- user should be able to log in even during a network outage- removal/adding users, ssh-keys, custom settings, etc..
all had to be supported- it had to be reusable/accessible in other roles
(i.e. Icinga/monitoring)- different privileges for dev,production and other environments- UID/GID unification
● turned out to be simpler than we thought - users are managed using few simple tasks and group_vars data. Rest is handled via variables precedence.
● migration/standardization required some effort though
User management automation
● standard ansible inventory management becomes a bit cumbersome with 100’s of
hosts:
- each host has to have ansible_ssh_host defined
- adding/removing large number of hosts/groups required editing lots of files
and/or one-off scripts
- ip address management using google docs does not scale ;)
● Ansible has well defined dynamic inventory API, with scripts available for AWS,
Cobbler, Rackspace, Docker, and many others.
● we wrote our own, which is based on YAML file, version controlled by git:
- python API allowing to manipulate the inventory easily
- logic and syntax checking of the inventory
● available as opensource: https://github.com/brainly/inventory_tool
Inventory management
● we are leasing our servers from Hetzner, no direct Layer 2 connectivity
● all tunnel setups are done using Ansible, new server
is automatically added to our network
● firewalls are set up by Ansible as well:
- OPS contribute the base firewall, DEVs can open
the ports of interest for their application
- ferm at it's base, for easy rule making and keeping in-kernel firewall in sync
with on-disk rules
- rules are auto-generated basing on inventory, adding/removing hosts is
automatically reconfigures FW
Networking
● based on Bareos, opensource Bacula fork
● new hosts are automatically set up for backup,
extending storage space is no longer a problem
● authentication using certificates, PITA without ansible
Backups
● deployment done by Python script calling Ansible API
● simple tasks implemented using ansible playbooks
● complex logic implemented in Python
Deployments
● Jinja2 template error messages are "difficult" to interpret● templates sometimes grow to huge complexity● Jinja2 is designed for speed, but with tradeoffs - some Python operators are
missing and creating custom plugins/filters poses some problems● multi-inheritance, problems with 2-headed trees● speed, improved with "pipelining=True", containerization on the long run● some useful functionality requires paid subscription (Ansible Tower)
- RESTfull API, useful if you want to push new application versionto productions via i.e. Jenkins
- schedules - currently we need to push the changes ourselves
Not everything is perfect
● developers by default have RO access to repo, RW on case-by-case basis● changes to systems owned by developers are done by developers,
OPS only provide the platform and tools● all non-trivial changes require a Pull Request and a review from Ops● encrypt mission critical data with Ansible Vault and push it directly to the repo
- *strong* encryption- available to Ansible without the need for decryption
(password still required though)- all security sensitive stuff can be skipped by developers with
"--skip-tags" option to ansible-playbooks
Dev,DevOps,Ops
● some of the things we mentioned can be find on our Github account
● we are working on opensourcing more stuff
https://github.com/brainly
Opensource! Opensource! Opensource!
● time needed to deploy new markets dropped considerably
● increased productivity
● better cooperation with developers
● more workpower, Devs are no longer blocked so much, we can push
tasks to them
● infrastructure as a code
● versioning
● code-reuse, less copy-pasting
Conclusions