ship happens: a better firefox build and release pipeline
TRANSCRIPT
Kim Moir (kmoir), Mozilla Release Engineering
Ship Happens:A better Firefox build & release pipeline
“I am notorious for making impassioned speeches
about things nobody cares about.”
― Mindy Kaling, Why Not Me?
Today’s agenda
● Faster pipelines and what they mean for you
● How to try it yourself!
● Lessons learned and what’s next
Mozilla Releng live here
Release times
● 2013 - 11 hours
● 2017 - 4-5 hours
Continuous integration
Land code
Unit tests
Decision
graph
Builds x N
platforms
Performance
tests
Sign Builds
Nightlies
Land code
Unit tests
Decision
graph
Builds x N
platforms
Performance
tests
Sign Builds
Generate
updates
L10n
Release process using release promotion
Use existing
build
artifacts
Generate
updates
L10n
Unit tests
Decision
graphSign Builds
Performance
tests
Repackage
Builds
+
Move
artifacts
Refresh
update db
rules
Update
websites
with release
About:Taskcluster
● Taskcluster is a task execution framework that supports Mozilla’s continuous
integration farm + release pipeline
It is a set of components that manages task queuing, scheduling, execution and
provisioning of resources.
Why: In-tree and Decision Graph
● Build and test configs are all in tree
○ Good news: Developer autonomy
○ Bad news: Developer autonomy
● Decision graph upon push identifies failures more quickly
● Changes can be tested locally and on try
Testing the graph locally
● Generates the full taskgraph.
○ ./mach taskgraph full > full.txt
● Generates an optimized taskgraph
○ ./mach taskgraph optimized > full.txt
● Generates a target taskgraph
○ ./mach taskgraph target -p parameters.yml > target.txt
● Generates a target taskgraph with json to inspect content of graph
○ ./mach taskgraph target --json -p parameters.yml > target.txt
● Taskcluster config files are under taskcluster/ in tree
○ Example: taskcluster/ci/build/macosx.yml defines mac builds (which
actually run on Linux)
Changing tests
● YAML files in taskcluster/ci/test/ files define tests groups by suite name - e.g.
mochitest, reftest, talos etc
Why: Docker Containers
● Docker containers for test and build images (not all platforms)○ Consistent environment to debug build and test failures via one click loaners
○ More self-serve developer loaners
Why: More autoscaling
● Moved more platforms to AWS enable autoscaling in response to bursty load
○ Moved Macosx builds to Linux cross-compile on AWS
○ Moved many Windows builds/tests to AWS
Why: More security
● Better security - Chain of Trust (CoT) between artifacts as they are built,
signed and moved to AWS S3/CDNs for download on releases/nightlies
● CoT is the security model for releases
● Task execution is restricted by taskcluster scopes, but that is only one type of
authentication
● CoT allows us to trace requests back to the tree and verify each previous task
in the chain.
● If CoT fails, the task is marked as invalid
Why+?
● Team learned new things - Docker, transforms, migration strategies,
microservices, monitoring
● Future efficiencies - allow us to continue to scale
● Migrate off technologies that did not scale to our needs
● Re-evaluate existing jobs: Are they still needed? Could they be improved?
Timeline for migration
● Jan 20 - Linux Desktop and Android Firefox nightly builds from Taskcluster
● Mar 13 - Mobile beta in Taskcluster
● July 2 - Mac Nightlies in Taskcluster
● Aug 30 - Windows nightlies in Taskcluster
● Nov 14 - Shipped Firefox Quantum in Taskcluster
Approach to migration
● Incremental portions of pool
● Communication
● Checklist
● Monitor capacity and wait times
● Monitor state after migration
● Rollback plan
● Decommission old
● Migrate more
Strangler Application - Martin Fowler
56 was a rough release
● We had many automation changes
○ New compression format for updates
○ Watersheds for win32->win64 migration for people on 64 bit hardware
○ Win32/Win64 on taskcluster
Operation: Don’t F*ck up 57
● Implement missing release automation
● Fix our staging environment
● Smooth our merge day process
● Train team members on merges and staging releases
● Run staging releases and merges to iron out any issues
before 57 releases
● Write tests to validate update rules for 57
● Spreadsheet to coordinate update rules with relman
What have we learned?
● Incrementalism - change one thing, evaluate, then change
another
● Expectations change. The faster we build, the faster other
groups expect to be able to ship
● Staging environment is important to test new automation
● Communication
● Organizational changes
● Consider the operational side, not just landing code
Upcoming work
● In tree release promotion for beta and release builds
● Release process optimizations: measure our release end-
to-end times, common failure points with the aim of
providing more predictable and stable releases
● Staging releases on try
● More incremental fixes to make things faster
I embrace mistakes, they make you who you are
―Beyoncé
Questions?
Additional Reading
● Justin Wood’s (Callek’s) talks on transforms
https://gitpitch.com/Callek/slideshows/transforms_2017
● All your nightlies are belong to Taskcluster
https://atlee.ca/blog/posts/migration-status.html
● Nightly builds from Taskcluster https://atlee.ca/blog/posts/nightly-builds-from-
taskcluster.html
● 2016 retrospective https://atlee.ca/blog/posts/2016-releng-retrospective.html
● What's So Special About "In-Tree?"
http://code.v.igoro.us/posts/2016/08/whats-so-special-about-in-tree.html
Additional Reading
● Chris Cooper Nightlies in Taskcluster
http://coopcoopbware.tumblr.com/post/156133487075/nightlies-in-taskcluster-
go-team
● Chris Cooper Mobile Betas in TC
http://coopcoopbware.tumblr.com/post/158362146735/shameless-self-
release-promotion-firefox-530b1
● So you want to rewrite that - Camille Fournier, GOTO conference, Chicago,
2014 https://www.youtube.com/watch?v=PhYUvtifJXk