google, quality and you
TRANSCRIPT
Google, Quality & You
Agenda1. Why we care about test automation2. Test sizes and test hermeticity3. Deflake strategies4. A tale of large test
Codebase - as of Jan 2015
Number of files 1 billion
Number of source files
9 million
Lines of code 2 billion
Depth of history 35 million commits
Size of contents 86 terabytes
Commits per workday 45 thousand
Source: https://www.youtube.com/watch?v=W71BTkUbdqE
Codebase - as of Jan 2015
Number of files 1 billion
Number of source files
9 million
Lines of code 2 billion
Depth of history 35 million commits
Size of contents 86 terabytes
Commits per workday 45 thousand
● 15 million lines in 250 thousand files changed per week, by humans. This is the same number of LOC as in the Linux Kernel
Source: https://www.youtube.com/watch?v=W71BTkUbdqE
Codebase - as of Jan 2015
Number of files 1 billion
Number of source files
9 million
Lines of code 2 billion
Depth of history 35 million commits
Size of contents 86 terabytes
Commits per workday 45 thousand
● ⅔ of these 45k commits are done by robots
Source: https://www.youtube.com/watch?v=W71BTkUbdqE
Codebase - as of Jan 2015
Number of files 1 billion
Number of source files
9 million
Lines of code 2 billion
Depth of history 35 million commits
Size of contents 86 terabytes
Commits per workday 45 thousand
● 800k read requests per second at daily pick
Source: https://www.youtube.com/watch?v=W71BTkUbdqE
Build System● Most engineers build large parts of the codebase
many times a day● Everything is built from head
http://www.bazel.io/docs/be/c-cpp.html
cc_library(name = 'search', hdrs = ['search.h'], srcs = ['search.cc'], deps = ['//index:query'],)
cc_test(name = 'search_test', srcs = ['search_test.cc'], deps = [':search'],)
Runtime dependencies
● Different push cycles● Frequent push cycles
Runtime dependencies
● Different push cycles● Frequent push cycles
MyBinary
Library
Runtime dependencies
● Different push cycles● Frequent push cycles
MyBinary
Library
OtherBinary
RPC
What do you call a test that tests your application through its UI?
What do you call a test that tests your application through its UI?
Ui test
Integration test
Functional test
Regression test
Black box test
Selenium/Webdriver test
E2E test
Release test
Validation test
...
Just what, exactly, is an integration test? A unit test? How do we name these things?
Different tests have (very) different properties, and it is important to have a common language to talk about tests (example == scheduling)
At Google, we like to make decisions based on data, rather than just relying on gut instinct or something that can’t be measured and assessed
Over time we’ve come to agree on a set of data-driven naming conventions for our tests. We call them “Small”, “Medium” and “Large” tests
Small test
A unit test. Tests a class or a function
Specific logic conditions; heavy use of mocks, stubs and fakes
(blazingly) Fast (test runner is expected to kill fast)
Opens no external ports
No sleep statements
Single threaded, no async flows, no race conditions
Run frequently - while editing code
Medium test
Interaction of one or more modules on a single machine
Less mocking
Slower (test runner)
Limits network service to the localhost
Permits sleep statements
Uses lightweight tools such as in-memory databases to improve performance
Multiple threads, async flows
(Aimed to be) run before submitting code, along with small tests
Large test
A system test, integration test, or end-to-end test that verifies that a complete application works and accounts for the behavior of external subsystems
Exercises any or all application subsystems and may make use of external resources such as databases, file systems, and network services
Slooooooooooooooooooooooooooooooow
External dependencies
Multiple threads, multiple processes, even multiple machines
Run as frequent as possible, but definitely cannot run on a presubmit queue
Requirements common to all sizes
Each test must be independent from other tests; tests must be runnable in any order
Tests must not have any persistent side effects. They must leave their environment as it was before they started
Test hermeticity
Hermetic tests can be run if you unplug the network cable
non-Hermetic tests are inherently flaky - don’t try to fix these
Larger tests tend to be less hermetic
You should not think about small tests the same way you think about large tests
The Test Pyramid
Small tests
Write a lot of them
Maximize code coverage
Run as much as possible
Do invest in faking out dependencies
Sharding and parallelization
Block submit on failures
Do not allow flakes in
Small tests
Write a lot of them
Maximize code coverage
Run as much as possible
Do invest in faking out dependencies
Sharding and parallelization
Block submit on failures
Do not allow flakes in
Flakiness @Google
We heavily rely on (small) tests
Literally, without these we wouldn’t be able to scale up with a single monolithic repository
We define a "flaky" test result as a test that exhibits both a passing and a failing result with the same code (@ the same version)
Root causes: concurrency, relying on non-deterministic or undefined behaviors, flaky third party code, infrastructure problems, rendering, gpu and animations
Across our entire corpus of tests, we see a continual rate of about 1.5% of all test runs reporting a "flaky" result
Flakiness @Google
Our continuous build systems understand when a test has transitioned from a passing state to failure
If we had no flakes, we can automatically, efficiently and reliably find the culprit (binary search)
If so, we can automatically roll it back
Deflake strategies
Mark a test as flaky (ok if passes 1 out of 3)
Each time a test is failed it is being executed again in the background; if it passes, the original run is designated as flaky
The flake probability is the number of times the test failure was a flake over the total number of test passes for the particular test. It is the historic ratio of a test's flakes to its passes
We use this metric to automatically quarantines flaky tests
Another tool detects changes in the flakiness level of tests and works to identify the change that caused the test to change the level of flakiness
Other approach: new tests get quarantined by default
A tale of large tests
MyBinary
Library
OtherBinary
RPC
A tale of large tests
MyBinary
Library
OtherBinary
RPC
A tale of large tests
MyBinary
Library
OtherBinary
RPC
A tale of large tests
MyBinary
Library
OtherBinary
RPC
Test
A tale of large tests
MyBinary
Library
LocalOtherBinary
RPC
ProductionTest
A tale of large tests
MyBinary
Library
RealOtherBinary
RPC
ProductionPreProduction
A tale of large tests
MyBinaryCandidate
LibraryRPC
Test
(Per candidate)RealOtherBinary
ProductionPreProduction
A tale of large tests (last slide)
MyBinaryCandidate
LibraryRPC
Test
● Runs continuously against different configurations
● Statistical approach (probers)
● Distributed geographically
● Alerts with respect to configuration importance
RealOtherBinary