google, quality and you

Google, Quality & You

Agenda1. Why we care about test automation2. Test sizes and test hermeticity3. Deflake strategies4. A tale of large test

Codebase - as of Jan 2015

Number of files 1 billion

Number of source files

9 million

Lines of code 2 billion

Depth of history 35 million commits

Size of contents 86 terabytes

Commits per workday 45 thousand

Source: https://www.youtube.com/watch?v=W71BTkUbdqE

https://www.youtube.com/watch?v=W71BTkUbdqE




9 million





● 15 million lines in 250 thousand files changed per week, by humans. This is the same number of LOC as in the Linux Kernel






9 million





● ⅔ of these 45k commits are done by robots






9 million





● 800k read requests per second at daily pick



Build System● Most engineers build large parts of the codebase

many times a day● Everything is built from head

http://www.bazel.io/docs/be/c-cpp.html

cc_library(name = 'search', hdrs = ['search.h'], srcs = ['search.cc'], deps = ['//index:query'],)

cc_test(name = 'search_test', srcs = ['search_test.cc'], deps = [':search'],)

http://www.bazel.io/docs/be/c-cpp.html

Runtime dependencies

● Different push cycles● Frequent push cycles



MyBinary

Library



MyBinary

Library

OtherBinary

RPC

What do you call a test that tests your application through its UI?

What do you call a test that tests your application through its UI?

Ui test

Integration test

Functional test

Regression test

Black box test

Selenium/Webdriver test

E2E test

Release test

Validation test

...

Just what, exactly, is an integration test? A unit test? How do we name these things?

Different tests have (very) different properties, and it is important to have a common language to talk about tests (example == scheduling)

At Google, we like to make decisions based on data, rather than just relying on gut instinct or something that can’t be measured and assessed

Over time we’ve come to agree on a set of data-driven naming conventions for our tests. We call them “Small”, “Medium” and “Large” tests

Small test

A unit test. Tests a class or a function

Specific logic conditions; heavy use of mocks, stubs and fakes

(blazingly) Fast (test runner is expected to kill fast)

Opens no external ports

No sleep statements

Single threaded, no async flows, no race conditions

Run frequently - while editing code

Medium test

Interaction of one or more modules on a single machine

Less mocking

Slower (test runner)

Limits network service to the localhost

Permits sleep statements

Uses lightweight tools such as in-memory databases to improve performance

Multiple threads, async flows

(Aimed to be) run before submitting code, along with small tests

Large test

A system test, integration test, or end-to-end test that verifies that a complete application works and accounts for the behavior of external subsystems

Exercises any or all application subsystems and may make use of external resources such as databases, file systems, and network services

Slooooooooooooooooooooooooooooooow

External dependencies

Multiple threads, multiple processes, even multiple machines

Run as frequent as possible, but definitely cannot run on a presubmit queue

Requirements common to all sizes

Each test must be independent from other tests; tests must be runnable in any order

Tests must not have any persistent side effects. They must leave their environment as it was before they started

Test hermeticity

Hermetic tests can be run if you unplug the network cable

non-Hermetic tests are inherently flaky - don’t try to fix these

Larger tests tend to be less hermetic

You should not think about small tests the same way you think about large tests

The Test Pyramid

Small tests

Write a lot of them

Maximize code coverage

Run as much as possible

Do invest in faking out dependencies

Sharding and parallelization

Block submit on failures

Do not allow flakes in

Flakiness @Google

We heavily rely on (small) tests

Literally, without these we wouldn’t be able to scale up with a single monolithic repository

We define a "flaky" test result as a test that exhibits both a passing and a failing result with the same code (@ the same version)

Root causes: concurrency, relying on non-deterministic or undefined behaviors, flaky third party code, infrastructure problems, rendering, gpu and animations

Across our entire corpus of tests, we see a continual rate of about 1.5% of all test runs reporting a "flaky" result

Flakiness @Google

Our continuous build systems understand when a test has transitioned from a passing state to failure

If we had no flakes, we can automatically, efficiently and reliably find the culprit (binary search)

If so, we can automatically roll it back

Deflake strategies

Mark a test as flaky (ok if passes 1 out of 3)

Each time a test is failed it is being executed again in the background; if it passes, the original run is designated as flaky

The flake probability is the number of times the test failure was a flake over the total number of test passes for the particular test. It is the historic ratio of a test's flakes to its passes

We use this metric to automatically quarantines flaky tests

Another tool detects changes in the flakiness level of tests and works to identify the change that caused the test to change the level of flakiness

Other approach: new tests get quarantined by default

A tale of large tests

MyBinary

Library

OtherBinary

RPC

Test


MyBinary

Library

LocalOtherBinary

RPC

ProductionTest


MyBinary

Library

RealOtherBinary

RPC

ProductionPreProduction


MyBinaryCandidate

LibraryRPC

Test

(Per candidate)RealOtherBinary

ProductionPreProduction

A tale of large tests (last slide)

MyBinaryCandidate

LibraryRPC

Test

● Runs continuously against different configurations

● Statistical approach (probers)

● Distributed geographically

● Alerts with respect to configuration importance

RealOtherBinary

google, quality and you

Software