data flow in the data center

IN THE DATA CENTERDATA FLOWDATA FLOW

wealthfront.com

Adam Cataldo @djscroogeNovember 7, 2013

wealthfront.com |

Wealthfront & Me

• Wealthfront is the largest and fastest growing software-based financial advisor

• We manage the first $10,000 for free the rest for only 0.25% a year

• Our automated trading system continuously rebalances a portfolio of low-cost ETFs, with continuous tax-loss harvesting for accounts over $100,000

• I’ve been working on the data platform we use for website optimization, investment research, business analytics, and operations

2

wealthfront.com |

Why the Ptolemy conference?

• This is not a talk about modeling, simulation, and design of concurrent, real-time embedded systems

• This is a talk about the design of a data analytics system

• It turns out many of the patterns are the same in both fields

3

wealthfront.com |

MapReduce & Hadoop

4

wealthfront.com |

Hadoop at a Glance

• Scales well for large data sets

• Industry standard for data processing

• Optimized for throughput batch-processing

• Long latency

• Overkill for small data sets

5

wealthfront.com |

Cascading

6

wealthfront.com |

Why Cascading?

• Most real problems require multiple MapReduce jobs

• Provides a data-flow abstraction to specify data transformations

• Builds on standard database concepts: joins, groups, and so on

• Provides decent testing capabilities, which we’ve extended

7

wealthfront.com |

From SQL to Cascading

select name from users join mails on users.email=mails.to

8

Pipe joined = new CoGroup(users, “email”, mails, “to);

Pipe name = new Retain(joined, “lastName”);

wealthfront.com |

Cascading to Hadoop

9

mails

users

mailsmappers

usersmappers

joinreducers

result

wealthfront.com |

Getting data ready for Cascading

10

ProductionMySQL DB

AvrofileAvro

fileAvrofiles

ProductionMySQL DB

Amazon Simple Storage Service

extract transform load

wealthfront.com |

• A compact data format, capable of storing large data sets

Why Avro?

• We compress with Google Snappy

• Compressed is splittable into 128MB chunks

• De-facto file format for Hadoop

11

wealthfront.com |

Running Cascading Jobs

12

ProductionMySQL DB

Amazon Simple Storage Service

Elastic MapReduce

Online Systems

Redshiftdata

warehouse

wealthfront.com |

What do we do with the data?

• We use it to track how well the investment product is performing

• We use it to track how well the business is performing

• We use it to monitor our production systems

• We use it to test how well new features perform on the website

13

wealthfront.com |

Bandit Testing

• When rolling new features out, we expose the new version to some users and the old version to the rest

• We monitor what percent of users “convert”: sign up, fund account, etc.

• We gradually send more traffic to the winning variant of the experiment

• Similar to A/B testing, but way faster

14

Does anyone know where the name bandit testing comes from?

wealthfront.com |

Thompson Sampling

1. Estimate the probability for each variant of the experiment that it performs best, using Bayesian inference

2. Weight the percentage of traffic sent to each variant according to this probability

3. End the experiment when one variant has a 95% chance of winning, or when the losing arms have no more than a %5 chance of beating the winner by more than 1%

4. In 2012, Kaufmann et al proved optimality of Thompson sampling

16

wealthfront.com |

What’s Redshift?

• Amazon’s cloud-based data warehouse database

• To support ad-hoc analysis, we copy all raw and computed data into redshift

• It’s a column-oriented database, optimized for aggregate queries and joins over large batch sizes

17

wealthfront.com |

What are the technical challenges?

• Testing complicated analytics computations is non-trivial

- We ended up writing a small library to make testing Cascading jobs simpler

• Running multiple Hadoop jobs on large datasets takes a long time

- We use Spark for prototyping, to get a speedup

• Your assumptions about the constraints on the data is always wrong

18

wealthfront.com |

Where’s this heading?

• We have a unique collection of consumer web data and financial data

• There are many ways we can combine this data to make our product better

• Hypothetical example: suggest portfolio risk adjustments based on a client’s withdrawal patterns

19

wealthfront.com |

How is this relevant?

• We use data flow as the primary model of computation

• While the time scales are much slower, we have timing constraints, called SLAs, imposed by production use cases

• We have to make sure all code can safely execute concurrently on multiple machines, cores, and threads

20

wealthfront.com |

Disclosure

21

Text

Nothing in this presentation should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Financial advisory services are only provided to investors who become Wealthfront clients pursuant to a

written agreement, which investors are urged to read and carefully consider in determining whether such agreement is

suitable for their individual facts and circumstances. Past performance is no guarantee of future results, and any hypothetical returns, expected returns, or probability

projections may not reflect actual future performance. Investors should review Wealthfront’s website for additional

information about advisory services.

data flow in the data center

Technology

data processing

data transformations

data ready

small data

data platform

data analytics system

dataflow abstraction

compact data format