app engine batch data processing

48
View live notes and ask questions about this session on Google Wave http://bit.ly/ds6t9F

Upload: atif-gulzar

Post on 07-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 1/48

View live notes and ask questions aboutthis session on Google Wave

http://bit.ly/ds6t9F

Page 2: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 2/48

Page 3: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 3/48

Batch Data Processingwith Google App Engine

Mike Aizatsky20 May 2010

Page 4: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 4/48

View live notes and ask questions aboutthis session on Google Wave

http://bit.ly/ds6t9F

Page 5: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 5/48

Agenda

The challenge

Early batch processing in App Engine

Batch processing with Task Queues

Batch processing at Google

App Engine approach

Page 6: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 6/48

Batch Data Processing

Processing thousands of entities is hard on App Engine

Some examples:

schema migration

data export

report generation

Page 7: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 7/48

What Makes It Hard?

App Engine imposes certain restrictions

Restrictions guarantee automatic scalability

Page 8: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 8/48

What Makes It Hard?

30s request limit

Transient errors in the system

Datastore latency and timeouts

Changing dataset

Page 9: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 9/48

Early Batch Processing

Page 10: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 10/48

Early Batch Processing

Define batch web handler:

class BatchHandler(...):def get(self):

self.processNextBatch()

Use a web page with auto-refresh or use curl:

while true; curl http://batch_url; doneTrivia: Was actually done by the App Engine team onthe day of release

Page 11: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 11/48

Challenges

Need an external "driver" computer (may fail too)

Difficult error handling and recovery

Slow, inefficient

Complex state management

Page 12: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 12/48

Possible Improvements

Communicate state with the driver 

Sharding/parallel execution

Use remote api for complex scenarios

Typical Example: bulkloader from App Engine SDK

Page 13: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 13/48

Batch Processing With Task Queues

Page 14: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 14/48

Batch Processing with Task Queues

Task Queues released a year ago

Can perform work outside of a user request

Reliable, high-performance system

Still 30s limit

Page 15: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 15/48

Task Chaining

Simple technique of overcoming 30s limit

Task enqueues its continuation when it's close to 30slimit

Task 1 Task 2 Task 3 Task 4

Page 16: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 16/48

Task Chaining

Define batch handler:

class BatchHandler(...):def get(self):

next_starting_point =

self.performNextBatch(

self.request["starting_point"])

taskqueue.Task("/batch", params ={"starting_point":

next_starting_point})

.add("batch_queue") 

To start a process, simply enqueue first task

Page 17: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 17/48

Nice Task Queue Properties

Guarantees eventual task execution

No need for external drivers

Repeats task execution in case of unhandled failures

Can limit execution rate (both manually and

automatically)

Page 18: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 18/48

Batch Processing At Google

Page 19: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 19/48

Batch Processing At Google

MapReduce successfully used for years to do batch

processing at Google scale

Created to help developers work with unreliabledistributed systems

Widely adopted

Page 20: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 20/48

MapReduce

Developer defines only 2 functions:

map(entity) -> [(key, value)]reduce(key, [value]) -> [value]

Page 21: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 21/48

MapReduce Special Cases

Schema migration: empty reduce, update in map

Report generation: reduce generates new entities

Page 22: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 22/48

App Engine & Google's MapReduce

We want you to use MapReduce too!

There are some unique challenges

Page 23: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 23/48

App Engine & Google's MapReduce

Additional scaling dimension:

Lots and lots of applications

Many of them will run MapReduce at the same time

Isolation: application shouldn't influence performance of 

the other 

Page 24: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 24/48

App Engine & Google's MapReduce

Rate limiting: you don't want to burn all day's resourcesin 15min and kill your online traffic

Very slow execution: free apps want to go really slow,staying under their resource limint

Protection: from malicious App Engine users

Page 25: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 25/48

App Engine Approach

Page 26: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 26/48

App Engine Approach

We already have a system to solve (most) of these

problems: Task Queue

Decided to build MapReduce on top of Task Queue

Some additional services will have to be developed

Page 27: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 27/48

Mapper Library for App Engine

Early experimental release

Reliable, fast and efficient way to iterate over datastoreor blob files

Part 1 of the MapReduce story

You can start playing with it while we're working on thefull MapReduce

http://mapreduce.appspot.com/

Page 28: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 28/48

Mapper Library Features

Completely user-space. Just pull into your project.

OSS (Apache 2.0). Hack, modify, play around. Patcheswelcome!

Python today & Java soon

API is very familiar to Hadoop/Dumbo users

Page 29: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 29/48

Mapper Library Features

Automatic sharding for faster execution

Automatic rate limiting for slow execution

Status pages

Counters

Parameterized mappers

Batching datastore operations

Iterating over blob data

Page 30: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 30/48

Demo

Page 31: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 31/48

Adding Mapper Library To Your Project

Checkout library from svn into your project folder 

Add 1 handler to app.yaml:

- url: mapreduce(/.*)?

script: <script_location>/main.py

admin: true

Page 32: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 32/48

Defining Mapper 

Define map function:

def process(entity):doSomethingWithEntity(entity)

Register in mapreduce.yaml:

mapreduce:

- name: Test Mappermapper:

input_reader: mapreduce.DatastoreInputReader

handler: model.Entity

Open http://<your_app>/mapreduce and start your mapper 

Python

Page 33: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 33/48

Defining Mapper 

public class ExampleMapper extends

AppEngineMapper<Key, Entity> {@Override

public void map(

Key k, Entity e, Context ctx) {

processEntity(e);}

}

Java

Page 34: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 34/48

Example: Datastore Operations

def user_ages(user):

migrate_user(user)yield op.db.Put(user)

Page 35: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 35/48

Example: Counters

def user_ages(user):

yield op.counter.Increment("age-%d" % user.age)

Page 36: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 36/48

Example: More Complex Reports

def orders_total(customer):

orders = Order.by_customer(customer)total = sum_orders(orders)

report_line = ReportLine(

report_id, customer, total)

yield op.db.Put(report_line)

Page 37: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 37/48

Some Implementation Details

Uses task queue chaining

2 types of flows: controller flow & worker flow

Uses datastore for state storage and communication

Page 38: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 38/48

Important point

We handle most of the task chaining complexity

You should handle only one: idempotence

Page 39: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 39/48

Idempotence

f(f(x)) = f(x)

Means: your batch handler should be ready to processthe same entity twice

The most important property of batch operation

You should always think about idempotence

Page 40: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 40/48

Practical Idempotence: Data Migration

Not Idempotent:

def migrate(entity):

yield op.counters.Increment("updated")

entity.property =

compute_property(entity)

yield op.db.Put(entity)

Page 41: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 41/48

Practical Idempotence: Data Migration

Idempotent:

def migrate(entity):

if entity.property_updated:

return

yield op.counters.Increment("updated")

entity.property =compute_property(entity)

entity.property_updated = True

yield op.db.Put(entity)

Page 42: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 42/48

Practical Idempotence: Reports

Not Idempotent:

def report(customer):

# .....

report_line = ReportLine(

report_id, customer, total)

yield op.db.Put(report_line)

Page 43: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 43/48

Practical Idempotence: Reports

Idempotent:

def report(customer):

# .....

key_name = "%s-%s" %

(report_id, customer.id)

if ReportLine.get_by_key_name(key_name):return

report_line = ReportLine(

key_name=key_name,

report_id, customer, total)

yield op.db.Put(report_line)

Page 44: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 44/48

Page 45: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 45/48

Practical Idempotence: Counters

No easy way to achieve

Arguably not needed

Page 46: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 46/48

Idempotence And Changing Data

Most reports over live data are approximate

Approximate reports are OK for most cases

Margin of error should be quite small due to the waymapper library is implemented

Page 47: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 47/48

Summary

Mapper library available today

Reliable, fast and efficient way to iterate over datastoreor blob files

Java & Python

Fully OSS

http://mapreduce.appspot.com/

Page 48: App Engine Batch Data Processing

8/6/2019 App Engine Batch Data Processing

http://slidepdf.com/reader/full/app-engine-batch-data-processing 48/48