introduction to pyspark

21
Intro to PySpark Jason White, Shopify

Upload: jason-white

Post on 08-Nov-2015

93 views

Category:

Documents


2 download

DESCRIPTION

A presentation I gave at the Python User Group meetup on 28 May 2015, at Shopify's Ottawa headquarters.

TRANSCRIPT

  • Intro to PySparkJason White, Shopify

  • What is PySpark? Python interface to Apache Spark

    Map/Reduce style distributed computing

    Natively Scala

    Interfaces to Python, R are well-maintained

    Uses Py4J for Java Scala interface

  • PySpark Basics Distributed Computing basic premise:

    Data is big

    Program to process data is relatively small

    Send program to where the data lives

    Leave data on multiple nodes, scale horizontally

  • PySpark Basics

    Driver: e.g. my laptop

    Cluster Manager: YARN, Mesos, etc

    Workers: Containers spun up by Cluster Manager

  • PySpark Basics

    RDD: Resilient Distributed Dataset

    Interface to parallelized data in the cluster

    Map, Filter, Reduce functions sent from driver, executed by workers on chunks of data in parallel

  • PySpark: Hello World

    Classic Word Count problem

    How many times does each word appear in a given text?

    Approach: Each worker computes word counts independently, then results aggregated together

  • PySpark: Hello WorldThe

    brown dogbrown dog

    (The, 1) (brown, 1) (dog, 1)

    (brown, 1) (dog, 1)

    (The, 1) (dog, 2)

    (brown, 2)

    (The, 1) (dog, 2) (brown, 2)

    Map

    Reduce

    Shuffle

    Collect

  • Demo# example 1 text = "the brown dog jumped over the other brown dog" text_rdd = sc.parallelize(text.split(' ')) text_rdd.map(lambda word: (word, 1)) \ .reduceByKey(lambda left, right: left + right).collect()

    # example 2 import string time_machine = sc.textFile('/user/jasonwhite/time_machine') time_machine_tuples = time_machine.flatMap(lambda line: line.lower().split(' ')) \ .map(lambda word: ''.join(ch for ch in word if ch in string.letters)) \ .filter(lambda word: word != '') \ .map(lambda word: (word, 1))

    word_counts = time_machine_tuples.reduceByKey(lambda left, right: left + right)

  • Monoids Monoids are combinations of:

    set of data; and

    associative, commutative functions

    Very efficient in M/R, strongly preferred

    Examples:

    addition of integers

    min/max of records by timestamp

  • Demo# example 3 dataset = sc.parallelize([ {'id': 1, 'value': 1}, {'id': 2, 'value': 2}, {'id': 2, 'value': 6} ])

    def add_tuples(left, right): left_sum, left_count = left right_sum, right_count = right return (left_sum + right_sum, left_count + right_count)

    averages = dataset.map(lambda d: (d['value'], 1)) \ .reduce(add_tuples)

    averages_by_key = dataset.map(lambda d: (d['id'], (d['value'], 1))) \ .reduceByKey(add_tuples) \ .map(lambda (key, (sum, count)): (key, sum * 1.0 / count))

  • Demo# example 4 from datetime import date dataset = sc.parallelize([ {'id': 1, 'group_id': 10, 'timestamp': date(1978, 3, 2)}, {'id': 2, 'group_id': 10, 'timestamp': date(1984, 3, 24)}, {'id': 3, 'group_id': 10, 'timestamp': date(1986, 5, 19)}, {'id': 4, 'group_id': 11, 'timestamp': date(1956, 6, 5)}, {'id': 5, 'group_id': 11, 'timestamp': date(1953, 2, 21)}, ])

    def calculate_age(d): d['age'] = (date.today() - d['timestamp']).days() return d

    def calculate_group_stats(left, right): earliest = min(left['earliest'], right['earliest']) latest = max(left['latest'], right['latest']) total_age = left['total_age'] + right['total_age'] count = left['count'] + right['count'] return { 'earliest': earliest, 'latest': latest, 'total_age': total_age, 'count': left['count'] + right[count'] }

    group_stats = dataset.map(calculate_age) \ .map(lambda d: (d['group_id'], {'earliest': d['timestamp'], 'latest': d['timestamp'], 'total_age': d['age'], 'count': 1})) \ .reduceByKey(calculate_group_stats)

  • Joining RDDs

    Like many RDD operations, works on (k, v) pairs

    Each side shuffled using common keys

    Each node builds its part of the joined dataset

  • Joining RDDs{id: 1, field1: foo} {id: 2, field1: bar}

    {id: 1, field2: baz} {id: 2, field2: baz}

    (1, {id: 1, field1: foo}) (2, {id: 2, field1: bar})

    (1, {id: 1, field2: baz}) (2, {id: 2, field2: baz})

    (1, ({id: 1, field1: foo}, {id: 1, field2: baz}))

    (2, ({id: 2, field1: bar}, {id: 2, field2: baz}))

  • Demo# example 4 first_dataset = sc.parallelize([ {'id': 1, 'field1': 'foo'}, {'id': 2, 'field1': 'bar'}, {'id': 2, 'field1': 'baz'}, {'id': 3, 'field1': 'foo'} ]) first_dataset = first_dataset.map(lambda d: (d['id'], d))

    second_dataset = sc.parallelize([ {'id': 1, 'field2': 'abc'}, {'id': 2, 'field2': 'def'} ]) second_dataset = second_dataset.map(lambda d: (d['id'], d))

    output = first_dataset.join(second_dataset)

  • Key Skew

    Achilles heel of M/R: key skew

    Shuffle phase distributes like keys to like nodes

    If billions of rows are shuffled to the same node, may cause slight memory issues

  • Joining RDDs w/ Skew When joining to small RDD, an alternative is to

    broadcast the RDD

    Instead of shuffling, entire RDD is sent to each worker

    Now each worker has all data needed

    Each join is now just a map. No shuffle needed!

  • Demo# example 5 first_dataset = sc.parallelize([ {'id': 1, 'field1': 'foo'}, {'id': 2, 'field1': 'bar'}, {'id': 2, 'field1': 'baz'}, {'id': 3, 'field1': 'foo'} ]) first_dataset = first_dataset.map(lambda d: (d['id'], d))

    second_dataset = sc.parallelize([ {'id': 1, 'field2': 'abc'}, {'id': 2, 'field2': 'def'} ]) second_dataset = second_dataset.map(lambda d: (d['id'], d)) second_dict = sc.broadcast(second_dataset.collectAsMap())

    def join_records((key, record)): if key in second_dict.value.keys(): yield (key, (record, second_dict.value[key]))

    output = first_dataset.flatMap(join_records)

  • Ordering

    Row order isnt guaranteed unless you explicitly sort the RDD

    But: sometimes you need to process events in order!

    Solution: repartitionAndSortWithinPartitions

  • Ordering{id: 1, value: 10} {id: 2, value: 10} {id: 3, value: 20}

    {id: 1, value: 12} {id: 1, value: 5} {id: 2, value: 15}

    {id: 1, value: 5} {id: 1, value: 10} {id: 1, value: 12} {id: 3, value: 20

    {id: 2, value: 10} {id: 2, value: 15}

    Shuffle & Sort

    {id: 1, interval: 5} {id: 1, interval: 2} {id: 2, interval: 5}

    MapPartitions

  • Thanks!