data engineering 101: building your first data product by jonathan dinu pydata sv 2014

121
Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex @ZipfianAcademy Data Engineering 101: Building your first data product May 4th, 2014

Upload: pydata

Post on 27-Jan-2015

108 views

Category:

Technology


5 download

DESCRIPTION

Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company. As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be. https://github.com/Jay-Oh-eN/pydatasv2014

TRANSCRIPT

Page 1: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Jonathan DinuCo-Founder, Zipfian [email protected]

@clearspandex

@ZipfianAcademy

Data Engineering 101: Building your first data product

May 4th, 2014

Page 2: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 3: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Formerly

Questions? tweet @zipfianacademy #pydata

Page 4: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Formerly

Questions? tweet @zipfianacademy #pydata

Page 5: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Currently

Questions? tweet @zipfianacademy #pydata

Page 6: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today Disclaimer:

All characters appearing in this presentation are

fictitious. Any resemblance to real persons, living

or dead, is purely coincidental.

Questions? tweet @zipfianacademy #pydata

Page 7: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today Disclaimer:

This presentation contains strong opinions that

you may or may not agree with. All thoughts are

my own.

Jonathan DinuCo-Founder, Zipfian [email protected]

@clearspandex

Questions? tweet @zipfianacademy #pydata

Page 8: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Creating Value for Users

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 9: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

nwsrdr (News Reader)

Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark-Button.png

OR

nwsrdr+ nwrsrdr

+ nwrsrdr

+ nwrsrdr

nwsrdr

getnews.com/bookmarklet

When browsing the web simply click the +nwsrdr to save any page to nwsrdr

Get nwsrdr on your desktop

Questions? tweet @zipfianacademy #pydata

Page 10: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

nwsrdr

• Auto-categorize Articles

• Find Similar Articles

• Recommend articles

• Suggest Feeds to Follow

• No Ads!

It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!

Questions? tweet @zipfianacademy #pydata

Page 11: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

nwsrdr

It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!

• Naive Bayes (classification)

• Clustering (unsupervised learning)

• Collaborative Filtering

• Triangle Closing

• Real Business Model!

Questions? tweet @zipfianacademy #pydata

Page 12: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 13: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Products

Product Built on Data(that you sell)

Questions? tweet @zipfianacademy #pydata

Page 14: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Products

Product that Generates Data

Questions? tweet @zipfianacademy #pydata

Page 15: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Products

Product that Generates Data(that you sell)

Questions? tweet @zipfianacademy #pydata

Page 16: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Products

Product that Generates Data(that you sell)

i.e. Facebook

Questions? tweet @zipfianacademy #pydata

Page 17: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Products

Questions? tweet @zipfianacademy #pydata Source: http://gifgif.media.mit.edu/

Page 18: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Products

Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata

Page 19: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Generating Products

Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata

Page 20: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Products that enhance a users’ experience the more “data” a user

provides

Data Generating Products

Ex: Recommender Systems

Questions? tweet @zipfianacademy #pydata

Page 21: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 22: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Science

Questions? tweet @zipfianacademy #pydata

Page 23: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

i.e. solve more problems than you create

Data Science

Questions? tweet @zipfianacademy #pydata

Page 24: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Source: http://estoyentretenido.com/wp-content/uploads/2012/11/Jackie-Chan-Meme.jpg

But.... How?!?!?!!?

Data Science

Questions? tweet @zipfianacademy #pydata

Page 25: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Data Engineering

Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-It-1350063721.PNG

Questions? tweet @zipfianacademy #pydata

Page 26: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Data Engineering

Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-It-1350063721.PNG

!

Questions? tweet @zipfianacademy #pydata

Page 27: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

OR

Data Engineering

Questions? tweet @zipfianacademy #pydata

Page 28: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Prepared Data

Test Set

Training Set Train

ModelSampling

EvaluateCross

Validation

Data Science

Questions? tweet @zipfianacademy #pydata

Page 29: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Raw Data

Cleaned Data

Scrubbing

Prepared DataVectorization

New Data

Test Set

Training Set Train

ModelSampling

EvaluateCross

Validation

Cleaned Data

Prepared DataVectorizationScrubbing

Predict

Labels/Classes

Data Engineering

Questions? tweet @zipfianacademy #pydata

Page 30: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 31: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

What

• Naive Bayes (classification)

• Clustering (unsupervised learning)

• Collaborative Filtering

• Triangle Closing

• Real Business Model

Questions? tweet @zipfianacademy #pydata

Page 32: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

nwsrdr

• Auto-categorize Articles

• Find Similar Articles

• Recommend articles

• No Ads!

It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!

Questions? tweet @zipfianacademy #pydata

Page 33: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

nwsrdr

It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!

• Naive Bayes (classification)

• Clustering (unsupervised learning)

• Collaborative Filtering

• Triangle Closing

• Real Business Model!

Questions? tweet @zipfianacademy #pydata

Page 34: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Source: http://media.tumblr.com/tumblr_lakcynCyG31qbzcoy.jpg

Abstraction (Cake)

How

(ABK)

Questions? tweet @zipfianacademy #pydata

Page 35: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

requests

BeautifulSoup4

pandas

pymongo

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 36: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

requests

BeautifulSoup4

pandas

pymongo

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 37: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 38: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

At Scale Locally

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

Flask

yHat

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 39: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 40: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Pipeline

Iteration 0:

• Find out how much data

• Run locally

• Experiment

Questions? tweet @zipfianacademy #pydata

Page 41: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Acquire

Page 42: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Retrieve Meta-data for ALL NYT articles

Questions? tweet @zipfianacademy #pydata

Acquire

Page 43: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

api_key='xxxxxxxxxxxxx'!!!!url = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key!!!!# make an API request!api = requests.get(url)!

Questions? tweet @zipfianacademy #pydata

Acquire

Page 44: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

# parse resulting JSON and insert into a mongoDB collection!for content in api.json()['response']['docs']:! if not collection.find_one(content):! collection.insert(content)! !!# only returns 10 per page!"There are only %i docuemtns returned 0_o" % \!! len(api.json()[‘response']['docs'])!

Questions? tweet @zipfianacademy #pydata

Acquire

Page 45: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

# there are many more than 10 articles however!total_art = articles_left = api.json()['response']['meta']['hits']!!!print "There are currently %s articles in the NYT archive" % total_art!!!#=> There are currently 15277775 articles in the NYT archive

Questions? tweet @zipfianacademy #pydata

Acquire

Page 46: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Gotchas!

• Rate Limiting

• Page Limiting

Questions? tweet @zipfianacademy #pydata

Acquire

Page 47: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Iterate

Iteration 1:

• (Meaningful) Sample of Data

• Prototype — “Close the Loop”

Questions? tweet @zipfianacademy #pydata

Page 48: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Retrieve Meta-data for ALL NYT articles

Questions? tweet @zipfianacademy #pydata

Acquire

(take 2)

Page 49: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

# let us loop (and hopefully not hit our rate limit)!

while articles_left > 0 and page_count < max_pages:!

more_articles = requests.get(url + "&page=" + str(page) + "&end_date=" + str(last_date))!

print "Inserting page " + str(page)!

# make sure it was successful!

if more_articles.status_code == 200:!

for content in more_articles.json()['response']['docs']:!

latest_article = parser.parse(content['pub_date']).strftime("%Y%m%d")!

if not collection.find_one(content) and content['document_type'] == 'article':!

print "No dups"!

try:!

print "Inserting article " + str(content['headline'])!

collection.insert(content)!

except errors.DuplicateKeyError:!

print "Duplicates"!

continue!

else:!

print "In collection already”!

! ! …

Iteration 0.5

Questions? tweet @zipfianacademy #pydata

Acquire

Page 50: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

articles_left -= 10! page += 1! page_count += 1! cursor_count += 1! final_page = max(final_page, page)! else:! if more_articles.status_code == 403:! print "Sleepy..."! # account for rate limiting! time.sleep(2)! elif cursor_count > 100:! print "Adjusting date”!! ! ! ! # account for page limiting! cursor_count = 0! page = 0! last_date = latest_article! else:! print "ERRORS: " + str(more_articles.status_code)! cursor_count = 0! page = 0! last_date = latest_article!

Questions? tweet @zipfianacademy #pydata

Acquire

Page 51: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Download HTML content of

articles from NYT.com

Questions? tweet @zipfianacademy #pydata

Acquire

(and store in MongoDB™)

Page 52: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Acquire# now we can get some content!!#limit = 100!limit = 10000!!for article in collection.find({'html' : {'$exists' : False} }):! if limit and limit > 0:! if not article.has_key('html') and article['document_type'] == 'article':! limit -= 1! print article['web_url']! html = requests.get(article['web_url'] + "?smid=tw-nytimes")! ! if html.status_code == 200:! soup = BeautifulSoup(html.text)! ! # serialize html! collection.update({ '_id' : article['_id'] }, { '$set' : !! ! ! ! ! ! ! ! ! ! ! ! ! { 'html' : unicode(soup), 'content' : [] } !! ! ! ! ! ! ! ! ! ! ! ! } )! ! for p in soup.find_all('div', class_='articleBody'):! collection.update({ '_id' : article['_id'] }, { '$push' : !! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() !! ! ! ! ! ! ! ! ! ! ! ! ! } })!

Questions? tweet @zipfianacademy #pydata

Page 53: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Parse

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

scikit-learn/NLTK

Flask

yHat

Locally

Questions? tweet @zipfianacademy #pydata

Page 54: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Parse HTML with BeautifulSoup

and Extract the article Body

Questions? tweet @zipfianacademy #pydata

(and store in MongoDB™)

Parse

Page 55: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

# parse HTML content of articles!

for article in collection.find({'html' : {'$exists' : True} }):!

print article['web_url']!

soup = BeautifulSoup(article['html'], 'html.parser')!

arts = soup.find_all('div', class_='articleBody')!

!

if len(arts) == 0:!

arts = soup.find_all('p', class_=‘story-body-text')!

!! ! …

Questions? tweet @zipfianacademy #pydata

Parse

Page 56: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Store

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 57: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

for p in arts:! collection.update({ '_id' : article['_id'] }, { '$push' : !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() } !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! })!

Questions? tweet @zipfianacademy #pydata

Store

Page 58: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Explore

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 59: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Exploratory Data Analysis with pandas

Questions? tweet @zipfianacademy #pydata

Explore

Page 60: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

articles.describe()!# ! ! text section!# count 1405 1405!# unique 1397 10!!fig = plt.figure()!# histogram of section counts!articles['section'].value_counts().plot(kind='bar')

Questions? tweet @zipfianacademy #pydata

Explore

Page 61: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Questions? tweet @zipfianacademy #pydata

Explore

Page 62: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

error with NYT API

Questions? tweet @zipfianacademy #pydata

Explore

Page 63: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

api_key='xxxxxxxxxxxxx'!!!!url = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key!!!!# make an API request!api = requests.get(url)!

Questions? tweet @zipfianacademy #pydata

Explore

error with NYT API

Page 64: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Vectorize

Page 65: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Tokenize article text and

create feature vectors with NLTK

Questions? tweet @zipfianacademy #pydata

Vectorize

Page 66: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Vectorize

wnl = nltk.WordNetLemmatizer()!!def tokenize_and_normalize(chunks):! words = [ tokenize.word_tokenize(sent) for sent in tokenize.sent_tokenize("".join(chunks)) ]! flatten = [ inner for sublist in words for inner in sublist ]! stripped = [] !! for word in flatten: ! if word not in stopwords.words('english'):! try:! stripped.append(word.encode('latin-1').decode('utf8').lower())! except:! print "Cannot encode: " + word! ! no_punks = [ word for word in stripped if len(word) > 1 ] ! return [wnl.lemmatize(t) for t in no_punks]!

Questions? tweet @zipfianacademy #pydata

Page 67: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Train

Page 68: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Train and score a model with scikit-learn

Questions? tweet @zipfianacademy #pydata

Train

Page 69: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

# cross validate!from sklearn.cross_validation import train_test_split!!xtrain, xtest, ytrain, ytest = !! ! ! ! ! ! ! train_test_split(X, labels, test_size=0.3)!!# train a model!alpha = 1!multi_bayes = MultinomialNB(alpha=alpha)!!multi_bayes.fit(xtrain, ytrain)!multi_bayes.score(xtest, ytest)

Questions? tweet @zipfianacademy #pydata

Train

Page 70: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Gotchas!

• Model only exists locally on Laptop

• Not Automated for realtime prediction

Questions? tweet @zipfianacademy #pydata

Train

Page 71: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Exposé

Questions? tweet @zipfianacademy #pydata

Page 72: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Iteration 2:

• Expose your model

• Automate your processes

Questions? tweet @zipfianacademy #pydata

Exposé

Page 73: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Getting that model off your lap(top)

Questions? tweet @zipfianacademy #pydata

Exposé

Page 74: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Source: http://pixel.nymag.com/imgs/daily/vulture/2012/03/09/09_joan-taylor.o.jpg/a_560x0.jpg

Questions? tweet @zipfianacademy #pydata

Exposé

Page 75: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

A model is just a function

Questions? tweet @zipfianacademy #pydata

Exposé

Page 76: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Inputs...

Questions? tweet @zipfianacademy #pydata

Exposé

Page 77: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Outputs...

Questions? tweet @zipfianacademy #pydata

Exposé

Page 78: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Serialize your model with pickle (or cPickle or joblib)

Questions? tweet @zipfianacademy #pydata

Persistence

Page 79: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/g-6mevh13be74mgnc9i8qifa0

Persistence

Questions? tweet @zipfianacademy #pydata

Page 80: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Persistence

SerDes

• Disk

• Database

• Memory

Questions? tweet @zipfianacademy #pydata

Page 81: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Exposé

Page 82: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Deploy your Model to yHat

Questions? tweet @zipfianacademy #pydata

Exposé

Page 83: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

class DocumentClassifier(YhatModel):! @preprocess(in_type=dict, out_type=dict)! def execute(self, data):! featureBody = vectorizer.transform([data['content']])! result = multi_bayes.predict(featureBody)! list_res = result.tolist()! return {"section_name": list_res}!!clf = DocumentClassifier()!yh = Yhat("[email protected]", “xxxxxx",!! ! ! ! ! ! ! ! ! ! ! ! ! "http://cloud.yhathq.com/")!yh.deploy("documentClassifier", DocumentClassifier, globals())

Questions? tweet @zipfianacademy #pydata

Exposé

Page 84: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask (on Heroku)

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Present

Page 85: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Create a Flask application to expose your model on the web

Questions? tweet @zipfianacademy #pydata

Present

Page 86: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

yh = Yhat("<USERNAME>", "<API KEY>", "http://cloud.yhathq.com/")[email protected]('/')def index(): return app.send_static_file('index.html')[email protected]('/predict', methods=['POST'])def predict(): article = request.form['article'] results = yh.predict("documentClf", { 'content': article }) return jsonify({"results": results})

Questions? tweet @zipfianacademy #pydata

Present

Page 87: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Pipeline

Only Data should Flow

Questions? tweet @zipfianacademy #pydata

Page 88: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

DataRemember to Remember

(Lineage)

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

Questions? tweet @zipfianacademy #pydata

Pipeline

Page 89: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Immutable append only set of Raw Data

Computation is a view on data

*Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata

Pipeline

Page 90: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Functional Data Science

• Modularity

• Define interfaces

• Separate data from computation

• Data Lineage

Functional

Questions? tweet @zipfianacademy #pydata

Page 91: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Need Robust and Flexible Pipeline!

Questions? tweet @zipfianacademy #pydata

Pipeline

Page 92: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Whatever you do, DO NOT cross the streams

Questions? tweet @zipfianacademy #pydata

Pipeline

Page 93: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

NYT API

MongoDB

BeautifulSoup

Feature Matrixscikit-learn

Web App

ModelDeploy

yHat

HerokuPOST

Predict

Predicted Section

Where we are

NLTK

scikit-learn

Questions? tweet @zipfianacademy #pydata

Page 94: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Gotchas!

• Only have a static subset of articles

• Pipeline not automated for re-training

Questions? tweet @zipfianacademy #pydata

Gotchas

Page 95: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 96: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Iteration 3:

Source: http://vninja.net/wordpress/wp-content/uploads/2013/03/KCaAutomate.pngQuestions? tweet @zipfianacademy #pydata

Iterate

Page 97: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

NYT API

MongoDB

cron

Feature Matrixscikit-learn

Web App

ModelDeploy

yHat

HerokuPOST

Predict

Predicted Section

Where we are

NLTK

scikit-learn

Questions? tweet @zipfianacademy #pydata

Amazon EC2

Page 98: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

testing

Start small (data) and fast

(development)

testing

Increase size of data set

Optimize and productionize

PROFIT!

$$$

Questions? tweet @zipfianacademy #pydata

How to Scale

Page 99: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

How to Scale

testing

Develop locally

testing

Distribute computation

(run on cluster)

Tune parameters

PROFIT!

$$$

Questions? tweet @zipfianacademy #pydata

Can also use a streaming algorithm or

single machine disk based “medium data”

technologies (i.e. database or memory

mapped files)

Page 100: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Products

If you build it...

Questions? tweet @zipfianacademy #pydata

Page 101: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Source: http://nateemery.com/wp-content/uploads/2013/05/field-of-dreams.jpg

Products

Questions? tweet @zipfianacademy #pydata

Page 102: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 103: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Q & A

Q&AQuestions? tweet @zipfianacademy #pydata

Page 104: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Zipfian Academy

@ZipfianAcademy

Data Science & Data Engineering 12-week Bootcamp (May 12th & Sep 8th)

Weekend Workshops

http://zipfianacademy.com/apply

http://zipfianacademy.com/workshops

Next: Interactive Visualizations w/ d3.js (June 7th)

Questions? tweet @zipfianacademy #pydata

Page 105: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Thank You!

Jonathan DinuCo-Founder, Zipfian [email protected]

@clearspandex

@ZipfianAcademy

http://zipfianacademy.com

Questions? tweet @zipfianacademy #pydata

Page 106: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Appendix

Questions? tweet @zipfianacademy #pydata

Page 107: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Data Sources

Obtain(ranked by ease of use)

1. DaaS -- Data as a service

2. Bulk Download

3. APIs

4. Web Scraping

Questions? tweet @zipfianacademy #pydata

Page 108: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

DaaS(Data as a Service)

• Time Series/Numeric: Quandl

• Financial Modeling: Quantopian

• Email Contextualization: Rapleaf

• Location and POI: Factual

Data Sources

Questions? tweet @zipfianacademy #pydata

Page 109: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Bulk Download(just like the good ol’ days)

• File Transfer Protocol (FTP): CDC

• Amazon Web Services: Public Datasets

• Infochimps: Data Marketplace

• Academia: UCI Machine Learning Repository

Data Sources

Questions? tweet @zipfianacademy #pydata

Page 110: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

APIs(if it’s not RESTed, I’m not buying)

• Geographic: Foursquare

• Social: Facebook

• Audio: Rdio

• Content: Tumblr

• Realtime: Twitter

• Hidden: Yahoo Finance

Data Sources

Questions? tweet @zipfianacademy #pydata

Page 111: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Web Scraping

1. wget and curl

2. Web Spider/Crawler

3. API scraping

4. Manual Download

(DIY for life)

Data Sources

Questions? tweet @zipfianacademy #pydata

Page 112: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

• Delimited Values

• TSV

• CSV

• WSV

• JSON

• XML

• Ad Hoc Formats (avoid these if you can)

Data Formats

Questions? tweet @zipfianacademy #pydata

Page 113: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

• JSON is made up of hash tables and arrays • Hash tables: { “foo” : 1, “bar” : 2, baz : “3” } • Arrays: [1, 2, 3] • Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]] • Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}] • Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}}

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 114: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

{"widget": {! "debug": "on",! "window": {! "title": "Sample Konfabulator Widget",! "name": "main_window",! "width": 500,! "height": 500! },! "image": { ! "src": "Images/Sun.png",! "name": "sun1",! "hOffset": 250,! "vOffset": 250,! "alignment": "center"! },! "text": {! "data": "Click Here",! "size": 36,! "style": "bold",! "name": "text1",! "hOffset": 250,! "vOffset": 100,! "alignment": "center",! "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"! }!}} !

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 115: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

• XML is a recursive self-describing container <container>

<item>Foo</item> <item>Bar</item>

<container> <item attr=”SomethingAboutBaz”>Baz</item>

</container> </item>

<container>

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 116: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

<widget>! <debug>on</debug>! <window title="Sample Konfabulator Widget">! <name>main_window</name>! <width>500</width>! <height>500</height>! </window>! <image src="Images/Sun.png" name="sun1">! <hOffset>250</hOffset>! <vOffset>250</vOffset>! <alignment>center</alignment>! </image>! <text data="Click Here" size="36" style="bold">! <name>text1</name>! <hOffset>250</hOffset>! <vOffset>100</vOffset>! <alignment>center</alignment>! <onMouseUp>! sun1.opacity = (sun1.opacity / 100) * 90;! </onMouseUp>! </text>!</widget>!

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 117: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

• Ad hoc data formats • Fixed-width (Census data) • Graph Edgelists • Voting records • etc.

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 118: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

• 7-5-5 format •Sam foo bar!•Roger baz 6!•Jane 314 99

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 119: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

• Directed Graph Format 1 2!

1 3!

1 4!

2 3!

4 4

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 120: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

• Directed Graph Format 1 2!

1 3!

1 4!

2 3!

4 4

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 121: Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Programming languages like Python, Ruby, and R have built in parsers for data formats such as

JSON and CSV. For other esoteric formats you will

probably have to write your own

Questions? tweet @zipfianacademy #pydata

Data Formats