data engineering 101: building your first data product
DESCRIPTION
Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company. As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.TRANSCRIPT
![Page 1: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/1.jpg)
Jonathan DinuCo-Founder, Zipfian [email protected]
@clearspandex
@ZipfianAcademy
Data Engineering 101: Building your first data product
May 4th, 2014
![Page 2: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/2.jpg)
Today
• whoami
• Nws Rdr (News Reader)
• The What, Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
![Page 3: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/3.jpg)
Formerly
Questions? tweet @zipfianacademy #pydata
![Page 4: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/4.jpg)
Formerly
Questions? tweet @zipfianacademy #pydata
![Page 5: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/5.jpg)
Currently
Questions? tweet @zipfianacademy #pydata
![Page 6: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/6.jpg)
Today Disclaimer:
All characters appearing in this presentation are
fictitious. Any resemblance to real persons, living
or dead, is purely coincidental.
Questions? tweet @zipfianacademy #pydata
![Page 7: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/7.jpg)
Today Disclaimer:
This presentation contains strong opinions that
you may or may not agree with. All thoughts are
my own.
Jonathan DinuCo-Founder, Zipfian [email protected]
@clearspandex
Questions? tweet @zipfianacademy #pydata
![Page 8: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/8.jpg)
Today
• whoami
• Nws Rdr (News Reader)
• The What, Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Creating Value for Users
• Q&A
Questions? tweet @zipfianacademy #pydata
![Page 9: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/9.jpg)
nwsrdr (News Reader)
Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark-Button.png
OR
nwsrdr+ nwrsrdr
+ nwrsrdr
+ nwrsrdr
nwsrdr
getnews.com/bookmarklet
When browsing the web simply click the +nwsrdr to save any page to nwsrdr
Get nwsrdr on your desktop
Questions? tweet @zipfianacademy #pydata
![Page 10: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/10.jpg)
nwsrdr
• Auto-categorize Articles
• Find Similar Articles
• Recommend articles
• Suggest Feeds to Follow
• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
![Page 11: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/11.jpg)
nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model!
Questions? tweet @zipfianacademy #pydata
![Page 12: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/12.jpg)
Today
• whoami
• Nws Rdr (News Reader)
• The What, Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
![Page 13: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/13.jpg)
OR
Data Products
Product Built on Data(that you sell)
Questions? tweet @zipfianacademy #pydata
![Page 14: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/14.jpg)
OR
Data Products
Product that Generates Data
Questions? tweet @zipfianacademy #pydata
![Page 15: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/15.jpg)
OR
Data Products
Product that Generates Data(that you sell)
Questions? tweet @zipfianacademy #pydata
![Page 16: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/16.jpg)
OR
Data Products
Product that Generates Data(that you sell)
i.e. Facebook
Questions? tweet @zipfianacademy #pydata
![Page 17: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/17.jpg)
OR
Data Products
Questions? tweet @zipfianacademy #pydata Source: http://gifgif.media.mit.edu/
![Page 18: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/18.jpg)
OR
Data Products
Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
![Page 19: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/19.jpg)
OR
Data Generating Products
Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
![Page 20: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/20.jpg)
Products that enhance a users’ experience the more “data” a user
provides
Data Generating Products
Ex: Recommender Systems
Questions? tweet @zipfianacademy #pydata
![Page 21: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/21.jpg)
Today
• whoami
• Nws Rdr (News Reader)
• The What, Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
![Page 22: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/22.jpg)
OR
Data Science
Questions? tweet @zipfianacademy #pydata
![Page 23: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/23.jpg)
i.e. solve more problems than you create
Data Science
Questions? tweet @zipfianacademy #pydata
![Page 24: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/24.jpg)
Source: http://estoyentretenido.com/wp-content/uploads/2012/11/Jackie-Chan-Meme.jpg
But.... How?!?!?!!?
Data Science
Questions? tweet @zipfianacademy #pydata
![Page 25: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/25.jpg)
Data Engineering
Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-It-1350063721.PNG
Questions? tweet @zipfianacademy #pydata
![Page 26: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/26.jpg)
Data Engineering
Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-It-1350063721.PNG
!
Questions? tweet @zipfianacademy #pydata
![Page 27: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/27.jpg)
OR
Data Engineering
Questions? tweet @zipfianacademy #pydata
![Page 28: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/28.jpg)
Prepared Data
Test Set
Training Set Train
ModelSampling
EvaluateCross
Validation
Data Science
Questions? tweet @zipfianacademy #pydata
![Page 29: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/29.jpg)
Raw Data
Cleaned Data
Scrubbing
Prepared DataVectorization
New Data
Test Set
Training Set Train
ModelSampling
EvaluateCross
Validation
Cleaned Data
Prepared DataVectorizationScrubbing
Predict
Labels/Classes
Data Engineering
Questions? tweet @zipfianacademy #pydata
![Page 30: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/30.jpg)
Today
• whoami
• Nws Rdr (News Reader)
• The What, Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
![Page 31: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/31.jpg)
What
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model
Questions? tweet @zipfianacademy #pydata
![Page 32: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/32.jpg)
nwsrdr
• Auto-categorize Articles
• Find Similar Articles
• Recommend articles
• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
![Page 33: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/33.jpg)
nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model!
Questions? tweet @zipfianacademy #pydata
![Page 34: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/34.jpg)
Source: http://media.tumblr.com/tumblr_lakcynCyG31qbzcoy.jpg
Abstraction (Cake)
How
(ABK)
Questions? tweet @zipfianacademy #pydata
![Page 35: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/35.jpg)
Obligatory Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
![Page 36: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/36.jpg)
Obligatory Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
![Page 37: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/37.jpg)
Obligatory Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
![Page 38: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/38.jpg)
Obligatory Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
At Scale Locally
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
Flask
yHat
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
![Page 39: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/39.jpg)
Obligatory Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
![Page 40: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/40.jpg)
Pipeline
Iteration 0:
• Find out how much data
• Run locally
• Experiment
Questions? tweet @zipfianacademy #pydata
![Page 41: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/41.jpg)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Acquire
![Page 42: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/42.jpg)
Retrieve Meta-data for ALL NYT articles
Questions? tweet @zipfianacademy #pydata
Acquire
![Page 43: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/43.jpg)
api_key='xxxxxxxxxxxxx'!!!!url = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key!!!!# make an API request!api = requests.get(url)!
Questions? tweet @zipfianacademy #pydata
Acquire
![Page 44: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/44.jpg)
# parse resulting JSON and insert into a mongoDB collection!for content in api.json()['response']['docs']:! if not collection.find_one(content):! collection.insert(content)! !!# only returns 10 per page!"There are only %i docuemtns returned 0_o" % \!! len(api.json()[‘response']['docs'])!
Questions? tweet @zipfianacademy #pydata
Acquire
![Page 45: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/45.jpg)
# there are many more than 10 articles however!total_art = articles_left = api.json()['response']['meta']['hits']!!!print "There are currently %s articles in the NYT archive" % total_art!!!#=> There are currently 15277775 articles in the NYT archive
Questions? tweet @zipfianacademy #pydata
Acquire
![Page 46: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/46.jpg)
Gotchas!
• Rate Limiting
• Page Limiting
Questions? tweet @zipfianacademy #pydata
Acquire
![Page 47: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/47.jpg)
Iterate
Iteration 1:
• (Meaningful) Sample of Data
• Prototype — “Close the Loop”
Questions? tweet @zipfianacademy #pydata
![Page 48: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/48.jpg)
Retrieve Meta-data for ALL NYT articles
Questions? tweet @zipfianacademy #pydata
Acquire
(take 2)
![Page 49: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/49.jpg)
# let us loop (and hopefully not hit our rate limit)!
while articles_left > 0 and page_count < max_pages:!
more_articles = requests.get(url + "&page=" + str(page) + "&end_date=" + str(last_date))!
print "Inserting page " + str(page)!
# make sure it was successful!
if more_articles.status_code == 200:!
for content in more_articles.json()['response']['docs']:!
latest_article = parser.parse(content['pub_date']).strftime("%Y%m%d")!
if not collection.find_one(content) and content['document_type'] == 'article':!
print "No dups"!
try:!
print "Inserting article " + str(content['headline'])!
collection.insert(content)!
except errors.DuplicateKeyError:!
print "Duplicates"!
continue!
else:!
print "In collection already”!
! ! …
Iteration 0.5
Questions? tweet @zipfianacademy #pydata
Acquire
![Page 50: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/50.jpg)
articles_left -= 10! page += 1! page_count += 1! cursor_count += 1! final_page = max(final_page, page)! else:! if more_articles.status_code == 403:! print "Sleepy..."! # account for rate limiting! time.sleep(2)! elif cursor_count > 100:! print "Adjusting date”!! ! ! ! # account for page limiting! cursor_count = 0! page = 0! last_date = latest_article! else:! print "ERRORS: " + str(more_articles.status_code)! cursor_count = 0! page = 0! last_date = latest_article!
Questions? tweet @zipfianacademy #pydata
Acquire
![Page 51: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/51.jpg)
Download HTML content of
articles from NYT.com
Questions? tweet @zipfianacademy #pydata
Acquire
(and store in MongoDB™)
![Page 52: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/52.jpg)
Acquire# now we can get some content!!#limit = 100!limit = 10000!!for article in collection.find({'html' : {'$exists' : False} }):! if limit and limit > 0:! if not article.has_key('html') and article['document_type'] == 'article':! limit -= 1! print article['web_url']! html = requests.get(article['web_url'] + "?smid=tw-nytimes")! ! if html.status_code == 200:! soup = BeautifulSoup(html.text)! ! # serialize html! collection.update({ '_id' : article['_id'] }, { '$set' : !! ! ! ! ! ! ! ! ! ! ! ! ! { 'html' : unicode(soup), 'content' : [] } !! ! ! ! ! ! ! ! ! ! ! ! } )! ! for p in soup.find_all('div', class_='articleBody'):! collection.update({ '_id' : article['_id'] }, { '$push' : !! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() !! ! ! ! ! ! ! ! ! ! ! ! ! } })!
Questions? tweet @zipfianacademy #pydata
![Page 53: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/53.jpg)
Parse
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Flask
yHat
Locally
Questions? tweet @zipfianacademy #pydata
![Page 54: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/54.jpg)
Parse HTML with BeautifulSoup
and Extract the article Body
Questions? tweet @zipfianacademy #pydata
(and store in MongoDB™)
Parse
![Page 55: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/55.jpg)
# parse HTML content of articles!
for article in collection.find({'html' : {'$exists' : True} }):!
print article['web_url']!
soup = BeautifulSoup(article['html'], 'html.parser')!
arts = soup.find_all('div', class_='articleBody')!
!
if len(arts) == 0:!
arts = soup.find_all('p', class_=‘story-body-text')!
!! ! …
Questions? tweet @zipfianacademy #pydata
Parse
![Page 56: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/56.jpg)
Store
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
![Page 57: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/57.jpg)
for p in arts:! collection.update({ '_id' : article['_id'] }, { '$push' : !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() } !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! })!
Questions? tweet @zipfianacademy #pydata
Store
![Page 58: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/58.jpg)
Explore
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
![Page 59: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/59.jpg)
Exploratory Data Analysis with pandas
Questions? tweet @zipfianacademy #pydata
Explore
![Page 60: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/60.jpg)
articles.describe()!# ! ! text section!# count 1405 1405!# unique 1397 10!!fig = plt.figure()!# histogram of section counts!articles['section'].value_counts().plot(kind='bar')
Questions? tweet @zipfianacademy #pydata
Explore
![Page 61: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/61.jpg)
Questions? tweet @zipfianacademy #pydata
Explore
![Page 62: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/62.jpg)
error with NYT API
Questions? tweet @zipfianacademy #pydata
Explore
![Page 63: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/63.jpg)
api_key='xxxxxxxxxxxxx'!!!!url = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key!!!!# make an API request!api = requests.get(url)!
Questions? tweet @zipfianacademy #pydata
Explore
error with NYT API
![Page 64: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/64.jpg)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
![Page 65: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/65.jpg)
Tokenize article text and
create feature vectors with NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
![Page 66: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/66.jpg)
Vectorize
wnl = nltk.WordNetLemmatizer()!!def tokenize_and_normalize(chunks):! words = [ tokenize.word_tokenize(sent) for sent in tokenize.sent_tokenize("".join(chunks)) ]! flatten = [ inner for sublist in words for inner in sublist ]! stripped = [] !! for word in flatten: ! if word not in stopwords.words('english'):! try:! stripped.append(word.encode('latin-1').decode('utf8').lower())! except:! print "Cannot encode: " + word! ! no_punks = [ word for word in stripped if len(word) > 1 ] ! return [wnl.lemmatize(t) for t in no_punks]!
Questions? tweet @zipfianacademy #pydata
![Page 67: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/67.jpg)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Train
![Page 68: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/68.jpg)
Train and score a model with scikit-learn
Questions? tweet @zipfianacademy #pydata
Train
![Page 69: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/69.jpg)
# cross validate!from sklearn.cross_validation import train_test_split!!xtrain, xtest, ytrain, ytest = !! ! ! ! ! ! ! train_test_split(X, labels, test_size=0.3)!!# train a model!alpha = 1!multi_bayes = MultinomialNB(alpha=alpha)!!multi_bayes.fit(xtrain, ytrain)!multi_bayes.score(xtest, ytest)
Questions? tweet @zipfianacademy #pydata
Train
![Page 70: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/70.jpg)
Gotchas!
• Model only exists locally on Laptop
• Not Automated for realtime prediction
Questions? tweet @zipfianacademy #pydata
Train
![Page 71: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/71.jpg)
Exposé
Questions? tweet @zipfianacademy #pydata
![Page 72: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/72.jpg)
Iteration 2:
• Expose your model
• Automate your processes
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 73: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/73.jpg)
Getting that model off your lap(top)
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 74: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/74.jpg)
Source: http://pixel.nymag.com/imgs/daily/vulture/2012/03/09/09_joan-taylor.o.jpg/a_560x0.jpg
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 75: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/75.jpg)
A model is just a function
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 76: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/76.jpg)
Inputs...
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 77: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/77.jpg)
Outputs...
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 78: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/78.jpg)
Serialize your model with pickle (or cPickle or joblib)
Questions? tweet @zipfianacademy #pydata
Persistence
![Page 79: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/79.jpg)
Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/g-6mevh13be74mgnc9i8qifa0
Persistence
Questions? tweet @zipfianacademy #pydata
![Page 80: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/80.jpg)
Persistence
SerDes
• Disk
• Database
• Memory
Questions? tweet @zipfianacademy #pydata
![Page 81: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/81.jpg)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 82: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/82.jpg)
Deploy your Model to yHat
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 83: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/83.jpg)
class DocumentClassifier(YhatModel):! @preprocess(in_type=dict, out_type=dict)! def execute(self, data):! featureBody = vectorizer.transform([data['content']])! result = multi_bayes.predict(featureBody)! list_res = result.tolist()! return {"section_name": list_res}!!clf = DocumentClassifier()!yh = Yhat("[email protected]", “xxxxxx",!! ! ! ! ! ! ! ! ! ! ! ! ! "http://cloud.yhathq.com/")!yh.deploy("documentClassifier", DocumentClassifier, globals())
Questions? tweet @zipfianacademy #pydata
Exposé
![Page 84: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/84.jpg)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask (on Heroku)
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Present
![Page 85: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/85.jpg)
Create a Flask application to expose your model on the web
Questions? tweet @zipfianacademy #pydata
Present
![Page 86: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/86.jpg)
yh = Yhat("<USERNAME>", "<API KEY>", "http://cloud.yhathq.com/")[email protected]('/')def index(): return app.send_static_file('index.html')[email protected]('/predict', methods=['POST'])def predict(): article = request.form['article'] results = yh.predict("documentClf", { 'content': article }) return jsonify({"results": results})
Questions? tweet @zipfianacademy #pydata
Present
![Page 87: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/87.jpg)
Pipeline
Only Data should Flow
Questions? tweet @zipfianacademy #pydata
![Page 88: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/88.jpg)
DataRemember to Remember
(Lineage)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
Questions? tweet @zipfianacademy #pydata
Pipeline
![Page 89: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/89.jpg)
Immutable append only set of Raw Data
Computation is a view on data
*Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata
Pipeline
![Page 90: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/90.jpg)
Functional Data Science
• Modularity
• Define interfaces
• Separate data from computation
• Data Lineage
Functional
Questions? tweet @zipfianacademy #pydata
![Page 91: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/91.jpg)
Need Robust and Flexible Pipeline!
Questions? tweet @zipfianacademy #pydata
Pipeline
![Page 92: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/92.jpg)
Whatever you do, DO NOT cross the streams
Questions? tweet @zipfianacademy #pydata
Pipeline
![Page 93: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/93.jpg)
NYT API
MongoDB
BeautifulSoup
Feature Matrixscikit-learn
Web App
ModelDeploy
yHat
HerokuPOST
Predict
Predicted Section
Where we are
NLTK
scikit-learn
Questions? tweet @zipfianacademy #pydata
![Page 94: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/94.jpg)
Gotchas!
• Only have a static subset of articles
• Pipeline not automated for re-training
Questions? tweet @zipfianacademy #pydata
Gotchas
![Page 95: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/95.jpg)
Today
• whoami
• Nws Rdr (News Reader)
• The What, Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
![Page 96: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/96.jpg)
Iteration 3:
Source: http://vninja.net/wordpress/wp-content/uploads/2013/03/KCaAutomate.pngQuestions? tweet @zipfianacademy #pydata
Iterate
![Page 97: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/97.jpg)
NYT API
MongoDB
cron
Feature Matrixscikit-learn
Web App
ModelDeploy
yHat
HerokuPOST
Predict
Predicted Section
Where we are
NLTK
scikit-learn
Questions? tweet @zipfianacademy #pydata
Amazon EC2
![Page 98: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/98.jpg)
testing
Start small (data) and fast
(development)
testing
Increase size of data set
Optimize and productionize
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
How to Scale
![Page 99: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/99.jpg)
How to Scale
testing
Develop locally
testing
Distribute computation
(run on cluster)
Tune parameters
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
Can also use a streaming algorithm or
single machine disk based “medium data”
technologies (i.e. database or memory
mapped files)
![Page 100: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/100.jpg)
Products
If you build it...
Questions? tweet @zipfianacademy #pydata
![Page 101: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/101.jpg)
Source: http://nateemery.com/wp-content/uploads/2013/05/field-of-dreams.jpg
Products
Questions? tweet @zipfianacademy #pydata
![Page 102: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/102.jpg)
Today
• whoami
• Nws Rdr (News Reader)
• The What, Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
![Page 103: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/103.jpg)
Q & A
Q&AQuestions? tweet @zipfianacademy #pydata
![Page 104: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/104.jpg)
Zipfian Academy
@ZipfianAcademy
Data Science & Data Engineering 12-week Bootcamp (May 12th & Sep 8th)
Weekend Workshops
http://zipfianacademy.com/apply
http://zipfianacademy.com/workshops
Next: Interactive Visualizations w/ d3.js (June 7th)
Questions? tweet @zipfianacademy #pydata
![Page 105: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/105.jpg)
Thank You!
Jonathan DinuCo-Founder, Zipfian [email protected]
@clearspandex
@ZipfianAcademy
http://zipfianacademy.com
Questions? tweet @zipfianacademy #pydata
![Page 106: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/106.jpg)
Appendix
Questions? tweet @zipfianacademy #pydata
![Page 107: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/107.jpg)
Data Sources
Obtain(ranked by ease of use)
1. DaaS -- Data as a service
2. Bulk Download
3. APIs
4. Web Scraping
Questions? tweet @zipfianacademy #pydata
![Page 108: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/108.jpg)
DaaS(Data as a Service)
• Time Series/Numeric: Quandl
• Financial Modeling: Quantopian
• Email Contextualization: Rapleaf
• Location and POI: Factual
Data Sources
Questions? tweet @zipfianacademy #pydata
![Page 109: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/109.jpg)
Bulk Download(just like the good ol’ days)
• File Transfer Protocol (FTP): CDC
• Amazon Web Services: Public Datasets
• Infochimps: Data Marketplace
• Academia: UCI Machine Learning Repository
Data Sources
Questions? tweet @zipfianacademy #pydata
![Page 110: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/110.jpg)
APIs(if it’s not RESTed, I’m not buying)
• Geographic: Foursquare
• Social: Facebook
• Audio: Rdio
• Content: Tumblr
• Realtime: Twitter
• Hidden: Yahoo Finance
Data Sources
Questions? tweet @zipfianacademy #pydata
![Page 111: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/111.jpg)
Web Scraping
1. wget and curl
2. Web Spider/Crawler
3. API scraping
4. Manual Download
(DIY for life)
Data Sources
Questions? tweet @zipfianacademy #pydata
![Page 112: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/112.jpg)
• Delimited Values
• TSV
• CSV
• WSV
• JSON
• XML
• Ad Hoc Formats (avoid these if you can)
Data Formats
Questions? tweet @zipfianacademy #pydata
![Page 113: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/113.jpg)
• JSON is made up of hash tables and arrays • Hash tables: { “foo” : 1, “bar” : 2, baz : “3” } • Arrays: [1, 2, 3] • Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]] • Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}] • Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}}
Questions? tweet @zipfianacademy #pydata
Data Formats
![Page 114: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/114.jpg)
{"widget": {! "debug": "on",! "window": {! "title": "Sample Konfabulator Widget",! "name": "main_window",! "width": 500,! "height": 500! },! "image": { ! "src": "Images/Sun.png",! "name": "sun1",! "hOffset": 250,! "vOffset": 250,! "alignment": "center"! },! "text": {! "data": "Click Here",! "size": 36,! "style": "bold",! "name": "text1",! "hOffset": 250,! "vOffset": 100,! "alignment": "center",! "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"! }!}} !
Questions? tweet @zipfianacademy #pydata
Data Formats
![Page 115: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/115.jpg)
• XML is a recursive self-describing container <container>
<item>Foo</item> <item>Bar</item>
<container> <item attr=”SomethingAboutBaz”>Baz</item>
</container> </item>
<container>
Questions? tweet @zipfianacademy #pydata
Data Formats
![Page 116: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/116.jpg)
<widget>! <debug>on</debug>! <window title="Sample Konfabulator Widget">! <name>main_window</name>! <width>500</width>! <height>500</height>! </window>! <image src="Images/Sun.png" name="sun1">! <hOffset>250</hOffset>! <vOffset>250</vOffset>! <alignment>center</alignment>! </image>! <text data="Click Here" size="36" style="bold">! <name>text1</name>! <hOffset>250</hOffset>! <vOffset>100</vOffset>! <alignment>center</alignment>! <onMouseUp>! sun1.opacity = (sun1.opacity / 100) * 90;! </onMouseUp>! </text>!</widget>!
Questions? tweet @zipfianacademy #pydata
Data Formats
![Page 117: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/117.jpg)
• Ad hoc data formats • Fixed-width (Census data) • Graph Edgelists • Voting records • etc.
Questions? tweet @zipfianacademy #pydata
Data Formats
![Page 118: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/118.jpg)
• 7-5-5 format •Sam foo bar!•Roger baz 6!•Jane 314 99
Questions? tweet @zipfianacademy #pydata
Data Formats
![Page 119: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/119.jpg)
• Directed Graph Format 1 2!
1 3!
1 4!
2 3!
4 4
Questions? tweet @zipfianacademy #pydata
Data Formats
![Page 120: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/120.jpg)
• Directed Graph Format 1 2!
1 3!
1 4!
2 3!
4 4
Questions? tweet @zipfianacademy #pydata
Data Formats
![Page 121: Data Engineering 101: Building your first data product](https://reader033.vdocument.in/reader033/viewer/2022060109/5552bff5b4c905920f8b4793/html5/thumbnails/121.jpg)
Programming languages like Python, Ruby, and R have built in parsers for data formats such as
JSON and CSV. For other esoteric formats you will
probably have to write your own
Questions? tweet @zipfianacademy #pydata
Data Formats