hadoop at meebo: lessons in the real world
DESCRIPTION
TRANSCRIPT
![Page 1: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/1.jpg)
Hadoop at MeeboLessons learned in the real world
Vikram OberoiAugust, 2010Hadoop Day, Seattle
![Page 2: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/2.jpg)
About me
• SDE Intern at Amazon, ’07– R&D on item-to-item similarities
• Data Engineer Intern at Meebo, ’08– Built an A/B testing system
• CS at Stanford, ’09– Senior project: Ext3 and XFS under Hadoop
MapReduce workloads• Data Engineer at Meebo, ’09—present– Data infrastructure, analytics
![Page 3: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/3.jpg)
About Meebo
• Products– Browser-based IM client (www.meebo.com)– Mobile chat clients– Social widgets (the Meebo Bar)
• Company– Founded 2005– Over 100 employees, 30 engineers
• Engineering– Strong engineering culture– Contributions to CouchDB, Lounge, Hadoop components
![Page 4: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/4.jpg)
The Problem
• Hadoop is powerful technology– Meets today’s demand for big data
• But it’s still a young platform– Evolving components and best practices
• With many challenges in real-world usage– Day-to-day operational headaches– Missing eco-system features (e.g recurring jobs?)– Lots of re-inventing the wheel to solve these
![Page 5: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/5.jpg)
Purpose of this talk
1. Discuss some real problems we’ve seen2. Explain our solutions3. Propose best practices so you can avoid them
![Page 6: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/6.jpg)
What will I talk about?
Background:• Meebo’s data processing needs• Meebo’s pre and post Hadoop data pipelines
Lessons:• Better workflow management
– Scheduling, reporting, monitoring, etc.– A look at Azkaban
• Get wiser about data serialization– Protocol Buffers (or Avro, or Thrift)
![Page 7: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/7.jpg)
Meebo’s Data Processing Needs
![Page 8: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/8.jpg)
What do we use Hadoop for?
• ETL• Analytics• Behavioral targeting• Ad hoc data analysis, research• Data produced helps power:– internal/external dashboards– our ad server
![Page 9: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/9.jpg)
What kind of data do we have?
• Log data from all our products– The Meebo Bar– Meebo Messenger (www.meebo.com)– Android/iPhone/Mobile Web clients– Rooms– Meebo Me– Meebo notifier– Firefox extension
![Page 10: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/10.jpg)
How much data?
• 150MM uniques/month from the Meebo Bar• Around 200 GB of uncompressed daily logs• We process a subset of our logs
![Page 11: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/11.jpg)
Meebo’s Data PipelinePre and Post Hadoop
![Page 12: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/12.jpg)
A data pipeline in general
1. DataCollection
2. DataProcessing
3. DataStorage
4. Workflow Management
![Page 13: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/13.jpg)
Our data pipeline, pre-Hadoop
Python/shell scripts pull log data
Python/shell scripts
process data
MySQL, CouchDB, flat files
Cron, wrapper shell scripts glue everything together
Servers
![Page 14: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/14.jpg)
Our data pipeline post Hadoop
Push logs to HDFS
Pig scripts process
data
MySQL, CouchDB, flat files
Azkaban, a workflow management system, glues everything together
Servers
![Page 15: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/15.jpg)
Our transition to using Hadoop
• Deployed early ’09– Motivation: processing data took aaaages!– Catalyst: Hadoop Summit
• Turbulent, time consuming– New tools, new paradigms, pitfalls
• Totally worth it– 24 hours to process day’s logs under an hour– Leap in ability to analyze our data– Basis for new core product features
![Page 16: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/16.jpg)
Workflow Management
![Page 17: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/17.jpg)
What is workflow management?
![Page 18: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/18.jpg)
What is workflow management?
It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc.
• Most people use scripts and cron• But end up spending too much time managing• We need a better way
![Page 19: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/19.jpg)
Workflow management consists of:
• Executes jobs with arbitrarily complex dependency chains
![Page 20: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/20.jpg)
Split up your jobs into discrete chunks with dependencies
• Minimize impact when chunks fail
• Allow engineers to work on chunks separately
• Monolithic scripts are no fun
![Page 21: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/21.jpg)
Clean up data from log A
Process data from log B
Join data, train a classifier
Archive output Post-processing
Export to DB somewhere
![Page 22: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/22.jpg)
Workflow management consists of:
• Executes jobs with arbitrarily complex dependency chains
• Schedules recurring jobs to run at a given time
![Page 23: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/23.jpg)
Workflow management consists of:
• Executes jobs with arbitrarily complex dependency chains
• Schedules recurring jobs to run at a given time • Monitors job progress
![Page 24: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/24.jpg)
Workflow management consists of:
• Executes jobs with arbitrarily complex dependency chains
• Schedules recurring jobs to run at a given time • Monitors job progress• Reports when job fails, how long jobs take
![Page 25: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/25.jpg)
Workflow management consists of:
• Executes jobs with arbitrarily complex dependency chains
• Schedules recurring jobs to run at a given time • Monitors job progress• Reports when job fails, how long jobs take• Logs job execution and exposes logs so that
engineers can deal with failures swiftly
![Page 26: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/26.jpg)
Workflow management consists of:
• Executes jobs with arbitrarily complex dependency chains
• Schedules recurring jobs to run at a given time • Monitors job progress• Reports when job fails, how long jobs take• Logs job execution and exposes logs so that
engineers can deal with failures swiftly• Provides resource management capabilities
![Page 27: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/27.jpg)
Export to DB somewhere
Export to DB somewhere
Export to DB somewhere
Export to DB somewhere
Export to DB somewhere
DB somewhere
Don’t DoS yourself
![Page 28: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/28.jpg)
Export to DB somewhere
Export to DB somewhere
Export to DB somewhere
Export to DB somewhere
Export to DB somewhere
DB somewhere
Permit Manager2 1 0 0 0
![Page 29: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/29.jpg)
Don’t roll your own scheduler!
• Building a good scheduling framework is hard– Myriad of small requirements, precise
bookkeeping with many edge cases• Many roll their own– It’s usually inadequate– So much repeated effort!
• Mold an existing framework to your requirements and contribute
![Page 30: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/30.jpg)
Two emerging frameworks
• Oozie– Built at Yahoo– Open-sourced at Hadoop Summit ’10– Used in production for [don’t know]– Packaged by Cloudera
• Azkaban– Built at LinkedIn– Open-sourced in March ‘10– Used in production for over nine months as of March ’10– Now in use at Meebo
![Page 31: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/31.jpg)
Azkaban
![Page 32: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/32.jpg)
![Page 33: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/33.jpg)
![Page 34: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/34.jpg)
![Page 35: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/35.jpg)
Azkaban jobs are bundles of configuration and code
![Page 36: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/36.jpg)
Configuring a job
type=commandcommand=python [email protected]
process_log_data.job
import osimport sys# Do useful things…
process_logs.py
![Page 37: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/37.jpg)
Deploying a jobStep 1: Shove your config and code into a zip archive.
process_log_data.zip
.job .py
![Page 38: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/38.jpg)
Deploying a jobStep 2: Upload to Azkaban
process_log_data.zip
.job .py
![Page 39: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/39.jpg)
Scheduling a jobThe Azkaban front-end:
![Page 40: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/40.jpg)
What about dependencies?
![Page 41: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/41.jpg)
get_users_widgets
process_widgets.job process_users.job
join_users_widgets.job
export_to_db.job
![Page 42: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/42.jpg)
type=commandcommand=python [email protected]
process_widgets.job
type=commandcommand=python [email protected]
process_users.job
get_users_widgets
![Page 43: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/43.jpg)
type=commandcommand=python join_users_widgets.pyfailure.emails=datateam@whereiwork.comdependencies=process_widgets,process_users
join_users_widgets.job
type=commandcommand=python export_to_db.pyfailure.emails=datateam@whereiwork.comdependencies=join_users_widgets
export_to_db.job
get_users_widgets
![Page 44: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/44.jpg)
get_users_widgets
get_users_widgets.zip
.job
.py
.job.job.job
.py.py.py
![Page 45: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/45.jpg)
You deploy and schedule a job flow as you would a single job.
![Page 46: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/46.jpg)
![Page 47: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/47.jpg)
Hierarchical configuration
type=commandcommand=python [email protected]
type=commandcommand=python [email protected]
process_users.job
process_widgets.job
This is silly. Can‘t I specify failure.emails globally?
![Page 48: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/48.jpg)
azkaban-job-dir/system.propertiesget_users_widgets/
process_widgets.jobprocess_users.jobjoin_users_widgets.jobexport_to_db.job
some-other-job/…
![Page 49: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/49.jpg)
Hierarchical configuration
[email protected]=foo.whereiwork.comarchive.dir=/var/whereiwork/archive
system.properties
![Page 50: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/50.jpg)
What is type=command?
• Azkaban supports a few ways to execute jobs– command• Unix command in a separate process
– javaprocess• Wrapper to kick off Java programs
– java• Wrapper to kick off Runnable Java classes• Can hook into Azkaban in useful ways
– Pig• Wrapper to run Pig scripts through Grunt
![Page 51: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/51.jpg)
What’s missing?
• Scheduling and executing multiple instances of the same job at the same time.
![Page 52: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/52.jpg)
FOO
FOO
3:00 PM
4:00 PM
• Runs hourly• 3:00 PM took longer than expected
![Page 53: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/53.jpg)
FOO
FOO
FOO
3:00 PM
4:00 PM
5:00 PM
• Runs hourly• 3:00 PM failed, restarted at 4:25 PM
![Page 54: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/54.jpg)
What’s missing?
• Scheduling and executing multiple jobs at the same time.– AZK-49, AZK-47– Stay tuned for complete, reviewed patch
branches: www.github.com/voberoi/azkaban
![Page 55: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/55.jpg)
What’s missing?
• Scheduling and executing multiple jobs at the same time.– AZK-49, AZK-47– Stay tuned for complete, reviewed patch
branches: www.github.com/voberoi/azkaban• Passing arguments between jobs.– Write a library used by your jobs– Put your arguments anywhere you want
![Page 56: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/56.jpg)
What did we get out of it?
• No more monolithic wrapper scripts• Massively reduced job setup time– It’s configuration, not code!
• More code reuse, less hair pulling• Still porting over jobs– It’s time consuming
![Page 57: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/57.jpg)
Data Serialization
![Page 58: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/58.jpg)
What’s the problem?
• Serializing data in simple formats is convenient– CSV, XML etc.
• Problems arise when data changes• Needs backwards-compatibility
Does this really matter? Let’s discuss.
![Page 59: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/59.jpg)
v1
Username: Password:
Go!
clickabutton.com
![Page 60: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/60.jpg)
“Click a Button” Analytics PRD
• We want to know the number of unique users who clicked on the button.– Over an arbitrary range of time.– Broken down by whether they’re logged in or not.– With hour granularity.
![Page 61: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/61.jpg)
“I KNOW!”
unique_id,logged_in,clicked
Every hour, process logs and dump lines that look like this to HDFS with Pig:
![Page 62: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/62.jpg)
“I KNOW!”
--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING PigStorage(‘,’) AS (
unique_id:chararray,logged_in:int,clicked:int
);
-- Munge data according to the PRD…
![Page 63: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/63.jpg)
v2
Username: Password:
Go!
clickabutton.com
![Page 64: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/64.jpg)
“Click a Button” Analytics PRD
Break users down by which button they clicked, too.
![Page 65: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/65.jpg)
“I KNOW!”
unique_id,logged_in,red_click,green_click
Every hour, process logs and dump lines that look like this to HDFS with Pig:
![Page 66: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/66.jpg)
“I KNOW!”
--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING PigStorage(‘.’) AS (
unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int
);
-- Munge data according to the PRD…
![Page 67: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/67.jpg)
v3
Username: Password:
Go!
clickabutton.com
![Page 68: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/68.jpg)
“Hmm.”
![Page 69: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/69.jpg)
Bad Solution 1
Remove red_click
unique_id,logged_in,red_click,green_click
unique_id,logged_in,green_click
![Page 70: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/70.jpg)
Why it’s bad
LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int
);
-- Munge data according to the PRD…
Your script thinks green clicks are red clicks.
![Page 71: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/71.jpg)
Why it’s bad
LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,green_clicked:int
);
-- Munge data according to the PRD…
Now your script won’t work for all the data you’ve collected so far.
![Page 72: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/72.jpg)
“I’ll keep multiple scripts lying around”
![Page 73: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/73.jpg)
LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,green_clicked:int
);
LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,orange_clicked:int
);
My data has three fields. Which one do I use?
![Page 74: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/74.jpg)
Bad Solution 2
Assign a sentinel to red_click when it should be ignored, i.e. -1.
unique_id,logged_in,red_click,green_click
![Page 75: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/75.jpg)
Why it’s bad
It’s a waste of space.
![Page 76: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/76.jpg)
Why it’s bad
Sticking logic in your data is iffy.
![Page 77: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/77.jpg)
The Preferable Solution
Serialize your data using backwards-compatible data structures!
Protocol Buffers and Elephant Bird
![Page 78: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/78.jpg)
Protocol Buffers
• Serialization system– Avro, Thrift
• Compiles interfaces to language modules– Construct a data structure– Access it (in a backwards-compatible way)– Ser/deser the data structure in a standard,
compact, binary format
![Page 79: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/79.jpg)
message UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;
}
.java .py .h,.cc
uniqueuser.proto
![Page 80: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/80.jpg)
Elephant Bird
• Generate protobuf-based Pig load/store functions + lots more
• Developed at Twitter• Blog post– http://engineering.twitter.com/2010/04/hadoop-
at-twitter.html• Available at:– http://www.github.com/kevinweil/elephant-bird
![Page 81: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/81.jpg)
message UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;
}
uniqueuser.proto
*.pig.load.UniqueUserLzoProtobufB64LinePigLoader*.pig.store.UniqueUserLzoProtobufB64LinePigStorage
![Page 82: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/82.jpg)
LzoProtobufB64?
![Page 83: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/83.jpg)
LzoProtobufB64Serialization
(bak49jsn, 0, 1)
Protobuf Binary Blob
Base64-encoded Protobuf Binary Blob
LZO-compressed Base64-encoded Protobuf Binary Blob
![Page 84: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/84.jpg)
LzoProtobufB64Deserialization
(bak49jsn, 0, 1)
Protobuf Binary Blob
Base64-encoded Protobuf Binary Blob
LZO-compressed Base64-encoded Protobuf Binary Blob
![Page 85: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/85.jpg)
Setting it up
• Prereqs– Protocol Buffers 2.3+– LZO codec for Hadoop
• Check out docs– http://www.github.com/kevinweil/elephant-bird
![Page 86: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/86.jpg)
Time to revisit
![Page 87: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/87.jpg)
v1
Username: Password:
Go!
clickabutton.com
![Page 88: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/88.jpg)
message UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;
}
uniqueuser.proto
Every hour, process logs and dump lines to HDFS that use this protobuf interface:
![Page 89: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/89.jpg)
--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (
unique_id:chararray,logged_in:int,red_clicked:int
);
-- Munge data according to the PRD…
![Page 90: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/90.jpg)
v2
Username: Password:
Go!
clickabutton.com
![Page 91: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/91.jpg)
message UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;optional int32 green_clicked = 4;
}
uniqueuser.proto
Every hour, process logs and dump lines to HDFS that use this protobuf interface:
![Page 92: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/92.jpg)
--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (
unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int
);
-- Munge data according to the PRD…
![Page 93: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/93.jpg)
v3
Username: Password:
Go!
clickabutton.com
![Page 94: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/94.jpg)
No need to change your scripts.
They’ll work on old and new data!
![Page 95: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/95.jpg)
Bonus!
http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter
![Page 96: Hadoop at Meebo: Lessons in the Real World](https://reader035.vdocument.in/reader035/viewer/2022081414/54c6722d4a7959f67d8b45ea/html5/thumbnails/96.jpg)
Conclusion
• Workflow management– Use Azkaban, Oozie, or another framework.– Don’t use shell scripts and cron.– Do this from day one! Transitioning expensive.
• Data serialization– Use Protocol Buffers, Avro, Thrift. Something else!– Do this from day one before it bites you.