hadoop demystified + automation smackdown! austin jug june 24 2014
TRANSCRIPT
![Page 1: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/1.jpg)
Hadoop Demystified 101 ETL + Automation Smackdown
Learn Big Data: Learn manually, then ask “Which approach makes me the most valuable as developer?”
Slides, code, youtube, resources at end
![Page 2: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/2.jpg)
Q&A at end
![Page 3: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/3.jpg)
Bio - Pete Carapetyan
• Java dev last 15 years, dev 20 years
• Grew up automating in a different industry
• Almost involuntary obsession with systems & automation
• Since 2000 as dataFundamentals, now a 2 man shop
Contact info at last slide
![Page 4: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/4.jpg)
Special Skills - Special Snowflakes
• Let me show you these Hadoop basics.
• Then, we code for special snowflakes. (data sets)
• Thus we are more valuable, and can up our bill rates!
• This is Approach #1: Special Snowflake (manual)
![Page 5: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/5.jpg)
My 2013 Manual Hadoop Benchmark
• 15 ETL jobs [Partial scope]
• Brilliant, ninja level team
• 1 year of competitive NIH* copy paste spaghetti coding - AKA special snowflake approach
• Is this the best I can do?
*NIH: Not Invented Here
![Page 6: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/6.jpg)
Hadoop sidebar: which Serialization Protocol?
• [text - native via SequenceFile]
• Binary protocols include
• Thrift (Facebook, Evernote)
• Protocol Buffers (Google)
• Avro (Hadoop author, Cloudera)
• What about character based?
• XML
• JSON
• etc
http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avroLess than complementary view of Avro:
![Page 7: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/7.jpg)
Transform to Avro
• Not detailed in this talk
• Demo’d here as a binary
• Code listed at end of talk
![Page 8: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/8.jpg)
[Demo Basics of Hadoop ETL Job]
![Page 9: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/9.jpg)
Whoops - [lots of moving parts!]
• What if I make a misteak?
• Dig through log files
• Obtuse messages
• Scripts for logs are critical
• Budget lots of time
• Error UI
![Page 10: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/10.jpg)
I don’t get it. What makes Hadoop so cool?• Expands to thousands of machines
• Placement of my data across those machines (uses HDFS)
• Moves program to data, not data to program
• Tooling/ecosystem
• Much of which is now usable outside Hadoop
• Examples:
• Hive
• Pig
• Zookeeper
![Page 11: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/11.jpg)
Map Reduce 101?
• Makes more sense as ”MapShuffleReduce”
• API for handing program to the data.
• Primary feature is the two pass heuristic for dealing with data on clusters
• You can avoid understanding Map Reduce if Hive is all you use :(
• Yes, MapReduce runs on other systems than Hadoop!
![Page 12: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/12.jpg)
Special Snowflake Approach:Human drama!
What limitations of this manual special skills special snowflakes
approach do we observe?
![Page 13: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/13.jpg)
How To Un-Pack Either Approach?
What if we remove the human drama?
![Page 14: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/14.jpg)
![Page 15: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/15.jpg)
![Page 16: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/16.jpg)
![Page 17: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/17.jpg)
![Page 18: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/18.jpg)
![Page 19: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/19.jpg)
![Page 20: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/20.jpg)
Now, what happens if we automate?
Automated Approach
![Page 21: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/21.jpg)
Carrie
Our own internal project for automating big data.Name inspired by the horror film…
![Page 22: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/22.jpg)
![Page 23: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/23.jpg)
How to drive focus? The Phoenix Project
• Results, not drama
• Focus only on bottleneck
• Brent as bottleneck
![Page 24: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/24.jpg)
Brent: The bad guy?
• Brent is a team’s best asset! Brent is a ninja.Brent is not the bad guy.
• Brent is bottleneck only when treating every situation like a special snowflake.
• Brent enjoys attention???
• Brent is not the drama queen, others bring the drama to him.
• Often victim of his own success.
Brent?
![Page 25: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/25.jpg)
Automation Basics
1. Brent spends time on clean design, PIE*, not NIH*
• Uses [Camel] - Integration Server
2. Brent automates the rule, codes the exception
• Apply metadata to templates
• Infrastructure as code: servers(Devops)
* NIH: Not Invented Here especially as opposed to PIE “Proudly Invented Elsewhere”
![Page 26: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/26.jpg)
Demo Integration Server
• Raw linux OS (Centos)
• Java
• Maven
• Ruby
• networking
• maven repo - binaries
• [created with vagrant]
youtube link
![Page 27: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/27.jpg)
Test your Chef and Vagrant knowledge:
1. What is Vagrant?
2. Name 4 other tools like Chef.
3. Dev, Test, Prod all identical ?????
4. In Chef box as ‘run list' of ?????
5. Idempotent in Chef defined as ????
6. Extra credit: VirtualBox is to VM as Docker is to ?????
![Page 28: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/28.jpg)
Chef and Vagrant Basic Answers
1. Vagrant is a command line front end for creating VMs
2. Chef as (1 of 5) Chef, Puppet, Ansible, Salt, CFEngine
3. Dev, Test, Prod all identical ‘code’
4. Box as ‘run list' of features or recipes
5. Idempotent creates or updates same code
6. Virtualbox is to VM as Docker is to container
![Page 29: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/29.jpg)
Demo Metadata Collection
• Simple properties
• Collected using a cheesy UI
• UI and code generation bothwritten in Ruby
youtube link
![Page 30: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/30.jpg)
Demo Generated Code
• Camel ETL binary
• OSGi, versioned, modular jar
• Only 3 primary outputs!
• simple
• clean
• well designed (?)
• JUnit/integration tested
• Supporting scripting
• messy
youtube link
![Page 31: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/31.jpg)
Demo Server Deploy
• One line deploy/run command
• Compiles on server with Maven
• Also runnable as jar
youtube link
![Page 32: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/32.jpg)
Does it work?
• Make custom file
• Drop into ETL folder
• Inspect
youtube link
![Page 33: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/33.jpg)
Demo - Review
• Schema created
• DDL run
• Avro binary (JSON) transform
• Data Migration
• FTP to server
• Into HDFS partition
• Alter Table: Date Partition
youtube link
![Page 34: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/34.jpg)
Takeaways
• Brent coding the exception manually, rule by template.
• Brent has time to focus on design & exceptions.
• Brent may lose some personal attention and status.
• Resulting code is
• clean
• consistent, easy to maintain
• But is there a Home Run?
• defined as anything not possible via special snowflake approach
![Page 35: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/35.jpg)
Home Run 1: Instant, identical, dev/test/prod
![Page 36: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/36.jpg)
Home Run 2: Big Data, Beyond Hadoop!
1. Pick your provider
• Hadoop
• Cassandra
• Couchbase
• any of hundreds…
2. Adopt your templates, VMs, etc
3. Even stick with Avro…
![Page 37: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/37.jpg)
Home Run 3: Effort as Idempotent
• Idempotent effort? No penalty for discontinuous development.
• Walkup - The 10 minute test
• Walkaway - Requirements
• Features
• Testing, technical debt, already in place for code
• VMs and recipes for dev, test, prod
• OSGi etc modularity for binaries
• Does what we see here pass this test?
![Page 38: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/38.jpg)
What to leave with
• De-mystify: how to Avro/Hadoop a delimited file
• Review motives for automating this process
• Code automation basics
• Infrastructure automation basics
• Code for above
![Page 39: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/39.jpg)
Further Hadoop Tutuorial Resources
• Hortonworks
• best free stuff? Except networking vas
• Cloudera
• Lots but appear to prefer to get paid
• Apache Hadoop
• haven’t tried but it is Apache
![Page 40: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/40.jpg)
Further Camel Resources
• Gerald Cantor of this group, Mark of this group (AMD)
• Camel In Action Book
• Camel mail list
• Red Hat support (Fuse)
![Page 41: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/41.jpg)
6 week deep dives: Candidates• Apache Camel
• Serialization choices, formats, code, tools
• Big Data: NoSQL and newSQL variants and choices
• Test Driven Development
• Jenkins, CI, etc
• Hive, Impala, Hawq other Hadoop sql engines
• Pig, and MapReduce for Hadoop
• Hadoop clustering
• OSGi, Felix
• Maven, Gradle etc
• bash
• Chef or Puppet, Salt, Ansible, CFEngine
• Devops, Phoenix Project
![Page 42: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/42.jpg)
Wish To See More?
• In office demos
• Your sample data
![Page 43: Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014](https://reader030.vdocument.in/reader030/viewer/2022020116/55bebd12bb61ebac1d8b47dd/html5/thumbnails/43.jpg)
Code, Content, Contacts• This Slide Deck: http://www.slideshare.net/datafundamentals/hadoop-big-data-35762308
• or just remember slideshare.net/datafundamentals it may be the only one there
• Youtube - 11 minute slide-less version of code demo - https://www.youtube.com/playlist?list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe
• Dev Code
• Carrie (ruby UI and generator) https://github.com/datafundamentals/df_ui_carrie
• Avro from delimited https://bitbucket.org/datafundamentals/avro_from_delimited
• Camel-Avro https://bitbucket.org/datafundamentals/camel-avro-etl
• Ops Code - cookbook recipes
• https://github.com/datafundamentals
• Contact
• [email protected], [email protected]
Be careful! It’s a competitive world out there!