hadoop at datasift
DESCRIPTION
Slides from the presentation at Hadoop UK User group meetup in London as part of BigDataWeek.TRANSCRIPT
![Page 1: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/1.jpg)
Hadoop At
Datasift
![Page 2: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/2.jpg)
About me
Jairam ChandarBig Data Engineer
Datasift
@jairamc
http://about.me/jairam
http://blog.jairam.me
![Page 3: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/3.jpg)
Outline
What is Datasift?
Where do we use Hadoop?
– The Numbers– The Use-cases– The Lessons
![Page 4: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/4.jpg)
!! Sales Pitch Alert !!
![Page 5: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/5.jpg)
What is Datasift?
![Page 6: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/6.jpg)
What is Datasift?
![Page 7: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/7.jpg)
What is Datasift?
![Page 8: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/8.jpg)
What is Datasift?
![Page 9: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/9.jpg)
What is Datasift?
![Page 10: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/10.jpg)
What is Datasift?
![Page 11: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/11.jpg)
What is Datasift?
![Page 12: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/12.jpg)
What is Datasift?
![Page 13: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/13.jpg)
What is Datasift?
![Page 14: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/14.jpg)
What is Datasift?
![Page 15: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/15.jpg)
The Numbers
Machines
– 60 machines ● Datanode● Tasktracker● RegionServer
– 2 machines● Namenode
– 2 machines● HBase Master
– In the processing of doubling our capacity
![Page 16: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/16.jpg)
The Numbers
Machines
– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)
– 24GB RAM
– 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage)
– 1 Gigabit network links
![Page 17: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/17.jpg)
The Numbers
Data
– Avg load of 3500 interactions/second
– Peak load of 6000 interactions/second
– Highest during the Superbowl – 12000 interactions/second
– Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3)
– And that's not it!
![Page 18: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/18.jpg)
The Use Cases
HBase
– Recordings– Archive/Ultrahose
Map/Reduce
– Exports– Historics
![Page 19: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/19.jpg)
The Use Cases
Recordings– User defned streams
– Stored in HBase for later retrieval
– Export to multiple output formats and stores
– <recording-id><interaction-uuid>● Recording-id is a SHA-1 hash● Allows recordings to be distributed by their key
without generating hot-spots.
![Page 20: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/20.jpg)
The Use Cases
Recordings continued ...
![Page 21: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/21.jpg)
The Use Cases
Exporter– Export data from HBase for customer
– Export fles 5 – 10 GB or 3-6 million records
– MR over HBase using TableInputFormat
– But the data needs to be sorted● TotalOrderPartioner
![Page 22: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/22.jpg)
The Use Cases
Exporter Continued
![Page 23: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/23.jpg)
!! Sales Pitch Alert !!
![Page 24: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/24.jpg)
Historics
![Page 25: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/25.jpg)
The Use Cases
Archive/Ultrahose– Not just the Firehose but the Ultrahose
– Stored in HBase as well
– HBase architecture (BigTable) creates Hotspots with Time Series data
● Leading randomizing bit (see HBaseWD)● Pre-split regions● Concurrent writes
![Page 26: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/26.jpg)
The Use Cases
Archive continued …
2 years of Tweets
– 11 TB compressed
– <Number of tweets we got>
![Page 27: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/27.jpg)
The Use Cases
Historics– Export archive data
– Slightly different from Exporter● Much larger time lines (1 – 3 months)● Unfltered Input Data● Therefore longer processing time● Hence more optimizations required
![Page 28: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/28.jpg)
The Use Cases
Historics continued ...
![Page 29: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/29.jpg)
The Lessons - HBase
Tune Tune Tune (Default == BAD)
Based on use case tune -
– Heap– Block Size– Memstore size
Keep number of column families low
Be aware of hot-spotting issue when writing time-series data
Use compression (eg. Snappy)
![Page 30: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/30.jpg)
The Lessons - HBase
Ops need intimate understanding of system
Monitor metrics (GC, CPU, Compaction, I/O)
Don't be afraid to fddle with HBase code
Using a distribution is advisable
![Page 31: Hadoop at datasift](https://reader033.vdocument.in/reader033/viewer/2022052823/555281d0b4c905115b8b4e5e/html5/thumbnails/31.jpg)
Questions?