hadoop/hbase poc framework
TRANSCRIPT
![Page 1: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/1.jpg)
Hadoop/HBase POC v1 Review
A framework for Hadoop/HBase POC
![Page 2: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/2.jpg)
POC
• Proof Of Concept, usually in competition with another product.
• Given use case:– Performance: critical path (speed), most
benchmark read performance,shard for write performance
– Cost: H/W + administrative cost– Look at Hbase+Hadoop vs. MongoDB
![Page 3: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/3.jpg)
HBase
• Transactional store; 70k messages/sec 1.5kb/message. >1GB ethernet speeds
• What is Hbase?, Sources
![Page 4: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/4.jpg)
Cloudera HBase Training Materials
• Exercises: http://async.pbworks.com/w/file/55596671/HBase_exercise_instructions.pdf
• Training Slides: http://async.pbworks.com/w/file/54915308/Cloudera_hbase_training.pdf
• Training VM; 2GB put somewhere else.
![Page 5: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/5.jpg)
![Page 6: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/6.jpg)
System Design on working components
![Page 7: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/7.jpg)
HDFS vs. Hbase
• Replication and distributed FS. Think NFS not just replicas. Metadata at central NameNode, single point of failure. Secondary NN as hot backup. Failure and recovery protocol testing not part of POC
• Blocks, larger is better. Blocks are replicated. Not cells.• HDFS write once, was modified to append to file for HBase.• MapR HDFS compatible:
– fast adoption w/Hbase; snapshots– Cross data center mirroring, consistent mirroring– Star replication vs. chain replication– FileServer vs. TaskTracker, Warden vs. NN. No Single point failure
![Page 8: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/8.jpg)
RS + DN on same machine
![Page 9: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/9.jpg)
![Page 10: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/10.jpg)
![Page 11: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/11.jpg)
![Page 12: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/12.jpg)
HBase Memory(book)
![Page 13: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/13.jpg)
Hbase Disks(book)
• No RAID on slaves, master ok. Use IOPS
![Page 14: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/14.jpg)
![Page 15: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/15.jpg)
HBase Networking(book)
![Page 16: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/16.jpg)
Transactional Write Perf.
• Factor out network, multiple clients, any disk seeks from test program
• Create test packets in memory only. • Write perf function of Instance memory,
packet size,
![Page 17: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/17.jpg)
HBase Write Path
![Page 18: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/18.jpg)
Run on Amazon AWS first
• INSTANCES:– SMALL INSTANCE: 1.7GB– LARGE INSTANCE: 7.5GB– HIMEM XLARGE: 17GB, 34GB, 68GB– SSD DRIVES!!
![Page 19: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/19.jpg)
Write performance, 300k m/s 1500 bytes synthetic data.
1.7 7.5 17 34 680
500
1000
1500
2000
2500
3000
3500
Series 2
Series 2
![Page 20: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/20.jpg)
Dell Notes:
• MapR says 16GB/Cloudera 24GB, • plot heap size instead. • Dell, is this slowing down performance? • Take out a dimm?• Reproduce results first?
![Page 21: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/21.jpg)
HBase write perf, 1M byte/s
• http://www.slideshare.net/jdcryans/performance-bof12, 100k-40k/second 10 byte packets
![Page 22: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/22.jpg)
Write test code
• No network, no disk accesses. Run on local node
![Page 23: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/23.jpg)
Hbase AWS Packet Size 16-1500 bytes
• http://async.pbworks.com/w/file/55320973/AWSHBasePerf16_1500bytepacket.xlsx
![Page 24: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/24.jpg)
Hbase Write Perf, 1500 byte packets
• Single thread, single node. Should be >> w/more threads or async client
• 16 Byte: 11235p/s• 40 Byte: 8064p/s• 80 Byte: 5263p/s• 1500 Byte:3311p/s• 8GB Heap, big regions(optimizations in file
names), etc…12-20 tried, 4 make diff
![Page 25: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/25.jpg)
AWS Reduce #RPC
• Batch Mode, 1000 inserts = 1000 RPCs, reduce to 1 RPC w/Batch, 3610 p/s (5.4Mb/s, pass error check, m22xlarge instance). Note:mongo
![Page 26: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/26.jpg)
Dell H/W Perf. (default config) worse 2262p/s vs 3311(aws)
http://async.pbworks.com/w/file/55225682/graphdell1500bytepacket8gb.txt
![Page 27: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/27.jpg)
DELL WAL off, 2262->2688(+18.5%)
1 10 19 28 37 46 55 64 73 82 91 1001091181271361451541631721811901992082172262352442532622712802892980
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Series1
![Page 28: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/28.jpg)
Dell WAL disabled,big heap, big regions, need more time 2262->3557p/s. 57% increase
1 10 19 28 37 46 55 64 73 82 91 1001091181271361451541631721811901992082172262352442532622712802892980
1
2
3
4
5
6
Series1
![Page 29: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/29.jpg)
AWS SSD(3267p/s) vs. EBS(4042p/s), no compaction. Red m2large. Maybe AWS using SSD?
1 10 19 28 37 46 55 64 73 82 91 1001091181271361451541631721811901992082172262352442532622712802892980
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Series1Series2
![Page 30: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/30.jpg)
AWS(3500-4k packets/sec) vs. DELL
• AWS 3-4k p/s default configuration w/o optimization. • Dell (3557p/s) slower than AWS(3610 optimized
m22xlarge, 4240p/s m2large) • Faster h/w instances in AWS makes a difference.
Lesson(4210p/s): contolling the regions and compactions have impact on performance, fast IO. Spend time later on this.
• User error w/Dell h/w somewhere. Can’t be that slow!• Could run a benchmark on m22xlarge over 24h period
to see variability in perf. Not worth time investment
![Page 31: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/31.jpg)
Dell Tuning
• Ext3/4 5% diff in benchmarks. No diff in p/s performance.
• Raid levels? JBOD not avail. • Maybe m2.2xlarge are high perf AWS drives
are SSD? Seems funny w/pricing structure.• Noatime, 256k block sizes, • Goal: 4k P/S?
![Page 32: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/32.jpg)
Bulk Load (worth time investment?)
• Quick error check• Take existing table, export it, bulk load.
Command line; very rough. • Should redo w/Java program. WAL off is
approximation
![Page 33: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/33.jpg)
Write Clients for NOSQL
• HBase, Mongo, Cassandra have threads behind them, need a threaded or async client to get full performance.
• need more time, higher priority than dist mode, needed in dist mode
• lock timeout behavior; insert 1 row• Need a threaded or async client. Most get
threaded design wrong?
![Page 34: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/34.jpg)
Write Load Tool (multiple clients)
• 300k rows single thread single client: 14430 ms, 2079p/s; about right….
• 300k rows 3 threads:22804ms• M/R 30 mappers:24289• M/R better when need to do combining or
processing of input data. M/R & Threads comparison about right. Threads should increase performance… ok writing my own…
![Page 35: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/35.jpg)
Application Level Perf
• Not transactional… • Simulate reporting store; writes concurrent
w/web page read. • Compare w/SQL Server, MongoDB which have
column indexes. • You may not need column indexes if designed
correctly. ESN not key, will need consecutive keys to split into balanced regions.
![Page 36: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/36.jpg)
Web GUI
• Demo, webpage & writes into DB. Test MS SQL Server packets/sec using same.
• Do a like %asdf% with no data to see if there is a timeout
![Page 37: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/37.jpg)
Read Performance• Index search through webpage w/writer is fast, 50-100ms, <10-20ms
if in cache• Don’t do all table scans. Like in hbase shell count ‘table name’
– Count * from table • PIG/HIVE are faster on top of Hbase b/c they store metadata• All table scan:
10 rows:18ms100 rows:11-166ms1000 rows: 638 ms10k rows: 4.3 s100k rows: 38 s (not printed)
• Use filters for search, exact match, regex, substring, more
![Page 38: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/38.jpg)
Read Path/SCAN/Filters
![Page 39: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/39.jpg)
SingleValueColumn Filter
• Search for specific value, constant, regex, prefix. Did not try others
• Same queries as before, search for specific values testing 100k-1M rows.
• W/O filters, use iterator to hold result set and iterate through each result, test each result value. Like DB drivers. Filter reduces result set size from all rows to only rows which meet condition
![Page 40: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/40.jpg)
Column Value
• Filter filter = new SingleColumnValueFilter(“CF”, “Key5”, CompareOp.EQUAL, “bob”).
• Filter f = new SingleColumnValutFilter(“CF”,”COLUMN”, CompareOp.EQUAL, new RegexStringComparator(“z*”));
• 565ms for 200k rows, 115 result set returned (printed), small result sets are faster.
![Page 41: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/41.jpg)
Column Value Searches
• 100k row table– Returning .1% of results , (10):5s– Returning 1% of results, (100): 11.29s
• 1M row table – 1% results:212 s (10k)– .1% results:204.057s (1k)
![Page 42: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/42.jpg)
Compose row key w/values or index tables
• Add second table where the row keys are composed partially of the values
Secondary table Consistency, don’t need for a reporting system? Consistent on inserts or bulk import.
![Page 43: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/43.jpg)
Build Environment
• Ready for CI, (Jenkins)• Ubuntu specific process for changing code,
make all, make deb, make apt, then install using apt-get install hadoop\* hbase\*.
• Need to start over for yum for centos. • Demo• Also ready for command line w/o GUIHbase org.apache.hadoop.hbase.PerfEval xx xx
![Page 44: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/44.jpg)
Distributed mode
• Setup build environment• Distributed mode setup. Zookeeper error
message: • Disable ipv6? Debugging
![Page 45: Hadoop/HBase POC framework](https://reader036.vdocument.in/reader036/viewer/2022062300/55d4dc44bb61ebca1d8b45cd/html5/thumbnails/45.jpg)
Docs:
• Bigtop/updated version of CDH• Installation:• Build Docs: Ubuntu/deb; big change to rpms;
takes time to document and debug. Can do both, takes time.
• Distributed Mode:• NXServer/NXClient:• Screen: