Download - Scaling Apache Storm - Hadoop Summit 2014
Transcript
- Scaling Apache Storm P. Taylor Goetz, Hortonworks @ptgoetz
- About Me Member of Technical Staff / Storm Tech Lead @ Hortonworks Storm Committer / PPMC Member / Release Mgr. @ Apache
- About Me Member of Technical Staff / Storm Tech Lead @ Hortonworks Storm Committer / PPMC Member / Release Mgr. @ Apache Volunteer Firefighter since 2004
- 1M+ messages / sec. on a 10-15 node cluster How do you get there?
- How do you fight fire?
- Put the wet stuff on the red stuff. Water, and lots of it.
- When you're dealing with big fire, you need big water.
- Water Sources Lakes Streams Reservoirs, Pools, Ponds
- Data Hydrant You heard it here first.
- How does this relate to Storm?
- Littles Law L=W The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, , multiplied by the average time a customer spends in the system, W; or expressed algebraically: L = W. http://en.wikipedia.org/wiki/Little's_law
- Batch vs. Streaming
- Batch Processing Typically operates on data at rest Velocity is a function of performance Poor performance costs you time
- Stream Processing At the mercy of your data source Velocity fluctuates over time Poor performance.
- Poor performance bursts the pipes. Buffers fill up and eat memory Timeouts / Replays Sink systems overwhelmed
- What can developers do?
- public class MyBolt extends BaseRichBolt { public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { // initialize task } public void execute(Tuple input) { // process input QUICKLY! } public void declareOutputFields(OutputFieldsDeclarer declarer) { // declare output } } Keep tuple processing code tight Worry about this!
- public class MyBolt extends BaseRichBolt { public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { // initialize task } public void execute(Tuple input) { // process input QUICKLY! } public void declareOutputFields(OutputFieldsDeclarer declarer) { // declare output } } Keep tuple processing code tight Not this.
- Know your latencies L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832
- Use a Cache Guava is your friend.
- DevOps will appreciate it. Expose your knobs and gauges.
- What can DevOps do?
- How big is your hose?
- Text Find out!
- Text Performance testing is essential!
- How to deal with small pipes? (i.e. When your output is more like a garden hose.)
- Parallelize Slow sinks
- Parallelism == Manifold Take input from one big pipe and distribute it to many smaller pipes The bigger the size difference, the more parallelism you will need
- Sizeup Initial assessment
- Every fire is different.
- Text
- Every Storm use case is different.
- Sizeup Fire What are my water sources? What GPM can they support? How many lines (hoses) will I need? How much water will I need to flow to put this fire out?
- Sizeup Storm What are my input sources? At what rate do they deliver messages? What size are the messages? What's my slowest data sink?
- There is no magic bullet.
- But there are good starting points.
- Numbers Where to start.
- 1 Worker / Machine / Topology Keep unnecessary network transfer to a minimum
- 1 Acker / Worker Default in Storm 0.9.x
- 1 Executor / CPU Core Optimize Thread/CPU usage
- 1 Executor / CPU Core (for CPU-bound use cases)
- 1 Executor / CPU Core Multiply by 10x-100x for I/O bound use cases
- Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160 Parallelism Units available
- Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160 Parallelism Units available Subtract # Ackers: 160 - 10 = 150 Units.
- Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150 Parallelism Units available
- Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150 Parallelism Units available (* 10-100 if I/O bound) Distrubte this among tasks in topology. Higher for slow tasks, lower for fast tasks.
- This is just a starting point. Test, test, test. Measure, measure, measure.
- Internal Messaging Handling backpressure.
- Internal Messaging (Intra-worker)
- Turn knobs slowly, one at a time.
- Don't mess with settings you don't understand.
- Storm ships with sane defaults Override only as necessary
- Hardware Considerations
- Minimum Hardware Requirements
- CPU Cores More is usually better The more you have the more threads you can support (i.e. parallelism) Storm potentially uses a LOT of threads
- Memory Highly use-case specific How many workers (JVMs) per node? Are you caching and/or holding in-memory state? Tests/metrics are your friends
- Network Use bonded NICs if necessary Keep nodes close
- Other performance considerations
- Dont Pancake! Separate concerns.
- Keep this guy happy. He has big boots and a shovel. He will hurt you if you piss him off.
- Shameless Plug http://www.packtpub.com/sto rm-distributed-real-time- computation-blueprints/book
- Thanks! Questions? Storm BoF Session 3:30 Room 230A