starfish-a self tuning system for bigdata analytics
TRANSCRIPT
STARFISH: A SELF-TUNING SYSTEM FOR BIGDATA ANALYTICS
SEMINAR BY
Y.SAI PRAMODA
10191A0511
CONTENTS
• Introduction to Big data
• Hadoop
• Tuning problems
• Starfish Architecture
• Usage of Starfish
• Conclusion
INTRODUCTION TO BIG DATA
Big data is the term for data sets so large and complicated that it becomes difficult to process using traditional data management tools or processing applications
What are the tools of Big data?
Features of Big data Analytics
BIG DATA PRACTITIONERS
• Data analysts Report generation, data mining, ad optimization
• Computational scientists Computational biology, economics, journalism
• Statisticians and machine-learning researchers• Systems researchers, developers, and testers
Distributed systems, networking, security, …
Practitioners want a MAD system-HADOOP
Hadoop is as MAD as it is!
Magnetism “Attracts” or welcomes all sources of data,
regardless of structure, values, etc.
Agility Adaptive, remains in sync with rapid data
evolution and modification
Depth More than just your typical analytics, we
need to support complex operations like statistical analysis and machine learning
MADDER
Data-lifecycle Do more than just queries, Awareness optimize the movement,
storage, and processing of big
Elasticity Dynamically adjust resource usage
and user requirements
Robustness Provide storage and querying
services even in the
event of some failures
Tuning Challenges
• Heavy use of programming languages for MapReduce programs
• Data loaded/accessed as opaque files
• Large space of tuning choices
• Elasticity is wonderful, but hard to achieve
• Terabyte-scale data cycles.
Tuning Problems
Job-level
MapReduce
configuration
Workload
management
Data
layout
tuning
Cluster sizing
Workflow
optimization
J1 J2
J3
J4
Starfish’s Core Approach to Tuning
Profiler
Collects concise
summaries of
execution
What-if Engine
Estimates impact of hypothetical changes
on execution
Optimizers
Search through space of tuning choices
Job
Workflow
Workload
Data layout
Cluster
THE STARFISH PHILOSOPHY
• Goal: A high-performance MAD system
• Build on Hadoop’s strengths
• How can users get good performance automatically?
STARFISH ARCHITECTURE
VISUALIZE WITH STARFISH
• See how MapReduce apps are working
• Understand Bottlenecks in Hadoop
• Find Misconfigured Hadoop Parameters
• Learn to develop MapReduce apps
OPTIMIZE WITH STARFISH
• Tune Hadoop easily
• Find Optimal parameters settings for MapReduce applications
STRATEGIZE WITH STARFISH
• Make intelligent resource allocation choices for Hadoop.
• Find Instances for Workloads.
• Meet time and cost budgets with ease.
STEPS TO USE STARFISH
Cntd…
• First Step: collect the profiling the data from your Hadoop cluster.
• Second Step: import the profiling data into profile store.
• Third Step: Fire up the Graphical or Command Line interfaces to invoke visualize, optimize and strategize features.
CONCLUSION
Hadoop is now a viable competitor to existing systems for big data analytics.
Starfish fills a different void by enabling Hadoop users and applications to get good performance automatically throughout the data lifecycle in analytics.
REFERENCES
• Herodotou, Herodotos, et al. "Starfish: A self-tuning system for big data analytics." Proc. of the Fifth CIDR Conf. 2011.
• Dong, Fei. Extending Starfish to Support the Growing Hadoop Ecosystem. Diss. Duke University, 2012.
• Herodotou, Herodotos, Fei Dong, and Shivnath Babu. "MapReduce programming and cost-based optimization? Crossing this chasm with Starfish." Proceedings of the VLDB Endowment 4.12 (2011).
• http://www.cs.duke.edu/starfish/
• http://www.youtube.com/watch?v=Upxe2dzE1uk