alluxio (formerly tachyon): the journey thus far and the road ahead
TRANSCRIPT
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
September 2016 @ Strata & Hadoop World 2016 Haoyuan (HY) Li , Gene Pang
AGENDA
2
• Alluxio Open Source Status and History
• Alluxio Overview
• Alluxio Use Cases and Demos
• What’s Next?
HISTORY
3
• Started at UC Berkeley AMPLab In Summer 2012 • Original named as Tachyon
• Open Sourced in 2013 • Apache License 2.0 • Latest Stable Release: Alluxio 1.2.0 • Next Release (Alluxio 1.3.0) In Two Weeks
• Rebranded as Alluxio in 2016
0
50
100
150
200
250
300
350
Year 1 Year 3Year 2
4
OPEN SOURCE ALLUXIO
• One of the tastest growing open-source projects in the big data ecosystem
• Currently over 300 contributors from over 100 organizations
• Welcome to join our community!
Popular Open Source Projects’ Growth
Spark Kafka Cassandra HDFS
Alluxio
About Us
5
• Team members from Google, Palantir, Uber, Yahoo with years of distributed systems development experience
• Graduated from Stanford University, UC Berkeley, CMU, Peking University, and Tsinghua, with CS masters or PhDs
• Top 9 committers of the Alluxio open source project
Team
HaoyuanLi, CEO & Founder Co-creator of Alluxio project while working towards Ph.D. at UC Berkeley AMPLab.
Gene Pang, Software Engineer, Alluxio Maintainer Ph.D. from UC Berkeley AMPLab Previously at Google F1 team
• Andreessen Horowitz Investors
BIG DATA ECOSYSTEM TODAY BIG DATA ECOSYSTEM WITH ALLUXIO
6
BIG DATA ECOSYSTEM YESTERDAY
…
…
FUSE Compatible File System Hadoop Compatible File System Native Key-Value Interface Native File System
Enabling any application to access data from any storage system at memory-speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS Interface Amazon S3 Interface Swift Interface HDFS Interface
• Memory is getting Faster, Larger, and Cheaper
• Memory price as halving every 18 months
• Disk throughput increasing slowly
7
TECHNOLOGY TRENDS
Top left chart: https://lazure2.wordpress.com/2013/07/02/20-years-of-samsung-new-management-as-manifested-by-the-latest-june-20th-galaxy-ativ-innovations/ Top right chart: people.eecs.berkeley.edu/~istoica/classes/cs294/ 15/notes/02-TechnologyTrends.ppt Bottom chart: jcmit.com/
6.25
12.5
25
18.75
31.25
43.75
37.5
50
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
DDR performance over time
GBs/
seco
nd
DDR2
DDR4
DDR3
File System API Software Only
8
ATTRIBUTES
Memory-Speed Virtual Distributed Storage
Scale out architecture
Virtualized across different storage
types under a unified namespace
Memory-speed access to data
Server A
A p p l i c a t i o n s
Server B
A p p l i c a t i o n s
Server Z
A p p l i c a t i o n s
Server C
A p p l i c a t i o n s A l l u x i o A l l u x i o A l l u x i o A l l u x i o
9
ALLUXIO SOLUTION DEPLOYMENT
S t o r a g e B S t o r a g e C S t o r a g e Z S t o r a g e A
10
BENEFITS
Unification
New workflows across any data in any storage system
Performance
High performance data access
Flexibility Work with the compute and storage frameworks of your choice
Cost
Grow compute and storage systems independently
USE CASE 1 – Accelerate I/O to/from Remote Storage
11
• Compute and Storage Separation • Advantages
• Meet different compute and storage hardware requirements efficiently
• Scale compute and storage independently • Store data in Traditional filers/SANs and object
stores cost effectively • Compute on data in existing storage via Big Data
Computational frameworks • Disadvantage
• Accessing data requires remote I/O
Use Case without Alluxio
12
Spark
Storage
Low latency, memory throughput
High latency, network throughput
14
CASE STUDY
Baidu File System
The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Shaoshan Liu, Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster run stably, providing over 50TB of RAM space
• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds
Accelerate Access to Remote Storage
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
USE CASE 2 – Share Data Across Jobs at Memory Speed
15
• Architectures Requiring Shared Data • Pipelines: output of one job is input of the next job • Different applications, jobs, or contexts read the
same data • Disadvantage • Sharing data requires I/O
Use Case without Alluxio
16
Spark
Storage
MapReduce Spark
Network I/O
Disk I/O
I/O slows down
sharing
18
CASE STUDY
Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Henry Powell, Barclays
RESULTS
• Barclays workflow iteration time decreased from hours to seconds
• Alluxio enabled workflows that were impossible before
• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds
Relational Database
Share Data Across Jobs at Memory-Speed
• 6 node deployment
• 1TB of storage
• Memory only
USE CASE 3 - Transparently Manage Data Across Storage Systems
20
• Reasons • Most enterprises have multiple storage systems • New (better, faster, cheaper) storage systems arise
• Disadvantage • Managing data across systems can be difficult
Use Case Explained
21
Storage
Alluxio
Spark MapReduce Spark
Storage Storage
Flexible,
simple
no application changes,
new mount point
22
CASE STUDY
We’ve been running Alluxio in production for over 9 months, resulting in 15x speedup on average, and 300x speedup at peak service times. - Xueyan Li, Qunar
RESULTS
• Alluxio’s unified namespace enables different applications and frameworks to easily interact with their data from different storage systems
• Improved the performance of their system with 15x – 300x speedups
• Tiered storage feature manages various storage resources including memory, SSD and disk
Transparently Manage Data Across Different Storage Systems
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
USE CASE 4 - Compute on Data in Different Storage with Compliance Requirements
©2016AlluxioConfiden2al
23
• Motivation • Compliance with local laws restricts data storage
location • Global Analytics on this data is not possible
Use Case Explained
©2016AlluxioConfiden2al
24
Storage
Alluxio
Spark MapReduce Spark
Storage Storage
Flexible,
simple
no application changes,
new mount point
25
CASE STUDY
RESULTS
• Alluxio’s unified namespace enables any compute cluster accessing data from storage systems at different data centers
• Enables global analytics which was earlier not possible
• No local persistent storage of data
Compute on Data in Different Storage with Compliance
Requirement
• 500+ nodes deployment
• Memory + SSD
AGlobalFortune500Enterprise
• Contact: {haoyuan, gene}@alluxio.com or [email protected] • Twitter: @Alluxio • Websites: www.alluxio.com and www.alluxio.org
Thank you!