alluxio presentation at strata san jose 2016
TRANSCRIPT
![Page 1: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/1.jpg)
Alluxio (formerly Tachyon):Unified Namespace and Tiered Storage
Calvin Jia, Jiri Simsa
![Page 2: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/2.jpg)
2
One of the Things to Watch at Strata
TechCrunch article:
“… An interesting item that made the top terms list is “alluxio,” which is the recently renamed Tachyon project. Alluxio is a virtual distributed storage system, and it has a memory-centric architecture that enables data sharing across clusters at memory speed. … “
![Page 3: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/3.jpg)
3
Who Are We?
• Calvin Jia
• SWE @ Alluxio, Inc.
• #1 Alluxio contributor
• Twitter: @JiaCalvin
• Jiri Simsa
• SWE @ Alluxio, Inc
• CMU Ph.D. & Google
• Twitter: @jsimsa
![Page 4: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/4.jpg)
4
Alluxio Inc.
• Founded by Alluxio creators and top committers
• Formerly Tachyon Nexus, Inc.
• $7.5 million Series A by Andreessen Horowitz
• Committed to the Alluxio Open Source Project
• Company Website: http://www.alluxio.com
• We are hiring!
![Page 5: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/5.jpg)
5
Outline
• Alluxio Introduction
• Tiered Storage
• Unified Namespace
![Page 6: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/6.jpg)
6
ALLUXIO:
Open Source Memory Speed
Virtual Distributed Storage
![Page 7: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/7.jpg)
7
Memory Speed• Memory-centric architecture designed for memory I/O
Virtual• Abstracts persistent storage from applications
Distributed• Designed to scale with nothing but commodity hardware
Open Source• One of the fastest growing project communities
![Page 8: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/8.jpg)
8
Contributor Growth
• Over 200 Contributors– 3x growth over the last year
![Page 9: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/9.jpg)
9
Organizations
• Over 50 Organizations
![Page 10: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/10.jpg)
10
Alluxio Ecosystem
![Page 11: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/11.jpg)
11
Memory is Getting Faster
![Page 12: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/12.jpg)
12
Memory is Getting Cheaper
![Page 13: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/13.jpg)
13
Simple Examples
• Data sharing between frameworks
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC issues
![Page 14: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/14.jpg)
14
Spark Job
Spark Memory
block 1
block 3
Hadoop MR Job
YARN
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Data Sharing Between Frameworks
Inter-process sharing slowed down by network and/or disk I/O
![Page 15: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/15.jpg)
15
Data Sharing Between Frameworks
Spark Job
Spark Memory
Hadoop MR Job
YARN
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
In-Memory
block 1
block 3 block 4
storage engine & execution enginesame process
Inter-process sharing can happen at memory speed
![Page 16: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/16.jpg)
16
Data Resilience during Crashes
Spark Task
Spark Memoryblock manager
block 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Process crash requires network and/or disk I/O to re-read the data
![Page 17: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/17.jpg)
17
Data Resilience during Crashes
Crash
Spark Memoryblock manager
block 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Process crash requires network and/or disk I/O to re-read the data
![Page 18: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/18.jpg)
18
HDFS / Amazon S3
Data Resilience during Crashes
block 1
block 3
block 2
block 4
Crash
storage engine & execution enginesame process
Process crash requires network and/or disk I/O to re-read the data
![Page 19: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/19.jpg)
19
Data Resilience during Crashes
Spark Task
Spark Memoryblock manager
storage engine & execution enginesame process
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
In-Memory
block 1
block 3 block 4
Process crash only needs memory I/O to re-read the data
![Page 20: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/20.jpg)
20
Data Resilience during Crashes
Crash
storage engine & execution enginesame process
Process crash only needs memory I/O to re-read the data
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
In-Memory
block 1
block 3 block 4
![Page 21: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/21.jpg)
21
HDFS / Amazon S3
Consolidating Memory
Spark Job1
SparkMemory
block 1
block 3
Spark Job2
SparkMemory
block 3
block 1
block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Data duplicated at memory-level
![Page 22: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/22.jpg)
22
Consolidating Memory
Spark Job1
Spark mem
Spark Job2
Spark mem
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
In-Memory
block 1
block 3 block 4
Data not duplicated at memory-level
![Page 23: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/23.jpg)
23
Case Study: Barclays
Making the Impossible Possible with Tachyon: Accelerate Spark J
obs from Hours to
Seconds
• Application: SparkSQL + Spark RDDs
• Alluxio Storage Layer: MEM
• Backend Storage: None
• Result: Speeding up Spark jobs from hours to seconds
![Page 24: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/24.jpg)
24
Common Questions–Memory speed sharing among distributed applications
HDFS interface compatible
–GC overhead introduced by in-memory caching
Off-Heap Memory Management
–Data set could be larger than available memory
Tiered storage
![Page 25: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/25.jpg)
25
Outline
• Alluxio Introduction
• Tiered Storage
• Unified Namespace
![Page 26: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/26.jpg)
26
Motivation
• Memory resources are still constrained
• Alluxio data management logic is not limited to memory
• Storage resources available on compute clusters
![Page 27: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/27.jpg)
27
Tiered Storage
MEM
SSD
HDD
![Page 28: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/28.jpg)
28
Tiered Storage• Extends Alluxio with support for SSDs and/or HDDs storage
• Different tiers have different characteristics– Keep hot data in fast but limited storage– Keep warm data in slower but abundant storage
• Workers manage their own storage
• Data allocation and eviction is driven by application access
![Page 29: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/29.jpg)
29
Tiered Storage Architecture
Machine Type 1
Compute ClientAlluxio Master
Memory, SSD, HDD
Machine Type 2
Compute ClientAlluxio Worker
Memory, SSD, HDD
![Page 30: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/30.jpg)
30
Tiered Storage Architecture
Machine Type 2
Compute Client• Alluxio Client
Alluxio Worker• Tiered Block Store
• Evictor• Allocator
Memory, SSD, HDD
![Page 31: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/31.jpg)
31
Automatic Data Migration• Data can be evicted to lower layers if it is “cooling down”
• Data can be promoted to upper layers if it is “warming up”
Evict stale data to lower tier
Promote hot data to upper tier
![Page 32: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/32.jpg)
32
Pluggable Policies
• Policies can be customized to suit workloads
• Defaults provided for general scenarios
• Advanced users can optimize with additional knowledge– For example: Optimize for iterations
![Page 33: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/33.jpg)
33
Case Study: Baidu
Baidu Queries Data 30 Times Faster with Alluxio
• Application: Spark
• Alluxio Storage: MEM + HDD
• Backend Storage: Baidu’s File System
• 200+ nodes deployment, 2PB+ managed space
• Result: Speeding up data querying by 30x
![Page 34: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/34.jpg)
34
Outline
• About Alluxio
• Tiered Storage
• Unified Namespace
![Page 35: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/35.jpg)
35
Big Data Ecosystem
![Page 36: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/36.jpg)
36
Big Data Ecosystem
![Page 37: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/37.jpg)
37
Big Data Ecosystem
![Page 38: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/38.jpg)
38
Motivation• At large organizations, data spans many storage systems
(object storage, network / distributed file systems, DBs)
• Application logic needs to integrate with different types of storage systems
• Data needs to be moved around to work around application limitations
• In-house storage layers are built to address limitations of legacy storage systems
![Page 39: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/39.jpg)
39
Transparent Naming• Applications can transparently and efficiently interact with
remote storage through Alluxio.• Applications do not need to use different APIs for interacting
with different storage systems.
alluxio://host:port/
data users
reports sales alice bob
s3n://bucket/directory
data users
reports sales alice bob
Alluxio Storage System
![Page 40: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/40.jpg)
40
Single Namespace• Applications can read and write different storage systems.
• Decouples data location from application
alluxio://host:port/
data users
reports sales alice bob
hdfs://host:port/
users
alice bob
s3n://bucket/directory
reports sales
Alluxio Storage System A
Storage System B
![Page 41: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/41.jpg)
41
Architecture
Alluxio Interface
UFS Interface
HDFSS3 Swift …
S3adapter
Swiftadapter
HDFSadapter ALLUXIO
![Page 42: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/42.jpg)
42
Alluxio Benefits• Enable new workloads across storage systems
• Work with the framework of your choice
• Scale storage and compute independently
![Page 43: Alluxio Presentation at Strata San Jose 2016](https://reader034.vdocument.in/reader034/viewer/2022042907/5875bb6f1a28ab33128b4669/html5/thumbnails/43.jpg)
43
Resources
• Alluxio Project: http://www.alluxio.org
• Development: https://github.com/Alluxio/alluxio
• Meet Friends: http://www.meetup.com/Alluxio
• Alluxio Inc: http://www.alluxio.com
• Contact us: [email protected]