bridging posix-like apis and cloud storage · 2020. 8. 25. · cloud storage: a model for storage...
Post on 17-Sep-2020
2 Views
Preview:
TRANSCRIPT
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Bridging POSIX-like APIs and Cloud Storage
Wesley Leggette Cleversafe, Inc.
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Presentation Agenda
r From POSIX to Cloud Storage/Object Storage r User-mountable file systems r Bridging file systems and object storage
r Mapping semantics r Performance considerations
r Addressing write performance r Write/Discard caching model r Performance results
r Other considerations 2
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
What did we do?
r POSIX/WIN32 APIs still important for customers r Wanted to develop a better POSIX/Cloud driver
r Allows high performance r Minimize hardware costs, allow scale out
r Addressed one side of performance issue r Write improvement using new caching model
r Next steps are read and metadata improvements
3
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
POSIX Historical Origins
r Designed on top of block devices r General purpose application I/O to hard drive
r Everything from documents to database r Random read and write access, overwrite
r Also encapsulates access to LAN (NAS) r “Medium latency” applications
4
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
POSIX and Performance
r API not designed for high latency storage (WAN) r Typically small buffers for all operations r Operations expected to return quickly r Traditional API methods synchronous r Multiple requests for simple operations
r Create file: open, setattr, write r Has been improvement over time for latency
r Async IO, atomic open, larger transfer sizes, multiplexing, buffering
r Improvements “behind the scenes” r Must still implement legacy API
5
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Cloud Storage
r Cloud Storage: A model for storage as a service r Separate storage usage and provisioning r Public Cloud
r Consolidates customers under large providers r Storage over the Internet
r Private Cloud r Works same technically as public cloud
r Cloud Storage APIs: CDMI, S3, OpenStack Cloud Storage is meant for Internet
6
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Object Storage
r Cloud Storage provides Object Storage interfaces r Object Storage is for unstructured data r Examples:
r Books, documents, spreadsheets, presentations, audio, video, e-mail, web pages
r What do these have in common? r Most use pre-compressed formats r Written and read in streaming fashion
r Almost never random-offset writes
7
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
REST and Object Storage
r REST (representational state transfer) r Roy Fielding, for “quality” scalable API design
r Main principal is stateless architecture r One storage request = one stored object r No session between client and server r Reduces round trips over Internet r No sessions, easier load balancing, scalability
Used by modern Cloud Storage APIs: It works well on the Internet
8
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Broader Definition of Object Storage
r Objects used in several “non-cloud” systems r Random write/update not available or not
commonly used r Perhaps categorized as “streaming storage”
r WebDAV, FTP r HDFS
r File system for Hadoop r Typically deployed to local clusters r Same semantic limitations
r No random write, weak append support
9
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
User-mountable file storage
r The mountable “file system” r From local DAS to sharable NAS and SAN
r Still a very common use case r Works with most existing applications r Familiarity, ease of migration r Well established APIs
r WIN32, POSIX
Object Storage is for applications, File systems are for humans!
10
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
What do we want to do?
r Allow users and applications to use object storage as if it were a local file system
r Achieve use-case advantages of NAS r Plus advantages of object storage
r File system APIs: POSIX, WIN32 r Object Storage APIs: S3, Atmos, CDMI, HDFS,
OpenStack, Mezeo, Cleversafe, Nirvanix
r We will specifically discuss HDFS 11
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
HDFS Architecture
12
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Considering Cloud/POSIX translation
r Read model r Semantic model similar r Positional reads usually allowed r Performance differences are only major concern r Techniques involve prefetching, buffering, caching
r Write model r Semantic model different, reduced functionality r Performance implications r Techniques mostly involve better caching
13
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Object Storage Write Semantics
r Limits writes to “streaming” r One byte in front of the other
r Appends sometimes allowed r No random I/O in middle of file r Better WAN performance
r Per-call transfer sizes much larger than 4K blocks
14
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Mapping POSIX semantics
r POSIX procedure: r open(): obtain a file handle, perhaps creating file r write(): single call per I/O buffer chunk
r Buffers usually extremely small (4K-16K typical) r Depends on write() calls being low latency r I/O indicates offset, which may be completely random
r close(): closes a reference to a file handle
15
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Mapping POSIX semantics (cont.)
r HDFS procedure: r create(): Obtain an output stream, perhaps
creating file r Same as open() with O_TRUNC on POSIX; will
replace file content r append() also available
r Writes append to stream, stream closed when finished r On HDFS, content visible during writes; though
timing not guaranteed r On other systems, file invisible until stream closed.
16
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Traditional Caching: Write/Copy
r All drivers must cache to allow random writes r Most caches two-phase procedure
r open(): Create cache entry r write(): Write data to cache (write phase) r close(): Copy from cache to storage (copy phase)
r Actually, release() when all fd handles closed r Cache eviction interesting, but out of scope
r Copy either done in blocking or in background r Blocking allows error reporting r Background attractive when files large
17
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Performance Issues with Write/Copy
r Hard drives become performance bottleneck r Network performance increasing with Moore’s Law r Data access speed is not trending as well
r Disks aren’t keeping up with CPU, network, memory r SSD’s also have bad random I/O performance
r Gigabit Ethernet at ~100 MBps in practice r Taxes sequential performance of many drives r 10G Ethernet makes problem even more pronounced
r Write caches exacerbate issue r I/O is mixed read/write, performance even worse
18
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Random Write Issue in Practice
r Cacheless system is ideal, but impractical r Driver could fail random write() operation r Applications don’t degrade gracefully when
write() fails r But most applications don’t use random write
r Remember unstructured data is compressed r Can’t trivially randomly update compressed data
r Small text files not compressed… but they are quite small
19
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
New Technique: Write/Discard
r Driver can detect when writes are streaming r Updated write procedure:
r open(): Create cache entry, open remote stream r Handle starts in “discard” mode
r write(): Duplex to cache and remote (write phase) r If write out of position, close stream, go to “copy” mode
r close(): Discard mode, clear cache (discard phase) r If in copy mode, copy from cache to remote stream
r Cache should be write-only for most workloads r Used only to support copy fallback for random writes r Copy mode better for heavy random-write workloads
20
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Performance Results
r Implemented write/discard cache for mountable HDFS driver r Driver written with FUSE, Java
r Tested on 1 master, 3 slave HDFS configuration r Used iozone load tool
21
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
r Simulates concurrent reader threads in copy mode r Write/Copy degradation caused by mixed workload
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8
Aggregate Th
roughp
ut (M
B/s)
Concurrent Writers
Simulated Baseline — Local HDD Aggregate Throughput vs. Concurrent Writers
Write/Discard
Write/Copy (Ac<ve)
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
r Apparent performance omits copy phase (background copy) r Actual performance includes copy phase (inline copy) r Apparent performance trends with baseline r Copy phase degradation seemed to highlight write-over-read favoring
r To avoid cache exhaustion, write throttling would be needed to fix this
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8
Aggregate Th
roughp
ut (M
B/s)
Concurrent Writers
Write/Copy Cache — Ac:ve System Aggregate Throughput vs. Concurrent Writers
Baseline Write/Copy
HDFS Write/Copy Apparent
HDFS Write/Copy Actual
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
r Driver hits disk bottleneck at 60 MB/s r Aligns closely with baseline write-only simulation
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8
Aggregate Th
roughp
ut (M
B/s)
Concurrent Writers
Write/Discard Cache Aggregate Throughput vs. Concurrent Writers
Baseline Write/Discard
HDFS Write/Discard
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
r Write/Discard consistently performs better for a streaming write workload
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8
Aggregate Th
roughp
ut (M
B/s)
Concurrent Writers
HDFS Cache Techniques Aggregate Throughput vs. Concurrent Writers
HDFS Write/Discard
HDFS Write/Copy (Ac<ve)
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Conclusions
r Driver performance is a single system issue r Important for performance with fast networks r Addresses limitations of cheap hardware
r Write/Discard is a good write cache model r Better than Write/Copy for streaming workloads
26
2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.
Other considerations?
r Cloud drivers additionally need to consider reads r Prefetching and read buffering r Mixed read/write in POSIX and visibility r Write cache eviction policies and read
performance r Reducing impact of retransmission on failure r Upload techniques for higher throughput
27
top related