smart_open at data science london meetup
TRANSCRIPT
smart_openStreaming large files with a simple Pythonic API to and from S3, HDFS, WebHDFS, even zip and local files
Lev Konstantinovskiy
What?
smart_open is
a Python 2 and 3 library
for efficient streaming of very large files
with a simple Pythonic API
in 600 lines of code.
Easily switch just the path when data are moved for example from laptop to S3.
smart_open.smart_open('./foo.txt')
smart_open.smart_open('./foo.txt.gz')
smart_open.smart_open('s3://mybucket/mykey.txt')
smart_open.smart_open('hdfs://user/hadoop/my_file.txt')
smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt')
Who?
Open-source MIT License. Maintained by RaRe Technologies. Headed by Radim Rehurek aka piskvorky.
Why?
- Originally part of gensim - an out-of-core open-source text processing library (word2vec, LDA etc). smart_open is used for streaming large text corpora.
Why? Boto is not Pythonistic :(- Study 15 pages of boto book
before using S3
Solution:
smart_open is Pythonised boto
What is “Pythonistic”?
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python
Write more than 5GB to S3: multipart-ing in Boto>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))>>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size / float(chunk_size))) # Use a chunk size of 50 MiB# Send the file parts, using FileChunkIO to create a file-like object# that points to a certain byte range within the original file. We# set bytes to never exceed the original file size.>>> for i in range(chunk_count):>>> offset = chunk_size * i>>> bytes = min(chunk_size, source_size - offset)>>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:>>> mp.upload_part_from_file(fp, part_num=i + 1)# Finish the upload>>> mp.complete_upload()
#Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_uploads() can help to show lost multipart upload parts.
Write more than 5GB to S3: multipart-ing in smart_open
>>> # stream content *into* S3 (write mode, multiparting behind the screen):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')
Write more than 5GB to S3: multipart-ing>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))# Use a chunk size of 50 MiB (feel free to change this)>>> chunk_size = 52428800>>> chunk_count = int(math.ceil(source_size / float(chunk_size)))# Send the file parts, using FileChunkIO to create a file-like object# that points to a certain byte range within the original file. We# set bytes to never exceed the original file size.>>> for i in range(chunk_count):>>> offset = chunk_size * i>>> bytes = min(chunk_size, source_size - offset)>>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:>>> mp.upload_part_from_file(fp, part_num=i + 1)# Finish the upload>>> mp.complete_upload()
#Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_uploads() can help to show lost multipart upload parts.
Boto:>>> # stream content *into* S3 (write mode):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')
smart_open:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python
From S3 to memory
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>> # Create StringIO in RAM
>>> k.get_contents_as_string() Traceback (most recent call last):
MemoryError
>>> # Workaround for memory error:
writing to local disk first. Need a large
local disk!
Boto:
>>> # can use context managers:>>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin:... for line in fin:... print line
>>> # bonus... fin.seek(0) # seek to the beginning... print fin.read(1000) # read 1000 bytes
smart_open:
From large iterator to S3
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>>
k.set_contents_as_string( list(my_iterator))
Traceback (most recent call last):
MemoryError
>>> # Workaround: via writing to local disk
first. Need a large local disk!
Boto:
>>> # stream content *into* S3 (write mode):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')# Streamed input is uploaded in chunks, as soon as `min_part_size` bytes are accumulated
smart_open:
Un/Zipping line by line
>>> # stream from/to local compressed files:>>> for line in smart_open.smart_open('./foo.txt.gz'):... print line
>>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:... fout.write("some content\n")
Summary of Why?
Working with large S3 files using Amazon's default Python library, boto, is a pain. - limited by RAM. Its key.set_contents_from_string() and
key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming).
- There are nasty hidden gotchas when using boto's multipart upload functionality, and a lot of boilerplate.
smart_open shields you from that.
It builds on boto but offers a cleaner API.
The result is less code for you to write and fewer bugs to make.- gzip ContextManager in Python 2.5 and 2.6
Streaming out-of-core read and write for:
- S3
- HDFS
- WebHDFS ( don’t have to use requests library!)
- local files.
- local compressed files
smart_open is not just for S3!
Thanks!
Lev Konstantinovskiy
github.com/tmylk
@teagermylk