smart_open at data science london meetup

smart_openStreaming large files with a simple Pythonic API to and from S3, HDFS, WebHDFS, even zip and local files

Lev Konstantinovskiy

What?

smart_open is

a Python 2 and 3 library

for efficient streaming of very large files

with a simple Pythonic API

in 600 lines of code.

Easily switch just the path when data are moved for example from laptop to S3.

smart_open.smart_open('./foo.txt')

smart_open.smart_open('./foo.txt.gz')

smart_open.smart_open('s3://mybucket/mykey.txt')

smart_open.smart_open('hdfs://user/hadoop/my_file.txt')

smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt')

Who?

Open-source MIT License. Maintained by RaRe Technologies. Headed by Radim Rehurek aka piskvorky.

Why?

- Originally part of gensim - an out-of-core open-source text processing library (word2vec, LDA etc). smart_open is used for streaming large text corpora.

Why? Boto is not Pythonistic :(- Study 15 pages of boto book

before using S3

Solution:

smart_open is Pythonised boto

What is “Pythonistic”?

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

PEP 20. The zen of Python

Write more than 5GB to S3: multipart-ing in Boto>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))>>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size / float(chunk_size))) # Use a chunk size of 50 MiB# Send the file parts, using FileChunkIO to create a file-like object# that points to a certain byte range within the original file. We# set bytes to never exceed the original file size.>>> for i in range(chunk_count):>>> offset = chunk_size * i>>> bytes = min(chunk_size, source_size - offset)>>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:>>> mp.upload_part_from_file(fp, part_num=i + 1)# Finish the upload>>> mp.complete_upload()

#Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_uploads() can help to show lost multipart upload parts.

Write more than 5GB to S3: multipart-ing in smart_open

>>> # stream content *into* S3 (write mode, multiparting behind the screen):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')

Write more than 5GB to S3: multipart-ing>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))# Use a chunk size of 50 MiB (feel free to change this)>>> chunk_size = 52428800>>> chunk_count = int(math.ceil(source_size / float(chunk_size)))# Send the file parts, using FileChunkIO to create a file-like object# that points to a certain byte range within the original file. We# set bytes to never exceed the original file size.>>> for i in range(chunk_count):>>> offset = chunk_size * i>>> bytes = min(chunk_size, source_size - offset)>>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:>>> mp.upload_part_from_file(fp, part_num=i + 1)# Finish the upload>>> mp.complete_upload()

#Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_uploads() can help to show lost multipart upload parts.

Boto:>>> # stream content *into* S3 (write mode):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')

smart_open:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

PEP 20. The zen of Python

From S3 to memory

>>> c = boto.connect_s3()

>>> b = c.get_bucket('mybucket')

>>> k = Key(b)

>>> k.key = 'foobar'

>>> # Create StringIO in RAM

>>> k.get_contents_as_string() Traceback (most recent call last):

MemoryError

>>> # Workaround for memory error:

writing to local disk first. Need a large

local disk!

Boto:

>>> # can use context managers:>>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin:... for line in fin:... print line

>>> # bonus... fin.seek(0) # seek to the beginning... print fin.read(1000) # read 1000 bytes

smart_open:

From large iterator to S3

>>> c = boto.connect_s3()

>>> b = c.get_bucket('mybucket')

>>> k = Key(b)

>>> k.key = 'foobar'

>>>

k.set_contents_as_string( list(my_iterator))

Traceback (most recent call last):

MemoryError

>>> # Workaround: via writing to local disk

first. Need a large local disk!

Boto:

>>> # stream content *into* S3 (write mode):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')# Streamed input is uploaded in chunks, as soon as `min_part_size` bytes are accumulated

smart_open:

Un/Zipping line by line

>>> # stream from/to local compressed files:>>> for line in smart_open.smart_open('./foo.txt.gz'):... print line

>>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:... fout.write("some content\n")

Summary of Why?

Working with large S3 files using Amazon's default Python library, boto, is a pain. - limited by RAM. Its key.set_contents_from_string() and

key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming).

- There are nasty hidden gotchas when using boto's multipart upload functionality, and a lot of boilerplate.

smart_open shields you from that.

It builds on boto but offers a cleaner API.

The result is less code for you to write and fewer bugs to make.- gzip ContextManager in Python 2.5 and 2.6

http://docs.pythonboto.org/en/latest/

Streaming out-of-core read and write for:

- S3

- HDFS

- WebHDFS ( don’t have to use requests library!)

- local files.

- local compressed files

smart_open is not just for S3!

Thanks!

Lev Konstantinovskiy

github.com/tmylk

@teagermylk

smart_open at data science london meetup

Data & Analytics