smart_open at data science london meetup

16
smart_open Streaming large files with a simple Pythonic API to and from S3, HDFS, WebHDFS, even zip and local files Lev Konstantinovskiy

Upload: lev-konstantinovskiy

Post on 12-Apr-2017

717 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: smart_open at Data Science London meetup

smart_openStreaming large files with a simple Pythonic API to and from S3, HDFS, WebHDFS, even zip and local files

Lev Konstantinovskiy

Page 2: smart_open at Data Science London meetup

What?

smart_open is

a Python 2 and 3 library

for efficient streaming of very large files

with a simple Pythonic API

in 600 lines of code.

Page 3: smart_open at Data Science London meetup

Easily switch just the path when data are moved for example from laptop to S3.

smart_open.smart_open('./foo.txt')

smart_open.smart_open('./foo.txt.gz')

smart_open.smart_open('s3://mybucket/mykey.txt')

smart_open.smart_open('hdfs://user/hadoop/my_file.txt')

smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt')

Page 4: smart_open at Data Science London meetup

Who?

Open-source MIT License. Maintained by RaRe Technologies. Headed by Radim Rehurek aka piskvorky.

Page 5: smart_open at Data Science London meetup

Why?

- Originally part of gensim - an out-of-core open-source text processing library (word2vec, LDA etc). smart_open is used for streaming large text corpora.

Page 6: smart_open at Data Science London meetup

Why? Boto is not Pythonistic :(- Study 15 pages of boto book

before using S3

Solution:

smart_open is Pythonised boto

Page 7: smart_open at Data Science London meetup

What is “Pythonistic”?

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

PEP 20. The zen of Python

Page 8: smart_open at Data Science London meetup

Write more than 5GB to S3: multipart-ing in Boto>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))>>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size / float(chunk_size))) # Use a chunk size of 50 MiB# Send the file parts, using FileChunkIO to create a file-like object# that points to a certain byte range within the original file. We# set bytes to never exceed the original file size.>>> for i in range(chunk_count):>>> offset = chunk_size * i>>> bytes = min(chunk_size, source_size - offset)>>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:>>> mp.upload_part_from_file(fp, part_num=i + 1)# Finish the upload>>> mp.complete_upload()

#Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_uploads() can help to show lost multipart upload parts.

Page 9: smart_open at Data Science London meetup

Write more than 5GB to S3: multipart-ing in smart_open

>>> # stream content *into* S3 (write mode, multiparting behind the screen):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')

Page 10: smart_open at Data Science London meetup

Write more than 5GB to S3: multipart-ing>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))# Use a chunk size of 50 MiB (feel free to change this)>>> chunk_size = 52428800>>> chunk_count = int(math.ceil(source_size / float(chunk_size)))# Send the file parts, using FileChunkIO to create a file-like object# that points to a certain byte range within the original file. We# set bytes to never exceed the original file size.>>> for i in range(chunk_count):>>> offset = chunk_size * i>>> bytes = min(chunk_size, source_size - offset)>>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:>>> mp.upload_part_from_file(fp, part_num=i + 1)# Finish the upload>>> mp.complete_upload()

#Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_uploads() can help to show lost multipart upload parts.

Boto:>>> # stream content *into* S3 (write mode):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')

smart_open:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

PEP 20. The zen of Python

Page 11: smart_open at Data Science London meetup

From S3 to memory

>>> c = boto.connect_s3()

>>> b = c.get_bucket('mybucket')

>>> k = Key(b)

>>> k.key = 'foobar'

>>> # Create StringIO in RAM

>>> k.get_contents_as_string() Traceback (most recent call last):

MemoryError

>>> # Workaround for memory error:

writing to local disk first. Need a large

local disk!

Boto:

>>> # can use context managers:>>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin:... for line in fin:... print line

>>> # bonus... fin.seek(0) # seek to the beginning... print fin.read(1000) # read 1000 bytes

smart_open:

Page 12: smart_open at Data Science London meetup

From large iterator to S3

>>> c = boto.connect_s3()

>>> b = c.get_bucket('mybucket')

>>> k = Key(b)

>>> k.key = 'foobar'

>>>

k.set_contents_as_string( list(my_iterator))

Traceback (most recent call last):

MemoryError

>>> # Workaround: via writing to local disk

first. Need a large local disk!

Boto:

>>> # stream content *into* S3 (write mode):>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:... for line in ['first line', 'second line', 'third line']:... fout.write(line + '\n')# Streamed input is uploaded in chunks, as soon as `min_part_size` bytes are accumulated

smart_open:

Page 13: smart_open at Data Science London meetup

Un/Zipping line by line

>>> # stream from/to local compressed files:>>> for line in smart_open.smart_open('./foo.txt.gz'):... print line

>>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:... fout.write("some content\n")

Page 14: smart_open at Data Science London meetup

Summary of Why?

Working with large S3 files using Amazon's default Python library, boto, is a pain. - limited by RAM. Its key.set_contents_from_string() and

key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming).

- There are nasty hidden gotchas when using boto's multipart upload functionality, and a lot of boilerplate.

smart_open shields you from that.

It builds on boto but offers a cleaner API.

The result is less code for you to write and fewer bugs to make.- gzip ContextManager in Python 2.5 and 2.6

Page 15: smart_open at Data Science London meetup

Streaming out-of-core read and write for:

- S3

- HDFS

- WebHDFS ( don’t have to use requests library!)

- local files.

- local compressed files

smart_open is not just for S3!

Page 16: smart_open at Data Science London meetup

Thanks!

Lev Konstantinovskiy

github.com/tmylk

@teagermylk