practical medium data analytics with python (10 things i hate about pandas, pydata nyc 2013)

48
PyData NYC 2013 Practical Medium Data Analytics with Python

Upload: wesm

Post on 27-Jan-2015

114 views

Category:

Technology


1 download

DESCRIPTION

by Wes McKinney (@wesmckinn) at PyData NYC 2013

TRANSCRIPT

Page 1: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

PyData NYC 2013

Practical Medium Data Analytics with Python

Page 2: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

PyData NYC 2013

Practical Medium Data Analytics with Python

10 Things I Hate About pandas

Page 3: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

Wes McKinney

3

• Former quant and MIT math dude

• Creator of Pandas project for Python

• Author of Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

@wesmckinn

Page 4: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• > 20k copies since Oct 2012• Bringing many new people

to Python and data analysis with code

4

Page 5: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•http://datapad.io

•Founded in 2013, located in SF

• In private beta, join us!

•Hiring for engineering

Page 6: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

Why hate on pandas?

Page 8: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

pandas rocks!

Page 9: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)
Page 10: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• Easy-to-use, fast in-memory data wrangling and analytics library

• Enabled loads of complex data work to be done by mere mortals in Python

• Might have kept R from taking over the world (hehe)

10

So, pandas

Page 12: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•170 distinct contributors

•Over 5400 issues and pull requests on GitHub

•Upcoming 0.13 release

12

pandas, the project

Page 13: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•pandas’s broad applicability also a liability

•Only game in town for lot of things

•pandas being used in some unplanned ways

13

But.

Page 14: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• No more structured dtype drudgery!

• Easy IO!

• Data alignment!

• Hierarchical indexing!

• Time series analytics!

14

Some things to love

Page 15: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Table reshaping

•Missing data handling

•pandas.merge, pandas.concat

•Expressive groupby machinery

15

More things to love

Page 16: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•General data wrangling

•ETL jobs

•Business analytics (incl. BI uses)

•Time series analysis, statistical modeling

16

Some pandas use cases

Page 17: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

pandas does many things that are tedious, slow, or

difficult to do correctly without it

Page 18: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

Unfortunately, pandas is not a database

Page 19: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•DataFrame’s internal structure intended to make row-oriented ops fast on numerical data

•Python objects can be used as data, indices (a feature, not a bug)

19

#1 Slightly too far from the metal

Page 20: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• Many analytics ops require a small portion of the data

• Many ways to “materialize” the full data set in memory by accident

• Axis indexes wouldn’t necessarily make sense on out of core data sets

20

#2 No support (yet) for memory maps

Page 21: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•N.B. HDF5/PyTables support is a partial solution

21

#2 No support (yet) for memory maps

Page 22: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system

• Inadequacy of pandas/NumPy data type systems

22

#3 No tight database integration

Page 23: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• Jobs with heavy SQL-reading are slow and use tons of memory

•TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays

23

#3 No tight database integration

Page 24: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• Inconsistent representation of missing data

•No Boolean or Integer NA values

•NA needs to be a first class citizen in analytics operations

24

#4 Best-efforts NA representation

Page 25: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• Difficult to understand footprint of pandas object

• Ample data copying throughout library

• Would benefit from being able to compress data in-memory or shuttle data temporarily to disk

25

#5 RAM management

Page 26: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Makes pandas not quite a fully-fledged R replacement

•GroupBy and Joins slower than they could be

26

#6 Weak support for categorical data

Page 27: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Must write custom functions to pass to .apply(..)

•Easy to run up against DRY problems and general Python syntax limitations

27

#7 Complex GroupBy operations get messy

Page 28: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•DataFrame not intended as a database table

•Makes streaming data use a challenge

•B+ tree tables interesting?

28

#8 Appending data slow and tedious

Page 29: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Currencies, units

•Time zones

•Geographic data

•Composite data types

29

#9 Limited type system, column metadata

Page 30: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Filter

•Group

• Join

•Aggregate

•Limit/TopK

•Sorting

30

#10 No true query processing layer

WHERE, HAVINGGROUP BYJOINSUM, MEAN, ...LIMITORDER BY

Page 31: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Hampered by use of Python data structures / GIL interactions

•Object internals not designed for concurrent use

31

#11 “Slow”: no multicore / distributed algos

Page 32: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

Oh no what do we do

Page 33: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

Stop believing in the “one tool to rule them all”

Page 34: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

“Real Artists Ship”- Steve Jobs

Page 36: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• I am heavily biased by focus on business analytics/BI use cases

•Need production-ready software to ship in relatively short time frame

36

Focus on results

Page 37: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

• In internal development at DataPad

•Code named “badger”

•pandas-ish syntax: designed for data processing and analytical queries

37

A new project

Page 38: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Consistent data type system

•Compressed columnar binary storage

•High perf analytical query processor

•Data preparation/cleaning tools

38

Badger in a nutshell

Page 39: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Time series analytics

• Immutable array data, little copying

•Analytics kernels: written C with no dependencies

•Caching of useful intermediates

39

Badger in a nutshell

Page 40: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Data set: 2012 Election data (FEC)

•5.3 mm records 7 columns

•Tools

•pandas

•badger

•R: data.table

•SQL: PostgreSQL, SQLite

40

Some benchmarks

Page 41: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Total contributions by candidate

41

Query 1

SELECT  cand_nm,                sum(contb_receipt_amt)  AS  totalFROM  fecGROUP  BY  cand_nm

Page 42: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Total contributions by candidate

42

Query 1

badger  (in-­‐memory)  :      19ms  (1x)badger  (from-­‐disk)  :    131ms  (6.9x)pandas  (in-­‐memory)  :    273ms  (14.3x)R  data.table  1.8.10:    382ms  (20x)PostgreSQL                  :      4.7s  (247x)SQLite                          :        72s  (3800x)

Page 43: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Total contributions by candidate and state

43

Query 2

SELECT  cand_nm,  contbr_st,              sum(contb_receipt_amt)  AS  totalFROM  fecGROUP  BY  cand_nm,  contbr_st

Page 44: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io44

Query 2

badger  (in-­‐memory)  :    269ms  (1x)badger  (from-­‐disk)  :    391ms  (1.5x)R  data.table  1.8.10:    500ms  (1.8x)pandas  (in-­‐memory)  :    770ms  (2.9x)PostgreSQL                  :    5.96s  (23x)

•Total contributions by candidate and state

Page 45: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Total contributions by candidate and state with 2 filter predicates

45

Query 3

SELECT  cand_nm,              sum(contb_receipt_amt)  as  totalFROM  fecWHERE  contb_receipt_dt  BETWEEN                '2012-­‐05-­‐01'  and  '2012-­‐11-­‐05'    AND  contb_receipt_amt  BETWEEN                  0  and  2500GROUP  BY  cand_nm

Page 46: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Total contributions by candidate and state with 2 filter predicates

46

Query 3

badger  (in-­‐memory)  :      96ms  (1x)badger  (from-­‐disk)  :    275ms  (2.9x)pandas  (in-­‐memory)  :    946ms  (9.8x)PostgreSQL                  :      6.2s  (65x)

Page 47: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

•Distributed in-memory analytics

•Multicore algorithms

•ETL job-building tools

•Open source in some form someday

•Looking for algorithms hackers to help

47

Badger, the future

Page 48: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

www.datapad.io

Thank you!

48