introduction to pandas - bi...

68
Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf @hendorf

Upload: others

Post on 15-Nov-2019

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Introduction to Pandas and Time Series Analysis

Alexander C. S. Hendorf @hendorf

Page 2: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Alexander C. S. Hendorf

Königsweg GmbH

Königsweg affiliate high-tech startups and the industry

EuroPython Organisator + Programm Chair

mongoDB master 2016, MUG Leader

Speaker mongoDB days, EuroPython, PyData…

@hendorf

Page 3: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 4: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Origin und Goals

-Open Source Python Library

-practical real-world data analysis - fast, efficient & easy

-gapless workflow (no switching to e.g. R)

-2008 started by Wes McKinney,

now PyData stack at Continuum Analytics ("Anaconda")

-very stable project with regular updates

-https://github.com/pydata/pandas

Page 5: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Main Features

-Support for CSV, Excel, JSON, SQL, SAS, clipboard, HDF5,…

-Data cleansing

-Re-shape & merge data (joins & merge) & pivoting

-Data Visualisation

-Well integrated in Jupyter (iPython) notebooks

-Database-like operations

-Performant

Page 6: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Today

Part 1: Basic functionality of Pandas

Teil 2: Time series analysis with Pandas

Git featuring this presentation's code examples: https://github.com/Koenigsweg/data-timeseries-analysis-with-pandas

Page 7: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

2014-08-10T05:00:00,142014-08-21T22:50:00,12.02014-08-17T13:20:00,16.02014-08-06T01:20:00,14.02014-09-27T06:50:00,11.02014-08-25T21:50:00,13.02014-08-14T05:20:00,13.02014-09-14T05:20:00,16.02014-08-03T02:50:00,21.02014-09-29T03:00:00,132014-09-06T08:20:00,16.02014-08-19T07:20:00,13.02014-09-27T22:50:00,10.02014-08-28T08:20:00,12.02014-08-17T01:00:00,142014-09-27T14:00:00,172014-09-10T18:00:00,182014-09-22T23:00:00,82014-09-20T03:00:00,92014-08-29T09:50:00,16.02014-08-16T01:50:00,13.02014-08-28T22:00:00,142014-08-03T08:50:00,23.0

Page 8: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 9: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

I/O and viewing data

-convention import pandas as pd

-example pd.read_csv()

-very flexible, ~40 optional parameters included (delimiter,

header, dtype, parse_dates,…)

-preview data with .head(#number of lines) and .tail(#)

Page 10: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 11: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

ax = df[:100].plot()

ax.axhline(16, color='r', linestyle='-')

df.plot(kind='bar')

Page 12: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 13: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Visualisation

-matplotlib (http://matplotlib.org) integrated, .plot()

-custom- and extendable, plot() returns ax

-Bar-, Area-, Scatter-, Boxplots u.a.

-Alternatives:

Bokeh (http://bokeh.pydata.org/en/latest/)

Seaborn (https://stanford.edu/~mwaskom/software/seaborn/index.html)

Page 14: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Structure

Page 15: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Structure

Data

123456789

Page 16: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Structure

pd.Series

Index Data

123456789

123456789

Page 17: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Structure

pd.Series

Index

pd.DataFrame

Data

123456789

123456789

123456789

123456789

123456789

123456789

123456789

123456789

123456789

123456789

Page 18: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Structure: DataSeries

-one dimensional, labeled series, may contain any data type

-the label of the series is usually called index

-index automatically created if not given

-One data type,

datatype can be set or transformed dynamically in a pythonic fashion

e. g. explicitly set

Page 19: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

simpleseries,autodatatypeauto,indexauto

simpleseries,autodatatypeauto,indexauto

simpleseries,autodatatypeset,indexauto

simpleseries,autodatatypeset,numericalindexgiven

Page 20: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

simpleseries,autodatatypeset,text-labelindexgiven

accessviaindex/label

accessviaindex/position

accessmultipleviaindex/label

accessmultipleviaindex/positionrange

accessmultipleviaindex/multiplepositions

accessviabooleanindex/lambdafunction

Page 21: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

.loc()indexlabel

.iloc()indexposition

.ix()indexguessing

label/positionfallback

Page 22: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

.name(column)names

.sample()samplingdataset

Page 23: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Selecting Data

-Slicing

-Boolean indexing

series[x], series[[x, y]]

series[2], series[[2, 3]], series[2:3]

series.ix() / .iloc() / .loc()

series.sample()

Page 24: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Structure: DataFrame

-Twodimensional, labeled data structure of e. g.

-DataSeries

-2-D numpy.ndarray

-other DataFrames

-index automatically created if not given

Page 25: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Structure: Index

- Index

-automatically created if not given

-can be reset or replaced

-types: position, timestamp, time range, labels,…

-one or more dimensions

-may contain a value more than once (NOT UNIQUE!)

Page 26: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Examples

-work with series / calculation

-create and add a new series

-how to deal with null (NaN) values

-method calls directly from Series/ DataFrames

Page 27: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 28: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 29: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 30: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 31: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 32: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Modifying Series/DataFrames

-Methods applied to Series or DataFrames do not change them, but

return the result as Series or DataFrames

-With parameter inplace the result can be deployed directly into Series /

DataFrames

Page 33: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 34: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 35: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

NaN Values & Replacing

-NaN is representation of null values

-series.describe() ignore NaN

-NaNs:

-remove drop()

-replace with default

- forward- or backwards-fill, interpolate

- Series can be removed from DF with drop()

Page 36: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Data Aggregation

-describe()

-groupby()

-groupby([]) & unstack()

-mean(), sum(), median(),…

Page 37: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

End Part 1

-DataSeries & DataFrame

- I/O

-Data analysis & aggregation

- Indexes

-Visualisation

- Interacting with the data

Page 38: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Year

Page 39: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Year

12 months

31 31

31 31 31

31 31

30

30

30 30

28

Page 40: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Year

12 months

February

90% of March

31 31

31 31 31

31 31

30

30

30 30

28

Page 41: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

RomanyearusedtostartinMarchandhad10months

2monthstherewas"no"month

solar|topicalyear

quick&funnyexplanation:https://www.youtube.com/watch?v=AgKaHTh-_Gs

Page 42: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

TimeSeries

-TimeSeriesIndex

-pd.to_datetime() ! US date friendly

-Data Aggregation examples

Page 43: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 44: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 45: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 46: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 47: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 48: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 49: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 50: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 51: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 52: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 53: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 54: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 55: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 56: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 57: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Resampling

- H hourly frequency - T minutely frequency - S secondly frequency - L milliseonds - U microseconds - N nanoseconds

- D calendar day frequency - W weekly frequency - M month end frequency - Q quarter end frequency - A year end frequency

- B business day frequency - C custom business day frequency (experimental) - BM business month end frequency - CBM custom business month end frequency - MS month start frequency - BMS business month start frequency - CBMS custom business month start frequency - BQ business quarter endfrequency - QS quarter start frequency - BQS business quarter start frequency - BA business year end frequency - AS year start frequency - BAS business year start frequency - BH business hour frequency

Page 58: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 59: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 60: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 61: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 62: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 63: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 64: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Bonus: statsmodels

is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests

Page 65: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Somesalesdataofasingleproduct

Page 66: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf
Page 67: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Attributions

PandaPictureByAiluropodaaten.wikipedia(Transferredfromen.wikipedia)[GFDL(http://www.gnu.org/copyleft/fdl.html),CC-BY-SA-3.0(http://creativecommons.org/licenses/by-sa/3.0/)orCCBY-SA2.5-2.0-1.0(http://creativecommons.org/licenses/by-sa/2.5-2.0-1.0)],fromWikimediaCommons

Page 68: Introduction to Pandas - BI Consultingbiconsulting.hu/...introduction_to_pandas_and_time_series_analysis.pdf · Introduction to Pandas and Time Series Analysis Alexander C. S. Hendorf

Alexander C. S. Hendorf

[email protected] @hendorf

Code-Examples https://github.com/Koenigsweg/data-timeseries-analysis-with-pandas