python and pandas as back end to real-time data driven applications by giovanni lanzani pydata...

50
Go DataDriven PROUDLY PART OF THE XEBIA GROUP Real time data driven applications Giovanni Lanzani Data Whisperer Using Python + pandas as back end

Upload: pydata

Post on 27-Jan-2015

106 views

Category:

Data & Analytics


1 download

DESCRIPTION

For data, and data science, to be the fuel of the 21th century, data driven applications should not be confined to dashboards and static analyses. Instead they should be the driver of the organizations that own or generates the data. Most of these applications are web-based and require real-time access to the data. However, many Big Data analyses and tools are inherently batch-driven and not well suited for real-time and performance-critical connections with applications. Trade-offs become often inevitable, especially when mixing multiple tools and data sources. In this talk we will describe our journey to build a data driven application at a large Dutch financial institution. We will dive into the issues we faced, why we chose Python and pandas and what that meant for real-time data analysis (and agile development). Important points in the talk will be, among others, the handling of geographical data, the access to hundreds of millions of records as well as the real time analysis of millions of data points.

TRANSCRIPT

Page 1: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDrivenPROUDLY PART OF THE XEBIA GROUP

Real time data driven applications

Giovanni Lanzani Data Whisperer

Using Python + pandas as back end

Page 2: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

Who am I?

Page 3: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

Who am I?

2008-2012: PhD Theoretical Physics

2012-2013: KPMG

2013-Now: GoDataDriven

Page 4: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Feedback

@gglanzani

Page 5: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Real-time, data driven app?

• No store and retrieve;

• Store, {transform, enrich, analyse} and retrieve;

• Real-time: retrieve is not a batch process;

• App: something your mother could use:

SELECT attendees ! FROM!pydataberlin2014 ! WHERE password = '1234';

Page 6: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Get insight about event impact

Page 7: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Get insight about event impact

Page 8: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Get insight about event impact

Page 9: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Get insight about event impact

Page 10: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Get insight about event impact

Page 11: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Is it Big Data?

Page 12: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Is it Big Data?

• Raw logs are in the order of 40TB;

• We use Hadoop for storing, enriching and pre-processing;

• (10 nodes, 24TB per nodes)

Page 13: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Challenges

1. Privacy; 2. Huge pile of data; 3. Real-time retrieval;

4. Some real-time analysis.

Page 14: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

1. Privacy

Page 15: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

1. Privacy

Page 16: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

3. Real-time retrieval

• Harder than it looks;

• Large data;

• Retrieval is by giving date, center location + radius.

Page 17: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

4. (Some) real-time analysis

Page 18: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Architecture

AngularJS app.py

helper.py

REST

Front-end Back-end

JSON

Page 19: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

JS - 1

Page 20: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

JS - 2

Page 21: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

Flask

from flask import Flask!app = Flask(__name__)!!

@app.route("/hello")!def hello():! return "Hello World!"!!

if __name__ == "__main__":! app.run()

Page 22: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

Flask

from flask import Flask!app = Flask(__name__)!!

@app.route("/hello")!def hello():! return "Hello World!"!!

if __name__ == "__main__":! app.run()

Page 23: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

app.py example

@app.route('/api/<postcode>/<date>/<radius>', methods=['GET'])[email protected]('/api/<postcode>/<date>', methods=['GET'])!def datapoints(postcode, date, radius=1.0):! ...! stats, timeline, points = helper.get_json(postcode, date, radius)! return … # returns a JSON object for AngularJS

Page 24: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

data example

date hour id_activity postcode hits delta sbi

2013-01-01 12 1234 1234AB 35 22 1

2013-01-08 12 1234 1234AB 45 35 1

2013-01-01 11 2345 5555ZB 2 1 2

2013-01-08 11 2345 5555ZB 55 2 2

Page 25: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

helper.py example

def get_json(postcode, date, radius):!    ...!!

lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)!!

stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)!!

return stats, timeline, data.to_json(orient='records')

Page 26: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

helper.py example

def get_json(postcode, date, radius):!    ...!!

lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)!!

stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)!!

return stats, timeline, data.to_json(orient='records')

Page 27: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

helper.py example

def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi] # filter by sbi!! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits!! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0!! return {"sbi": sbi, "total": hits, "percentage": percentage}

Page 28: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

helper.py example

def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi] # filter by sbi!! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits!! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0!! return {"sbi": sbi, "total": hits, "percentage": percentage}

Page 29: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

helper.py exampledef get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).! aggregate(sum)! return df_sbi

Page 30: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

helper.py exampledef get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).! aggregate(sum)! return df_sbi

Page 31: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

helper.py exampledef get_json(postcode, date, radius):!    ...!    ! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)!!

stats = get_statistics(data)! timeline = get_timeline(data, dates)!!

return stats, timeline, data.to_json(orient='records')

Page 32: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Who has my data?

Page 33: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Who has my data?

• First iteration was a (pre)-POC, less data (3GB vs 500GB);

• Time constraints;

• Oeps:

Page 34: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Who has my data?

• First iteration was a (pre)-POC, less data (3GB vs 500GB);

• Time constraints;

• Oeps:

import pandas as pd!...!source_data = pd.read_csv("data.csv", …)!...!def get_data(postcodes, dates):! result = filter_data(source_data, postcodes, dates)! return result

Page 35: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Advantage of “everything is a df ”

Pro:

• Fast!!

• Use what you know

• NO DBA’s!

• We all love CSV’s! !

!

!

Page 36: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Advantage of “everything is a df ”

Pro:

• Fast!!

• Use what you know

• NO DBA’s!

• We all love CSV’s! !

!

!

Contra:

• Doesn’t scale;

• Huge startup time;

• NO DBA’s!

• We all hate CSV’s!

Page 37: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

If you want to go down this path

• Set the dataframe index wisely;

• Align the data to the index: !

• Beware of modifications of the original dataframe!source_data.sort_index(inplace=True)

Page 38: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

If you want to go down this path

Page 39: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

If you want to go down this path

Page 40: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

If you want to go down this path

Page 41: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

If you want to go down this path

The reason pandas is faster is because I came up with a better algorithm

Page 42: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

If you don’t…

data = get_data(postcodes, dates)

AngularJS app.py

helper.py

REST

Front-end Back-end

JSON

Page 43: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

If you don’t…

data = get_data(postcodes, dates)

database.py

Data

psycopg2

AngularJS app.py

helper.py

REST

Front-end Back-end

JSON

Page 44: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

If you don’t…

data = get_data(db, postcodes, dates)

database.py

Data

psycopg2

AngularJS app.py

helper.py

REST

Front-end Back-end

JSON

Page 45: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Handling geo-datadef get_json(postcode, date, radius):! """!    ...!    """! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)!!

stats = get_statistics(data)! timeline = get_timeline(data, dates)!!

return stats, timeline, data.to_json(orient='records')

Page 46: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Issues?!

• With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL: !

!

!

• Index on date and postcode, but single queries running more than 20 minutes.

SELECT * FROM datapoints ! WHERE ! date IN date_array !! ! AND !! ! ! postcode IN postcode_array;

Page 47: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Postgres + Postgis (2.x)

PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries

SELECT *!FROM datapoints!WHERE ST_DWithin(lon, lat, 1500)!AND dates IN ('2013-02-30', '2013-02-31');!-- every point within 1.5km !-- from (lat, lon) on imaginary dates

Page 48: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

Other db’s?

Page 49: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

Steps to solve it

1. Align data on disk by date; 2. Use the temporary table trick:

!

!

!

!

3. Lose precision: 1234AB→1234 4. (Compression)

CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);!INSERT INTO tmp (postcodes) VALUES postcode_array;!!SELECT * FROM tmp! JOIN datapoints d! ON d.postcode = tmp.postcodes! WHERE! d.dt IN dates_array;

Page 50: Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

GoDataDriven

We’re hiring / Questions? / Thank you!

@gglanzani [email protected]

Giovanni Lanzani Data Whisperer