python and pandas as back end to real-time data driven applications by giovanni lanzani pydata...
DESCRIPTION
For data, and data science, to be the fuel of the 21th century, data driven applications should not be confined to dashboards and static analyses. Instead they should be the driver of the organizations that own or generates the data. Most of these applications are web-based and require real-time access to the data. However, many Big Data analyses and tools are inherently batch-driven and not well suited for real-time and performance-critical connections with applications. Trade-offs become often inevitable, especially when mixing multiple tools and data sources. In this talk we will describe our journey to build a data driven application at a large Dutch financial institution. We will dive into the issues we faced, why we chose Python and pandas and what that meant for real-time data analysis (and agile development). Important points in the talk will be, among others, the handling of geographical data, the access to hundreds of millions of records as well as the real time analysis of millions of data points.TRANSCRIPT
GoDataDrivenPROUDLY PART OF THE XEBIA GROUP
Real time data driven applications
Giovanni Lanzani Data Whisperer
Using Python + pandas as back end
Who am I?
Who am I?
2008-2012: PhD Theoretical Physics
2012-2013: KPMG
2013-Now: GoDataDriven
GoDataDriven
Feedback
@gglanzani
GoDataDriven
Real-time, data driven app?
• No store and retrieve;
• Store, {transform, enrich, analyse} and retrieve;
• Real-time: retrieve is not a batch process;
• App: something your mother could use:
SELECT attendees ! FROM!pydataberlin2014 ! WHERE password = '1234';
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Is it Big Data?
GoDataDriven
Is it Big Data?
• Raw logs are in the order of 40TB;
• We use Hadoop for storing, enriching and pre-processing;
• (10 nodes, 24TB per nodes)
GoDataDriven
Challenges
1. Privacy; 2. Huge pile of data; 3. Real-time retrieval;
4. Some real-time analysis.
GoDataDriven
1. Privacy
GoDataDriven
1. Privacy
GoDataDriven
3. Real-time retrieval
• Harder than it looks;
• Large data;
• Retrieval is by giving date, center location + radius.
GoDataDriven
4. (Some) real-time analysis
GoDataDriven
Architecture
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
GoDataDriven
JS - 1
GoDataDriven
JS - 2
Flask
from flask import Flask!app = Flask(__name__)!!
@app.route("/hello")!def hello():! return "Hello World!"!!
if __name__ == "__main__":! app.run()
Flask
from flask import Flask!app = Flask(__name__)!!
@app.route("/hello")!def hello():! return "Hello World!"!!
if __name__ == "__main__":! app.run()
GoDataDriven
app.py example
@app.route('/api/<postcode>/<date>/<radius>', methods=['GET'])[email protected]('/api/<postcode>/<date>', methods=['GET'])!def datapoints(postcode, date, radius=1.0):! ...! stats, timeline, points = helper.get_json(postcode, date, radius)! return … # returns a JSON object for AngularJS
GoDataDriven
data example
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
GoDataDriven
helper.py example
def get_json(postcode, date, radius):! ...!!
lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)!!
stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)!!
return stats, timeline, data.to_json(orient='records')
GoDataDriven
helper.py example
def get_json(postcode, date, radius):! ...!!
lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)!!
stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)!!
return stats, timeline, data.to_json(orient='records')
GoDataDriven
helper.py example
def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi] # filter by sbi!! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits!! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0!! return {"sbi": sbi, "total": hits, "percentage": percentage}
GoDataDriven
helper.py example
def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi] # filter by sbi!! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits!! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0!! return {"sbi": sbi, "total": hits, "percentage": percentage}
GoDataDriven
helper.py exampledef get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).! aggregate(sum)! return df_sbi
GoDataDriven
helper.py exampledef get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).! aggregate(sum)! return df_sbi
GoDataDriven
helper.py exampledef get_json(postcode, date, radius):! ...! ! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)!!
stats = get_statistics(data)! timeline = get_timeline(data, dates)!!
return stats, timeline, data.to_json(orient='records')
GoDataDriven
Who has my data?
GoDataDriven
Who has my data?
• First iteration was a (pre)-POC, less data (3GB vs 500GB);
• Time constraints;
• Oeps:
GoDataDriven
Who has my data?
• First iteration was a (pre)-POC, less data (3GB vs 500GB);
• Time constraints;
• Oeps:
import pandas as pd!...!source_data = pd.read_csv("data.csv", …)!...!def get_data(postcodes, dates):! result = filter_data(source_data, postcodes, dates)! return result
GoDataDriven
Advantage of “everything is a df ”
Pro:
• Fast!!
• Use what you know
• NO DBA’s!
• We all love CSV’s! !
!
!
GoDataDriven
Advantage of “everything is a df ”
Pro:
• Fast!!
• Use what you know
• NO DBA’s!
• We all love CSV’s! !
!
!
Contra:
• Doesn’t scale;
• Huge startup time;
• NO DBA’s!
• We all hate CSV’s!
GoDataDriven
If you want to go down this path
• Set the dataframe index wisely;
• Align the data to the index: !
• Beware of modifications of the original dataframe!source_data.sort_index(inplace=True)
GoDataDriven
If you want to go down this path
GoDataDriven
If you want to go down this path
GoDataDriven
If you want to go down this path
GoDataDriven
If you want to go down this path
The reason pandas is faster is because I came up with a better algorithm
GoDataDriven
If you don’t…
data = get_data(postcodes, dates)
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
GoDataDriven
If you don’t…
data = get_data(postcodes, dates)
database.py
Data
psycopg2
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
GoDataDriven
If you don’t…
data = get_data(db, postcodes, dates)
database.py
Data
psycopg2
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
GoDataDriven
Handling geo-datadef get_json(postcode, date, radius):! """! ...! """! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)!!
stats = get_statistics(data)! timeline = get_timeline(data, dates)!!
return stats, timeline, data.to_json(orient='records')
GoDataDriven
Issues?!
• With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL: !
!
!
• Index on date and postcode, but single queries running more than 20 minutes.
SELECT * FROM datapoints ! WHERE ! date IN date_array !! ! AND !! ! ! postcode IN postcode_array;
GoDataDriven
Postgres + Postgis (2.x)
PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries
SELECT *!FROM datapoints!WHERE ST_DWithin(lon, lat, 1500)!AND dates IN ('2013-02-30', '2013-02-31');!-- every point within 1.5km !-- from (lat, lon) on imaginary dates
Other db’s?
GoDataDriven
Steps to solve it
1. Align data on disk by date; 2. Use the temporary table trick:
!
!
!
!
3. Lose precision: 1234AB→1234 4. (Compression)
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);!INSERT INTO tmp (postcodes) VALUES postcode_array;!!SELECT * FROM tmp! JOIN datapoints d! ON d.postcode = tmp.postcodes! WHERE! d.dt IN dates_array;
GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani [email protected]
Giovanni Lanzani Data Whisperer