learn how to run python on redshift
TRANSCRIPT
AWS Deck Template
How Bellhops Leverages Amazon Redshift UDFs for Massively Parallel Data ScienceIan Eaves, BellhopsMay 12th, 2016
Todays SpeakersChartioAJ WelchChartio.com
BellhopsIan EavesGetBellhops.com
AWSBrandon Chavisaws.amazon.com
The recording will be sent to all webinar participants after the event.Questions? Type them in the chat box & we will answer at the endPosting to social? Use #AWSandChartio
Housekeeping
Relational data warehouseMassively parallel; Petabyte scaleFully managedHDD and SSD Platforms$1,000/TB/Year; starts at $0.25/hour
Amazon RedshiftWhat is Amazon Redshift?
For those unfamiliar with Amazon Redshift, it is a fast, fully managed, petabyte-scale data warehouse for less than $1000 per terabyte per year.
fast, cost effective, easy to use (launch cluster in a few minutes, scale with the push of a button)
4
Amazon Redshift is easy to useProvision in minutes
Monitor query performance
Point and click resize
Built in security
Automatic backups
Redshift is not only cheaper but also easy to use. Provisioning takes 15 minutes. 5
Amazon Redshift System Architecture10 GigE(HPC)IngestionBackupRestoreSQL Clients/BI Tools
128GB RAM16TB disk
16 coresS3 / EMR / DynamoDB / SSHJDBC/ODBC
128GB RAM16TB disk
16 coresCompute Node
128GB RAM16TB disk
16 coresCompute Node
128GB RAM16TB disk
16 coresCompute NodeLeaderNode
6
The Amazon Redshift view of data warehousing10x cheaper
Easy to provision
Higher DBA productivity10x faster
No programming
Easily leverage BI tools, Hadoop, Machine Learning, StreamingAnalysis in-line with process flows
Pay as you go, grow as you need
Managed availability & DREnterpriseBig DataSaaS
7
The legacy view of data warehousing ...Global 2,000 companiesSell to central ITMulti-year commitmentMulti-year deployments Multi-million dollar deals
8
Leads to dark dataThis is a narrow view
Small companies also have big data(mobile, social, gaming, adtech, IoT)
Long cycles, high costs, administrative complexity all stifle innovation
9
New SQL FunctionsWe add SQL functions regularly to expand Amazon Redshifts query capabilities
Added 25+ window and aggregate functions since launch, including:LISTAGG[APPROXIMATE] COUNTDROP IF EXISTS, CREATE IF NOT EXISTSREGEXP_SUBSTR, _COUNT, _INSTR, _REPLACEPERCENTILE_CONT, _DISC, MEDIANPERCENT_RANK, RATIO_TO_REPORT
Well continue iterating but also want to enable you to write your own
Scalar User Defined FunctionsYou can write UDFs using Python 2.7Syntax is largely identical to PostgreSQL UDF syntaxSystem and network calls within UDFs are prohibited
Comes with Pandas, NumPy, and SciPy pre-installedYoull also be able import your own libraries for even more flexibility
Scalar UDF ExampleCREATE FUNCTION f_hostname (VARCHAR url) RETURNS varcharIMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname$$ LANGUAGE plpythonu;Rather than using complex REGEX expressions, you can import standard Python URL parsing libraries and use them in your SQL
Analytics for EveryoneThe best platform for everyone to explore and visualize data
Header Only13
The Smartest Companies Use Chartio
Two-Section14
Legacy BI
Expensive to set up
Expensive to maintain
Requires technical skills
Creates a bottleneck
Limits your ability to make decisions
Modern BI
Faster time to value
Easier to maintain
Modes for both technical and non-technical users
Alleviates bottlenecks
Enhances your ability to make decisions
Chartios Modern Architecture
Three Section17
Chartios Modern Architecture
Three Section18
Chartios Modern Architecture
19
The Chartio Schema Editor
Team-specific schemas
Rename tables/columns
Hide tables/columns
Define custom tables/columns
Define data types and foreign keys
20
Schema Editor Live Demo
Section Header21
UDFs (A Brave New World)Using UDFs in the Real WorldIan Eaves - Data Scientist
The Land Between SQL and Analysis
While SQL is a phenomenal tool for data extraction its either painful or impossible to work with for analysis.
Genera purpose programming languages like Python on the other hand are better suited to analysis and visualization but more difficult to use for pure extraction
Into this gap services like Chartio have emerged providing extended visualization and analysis options usually accomplished with those more traditional programming tools.
The Land Between SQL and Scripts
UDF
UDFs begin to bridge this gap by providing limited python functionality within the scope of your standard SQL toolbox.
A Little About Bellhops
On-demand moving and labor companySelf-scheduling capacity (a la Uber)Located in 83 markets
Because our supply of labor (lovingly referred to as Bellhops) are free to set their own schedule understanding the health of a market is extremely important.
Too many Bellhops chasing too little work yields high churn and inexperienced laborers.
On the other hand having only a handful of Bellhops might be sufficient to service demand in small or growing markets. However, this dynamic is unstable - what happens if a Bellhop decides to take a month off? Or how will it respond to sudden spikes in demand as happens during the summer?
One of the measures we use to determine when a market has entered an unstable dynamic like this is the Herfindahl Index
Herfindahl Index (H)N = number of market actorssi = market share of ith actor
Theres no need to linger on this but the Herfindahl Index is the sum of the squared market shares of actors in a market.
So for example, if we were looking at the Soda industry the actors in that market would be Coca Cola, Pepsi, Fanta, etc
More important is how its used. (next slide)
Herfindahl Index
Its found common usage in economics to determine if an industry has become monopolistic and to what degree that might be the case.
We at Bellhops use a similar idea to measure the concentration of work amongst our labor supply in each market.
A combination of this metric and UDFs allows us to provide real time feedback to the organization which would otherwise require separate extraction, processing, and analysis steps.
Take our index as an example. We are interested in knowing if there have been any sudden changes in indexed concentration for any of our markets.
This can be used as a call to action for our market health team So how do we do that?
Market Health Feedback Process
Monthly
Data Warehouse
First we calculate the current state of each market every day, week, month, etc as part of our ETL process
Market Health Feedback Process
Monthly
Data Warehouse
These values are then feed from our warehouse into psql, Chartio, or any other tool of your choice. In our case our end users (The Market Health Team) interact with data primarily through Chartio so thats where it will sit.
Market Health Feedback Process
Monthly
UDFs
Data Warehouse
We next feed these values into a Python UDF
Which In this case is an implementation of the Students T-test.
A t test effectively allows you to determine a value differs significantly from its historical distribution and the degree to which it differs. Its an especially important distribution when the number of samples being used is small.
In this case we will be determining whether a markets concentration differs significantly from the historical (say past 6 months) of observations.
Market Health Feedback Process
Monthly
UsersUDFs
Data Warehouse
Finally these significance warnings are surfaced directly to relevant users through pre-made Chartio dashboards for them to take action when necessary.
T-test UDF
create function f_t_test (val float, mean float, stddev float, n_samps float, alpha float)returns varcharstableas $$from scipy.stats import tdf = n_samps - 1
tval = (mean - val) / stddevp = t.sf(abs(tval), df) * 2 # two sided
if p 0 else 'Worse'else:return 'No Change'$$LANGUAGE plpythonu;
Finally here is our actual UDF. This is an implementation of a two sided t test with a couple of notable features.
We were able to make use of prebuilt python functionality like scipySo long as our UDF exclusively uses data available immediately within its scope (unfortunately meaning no disk or network access) we have all the power of python at our finger tops. That means things like complicated conditional logic can be trivially implemented bypassing otherwise clumsy SQL.
Example Table Schemamarket_month_dimensioniddateherfindahl_index~~~
market_fact-idmonth_keymarket_name~~
Lets just take a toy model of two tables in our data warehouse. The first is a fact key containing a market_id and a foreign key to the market_month_dimension table.
The market_month_dimension table contains a variety of statistics calculated monthly for each market; one of which is the herfindahl_index.
Our QueryWITH market_stats AS (SELECT market_name, date, mmd.herfindahl_index, avg(herfindahl_index) OVER (PARTITION BY market_name ORDER BY date ROWS BETWEEN 6 PRECEDING AND 1 PRECEDING) as avg, stddev_samp(herfindahl_index) OVER (PARTITION BY market_name ORDER BY date ROWS BETWEEN 6 PRECEDING AND 1 PRECEDING) as stddevFROM market_fact LEFT JOIN( SELECT herfindahl_index, month_key as join_key, date FROM market_month_dimension ) AS mmd ON join_key = market_fact.month_keyGROUP BY market_name, date, month_key, mmd.herfindahl_indexORDER BY date)
SELECT market_name, date, herfindahl_index, avg, stddev, f_t_test(herfindahl_index, avg, stddev, 6, .2)FROM market_statsWHERE market_name = 'Atlanta'ORDER BY market_name, date;
With our UDF and schema in hand we can now execute a query!
In this case we are using our T-test UDF to determine in which months Atlantas herfindahl index changed dramatically as compared to the past six months.
As you can see actually using the UDF is extremely simple and behaves as if it were any other function. The majority of the hard work lies in constructing our temporary table containing the six month moving average and standard deviations.
Sample Result
General Thoughts
UDF Use Cases
By their scalar nature UDFs are in some sense reflective rather than prescriptive we found that reflective nature to be most useful in support of the analytics being performed by our BI team.
They are additionally useful when cumbersome SQL expressions might be simplified by an equivalent python library or representation. Things like (slide)
General Thoughts
UDF Use Cases
Complicated Conditional Logic
Complicated conditional logic
General Thoughts
UDF Use Cases
Complicated Conditional Logic
Text Processing
Text processing especially when the equivalent regular expression is complicated or contains numerous edge cases (urls, emails, etc)
General Thoughts
UDF Use Cases
Complicated Conditional Logic
Text Processing
Basic Statistical Analysis
and doing basic statistical analysis.
Thank You
Questions?ChartioAJ [email protected]
BellhopsIan [email protected]
AWSBrandon [email protected] aws.amazon.com