Page 1
🐍 Python in the Land of Serverless
David Przybilla
dav009 dav009
Page 2
" " last time in here was ~9 years ago
Page 3
NLP Data Engineering
Page 4
Gov spending dataset
Page 5
Gov spending datasetHow to access datasets
Page 6
ColombiaDev
Encuesta de Salarios
Page 7
..Ops + Golang + Python..
Page 10
cool problems to solve
Page 12
..a bit of context…
Page 13
Roughly 1.5 years ago…
Page 15
first week..started looking at projects..
Page 16
first week..started looking at projects..
Github repos…
Page 17
first week..started looking at projects..
Github repos…
New project I had to work on…
Page 18
turns out some projects were built on “serverless”
Page 19
so in this talk I will guide you through:
Page 20
some of the tools
Page 21
some of the use cases
Page 22
some common mistakes
Page 24
my first task: log stream processing
Page 27
If you have been doing data pipelines, probably you
would go for:
Page 28
If you have been doing data pipelines, probably you
would go for:
Page 29
If you have been doing data pipelines, probably you
would go for:
Page 30
a lot to configure a lot to manage
Page 31
a lot to configure a lot to manage
zookeeper..yarn cluster..
driver instance.. JVM..
HDFS..
Page 32
a lot to configure a lot to manage
zookeeper..yarn cluster..
driver instance.. JVM..
HDFS..
Page 33
you have to make sure those parts are running
smoothly
Page 34
you have to make sure those parts are running
smoothly
Page 37
a coworker ops engineer
Page 38
He loves Serverless
Page 39
And he has good reasons for it…
Page 40
he does not want to do any server management
Page 41
So Jeff suggested to use serverless to solve our
stream processing problem
Page 42
I raised my eyebrow in doubt:
Page 44
a quite controversial & confusing term..
Page 45
“Serverless”
1. Fully managed services, managed on your behalf
Databases (DynamoDB).. Storage (s3)..
Queues (kafka as a service, sqs…) heroku…
Page 46
“Serverless”
2. Function as a Service (FaaS)AWS lambda
Azure functions Kubeless
Google Cloud functions.. Nuclio
..openwhisk…. …many more..…..
Page 47
Function as a service
Page 48
Upload your code and it will run…
Page 49
Upload your code and it will run…
you don’t care on top of what is running..
Page 50
Upload your code and it will run…
you don’t care on top of what is running..
you only focus on your business logic
Page 51
(i) zero administration - Focus on a single function
Page 52
zero administration - Focus on a single function
- Managed by provider
Page 53
zero administration - Focus on a single function
- Managed by provider- You gain peace of mind
Page 54
zero administration - Focus on a single function
- Managed by provider- You gain peace of mind
- cost: more integration with your vendor(vendor lock-in)
Page 55
(ii) You are billed by number of invocations
Page 56
so how does this looks like in python?
Page 57
def handler(event, context): # do something pass
Page 58
def handler(event, context): # do something url = event[‘url’] scrape(url)
Page 59
def handler(event, context): # do something database = # magic username = event[‘username’] database.find(username)
Page 60
so how do I deploy my function?
Page 61
in order run your project on FaaS:
0. define your function 1. package your function 2. upload your package
3. call your function
tools address those steps
Page 62
(tools) 0. define function (infrastructure / glue)
Page 63
(tools) 0. define function (infrastructure / glue)
- runtime (python, golang, js..)
Page 64
(tools) 0. define function (infrastructure / glue)
- memory (128mb, 500?..)- runtime (python, golang, js..)
Page 65
(tools) 0. define function (infrastructure / glue)
- memory (128mb, 500?..)- runtime (python, golang, js..)
- access to resources
Page 66
(tools) 0. define function (infrastructure / glue)
mouse & clicking
- memory (128mb, 500?..)- runtime (python, golang, js..)
- access to resources
Page 67
(tools) 0. define function (infrastructure / glue)
mouse & clicking
- memory (128mb, 500?..)- runtime (python, golang, js..)
- access to resources
Page 68
(tools) 0. define function (infrastructure / glue)
mouse & clicking
- memory (128mb, 500?..)- runtime (python, golang, js..)
- access to resources
Page 69
1. package your function
Page 70
1. package your function
your function code
Page 71
+ dependencies
1. package your function
your function code
Page 72
“deploying it”
2. upload package
Page 73
3. call your function
Page 74
3. call your function
“what triggers it?”
Page 75
3. call your function
“what triggers it?”function is called whenever:
Page 76
3. call your function
“what triggers it?”
url gets hit.. an object is uploaded to s3.. a record is added to your database.. while queue is not empty..
function is called whenever:
Page 77
3. call your function
Deeply entangled on your cloud provider services
Page 78
(infrastructure /glue)
3. call your function
Deeply entangled on your cloud provider services
mouse & clicking
Page 79
tooling address those steps
Page 80
deploying: infrastructure + code
Page 81
back to my story..
Page 82
back to my story..
first week.. started looking at repos..
Page 83
One of them is an API
Page 84
something.com/endpointA something.com/endpointB something.com/endpointC
Page 85
my team was an early adopter of this technology
Page 86
they stitched projects with the tools available back then
Page 89
def a(..)request to urlA
Page 90
def a(..)
def b(..)
request to urlA
request to urlB
Page 91
def a(..)
def b(..)
def c(..)
request to urlA
request to urlB
request to urlC
Page 92
deploy: makefiles
Page 93
deploy: makefiles
glue:
Page 94
3 functions to deploy
Page 95
3 functions to deploy3 functions to package
Page 96
3 functions to deploy3 functions to package
lots of glue : - many pieces to move - to worry about - hard to test
Page 97
changing this kind of trigger(url) on terraform is painful
Page 98
changing this kind of trigger(url) on terraform is painful
running terraform is scary you might destroy other infrastructure
Page 99
if for whatever reason you want to move to non-serverless.. it is harder..
Page 100
if for whatever reason you want to move to non-serverless.. it is harder..
why?
Page 101
because all that glue that triggers the functions
you have to implement that glue in a different way
Page 102
3 different lambdas.. also make people structure their
project like 3 independent projects..
Page 103
with tools available today you can avoid this..
Page 104
any request to
something.com
Page 105
wsgi wrapper (like a flask app)
any request to
something.com
Page 106
wsgi wrapper (like a flask app)
any request to
something.com
wsgi wrapper?
Page 107
wsgi wrapper (like a flask app)
any request to
something.com
translates requests arriving to def handler(..) into wsgi requests
wsgi wrapper?
Page 108
wsgi wrapper (like a flask app)
any request to
something.com
translates requests arriving to def handler(..) into wsgi requests
wsgi wrapper?
it means we can use tools like flask, django ..
Page 109
you can use all tools already available and all that come with them
testing is easier
Page 110
you can use all tools already available and all that come with them
testing is easier
you can make your api with the same tools, and have serverless “for free”
Page 111
you can use all tools already available and all that come with them
testing is easier
you can make your api with the same tools, and have serverless “for free”
Page 112
minimal implementation of that wrapper (python3) is available at:
https://github.com/slank/awsgi
Page 113
so if you have a dummy flask app
Page 114
from flask import ( Flask, jsonify, )
app = Flask(__name__)
@app.route(‘/endpointB’) def endpoint_b(): # … return jsonify(status=200, message='OK')
Page 115
to make it serverless you just need to add this function:
Page 116
def handler(event, context): return awsgi.response(app, event, context)
to make it server less you just need to add this function:
Page 117
def handler(event, context): return awsgi.response(app, event, context)
to make it server less you just need to add this function:
wsgi app
Page 118
- can run it locally
Page 119
- can run it locally
- easy to test routing
Page 120
- less glue
- can run it locally
- easy to test routing
Page 121
API building serverless tools:
Page 122
API building server less tools:
serverless chalice zappa
… there are more..many..
Page 123
https://github.com/serverless/serverless https://serverless.com/
Serverless
Page 125
Good: - many plugins - got funding - particularly good when you are building apis - it is not sluggish (for the given use case) - provide you easy way to handle environments (stg, prd) - great community
serverless
Page 126
Bad: no plan building:
serverless
Page 127
Bad: no plan building:
serverless
deploying: infrastructure + code
Page 128
Bad: no plan building:
serverless
it does not tell you what will change before deploying
deploying: infrastructure + code
Page 129
Bad:
serverless
changing something in config file can leak infrastructure
It is dangerous to leave infrastructure leaking behind
Page 130
How to use it?
serverless
Page 131
How to use it?
serverless
1. define glue in .yml file
Page 132
How to use it?
serverless
2. do `sls deploy`1. define glue in .yml file
Page 133
serverless
.yml file
Page 134
serverless
.yml file• I want a function with this memory.. • I want a function with this name… • I want it to get called when X and Y happens
Page 135
serverless
`sls deploy`• Package + upload
Page 136
serverless
serverless has a lot fo plugins that you can add to your .yml file.
don’t use it to manage infrastructure.
Page 137
serverless
serverless “applications”
> serverless install --url <service-github-url> > sls deploy
code + glue + infrastructure i.e: serverless service to get a slack bot via FaaS
Page 138
Chalicehttps://github.com/aws/chalice
- comes with “wsgi wrapper” - purely focused on AWS - aimed at API particular case
> chalice new-project my_sample_project > chalice deploy
Page 139
How to use it?
Zappa
1. define glue in .json file
Page 140
How to use it?
Zappa
2. some of the glue code is defined as python decorators
1. define glue in .json file
Page 141
@task def make_pie(): """ This takes a long time! """ ingredients = get_ingredients() pie = bake(ingredients) deliver(pie)
@task def make_soup(): ingredients = get_ingredients() soup = bake(ingredients) deliver(soup)
@task
Page 142
How to use it?
Zappa
it has some cool decoratorsit lacks plugins/addons
goodif you are building APIsif you have lots of cron/async calls
Page 143
Another use case..
Page 144
you have a lot of data to process in batch
Page 145
you have a lot of data to process in batch
text.. numbers..
Page 146
map & reduce..right?
Page 147
map & reduce..right?get me a spark cluster
use pyspark ..done
Page 148
compute to TB datasets across hundreds of functions
f
fclean wikipediaetldata preprocessing
count frequencies
Page 149
this is a great use case for FaaS
Page 150
this is a great use case for FaaS
pywren https://github.com/pywren/pywren
Page 151
benchmark of
- 80 GB/sec read - 60 GB/sec write
pywren
Page 152
No knowledge of AWS required
pywren
Page 153
No knowledge of AWS required
pywren
No large (expensive) cluster up
Page 154
No knowledge of AWS required
pywren
No large (expensive) cluster up
Using vanilla python
Page 155
> pywren-setup
pywren
Page 156
> pywren-setup
pywren
essentially : 1. takes a python function (creates a FaaS)
Page 157
> pywren-setup
pywren
essentially : 1. takes a python function (creates a FaaS)2. takes data and uploads it to s3
Page 158
> pywren-setup
pywren
essentially : 1. takes a python function (creates a FaaS)2. takes data and uploads it to s33. runs your python function in parallel on the data uploaded to s3
Page 159
def add_one(x): return x + 1
pywren
Page 160
def add_one(x): return x + 1
[0, 1…9]
pywren
Page 161
def clean_text(text): # clean text return cleaned_text
pywren
Page 162
def clean_text(text): # clean text return cleaned_text
pywren
wikidump (100G)
Page 163
pywren
def add_one(x): return x + 1
Page 164
pywren
def add_one(x): return x + 1
creates lambda
Page 165
pywren
def add_one(x): return x + 1
creates lambda
[0, 1…9]
Page 166
pywren
def add_one(x): return x + 1
creates lambda
[0, 1…9]uploads data to
part1
Page 167
pywren
def add_one(x): return x + 1
creates lambda
[0, 1…9]uploads data to
part1
part2
Page 168
pywren
def add_one(x): return x + 1
creates lambda
[0, 1…9]uploads data to
part1
part2
part3
Page 169
pywren
def add_one(x): return x + 1
[0, 1…9]
Page 170
pywren
def add_one(x): return x + 1
[0, 1…9]
add_one(0)
Page 171
pywren
def add_one(x): return x + 1
[0, 1…9]
add_one(0) add_one(1)
Page 172
pywren
def add_one(x): return x + 1
[0, 1…9]
add_one(0) add_one(1) add_one(2)
…..
Page 173
pywren
def add_one(x): return x + 1
[0, 1…9]
add_one(0) add_one(1) add_one(2)
…..
Page 174
pywren
def add_one(x): return x + 1
[0, 1…9]
add_one(0) add_one(1) add_one(2)
…..
Page 175
pywren
def add_one(x): return x + 1
[0, 1…9]
add_one(0) add_one(1) add_one(2)
…..
Page 176
pywren
how does it look like ?
import pywren
xlist = np.arange(10) //data
Page 177
pywren
how does it look like ?
import pywren
xlist = np.arange(10) //data
# pywren magicwrenexec = pywren.default_executor()futures = wrenexec.map(addone, xlist)
Page 178
pywren
how does it look like ?
import pywren
xlist = np.arange(10) //data
# pywren magicwrenexec = pywren.default_executor()futures = wrenexec.map(addone, xlist)
# f.result() blocks until s3 file result is availableprint [f.result() for f in futures]
Page 179
pywren
how does it look like ?
import pywren
xlist = np.arange(10) //data
# pywren magicwrenexec = pywren.default_executor()futures = wrenexec.map(addone, xlist)
# f.result() blocks until s3 file result is availableprint [f.result() for f in futures]
> python sample.py
Page 180
pywren
good for:
- ETL tasks.. - Scraping.. - Data crunching in general..
Page 181
similar to pywren
https://github.com/qubole/spark-on-lambda
spark-on-lambda
Page 182
similar to pywren
https://github.com/qubole/spark-on-lambda
spark-on-lambda
looks experimental though
Page 183
similar to pywren
https://github.com/qubole/spark-on-lambda
spark-on-lambda
looks experimental thoughsame as spark.. just executors are lambda functions
Page 184
Scanning 1 TB of Data 1000 Lambda executors took 47s cost turns out to be $1.18.
spark-on-lambda
Page 185
Scanning 1 TB of Data 1000 Lambda executors took 47s cost turns out to be $1.18.
spark-on-lambda
on regular spark 50 r3.Xlarge instances.. 2 or 3 mins just to setup + start
Page 187
no multiprocessing.. module
Page 189
structure your project as any other python project
Page 190
structure your project as any other python project
don’t think of it as FaaS
Page 191
if your project has many FaaS
Page 192
if your project has many FaaS
it is a single project, with different entry points
Page 194
FaaS is built on top of containers
Page 195
containers for the same function sometimes gets reused..
Page 196
db_connection = connect(something)
def handler(a,b): db_connection.query(something(a))
Page 197
> db_connection = connect(something)
def handler(a,b): db_connection.query(something(a))
1st run
Page 198
db_connection = connect(something)
> def handler(a,b): db_connection.query(something(a))
1st run
Page 199
db_connection = connect(something)
> def handler(a,b): db_connection.query(something(a))
2nd run
Page 200
as in any project
Page 201
as in any project
mutable globals are not desirable
Page 202
list_of_users = [‘admin’]def handler(a,b): list_of_users = list_of_users + a[‘user’] do_something(list_of_users)
Page 203
list_of_users = [‘admin’]def handler(a,b): list_of_users = list_of_users + a[‘user’] do_something(list_of_users)
Page 204
list_of_users = [‘admin’]def handler(a,b): list_of_users = list_of_users + a[‘user’] do_something(list_of_users)
[‘admin’, user1]
intended input to do_something[‘admin’, user1]
do_somethingactual input to
Page 205
list_of_users = [‘admin’]def handler(a,b): list_of_users = list_of_users + a[‘user’] do_something(list_of_users)
[‘admin’, user1]
intended input to do_something[‘admin’, user1]
do_somethingactual input to
[‘admin’, user2]
do_somethingintended input to
Page 206
list_of_users = [‘admin’]def handler(a,b): list_of_users = list_of_users + a[‘user’] do_something(list_of_users)
[‘admin’, user1]
intended input to do_something[‘admin’, user1]
do_somethingactual input to
[‘admin’, user2]
do_something
[‘admin’, user1, user2]
do_somethingactual input to intended input to
Page 207
list_of_users = [‘admin’]def handler(a,b): list_of_users = list_of_users + a[‘user’] do_something(list_of_users)
[‘admin’, user1]
intended input to do_something[‘admin’, user1]
do_somethingactual input to
[‘admin’, user2]
do_something
[‘admin’, user1, user2]
do_somethingactual input to intended input to
Page 210
Dependencies.. please update them
Page 211
delete old functions
Page 212
you don’t pay if you don’t use them so you don’t get reminded in your bill
Page 213
if you don’t use them, delete them
Page 214
sounds easy..but..in a large organisation: - you got no idea who is the owner
Page 215
sounds easy..but..in a large organisation: - you got no idea who is the owner - is it safe to delete?
Page 216
straight up your permissions
only give needed permissions
Page 217
DDoS are wallet attacks
Page 219
lots of new tools are yet to come…
Page 220
serverless provides a peace of mind.: it will be running it won’t be down
but you agree on going full on using your cloud provider features
Page 221
There is a lot of glue.. lots of events here and
there….
Page 222
serverless is [in my opinion] cheaper not simpler
Page 223
data crunching : yes handling small events: yes
Page 224
total complexity of your system might grow
Page 226
this is your extra complexity: glue, wiring
Page 228
glue is hard to test
Page 229
this is the hard part.. integration testing
Page 230
does the event triggers the expected behaviour?
Page 231
it is getting more popular..
Page 232
many surveys seems to indicate FaaS adaption is
as fast as that of containers
Page 233
I guess new tools will help tame this
complexity
Page 234
I guess now you are wondering.
should I use server less or not?
Page 235
If you find yourself in any of these situations:
Page 236
you have a developer complaining about having to spin up infrastructure before they can get something done
Page 237
maybe you are a data scientist annoyed by how difficult is to
run your experiment
Page 238
then maybe is worth trying it
Page 239
by the way, yes..
Page 240
we built our stream processing system on top of
FaaS
Page 241
Gracias!
dav009 dav009