Page 1
from the
TRENCHESTRENCHES
what you should know before you go to production
AWS LAMBDAAWS LAMBDA
Page 3
hi, I’m Yan CuiAWS user since 2009
Page 11
hidden complexities and dependencies
low utilisation to leave room for traffic spikes
EC2 scaling is slow, so scale earlier
lots of cost for unused resources
up to 30 mins for deployment
deployment required downtime
Page 12
- Dan North
“lead time to someone saying thank you is the only reputation
metric that matters.”
Page 14
“what would good
look like for us?”
Page 15
be small be fast
have zero downtime have no lock-step
DEPLOYMENTS SHOULD...
Page 16
FEATURES SHOULD...be deployable independently
be loosely-coupled
Page 17
WE WANT TO...minimise cost for unused resources
minimise ops effort reduce tech mess
deliver visible improvements faster
Page 19
170 Lambda functions in prod
1.2 GB deployment packages in prod
95% cost saving vs EC2
15x no. of prod releases per month
Page 20
timeis a good fit
Page 21
1st function in prod!time
is a good fit
Page 22
?
timeis a good fit
1st function in prod!
Page 23
Practices ToolsPrinciples
what is good? how to make it good? with what?
Page 24
Principles outlast Tools
Page 25
ALERTING
CI / CD
TESTING
LOGGING
MONITORING
Page 26
170 functions
WOOF!
? ?
timeis a good fit
1st function in prod!
Page 27
SECURITY
DISTRIBUTEDTRACING
CONFIGMANAGEMENT
Page 28
evolving the PLATFORM
Page 30
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearch
Page 31
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Page 32
new analytics pipeline
Page 33
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
Page 34
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
1 developer, 2 daysdesign production
(his 1st serverless project)
Page 35
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery“nothing ever got done
this fast at Skype!”
- Chris Twamley
Page 36
- Dan North
“lead time to someone saying thank you is the only reputation
metric that matters.”
Page 37
Rebuiltwith Lambda
Page 44
Rebuiltwith Lambda
Page 47
grapheneDB
BigQuery
Page 48
grapheneDB
BigQuery
Page 49
grapheneDB
BigQuery
Page 50
getting PRODUCTION READY
Page 51
CHOOSE A
FRAMEWORK
DEPLOYMENT
Page 52
http://serverless.com
Page 53
https://github.com/awslabs/serverless-application-model
Page 55
https://apex.github.io/up
Page 56
https://github.com/claudiajs/claudia
Page 57
https://github.com/Miserlou/Zappa
Page 58
http://gosparta.io/
Page 61
Level of Testing
1.Unitdo our objects do the right thing?are they easy to work with?
Page 63
Level of Testing
1.Unit2.Integrationdoes our code work against code we can’t change?
Page 65
handler
test by invoking the handler
Page 66
Level of Testing
1.Unit2.Integration3.Acceptancedoes the whole system work?
Page 67
Level of Testing
unit
integration
acceptance
feedb
ack
confidence
Page 68
“…We find that tests that mock external libraries often need to be complex to get the code into the right state for the functionality we need to exercise.
The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the extra complexity in both code and test…”
Don’t Mock Types You Can’t Change
Page 69
“…The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do…
Even if we get it right once, we have to make sure that the tests remain valid when we upgrade the libraries…”
Don’t Mock Types You Can’t Change
Page 70
Don’t Mock Types You Can’t ChangeServices
Page 71
Paul Johnston
The serverless approach to testing is different and may
actually be easier.
http://bit.ly/2t5viwK
Page 72
LambdaAPI Gateway DynamoDB
Page 73
LambdaAPI Gateway DynamoDB
Unit Tests
Page 74
LambdaAPI Gateway DynamoDB
Unit Tests
Mock/Stub
Page 75
is our request correct?
is the request mapping set up correctly?is the API resources
configured correctly?
are we assuming the correct schema?
LambdaAPI Gateway DynamoDB
is Lambda proxy configured correctly?
is IAM policy set up correctly?
is the table created?
what unit tests will not tell you…
Page 77
most Lambda functions are simple have single purpose, the risk of
shipping broken software has largely shifted to how they integrate with
external services
observation
Page 79
But it slows down my feedback loop…
IT’S NOT ABOUT YOU!
Page 81
IT’S CHINA. NOT SCHINA.
Page 88
me
Your users shouldn’t be the ones to pay the price for your
faster feedback loop. Optimise for working software.
Test your software end-to-end.
Page 89
“…Wherever possible, an acceptance test should exercise the system end-to-end without directly calling its internal code.
An end-to-end test interacts with the system only from the outside: through its interface…”
Testing End-to-End
Page 90
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Page 91
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Test Input
Page 92
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Test Input
Validate
Page 93
integration tests exercise system’s Integration with its
external dependencies
Page 94
acceptance tests exercise system End-to-End from
the outside
Page 95
integration tests differ from acceptance tests only in HOW the
Lambda functions are invoked
observation
Page 100
“the earlier you consider CI + CD, the more time you save in the long run”
- me
Page 101
“…We prefer to have the end-to-end tests exercise both the system and the process by which it’s built and deployed…
This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime…”
Testing End-to-End
Page 102
“deployment scripts that only live on the CI
box is a disaster waiting to happen”
- me
Page 103
Jenkins build config deploys and tests
unit + integration tests
deploy
acceptance tests
Page 104
if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4
npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4
npm install AWS_PROFILE=$PROFILE npm run int-$STAGE elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4
npm install AWS_PROFILE=$PROFILE npm run acceptance-$STAGE else usage exit 1 fi
Page 105
build.sh allows repeatable builds on both local & CI
Page 107
Auto Auto Manual
Page 110
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
Page 111
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
UTC Timestamp API Gateway Request Id
your log message
Page 112
function name
date
function version
Page 115
CENTRALISE LOGS
MAKE THEM EASILYSEARCHABLE
Page 116
+ +the elk stack
Page 118
CloudWatch Logs AWS Lambda ELK stack
Page 119
CloudWatch Events
Page 121
http://bit.ly/2f3zxQG
Page 122
DISTRIBUTED TRACING
Page 124
“my followers didn’t receive my new post!”
- a user
Page 125
where could the problem be?
Page 126
correlation IDs*
* eg. request-id, user-id, yubl-id, etc.
Page 127
ROLL YOUR OWNCLIENTS
Page 128
kinesis client
http client
sns client
Page 129
http://bit.ly/2k93hAj
Page 130
ROLL YOUR OWNCLIENTS
X-RAY
Page 133
traces do not span over API Gateway
Page 134
http://bit.ly/2s9yxmA
Page 135
MONITORING + ALERTING
Page 136
“where do I install monitoring agents?”
Page 138
• invocation Count• error Count• latency• throttling• granular to the minute• support custom metrics
Page 139
• same metrics as CW• better dashboard• support custom metrics
https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/
Page 141
“how do I batch up and send logs in the
background?”
Page 142
you can’t (kinda)
Page 143
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs
Page 144
CloudWatch Logs AWS Lambda
ELK stacklogs
metrics
CloudWatch
Page 145
http://bit.ly/2gGredx
Page 147
DASHBOARDS
SET ALARMS
Page 148
DASHBOARDS
SET ALARMS
TRACK APP-LEVELMETRICS
Page 149
Not Only CloudWatch
Page 151
“you really don't want your monitoring
system to fail at the same time as the
system it monitors” - me
Page 152
CONFIG MANAGEMENT
Page 153
easily and quickly propagate config changes
Page 155
CENTRALISEDCONFIG SERVICE
Page 156
config servicegoes here
Page 160
SSM Parameter
Store
Page 161
sensitive data should be encrypted in-flight, and at rest
(credentials, connection string, etc.)
Page 162
role-based access
Page 163
SSM Parameter Store
HTTPS
role-based access
encrypted in-flight
Page 164
SSM Parameter Store
encrypt
role-based access
Page 165
SSM Parameter Store
encrypted at-rest
Page 166
HTTPS
role-based access
SSM Parameter Store
encrypted in-flight
Page 167
CENTRALISEDCONFIG SERVICE
CLIENT LIBRARY
Page 168
fetch & cache at Cold Start
Page 169
invalidate at interval + signal
Page 170
http://bit.ly/2yLUjwd
Page 172
max 75 GB total deployment package size*
* limit is per AWS region
Page 174
Janitor Lambda
http://bit.ly/2xzVu4a
Page 175
disable versionFunctions in
Page 176
install Serverless framework as dev dependency at project level
dev dependencies are excluded since 1.16.0
Page 177
http://bit.ly/2vzBqhC
Page 178
http://amzn.to/2vtUkDU
Page 179
UNDERSTANDCOLDSTARTS
Page 180
Amazon X-Ray1st invocation
2nd invocation
cold start
Page 181
source: http://bit.ly/2oBEbw2
Page 182
EMBRACENODE.JS & PYTHON
Page 183
http://bit.ly/2rtCCBz
Page 184
C#
http://bit.ly/2rtCCBz
Page 185
Java
http://bit.ly/2rtCCBz
Page 186
NodeJs, Python
http://bit.ly/2rtCCBz
Page 187
what about type safety?
Page 189
complexity ceiling of a Node.js app
com
plex
ity
Page 190
complexity ceiling of a Node.js app
com
plex
ity
referential transparencyimmutability as default
type inferenceoption typesunion types
…
Page 191
for managing complexity
complexity ceiling of a Node.js app
com
plex
ity
referential transparencyimmutability as default
type inferenceoption typesunion types
…
Page 192
complexity ceiling of a Node.js app
com
plex
ity
complexity ceiling of a Node.js Lambda function
Page 193
if you can limit the complexity of your solution, maybe you
won’t need the tools for managing that complexity.me
Page 194
AVOID HARDASSUMPTIONS
ABOUT FUNCTIONLIFETIME
Page 195
USE STATE FOR
OPTIMISATION
Page 197
CloudWatch Event AWS Lambda
Page 198
CloudWatch Event AWS Lambda
ping
ping
ping
ping
Page 199
CloudWatch Event AWS Lambda
ping
ping
ping
ping
Page 200
CloudWatch Event AWS Lambda
ping
ping
ping
ping
HEALTH CHECKS?
Page 201
max 5 mins execution time
Page 202
USE RECURSIONFOR LONG
RUNNING TASKS
Page 203
CONSIDERPARTIAL
FAILURES
Page 204
“AWS Lambda polls your stream and invokes your Lambda function. Therefore, if
a Lambda function fails, AWS Lambda attempts to process the erring batch of
records until the time the data expires…”
http://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
Page 205
should function fail on partial/any failures?
Page 206
SNS
Kinesis
SQS
after 3 attempts
share processing logic
events are processed in chronological order
failed events are retried out of sequence
Page 207
PROCESS SQSWITH RECURSIVE
FUNCTIONS
Page 208
http://bit.ly/2npomX6
Page 209
AVOID HOTKINESS
STREAMS
Page 210
“Each shard can support up to 5 transactions per second for reads, up to a maximum total data
read rate of 2 MB per second.”
http://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
Page 211
“If your stream has 100 active shards, there will be 100 Lambda functions running concurrently. Then, each
Lambda function processes events on a shard in the order that they arrive.”
http://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html
Page 212
when no. of processors goes up…
Page 213
ReadProvisionedThroughputExceeded
can have too many Kinesis read operations…
Page 214
ReadRecords.IteratorAge
unpredictable spikes in read ‘latency’…
Page 215
can kinda workaround…
Page 216
http://bit.ly/2uv5LsH
Page 217
clever, but costly
Page 218
for subsystems that don’t have to be realtime, or are task-
based (ie. order doesn’t matter), consider other
triggers such as S3 or SNS.me
Page 219
@theburningmonktheburningmonk.comgithub.com/theburningmonk
Page 220
@theburningmonktheburningmonk.comgithub.com/theburningmonk
http://bit.ly/2yQZj1H
all my blog posts on Lambda
Page 221
sign up here: http://bit.ly/2xIO23O