nginx conference 2015
Post on 17-Feb-2017
1.012 Views
Preview:
TRANSCRIPT
Move Over IBM WebSeal and F5 BigIP, Here Comes NGINX09/23/2015
#nginx #nginxconf2
Advisory IT Specialist at ING Bank N.V.
Bart Warmerdam
Who is ING globally
3
Who is ING in the Netherlands
4
• Bank with diverse software and hardware landscape• Cost driven IT• Traditional software development: design, build, test, implement• Software strategy: buy before build• Middleware strategy: buy• Hardware strategy: appliance
History up to 2.5 years ago within ING
5
• Bank with diverse software and hardware landscape• IT and Time-to-Market is important• 60 scrum teams internally working on software• Software strategy: build before buy (a lot of time)• Middleware strategy: buy but…• Hardware strategy: standard scalable stacks
From 2.5 years ago up to now
6
Complex IT landscape
Task: simplify IT
Add missing functionality
7
• Internet facing reverse proxies (IBM TAM WebSeal) Authenticating proxy Content caching and compression Cookie jar functionality
• Multiple layers of load balancers (F5 BigIP) Over data centers Over nodes in different network zones
For all internet facing domains of domestic banking Netherlands
Infra structure to replace
8
• Investigate open source software: NGINX or Apache vs IBM WebSeal / F5• Perform a proof of concept with NGINX for Authentication and Event Publishing• Write a report for deciding architects which concluded after proof of concept:
Replace IBM TAM WebSeal with NGINX using custom modules Integrate the layers of F5 BigIP’s with NGINX
The result “GO!” Now we are more in control then ever.
The Plan to Simplify
9
Starting with
10
Load balancer
WebSeal
Load balancer
Tier 1 (dmz)
Tier 2
F5
IBM
F5
F5
External Authentication
Interface
ApplicationApplication
Application
10
Inter Connectivity Cloud (between DC’s)Inter Connectivity Cloud (between DC’s)
Policy Mgr LDAP
Load Balancer
Working towards
11
Load balancer
NGINX
Tier 1 (dmz)
Tier 2
F5
NGINX
External Authentication
Interface
ApplicationApplication
Application
11
Inter Connectivity Cloud (between DC’s)Inter Connectivity Cloud (between DC’s)
Control in…
12
• Integrate Authentication and Event Publishing module from PoC
Functionality
Time-to-Market
Operational Monitoring
Control
Control in…
13
• Integrate Authentication and Event Messaging module from PoC• Add missing cookie jar functionality
Functionality
Time-to-Market
Operational Monitoring
Control
Control in…
14
• Integrate Authentication and Event Messaging module from PoC• Add missing cookie jar functionality• Add load balancing persistency over data centers
Functionality
Time-to-Market
Operational Monitoring
Control
Control in…
15
• Integrate Authentication and Event Messaging module from PoC• Add missing cookie jar functionality• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points
Functionality
Time-to-Market
Operational Monitoring
Control
Control in…
16
• Integrate Authentication and Event Messaging module from PoC• Add missing cookie jar functionality• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points• Integrate existing (Java) Continuous Delivery Pipeline
Functionality
Time-to-Market
Operational Monitoring
Control
Control in…
17
• Integrate Authentication and Event Messaging module from PoC• Add missing cookie jar functionality• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points• Integrate existing (Java) Continuous Delivery Pipeline
• Monitor system resource usages and errors to Graphite
Functionality
Time-to-Market
Operational Monitoring
Control
Control in…
18
• Integrate Authentication and Event Messaging module from PoC• Add missing cookie jar functionality• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points• Integrate existing (Java) Continuous Delivery Pipeline
• Monitor system resource usages and errors to Graphite• Add Grafana dashboards and Mobile alerts for team dashboards
Functionality
Time-to-Market
Operational Monitoring
Control
Control in…
19
• Integrate Authentication and Event Messaging module from PoC• Add missing cookie jar functionality• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points• Integrate existing (Java) Continuous Delivery Pipeline
• Monitor system resource usages and errors to Graphite• Add Grafana dashboards and Mobile alerts for team dashboards• Monitor and report upstream errors to Tivoli Omnibus (MCR)
Functionality
Time-to-Market
Operational Monitoring
Control
Control in…
20
• Integrate Authentication and Event Messaging module from PoC• Add missing cookie jar functionality• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points• Integrate existing (Java) Continuous Delivery Pipeline
• Monitor system resource usages and errors to Graphite• Add Grafana dashboards and Mobile alerts for team dashboards• Monitor and report upstream errors to Tivoli Omnibus (MCR)• Make performance data and reports available to all scrum teams
Functionality
Time-to-Market
Operational Monitoring
Control
• First step: Integrate into the Continuous Delivery Pipeline• From GIT to production
• Second step: Add additional functionality to NGINX
• Future roadmap of the NGINX authenticating proxy environment
Roll-out planning
21
• Using standard open source tools like:Git, Jenkins, Maven, Nexus, Docker, Valgrind, Python
• And closed source tools likeNolio (deployments), Fortify (static source code analysis)
First step: integrate in continuous delivery pipeline
22
23
GIT repository
24
Commits on “develop” trigger a build in JenkinsUsing an Apache Maven build profile
25
Which builds the project modules
26
By packaging all own modulesAnd add nginx.org source from our Nexus repositoryAnd 3rd party source modules from our Nexus repositoryAs a tar.gz file
27
And add the RedHat .spec file
28
To start a Docker build in a CentOS imageWhich results in an RPM
29
If all Python tests succeed on the binary
30
If all integration test scripts ran successfullyAll product acceptance scripts ran successfully
31
And all module tests succeed as well
32
Using a Python test frameworkTo easily create test cases for the binary and modules
33
The RPM’s and test results are uploaded to a Nexus RepositoryTogether with Nolio deployment scriptsAfter which Jenkins triggers an automatic Nolio deployment in LCM
34
Each commit in “develop” also starts a Jenkins job thatTriggers the Valgrind tests on all modulesAnd emails the results on failures
35
Each commit in “develop” also starts a nightly Jenkins job thatStarts a Fortify scan for static source code analysisOn all own modules, NGINX code and all 3rd party modules used
36
Releases on “master” trigger a build in JenkinsUsing Apache Maven release profileWhere versioned artifacts are uploaded to Nexus
37
Configuration releases on “master” trigger a build in JenkinsWhere the correct nginx.conf and site information created
38
And SQL is used to create a list of URL endpointsAnd their module directives
39
Using a maven plugin to create the correct configuration files
40
Using Docker to build a RPM and test all generated configurations
41
So it can be automatically deployed in Nolio in LCM by Jenkins
• LCM DEV + TST environment for internal team tests
• DEV + TST for integration tests for all other teams
• ACC for pre-production testsDaily load tests using Load Runner & perf. reports using Python, Latex and gnuplotWeekly resilience testsUnplanned Simian Army testsRun “perf” tests for NGINX profiling (if a change requires it)Penetration and security tests
• Multiple PRD environments in different data centersReplaced all IBM WebSeal reverse proxies with NGINXStarting to replace all F5 BigIP internal load balancers with NGINX load balancer module
The result…
42
• Using “perf” we analyzed the binary under load ~500 URI/sec
Optimizing the result
43
Number 1, 3, 8,11 is GZIP compressionNumber 2 is memset => hard to pinpoint since generic use
Number 4 is network driver => cannot changeNumber 5 is cookie header parsing, triggered by our codeNumber 6 is OSNumber 7 is Kafka CRC32 code
Number 9 is memcpy => hard to pinpoint since generic useNumber 10 is cause by the audit system => cannot change
Number 20 first own method listed
• GZIP is expensive on the CPU, use optimized libraries when possible
• Use static linking when replacing the patched library cannot be done on target machine
• Two patches available, from Intel and CloudflareCompression level 5
Source: https://www.snellman.net/blog/archive/2014-08-04-comparison-of-intel-and-cloudflare-zlib-patches.html
Include optimized libraries
44
• Some libraries are not available on the target machine (Kafka, MaxMind, Protobuf)
• Some libraries are too old on target machine (PCRE3 – for JIT)
• CPU optimized versions are added in the Docker image and statically linked
Patching libraries for performance
45
• Our five most important home-made modules
Cookie jar module – store Set-Cookie operations in reverse proxy WebSeal module – Authentication module based on Extended Authentication Interface (EAI) Kafka module – Send Event Messages from proxy layer to other systems Load balancing – Rule based upstream use, allow dynamic service discovery Monitoring module – Monitor application use and system resource usage
Second step: Add additional functionality to NGINX
46
• Uses two levels of RB Trees to store state
• Highly configurable
• Use timers for automatic expiration and cleanup
• Use shared memory to share state between workers
Cookie jar module
47
• Uses a RB Trees to store session state
• Allows access on different policies (fine or coarse grained)
• Use timers for automatic expiration and cleanup
• Use shared memory to share state between workers
• Implement the EAI interface to allow gradual migration
WebSeal module
48
• Publish Events for monitoring and error analysis
• Highly configurable using a separate json config file
• Fast and asynchronous to avoid processing overhead
Event Publishing (Kafka) module
49
• Use specific upstream servers based on rules (e.g. confidence test)
• Allow static load balancing over data centers for stateful applications
• Allow TCP connection re-use, using pools
• Integration with monitoring module to allow monitoring via MCR
Load balancing module
50
• Read variables from other modules to monitor
• Create and expose variables with system resources to monitor
• Use UDP or TCP to transfer monitor data to Graphite
• Integration with Tivoli Omnibus to allow monitoring via MCR
Monitoring module
51
Monitoring example
52
• Add WAF modules
• Fully implement dynamic service discovery to dynamically add/remove URI’s and upstream servers
• Implement cross datacenter persistency for cookie jar
Future roadmap of the NGINX authenticating proxy environment
53
• Remove manual work in development and testing ASAP
• NGINX has a lot of configuration optimization possibilitiesTCP Socket/TCP options, caching, connection re-use, JIT, Threads, upstream zone, buffer settings, timeouts
• In own modulesUse Shared Memory for Session State (if needed), RB Trees, Thread pools, Timers and the event queueUse atomic reference counter over shared mutex locks if possibleUse variables to pass data between modules
• In NGINX modulesCompression on content is CPU expensive!Cookie lookups in modules are potentially CPU expensiveCRC32 is potentially CPU expensiveIf using symmetric crypto, use types supported by the CPU (EAS-NI), like EAS GCM/CTR
Lessons learned so far…
54
• Older stack require more work to fully use all configurationsRecompiled new GCC C-compiler for strong stack protector and CPU optimization optionsRecompiled libz and static link for latest version and add Intel performance patchesRecompiled libpcre and static link for latest version for JIT, and use CPU optimize flagsRecompiled other libs which are not present in RHEL and use CPU optimize flags
• Make monitoring highly configurable per site and fine-tune over time
• Use good monitoring dashboardsCombination of Graphite and Grafana works very wellTest which log data in error.log is required for good root-cause-analysis if an error occurs
• Take enough time to testPerformance tests under stress load with tools like “perf” give a lot of insightInvest enough time in resilience tests and what key data is needed to monitor your systemAll code which involves shared memory, locks, timers and configuration reloads take more time to get right
Lessons learned so far…
55
And… NGINX is very fast, very efficiently coded and extremely fun to program for!
Lessons learned so far…
56
Questions??
E-mail: bart.warmerdam@ing.nl
And...
57
The opinions expressed in this publication are based on information gathered by ING and on sources that ING deems reliable. This data has been processed with care in our analyses. Neither ING nor employees of the bank can be held liable for any inaccuracies in this publication. No rights can be derived from the information given. ING accepts no liability whatsoever for the content of the publication or for information offered on or via the sites. Author rights and data protection rights apply to this publication. Nothing in this publication may be reproduced, distributed or published without explicit mention of ING as the source of this information. The user of this information is obliged ot abide byb ING's instructions relating to the use of this information. Dutch law applies.
www.ing.com
Disclaimer
58
top related