ecetera uses splunk to facilitate devops in forex
TRANSCRIPT
The Customer: The trading division of one of Australia’s big 4 banks
Objectives & Deliverables: Reduce IT overheads by integrating the FX & FI platforms.
Provide an e-trading platform for internal and external use, with a browser-based, custom user interface (UI).
Objectives & Deliverables: A unified view for Business & IT Operations providing real time BI & actionable business analytics.
Objectives & Deliverables: Establish real-time monitoring of trading activity & underlying technology, supporting issue resolution & analysis
Monitoring & Analysis Targets
Business Transactions FUNCTIONS | ACTORS | FLOWS
Technology APPS | APP INFRA | INTEGRATION | SERVERS/STORAGE | COMMS
Customer Project
Monitoring & Alerting
Status Dashboards &
Query
Historical Analysis
Support & Incident
Management
Investigation & Resolution
Project for Business Client & Channel Monitoring, Business Function & Flow Monitoring, User Support Case Mgmt,
Event & Transaction Investigation, Business Performance Analysis
Project for TechnologyApplication & Integration Monitoring, Infrastructure Status Monitoring, Technology Incident Case
Mgmt, Technology Investigation, Fix/Test Support, Technical Performance AnalysisFunctions
Inputs
Business Benefits
● Identify “stuck” trades which are in millions of dollars each
● Identify potential system impacts to trades● Identify quickly all the involved parties and details of
a trade
Enablement of BizOps and DevOps● Business Operations can see into IT systems● IT Operations can see business impacts
Faster feedback on development and testing● Bugs identified in SIT and Staging environments
Splunk as a Solution for Client Project:
Example:A BUY order for $5,000,000 AUD/HKD at 7.20354 rate has taken more than 5 seconds to clear the booking system.
Flag as RED and drill into the transaction.
Who Uses Splunk? Everyone!
Who >
For What >
Challenges: Constraints for the Solution
● Had to use simpleXML, needed to be accessible to bank developers and business operations
● Few moving parts (initially no Nagios or other products)● Performance, needed to have as little page reloading as possible● initially a very small deployment to test out the technology
Requirements for Splunk
● Real-time views and alerting● Environment aware Service Model
Business Ops and IT Ops
Business Flows IT Components
● Apache WebServer● Apache Tomcat● WebStreaming● FX Trading Core● Integration Server● Credit● Rates Adaptor● Cache● DB● RedHat Linux● Network/Storage
Login
Credit Check
Deal Capture
Reference Data
Price Distribution
Business Process Status Flows
Business flows relate directly to system components
Client/User Login Processes
Pricing / Reference Data
Deal Capture / STP
Credit Check
Business Ops & Support Dashboard
Trade Search Process Status
In-Flight Trades
Rate Updates In-Flight Trade Detail Trade Detail/Search Results
Trade Search
Trade search allows you to search for any trade bookedSearch period will be limited by data capture vs. storage space Estimated to be 4+ years based on testing estimates (1.37GB per day compressed to 400mb on a 500GB index)
You can search for trades on:ID e.g. XXX300614-0926474596 Price (All-in Rate) e.g. 0.94500Currency Pair e.g. AUDUSD or AUD/USD (Drop down selection pre-populated by last 30 days worth of valuesClient WID (Legal Entity) e.g. 5100230Search Period (default Today → driven by server location → London Time)
Component Status
• Provides a high level overview of the health of all StarXchange components:– Apache HTTPD [3 node cluster]– Tomcat [3 node cluster]– Frontend (Web streaming) [3 node cluster]– Core [3 node cluster]– Backend (Integration) [3 node cluster]– Credit [Single instance failover across 3 servers]– Rates Adaptor [Single instance failover across 3 servers]– Cache [2 node cluster] – DB
• Clicking on any of the processes will take you to the Tech Dashboard which will provide more details about the process status
• Status– Up = If all nodes for a given process are up and running– Degraded = if 1 or 2 of the nodes are down for a given process (except ESB → Degraded only if 1
process is down)– Down = if all nodes are not running – Exceptions are Credit and Pricing Adaptor → these are single node so will only show Up or Down status
Rate Updates
• FIX logs are consumed from Application by Splunk, these logs generate a message for every rate update
• Rate updates are ordered by default with the seconds since the last rate change
• All currency pairs are show by default
• Rate updates captured denote if a rate is dealable or non-dealable
• Green = Rate update within last 15 seconds or less
• Orange = Rate update >15 second <30 seconds
• Red = Last rate update detected >30 seconds ago
Inflight Trades
• Real time search that displays all trades booked that has not received an Execution Report back
• In theory this panel should be empty at all times • Any trades that appear within this view should be manually checked to ensure STP
of Risk Capture • Possibly reasons why a trade might appear within this view:
– Queues between are down– Integration Backend is down – Booking systems down down – Deal might have been captured in Dealing system but Execution Report was not
received by system to confirm booking
Tech Dashboard
Active Connections Count from respective component
JVM Status for respective process Process Status (same as eTeam Dashboard)
Disk Space Usage per server being monitored
CPU Utilisation per server being monitored
Data Extraction for Status & Events
Business EventsFX transactions
Service EventsInfra and application
notifications
Service StatusPolled status of system
components
Splunk JMX Agent
Splunk Unix Pack
Splunk Forwarder
Log monitoring
NetworkNetwork quality and performance for ingress and egress
connections.
HardwarePhysical machine health and performance metrics of servers
and storage
OSMetrics and events such as CPU, memory and storage.
ProcessStatus of the OS process running the monitored component.
Framework and Runtime (JVM)Events and metrics of the Java Virtual Machine such as
threads and garbage collection.
ApplicationEvents and metrics which relate directly to processing of
business transactions
Monitoring Layer This graphic shows the Splunk coverage over the monitored layers of System components and supporting infrastructure. (For M1 only)
There are 3 types messages sent Splunk: 1) Business events 2) Service Events 3) Service Status
These messages are generated by:
1) Consumed log files2) Process and OS monitoring3) JMX agent monitoring
Splunk Stream
Learnings
Splunk● Creation of lookup based service model – this will be moved to a CMDB● Developed a small angularJS app to expose some widgets
Splunk Extensions:● Java agent customized to run standalone with a plugin system, used to scrape JMX● Lookup Editor used to easily edit business alerts
Adoption and Integration into DevOps● Automated deployment of Splunk forwarders and Splunk Servers via Chef● Splunk apps are fully managed in git repo and binaries distributed via artifactory● Main Splunk app is packaged with Vagrantfile, eventgen samples, development settings
and can fully replicate production
Future – Expansion other use cases, broaden scope
Troy Bebee is a Managing Consultant at Ecetera, and was the lead consultant on this engagement. With over 12 years experience working directly with Telco and Banking IT teams, Troy is a highly regarded Application Performance Management & DevOps specialist.
[email protected] @trizow
Our mission is to rid the world of badly behaving applications and sites.
We measure and monitor the performance and availability of enterprise applications.
We diagnose the source of performance issues and provide solutions to improve applications functionality.