alison*perkins* - splunk · disclaimer* 2...
TRANSCRIPT
Disclaimer
2
During the course of this presentaFon, we may make forward-‐looking statements regarding future events or the expected performance of the company. We cauFon you that such statements reflect our current expectaFons and
esFmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements,
please review our filings with the SEC. The forward-‐looking statements made in the this presentaFon are being made as of the Fme and date of its live presentaFon. If reviewed aRer its live presentaFon, this presentaFon may not contain current or accurate informaFon. We do not assume any obligaFon to update any forward-‐looking statements we may make. In addiFon, any informaFon about our roadmap outlines our general product direcFon and is subject to change at any Fme without noFce. It is for informaFonal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligaFon either to develop the features or funcFonality described or to
include any such feature or funcFonality in a future release.
3
aperkins decides Red Hat is her favorite distro
aperkins FINALLY
joins Red Hat!
CEO JIM WHITEHURST
#1 OPEN SOURCE LEADER
4
What We Do We offer a range of mission-‐criFcal soRware and services covering:
ü Flexibility ü Faster technology innovaFon ü Be[er quality ü Be[er price/performance
ü Long-‐term deployment ü Be[er security-‐assurance ü Shared development:
Accelerated innovaFon
ü Open collaboraFon: Products that meet customer needs
About Red Hat IT
6
Who we are: Global team of ~290 associates What we do: Partner with teams across Red Hat Strive to be corporate leaders and “Customer One” Provide value to both our internal and external customers
About Red Hat IT
7
Our Vision: Our Mission:
To be a service-‐driven informaFon technology organizaFon and a trusted business partner, delivering flexible, effecFve soluFons for our customers.
To be a world-‐class informaFon technology organizaFon and a beacon for the implementaFon of open source and cloud soluFons.
We invest in open source! We strive to be Customer One
8
About Me
Alison Perkins Senior Systems Engineer, Red Hat IT IT Enablement Tower -‐ responsible for designing, deploying, and ensuring availability and performance of both customer-‐facing and internal plajorms
8
10
Life Before Splunk
• Insight gathering was very manual and took a long Fme • To get informaFon, people had to SSH into boxes to grep logs • Time to resoluFon of issues measured in days or weeks • No single place to access and visualize machine data • CorrelaFon across disparate data sources was complex
10 devopsreacFons.tumblr.com J
11
Life Before Splunk ProdOps Engineer says:
11
"You have not truly experienced producFon-‐support horror unFl you have to find the single error experienced by a single (angry) customer from one of many possible logs...on each of many load-‐balanced machines...tracing a customer's transacFon through the layers of a SOA architecture... all the way through to the backend business database."
“The memories of pre-‐Splunk are forever burned into my brain.. Instead of PTSD, maybe I should call it PSD, for Pre-‐Splunk Debugging!”
Splunk at Red Hat, v.1.0
13
IniFal deployment in June 2012 Splunk 4.3.2 Scope was limited, parFcular environments and use cases Just a few apps: Search, SoS, Cisco, *nix IT OperaFons teams had access
14
ZOMG! I Can Haz Splunk! • Splunk became very popular J
- Started with about 20 users in 2012 • Gradual expansion of:
- Hosts - Data sources - Sourcetypes - Users
• Started with syslog data, web logs, network device logs • Expanded to include more sources, more Splunk Apps
Splunk at Red Hat
16
Over 400 people have Splunk access – not just OperaFons!
Who uses Splunk? " Plajorm OperaFons " InfoSec " Enterprise Architecture " Systems Engineering " IT Engineering " IdenFty & Access Management " Global Support Services Developers " IT Management " …even some groups outside of IT J
17
OperaFonal Insights " Incident troubleshooFng " Anomaly detecFon in producFon environments " Correlate data from numerous systems –
Nagios, Apache, NetApp, LDAP, JBoss, Sendmail
Produc5on Support Engineer says: “Dump all the logs into Splunk, and it starts looking like One Big System, instead of a bazillion teeny ones that hate each other.” TransacMons allow him to find what he needs in minutes, not hours.
OperaFonal Insights
19!
index=nagios! sourcetype=nagios!“SERVICE NOTIFICATION”!
NOT notification_state=ACKNOWLEDGEMENT*!
notification_dest=opsteam-alerts!
| transaction notification_host,notification_type !
startswith=(notification_state=CRITICAL OR notification_state=WARNING)!
endswith=(notification_state=OK)!
| chart count by notification_type | sort -count | head 25!
Top 25 Alert Types (last 7 days)
21
Security Insights " Threat and anomaly detecFon in producFon environments
" Correlate data from numerous systems
23
Development Insights " Real-‐Fme dashboards show error rate in producFon and impact of pushing new builds " Developers can search and visualize web logs, Java logs—without producFon access " Alerts let developers know as soon as a problem arises
24
Development Insights Development manager says: <chris> I can tell you about our FY14 goal!!<chris> which we WAY EXCEEDED!
Chris's team was able to reduce their applicaMon's error rate by 2 orders of magnitude in weeks, not months (just 2 sprints!)
25
Developers Say: “We recently caught an excepFon in upstream code as soon as it merged with our code, using one of our standing ‘search plus email’ alerts.”
“We check our Splunk dashboard every morning. At a glance, we can see response Fmes, response codes, whether all our hosts are pulling their weight, and which customer applicaFon calls are taking longest.”
26
Development Insights Developer wrote instrumentaFon to log client-‐side JavaScript and JQuery errors. His ClientLogger code unlocked the potenFal—then, he used Splunk to show the impact. From his blog:
26
In the first ~ 24 hours of operaFon we had 330,000+ ERROR events logged!
..in some cases we can fix one line of JS code and do away 30,000 errors.
...aYer just a few days of work, we have reduced the daily error total by about 1/3.
“
”
27
Open Source in AcFon
Two Red Hat IT developers wrote browser plugins to work with Splunk Pop-‐up display of beauFfully indented, syntax-‐highlighted JSON
https://github.com/mwcz/splunk-json-formatter
Upgrade to Splunk v.6
29
" Upgrade to Splunk 6, February 2014 – 2 search heads, 5 indexers – 1 admin server (Deployment Server,
License Manager) – 1 uFlity server (Hydra modular inputs)
33
Cloud Plajorm Visibility
" CIO goal to move 70% of applicaFons to the cloud in the next 18 months* " Open Hybrid Cloud – both public and private
" Many teams in process of re-‐tooling applicaFons to support both tradiFonal on-‐premise deployments and cloud-‐based deployments
" Using Splunk to increase visibility into cloud environments’ price/performance
33
“Cloud has become the default choice for most of Red Hat’s new applications.”
~ Lee Congdon, Red Hat CIO
*source: h[p://diginomica.com/2014/02/11/leveraging-‐cloud-‐extend-‐service-‐management-‐business/
Cloud Plajorm Visibility
34
" Cloud plajorms challenge us to answer quesFons that go: – across our infrastructure and organizaFonal structures – through the stack, with drill down by ownership and funcFon
" IT OperaFons Teams and IT Management use Splunk App for AWS to support cloud efforts – AWS Billing data – Performance and Security data from CloudTrail – Also using Splunk for tradiFonal machine data from instances
35
IaaS Monitoring
Splunk App for AWS Billing and CloudTrail data
Example images from apps.splunk.com
Splunk App for AWS
37
Presents the actual data – not just projecFons –across all subaccounts Validate – or maybe challenge –our assumpFons The boss loves it J
AWS App Challenges
38
" Some challenges at the outset " Not as simple as “just install this tarball” " New AWS services/setup required to get started " We figured it out J (thanks to helpful suggesFons in the code!) " Reached out to Splunk with feedback on improvements " Splunk was recepFve, changes incorporated in new version of app
Cloud Visibility
39
During the launch of the new redhat.com, Splunk helped: • Detect a problem that spanned
mulMple layers of the applicaMon stack and cloud infrastructure
• Track the problem down and determine root cause
• Help developers idenFfy a temporary fix/workaround
• Confirm the permanent fix once it was applied
“Being able to sculpt my own dashboard of reports and share it with others has been incredibly helpful in empowering my team
to troubleshoot problems.”
Web Developer says:
Cloud Visibility
40
Fun with Pre[y Graphs
41
index=rh_apache host=i-‐*vary* source=*error_log* ServicePhase=Prod ServiceName=Cms | rex field=_raw "(?i)^[^\]]*\]\s+\[(?P<msg_level>[^\]]+)" | Fmechart span=1h usenull=f count by msg_level | eval acceptable=50 | eval elevated=250 | eval BAT_SIGNAL=1000
Fun with Pre[y Graphs
43
<panel>! <chart>! <title>CMS Apache Errors by Type (past 3 days)</title>!…! <option name="charting.chart.overlayFields">! acceptable,elevated,BAT_SIGNAL! </option>! <option name="charting.fieldColors">{! "BAT_SIGNAL":0xFF0000,! "elevated":0xFFFF00, ! "acceptable":0x73A550}! </option>!…! </chart>!</panel>!
Cloud Visibility – A Drama in IRC
45
😏
😫
😠 😤
😧
😒
<mr_cowboy> what the..!<mr_cowboy> somebody’s messing with the security group settings again!!!<da_boss> what do you mean, “somebody?” !<da_boss> not one of you?!<mr_cowboy> i dunno, but it’s a real mess!<da_boss> can’t you find out? doesn’t cloudtrail keep track of that?!<mr_cowboy> i don’t have time for that right now, i’ve just got to figure out how bad it is..!**<miz_data> goes to the splunk app for AWS!!<miz_data> hey mr_cowboy, all the recent activity is associated with your userid: http://my.splnk/shared_srch1001!
😡
Cloud Visibility – A Drama in IRC
46
😯😰
😳
😆😎
<mr_cowboy> what?!?! !<mr_cowboy> did somebody hack into my account??!!!<miz_data> everything i see is coming from your usual IP, in your city: http://my.splnk/shared_srch1002!<da_boss> ohhh, cowboy, you got some ‘splaining to do?!<mr_cowboy> well, uh.. what about this SG ID? s-1203987234!** miz_data searches..!<miz_data> okay, i got exactly one result: http://my.splnk/shared_srch1003!<mr_cowboy> sonofa…!<mr_cowboy> sorry folks, my bad.. i’ll get right on a fix.!<da_boss> lol, thanks miz_data!<miz_data> np :)!
😌
😜
😊
😪
😕
Growing Demand
49
Over the past 6 months…
Plus, we have eight addi5onal teams interested in Splunking new data sources!
March 2014 September 2014 • 417 users • 31 roles • 1000+ forwarders • ~350 GB/day
• 322 users • 23 roles • 608 forwarders • ~250 GB/day
51
Looking Ahead with Splunk
" Splunk for pre-‐producFon environments
" Splunk to support ConFnuous IntegraFon and ConFnuous Deployment efforts
" Pull performance data from Splunk and combine with other sources via Splunk REST API
" Building Splunk Apps with the Splunk Web Framework Toolkit
" ExciFng custom visualizaFons with D3.js
Tiered Storage Approach
53
Storage )ering enables longer data reten)on at lower cost Longer data retenMon is important, because it allows us to: 1. Answer our customers’ long-‐term business trending quesFons 2. Enable pa[ern-‐matching across longer Fme windows 3. Search strategically, not just tacFcally We want to scale our compute costs (indexers) independently from our storage capacity costs Independent scaling is important, because it allows us to: 1. Invest in performance for the most-‐recent data 2. Gracefully handle unexpected indexing growth 3. Develop a roadmap for handling growth without forkliRing
Indexer Storage OpFons
55
We survived for over two years on direct-‐a[ached storage only..
Then, we added
NFS-‐based external storage for our cold buckets..
Indexer Storage OpFons
Next, we plan to use Red Hat
Storage to house our archived frozen data.
(Maybe cold, as well!)
56
Red Hat Storage for Cold?
57
With Red Hat Storage, we can: " ConFnue to use best-‐performing costly/limited DAS for hot/warm data " Use good-‐performing affordable storage for cold/frozen data " Simplify capacity planning-‐-‐growth of cold data managed via a single RHS volume " Expand cold storage in a non-‐disrupFve, and transparent way
Independent benchmark results with SplunkIt show comparable performance to other NAS plajorms—at 10% the cost!
Lab results with Splunk’s SBK show Red Hat Storage performs as well or be?er than local DAS—parFcularly for long-‐tail “rare” searches!
Red Hat Storage
59
Get the whitepaper!
“When we compared these SplunkIt results to published results of an eight-‐node EMC Isilon X400 storage soluFon, we found that
Red Hat Storage achieved comparable performance in terms of both throughput and search Fme, running on just two IBM x-‐ series
servers, cosMng significantly less.” SPLUNK ENTERPRISE ON RED HAT STORAGE SERVER 2.1, MAY 2014 A PRINCIPLED TECHNOLOGIES TEST REPORT Commissioned by Red Hat, Inc.
A Li[le Admin-‐to-‐Admin Advice
62
" The Admin App Trifecta: – SoS, Deployment Monitor, Fire Brigade
" Deployment Server / Forwarder Manager <3 " Seeing is believing! Be willing to give demonstraFons
" Educate your users about efficient searches " Think outside the ‘Fmechart’ " Talk to your developers about logging best pracFces – Splunk is magical in many ways, but not a
mind-‐reader J
Things I Wish I Knew Then
63
" h[p://wiki.splunk.com/Things_I_wish_I_knew_then " Plan your data retenFon strategy—revisit as your needs change
" Pay a[enFon to your indexing growth, bucket policy
" Unexpected increase in rolls to frozen buckets == Λ
" Splunk is wonderful and magical in many ways, but not a TARDIS
Splunk> Not a TARDIS. Yet.
Image credit: Steve Gibson
64
Love for Splunk
“Make this line go down!”
“Splunk is a technology that only gives me
good surprises.”
“It's not my fault!”
”Not for long, anyway!”
Sysadmins love Splunk
Trusted Troubleshoo5ng Tool
Engineers love Splunk Managers love Splunk
Stop the Blame Game Visualiza5on == Mo5va5on
65
Results with Splunk
• ProacFvely monitor costs, enabling be[er budget planning
• Gain insights into performance and reliability of workloads moved to the cloud
• Enable detailed security audits
• Quickly validate code pushes to producFon
• Ensure changes don’t negaFvely impact performance or UX
• Engineers have access to real-‐Fme producFon data
• Reduce the number of spurious pages from monitoring systems
• Combat alert faFgue among sysadmins
• Well-‐rested (happy?) sysadmins have fewer “oops” moments
Reduced Alert Noise Improved Code Quality
Visibility into Cloud Deployments