system insight without interference
DESCRIPTION
Talk at Wordnik HQ about how to monitor application performance and business goals without intrusive engineering work on your core product.TRANSCRIPT
Insight without InterferenceMonitoring with Scala, Swagger, MongoDB and Wordnik
OSSTony Tam@fehguy
Nagios Dashboard
Monitoring?
IT Ops 101
Host Checks
System
Load
Disk Space
Network
Host Checks
System
Load
Disk Space
Network
Monitoring?
Necessary(but
insufficient)
Why Insufficient?
•What about Services?
• Database running?
• HTTP traffic?
•Install Munin Node!
• Some (good) service-level insight
Your boss LOVES charts
“OH pretty
colors!”
“up and to the right!”“it MUST
be important
!”
Good vs. Bad?
•Database calls avg 1ms?
• Great! DB working well
• But called 1M times per page load/user?
•Most tools are for system, not your app
•By the time you know, it’s too late
Need business metrics
monitoring!
Enter APM
•Application Performance Monitoring
•Many flavors, degrees of integration
• Heavy: transaction monitoring, code performance, heap, memory analysis
• Medium: home-grown profiling
• Light: digest your logs (failure forensics)
•What you need depends on architecture, business + technology stage
APM @ Wordnik
•Micro Services make the System
Monolithic application
APM @ Wordnik
•Micro Services make the System
Monolithic application
API Calls are the unit of work!
Monitoring API Calls
•Every API must be profiled
•Other logic as needed
• Database calls
• Connection manager
• etc...
•Anything that might matter!
How?
•Wordnik-OSS Profiler for Scala
• Apache 2.0 License, available in Maven Central
•Profiling Arbitrary code block:import com.wordnik.util.perf.Profile
Profile("create a cat", {/* do something */})
•Profiling an API call:Profile("/store/purchase", {/* do something */})
Profiler gives you…
•Nearly free*** tracking
•Simple aggregation
•Trigger mechanism
• Actions on time spent “doing things”:
Profile.triggers += new Function1[ProfileCounter, Unit] { def apply(counter: ProfileCounter): Unit = { if (counter.name == "getDb" && counter.duration > 5000) wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW) return counter }}
Profiler gives you…
•Nearly free*** tracking
•Simple aggregation
•Trigger mechanism
• Actions on time spent “doing things”:
Profile.triggers += new Function1[ProfileCounter, Unit] { def apply(counter: ProfileCounter): Unit = { if (counter.name == "getDb" && counter.duration > 5000) wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW) return counter }}
This is intrusive on
your codebase
Accessing Profile Data
•Easy to get in codeProfileScreenPrinter.dump
•Output where you wantlogger.info(ProfileScreenPrinter.toString)
•Send to logs, email, etc.
Accessing Profile Data
•Easier to get via API with Swagger-JAXRS
import com.wordnik.resource.util
@Path("/activity.json")@Api("/activity")@Produces(Array("application/json"))class ProfileResource extends ProfileTrait
Accessing Profile Data
Accessing Profile Data
Inspect without bugging
devs!
Is Aggregate Data Enough?
•Probably not
•Not Actionable
• Have calls increased? Decreased?
• Faster response? Slower?
Make it Actionable
•“In a 3 hour window, I expect 300,000 views per server”
• Poll & persist the counters
• Example: Log page views, every min{
"_id" : "web1-word-page-view-20120625151812","host" : "web1","count" : 627172,"timestamp" : NumberLong("1340637492247")
},{"_id" : "web1-word-page-view-20120625151912","host" : "web1","count" : 627372,"timestamp" : NumberLong("1340637552778")
}
Make it Actionable
Make it Actionable
Your boss LOVES charts
That’s not Actionable!
•But it’s pretty
What’s missing?
APIs to track?
Low + High
Watermarks
Custom Time
window
Too much custom
Engineering
That’s not Actionable!
APIs to track?
Low + High
Watermarks
Custom Time
window
Too much custom
Engineering
Call to Action!
Make it Actionable
•Swagger + a tiny bit of engineering
• Let your *product* people create monitors, set goals
•A Check: specific API call mapped to a service function{ "name": "word-page-view", "path": "/word/*/wordView (post)", "checkInterval": 60, "healthSpan": 300, "minCount": 300, "maxCount": 100000}
Make it Actionable
•A Service Type: a collection of checks which make a functional unit { "name": "www-api", "checks": [ "word-of-the-day", "word-page-view", "word-definitions", "user-login", "api-account-signup", "api-account-activated" ] }
Make it Actionable
•A Host: “directions” to get to the checks { "host": "ip-10-132-43-114", "path": "/v4/health.json/profile?api_key=XYZ", "serviceType": "www-api”},{ "host": "ip-10-130-134-82", "path": "/v4/health.json/profile?api_key=XYZ", "serviceType": "www-api”}
Make it Actionable
•And finally, a simple GUI
Make it Actionable
•And finally, a simple GUI
Make it Actionable
•Point Nagios at this!serviceHealth.json/status/www-api?explodeOnFailure=true
•Get a 500, get an alert
Metrics from
Product
Based on YOUR app
Treat like system failure
Make it Actionable
Is this Enough?
System monitoring
Aggregate monitoring
Windowed monitoring
Object monitoring?
• Action on a specific event/object
Why!?
Object-level Actions
•Any back-end engineer can build this
• But shouldn’t
•ETL to a cube?
•Run BI queries against production?
•Best way to “siphon” data from production w/o intrusive engineering?
Avoiding Code Invasion
•We use MongoDB everywhere
•We use > 1 server wherever we use MongoDB
•We have an opLog record against everything we do
What is the OpLog
•All participating members have one
•Capped collection of all write ops
primary replica replicat0
time
t1
t3
t2
time
So What?
•It’s a “pseudo-durable global topic message bus” (PDGTMB)
• WTF?
•All DB transactions in there
•It’s persistent (cyclic collection)
•It’s fast (as fast as your writes)
•It’s non-blocking
•It’s easily accessible
More about this{
"ts" : {"t" : 1340948921000, "i" : 1
},"h" : NumberLong("5674919573577531409"),"op" : "i","ns" : "test.animals","o" : {"_id" : "fred", "type" : "cat"}
}, {"ts" : {
"t" : 1340948935000, "i" : 1},"h" : NumberLong("7701120461899338740"),"op" : "i","ns" : "test.animals","o" : {
"_id" : "bill", "type" : "rat"}
}
Tapping into the Oplog
•Made easy for you!https://github.com/wordnik/wordnik-oss
Tapping into the Oplog
•Made easy for you!https://github.com/wordnik/wordnik-oss
SnapshotsReplication
Incremental Backup
Same Techniqu
e!
Tapping into the Oplog
•Create an OpLogProcessor
class OpLogReader extends OplogRecordProcessor { val recordTriggers = new HashSet[Function1[BasicDBObject, Unit]] @throws(classOf[Exception]) def processRecord(dbo: BasicDBObject) = { recordTriggers.foreach(t => t(dbo)) } @throws(classOf[IOException]) def close(string: String) = {}}
Tapping into the Oplog
•Attach it to an OpLogTailThreadval util = new OpLogReader
val coll: DBCollection =
(MongoDBConnectionManager.getOplog("oplog",
"localhost", None, None)).get
val tailThread = new OplogTailThread(util, coll)
tailThread.start
Tapping into the Oplog
•Add some observer functions
util.recordTriggers += new Function1[BasicDBObject, Unit] { def apply(e: BasicDBObject): Unit = Profile("inspectObject", { totalExamined += 1 /* do something here */ } }) } }
/* do something here */
•Like?
•Convert to business objects and act!
• OpLog to domain object is EASY
• Just process the ns that you care about
"ns" : "test.animals”
•How?
Converting OpLog to Object
•Jackson makes this trivial
case class User(username: String, email: String, createdAt: Date)
val user = jacksonMapper.convertValue( dbo.get("o").asInstanceOf[DBObject], classOf[User])
•Reuse your DAOs? Bonus points!
•Got your objects!
Converting OpLog to Object
•Jackson makes this trivial
case class User(username: String, email: String, createdAt: Date)
val user = jacksonMapper.convertValue( dbo.get("o").asInstanceOf[DBObject], classOf[User])
•Reuse your DAOs? Bonus points!
•Got your objects!Now
What?
“o” is for “Object”
Use Case 1: Alert on Action
•New account!obj match { case newAccount: UserAccount => { /* ring the bell! */ } case _ => { /* ignore it */ }}
Use case 2: What’s Trending?
•Real-time activitycase o: VisitLog =>
Profile("ActivityMonitor:processVisit", {
wordTracker.add(o.word)
})
Use case 3: External Analytics
case o: UserProfile => {
getSqlDatabase().executeSql(
"insert into user_profile values(?,?,?)",
o.username, o.email, o.createdAt)
}
Use case 3: External Analytics
case o: UserProfile => {
getSqlDatabase().executeSql(
"insert into user_profile values(?,?,?)",
o.username, o.email, o.createdAt)
}
Don’t mix runtime &
OLAP!
Your Data pushes to Relational!
Use case 4: Cloud analysis
case o: NewUserAccount => {
getSalesforceConnector().create(
Lead(Account.ID, o.firstName, o.lastName,
o.company, o.email, o.phone))
}
Use case 4: Cloud analysis
case o: NewUserAccount => {
getSalesforceConnector().create(
Lead(Account.ID, o.firstName, o.lastName,
o.company, o.email, o.phone))
} We didn’t interrupt
core engineering
!
Pushed directly to Salesforce!
Examples
Polling profile APIs
cross cluster
Examples
Siphoning hashtags
from opLog
Examples
Page view activity from
opLog
Examples
Health check w/o
engineering
Summary
•Don’t mix up monitoring servers & your application
•Leave core engineering alone
•Make a tiny engineering investment now
•Let your product folks set metrics
•FOSS tools are available (and well tested!)
•The opLog is incredibly powerful
• Hack it!
Find out more
•Wordnik: developer.wordnik.com
•Swagger: swagger.wordnik.com
•Wordnik OSS: github.com/wordnik/wordnik-oss
•Atmosphere: github.com/Atmosphere/atmosphere
•MongoDB: www.mongodb.org