monitoring platform requirements

Upload: anon660557039

Post on 08-Jul-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 Monitoring Platform Requirements

    1/23

    MONITORING PLATFORM ASSESSMENT

    ABSTRACT:In this document is described a brief comparison and assessment for the MonitoringPlatform (from now on called MP), with the goal to have a more clear understanding

    possible future requirements and improvements.The MP Project is indeed an interesting challenge and as a Team we should provide an easand accessible solution, full configurable from the !eb"I. The integration with#pen$tac% is also an important factor. The MP should comprehend the differentMonitoring branch as&

    • 'vailabilit Monitoring

    • Performance Metric Monitoring

    • $ecurit vent Monitoring

    • og 'nalsis Monitoring

    The different Monitoring technologies should be merged together from a unified !ebInterface identified as MP *ashboard. $tarting from the *ashboard the "ser should bedriven to the right resources she+he want to monitor. This offer a balance between thedifferent Monitoring branch usage+priorit. ocusing onl on one brunch and then build theother brunch as addons should be avoided as approach as we don-t %now the final userpriorit in respect of the different technologies.

    SELF-ASSESSMENT:

    *ifferent $trategic aspect should be defined as the can strongl influence the decision.ommon strategic questions are&

    − $hould we develop a Platform to be proposed and hopefull included in

    #pen$tac%/

    −  $hould we Install the fastest Plug 0 Pla #$$ solution/

    − !hat-s the short, medium and long term strategic vision for our MP/

    − 're we loo%ing for a near real1time data processing or is acceptable a data

    processing alerting dela/

    If we want to develop our own solution from scratch or %eep developing thee2isting one, is our intention to release the software a #$$/

    GENERAL CONSIDERATIONS:The analsis was made based on giving some useful feedbac% in just one wee%. or thisreason the information can be inaccurate and some tool ma have been e2cluded for anumber of reasons li%e&

    − ull developed in a programming language not used b the Team

    − Inactive communit

    − lose $ource Project

  • 8/19/2019 Monitoring Platform Requirements

    2/23

    − Product developed onl b one or two people

    − $trong influence from one single ompan.

    $ome rating from http&++www.ohloh.net was ta%en in consideration as a reference andsource of information.

    SWIFT REQUIREMENT:System level statsThe obvious stuff here3 P" load, memor usage, dis% space free. 4otification of dis%failure is particularl important3 we5ll have a lot of dis%s and we want failures replaced ingood time. 'n standard sstem (collectd/) will do the job here.

    Sw!t system stats$wift 6econ produces a collection of statistics about the $wift instance that are particularlimportant to measure.

    $wift 6econ wor%s on a pull model3 the monitoring sstem needs to quer swift recon toget the latest stats.

    Sw!t "sta"#e statsThe swift code is heavil instrumented with a lot of different statsd metrics. !e want allthese to show up in the monitoring platform. "sing statsd in some wa is also arequirement3 we don5t want to alter the $wift code to use a different sstem, so themonitoring platform will need to wor% with statsd.

    !e want all the metrics to be available in the monitoring platform, preferabl without anmanual steps. !e5ll also want some storage of metrics to use when diagnosing problems.The must also be available for graphing and alerting.!hat we want from the monitoring platform$tats handling an storage

    Ale$t"%!e need to %now when something has gone wrong. or e2ample, dis% or node failuresshould trigger an alert in the form of an email, red bo2 on a web page or both. !e5ll alsowant to trigger alerts based on trends such as a rise in asnc1pending operations or highload averages etc.

    GENERAL REQUIREMENTS:

    G$a&'"%' dashboard showing %e current statistics about the $wift cluster could be ver useful forthe operations teams. *ata such as request timings, P" load etc could provide a useful7at1a1glance8 health1chec% for the sstem.'lso we5ll want some wa to graph other metrics for use when diagnosing problems in the

    sstem itself.

    http://www.ohloh.net/http://www.ohloh.net/

  • 8/19/2019 Monitoring Platform Requirements

    3/23

    − L#e"s"%:

    - The main core component of the MP should be released under an #pen $ource icense(9:P9, ;$*, MIT, 'P'

  • 8/19/2019 Monitoring Platform Requirements

    4/23

  • 8/19/2019 Monitoring Platform Requirements

    5/23

    available resources against the current usage resources.

    − 'dditional plugins+chec%s should be developed with Pthon, 6ub =ava or .

    − $martphone 'PP should be alread available or eas to build (using 'PI)

    − omponents should be independent. This mean if we want to use a different

    collector or a different technolog

    − 'lert triggering per average and+or current absolute values− og 'nalsis should be performed as additional source of information

    − Possibl ompatible + Mentioned in #penstac%

    − 2tensive *ocumentation

    − The simplest as possible 'rchitecture

    − 'ctive ommunit& forum and tic%eting sstems should be available.

    − Trigger alert using different threshold from different metrics (which can be

    different services or different nodes+clusters).

    − 'bilit to quer data based on& rege2p on metric name, metric value, hostname,

    etc)− 'bilit to compute metrics from different source metrics

    − 'lerting per deploment failure

    − 'bilit to specif our own aggregation wa (let-s sa, avg, low, high, current or

    custom function) 'D' Interpolated data representation.

    − 2act data representation

    − There should be a public big repositor for plugins, templates and new chec%s

    − INFRASTRUCTURE ARC+ITECTURE:

    ver component should be deploed on a dedicated node− ver component on a dedicated node should be redundant

    − *istributed Platform& The sharding of monitored hosts groups should be

    automatic. This can be arranged per

  • 8/19/2019 Monitoring Platform Requirements

    6/23

    − M"m,m Res(,$#es t( M("t($ "ee.s t( *e e)te".e. ($ $e.,#e./:

    − OS:

    − memor usage details

    − P", an single state and per core monitoring (usr, ss, idl, wai, hiq, siq)

    − onte2t $witch, Interrupt

    − Page I+#− Procs& 6unnable, ;loc%ed, 4ew

    − *is%s& Total 6+!

    − Most e2pensive memor process

    − Most e2pensive cpu process

    − $stem oad E, C, EC

    − aio stats (asnchronous I+#)

    − filesstem stats (open files, inodes)

    − ipc stats (message queue, semaphores, shared memor)

    − file loc% stats (posi2, floc%, read, write)− raw stats (raw soc%ets)

    − vm stats (hard pagefaults, soft pagefaults, allocated, free)

    − most e2pensive bloc% I+# process

    − process using the most P" time

    − process with highest total latenc

    − process with the highest average latenc

    − 'lerting dependencies mapping (not be flooded b alerts)

    − Puppet Integration should be done easil

    − NETWOR0:

    − 4etwor% stats& per interface stats as throughput, errors, per protocol stats,

    connections details, number of connections and per IP connections (important forountr traffic profiling and ;usiness Intelligence)

    − soc%et stats (total, tcp, udp, raw, ip1fragments)

    − tcp stats (listen, established, sn, timeFwait, close)

    − udp stats (listen, active)

    − uni2 stats (datagram, stream, listen, active.

    − SECURIT1:

    − 't some point, we-ll have to meet the PI1*$$ standard requirement.

    − ile ($stem) Integrit hec%ing (PI1*$$ sections EE.C, EG.C.C)

    − og Monitoring ($ecurit perspective i.e. ;rute orcing detection, PI*$$

    section EG in a whole)

    − 6oot%it *etection

    − Polic nforcement hec%ing (wea% password detection)

    OPENSTAC0 SWIFT:

  • 8/19/2019 Monitoring Platform Requirements

    7/23

    − *is% utiliHation

    − Monitor how much space is available from $wift5s perspective3 this is distinct to a

    sstem level view

    − "n1mounted drives

    Monitors drive failures3 $wift unmounts a drive when it has a problem.

    − 'snc1pending'n asnc1pending happens when a container update listing fails. If these levels arehigh then cluster is degraded.

    − vm stats (hard pagefaults, soft pagefaults, allocated, free)

    − per dis% transactions per second (tps) stats

    − per dis% utiliHation in percentage

    − per dis% utiliHation in megabtes

    − per filesstem dis% usage

    − per dis% transactions per second (tps) stats

    − per dis% utiliHation in percentage− $wift #bject $ever& ensure all the server cluster have the same cop of object ring

    − $wift *ispersion& dispersion analsis and chec% that all copies of objects are #D

    − $wift $ chec%& upload, download and delete a file in a $wift ontainer to chec%

    that it wor%s correctl

    − OPENSTAC0 NO2A:

    − hec% for one or more flavors

    − Possibilit to list servers

    #ne or more images available− #ne or more securitFgroups available

    − OPENSTAC0 0E1STONE I.e"tty Se$v#e/

    − hec% if is possible to get a To%en and chec% if there-s a public "6 declared for

    that service.

    − OPENSTAC0 GLANCE:

    − find the minimum images number and images name desired in glance.

    TEC+NOLOGIES COMPARISON:

    − Na%(s:

    − I"t$(:

    − Is the standard #pen $ource $oftware for monitoring.

    − P$(s:

    − 2cellent engine for 'vailabilit Monitoring.

  • 8/19/2019 Monitoring Platform Requirements

    8/23

    − ?er fast and performant (developed in )

    − Mentioned in the #pen$tac% documentation

    − :ood %now how in the Team

    − an be integrated with :anglia

    − an monitor easil #pen$tac% resources

    − an Monitor ever resource in 6abbitM>− C("s:

    − 4o

  • 8/19/2019 Monitoring Platform Requirements

    9/23

    − To the date no 4o$> *; is nativel supported (M$>, P$>, #racle)

    − 4ot developed in one of the preferred Team programming language

    − 66*Tool data storage (however integrable with :raphite for e2act data

    representation)

    − W'at #a" *e ta3e"4

    − ive $earch− ompound commands

    − *'P+'* Interface

  • 8/19/2019 Monitoring Platform Requirements

    10/23

    − W'at #a" *e ta3e"4

    − Prett much everthing

    − 4eed to be verified if ganglia can be used as collecting method. Probabl es, as

    it can be used as 4agios chec%.

    − Probabl no specific Team e2perience, but as the solution is 4agios based a

    relevant %now how should be available

    − G$a&'te

    − I"t$(:

    − The #pen$ource most used software for e2act data representation and graphic

    rendering

    − P$(s:

    − 6eal1Time :raphing with e2act data representation

    − *eveloped in Pthon

    omponents can be deploed in a distributed wa (scalabilit and fle2ibilit)− *jango !eb ramewor%

    − 'M>P can be used for application data routing, but now documentation is

    available.

    − ?er I+# fficient

    − Integrated with *'P

    − ?er customiHable *ashboard

    − 'dvanced graphing (use cairo as rendering engine)

    − *ata can be displaed+e2ported in different format, including =$#4

    asil integrable with other monitoring platform(https&++graphite.readthedocs.org+en+latest+tools.html)

    − C("s:

    − *ata collection must be done b e2ternals addon (there are a lots)

    − 4ot sure how to use different *; technologies for data storing (from the doc onl

    whisper is supported)

    − 4o native alerting support

    − W'at #a" *e ta3e"4

    :raphite is a great added value as performance graphing component. I-d define:raphite as part of the MP solution rather then the complete MP solution itself.

    − 5(. B,".le

    − I"t$(:

    − The urrent MP used in the 60* loud

    − P$(s:

    − 'bilit to quer current value of metrics

    − ilter criteria include pattern matching and string manipulation on host, metric,

    cluster and value

    https://graphite.readthedocs.org/en/latest/tools.htmlhttps://graphite.readthedocs.org/en/latest/tools.html

  • 8/19/2019 Monitoring Platform Requirements

    11/23

    − 'bilit to do comparisons between metric values + times etc using operators J

    K JK LK

    − $tring manipulation and formatting to generate output of live quer data

    − 'bilit to generate report tables (using live quer) which ma be permalin%ed and

    saved for later use

    − 'bilit to view graph of historic data of an persisted metric returned b 

    livequer

    − 'lerting notification, filtering, and hiding of alerts

    − Persistence for up to E ear

  • 8/19/2019 Monitoring Platform Requirements

    12/23

    − ?er efficient ad scalable $stem Monitor Tool. 'lso can be used as single

    component for data collection.

    − P$(s:

    − "sed in the current production monitoring solution

    − *ata e2ported via @M (over TP)

    − $upport "*P Multicast and "nicast for e2ternal data representation− *istributed ederation Model (scalabilit and redundanc)

    − It can be easil e2tended adding new plugins

    − an be integrated easil with

  • 8/19/2019 Monitoring Platform Requirements

    13/23

    − *eveloped in P

  • 8/19/2019 Monitoring Platform Requirements

    14/23

    − P$(s:

    − ?er eas to install and configure

    − an trac% change ma%e b specified files (li%e +etc+passwd) and send alert

    − le2ibilit on data collection method

    − ?er fle2ible alerting ('gent, $4MP, IPMI, =M@, $$

  • 8/19/2019 Monitoring Platform Requirements

    15/23

    − ?er $calable. Is can collect E.BM data points in C minutes via $4MP .

    − an handle $slog Messages

    − *eveloped in =ava

    − Multiple data collection method ($4MP,

  • 8/19/2019 Monitoring Platform Requirements

    16/23

    − Net6MS:

    − I"t$(:

    − nterprise multi platform networ% management and monitoring sstem

    − P$(s:

    − 'dvanced $ecurit features 'ccess ontrol, ncrption, *ifferent 'uthentication

    method− 4ative =ava and 'PI

    − 4agios Plugin ompatibilit

    − ;uilt1in Interface with

  • 8/19/2019 Monitoring Platform Requirements

    17/23

    − 2ternal plugin to store the data point in #penT$*; ('PI to

  • 8/19/2019 Monitoring Platform Requirements

    18/23

    − ;ac%1end support

    − Net-SNMP

    − I"t$(:

    − *ata ollection omponent

    P$(s:− $tandard for $4MP monitoring

    − ?er portable, can be installed don an #$

    − *eveloped in , ver efficient

    − $upport 2tension, so everthing can be monitored b $4MP

    − $4MP is the most supported standard for metric collection

    − Most of the technologies in the mar%et use net1snmp as embedded daemon

    − C("s:−

    W'at #a" *e ta3e"4− ' serious MP should be able to collect data using the $4MP protocol, so net1

    snmp should be included as snmp daemon.

    − OSSEC

    − I"t$(:

    − #$$ is a full platform to monitor and control our sstems. It mi2es together

    all the aspects of

  • 8/19/2019 Monitoring Platform Requirements

    19/23

  • 8/19/2019 Monitoring Platform Requirements

    20/23

    − W'at #a" *e ta3e"4

    − $aa$ ;usiness Model

    − 'PI and !eb"I ull ontrol

    − 'uto1scaling oncept

    O&e"TSDB− I"t$(:

    − #penT$*; is a distributed, scalable Time $eries *atabase (T$*;) written on top

    of ualit

    Monitoring solution at his e2cellence. There is a deep %now how and strong will onachieving this.

    's a general approach, we should get the best from our past e2periences, learn from pasterrors and ta%e the most useful components we can find from the #pen $ource communit,innovate, integrate and improve them.

  • 8/19/2019 Monitoring Platform Requirements

    21/23

    There-s no #pen$ource software product that can satisf all our requirement. 'ccording toour strateg this can be an e2cellent opportunit for *ell. 'lso additional time efforts arerequired wor%ing and e2tending an #pen $ource solution to meet the requirements withthe final goal to have a unique world class qualit Monitoring Platform.

    The MP should support at least R wa to collect data (i.e. :anglia, $tatsd, 4et1$4MP).$ame approach can be used for the 'PI (at least B 'PIs) and for the *; (M$>, 4o$>,

  • 8/19/2019 Monitoring Platform Requirements

    22/23

    The !eb "ser Interface is read onl and it lac%s of ver basic reporting functionalities.If in the case the Team want to %eep going with the od development there are a number of

    things that reall should be improved as the Team spend too resources on implementingver basics things li%e reporting and configuration, just as en e2ample, plain

  • 8/19/2019 Monitoring Platform Requirements

    23/23

    Devel(& a "ew s(l,t(" *ase. (" a" e)st"% O&e" S(,$#e s(!twa$e:

    In this conte2t probabl the following consideration should be ta%en when choosing the

    $oftware&

    $hould be avoided to develop a software and doing the interests of a singleompan. This is important even more if the main contributor of the software is aompetitor ompan

    − Time efforts needs to be spent on an available #pen $ource solution to implement

    capabilities that match with the Team requirements

    − The improvements should be re1released to the #pen $ource ommunit. In this case

    *ell image and perception will be positivel fortified. 'lso we-ll have the #pen$ource wor% force to implement more capabilities, reducing 6esources costs.