promcon2016

37
Hadoop, Fluentd cluster monitoring with Prometheus and Grafana 2016/08/26 Wataru Yukawa(@wyukawa) #promcon2016

Upload: wyukawa

Post on 15-Apr-2017

4.251 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Promcon2016

Hadoop,FluentdclustermonitoringwithPrometheusandGrafana

2016/08/26WataruYukawa(@wyukawa)

#promcon2016

Page 2: Promcon2016

WhoamI?

• DataEngineeratLINE• LINEmakesamessagingapplicationofthesamename,inadditiontootherrelatedservices

• ItisthemostpopularmessagingplatforminJapan

Page 3: Promcon2016

LINE

Page 4: Promcon2016

WhoamI?

• FirsttimetoGermany!• Maintainanon-premisesloganalysisplatformontopofHadoop/Hive/Fluentd.

• Unofficial Prometheus Evangelist in Japan– OrganizedmeetupinTokyoonJune14,2016(morethan100attendences)

– http://developers.linecorp.com/blog/?p=3908

Page 5: Promcon2016

Agenda

• BackgroundofLINE’sdevelopmentenvironment

• Promgen introduction• hadoop/fluentd clustermonitoringwithPrometheusandGrafana

Page 6: Promcon2016

Before Prometheus• Ihaveexperiencewith other monitoringtoolslikeGanglia,Nagios

• IfoundPrometheus– Monitoringandalertingareunified– Thereisaqueryfeaturethatallowsad-hocqueries

• maxdiskusage:maxby(instance)(100-(node_filesystem_free{...}/node_filesystem_size{...})*100)

• IwanttousePrometheus• HowdoweadjustPrometheustoourenvironment?

Page 7: Promcon2016

LINE’sdevelopmentenvironment• We rarely use cloud service like AWS because weare under on-premises environment– host information doesn’t change frequently

• That’swhycurrentlywedon’tuseanyservicediscoverysystem(likeConsul)– Therefore,weneedtousestaticconfigurationforPrometheus

• Wewantedtomanageserversthroughabrowser• So,wecreatedatooltomanageserverlistcalledpromgen (https://github.com/line/promgen)

Page 8: Promcon2016

Agenda

• BackgroundofLINE’sdevelopmentenvironment

• Promgen introduction• hadoop/fluentd clustermonitoringwithPrometheusandGrafana

Page 9: Promcon2016

Aboutpromgen

• Simplewebappwritteninrubywhich– Generatesserver list/rulesandreloads(POST/-/reload)Prometheus

– Controlsalertmanagement

Page 10: Promcon2016

Promgen data model

Service

Project

Project

Project

Farm

Farm

Farm

HostHostHost

HostHostHost

HostHostHost

Page 11: Promcon2016

serviceofpromgen

Thisservicehas3projects(blog-admin,blog-batch,blog-web)

Page 12: Promcon2016

projectofpromgen

alertnotification

exportershosts

Page 13: Promcon2016

Hostscreen

Page 14: Promcon2016

Exporterofpromgen

Job name becomes Prometheuslabel

Page 15: Promcon2016

prometheus.yamlrule_files:- "/tmp/prom.rule"

scrape_configs:- job_name:'dummy'file_sd_configs:- files:- "/tmp/prom.json"

Promgen

update

update

Promgendoesn’t change prometheus.yamldirectly,butinsteadupdatesprom.rule/prom.json.UsershouldrunPromgen andPrometheusonthesamemachine.

Page 16: Promcon2016

prom.jsonexample[{"targets":["blog.admin1.localhost:9100”,"blog.admin2.localhost:9100”,...],"labels":{"service":"blog","project":"blog-admin","farm":"blog-admin-RELEASE","job":"node"}

},...]

service/project/farm/jobbecomePrometheuslabelsWe use this Prometheus labels inGrafana Templates

Page 17: Promcon2016
Page 18: Promcon2016

Howtouselabelsintemplating• Templatingisusefulbecausedashboardcanbereused

• LabelscorrespondtotemplatesinGrafana• Inthisexample,weusethefarmlabel

Page 19: Promcon2016

PrometheusandGrafana

• Prometheuspullsmetricsfromexporters• Grafana’s datasouce isPrometheus• Prometheus and Grafana are aperfectcombination

• I really appreciate Grafana’s Prometheusplugin

PrometheusExporters Grafana

Page 20: Promcon2016

Aboutpromgen

• Simplewebappwritteninrubywhich– Generatesserver list/ruleandreload(POST/-/reload)prometheus

– Controlsalertmanagement

Page 21: Promcon2016

Alertmanager

• Alertmanager is powerful because users canavoid afloodofalertnotifications

• Deduplicationandsilences are useful• Alertmanager can avoid alert fatigue• We want to managealertnotificationrulesandsettings easily– forexample,wewanttoadd HipChat room and Mailaddress through browser.

• That’s why we implement webhook in promgen

Page 22: Promcon2016

Rulecontrolscreen

Page 23: Promcon2016

HipChat and Mail

• UsercansetHipChatroomandmailaddresstoreceivealert

Page 24: Promcon2016

Howtonotifyalert• Promgen haswebhook featuretosendalerttoboth HipChatandMail

• If alert occurs, user can receive alert throughAlertmanager, Promgen

Prometheus Alertmanager Promgen

HipChat

Mail

Page 25: Promcon2016

Agenda

• BackgroundofLINE’sdevelopmentenvironment

• Promgen introduction• hadoop/fluentd clustermonitoringwithPrometheusandGrafana

Page 26: Promcon2016

Log analysis platform• AccesslogsaresenttoHDFSbyfluentd.Therearemorethan400Fluentd processesand150kmsg/secduringpeaktimes.

• FluentdisanOSSlogcollectorlikelogstash,flumewritteninruby

• OurHadoopclusterismedium-sized,consistingof40units.

MRv2/Tez/HDFS

Hive

HDP2.4.0

accesslogaccesslogaccesslog

Page 27: Promcon2016

Monitoring ofhadoop/hivecluster• Developersnormallyusejmx_exporter tomonitorjavamiddleware

• ButIwantedtocreate exporter,soIimplementednamenode/resourcemanager/jstat exporter

• namenode_exporter useshttp://namenode:50070/jmx• resourcemanager_exporter useshttp://resourcemanager:8088/ws/v1/cluster/metrics

• jstat_exporter usesjstat command– Honestly,currentjstat_exporter implementationisnotsogood becausewhenPrometheuspullsmetrics,jstatcommandisalwaysexecuted

– cachemaybenecessary

Page 28: Promcon2016

Namenode FilesTotal monitoringbyusingnamenode_exporter

NameNode Down!

Alerts are also Prometheus metricssoGrafana canshowalertsasannotations

Page 29: Promcon2016

Resoucemanager jobmonitoringbyusingresourcemanager_exporter

Page 30: Promcon2016

Hiveserver2 jvm monitoringbyusingjstat_exporter

https://issues.apache.org/jira/browse/HIVE-13374

Page 31: Promcon2016

Fluentd buffermonitoring• Fluentd hasbuffermechanismtoretryifdestinationisunstable

• fluent-plugin-prometheus enablesbuffermonitoring

• fluent-plugin-prometheus isfluentdpluginandusePrometheusRubyclient

Page 32: Promcon2016

access log count• fluent-plugin-prometheus enablesustocountaccesslogbutweneedsamplingbecauseofhighcpu usage

• Onefluentd processcan‘thandlehightraffic

Page 33: Promcon2016

HTTP status count

Although 4xx/5xxisnot0,itmaybecome0becauseofsampling. So we will switch to Flink.

Page 34: Promcon2016

HTTP status percentage

sum(rate(accesslog_counts{tag="..."}[1m]))by(status,job)/ignoring(status)group_leftsum(rate(accesslog_counts{tag="..."}[1m]))by(job)

Page 35: Promcon2016

fluentd_exporter• Fluentd isoftenrequiredtoexecuteinmultiprocessbecauseofGVL

• Iimplementedfluentd_expoter tomonitorfluentd cpu usage per process

• fluentd_exporter canhandlemultiplefluentdprocesses

Page 36: Promcon2016

Myfeeling

• Prometheus’squerylanguageisreallypowerful– sum(rate(accesslog_counts{tag="..."}[1m]))by(status,job)/ignoring(status)group_leftsum(rate(accesslog_counts{tag="..."}[1m]))by(job)

• PrometheusandGrafana areaperfectcombination

• We created promgen to improve hostmanagement and alert notification settings

Page 37: Promcon2016

References

• http://developers.linecorp.com/blog/?p=3908• https://github.com/line/promgen• http://www.fluentd.org/• https://github.com/wyukawa/hadoop_exporter• https://github.com/wyukawa/jstat_exporter• https://github.com/wyukawa/fluentd_exporter• https://github.com/kazegusuri/fluent-plugin-prometheus