promcon2016
TRANSCRIPT
Hadoop,FluentdclustermonitoringwithPrometheusandGrafana
2016/08/26WataruYukawa(@wyukawa)
#promcon2016
WhoamI?
• DataEngineeratLINE• LINEmakesamessagingapplicationofthesamename,inadditiontootherrelatedservices
• ItisthemostpopularmessagingplatforminJapan
LINE
WhoamI?
• FirsttimetoGermany!• Maintainanon-premisesloganalysisplatformontopofHadoop/Hive/Fluentd.
• Unofficial Prometheus Evangelist in Japan– OrganizedmeetupinTokyoonJune14,2016(morethan100attendences)
– http://developers.linecorp.com/blog/?p=3908
Agenda
• BackgroundofLINE’sdevelopmentenvironment
• Promgen introduction• hadoop/fluentd clustermonitoringwithPrometheusandGrafana
Before Prometheus• Ihaveexperiencewith other monitoringtoolslikeGanglia,Nagios
• IfoundPrometheus– Monitoringandalertingareunified– Thereisaqueryfeaturethatallowsad-hocqueries
• maxdiskusage:maxby(instance)(100-(node_filesystem_free{...}/node_filesystem_size{...})*100)
• IwanttousePrometheus• HowdoweadjustPrometheustoourenvironment?
LINE’sdevelopmentenvironment• We rarely use cloud service like AWS because weare under on-premises environment– host information doesn’t change frequently
• That’swhycurrentlywedon’tuseanyservicediscoverysystem(likeConsul)– Therefore,weneedtousestaticconfigurationforPrometheus
• Wewantedtomanageserversthroughabrowser• So,wecreatedatooltomanageserverlistcalledpromgen (https://github.com/line/promgen)
Agenda
• BackgroundofLINE’sdevelopmentenvironment
• Promgen introduction• hadoop/fluentd clustermonitoringwithPrometheusandGrafana
Aboutpromgen
• Simplewebappwritteninrubywhich– Generatesserver list/rulesandreloads(POST/-/reload)Prometheus
– Controlsalertmanagement
Promgen data model
Service
Project
Project
Project
Farm
Farm
Farm
HostHostHost
HostHostHost
HostHostHost
serviceofpromgen
Thisservicehas3projects(blog-admin,blog-batch,blog-web)
projectofpromgen
alertnotification
exportershosts
Hostscreen
Exporterofpromgen
Job name becomes Prometheuslabel
prometheus.yamlrule_files:- "/tmp/prom.rule"
scrape_configs:- job_name:'dummy'file_sd_configs:- files:- "/tmp/prom.json"
Promgen
update
update
Promgendoesn’t change prometheus.yamldirectly,butinsteadupdatesprom.rule/prom.json.UsershouldrunPromgen andPrometheusonthesamemachine.
prom.jsonexample[{"targets":["blog.admin1.localhost:9100”,"blog.admin2.localhost:9100”,...],"labels":{"service":"blog","project":"blog-admin","farm":"blog-admin-RELEASE","job":"node"}
},...]
service/project/farm/jobbecomePrometheuslabelsWe use this Prometheus labels inGrafana Templates
Howtouselabelsintemplating• Templatingisusefulbecausedashboardcanbereused
• LabelscorrespondtotemplatesinGrafana• Inthisexample,weusethefarmlabel
PrometheusandGrafana
• Prometheuspullsmetricsfromexporters• Grafana’s datasouce isPrometheus• Prometheus and Grafana are aperfectcombination
• I really appreciate Grafana’s Prometheusplugin
PrometheusExporters Grafana
Aboutpromgen
• Simplewebappwritteninrubywhich– Generatesserver list/ruleandreload(POST/-/reload)prometheus
– Controlsalertmanagement
Alertmanager
• Alertmanager is powerful because users canavoid afloodofalertnotifications
• Deduplicationandsilences are useful• Alertmanager can avoid alert fatigue• We want to managealertnotificationrulesandsettings easily– forexample,wewanttoadd HipChat room and Mailaddress through browser.
• That’s why we implement webhook in promgen
Rulecontrolscreen
HipChat and Mail
• UsercansetHipChatroomandmailaddresstoreceivealert
Howtonotifyalert• Promgen haswebhook featuretosendalerttoboth HipChatandMail
• If alert occurs, user can receive alert throughAlertmanager, Promgen
Prometheus Alertmanager Promgen
HipChat
Agenda
• BackgroundofLINE’sdevelopmentenvironment
• Promgen introduction• hadoop/fluentd clustermonitoringwithPrometheusandGrafana
Log analysis platform• AccesslogsaresenttoHDFSbyfluentd.Therearemorethan400Fluentd processesand150kmsg/secduringpeaktimes.
• FluentdisanOSSlogcollectorlikelogstash,flumewritteninruby
• OurHadoopclusterismedium-sized,consistingof40units.
MRv2/Tez/HDFS
Hive
HDP2.4.0
accesslogaccesslogaccesslog
Monitoring ofhadoop/hivecluster• Developersnormallyusejmx_exporter tomonitorjavamiddleware
• ButIwantedtocreate exporter,soIimplementednamenode/resourcemanager/jstat exporter
• namenode_exporter useshttp://namenode:50070/jmx• resourcemanager_exporter useshttp://resourcemanager:8088/ws/v1/cluster/metrics
• jstat_exporter usesjstat command– Honestly,currentjstat_exporter implementationisnotsogood becausewhenPrometheuspullsmetrics,jstatcommandisalwaysexecuted
– cachemaybenecessary
Namenode FilesTotal monitoringbyusingnamenode_exporter
NameNode Down!
Alerts are also Prometheus metricssoGrafana canshowalertsasannotations
Resoucemanager jobmonitoringbyusingresourcemanager_exporter
Hiveserver2 jvm monitoringbyusingjstat_exporter
https://issues.apache.org/jira/browse/HIVE-13374
Fluentd buffermonitoring• Fluentd hasbuffermechanismtoretryifdestinationisunstable
• fluent-plugin-prometheus enablesbuffermonitoring
• fluent-plugin-prometheus isfluentdpluginandusePrometheusRubyclient
access log count• fluent-plugin-prometheus enablesustocountaccesslogbutweneedsamplingbecauseofhighcpu usage
• Onefluentd processcan‘thandlehightraffic
HTTP status count
Although 4xx/5xxisnot0,itmaybecome0becauseofsampling. So we will switch to Flink.
HTTP status percentage
sum(rate(accesslog_counts{tag="..."}[1m]))by(status,job)/ignoring(status)group_leftsum(rate(accesslog_counts{tag="..."}[1m]))by(job)
fluentd_exporter• Fluentd isoftenrequiredtoexecuteinmultiprocessbecauseofGVL
• Iimplementedfluentd_expoter tomonitorfluentd cpu usage per process
• fluentd_exporter canhandlemultiplefluentdprocesses
Myfeeling
• Prometheus’squerylanguageisreallypowerful– sum(rate(accesslog_counts{tag="..."}[1m]))by(status,job)/ignoring(status)group_leftsum(rate(accesslog_counts{tag="..."}[1m]))by(job)
• PrometheusandGrafana areaperfectcombination
• We created promgen to improve hostmanagement and alert notification settings
References
• http://developers.linecorp.com/blog/?p=3908• https://github.com/line/promgen• http://www.fluentd.org/• https://github.com/wyukawa/hadoop_exporter• https://github.com/wyukawa/jstat_exporter• https://github.com/wyukawa/fluentd_exporter• https://github.com/kazegusuri/fluent-plugin-prometheus