apache eagle dublin hadoop summit 2016
TRANSCRIPT
2
Apache EagleMonitor Hadoop in Real Time
Yong Zhang | Senior Architect | [email protected] Manoharan | Senior Product Manager | @lycos_86
Big Data @ eBay
800MListings *
159M Global Active Buyers *
*Q3 2015 data
7 Hadoop Clusters*
800MHDFS operations (single cluster)*
120 PB Data*
Hadoop @ eBay
HADOOP SECURITY
Authorization & Access Control
Perimeter Security
Data Classification
Activity Monitoring
SecurityMDR
• Perimeter Security• Authorization &
Access Control• Discovery• Activity Monitoring
Security for Hadoop
Who is accessing the data?
What data are they accessing?
Is someone trying to access data that they don’t have access to?
Are there any anomalous access patterns?
Is there a security threat?
How to monitor and get notified during or prior to an anomalous event occurring?
Motivation
Apache Eagle
Apache Eagle: Monitor Hadoop in Real Time
Apache Eagle is an Open Source Monitoring Platform for Hadoop eco-system, which started with monitoring data activities in Hadoop. It can instantly identify access to sensitive data, recognize attacks/malicious activities and blocks access in real time.
In conjunction with components such as Ranger, Sentry, Knox, DgSecure and Splunk etc., Eagle provides comprehensive solution to secure sensitive data stored in Hadoop.
Apache Eagle Composition
Apache Eagle
Integrations Alert EngineHDFSAUDIT
HIVEQUERY
HBASEAUDIT
CASSANDRAAUDIT
MapRAUDIT
2 HADOOPPerformanceMetric
Namenode JMX Metrics
DatanodeJMX Metrics
SystemMetrics
3 M/R JobPerformanceMetric
History Job Metrics
Running Job Metrics
4 Spark JobPerformanceMetric
Spark Job Metrics
QueueMetrics
1 Data Activity Monitoring
RMJMXMetrics
1 Policy Store
2 Metadata API
3 Scalability
4 Extensibility
[Domains] [Applications]
More Integrations
•Cassandra•MapR•Mongo DB•Job•Queue
Extensibility
Ranger• As remediation engine• As generic data source
DgSecure• Source of truth for data classification
Splunk• Syslog format output• EAGLE alert output is the 1st abstraction of analytics and
Splunk is the 2nd abstraction
Eagle Architecture
Highlights
1. Turn-key integration: after installation, user defines rules2. Comprehensive rules on high volume of data: Eagle solves some
unique problem in Hadoop3. Hot deploy rule: Eagle does not provide a lot of charts, instead it
allows user to write ad-hoc rule and hot deploy it.4. Metadata driven: kept in mind, here metadata includes policy, event
schema and UI component etc.5. Extensibility: Keep in mind that Eagle can’t succeed alone, Eagle has to
be integrated with other system for example data classification, policy enforcement etc.
6. Monolithic storm topology: application pre-processing are running together with alert engine.
Example 1: Integration with HDFS AUDIT log
• Ingestion KafkaLog4jAppender+Ka
fka Logstash+Kafka
• Partition By user
• Pre-processing Sensitivity join Command re-assembler
Namenode
Kafka Partition_1
Kafka Partition_2
Kafka Partition_N
StormKafkaSpout
User1 User1
Alert Executor_1
Alert Executor_2
Alert Executor_K
User2 User2
User1
User2
Data Classification - HDFS
• Browse HDFS file system• Batch import sensitivity metadata through Eagle API• Manually mark sensitivity in Eagle UI
One user command generates multiple HDFS audit events Eagle does reverse engineering to figure out original user command Example COPYFROMLOCAL_PATTERN = “every a = eventStream[cmd==‘getfileinfo’] ” + “-> b = eventStream[cmd==‘getfileinfo’ and user==a.user and src==str:concat(a.src,‘._COPYING_’)] ” + “-> c = eventStream[cmd==‘create’ and user==a.user and src==b.src] ” + “-> d = eventStream[cmd==‘getfileinfo’ and user==a.user and src==b.src] ” + “-> e = eventStream[cmd==‘delete’ and user==a.user and src==a.src] ” + “-> f = eventStream[cmd==‘rename’ and user==a.user and src==b.src and dst==a.src]”
2015-11-20 00:06:47,090 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=getfileinfo src=/tmp/private dst=null perm=null proto=rpc2015-11-20 00:06:47,185 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=getfileinfo src=/tmp/private._COPYING_ dst=null perm=null proto=rpc2015-11-20 00:06:47,254 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=create src=/tmp/private._COPYING_ dst=null perm=root:hdfs:rw-r--r-- proto=rpc2015-11-20 00:06:47,289 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=getfileinfo src=/tmp/private._COPYING_ dst=null perm=null proto=rpc2015-11-20 00:06:47,609 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=delete src=/tmp/private dst=null perm=null proto=rpc2015-11-20 00:06:47,624 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=rename src=/tmp/private._COPYING_ dst=/tmp/private perm=root:hdfs:rw-r--r-- proto=rpc
User Command Re-assembly
• Policy evaluation is stateful (one user’s data has to go to one physical bolt)• Partition by user all the way (hash)• User is not balanced at all• Greedy algorithm https://en.wikipedia.org/wiki/Partition_problem#The_greedy_algorithm
Data Skew Problem
Policy weight is not even• Regex policy is CPU intensive• Window based policy is Memory intensive
Computation Skew Problem
Example 2: Integration with Hive
• Ingestion Yarn API
• Partition user
• Pre-processing
Sensitivity join Hive SQL
parser
Data Classification - Hive
• Browse Hive databases/tables/columns• Batch import sensitivity metadata through Eagle API• Manually mark sensitivity in Eagle UI
Eagle Alert Engine Overview
1 Runs CEP engine on Apache Storm• Use CEP engine as library (Siddhi CEP)• Evaluate policy on streamed data• Rule is hot deployable
2 Inject policy dynamically• API• Intuitive UI
3 Scalability• Computation # of policies (policy placement)• Storage # of events (event partition)
4 Extensibility for policy enforcement• Post-alert processing with plugin
Run CEP Engine on Storm
Storm BoltCEPWorkerCEPWorker
CEPWorker
… …
Policy Check Thread Polic
y Store
Metadata API
event1
event1event1
event1
policy1,2,3,4,5,6
policy1,2,3policy1
policy2
policy3
Storm Bolt
event1
policy4,5,6
event schema
Primitives – event, policy, alert
Raw Event2015-10-11 01:00:00,014 INFO FSNamesystem.audit: allowed=true [email protected] (auth:KERBEROS) ip=/10.0.0.1 cmd=getfileinfo src=/tmp/private dst=null perm=null
Alert EventTimestamp, cmd, src, dst, ugi, sensitivityType, securityZone
PolicyviewPrivate: from hdfsAuditLogEventStream[(cmd=='getfileinfo') and (src=’/tmp/private’)]
Alert2015-10-11 01:00:09[UTC] hdfsAuditLog viewPrivate user_tom /10.0.0.1 The Policy "viewPrivate" has been detected with the below information: timestamp="1445993770932" allowed="true" cmd="getfileinfo" host="/10.0.0.1" sensitivityType="PRIVATE" securityZone="NA" src="/tmp/private" dst="NA" user=“user_tom”
Event Schema
• Modeling event
1 Single event evaluation• threshold check with various
conditions
Policy Capabilities
2 Event window based evaluation• various window semantics (time/length sliding/batch
window)• comprehensive aggregation support
3 Correlation for multiple event streams• SQL-like join
4 Pattern Match and Sequence• a happens followed by b
Powered by Siddhi 3.0.5, but Eagle provides dynamic capabilities and intuitive API/UI
1 Namenode master/slave lag from every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.journaltransaction.lastappliedorwrittentxid"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host != a.host and (max(convert(a.value, "long")) + 100) <= max(convert(value, "long"))] within 5 min select a.host as hostA, a.value as transactIdA, b.host as hostB, b.value as transactIdB insert into tmp;
Some policy examples
3 Namenode HA state changefrom every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.hastate.active.count"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and (convert(a.value, "long") != convert(value, "long"))] within 10 min select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as timestamp, b.metric as metric, b.component as component, b.site as site insert into tmp;
2 Namenode last checkpoint time• from hadoopJmxMetricEventStream[metric ==
"hadoop.namenode.dfs.lastcheckpointtime" and (convert(value, "long") + 18000000) < timestamp] select metric, host, value, timestamp, component, site insert into tmp;
Define policy in UI and API
curl -u ${EAGLE_SERVICE_USER}:${EAGLE_SERVICE_PASSWD} -X POST -H 'Content-Type:application/json' \ "http://${EAGLE_SERVICE_HOST}:${EAGLE_SERVICE_PORT}/eagle-service/rest/entities?serviceName=AlertDefinitionService" \ -d ' [ { "prefix": "alertdef", "tags": { "site": "sandbox", "application": "hadoopJmxMetricDataSource", "policyId": "capacityUsedPolicy", "alertExecutorId": "hadoopJmxMetricAlertExecutor", "policyType": "siddhiCEPEngine" }, "description": "jmx metric ", "policyDef": "{\"expression\":\"from hadoopJmxMetricEventStream[metric == \\\"hadoop.namenode.fsnamesystemstate.capacityused\\\" and convert(value, \\\"long\\\") > 0] select metric, host, value, timestamp, component, site insert into tmp; \",\"type\":\"siddhiCEPEngine\"}", "enabled": true, "dedupeDef": "{\"alertDedupIntervalMin\":10,\"emailDedupIntervalMin\":10}", "notificationDef": "[{\"sender\":\"[email protected]\",\"recipients\":\"[email protected]\",\"subject\":\"missing block found.\",\"flavor\":\"email\",\"id\":\"email_1\",\"tplFileName\":\"\"}]" } ] '
1 Create policy using API 2 Create policy using UI
Scalability
•Scale with # of events•Scale with # of policies
Statistics• # of events evaluated per
second• audit for policy change
Eagle ServiceAs of 0.3.0, Eagle stores metadata and statistics into HBASE, and support Druid as metric store.
Metadata• Policy• Event schema• Site/Application/UI Features
HBASE• Store metrics• Store M/R job/task data• Rowkey design for time-series
data• HBase Coprocessor
Raw data• Druid for metric• HBASE for M/R job/task
etc.• ES for log (future)
1 Data to be stored
2 Storage 3 API/UI
Druid• Consume data from Kafka
HBASE• filter, groupby, sort,
top
Druid• Druid query API• Dashboard in Eagle
Alert Engine Limitations in Eagle 0.3
1 High cost for integrating• Coding for onboarding new data source• Monolithic topology for pre-processing and
alert
3 Policy capability restricted by event partition• Can’t do ad-hoc group-by policy expressionFor example from groupby user to groupby cmd
2 Not multi-tenant• Alert engine is embedded into application• Many separate Storm topologies
4 Correlation is not declarative• Coding for correlating existing data sources
If traffic is partitioned by user, policy only supports expression of user based group-by
One storm topology even for one trivial data source
Even if it is a simple data source, you have to write storm topology and then deploy
Can’t declare correlations for multiple metrics
5 Stateful policy evaluation• fail over when bolt is down
How to replay one week history data when node is down
Eagle Next Releases
• Improve User experience Remote start storm topology Metadata stored in RDBMS
Eagle 0.4 Eagle 0.5
• Alert Engine as Platform No monolithic topology Declarative data source onboard Easy correlation Support policies with any field
group-by Elastic capacity management
USER PROFILE ALGORITHMS…Eigen Value Decomposition
• Compute mean and variance
• Compute Eigen Vectors and determine Principal Components
• Normal data points lie near first few principal components
• Abnormal data points lie further from first few principal components and
closer to later components
USER PROFILE ARCHITECTURE
http://eagle.incubator.apache.org
https://github.com/apache/incubator-eagle Github
Welcome Contributors in Apache Eagle
Dev Mail List
@TheApacheEagleTwitter
Q & A