32
Apache Eagle: Secure Hadoop in Real Time http://eagle.incubator.apache.org/ Hao Chen (@haozch) | Ralph Su (@raphaelsu)

Apache Eagle: Secure Hadoop in Real Time

Embed Size (px)

Citation preview

Page 1: Apache Eagle: Secure Hadoop in Real Time

Apache Eagle: Secure Hadoop in Real Timehttp://eagle.incubator.apache.org/

Hao Chen (@haozch) | Ralph Su (@raphaelsu)

Page 2: Apache Eagle: Secure Hadoop in Real Time

About Us

Apache Eagle Co-creator,PMC & [email protected]

Sr. Software Engineer at [email protected]

Apache Eagle PMC & [email protected]

MTS. Software Engineer at [email protected]

Ralph Su / 苏良飞

Hao Chen / 陈浩

Page 3: Apache Eagle: Secure Hadoop in Real Time

Agenda• Introduction• Architecture• Case Study• Q & A

Page 4: Apache Eagle: Secure Hadoop in Real Time

is a distributed real-time monitoring and alerting engine for hadoop from eBay

Open sourced as Apache Incubator Project on Oct 26th 2015

Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time.

See http://eagle.incubator.apache.org or http://github.com/apache/incubator-eagle

Apache Eagle

Page 5: Apache Eagle: Secure Hadoop in Real Time

Eagle  was  initialized  by  end  of  2013  for  hadoop ecosystem  monitoring  as  any  existing  tool  like  zabbix,  ganglia  can  not  handle  the  huge  volume  of  metrics/logs  generated  by  hadoop system  in  eBay.

2014/201510,000  nodes150,000+  cores200  PB2000+  user

3000+  nodes10,000+  cores50+  PB2012

20111000+  nodes10,000+  cores10+  PB

100+  nodes1000  +  cores1  PB2010

200950+  nodes

20071-­‐10  nodes

Hadoop  Data  • Security• ActivityHadoop  Platform  • Heath• Availability• Performance

Hadoop  @  eBay  Inc

Initiative  – Why  build  Eagle?

7+   CLUSTERS10000+ NODES200+  PB DATA

10  B+   EVENTS  /  DAY500+   METRIC  TYPES50,000+   JOBS  /  DAY50,000,000+   TASKS  /  DAY

Page 6: Apache Eagle: Secure Hadoop in Real Time

Initiative  – Core  ChallengesLarge  Scale  Processing  and  Storage• Scale  IO  Complexity  (Stream)• Scale  Computation  Complexity  (Policy)• Scale  Storage  &  Query  (Event,  Log,  Metric)

1

Real-­‐Time  Alerting• Real-­‐time  Data  Collection• Real-­‐time  Stream  Processing• Real-­‐time  Alerting

2

Expressive Correlation  Model• Complex  policy  model• Stream  GroupBy,  Join,  Window• Machine  Learning

3

Hadoop  Ecosystem  Integration• Data  Source  Integration• Anomaly  Detection  Model

4

Page 7: Apache Eagle: Secure Hadoop in Real Time

1 Eagle  FrameworkDistributed   real-­‐time  framework  for  efficiently  developing   highly  scalable  monitoring   applications

Eagle  Ecosystem

2 Eagle  AppsSecurity/  Hadoop/  Operational  Intelligence  /  …

3 Eagle  InterfaceREST  Service  /  Management  UI  /  Customizable  Analytics  Visualization

4 Eagle  IntegrationAmbari /  Docker /  Ranger  /  Dataguise

Appsà Securityà Hadoopà Cloudà Database

Interfaceà Web Portalà REST Servicesà Analytics Visualization

Integrationà Ambarià Dockerà Rangerà Dataguise Eagle

Framework

Open  SourceCommunity-­‐driven   and  Cross-­‐community  cooperation5

Page 8: Apache Eagle: Secure Hadoop in Real Time

Eagle  @  eBay  Inc.7+   CLUSTERS10000+ NODES200+  PB DATA

10  B+   EVENTS  /  DAY500+   METRIC  TYPES50,000+   JOBS  /  DAY50,000,000+   TASKS  /  DAY

Eagle  Deployment  at  eBay  Production• 100+  security  policies• 8  nodes• 30  worker  process• 64  kafka partitionEagle  Performance• Avg Latency:   ~  50  ms• Max  Throughput/Cluster:  300  k  /s

Page 9: Apache Eagle: Secure Hadoop in Real Time

Eagle  Architecture

Page 10: Apache Eagle: Secure Hadoop in Real Time

Distributed  Policy  Engine

METADATA  MANAGER

AlertExecutor_{1}

AlertExecutor_{2}

AlertExecutor_{N}

Real  Time  Alerts

Alerts

Policy  Management

Policy

Dynamical   Policy   Deployment

Real-­‐time  Event  Stream

Stream_{1}

Stream_{*}

Dynamical   Stream  Schema

Stream  Processing

• Real-­‐time  Streaming: Apache  Storm  (Execution  Engine)  +  Kafka  (Message  Bus)• Declarative  Policy: SQL  (CEP)  on  Streaming  +  Hot  Deploy• Linear  Scalability:  Data  volume scale +  Computation  scale• Metadata-­‐Driven:  Schema  Management  and  Dynamical  Policy  Lifecycle

Page 11: Apache Eagle: Secure Hadoop in Real Time

METADATA  MANAGER

AlertExecutor_{1}

AlertExecutor_{2}

AlertExecutor_{N}

Real  Time  Alerts

Alerts

Policy  Management

Policy

Dynamical   Policy   Deployment

Real-­‐time  Event  Stream

Stream_{1}

Stream_{*}

Dynamical   Stream  Schema

Stream  Processing

from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;

• Real-­‐time  Streaming: Apache  Storm  (Execution  Engine)  +  Kafka  (Message  Bus)• Declarative  Policy: SQL  (CEP)  on  Streaming  +  Hot  Deploy• Linear  Scalability:  Data  volume scale +  Computation  scale• Metadata-­‐Driven:  Schema  Management  and  Dynamical  Policy  Lifecycle

Distributed  Policy  Engine

Page 12: Apache Eagle: Secure Hadoop in Real Time

Declarative  Policy

• Filter• Join• Aggregation:  Avg,  Sum  ,  Min,  Max,  etc• Group by• Having• Stream handlers  for  window:  TimeWindow,  Batch  Window,  

Length  Window  • Conditions  and  Expressions:  and,  or,  not,  ==,!=,  >=,  >,  <=,  <,  

and  arithmetic  operations• Pattern  Processing• Sequence  processing• Event  Tables:  intergrate historical  data  in  realtime processing• SQL-­‐Like  Query:  Query,  Stream  Definition   and  Query  Plan  

compilation

Distributed  SQL  on  Streaming  :  Siddhi  CEP  +  Storm  by  default

from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;

Page 13: Apache Eagle: Secure Hadoop in Real Time

Declarative  Policy  -­‐ Examples

from hadoopJmxMetricEventStream[metric == "hadoop.namenode.fsnamesystemstate.capacityused" and value > 0.9] select metric, host, value, timestamp, component, site insert into alertStream;

Example  1:  Alert  if  hadoop namenode capacity  usage  exceed  90  percentages

from every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.fsnamesystem.hastate"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and a.value != value)] within 10 min select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as timestamp, b.metric as metric, b.component as component, b.site as site insert into alertStream;

Example  2:  Alert  if  hadoop namenode HA  switches

Page 14: Apache Eagle: Secure Hadoop in Real Time

1

Distributed  Policy  Engine  -­‐ Scalability

2

Distributed   Streaming   Cluster  Environment

AlertExecutor_{1}

AlertExecutor_{2}

AlertExecutor_{N}

Stream_{1}

Stream_{*}

Stream  Processing

Dynamic  policy  partition  by  {event}  *  {policy}

• N  Users  with  3  partitions,  M  policies  with  2  partitions,  then  3*2  physical  tasks• Physical  partition  +  policy-­‐level  partition

Linear  Scalability Principle

Page 15: Apache Eagle: Secure Hadoop in Real Time

3

Algorithm Weights  of  Executors   By  Partition   User

Random 0.0484 0.152 0.3535 0.105 0.203 0.072 0.042 0.024

Greedy 0.0837 0.0837 0.0837 0.0837 0.0737 0.0637 0.0437 0.0837

Stream  Partition  Skew  (15:1)

Distributed  Policy  Engine – OptimizationStream  Partition  Problem    https://en.wikipedia.org/wiki/Partition_problem

Page 16: Apache Eagle: Secure Hadoop in Real Time

Distributed  Real-­time    Policy  Engine

Siddhi  CEP  Policy  Evaluator

Machine  Learning  Policy  Evaluator• Support  WSO2  Siddhi  CEP  as  first  class

• Extensible  Policy  Engine  Implementation• Extensible  Policy  Lifecycle  Management• Metadata-­‐based  Module  Management

Extensible   Policy  Evaluator

public interface PolicyEvaluatorServiceProvider {public String getPolicyType(); // literal string to identify one type of policypublic Class getPolicyEvaluator(); // get policy evaluator implementationpublic List getBindingModules(); // policy text with json format to object mapping

}

METADATA  MANAGER

Policy/Metadata

Distributed  Policy  Engine – ExtensibilityPolicy  Engine  Extensibility

Page 17: Apache Eagle: Secure Hadoop in Real Time

Stream  Processing  Framework

Optimizer

1.  Development 2.  Optimization 3.  Compile   to  native  app

Use  eagle  alert  framework  as  library

Page 18: Apache Eagle: Secure Hadoop in Real Time

• Light-­‐weight  ORM  Framework  for  HBase/RDMBS• Full-­‐function   SQL-­‐Like  REST  Query  • Optimized  Rowkey design  for  time-­‐series  data• Native  HBase Coprocessor• Secondary  Index  Support

@Table("alertdef")@ColumnFamily("f")@Prefix("alertdef")@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)@JsonIgnoreProperties(ignoreUnknown = true)@TimeSeries(false)@Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"})@Indexes({

@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true),})public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{

@Column("a")private String desc;@Column("b")private String policyDef;@Column("c")private String dedupeDef;

Query=AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}

Large  Scale  Storage  and  Query

Page 19: Apache Eagle: Secure Hadoop in Real Time

Uniform  HBase rowkey design

• Metric

• Entity

• Log

Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …

Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …

Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …

Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …Rowvalue ::= Log Content

Large  Scale  Storage  and  Query

http://opentsdb.net

Page 20: Apache Eagle: Secure Hadoop in Real Time

Multi-­‐Tenants  – Topology  Scheduler• Dynamical  Topology  Management• No-­‐downtime  Topology  Maintenance• Topology  High  Availability  &  Balance• Resource  Scheduling  &  Isolation:  Runtime,Woker,  Topology  or  Cluster

Page 21: Apache Eagle: Secure Hadoop in Real Time

Multi-­‐Tenants  -­‐ Dynamical  Correlation

• Dynamical  Correlation  on  Runtime:• Sort,  Groupby,  Join,  Window• Hot  Deploy  Logic• Policy  Management

• Multi-­‐Correlation  on  the  Single  Stream• Group  by  different  fields  of  same  stream• Resort  same  stream  by  different  order• Join  certain  stream  in  different  way

• Multi  Correlation  on  Multi  Streams• Cross  Streams  Join• Real-­‐time  &  Historical  Stream  Join

Page 22: Apache Eagle: Secure Hadoop in Real Time

Multi-­‐Tenants  -­‐ Dynamical  Correlation

Page 23: Apache Eagle: Secure Hadoop in Real Time

Eagle  Use  Cases

Data  Activity  MonitoringSecure  Hadoop  in  Realtime a  data  activity  monitoring  solution  to   instantly  identify  access  to  sensitive  data,    recognize  attacks/  malicious  activity  and  block  access  in  real  time.

Job  Performance  MonitoringHadoop,  Spark  Job  Profiling  &  Performance  Monitoring,  Cluster  Health  Anomaly  Detection

1

2

4eBay  Unified  Monitoring  PlatformProvide  unified  monitoring-­‐as-­‐service  for  everything  around  infrastructure  or  business.

3

Page 24: Apache Eagle: Secure Hadoop in Real Time

Eagle  Data  Activity  MonitoringData  Loss  PreventionGet  alerted  and  stop  a  malicious  user  trying  to  copy,  delete,  move  sensitive  data  from  the  Hadoop  cluster.

Malicious  LoginsDetect  login  when  malicious  user  tries  to  guess  password.  Eagle  creates  user  profiles  using  machine   learning  algorithm  to  detect  anomalies

Unauthorized  accessDetect  and  stop  a  malicious  user  trying  to  access  classified  data  without  privilege.  

Malicious  user  operationDetect  and  stop  a  malicious  user  trying  to  delete  large  amount  of  data.  Operation  type  is  one  parameter  of  Eagle  user  profiles.  Eagle  supports  multiple  native  operation  types.

User

Privileges

Common  Data  Sets

Patterns

CommandsZones

Query

Columns

Page 25: Apache Eagle: Secure Hadoop in Real Time

§ HDFS  Policies§ Access  to  Sensitive  files§ HDFS  Commands  used  (read,  write,  update…)§ Client  host  § Destination§ Security  Zones

§ Hive  Policies§ Access  to  tables  with  PII  Data§ SQL  Query  Profiles§ Client  Host§ Security  Zone

Eagle  Data  Activity  MonitoringHadoop  Data  Security:  Detect  anomalies  in  accessing  HDFS  and  Hive

Page 26: Apache Eagle: Secure Hadoop in Real Time

Offline:  Determine  bandwidth   from  training  dataset  the  kernel  density  function   parameters  (KDE)Online:  If  a  test  data  point   lies  outside   the  trained  bandwidth,   it  is  anomaly  (Policy)

PCs(Principle  Components)  in  EVD(Eigenvalue  Value  Decomposition)

Kernel  Density  Function

Eagle  Machine  Learning  User  Profile

Page 27: Apache Eagle: Secure Hadoop in Real Time

Use  Case Detect  node  anomaly  by  analyzing   task  failure  ratio  across  all  nodesAssumption Task  failure  ratio  for  every  node  should  be  approximately  equalAlgorithm Node  by  node  compare  (symmetry  violation)  and  per  node  trend

Eagle  Job  Performance  Monitoring

Page 28: Apache Eagle: Secure Hadoop in Real Time

Alerting:  Anomaly  Detection  Alerting

Insight:  Task  failure  drill-­‐down

Insight:  Task  failure  drill-­‐down

Task  Failure  based  Anomaly  Host  Detection  

Page 29: Apache Eagle: Secure Hadoop in Real Time

Counters  &  Features

Use  Case Detect  data  skew  by  statistics  and  distributions   for  attempt  execution  durations  and  countersAssumption      Duration  and  counters  should  be  in  normal  distribution

mapDurationreduceDurationmapInputRecordsreduceInputRecordscombineInputRecordsmapSpilledRecordsreduceShuffleRecordsmapLocalFileBytesReadreduceLocalFileBytesReadmapHDFSBytesReadreduceHDFSBytesRead

Modeling  &  Statistics

AvgMinMax  DistributionsMax  z-­scoreTop-­NCorrelation

Threshold  &  Detection

Counters

Correlation  >  0.9  &  Max(Z-­Score)  >  90%  

Hadoop  Job  Data  Skew  Detection

Page 30: Apache Eagle: Secure Hadoop in Real Time

Counters  &  Features

Use  Case Detect  data  skew  by  statistics  and  distributions  for  attempt  execution  durations  and  countersAssumption      Duration  and  counters  should  be  in  normal  distribution

mapDurationreduceDurationmapInputRecordsreduceInputRecordscombineInputRecordsmapSpilledRecordsreduceShuffleRecordsmapLocalFileBytesReadreduceLocalFileBytesReadmapHDFSBytesReadreduceHDFSBytesRead

Modeling  &  Statistics

AvgMinMax  DistributionsMax  z-­scoreTop-­NCorrelation

Threshold  &  Detection

Counters

Correlation  >  0.9  &  Max(Z-­Score)  >  90%  

Hadoop  Job  Data  Skew  Detection

Page 31: Apache Eagle: Secure Hadoop in Real Time

Learn  More  about  Apache  EagleCommunity• Website:  http://eagle.incubator.apache.org• Github:  http://github.com/apache/incubator-­‐eagle• Mailing  list:  [email protected]

Resources• Documentation:  http://eagle.incubator.apache.org/docs/• Docker images:  https://hub.docker.com/r/apacheeagle/sandbox/

Publications  &  Patents• EAGLE:  USER  PROFILE-­‐BASED  ANOMALY  DETECTION  IN  HADOOP  CLUSTER  (IEEE)• EAGLE:  DISTRIBUTED  REALTIME  MONITORING  FRAMEWORK  FOR  HADOOP  CLUSTER

Page 32: Apache Eagle: Secure Hadoop in Real Time

Q  &  A

http://eagle.incubator.apache.org

The  slide  is  licensed  under  Creative  Commons  Attribution  4.0  International  license.