56
Apache NiFi Crash Course Intro Rafael Coss - @racoss Hadoop Summit – Tokyo Oct 2016

Hadoop Summit Tokyo Apache NiFi Crash Course

Embed Size (px)

Citation preview

Page 1: Hadoop Summit Tokyo Apache NiFi Crash Course

Apache NiFi Crash Course IntroRafael Coss - @racossHadoop Summit – Tokyo

Oct 2016

Page 2: Hadoop Summit Tokyo Apache NiFi Crash Course

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaData Flow & Streaming Fundamentals

What is dataflow and what are the challenges?

Apache NiFi

Architecture

Lab

Page 3: Hadoop Summit Tokyo Apache NiFi Crash Course

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Flow & Streaming Fundamentals

Page 4: Hadoop Summit Tokyo Apache NiFi Crash Course

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Connected Data World Internet of Anything (IoAT)

– Wind Turbines, Oil Rigs, Cars– Weather Stations, Smart Grids– RFID Tags, Beacons, Wearables

User Generated Content (Web & Mobile)– Twitter, Facebook, Snapchat, YouTube– Clickstream, Ads, User Engagement– Payments: Paypal, Venmo

44ZB in 2020

Page 5: Hadoop Summit Tokyo Apache NiFi Crash Course

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Let’s Connect A to BProducers A.K.A Things

AnythingAND

Everything

Internet!

Consumers• User• Storage• System• …More Things

Page 6: Hadoop Summit Tokyo Apache NiFi Crash Course

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Stream Processing?

Batch Processing• Ability to process and analyze data at-rest (stored data)• Request-based, bulk evaluation and short-lived processing• Enabler for Retrospective, Reactive and On-demand Analytics

Stream Processing• Ability to ingest, process and analyze data in-motion in real- or near-real-time• Event or micro-batch driven, continuous evaluation and long-lived processing• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best

Action

Stream Processing + Batch Processing = All Data Analyticsreal-time (now) historical (past)

Page 7: Hadoop Summit Tokyo Apache NiFi Crash Course

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Modern Data ApplicationsCustom or Off the Shelf

Real-Time Cyber Securityprotects systems with superior threat detectionSmart Manufacturingdramatically improves yields by managing more variables in greater detailConnected, Autonomous Carsdrive themselves and improve road safetyFuture Farmingoptimizing soil, seeds and equipment to measured conditions on each square footAutomatic Recommendation Enginesmatch products to preferences in milliseconds

DATA ATREST

DATA IN MOTION

ACTIONABLEINTELLIGENCE

Modern Data Applications

Hortonworks DataFlow

Hortonworks Data Platform

Page 8: Hadoop Summit Tokyo Apache NiFi Crash Course

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Store Data

Process and Analyze Data

Acquire Data

Simplistic View of DataFlows: Easy, Definitive

Dataflow

Page 9: Hadoop Summit Tokyo Apache NiFi Crash Course

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Unassuming Line: A Case StudyWe’ve seen a few lines show up in the wild thus far

Internet! Inter- & Intra- connections inour global courier enterprise

Spotlight: Arthur Lacôte, https://thenounproject.com/turo/

Page 10: Hadoop Summit Tokyo Apache NiFi Crash Course

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dataflow Line Anatomy 101Let’s dissect what this line typically represents

Fig 1. Lineus Worldwidewebus. Common Name: Internet!

Script or Application

Script or Application

Data Data

Disparate TransportMechanisms

Page 11: Hadoop Summit Tokyo Apache NiFi Crash Course

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dataflow Line Anatomy 201Sometimes that transport is just more lines

Fig 1. Lineus Worldwidewebus. Common Name: Internet!

Script or Application

Script or Application

Line Inception

Data Data

Page 12: Hadoop Summit Tokyo Apache NiFi Crash Course

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Realistic View of Dataflows: Complex, Convoluted

Store Data

Process and Analyze Data

Acquire Data

Store DataStore Data

Store Data

Store Data

Acquire Data

Acquire Data

Acquire Data

Dataflow

Page 13: Hadoop Summit Tokyo Apache NiFi Crash Course

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Streaming Architecture

IngestionSimple Event Processing

EngineStream Processing

DestinationData Bus

Page 14: Hadoop Summit Tokyo Apache NiFi Crash Course

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

High-Level Overview

IoT Edge(single node)

IoT Edge(single node)

IoT Devices

IoT Devices

NiFi Hub Data Broker

Column DB

Data Store

Live Dashboard

Data Center(on premises/cloud)

HDFS/S3 HBase/Cassandra

Page 15: Hadoop Summit Tokyo Apache NiFi Crash Course

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaWhat is dataflow and what are the challenges?

Apache NiFi

Architecture

Live Demo

Community

Page 16: Hadoop Summit Tokyo Apache NiFi Crash Course

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Moving data effectively is hard

Standards: http://xkcd.com/927/

Page 17: Hadoop Summit Tokyo Apache NiFi Crash Course

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why is moving data effectively hard?

Standards Formats “Exactly Once” Delivery Protocols Veracity of Information Validity of Information Ensuring Security Overcoming Security

Compliance Schemas Consumers Change Credential Management “That [person|team|group]” Network “Exactly Once” Delivery

Page 18: Hadoop Summit Tokyo Apache NiFi Crash Course

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕsLet’s consider the needs of a courier service

Physical Store

Gateway Server

Mobile Devices

Registers

Server Cluster

Distribution Center Core Data Center at HQ

Server Cluster

On Delivery Routes

Trucks Deliverers

Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/

Page 19: Hadoop Summit Tokyo Apache NiFi Crash Course

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Great! I am collecting all this data! Let’s use it!Finding our needles in the haystack

Physical Store

Gateway Server

Mobile Devices

Registers

Server Cluster

Distribution Center

Kafka

Core Data Center at HQ

Server Cluster

Others

Storm / Spark / Flink / Apex

Kafka

Storm / Spark / Flink / Apex

On Delivery Routes

Trucks Deliverers

Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/

Page 20: Hadoop Summit Tokyo Apache NiFi Crash Course

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕsOh, that courier service is global

Page 21: Hadoop Summit Tokyo Apache NiFi Crash Course

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaWhat is dataflow and what are the challenges?

Apache NiFi

Architecture

Live Demo

Community

Page 22: Hadoop Summit Tokyo Apache NiFi Crash Course

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 23: Hadoop Summit Tokyo Apache NiFi Crash Course

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Capabilities/Gaps

Use cases collected from the field since last release (HDF 1.2)

Major business drivers behind the use case

Problems, challenges and major pain points

How does NiFi help solve the problems

What are the remaining gaps

Use Cases

Page 24: Hadoop Summit Tokyo Apache NiFi Crash Course

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache NiFi High Level Capabilities Web-based user interface

– Design, control, feedback & monitoring

Highly configurable– Loss tolerant vs guaranteed delivery– Low latency vs high throughput– Dynamic prioritization– Flow can be modified at runtime– Back pressure

Data provenance– Track dataflow from beginning to end

Designed for extension– Build your own processors

Secure– SSL, SSH, HTTPS, etc.

Page 25: Hadoop Summit Tokyo Apache NiFi Crash Course

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache NiFiKey Features

• Guaranteed delivery• Data buffering

- Backpressure- Pressure release

• Prioritized queuing• Flow specific QoS

- Latency vs. throughput- Loss tolerance

• Data provenance• Supports push and pull

models

• Recovery/recording a rolling log of fine-grained history

• Visual command and control

• Flow templates• Pluggable/multi-role

security• Designed for extension• Clustering

Page 26: Hadoop Summit Tokyo Apache NiFi Crash Course

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Deeper Ecosystem Integration: 170+ Processors

HTTP

Syslog

Email

HTML

Image

Hash Encrypt

Extract

TailMerge

Evaluate

Duplicate Execute

Scan

GeoEnrich

Replace

ConvertSplit

Translate

HL7

FTP

UDP

XML

SFTP

Route Content

Route Context

Route Text

Control Rate

Distribute LoadAMQP

Page 27: Hadoop Summit Tokyo Apache NiFi Crash Course

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Revisit: Courier service from the perspective of NiFi

Physical Store

Gateway Server

Mobile Devices

Registers

Server Cluster

Distribution Center Core Data Center at HQ

Server Cluster

Trucks Deliverers

Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/

NiFi NiFi NiFi NiFi NiFi NiFi

On Delivery Routes

Page 28: Hadoop Summit Tokyo Apache NiFi Crash Course

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Courier service from the perspective of NiFi & MiNiFi

Physical Store

Gateway Server

Mobile Devices

Registers

Server Cluster

Distribution Center Core Data Center at HQ

Server Cluster

Trucks Deliverers

Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/

Client Libraries

Client Libraries

MiNiFi

MiNiFi NiFi NiFi NiFi NiFi NiFi NiFi

Client Libraries

On Delivery Routes

Page 29: Hadoop Summit Tokyo Apache NiFi Crash Course

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache NiFi Subproject: MiNiFi

Let me get the key parts of NiFi close to where data begins and provide bi-directional communication

NiFi lives in the data center. Give it an enterprise server or a cluster of them.

MiNiFi lives as close to where data is born and is a guest on that device or system

Page 30: Hadoop Summit Tokyo Apache NiFi Crash Course

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Visual Command and Controlvs.

Design and Deploy

Page 31: Hadoop Summit Tokyo Apache NiFi Crash Course

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache NiFi Managed DataflowSOURCES REGIONAL

INFRASTRUCTURECORE

INFRASTRUCTURE

Page 32: Hadoop Summit Tokyo Apache NiFi Crash Course

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi is based on Flow Based Programming (FBP)FBP Term NiFi Term DescriptionInformation Packet

FlowFile Each object moving through the system.

Black Box FlowFile Processor

Performs the work, doing some combination of data routing, transformation, or mediation between systems.

Bounded Buffer

Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates.

Scheduler Flow Controller

Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.

Subnet Process Group

A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.

Page 33: Hadoop Summit Tokyo Apache NiFi Crash Course

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

FlowFiles & Data Agnosticism

NiFi is data agnostic! But, NiFi was designed understanding that users

can care about specifics and provides tooling

to interact with specific formats, protocols, etc.

ISO 8601 - http://xkcd.com/1179/

Robustness principle

Be conservative in what you do, be liberal in what you accept from others“

Page 34: Hadoop Summit Tokyo Apache NiFi Crash Course

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

FlowFiles are like HTTP dataHTTP Data FlowFile

HTTP/1.1 200 OKDate: Sun, 10 Oct 2010 23:26:07 GMTServer: Apache/2.2.8 (CentOS) OpenSSL/0.9.8gLast-Modified: Sun, 26 Sep 2010 22:04:35 GMTETag: "45b6-834-49130cc1182c0"Accept-Ranges: bytesContent-Length: 13Connection: closeContent-Type: text/html

Hello world!

Standard FlowFile AttributesKey: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'fileSize’ Value: '23609'FlowFile Attribute Map ContentKey: 'filename’Value: '15650246997242'Key: 'path’ Value: './’

Binary Content *

Header

Content

Page 35: Hadoop Summit Tokyo Apache NiFi Crash Course

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The need for data provenance

For Operators• Traceability, lineage• Recovery and replayFor Compliance• Audit trail• RemediationFor Business / Mission• Value sources • Value IT investment

BEGIN

ENDLINEAGE

Page 36: Hadoop Summit Tokyo Apache NiFi Crash Course

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Provenance– Improved Navigation and Clearer Interaction

• Tracks data at each point as it flows through the system

• Records, indexes, and makes events available for display

• Handles fan-in/fan-out, i.e. merging and splitting data

• View attributes and content at given points in time

Page 37: Hadoop Summit Tokyo Apache NiFi Crash Course

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaWhat is dataflow and what are the challenges?

Apache NiFi

Architecture

Live Demo

Community

Page 38: Hadoop Summit Tokyo Apache NiFi Crash Course

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zero-master ClusteringFramework

Page 39: Hadoop Summit Tokyo Apache NiFi Crash Course

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi vs MiNiFi Java Processes

NiFi Framework

Components

MiNiFi

NiFi Framework

User Interface

Components

NiFi

Page 40: Hadoop Summit Tokyo Apache NiFi Crash Course

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

MiNiFi

Java agent

Java implementation

Availability– GA HDF 2.0 (built from scratch, ~ 10MB)

Native agent

C++ implementation

Availability– TP HDF 2.0 – GA post HDF 2.0

Resource efficient (focus on memory and disk)

Near term (HDF 2.0)

Design & deploy– Push updates– Config file driven/REST API (MiNiFi API – post

configurations and receive information, etc.) access

Long term

Centralized command and control

MiNiFi Agent MiNiFi Management

Page 41: Hadoop Summit Tokyo Apache NiFi Crash Course

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why NiFi?

Moving data is multifaceted in its challenges and these are present in different contexts at varying scopes– Think of our courier example and organizations like it: inter vs intra, domestically, internationally

Provide common tooling and extensions that are commonly needed but be flexible for extension– Leverage existing libraries and expansive Java ecosystem for functionality– Allow organizations to integrate with their existing infrastructure

Empower folks managing your infrastructure to make changes and reason about issues that are occurring– Data Provenance to show context and data’s journey– User Interface/Experience a key component

Page 42: Hadoop Summit Tokyo Apache NiFi Crash Course

NiFi Traffic Patterns Demo

Page 43: Hadoop Summit Tokyo Apache NiFi Crash Course

NiFi Traffic Patterns Lab

Page 44: Hadoop Summit Tokyo Apache NiFi Crash Course

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Smart Cities: Traffic Congestion

Monitor: Public transportation vehicles Pedestrian levels Optimize public transit duration

and walking routes

Page 45: Hadoop Summit Tokyo Apache NiFi Crash Course

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Our Lab for Today

We will be exploring some examples to work through creating a dataflow with Apache NiFi

Use Case: An urban planning board is evaluating the need for a new highway, dependent on current traffic patterns, particularly as other roadwork initiatives are under way. Integrating live data poses a problem because traffic analysis has traditionally been done using historical, aggregated traffic counts. To improve traffic analysis, the city planner wants to leverage real-time data to get a deeper understanding of traffic patterns. NiFi was selected for for this real-time data integration.

Labs are available at http://tinyurl.com/nificrashcourse

Page 46: Hadoop Summit Tokyo Apache NiFi Crash Course

Getting Started Resources

Page 47: Hadoop Summit Tokyo Apache NiFi Crash Course

49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Connected Data Architecture with HDC for AWS

C L O U DIdeal Use Cases:Data Science and Exploration(Spark, Zeppelin)

ETL and Data Preparation(Hive, Spark)

Analytics and Reporting(Hive2 w/LLAP, Zeppelin)

Cloud Data Processing

(HDC for AWS)

Technical Preview

hortonworks.github.io/hdp-aws

Page 48: Hadoop Summit Tokyo Apache NiFi Crash Course

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Learn more and join us!

Apache NiFi sitehttp://nifi.apache.org

Subproject MiNiFi sitehttp://nifi.apache.org/minifi/

Subscribe to and collaborate [email protected]@nifi.apache.org

Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFI

Follow us on Twitter@apachenifi

Page 49: Hadoop Summit Tokyo Apache NiFi Crash Course

51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Big Data Tutorials

Get Started– hortonworks.com/tutorials– Apache Hadoop & Ecosystem

• tinyurl.com/hello-hdp– Apache Spark

• tinyurl.com/hwx-spark-intro– Apache NiFi

• tinyurl.com/nifi-intro– Use Case

• IoT• Social Media

Page 50: Hadoop Summit Tokyo Apache NiFi Crash Course

52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks Nourishes the CommunityH O R T O N W O R K S

C O M M U N I T Y C O N N E C T I O NH O R T O N W O R K S PA R T N E R W O R K S

Page 51: Hadoop Summit Tokyo Apache NiFi Crash Course

53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Want to continue the technical Introduction?

Hadoop Summit Crash Courses– Replays– Free

hadoopsummit.org/san-jose/agenda– Apache Hadoop– Apache Spark– Apache NiFi– IoT & Streaming– Data Science

Page 52: Hadoop Summit Tokyo Apache NiFi Crash Course

54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

[email protected]

@racoss

Page 53: Hadoop Summit Tokyo Apache NiFi Crash Course

55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank you!

Page 54: Hadoop Summit Tokyo Apache NiFi Crash Course

56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaWhat is dataflow and what are the challenges?

Apache NiFi

Architecture

Demo

Community

Page 55: Hadoop Summit Tokyo Apache NiFi Crash Course

57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Matured at NSA 2006-2014

Brief history of the Apache NiFi Community

• Contributors from Government and several commercial industries

• Releases on a 6-8 week schedule

Code developed at NSA

2006

Today

Achieved TLP

status in just 7 months

July 2015

Code available open source

ASL v2

November 2014

Page 56: Hadoop Summit Tokyo Apache NiFi Crash Course

58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

MiNiFi Prospective Plans - Centralized Command and Control

Design at a centralized place, deploy on the edge– Flow deployment– NAR deployment– Agent deployment

Version control of flows Agent status monitoring Bi-directional command and control

Centralized management console with a UI