49
How graphs became just another big data primitive Ted Willke Cloud Platforms Group / Big Data Solutions

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Embed Size (px)

DESCRIPTION

Abstract: How graphs became just another big data primitive Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.

Citation preview

Page 1: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

How graphs became just another big data primitive Ted Willke

Cloud Platforms Group / Big Data Solutions

Page 2: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Why graphs are cool: DEMO

2

Page 3: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

So, how did graphs become just another useful big data

primitive?

They DIDN’T.

Page 4: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Reduce the tool drag for

graph analytics

-- Vision (early 2012)

Set off in the right direction

4

Page 5: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

A complete graph analytics solution

5

-- July 2013

Page 6: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

6

Wide on Analytics E2E on Graph Deep on Graph Wide on Analytics

User

Interest

Page 7: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Learning #1: Don’t ignore what’s popular!

7

Page 8: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Popular Big Data (Structure) Primitives Which one is best? It depends… and it’s probably not just one.

Key-Value Document Graph Column Tabular

8

Page 9: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Which one is best? It depends… and it’s probably not just one.

Key-Value Document Graph Tabular Column

Basic dictionary. Very fast. Very easy.

No/minimal structure. Java, PIQL, Lua, XML, XQuery,…

Popular Big Data (Structure) Primitives

9

Page 10: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Which one is best? It depends… and it’s probably not just one.

Key-Value Document Graph Tabular Column

Key(s), metadata, hierarchy, document structure

XML, BSON, JSON… Java, C, C++, REST, Clojure, Scala…

Popular Big Data (Structure) Primitives

10

Page 11: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Which one is best? It depends… and it’s probably not just one.

Key-Value Document Graph Tabular Column

Key:col_val, Key:col_val… Great for “do this to everything in this column”

Not so much for multiple columns, specific keys Hadoop, Zookeeper, Java, Python,…

Popular Big Data (Structure) Primitives

11

Page 12: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Which one is best? It depends… and it’s probably not just one.

Key-Value Document Graph Tabular Column

Old-school RDBMS Collection of tables + relations that join them

*SQL*

Popular Big Data (Structure) Primitives

12

Page 13: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Which one is best? It depends… and it’s probably not just one.

Key-Value Document Graph Tabular Column

Nodes, edges, properties of nodes and edges

Java, Clojure, Lisp, Ruby, C, C++, Scala, REST,…

Popular Big Data (Structure) Primitives

13

Page 14: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Key-Value Document Graph

Off-line (Queue) Async (Bus) Sync (I/O)

API (Remote) LIB (Local)

Model

Access

Implementation

SQL Column

14

How we use the primitives

Page 15: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

How are these primitives put to use?

15

Page 16: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Ingest &

Clean

Engineer

Features Structure

Model

Train

Model Query &

Analyze

Learn

Visualize

Data workflow example

16

Page 17: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Data Representation

Personal Learning Knowledge Graph

has_associated

has_result

contains

implemented_by

evaluated_byTask Level

-name: "10th Grade"-value: 10

Learning Task

-name: "Matrix Multiplication"-task_id: 101-description: "Demonstrate how to multiply two matrices"-type: "homework"

Subject

-name: "Linear Algebra"-subject_id: 100 Task Outcome

-score: 0.8-num_correct: 8-num_attempts: 2

Learning Plan

-plan_id: 1-num_tasks: 5-expected_time: 5h

Learning Goal

-goal_id: 9-description: "Achieve above average proficiency in all Linear Algebra course tasks"

Proficiency

name: "Above Average"summarized_by

has_associated

has_prerequisite

Graph? Columnar? Tabular??

17

Page 18: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

18

Run a graph-based classifier (e.g. LBP)

Build graph w/ features from frame

Pull results back to frame to

get model perf stats

Engineer features (avg, ratios) Input from another model (segment/cluster)

Page 19: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Learning #2: The primitives are not used in isolation.

19

Page 20: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Ingest &

Clean

Engineer

Features Structure

Model

Train

Model Query &

Analyze

Learn

Visualize

Pig/MR PySpark

ETL Tools?

Pig/MR PySpark

Java, Scala

Giraph GraphX

(Java, Scala…)

Mahout MLlib

??

*SQL* BI tools

PySpark…

Tooling mash-up!

20

Page 21: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Tools are not used in isolation either.

How can we cope with this?

21

Page 22: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Direction #1: Unify primitives and processing on a

workflow-oriented engine

22

Page 23: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Unification with Apache Spark

Image Source: Databricks

• In-memory structures (RDDs) support both table and graph abstractions

• Batch processing and Spark streaming

Spark

RDDs, Transformations, and Actions

Spark Streaming

real-time

Spark

SQL

MLLib

machine learning

DStream’s:

Streams of RDD’s

SchemaRDD’s

RDD-Based

Matrices

GraphX

graph processing/

machine learning

RDD-Based

Graphs

23

Page 24: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Image Source: GraphX project

• Graph processing engine on Spark

• Supports Pregel-style vertex programming

• View same data as either graphs or collections

GraphX API for Spark

24

Page 25: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Python bindings for Spark (GraphX)

25

Client Server

Python

JVM

Py4J

Files JVM

Akka Python

Worker

Pipes

Serialized Python Functions

Results

“Transformations”

“Actions”

“Operations”

Page 26: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Python bindings for Spark GraphX

26

Page 27: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Python bindings for Spark GraphX

Coming soon

to Apache!

Vertex • Transformations: filter, mapValues, diff

• Actions: aggregateUsingIndex

• Join Operations: innerJoin, leftJoin

Edge • Transformations: filter, mapValues, reverse

• Join Operations: innerJoin

Graph • Property Operators: mapVertices, mapEdges, mapTriplets

• Structural Operators: subgraph, reverse, mask, groupEdges,

• Join Operations: joinVertices, outerJoinVertices,

• Neighborhood Aggregation: mapReduceTriplets

• Analytics: ALS, SVDPlusPlus, TriangleCount, PageRank,

ConnectedComponents, ShortestPaths

27

Page 28: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Direction #1: Spark

28

• Feature engineering

• Model training

• Limited language binding (Python, R getting better)

• Lacks transactions and model serving

Page 29: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Lacks transactions and model serving... or does it?

Image Source: Crankshaw, D., et al., “The Missing Piece in Complex Analytics: Low Latency,

Scalable Model Management and Serving with Velox,” Cornell University Library Archive, retrieved November 2014

Extending BDAS

with Velox:

A UC Berkeley

AMPlab project

(sponsored in part

by Intel)

29

Page 30: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Direction #2:

Unify primitives and processing in relational database

30

Page 31: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)

Unification within the In-Memory Database (IMDB)

• Index data

structure for

graph traversal

• Prototyped in

SAP HANA

distributed

columnar IMDB

• Lays foundation

for complex

graph query and

algorithms

31

Page 32: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Graph Traversal

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 32

Page 33: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Graph Indexing

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 33

Page 34: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Graph Traversal Results

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 34

Page 35: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

• Store graph as a set of nodes and a set of edges

• Relational algebra captures all basic graph operations

• Iterative algorithms captured as driver program that calls stored procedures

Graph Analytics in Relational Databases?

Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog 35

Page 36: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog

Graph Analytics in Relational Databases?

Relational and graphical analysis – better together!

36

Page 37: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Source: ISTC for Big Data, Alekh Jindal

Expressing Graph in SQL

37

Page 38: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Real Time Database

BQL – BigDAWG Query Language & Compiler

Analytics Libraries

Hardware Platforms

Applications, Visualization, Languages

“Narrow waist” provides portability

Historical / Analytics Databases Spill

Stream

Future Vision – BigDAWG

38

Page 39: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Future Vision – BigDAWG

Real Time DBMSs

BQL – BigDAWG Query Language & Compiler

Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching

Languages, e.g, Julia, R, MLbase, GraphLab

SciDB

Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages

TupleWare

Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon

Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT

TileDB S-Store

“Narrow waist” provides portability

MyriaX

Historical / Analytics DBMSs Spill

Stream

39

Page 40: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Direction #2: Relational DB

40

• Feature engineering

• Transactions and model serving

• Performant model training?

• Just another Spark behind *QL?

Page 41: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Which direction do you favor?

41

Will the lines blur?

Page 42: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

42

Takeaway from both: Do all of the parallel distributed

processing in one place and work with it

through one UI!

Page 43: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

43

FILESYSTEMS AND NOSQL STORAGE

HW PLATFORM

APACHE HADOOP APACHE SPARK

DATA WRANGLING

MACHINE LEARNING AND STATISTICS

Graphical Algorithms

Classical Algorithms

Graph Construction Tools

Useful String Manipulation

Useful Math Operators

“DATA SCIENCE” REST API

Intel Analytics Toolkit

Unified UI’s across

the workflow

Easier feature & model creation

End-to-end graph

pipeline

Fully scalable throughout

Multiple data

primitives

Optimized for IA

Python

Libraries

3rd Party

GUIs/SDKs

Viz

Tools

Future

Libraries BI

Connectors

Query Interfaces

...

Pressing forward with the Intel Analytics Toolkit

Page 44: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Analyzing the Semantic Web

Reputations

Neutral

Good

Bad

Suspect

44

Page 45: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Unified programming environment: DEMO

45

Page 46: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

46

PROGRESS TOWARD VISION

Page 47: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

47

If we are successful...

graph will become just another

big data primitive!

Page 48: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Page 49: Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

49

How graphs became just another big data primitive

Graph-shaped data is used in product recommendation systems, social network analysis,

network threat detection, image de-noising, and many other important applications. And, a

growing number of these applications will benefit from parallel distributed processing for

graph featuring engineering, model training, and model serving. But today’s graph tools are

riddled with limitations and shortcomings, such as a lack of language bindings, streaming

support, and seamless integration with other popular data services. In this talk, we’ll argue

that the key to doing more with graphs is doing less with specialized systems and more with

systems already good at handling data of other shapes. We’ll examine some practical data

science workflows to further motivate this argument and we’ll talk about some of the things

that Intel is doing with the open source community and industry to make graphs just another

big data primitive.