Upload
sessionsevents
View
2.104
Download
2
Embed Size (px)
DESCRIPTION
Abstract: How graphs became just another big data primitive Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.
Citation preview
How graphs became just another big data primitive Ted Willke
Cloud Platforms Group / Big Data Solutions
Why graphs are cool: DEMO
2
So, how did graphs become just another useful big data
primitive?
They DIDN’T.
Reduce the tool drag for
graph analytics
-- Vision (early 2012)
Set off in the right direction
4
A complete graph analytics solution
5
-- July 2013
6
Wide on Analytics E2E on Graph Deep on Graph Wide on Analytics
User
Interest
Learning #1: Don’t ignore what’s popular!
7
Popular Big Data (Structure) Primitives Which one is best? It depends… and it’s probably not just one.
Key-Value Document Graph Column Tabular
8
Which one is best? It depends… and it’s probably not just one.
Key-Value Document Graph Tabular Column
Basic dictionary. Very fast. Very easy.
No/minimal structure. Java, PIQL, Lua, XML, XQuery,…
Popular Big Data (Structure) Primitives
9
Which one is best? It depends… and it’s probably not just one.
Key-Value Document Graph Tabular Column
Key(s), metadata, hierarchy, document structure
XML, BSON, JSON… Java, C, C++, REST, Clojure, Scala…
Popular Big Data (Structure) Primitives
10
Which one is best? It depends… and it’s probably not just one.
Key-Value Document Graph Tabular Column
Key:col_val, Key:col_val… Great for “do this to everything in this column”
Not so much for multiple columns, specific keys Hadoop, Zookeeper, Java, Python,…
Popular Big Data (Structure) Primitives
11
Which one is best? It depends… and it’s probably not just one.
Key-Value Document Graph Tabular Column
Old-school RDBMS Collection of tables + relations that join them
*SQL*
Popular Big Data (Structure) Primitives
12
Which one is best? It depends… and it’s probably not just one.
Key-Value Document Graph Tabular Column
Nodes, edges, properties of nodes and edges
Java, Clojure, Lisp, Ruby, C, C++, Scala, REST,…
Popular Big Data (Structure) Primitives
13
Key-Value Document Graph
Off-line (Queue) Async (Bus) Sync (I/O)
API (Remote) LIB (Local)
Model
Access
Implementation
SQL Column
14
How we use the primitives
How are these primitives put to use?
15
Ingest &
Clean
Engineer
Features Structure
Model
Train
Model Query &
Analyze
Learn
Visualize
Data workflow example
16
Data Representation
Personal Learning Knowledge Graph
has_associated
has_result
contains
implemented_by
evaluated_byTask Level
-name: "10th Grade"-value: 10
Learning Task
-name: "Matrix Multiplication"-task_id: 101-description: "Demonstrate how to multiply two matrices"-type: "homework"
Subject
-name: "Linear Algebra"-subject_id: 100 Task Outcome
-score: 0.8-num_correct: 8-num_attempts: 2
Learning Plan
-plan_id: 1-num_tasks: 5-expected_time: 5h
Learning Goal
-goal_id: 9-description: "Achieve above average proficiency in all Linear Algebra course tasks"
Proficiency
name: "Above Average"summarized_by
has_associated
has_prerequisite
Graph? Columnar? Tabular??
17
18
Run a graph-based classifier (e.g. LBP)
Build graph w/ features from frame
Pull results back to frame to
get model perf stats
Engineer features (avg, ratios) Input from another model (segment/cluster)
Learning #2: The primitives are not used in isolation.
19
Ingest &
Clean
Engineer
Features Structure
Model
Train
Model Query &
Analyze
Learn
Visualize
Pig/MR PySpark
ETL Tools?
Pig/MR PySpark
Java, Scala
Giraph GraphX
(Java, Scala…)
Mahout MLlib
??
*SQL* BI tools
PySpark…
Tooling mash-up!
20
Tools are not used in isolation either.
How can we cope with this?
21
Direction #1: Unify primitives and processing on a
workflow-oriented engine
22
Unification with Apache Spark
Image Source: Databricks
• In-memory structures (RDDs) support both table and graph abstractions
• Batch processing and Spark streaming
Spark
RDDs, Transformations, and Actions
Spark Streaming
real-time
Spark
SQL
MLLib
machine learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
RDD-Based
Matrices
GraphX
graph processing/
machine learning
RDD-Based
Graphs
23
Image Source: GraphX project
• Graph processing engine on Spark
• Supports Pregel-style vertex programming
• View same data as either graphs or collections
GraphX API for Spark
24
Python bindings for Spark (GraphX)
25
Client Server
Python
JVM
Py4J
Files JVM
Akka Python
Worker
Pipes
Serialized Python Functions
Results
“Transformations”
“Actions”
“Operations”
Python bindings for Spark GraphX
26
Python bindings for Spark GraphX
Coming soon
to Apache!
Vertex • Transformations: filter, mapValues, diff
• Actions: aggregateUsingIndex
• Join Operations: innerJoin, leftJoin
Edge • Transformations: filter, mapValues, reverse
• Join Operations: innerJoin
Graph • Property Operators: mapVertices, mapEdges, mapTriplets
• Structural Operators: subgraph, reverse, mask, groupEdges,
• Join Operations: joinVertices, outerJoinVertices,
• Neighborhood Aggregation: mapReduceTriplets
• Analytics: ALS, SVDPlusPlus, TriangleCount, PageRank,
ConnectedComponents, ShortestPaths
27
Direction #1: Spark
28
• Feature engineering
• Model training
• Limited language binding (Python, R getting better)
• Lacks transactions and model serving
Lacks transactions and model serving... or does it?
Image Source: Crankshaw, D., et al., “The Missing Piece in Complex Analytics: Low Latency,
Scalable Model Management and Serving with Velox,” Cornell University Library Archive, retrieved November 2014
Extending BDAS
with Velox:
A UC Berkeley
AMPlab project
(sponsored in part
by Intel)
29
Direction #2:
Unify primitives and processing in relational database
30
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Unification within the In-Memory Database (IMDB)
• Index data
structure for
graph traversal
• Prototyped in
SAP HANA
distributed
columnar IMDB
• Lays foundation
for complex
graph query and
algorithms
31
Graph Traversal
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 32
Graph Indexing
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 33
Graph Traversal Results
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 34
• Store graph as a set of nodes and a set of edges
• Relational algebra captures all basic graph operations
• Iterative algorithms captured as driver program that calls stored procedures
Graph Analytics in Relational Databases?
Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog 35
Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog
Graph Analytics in Relational Databases?
Relational and graphical analysis – better together!
36
Source: ISTC for Big Data, Alekh Jindal
Expressing Graph in SQL
37
Real Time Database
BQL – BigDAWG Query Language & Compiler
Analytics Libraries
Hardware Platforms
Applications, Visualization, Languages
“Narrow waist” provides portability
Historical / Analytics Databases Spill
Stream
Future Vision – BigDAWG
38
Future Vision – BigDAWG
Real Time DBMSs
BQL – BigDAWG Query Language & Compiler
Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching
Languages, e.g, Julia, R, MLbase, GraphLab
SciDB
Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages
TupleWare
Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon
Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT
TileDB S-Store
“Narrow waist” provides portability
MyriaX
Historical / Analytics DBMSs Spill
Stream
39
Direction #2: Relational DB
40
• Feature engineering
• Transactions and model serving
• Performant model training?
• Just another Spark behind *QL?
Which direction do you favor?
41
Will the lines blur?
42
Takeaway from both: Do all of the parallel distributed
processing in one place and work with it
through one UI!
43
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND STATISTICS
Graphical Algorithms
Classical Algorithms
Graph Construction Tools
Useful String Manipulation
Useful Math Operators
“DATA SCIENCE” REST API
Intel Analytics Toolkit
Unified UI’s across
the workflow
Easier feature & model creation
End-to-end graph
pipeline
Fully scalable throughout
Multiple data
primitives
Optimized for IA
Python
Libraries
3rd Party
GUIs/SDKs
Viz
Tools
Future
Libraries BI
Connectors
Query Interfaces
...
Pressing forward with the Intel Analytics Toolkit
Analyzing the Semantic Web
Reputations
Neutral
Good
Bad
Suspect
44
Unified programming environment: DEMO
45
46
PROGRESS TOWARD VISION
47
If we are successful...
graph will become just another
big data primitive!
49
How graphs became just another big data primitive
Graph-shaped data is used in product recommendation systems, social network analysis,
network threat detection, image de-noising, and many other important applications. And, a
growing number of these applications will benefit from parallel distributed processing for
graph featuring engineering, model training, and model serving. But today’s graph tools are
riddled with limitations and shortcomings, such as a lack of language bindings, streaming
support, and seamless integration with other popular data services. In this talk, we’ll argue
that the key to doing more with graphs is doing less with specialized systems and more with
systems already good at handling data of other shapes. We’ll examine some practical data
science workflows to further motivate this argument and we’ll talk about some of the things
that Intel is doing with the open source community and industry to make graphs just another
big data primitive.