Upload
shiyong-lu
View
278
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Scientific Workflows for Big Data
Prof. Shiyong LuBig Data Research Laboratory
Department of Computer ScienceWayne State University
Today’s data-intensive science
Jim Gray: Turing Award laureate
Looking for needle in haystack
Looking into haystack
Big Data Challenges
Ian Foster: Father of Grid Computing
Looking for needle in haystack
Looking needle in haystack
For Big Data, data management and movement is a frequent challenge…between facilities, archives, researchers…Many files, large data volumesWith security, reliability, performance…
Big Data Challenges
Looking for needle in haystack
Looking needle in haystackCapture Curation Storage Search Sharing Analysis Visualizatio
n
Big Data Science
15 PB/year173 TB/day500 MB/sec
Large Hardron Collider (LHC))Higgs discovery is “only possible because of the
extraordinary achievements of … grid
computing”—Rolf Heuer, CERN DG
Data management challenges
Short-term
storage
163
143
100
99
150100
External sources
Advanced Photon Source
Argonne Leadership Computing
Facility
10
50
Long-term
storage
Data analysis
Argonne data flows in TB/day
(estimates)
Data flows at Argonne National Lab
Credit: Ian Foster
Big Data demands new CS research
For example, existing clustering algorithms are typically cubic in N, and when N is too big, they do not work! - Jim Gray
What is Big Data?
•Definition of Big Data:
“…refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.”
from nsf.gov website
Big Data Challenges
•Challenges of Big Data:
“national big data challenges, which include advances in core techniques and technologies; big data infrastructure projects in various science, biomedical research, health and engineering communities; education and workforce development; and a comprehensive integrative program to support collaborations of multi-disciplinary teams and communities to make advances in the complex grand challenge science, biomedical research, and engineering problems of a computational- and data-intensive world.”
from nsf.gov website
Big Data demands big workflows
Reminiscent of
And thousands of parallel executions
Managing big workflows and large-scale parallel execution is a big CS challenge !
Outline
Introduction1
VIEW: A Prototypical SWFMS2
A Scientific Workflow Composition Model3
A Collectional Data Model4
Conclusions and Future Work5
Introduction
Data Intensive Science From computation intensive to data intensive. A new research cycle – from data capture and data
curation to data analysis and data visualization. “In the future, the rapidity with which any given
discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies.” (“Beyond the Data Deluge”, Science, Vol. 323. no. 5919, pp. 1297 – 1298, 2009.)
Introduction
Scientific Workflow
A formal specification of a scientific process.
Represents, streamlines, and automates the steps from dataset selection and integration, computation and analysis, to final data product presentation and visualization.
Applications: Bioinformatics, Oceanography, Neuroinformatics, Astronomy, etc.
Introduction
Scientific Workflow Management System (SWFMS) Supports the specification, modification,
execution, failure handling, and monitoring of a scientific workflow.
Existing SWFMSs: • Taverna, • Kepler,• Pegasus,• VisTrails, • VIEW, • …
Our VIEW System
Our VIEW System
Enables scientist to design workflows
Our VIEW System
Enables scientist to design workflows Provides runtime system to execute workflow
Our VIEW System
Enables scientist to design workflows Provides runtime system to execute workflow
on dedicated VIEW server
Our VIEW System
Enables scientist to design workflows Provides runtime system to execute workflow
on dedicated VIEW server in Cloud computing environment
Our VIEW System
Enables scientist to design workflows Provides runtime system to execute workflow
on dedicated VIEW server in Cloud computing environment
Supports efficient collection, storage, querying, and visualization of workflow provenance
Our VIEW System
Enables scientist to design workflows Provides runtime system to execute workflow
on dedicated VIEW server in Cloud computing environment
Supports efficient collection, storage, querying, and visualization of workflow provenance
Is currently used in several bioinformatics applications, including genomic recombination and gene conversion data analysis
An Example Workflow in VIEW
Example workflows in
An Example Workflow in VIEW
VIEW 1-2-3
Step 1: Drag and drop inputs and outputs, and computational modules
VIEW 1-2-3
Step 2: Link them into a scientific workflow
VIEW 1-2-3
Step 3: Click the run button, you get the result!
Kids Play VIEW
An Example Workflow in VIEW
FiberFlow Transforms the large-scale neuroimaging data to knowledge through cross-
subject, cross-modality computation, ultimately leading to high clinical intelligence in neural diseases.
VIEW: A Prototypical SWFMS
Minimum complexity for users, but massive techniques in the backstage. To provide a clear and simple abstraction for manipulating
and coordinating resources
Service-oriented architecture.
Intuitive, user-friendly GUI
A Reference Architecture for SWFMSs
Service-oriented architecture of VIEW
A Reference Architecture for SWFMSs
Other advantages of :
A Reference Architecture for SWFMSs
Other advantages of :
VIEW workflows can be executed in other systems (specifications are not tied to a particular SWFMS)
Use of open standards (Web Services, XML) promotes collaboration, interoperability and extensibility of the system
Workflow and data models implemented in VIEW are specifically geared towards heavy scientific data
A Reference Architecture for SWFMSs
VIEW: A Prototypical SWFMS
A typical scientific workflow execution diagram.
Workflow Engine
Workflow Engine is the heart of the system. Workflow Orchestration. Workflow Execution. Coordination of other subsystems.
Workflow Engine in VIEW. Dataflow based. Pure workflow composition. Workflow constructs.
SWL
Example of our proposed scientific workflow specification language (SWL).
Primitive Workflow Specification
Example SWL specification of a primitive workflow.
Workflow Execution
Workflow Execution Primitive workflow Unary construct based workflow Graph based workflow
• A workflow graph is a composition of workflows by binary constructs.
• Optimistic scheduling.
Workflow Database Schema
Data Product Manager
Data Product Manager Solid data model. Scalable data storage. Convenient data access. Data Independence.
Data Product Manager is based on the collectional data model.
DPM Architecture
Architecture of the Data Product Manager.D a ta P ro d u ct M a n a g er
D ata A ccess L ayer
D ata M ap p in g L ayer
D ata S to rage L ayer
R elatio nalD atab as es
N o d eD atab as e
M as ter
F ileR e p o s ito ry s
D ata S e t 1
M ainS erver
N o d eD atab as e
N o d eD atab as e
R elatio nalD atab as es
F ileR e p o s ito ry s
D ata S e t 2
DPL
Example of the XML description of a collectional data product.
Data Storage
VIEW supports two ways of storage: A collection can be stored in a table containing a
set of its key/value pairs, whose values are references to existing collections.
A collection can be expanded and stored in two tables. • The Group By operator.• The Compress operator.
Data Typing
A Data Product a Collection or a List or an Empty.
The List type Introduced in the workflow engine. Each element is a data product. Heterogeneous.
Collectional Data Querying
Operators are implemented in primitive workflows. Arithmetic operators. Boolean operators. Collectional operators. List operators.
Queries are implemented in workflow compositions.
Example
Given a table Reference < Student, Company, GradTime >, Find the total number of students offered in each company and each graduation year; Sort the result in descending GradTime and ascending Company order.
SQL query. SELECT Company, GradTime, COUNT(DISTINCT Student)
AS NumberOfJob
FROM Reference
GROUP BY Company, GradTime
ORDER BY GradTime DESC, Company ASC;
Example of Query Workflow
Query Workflow.
Key Requirements for Workflow Modeling
R1: Programming-in-the-large.R2: Dataflow programming model.R3: Composable dataflow constructs.R4: Workflow encapsulation and
hierarchical composition.R5: Single-assignment property.R6: Physical and logical data models.R7: Exception handling.
A Scientific Workflow Model
Workflows are the basic and the only operands for workflow composition.
Task components (e.g. Web services) are constructed to primitive workflows (a.k.a. tasks) which are the basic building blocks of scientific workflows.
W 3
ik W 2 o 1i1
o 1
M
ikW 1i1 o 1
o 1i1 i1
A Scientific Workflow Model
A workflow construct is a mapping from a set of workflows to a workflow. Unary workflow constructs Binary workflow constructs …
A construct C takes a set of workflows W1, ...., Wn as input, and composes them into Wc as the output workflow.
A Scientific Workflow Model
Our proposed scientific workflow model consists of the following two layers: The logical layer contains the workflow interface that
models the input ports and output ports of a workflow. The physical layer contains the workflow body that models
the physical implementation of the workflow.• Primitive workflows.• Graph-based workflows.• Unary-construct-based workflows.
Unary Workflow Constructs
Dataflow-based Unary Workflow Constructs
The Map Construct
The Map construct enables the parallel processing of a collection of data products based on a workflow that can only process a single data product.
Example:
[ 4 ,7 ]
[ 3 ,6 ]
2[ 1 ,2 ]
1 8
2 8ik
W 1 o 1
[[ 1 ,2 ] ,[ 3 ,6 ] ,[ 4 ,7 ] ]
i1o 1
W 2
i1M
ik W 1i1o 1
ik W 1i1 o 1
ik W 1i1 o 1
The Reduce Construct
The Reduce construct enables the aggregation of a list of data products to a single data product based on a workflow that aggregates a limited (two or more) number of input data products.
Example:
ikA d di1 o 1
i2
0 o 1
W 3
[3 ,5 ,9 ]
i1
i2
R
9
5
A d di1 o 103
8
17
A d di1 o 13
A d do 1
i2i1i2
i2
The Tree Construct
The Tree construct Enables parallel aggregation of a collection of data products. Aggregates a collection pairwisely as a binary tree until one
single aggregated product is generated.
The Tree construct can be applied on associative workflows.
Example:
ikA d di1 o 1
o 1
W 4
[0 ,3 ,5 ,9 ]i1
i2
9
3A d di1 o 10
317A d di1 o 1
i2
i2
A d di1 o 15i2 1 4
T
The Conditional Construct
The Conditional construct enables the conditional execution of a workflow based on a condition on one of the inputs.
Example:
P ro je c tio n o 1ikP ro je c tio n
i1 o 1
i2
o 1
W 4
2
i1[ 2 ,3 ]
2 i2
[ 2 ,3 ]3
pi1
i2
p=(PI1 < PI2)
p = tru e
C
F a il
P ro je c tio nikP ro je c tio n
i1 o 1
i2
o 1
W 4
1
i1[ 2 ,3 ]
2 i2
pi1
i2
p=(PI1 > = PI2)
p = fa ls e
C
The Loop Construct
The Loop construct enables cyclic executions of a workflow.
The output of the workflow will be repetitively returned (fed back) to a specified input port until the predicate evaluates to true.
Example:
A d d o 1
ik A d d
i1 o 1
i2
o 1
1
i1
1 i2
1 0 1p
i1
i2
p=(PI1 > 1 0 0 )
p = tru e
L
0
0
A d d o 1i11 i2
. . .A d d
1 i2
1
2
p = fa ls e
p = fa ls e
The Curry Construct
The Curry construct allows users to fix one of the input ports with a specified argument and thus reduce the number of input ports.
By applying multiple Curry constructs, a workflow that takes multiple arguments can be translated into a chain of workflows each with a single argument.
Example:
ikA d di1 o 1i1
o 1
W 8
4i2
1ik
A d di1 o 1i2
1
45
U
Workflow Composition
Example of the composition of Map and Map constructs. A Workflow that increase all the numbers in a nested list
by 1.
ik A d di1 o 1i2
o 1
i2
i1
(a ) W 9
1
M M
[[ 1 ,2 ,3 ] ,[ 4 ,5 ,6 ] ]
ik A d di1 o 1i2
11
ik A d di1 o 1i2
12
ik A d di1 o 1i2
13
ik A d di1 o 1i2
14
ik A d di1 o 1i2
15
ik A d di1 o 1i2
16
2
3
4
5
6
7
Workflow Composition
Example of the composition of Map and Reduce constructs. A workflow for parallel summation of each row in a matrix
.
ikA d d
i1 o 1
i2
o 1
W 1 1
ikA d d itio n
i1ik
A d d itio ni1 o 1
i2
0
1
o 1
2 3
[[ 1 ,2 ,3 ] ,[ 4 ,5 ,6 ] ]i2
0i1 ik
A d d itio ni1i2
o 16i2
M R
ikA d d itio n
i1ik
A d d itio ni1 o 1
i2
0
4
o 1
5 6ik
A d d itio ni1i2
o 11 5i2
Workflow Composition
Example of complicated workflow composition.
A workflow to calculate the greatest common divisor.
ikM o d u lu s ikM e rgei1 i1 o 1S p lit i2 i2
W 1 3
i1
o 1 o 1i1o 2
o 1i1
L
W 1 4 o 1i1
o 1
W 1 5
i1
M
M e rgeo 1
i1
i2
o 1
i1
i2 W 1 6
ik
o 1i1i1
M
i21
o 1
G 2 W
G 2 W
W 1 4
W 1 7
P ro je c tio n
o 1
p = (P I (2 )= = 0 )
U
A Collectional Data Model
A collectional data model Support collection oriented datasets.
• Scientists often work with collection oriented datasets, such as arrays, lists, tables or file collections.
• A collection-oriented data model enables data parallelism in scientific workflows.
Support nested data structures. • Scientific data is often hierarchically organized. • Scientific workflow tasks often produce collections of
data products, and the execution of a workflow composed from such tasks can create increasingly nested data collections.
Provide well-defined operators and their arbitrary compositions to manipulate and query scientific data collections.
A Collectional Data Model
A relation is a pair < R, r > where R is a schema of the relation and r is an instance of that schema.
A relation schema can be defined as an unordered tuple < c1 : d1, c2 : d2, …, cn : dn > where c1, c2, …, cn are column names and d1, d2, …, dn are domain names.
A relation instance is a table with rows (called tuples) and named columns (called attributes).
A Collectional Data Model
A collection schema is a pair < K, V >. K, the key, is a pair k : d where k is the key name and d is
the domain name . V, the value, is either a relation schema or a collection
schema.
A collection instance is a set of key-value pairs (pi, qi) (i∈ {1,…,m}). Each pi is a scalar value.
Each qi is either a relation instance or a collection instance.
A Collectional Data Model
An example: Parameters< Model : String, Experiments :
Integer, <Concentration : Double, Degree : Integer >>.
The Collectional Operators
We extend the relational operators to the collectional operators of which the collections are the only operands. Six primitive operators: union, set difference,
selection, projection, Cartesian product and renaming.
The set of the collections is closed under those operators.
A relation can be defined as a collection whose height and cardinality are equal to 1. The collectional operators will then reduce to the relational operators.
The Collectional Operators
The union and the set difference operators can only be applied on union-compatible collections.
m 1
m 2
Mode lR esult
26
R esult
32
m 2
m 3
Mode lR esult
3 2
R esult
3 1
The Collectional Operators
Example of the union operator and the set difference operator.
m 1
m 2
Mod e l
m 1
m 2
m 3
R esult
26Mo d e l
R esult
32
R esult
31
R e sult
2 6
R e sult
The Collectional Operators
Example of the Cartesian product Operator and the Renaming Operator.
m 1
m 2
M1.m od e lm 1
m 2
M2 .m od e l
m 1
m 2
M2 .m o d e l
M 1.R esult M2.R e sult
2 6 3 2
M 1.R esult M2.R e sult
2 6 3 1
M 1.R esult M2.R e sult
3 2 3 2
M 1.R esult M2.R e sult
3 2 3 1
The Collectional Operators
Example of the selection operator.
m 2
1
Mo de l
E xp e rim ent
C oncentration D eg re e ...
7 .1 1 5 ...
The Collectional Operators
Example of the projection operator.
1
2
...
E xp e rim e nt
C oncentration D e g re e ...
7 .0 1 5 ...
7 .1 1 5 ...
C oncentration D e g re e ...
7 .0 3 0 ...
7 .1 3 0 ...
Key Features of VIEW
F1: VIEW features the first uniform workflow model, in which workflows are the only building blocks. In VIEW, tasks are primitive workflows and all workflow constructs do not discriminate workflows from tasks. Such a model greatly simplifies workflow design, in which a workflow designer only needs to compose complex workflows from simpler ones without the need to first encapsulate workflows to tasks or vice versa during the composition process.
F2: VIEW has a powerful workflow composition power in which workflow constructs are fully compositional one with another with arbitrary levels. This often results in VIEW workflows that are more concise and efficient to execute, which can be hard to model in other workflow systems.
F3: VIEW features a pure dataflow-based workflow language SWL, including the dataflow counterparts of controlflow-style constructs, such as conditional and loop. Existing workflow languages often require both controlflow and dataflow constructs, resulting in complex or even obscure semantics and non-trivial workflow design.
F4: VIEW supports the cloud MapReduce programming model not only at the job level, but also at the workflow level. Therefore, one can apply the Map and Reduce constructs on an arbitrary workflow with arbitrary number of times. As a result, VIEW can process nested lists of data products in parallel using multiple runs of a workflow.
F5: VIEW features a collectional data model that supports not only traditional primitive data types, such as integer, float, double, boolean, char, string, but also files, relations, hierarchical collections (hierarchical key-value pairs) to support parallel processing of data collections.
F6: VIEW supports a high-level graph-based provenance query language OPQL. In most cases, users can formulate lineage queries easily without the need of writing recursive queries or knowing the underlying database schema.
F7: VIEW features the first service-oriented architecture that conforms to the reference architecture for scientific workflow management systems (SWFMSs). This architecture greatly facilitates interoperability and subsystem reusability in the community. This architecture also provides a generic infrastructure upon which a domain-specific scientific workflow application system (SWFAS) can be easily developed with custom interface for various platforms and devices.
Conclusions and Future Works
A scientific workflow composition model. A collectional data model. A protypical SWFMS. Future work:
Formalization of the scientific workflow algebra and collectional algebra.
• Completeness.• Integration.
Collaborative scientific workflow composition.• Concurrent design and composition.• Concurrent execution.
VIEW application
Fiber tract analysis for Epilepsy.
VIEW application
Computational detection of MARS in genome.
VIEW application
DNA analysis for bacteria E. Coli
VIEW application
Simulation of Nereis succinea mate search behavior.
Big Data is a Pyramid
Can you contribute a piece too?
Big Data Research LaboratoryWayne State University
viewsystem.org