MongoDB & Spark

Preview:

Citation preview

MongoDB + Spark@blimpyacht

Level Setting

TROUGH OF DISILLUSIONMENT

HDFS

Distributed Data

HDFS

YARN

Distributed Resources

HDFS

YARN

MapReduce

Distributed Processing

HDFSYARN

Hive

Pig

Domain Specific Languages

MapReduce

Interactive Shell

Easy (-er)Caching

HDFS

Distributed Data

HDFS

YARN

Distributed Resources

HDFSYARN

SparkHadoop

Distributed Processing

HDFSYARN

Spark

Hadoop

Mesos

HDFSStand Alone

YARN

Spark

Hadoop

Mesos

HDFSStand AloneYARN

SparkHadoop

Mesos

Hive

Pig

HDFSStand Alone

YARN

SparkHadoop

Mesos

Hive

Pig

SparkShell

HDFSStand Alone

YARN

SparkHadoop

Mesos

Hive

Pig

SparkShell

SparkStreaming

HDFS

Stand AloneYAR

N

SparkHadoop

Mesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

HDFS

Stand Alone

YARN

Spark

Hadoop

Mesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

HDFS

Stand Alone

YARN

Spark

Hadoop

Mesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

Stand Alone

YARN

Spark

Hadoop

Mesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

SparkStreaming

Hive

SparkShell

MesosHado

op

Pig

SparkSQL

Spark

Stand Alone

YARN

Stand AloneYAR

N

SparkMesos

SparkSQL

SparkShell

SparkStreaming

Stand Alone

YARN

SparkMesos

SparkSQL

SparkShell

SparkStreaming

executor

Worker Node

executor

Worker Node

Driver

Resilient Distributed Datasets

Parallelization

Parellelize = x

Transformations

Parellelize = x

t(x) = x’

t(x’) = x’’

Transformationsfilter( func )union( func )intersection( set )distinct( n )map( function )

Action

f(x’’) = y

Parellelize = x

t(x) = x’

t(x’) = x’’

Actionscollect()count()first()take( n )reduce( function )

Lineage

f(x’’) = y

Parellelize = x

t(x) = x’

t(x’) = x’’

Transform

Transform ActionParalleliz

e

Lineage

Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e

Lineage

Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e

Lineage

Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e

Lineage

https://github.com/mongodb/mongo-hadoop

{"_id" : ObjectId("4f16fc97d1e2d32371003e27"),"body" : "the scrimmage is still up in the air.

"subFolder" : "notes_inbox","mailbox" : "bass-e","filename" : "450.","headers" : {

"X-cc" : "","From" : "michael.simmons@enron.com","Subject" : "Re: Plays and other information","X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\

Notes inbox","Content-Transfer-Encoding" : "7bit","X-bcc" : "","To" : "eric.bass@enron.com","X-Origin" : "Bass-E","X-FileName" : "ebass.nsf","X-From" : "Michael Simmons","Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)","X-To" : "Eric Bass","Message-ID" :

"<6884142.1075854677416.JavaMail.evans@thyme>","Content-Type" : "text/plain; charset=us-ascii","Mime-Version" : "1.0"

}}

{"_id" : ObjectId("4f16fc97d1e2d32371003e27"),"body" : "the scrimmage is still up in the air.

"subFolder" : "notes_inbox","lfpwoojjf0wig=-i1qf=q0qif0=i38 \-00\ 1-8" : "bass-e","filename" : "450.","headers" : {

"X-cc" : "",

"From" : "michael.simmons@enron.com",

"Subject" : "Re: Plays and other information","X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\

Notes inbox","Content-Transfer-Encoding" : "7bit","X-bcc" : "",

"To" : "eric.bass@enron.com","X-Origin" : "Bass-E","X-FileName" : "ebass.nsf","X-From" : "Michael Simmons","Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)","X-To" : "Eric Bass","Message-ID" :

"<6884142.1075854677416.JavaMail.evans@thyme>","Content-Type" : "text/plain; charset=us-ascii","Mime-Version" : "1.0"

}}

{ _id : "gretchen.hardeway@enron.com|shirley.crenshaw@enron.com", value : 2}{ _id : "kmccomb@austin-mccomb.com|brian@enron.com", value : 2}{ _id : "sally.beck@enron.com|sandy.stone@enron.com", value : 2 }

Eratosthenes

Democritus

Hypatia

Shemp

Euripides

Spark ConfigurationConfiguration conf = new Configuration();conf.set(

"mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat”);conf.set(

"mongo.input.uri", "mongodb://localhost:27017/db.collection”);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

mongos

mongos

Data Services

Deployment Artifacts

Hadoop Connector Jar

Fat JarJava Driver

Jar

Spark Submit/usr/local/spark-1.5.1/bin/spark-submit \ --class com.mongodb.spark.examples.DataframeExample \ --master local Examples-1.0-SNAPSHOT.jar

Stand Alone

YARN

SparkMesos

SparkSQL

SparkShell

SparkStreaming

JavaRDD<Message> messages = documents.map (

new Function<Tuple2<Object, BSONObject>, Message>() {

public Message call(Tuple2<Object, BSONObject> tuple) { BSONObject header = (BSONObject)tuple._2.get("headers");

Message m = new Message(); m.setTo( (String) header.get("To") ); m.setX_From( (String) header.get("From") ); m.setMessage_ID( (String) header.get( "Message-ID" ) ); m.setBody( (String) tuple._2.get( "body" ) );

return m; } });

MognoDB & Spackcode demo

THE FUTUREAND

BEYOND THE INFINITE

Stand Alone

YARN

SparkMesos

SparkSQL

SparkShell

SparkStreaming

MongoDB + Spark

THANKS!{

name: ‘Bryan Reinero’,role: ‘Developer

Advocate’,twitter: ‘@blimpyacht’,email:

‘bryan@mongodb.com’ }

Recommended