Spark Illustrated and Illuminated
Tony Duarte
Spark and Hadoop Training
www.sparkInternals.com
(650) 223-3397
Copyright 2014 Tony Duarte 2
newRDD = myRDD.map(myfunc)
Copyright 2014 Tony Duarte 3
What is an RDD?
myRDD : RDDPartition
Partition
Partition
Partition
Some RDD CharacteristicsMemoryPartition
MemoryPartition
MemoryPartition
MemoryPartition
• Hold references to Partition objects
• Each Partition object references a subset of your data
• Partitions are assigned to nodes on your cluster
• Each partition/split will be in RAM (by default)
Array
Copyright 2014 Tony Duarte 4
What happens? newRDD = myRDD.map(myfunc)
myRDD : RDD
map() new mappedRDD(myRDD, myfunc)
newRDD : mappedRDD
dependency on myRDD
compute() stores operation: map(myfunc)
new
myfunc()
Copyright 2014 Tony Duarte
After Executing: newRDD = myRDD.map(myfunc)
5
newRDD : mappedRDD
Partition
Partition
Partition
Partition
This architecture enables:
MemoryPartition
MemoryPartition
MemoryPartition
MemoryPartition
• You can chain operations on RDDs and Spark will keep generating new RDD's
• Job Scheduling can be lazy - since a dependency chain of operations can be submited.
Array
myRDD : RDD
dependency on myRDD
stores operation: map(myfunc)
Spark Illustrated and Illuminated
Tony Duarte
Spark and Hadoop Training
www.sparkInternals.com
(650) 223-3397