4.HDFS & IO

Embed Size (px)

Citation preview

  • 8/9/2019 4.HDFS & IO

    1/41

    http://www.excelonlineclasses.

    /

      excel.onlineclasses mail.chttp://www.excelonlineclasses.co.nr/

    http://www.excelonlineclasses.co.nr/http://www.excelonlineclasses.co.nr/mailto:[email protected]:[email protected]:[email protected]://www.excelonlineclasses.co.nr/http://www.excelonlineclasses.co.nr/

  • 8/9/2019 4.HDFS & IO

    2/41

    Online TrainingDevelopmentTesting

    Job supportTechnical Guidance Job Consultancy Any needs of IT Sector 

    Excel Online Classes ofers ollowingservices:

  • 8/9/2019 4.HDFS & IO

    3/41

    MapReduce Anatomy

    Nagaruna !

  • 8/9/2019 4.HDFS & IO

    4/41

    AGENDA

    " Anatomy of MapReduce" MR work ow

    " Hadoop data types

    " Mapper" Reducer

    " artitioner

    " !om"iner" #nput $plit %s &lock $i'e

  • 8/9/2019 4.HDFS & IO

    5/41

    Anatomy o MR

    .#() * 

    +A *A

    (,+- Map#nterim

    data

    (,+- Map #nterimdata

    (,+- Map#nterim

    data

    Reduce(od

    stoout

    Reduce (odstoout

    Reduce(od

    sto

    out

    artitioning$hu0ing

  • 8/9/2019 4.HDFS & IO

    6/41

    Hadoop data types

    MR has a de1ned way of keys and%alues types  for it to mo%e acroscluster

    2alues  3rita"le

    4eys  3rita"le!ompara"le5*6 3rita"le!ompara"le 7

  • 8/9/2019 4.HDFS & IO

    7/41

    re!"ently "sed#ey$val"eHadoop type %rapper or &ava type

    &oolean3rita"le &oolean

    &yte3rita"le &yte

    +ou"le3rita"le +ou"le

    #nt3rita"le #nteger9ong3rita"le 9ong

     *ext $tring

    (ull3rita"le laceholder when key/%alue noneeded

  • 8/9/2019 4.HDFS & IO

    8/41

    C"stom %rita'le

    or any class to "e value, it has toimplement org.apache.hadoop.io.Writable

    write;+ata,utput out< readields;+ata#nput in<

  • 8/9/2019 4.HDFS & IO

    9/41

    C"stom #ey

    or any class to "e key, it has toimplementorg.apache.hadoop.io.WritableComparable

      +

    compare*o;* o<

  • 8/9/2019 4.HDFS & IO

    10/41

    C(ec#o"t %rita'les

    !heck out few of the writa"les andwrita"le compara"le

     *ime to write your own writa"les

  • 8/9/2019 4.HDFS & IO

    11/41

    MapRed"ce li'raries

     *wo li"raries in Hadoop org.apache.hadoop.mapred.=

    org.apache.hadoop.mapreduce.=

  • 8/9/2019 4.HDFS & IO

    12/41

    Mapper

    $hould implementorg.apache.hadoop.mapred.Mapper54>2>4>26# 2oid con1gure;?o"!onf o"<

    # All the parameters speci1ed in the xmls are a%aila"le here.

    # Any parameter explicitly set are also a%aila"le

    # !all "efore data processing starts

    # 2oid Mapper;4 key>2 %alue> ,utput!ollector54>26output>Reporter reporter<# +ata process starts

    # 2oid !lose;<# $hould close any 1les> d" connections etc.>

    # Reporter pro%ides extra information of mapper to *

  • 8/9/2019 4.HDFS & IO

    13/41

    Mappers )dea"lt

    Mapper "nctionality

    #dentityMapper #mplemetns Mapper54>2>4>26

    " 3hate%er the input it gets it gi%es that tooutput

    #n%erseMapper #mplemetns Mapper54>2>2>46" #n%erses the key>%alue from the input to out

     *oken!ountMapper

    #mplements Mapper54>*ext>*ext>9ong3rita"le6" enerates ;token>< from the input %alue

    tokeni'ed.

  • 8/9/2019 4.HDFS & IO

    14/41

    Red"cer

    $hould implementorg.apache.hadoop.mapred.Redcuer

    $orts the incoming data "ased on key and grtogether all the %alues for a key

    Reduce function is called for e%ery key in thesorted order# %oid reduce;4 key> #terator526 %alues>

    ,utput!ollector54B>2B6 output> Reporter reporter<

    Reporter pro%ides extra information of mappe **

  • 8/9/2019 4.HDFS & IO

    15/41

    Red"cer )dea"lt

    Red"cer "nctionality

    #dentityReducer54>26 #mplements Reducer54>2>4>2maps inputs directly to outpu

    9ong$umReducer546 #mplementsReducer54>9ong3rita"le>4>9o

    a"le6 anddetermines the sum of all %alcorresponding to the gi%en ke

  • 8/9/2019 4.HDFS & IO

    16/41

    *artitioner

    implements artitioner54>26 con1gure;<

    int getartition ; C <

    # D5 return5no.of.reducers

    enerally> implement artitioner so

    same keys go to one reducer

  • 8/9/2019 4.HDFS & IO

    17/41

    Reading and %riting

    enerally two kinds of 1les inHadoop *ext ;plain > EM9> html C. <

    &inary ;$eFuence<# #t is a hadoop speci1c compressed "inary 1

    format.

    # ,ptimi'ed to transfer output from one MR t

    MR

    3e can customi'e

  • 8/9/2019 4.HDFS & IO

    18/41

    +np"t ormat

    H+$ "lock si'e

    #nput splits

  • 8/9/2019 4.HDFS & IO

    19/41

    ,loc#s in HD-

    &ig ile is di%idinto multiple "land stored in h

     *his is a physicdi%ision of data

    dfs."lock.si'e;GM& default

    &9,!4

    &9,!4

    &9,!4 B

    &9,!4

    9AR- #9-

  • 8/9/2019 4.HDFS & IO

    20/41

    +np"t -plits and Record

    #nput split A chunk of data processed "y a mapper

    urther di%ided into records

    Map process these records# Record 7 key 8 %alue

    How to correlate to a +& ta"le# roup of rows  split

    # Row  record

    .ey /

    R R

    R R

    RB R

    9,#!A9

    +#2#$#,(

  • 8/9/2019 4.HDFS & IO

    21/41

    +np"t-plit

    pu"lic interface #nput$plit extends 3rita"le I

    long get9ength;< throws #,-xceptionJ

    $tringKL get9ocations;< throws #,-xceptionJ

    #t doesnNt contain the data ,nly locations where the data is present

    Helps o"tracker to arrange tasktrackers ;data locality< get9ength greater length split will "e executed

  • 8/9/2019 4.HDFS & IO

    22/41

    +np"tormat

    How we get the data to mapper #nputsplits and how the splits are di%ide

    into records will "e taken care "y

    inputformat.

    pu"lic interface #nputormat54> 26 I#nput$plitKL get$plits;?o"!onf o"> int num$plits< throws #,-xceptioRecordReader54> 26 getRecordReader;#nput$plit split> ?o"!onf o">

    Reporter reporter

  • 8/9/2019 4.HDFS & IO

    23/41

    +np"tormat

    Mapper getRecordReader;< is called to get

    RecordReader

    ,nce the record reader is o"tained># Map method is called recursi%ely until the e

    of the split

  • 8/9/2019 4.HDFS & IO

    24/41

    RecordReader

    4 key 7 reader.create4ey; output> reporter

  • 8/9/2019 4.HDFS & IO

    25/41

     &o' -"'mission ))retrospection ?o"!lient running the o" ets inputsplits "y calling get$plits;< in

    #nputormat

    +etermines data locations for the splits $ends these locations to the ?o"*racke

     ?o"*racker assigns mappers

    appropriately.# +ata locality

  • 8/9/2019 4.HDFS & IO

    26/41

    +n,"ilt +np"tormats

  • 8/9/2019 4.HDFS & IO

    27/41

    ile+np"tormat

    &ase class for all implementations o#nputormat > which uses 1les asinput

    +e1nes 3hich 1les to include for the o"

    #mplementation for generating splits

  • 8/9/2019 4.HDFS & IO

    28/41

    ile+np"tormat

    $et of iles  con%erts to no.of spli $plits only large 1lesC. H,3 9AR- O

    9arger than &lock$i'e

    !an we control it O*roperty Description Dea"lt va

    mapred.min.split.si'e *he smallest %alid si'e in"ytes for a 1le split.

    mapred.max.split.si'e *he largest %alid si'e in "ytesfor a 1le split.

    9ong.MaxP%

    dfs."lock.si'e *he si'e of a "lock in H+$ in"ytes.

    GM"

  • 8/9/2019 4.HDFS & IO

    29/41

    Calc"lating -plit -i0e

    mapred1min1split1si0e

    mapred1max1split1si0e

    ds1'loc#1si0e

    split si0e

    9ong.MAE G M& GM&

    9ong.MAE QM& Q M&

    Q M& 9ong.MAE GM& QM&

    BM& GM& BM&

    " Application may impose minimum split si'e greater than &lock $i'e."  *here is no good reason to that

    " +ata locality is lost

  • 8/9/2019 4.HDFS & IO

    30/41

    ile+np"tormat

    Min split si'e 3e might set it to larger than "lock si'e

    &ut concept of data locality may "e lost tosome extent

    $plit si'e calculated "y formula max ;minimum$i'e> min;maximum$i'e> "lock$i'e<

    &y default# minimum$i'e 5 "lock$i'e 5 maximum$i'e

    ile +normation in t(e

  • 8/9/2019 4.HDFS & IO

    31/41

    ile +normation in t(emapper

    !on1gure;?o"!onf o"<

    *roperty name description

    mapred.input.1le *he path of the input 1le "eingprocessed

    mapred.input.start *he "yte oset of the start ofthe split

    map.input.length *he length of the split in "ytes

  • 8/9/2019 4.HDFS & IO

    32/41

    2ext+np"tormat

    +efault ile#nputormat -ach line is a %alue

    &yte oset is a key

    -xample Run identity mapper program

    +np"t -plits and HD-

  • 8/9/2019 4.HDFS & IO

    33/41

    +np"t -plits and HD-,loc#s9ogical Records de1ned "y ile#nputorm

    doesnNt usually 1t it into H+$ "locks. -%eryile is written is written as seFuence of

    "ytes.

    G M& reached O then start the new "lock

    3hen G M& reached> the logical record may half written

    $o> the other half of logical record goes into tnext H+$ "lock.

    +np"t -plits and HD-

  • 8/9/2019 4.HDFS & IO

    34/41

    +np"t -plits and HD-,loc#s

    $o e%en in data locality some remoreading is done.. a slight o%erhead. $plit gi%es logical record "oundaries

    &locks S physical "oundaries ;si'e<

    - ll il

  • 8/9/2019 4.HDFS & IO

    35/41

    -mall iles

    iles which are %ery small areineTcient in mapper phase

    #magine &

    GM" S G 1les S G mappers DDk" S DDD 1les S DDD mappers

    C 'i il + t

  • 8/9/2019 4.HDFS & IO

    36/41

    Com'ineile+np"torma

    acks many 1les into single split +ata locality is taken into consideration

    MR accelerates "est if operated atdis# transer rate not at see# rate

     *his helps in processing large 1les als

    N3i + t t

  • 8/9/2019 4.HDFS & IO

    37/41

    N3ine+np"tormat

    $ame as *ext#nputormat

    -ach split guarenteed to ha%e ( lin

    mapred.line.input.format.linespermp

    .ey/al"e2ext+np"torm

  • 8/9/2019 4.HDFS & IO

    38/41

    .ey/al"e2ext+np"tormt

    -ach line in text 1le is a recordirst separator character di%ides ke

    and %alue

    +efault is UVtN

    !ontroller property

    key.%alue.separator.in.input.line

    - il + t t . /

  • 8/9/2019 4.HDFS & IO

    39/41

    -e!"enceile+np"tormat4.5/6

    #nputormat for reading seFuence 1les

    )ser de1ned 4ey 4)ser de1ned 2alue 2

     *hey are splitta"le 1les. 3ell$uited for MR

     *hey store compression

     *hey can store ar"itrary types

    O t t t

  • 8/9/2019 4.HDFS & IO

    40/41

    O"tp"tormat

    2e tO tormat

  • 8/9/2019 4.HDFS & IO

    41/41

    2extO"tormat

    key>%alues stored as Vt separated "default.

    mapred.textoutputformat.separator WW parameter

     !ounterart for 4ey2alue*ext#nputormat

    !an suppress key/%alue "y using (ull3rita"le