1-dplyr-tidyr

  • Upload
    rafik

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

  • 8/10/2019 1-dplyr-tidyr

    1/29

    The Grammar and Graphics of Data Science

    #RStudio

  • 8/10/2019 1-dplyr-tidyr

    2/29

    The next webinar in the series:

    Reproducible Reporting Live!

    Wednesday, August 13th, 11am Eastern Time US

    Master R Developer Workshop,

    Monday, September 8 Tuesday, September 9, 2014.

    New York City, NY.

    R Day@StrataNYC + Hadoop World,

    October 15th.

    Javits Center, New York City, NY.

  • 8/10/2019 1-dplyr-tidyr

    3/29

    Hadley Wickham@hadleywickhamChief Scientist, RStudio

    Grammars ofdata science

    July 2014

    http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/
  • 8/10/2019 1-dplyr-tidyr

    4/29

    What is datascience?

  • 8/10/2019 1-dplyr-tidyr

    5/29

    Tidy

    Compose

    Collect

    Analyse

    Communicate

  • 8/10/2019 1-dplyr-tidyr

    6/29

    Collect Tidy

    Compose

    Analyse

    Communicate

  • 8/10/2019 1-dplyr-tidyr

    7/29

    GatherRefine

    Visualise

    Transform

    Model

    Collect

    Deploy

    Explain

    Document

    Tidy

    Compose

    Analyse

    Communicate

  • 8/10/2019 1-dplyr-tidyr

    8/29

    Visualise

    Transform

    Model

    Tidy

  • 8/10/2019 1-dplyr-tidyr

    9/29

    Visualise

    Transform

    Model

    dplyr

    ggvisTidytidyr

  • 8/10/2019 1-dplyr-tidyr

    10/29

    tidyr

  • 8/10/2019 1-dplyr-tidyr

    11/29

    What is tidy data?

    # Data thats easy to transform, visualiseand model

    # Key idea: store variables in a consistentway, always as columns

    # tidyr provides useful tools to tidy messy

    data. Three most important are: gather,spreadand separate.

    # Google tidy data for more details.

  • 8/10/2019 1-dplyr-tidyr

    12/29

    dplyr

  • 8/10/2019 1-dplyr-tidyr

    13/29

    Think it Do itDescribe it

    Cognitive

    Computational

    (precisely)

  • 8/10/2019 1-dplyr-tidyr

    14/29

    # filter: keep rows matching

    criteria

    # select: pick columns by name

    # arrange: reorder rows

    # mutate: add new variables

    # summarise: reduce variables

    to values

    +groupby

  • 8/10/2019 1-dplyr-tidyr

    15/29

    Studio

    nycflights13

    # flights[336,776 x 16]. Every flight

    departing NYC in 2013.

    # weather[8,719 x 14]. Hourly weather

    data.

    # planes[3,322 x 9]. Plane metadata.

    # airports[1,397 x 7]. Airport metadata.

  • 8/10/2019 1-dplyr-tidyr

    16/29

    Studio

    library(nycflights13)

    library(dplyr)

    flights

    #> Source: local data frame [336,776 x 16]

    #>

    #> year month day dep_time dep_delay arr_time arr_delay carrier tailnum

    #> 1 2013 1 1 517 2 830 11 UA N14228

    #> 2 2013 1 1 533 4 850 20 UA N24211#> 3 2013 1 1 542 2 923 33 AA N619AA

    #> 4 2013 1 1 544 -1 1004 -18 B6 N804JB

    #> 5 2013 1 1 554 -6 812 -25 DL N668DN

    #> 6 2013 1 1 554 -4 740 12 UA N39463

    #> 7 2013 1 1 555 -5 913 19 B6 N516JB

    #> 8 2013 1 1 557 -3 709 -14 EV N829AS

    #> 9 2013 1 1 557 -3 838 -8 B6 N593JB

    #> 10 2013 1 1 558 -2 753 8 AA N3ALAA

    #> .. ... ... ... ... ... ... ... ... ...

    #> Variables not shown: flight (int), origin (chr), dest (chr),

    #> air_time (dbl), distance (dbl), hour (dbl), minute (dbl)

  • 8/10/2019 1-dplyr-tidyr

    17/29

    Demo

  • 8/10/2019 1-dplyr-tidyr

    18/29

    Pipelines

  • 8/10/2019 1-dplyr-tidyr

    19/29

    Think it Do itDescribe it

    Cognitive

    Computational

    (precisely)

  • 8/10/2019 1-dplyr-tidyr

    20/29

    hourly_delay 10

    )

  • 8/10/2019 1-dplyr-tidyr

    21/29

    %>%tidyr, dplyr, ggvis, ...

  • 8/10/2019 1-dplyr-tidyr

    22/29

    # x %>% f(y) -> f(x, y)

    hourly_delay %

    filter(!is.na(dep_delay)) %>%

    group_by(date, hour) %>%

    summarise(

    delay = mean(dep_delay),

    n = n()

    ) %>%

    filter(n > 10)

  • 8/10/2019 1-dplyr-tidyr

    23/29

    Remote

    sources

  • 8/10/2019 1-dplyr-tidyr

    24/29

    Think it Do it(remote)

    Describe it

    Cognitive

    Computational

    (precisely)

    S di

  • 8/10/2019 1-dplyr-tidyr

    25/29

    Studio

    Other data sources

    # PostgreSQL, redshift

    # MySQL, MariaDB

    # SQLite

    #MonetDB, BigQuery

    # Oracle, SQL Server,

    Greenplum, ImpalaDB

  • 8/10/2019 1-dplyr-tidyr

    26/29

    flights %>%

    filter(!is.na(dep_delay)) %>%

    group_by(date, hour) %>%

    summarise(delay = mean(dep_delay), n = n()) %>%

    filter(n > 10)

    # SELECT "date", "hour", "delay", "n"

    # FROM (# SELECT "date", "hour",

    # AVG("dep_delay") AS "delay",

    # COUNT() AS "n"

    # FROM "flights"# WHERE NOT("dep_delay" IS NULL)

    # GROUP BY "date", "hour"

    # ) AS "_W1"

    # WHERE "n" > 10.0

  • 8/10/2019 1-dplyr-tidyr

    27/29

    translate_sql(month > 1, flights)

    # "Month" > 1.0translate_sql(month > 1L, flights)

    # "Month" > 1

    translate_sql(dest == "IAD" || dest == "DCA",hflights)

    # "dest" = 'IAD' OR "dest" = 'DCA'

    dc

  • 8/10/2019 1-dplyr-tidyr

    28/29

    Learn more

    Studio

  • 8/10/2019 1-dplyr-tidyr

    29/29

    Studio

    # Built-in vignettes

    browseVignettes(package = "dplyr")

    # Translate plyr to dplyrhttp://jimhester.github.io/plyrToDplyr/

    # Common questions & answers

    http://stackoverflow.com/questions/tagged/dplyr?sort=frequent

    http://stackoverflow.com/questions/tagged/dplyr?sort=frequenthttp://jimhester.github.io/plyrToDplyr/