‰‹‰‹•™½  R è‍言資–™ˆ†‍¯¦‹™/¼µ¯“€«&陳¨

  • View
    13.998

  • Download
    1

Embed Size (px)

Text of ‰‹‰‹•™½  R...

  • /

    R

  • 2

  • 3

  • demo

    data code Op1: http://goo.gl/bGLgbQ

    Op2: https://goo.gl/zyUclg

    Op3: https://goo.gl/YRFpwC

    4

    http://goo.gl/bGLgbQhttps://goo.gl/zyUclghttps://goo.gl/YRFpwC

  • session_00_install_R_packages.R

    Rstudio

    Ex. https://www.facebook.com/events/677036392444252

    5

    https://www.facebook.com/events/677036392444252http://search.appledaily.com.tw/charity/projlist/

  • Pkg installr

    # installing/loading the package:if(!require(installr)) {install.packages("installr"); require(installr)}

    # using the package:updateR()

    6

  • Session A

    Data Collection

  • 8

  • 9

  • download.file session_A_DataCollection.R

    10

    - A-01

  • # Article urlurl

  • 12

  • 13

    Title Paragraph

    Item 1 Item 1

    Title Paragraph

    Item 1 Item 2

  • http://search.appledaily.com.tw/charity/projlist/

  • XPath

    /

    //

    @ (attribute)

    *

    | OR

    15

    - A-02

  • Xpath = //*[@id=inquiry3"]/table//tr[4]/td[1]

    http://search.appledaily.com.tw/charity/projlist/

  • (pkg xml2)

    library(xml2)

    # set your target urldoc

  • Title Paragraph

    Item 1 Item 1

    (pkg xml2)

    xml2 read_html; read_xml

    xml_find_all; xml_find_one

    xml_text; xml_attrs

    18

    Title Paragraph

    Item 1 Item 2

    - A-02

  • (pkg xmlview)

    # open the document to test your xpathxml_view(doc, add_filter = T)

    19

    - A-02

    http://search.appledaily.com.tw/charity/projlist/

  • A-01 (8 mins)

    Xpath1-1:

    1-2:

    1-3:

    bonus:

    20

    - A-01

    http://search.appledaily.com.tw/charity/projlist/

  • A-01 (8 mins)

    1-1

    1-2

    1-3

    21

    - A-01

    http://search.appledaily.com.tw/charity/projlist/

  • A-01 (8 mins)

    Xpath1-1:

    1-2:

    1-3:

    bonus:

    bonus

    22

    - A-01

  • A-01 ()

    23

    - A-01

    session_A_ex01.R

  • A-02 (10 mins)

    csv # npage

  • aid

    case.closed

    date.published

    donation

    title

    url.article

    url.detail

    A-02 (10 mins)

    25

    - A-02

    df_article_raw.csv

    session_A_ex02.R

    http://search.appledaily.com.tw/charity/projlist/

  • A-03 (15 mins)

    26

    - A-02

    http://search.appledaily.com.tw/charity/projlist/

  • A-03 (15 mins)

    Outcomedf_article_raw.csv

    .txt

    .txt

    27

    - A-03

    session_A_ex03.R

    http://search.appledaily.com.tw/charity/projlist/

  • df_article.csvaid

    case.closed

    date.published

    donation

    title

    url.article

    url.detail

    donor

    date.funded

    journalist

    n.fb.comment

    n.fb.like

    n.fb.share

    n.fb.total

    n.image

    n.word

    28

    - A-03

  • Data Manipulation

    29

  • df_article.csv

    aid

    case.closed

    date.published

    donation

    title

    url.article

    url.detail

    donor

    date.funded

    journalist

    n.fb.comment

    n.fb.like

    n.fb.share

    n.fb.total

    n.image

    n.word

    30

    df_article.csv

    df_article_raw.csv

    - A-03

  • df_donation.csv

    In db_donation_txt.rar

    df_donation.csv

    31

    - A-03

  • A-04 (Homework)

    crawl df_donation.csv

    32

    - A-04

  • A-04 (Homework)

    33

    - A-04

  • Next session starts at AM 11:00

    Stay Tuned Well be back soon!!

    34

  • Character encoding problem

    (Mac) read.csv Sys.getlocale() locale system("defaults write org.R-project.R force.LANGen_US.UTF-8")

    system("defaults write org.R-project.Rforce.LANG zh_TW.UTF-8")

    read.csv parameters fileEncoding = UTF-8

    35

  • Session BExplanatory Data Analysis

  • EDA ?

    EDA ()

    outliers

    37

    EDA - B-01

  • 38

    EDA - B-01

  • EDA

    39

    (Box-plot)

    (Histogram)

    (Scatter-plot) (Line-chart)

    EDA - B-01

  • Summary Functions in R

    Function Name Description

    names() Functions to get or set the names of an object

    head(), tail()Returns the first or last parts of a vector, matrix, table,

    data frame or function

    str()Compactly display the internal structure of

    an R object

    summary() Produce result summaries

    dim() Retrieve or set the dimension of an object

    length() Get or set the length of vectors

    complete.cases()Return a logical vector indicating which cases are

    complete, i.e., have no missing values

    as.Data()Convert between character representations and

    objects of class "Date" representing calendar dates

    40

    EDA - B-01

  • Visualization Functions in R

    Function Name Description

    plot() Generic function for plotting of R objects

    boxplot() Produce box-and-whisker plot(s) of the given (grouped) values

    hist() Computes a histogram of the given data values

    barplot() Creates a bar plot with vertical or horizontal bars

    arrows() Draw arrows between pairs of points

    abline() a, b: the intercept and slope, single values.y = [A] + [B]x

    lines() Join the corresponding points with line segments.

    41

    Function name and parameter

    http://jeromyanglim.blogspot.tw/2010/05/abbreviations-of-r-commands-explained.html

    EDA - B-01

    http://jeromyanglim.blogspot.tw/2010/05/abbreviations-of-r-commands-explained.html

  • session_B_eda.R

    # load in apple daily article> d dim(d)[1] 3784 17

    # check the column names> names(d)[1] "aid" "case.closed" "circulation" [4] "date.funded" "date.published" "donation" [7] "donor" "journalist" "n.fb.comment" [10] "n.fb.like" "n.fb.share" "n.fb.total" [13] "n.image" "n.word" "title" [16] "url.article" "url.detail"

    42

    EDA - B-01

  • # use str() to have a brief data summary> str(d)

    str()

    43

    EDA - B-01

  • > d$date.published d$title
  • > summary(d)

    summary() NA

    45

    EDA - B-01

  • (NA)

    46

    (NA)

    1. which() + is.na()2. !complete.case()3. summary()

    1. 2. NA

    1. 2. na.omit()

    1. 2. 3.

    EDA - B-01

  • # use hist() to check donation distribution> hist(d$donation, br = 100)

    hist()

    47

    EDA - B-01

  • 48

    EDA - B-01

  • > plot(d$donor, d$donation, pch = ., cex = 2)> abline(lm(d$donation ~d$donor), col = red)

    plot()

    49

    EDA - B-01

  • 50

    EDA - B-01

  • > n b abline(h = mean(d$donation), lty = 2, cex = 2)> text(1:n, (b$stats[3,]+b$stats[4,])/2, b$n, cex = 0.8)

    boxplot()

    51

    EDA - B-01

  • 52

  • B-01 (10 mins)

    NA n.word NA

    n.image NA

    hist()

    journalist plot(), boxplot()

    53

    EDA - B-01

  • B-01

    54

  • 55

  • 56

  • A FINDING!

    FB

    57

    EDA B-02

  • ?

    58

    EDA B-02

  • detrending

    ( ?)

    LOWESS Locally weighted scatterplot smoothing

    59

  • # lm() lowess() donation vs date.published> plot(d$published, d$donation, pch = ., cex = 3)> lines(lowess(d$published, d$donation), col = red)> abline(lm(d$published, d$donation), col = blue)

    lm lowess

    60

    EDA B-02

  • # R lowess() detrending> l d$donation.de plot(d$published, d$donation.de, pch = ., cex = 3)

    lowess() detrending

    61

    EDA B-02

  • > n b abline(h = mean(d$donation.de), lty = 2, cex = 2)> title(journalist vs. donation.de, ylab = donation.de)> text(1:n, 0, b$n, cex = 0.8, col = blue)

    journalist

    62

    EDA B-02

  • 63

  • 64

  • Log Transformation

    1/x, x^2, sqrt(x)

    65

    EDA B-02

  • # log() > plot(d$n.fb.total, d$donation.de, col = blue)

    log()

    66

    EDA B-02

  • > plot(log(d$n.fb.total + 1), log(d$donation.de))> plot(d$n.fb.total + 1, d$donation.de, log = xy)

    log()

    67

  • ...

    68

    EDA B-03

  • 69

    EDA B-03

  • > head(d$title)[1] "" ""[3] "" "" [5] "" ""

    > d$tle.n.word

  • > i = grep(, d$title)> d$tle.cancer d[i, ]$tle.cancer library(dplyr)> tmp = group_by(d, tle.cancer)> tmp = summarize(tmp, se = sd(donation.de) / sqrt(n()), m = mean(donation.de))

    > b arrows(b, tmp$m + tmp$se * 1.96, b, tmp$m - tmp$se * 1.96, angle = 90, code = 3)

    71

    EDA B-03

  • 72

    EDA B-03

  • > i d$tle.male d[i, ]$tle.male i d$tle.female d[i, ]$tle.female boxplot(donation.de ~ tle.male + tle.female + tle.cancer, data = d, col = orange)> title(male + female + cancer)

    73

    EDA B-03

  • 74

    EDA B-03

  • > class(d$date.published) [1] "Date"

    > d$month boxplot(d$donor.de ~ d$month, col = orange)

    %y = year%m = month%w = week%d = day

    75

    EDA B-03

  • 76

    EDA B-03

  • = donation / donor

    , , , .

    FB# of TOTAL (likes, shares, comments) in logarithm

    : , , , , , , , , , , ,