43
Computational Communication Research in Big Data Era 大數據時代的計算傳播學研究 Jonathan Zhu 祝建華 國立臺灣師範大學2016年10月5日

Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Computational Communication Research in Big Data Era大數據時代的計算傳播學研究

Jonathan Zhu 祝建華

國立臺灣師範大學2016年10月5日

Page 2: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Outline

Big Data: Technical Features

Computational Communication Research (CCR): Basic Methods

o Data Collection

o Data Preprocessing

o Data Analysis

o Data Visualization

Discussions: New Bottle of Old Wine?

o Examples of Studies on 5W

o When to use and when not to use CCR

2

Page 3: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Where Does Big Data Come from?

Web 1.0

Web 2.0/Social Media

Mobile Web

Internet of Things

3

Current Focus

Page 4: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Technical Features of Big Data

Offline/Analog Data Online/Digital Data

Data size hundreds to thousands millions to billions

Format analog text, audio, visual digitized text, audio, visual

Measures self-reported attitudes digital records of behavior

Time-scaleone to several points on monthly or yearly scale

unlimited points on millisecond scale

Spatial info usually unavailable often available

4

Page 5: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

5

A Sample of Server Log

Page 6: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

6

Data Size: Small, Mid, and Big

Anderson (2013) Forget big data, think mid data. http://blog.odintext.com/?p=300

Small

Mid Data

Big

Return

Qualitative Quantitative

Value

Size/Cost/Time

Data Size

Sweet Point

Page 7: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

7

Self-reported vs. Behavioral Measures

Attitudes Behavior

AttitudesBehavioral Intention

Behavior

Self-reported Measures

0.1 - 0.2

0.3 - 0.5 0.5 - 0.7

Ajzen & Fishbein (1977) Behavior

Behavioral Measures

Attitudes

Demo-graphics

PersonalityPredicting

Page 8: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

8

What Is Computational Social Science (CSS)?

Wikipedia: Computational social science (CSS) is the integrated, interdisciplinary pursuit of social inquiry with emphasis on information processing and through the medium of advanced computation. The main computational social science areas are automated information extraction systems, social network analysis, social geographic information systems (GIS), complexity modeling, and social simulation models. (wikipedia.org)

Lazer et al. (2009):

An emerging field that leverages the capacity to collect and analyze data at an unprecedented breath, depth and scale that may reveal patterns of individual and group behaviors. (Science, 323)

Page 9: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

9

The Multidisciplinary Coauthorship of Lazer et al. (2009)

1. D. Lazer (political science)2. A. Pentland (computer science)

3. L. Adamic (computer science)4. S. Aral (business)5. A. L. Barabasi (physics)6. D. D. Brewer (sociology)7. N. A. Christakis (public health)8. N. Contractor (communication)

9. J. H. Fowler (political Science)

10. M. P. Gutman (history)

11. T. Jebara (computer science)

12. G. King (political science)

13. M. Macy (sociology)

14. D. Roy (cognitive science)

15. M. Van Alstyne (business)

9

Page 10: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

CSS = Domain Knowledge + Statistical Tools + Computer Programming

Social Science Theory

Statistical Analysis

Computer Programming

10

Page 11: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

11

What Is Computational Communication Research (CCR)?

CCR is not a new discipline, but an emerging paradigm for empirical research in communication

CCR employs CSS methods, tools, and algorithms to communication research

CCR is carried out by scholars from communication, computer science and other disciplines

Most studies of CCR focus on 5W questions (who says what to whom through which channel with what effects)

Page 12: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

12

Traditional vs. Computational Methods in Communication Research

Traditional methods:

Survey

Experiment

Content analysis

Statistical analysis

Research ethics

...

Computational methods:

User logs analytics

A/B test

Text mining

Statistical/network analysis

Privacy & other ethical issues

...

Page 13: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Classifying CCR based on 5Ws

Harold Lasswell (1948): communication is a process of who says what through which channel to whom with what effects.

13

Who What Channel Whom Effects

Study Subject Communicator Message Medium Audience Effects

TraditionalMethod

Content analysis

/interviews

Contentanalysis

Experiment SurveySurvey

/experiment

CCR Method Text mining Text mining A/B test Log data analytics

Log data analytics, A/B

test

Page 14: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Exemplar Studies of CCR

Research Domain Author (Year) Method

Who/Communicator Wu et al. (2011)Text mining of Twitter users and posts

What/Content Zhao et al. (2011) Text mining of Twitter posts

Which Channel Petrovic et al. (2013)Text mining between newswire and Twitter

Whom/Audience Benevenuto et al. (2009)Clickstream analysis of Orkut visit logs

What Effects Bond et al. (2008) Experiment of Facebook effects

14

Page 15: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

15

ICA Interest Group on Computational Methods (IGCM) Existing divisions and interest

groups:o …

o Communication and Technology

o Information Systems

o International Communication

o Mass Communication and Society

o Political Communication

o …

Newly formed interest groups:

o Communication Science and Biology

o Computational Methods

o Public Diplomacy

Website: http://ica-cm.org

Page 16: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Basic Procedure and Methods of CCR

16

SamplingCrawlingArchiving

ParsingClassifyingClustering

TrendsAssociation

Mapping

ChartsInteractive visualsAnimated videos

Page 17: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

18

Efforts/Resources Involved in the Steps

SurveyLog

AnalyticsExperiment

A/B TestContent Analysis Text Mining

Data Collection

Data Processing

Data Analysis

Data Visualization

Page 18: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

19

i. Data Collection

Sampling: decide which webpages to download

o Breadth-First Search (BFS)

o Random Walk (RW) Sampling

o Uniform Sampling (Zhu et al., 2012)

Crawling: download the chosen webpages

o N of visits to a website per time unit

o Elements to download (text, images, audio, video, etc.)

o Use of application program interface (API)

Page 19: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

20

Web Scraping

Data FileID V1 V2 V3 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

Web Pages

Web Database

Web Pages

Web Pages

API

API

Web Scraping

Retrieving

Retrieving

20

Page 20: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

21

ii. (Text) Data Preprocessing

Parsing: extract targeted words from the text

o Keywords

o Phrases

o Sentences

o Paragraphs

o etc.

Classification: group the parsed words

o Named entities (persons, organizations, locations, events, etc.)

o Topics/themes

o Time (holidays, anniversaries, seasons, etc.)

o Emotions (adjectives, colors, mental states, etc.)

Page 21: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

From Texts to Topics

• Tweets

• Webpages

• Posts

• ……

Word Extraction

• Words

• Phrases

• nGrams

Topic Extraction • Topics

• Issues

• Themes

• ……

Ready for Analysis

22

Page 22: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

23

iii. Data Analysis

Longitudinal trends: changes of theme, entities, emotions, etc. over time

Cross-sectional associations: correlational/causal relationships between themes, entities, emotions, time, etc.

Spatial mapping: geographic relations among themes, entities, etc.

Network mapping: social relations among named entities

Semantic mapping: co-occurrence of keywords, themes, emotions, etc.

Page 23: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

24

From Data to Analysis

Data Characteristics:

1. Massive size

2. Limited variables

3. Behavioral data

4. Time stamped

5. Location stamped

6. Multi-level

7. Networked

Primary Analysis Tools:

Exploratory analysis

Time series analysis

Network analysis (by Dr. Fujio Toriumi)

Page 24: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

25

iv. Data Visualization

Basic Charts:

o Bar charts (for categories)

o Pie charts (for shares)

o Line charts (for trends

Interactive charts:

o for drill-down in multiple layers

Animated/video charts:

o continuous display of multiple charts

Page 25: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

26

Collection of Tweets

Text Search

Time Series: Stream of tweets

Topic

User group

Topic Competition

Modeling

Topic Transition Analysis

Visual Analysis of Topic Competition on Social Media

TimelineVisualization

Raw Tweets List

Word Cloud

Source: Xu, P. P. et al. (2013). IEEE Transactions on Visualization and Computer Graphics, 19(12), 2012-2020.

Page 26: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

27

Strengths and Weaknesses of CCR Methods

Strengths: Fast speed to get big data at

low/no cost

Unobtrusive measurement of user behavior (to minimize social desirability associated with self reported data)

Discovery of new patterns/regularities from bottom up

Weaknesses: Require technical expertise

and resources to obtain and process data

Difficult to link observed behavior to underlying motivations

Difficult to falsify bottom-up discoveries

Legal and ethical concerns over privacy and security

Page 27: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Is CCR a New Bottle of Old Wine?

Bottle(Methods Used)

Wine (Media Studied)

Old New

TraditionalOBOW (e.g., contentanalysis of national

images in NTY news)

OBNW (e.g., survey of FB users to study self-

disclosure)

ComputationalNBOW (e.g., text

mining of national images in NYT news)

NBNW (e.g., text mining of FB posts to study self-disclosure)

28

Page 28: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

29

What’s Wine

Phenomenon (What)

o Existence or change of a communication process

o Existence or change of a communication effect

Causal Mechanism (Why)

o Causes and working mechanism of the process

o Causes and working mechanism of the effect

Theory = Interpretation of the above

Page 29: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Classification of Scientific Knowledge

Has the phenomenonbeen observed?

Has the phenomenon been predicted/explained?

Yes No

Yes1. Replicating existing

knowledge in new setting(OLD WINE)

3. Offering new theoryfor known phenomenon

(OLD or NEW WINE?)

No2. Offering new evidence

for an existing theory(OLD or NEW WINE?)

4. Offering brand new knowledge

(NEW WINE)

30

Page 30: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

31

Page 31: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Exemplar Studies of CCR

Cited:

Wu et al. (2011). Who says what to whom on Twitter (communicator)

Benevenuto et al. (2009). User behavior in OSNs (user)

Zhao et al. (2011). Topic models between Twitter & NYT (content)

Petrovic et al. (2013). Twitter vs. newswire for breaking news (channel)

Bond et al. (2012). Experiment of 60m users on Facebook (effects)

Additional:

Watts et al. (1998). Small-world networks

Barabasi (2005). Origin of bursts

Ugander et al. (2012). Structural diversity of social contagion

Goel et al. (2015). Structural virality of 1b events on Twitter

32

Page 32: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

33

Wu et al. (2011). Who says what to whom on Twitter

Although the largest number of tweets came from ordinary users, each of them posted on average only 6 tweets, which is much fewer than that from media (1000) or other elite users.

# of Users

5,0005,0005,0005,000

40,000,000

Page 33: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

34

Benevenuto et al. (2009). Characterizing user behavior on online social networks

Degree of interaction increases by an order of magnitude when incorporating silent interactions

85% of the active users showed only silent interactions!

inte

racte

d

Page 34: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

35

Zhao et al. (2011). Comparing twitter and traditional media using topic models

Page 35: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

36

Petrovic et al. (2013). Can Twitter replace newswire for breaking news?

“+” means tweets lead newswire in time. As shown in the table, tweets lead 8 times, trails behind 15 times, and ties 4 times.

Page 36: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

37

Bond et al. (2012). A 61m-person experiment in social influence and political mobilization

An effect of 0.3% amounts to 180,000 voters. Is it statistically significant but practically trivial?

Page 37: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

38

Watts et al. (1998). Collective behavior of small-world networks

Page 38: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

39

Barabasi (2005). Origin of bursts and heavy tails in human dynamics

Page 39: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

40

Ugander et al. (2013). Structural diversity of social contagion

Page 40: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

41

Goel et al. (2015). Structural virality of online diffusion

Page 41: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Classification of Scientific Knowledge

Has the phenomenonbeen observed?

Has the phenomenon been predicted/explained?

Yes No

Yes

Old Wine(e.g., Wu et al.)

New Wine(Watts et al.; Barabasi)

NoNew Wine

(Bond et al.; Petrovic et al.)

Brand New Wine(Benevenuto et al.;

Ugendar et al.; Goel et al.)

42

Page 42: Computational Communication Research in Big Data Eraweblab.com.cityu.edu.hk/workshops/other_talks/CCR... · From Data to Analysis Data Characteristics: 1. Massive size 2. Limited

Concluding Remarks

CCR applies computational methods, tools, and algorithms to both existing questions and unexplored questions at different units of analysis and different time points

Although many of the current studies of CCR replicate existing knowledge, an increasing number of studies provided new evidence for existing hypotheses or new explanations for known facts, and a small number of studies even offer brand new facts with explanations.

CCR is not a scientific paradigm yet, but has all necessary ingredients for a new paradigm. The current bottleneck lies in shortage of computational knowledge and skills among communication scholars.

43