Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Computational Communication Research in Big Data Era大數據時代的計算傳播學研究
Jonathan Zhu 祝建華
國立臺灣師範大學2016年10月5日
Outline
Big Data: Technical Features
Computational Communication Research (CCR): Basic Methods
o Data Collection
o Data Preprocessing
o Data Analysis
o Data Visualization
Discussions: New Bottle of Old Wine?
o Examples of Studies on 5W
o When to use and when not to use CCR
2
Where Does Big Data Come from?
Web 1.0
Web 2.0/Social Media
Mobile Web
Internet of Things
3
Current Focus
Technical Features of Big Data
Offline/Analog Data Online/Digital Data
Data size hundreds to thousands millions to billions
Format analog text, audio, visual digitized text, audio, visual
Measures self-reported attitudes digital records of behavior
Time-scaleone to several points on monthly or yearly scale
unlimited points on millisecond scale
Spatial info usually unavailable often available
4
5
A Sample of Server Log
6
Data Size: Small, Mid, and Big
Anderson (2013) Forget big data, think mid data. http://blog.odintext.com/?p=300
Small
Mid Data
Big
Return
Qualitative Quantitative
Value
Size/Cost/Time
Data Size
Sweet Point
7
Self-reported vs. Behavioral Measures
Attitudes Behavior
AttitudesBehavioral Intention
Behavior
Self-reported Measures
0.1 - 0.2
0.3 - 0.5 0.5 - 0.7
Ajzen & Fishbein (1977) Behavior
Behavioral Measures
Attitudes
Demo-graphics
PersonalityPredicting
8
What Is Computational Social Science (CSS)?
Wikipedia: Computational social science (CSS) is the integrated, interdisciplinary pursuit of social inquiry with emphasis on information processing and through the medium of advanced computation. The main computational social science areas are automated information extraction systems, social network analysis, social geographic information systems (GIS), complexity modeling, and social simulation models. (wikipedia.org)
Lazer et al. (2009):
An emerging field that leverages the capacity to collect and analyze data at an unprecedented breath, depth and scale that may reveal patterns of individual and group behaviors. (Science, 323)
9
The Multidisciplinary Coauthorship of Lazer et al. (2009)
1. D. Lazer (political science)2. A. Pentland (computer science)
3. L. Adamic (computer science)4. S. Aral (business)5. A. L. Barabasi (physics)6. D. D. Brewer (sociology)7. N. A. Christakis (public health)8. N. Contractor (communication)
9. J. H. Fowler (political Science)
10. M. P. Gutman (history)
11. T. Jebara (computer science)
12. G. King (political science)
13. M. Macy (sociology)
14. D. Roy (cognitive science)
15. M. Van Alstyne (business)
9
CSS = Domain Knowledge + Statistical Tools + Computer Programming
Social Science Theory
Statistical Analysis
Computer Programming
10
11
What Is Computational Communication Research (CCR)?
CCR is not a new discipline, but an emerging paradigm for empirical research in communication
CCR employs CSS methods, tools, and algorithms to communication research
CCR is carried out by scholars from communication, computer science and other disciplines
Most studies of CCR focus on 5W questions (who says what to whom through which channel with what effects)
12
Traditional vs. Computational Methods in Communication Research
Traditional methods:
Survey
Experiment
Content analysis
Statistical analysis
Research ethics
...
Computational methods:
User logs analytics
A/B test
Text mining
Statistical/network analysis
Privacy & other ethical issues
...
Classifying CCR based on 5Ws
Harold Lasswell (1948): communication is a process of who says what through which channel to whom with what effects.
13
Who What Channel Whom Effects
Study Subject Communicator Message Medium Audience Effects
TraditionalMethod
Content analysis
/interviews
Contentanalysis
Experiment SurveySurvey
/experiment
CCR Method Text mining Text mining A/B test Log data analytics
Log data analytics, A/B
test
Exemplar Studies of CCR
Research Domain Author (Year) Method
Who/Communicator Wu et al. (2011)Text mining of Twitter users and posts
What/Content Zhao et al. (2011) Text mining of Twitter posts
Which Channel Petrovic et al. (2013)Text mining between newswire and Twitter
Whom/Audience Benevenuto et al. (2009)Clickstream analysis of Orkut visit logs
What Effects Bond et al. (2008) Experiment of Facebook effects
14
15
ICA Interest Group on Computational Methods (IGCM) Existing divisions and interest
groups:o …
o Communication and Technology
o Information Systems
o International Communication
o Mass Communication and Society
o Political Communication
o …
Newly formed interest groups:
o Communication Science and Biology
o Computational Methods
o Public Diplomacy
Website: http://ica-cm.org
Basic Procedure and Methods of CCR
16
SamplingCrawlingArchiving
ParsingClassifyingClustering
TrendsAssociation
Mapping
ChartsInteractive visualsAnimated videos
18
Efforts/Resources Involved in the Steps
SurveyLog
AnalyticsExperiment
A/B TestContent Analysis Text Mining
Data Collection
Data Processing
Data Analysis
Data Visualization
19
i. Data Collection
Sampling: decide which webpages to download
o Breadth-First Search (BFS)
o Random Walk (RW) Sampling
o Uniform Sampling (Zhu et al., 2012)
Crawling: download the chosen webpages
o N of visits to a website per time unit
o Elements to download (text, images, audio, video, etc.)
o Use of application program interface (API)
20
Web Scraping
Data FileID V1 V2 V3 ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
Web Pages
Web Database
Web Pages
Web Pages
API
API
Web Scraping
Retrieving
Retrieving
20
21
ii. (Text) Data Preprocessing
Parsing: extract targeted words from the text
o Keywords
o Phrases
o Sentences
o Paragraphs
o etc.
Classification: group the parsed words
o Named entities (persons, organizations, locations, events, etc.)
o Topics/themes
o Time (holidays, anniversaries, seasons, etc.)
o Emotions (adjectives, colors, mental states, etc.)
From Texts to Topics
• Tweets
• Webpages
• Posts
• ……
Word Extraction
• Words
• Phrases
• nGrams
Topic Extraction • Topics
• Issues
• Themes
• ……
Ready for Analysis
22
23
iii. Data Analysis
Longitudinal trends: changes of theme, entities, emotions, etc. over time
Cross-sectional associations: correlational/causal relationships between themes, entities, emotions, time, etc.
Spatial mapping: geographic relations among themes, entities, etc.
Network mapping: social relations among named entities
Semantic mapping: co-occurrence of keywords, themes, emotions, etc.
24
From Data to Analysis
Data Characteristics:
1. Massive size
2. Limited variables
3. Behavioral data
4. Time stamped
5. Location stamped
6. Multi-level
7. Networked
Primary Analysis Tools:
Exploratory analysis
Time series analysis
Network analysis (by Dr. Fujio Toriumi)
25
iv. Data Visualization
Basic Charts:
o Bar charts (for categories)
o Pie charts (for shares)
o Line charts (for trends
Interactive charts:
o for drill-down in multiple layers
Animated/video charts:
o continuous display of multiple charts
26
Collection of Tweets
Text Search
Time Series: Stream of tweets
Topic
User group
Topic Competition
Modeling
Topic Transition Analysis
Visual Analysis of Topic Competition on Social Media
TimelineVisualization
Raw Tweets List
Word Cloud
Source: Xu, P. P. et al. (2013). IEEE Transactions on Visualization and Computer Graphics, 19(12), 2012-2020.
27
Strengths and Weaknesses of CCR Methods
Strengths: Fast speed to get big data at
low/no cost
Unobtrusive measurement of user behavior (to minimize social desirability associated with self reported data)
Discovery of new patterns/regularities from bottom up
Weaknesses: Require technical expertise
and resources to obtain and process data
Difficult to link observed behavior to underlying motivations
Difficult to falsify bottom-up discoveries
Legal and ethical concerns over privacy and security
Is CCR a New Bottle of Old Wine?
Bottle(Methods Used)
Wine (Media Studied)
Old New
TraditionalOBOW (e.g., contentanalysis of national
images in NTY news)
OBNW (e.g., survey of FB users to study self-
disclosure)
ComputationalNBOW (e.g., text
mining of national images in NYT news)
NBNW (e.g., text mining of FB posts to study self-disclosure)
28
29
What’s Wine
Phenomenon (What)
o Existence or change of a communication process
o Existence or change of a communication effect
Causal Mechanism (Why)
o Causes and working mechanism of the process
o Causes and working mechanism of the effect
Theory = Interpretation of the above
Classification of Scientific Knowledge
Has the phenomenonbeen observed?
Has the phenomenon been predicted/explained?
Yes No
Yes1. Replicating existing
knowledge in new setting(OLD WINE)
3. Offering new theoryfor known phenomenon
(OLD or NEW WINE?)
No2. Offering new evidence
for an existing theory(OLD or NEW WINE?)
4. Offering brand new knowledge
(NEW WINE)
30
31
Exemplar Studies of CCR
Cited:
Wu et al. (2011). Who says what to whom on Twitter (communicator)
Benevenuto et al. (2009). User behavior in OSNs (user)
Zhao et al. (2011). Topic models between Twitter & NYT (content)
Petrovic et al. (2013). Twitter vs. newswire for breaking news (channel)
Bond et al. (2012). Experiment of 60m users on Facebook (effects)
Additional:
Watts et al. (1998). Small-world networks
Barabasi (2005). Origin of bursts
Ugander et al. (2012). Structural diversity of social contagion
Goel et al. (2015). Structural virality of 1b events on Twitter
32
33
Wu et al. (2011). Who says what to whom on Twitter
Although the largest number of tweets came from ordinary users, each of them posted on average only 6 tweets, which is much fewer than that from media (1000) or other elite users.
# of Users
5,0005,0005,0005,000
40,000,000
34
Benevenuto et al. (2009). Characterizing user behavior on online social networks
Degree of interaction increases by an order of magnitude when incorporating silent interactions
85% of the active users showed only silent interactions!
inte
racte
d
35
Zhao et al. (2011). Comparing twitter and traditional media using topic models
36
Petrovic et al. (2013). Can Twitter replace newswire for breaking news?
“+” means tweets lead newswire in time. As shown in the table, tweets lead 8 times, trails behind 15 times, and ties 4 times.
37
Bond et al. (2012). A 61m-person experiment in social influence and political mobilization
An effect of 0.3% amounts to 180,000 voters. Is it statistically significant but practically trivial?
38
Watts et al. (1998). Collective behavior of small-world networks
39
Barabasi (2005). Origin of bursts and heavy tails in human dynamics
40
Ugander et al. (2013). Structural diversity of social contagion
41
Goel et al. (2015). Structural virality of online diffusion
Classification of Scientific Knowledge
Has the phenomenonbeen observed?
Has the phenomenon been predicted/explained?
Yes No
Yes
Old Wine(e.g., Wu et al.)
New Wine(Watts et al.; Barabasi)
NoNew Wine
(Bond et al.; Petrovic et al.)
Brand New Wine(Benevenuto et al.;
Ugendar et al.; Goel et al.)
42
Concluding Remarks
CCR applies computational methods, tools, and algorithms to both existing questions and unexplored questions at different units of analysis and different time points
Although many of the current studies of CCR replicate existing knowledge, an increasing number of studies provided new evidence for existing hypotheses or new explanations for known facts, and a small number of studies even offer brand new facts with explanations.
CCR is not a scientific paradigm yet, but has all necessary ingredients for a new paradigm. The current bottleneck lies in shortage of computational knowledge and skills among communication scholars.
43