Upload
marco-brambilla
View
455
Download
4
Embed Size (px)
Citation preview
Analysis & Knowledge Extraction of Online User Behaviour and Visual Content
for Art and Culture Events
Marco Brambilla Tahereh Arabghalizi Behnam Rahdari
Marco Brambilla
Contacts: @marcobrambi, [email protected], http://datascience.deib.polimi.it
UNIVERSITY OF PITTSBURGH
Agenda
Context
Method
• Pre-processing
• Topic analysis
• User clustering
• Multimedia: Images• concepts vs. text extraction
• color schema and the main color pattern(s)
• Prediction of interests
Challenges & Conclusions
Context
• Role of social media in our life
• Social media for cultural and artistic events
• Behaviour and content
• Multi-disciplinary collaboration on social media analysis and
cultural heritage
• Collaboration: Politecnico di Milano, Musei di Brescia, University
of Pittsburg
Research Questions
Topics of interest of visitors?
Categorization of users?
Demographics of visitors?
Engagement and online
participation?
Relation between photos, time,
location, text and the event?
Approach
Domain-specific pipeline to profile social media users
and content in cultural or art events
Case Study
The Floating Piers by Christo and Jeanne Claude
Iseo Lake, Italy
June 2016
Case Study
Case Study
• 17 MLN $
• 220,000 floating blocks
• 1.5 MLN visitors in 16 days
Pre-processing
Data Extraction
• Using Instagram and Twitter APIs
• Extract relevant tweets/posts during the event
• Extract all relevant users
o That tweet/post directly
o that like, comment, retweet, etc.
• Extract all properties
o Textual: bio, tweet/post text, hashtag, etc.
o Quantitative: #followers, #followings, etc.
o Media: photos, metadata (geotag, …)
Tweets Posts
14,062 30,256
Users Users
23,916 94,666
Authors Reacting Authors Reacting
7,724 16,197 16,681 77,985
From June 10th to July 30th
Collected Data
• Text normalization (NLP)
• Language identification and translation
• Gender detection
• Data cleansing
• Store clean and transformed data
Preprocessing
Time Distribution (Twitter)
Time series – Instagram vs. Twitter
Instagram Likes and Comments
Italy Lombardy Region Iseo Lake
Geographical Distribution (Instagram)
Data Analysis Process
1. Document Term Matrix (DTM)
2. Topic Extraction
3. Dimension Reduction
4. Cluster Analysis and Validation
5. Prediction
6. Media Analysis
7. Content Network Analysis
Topics
Document-term Matrix
A matrix that describes the frequency of terms that
occur in a collection of documents
Terms
Documents
Art Travel Italy Design …
Post 1 0 1 1 0
Post 2 1 2 0 1
Post 3 0 0 1 0
Post 4 1 1 3 1
…
Topic Extraction
Latent Dirichlet Allocation (LDA):
documents as mixtures of topics (with probability)
Input: Document Term Matrix
Outputs: Topics, Topic Probabilities Matrix
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 …
Post 1 0.19 0.16 0.27 0.14 0.11 0.13
Post 2 0.31 0.18 0.21 0.08 0.10 0.12
Post 3 0.25 0.24 0.20 0.17 0.09 0.05
Post 4 0.19 0.32 0.22 0.10 0.07 0.10
…
Dimensionality Reduction
• Hundreds of topics extracted with LDA
• Using Principle Component Analysis (PCA) to extract a smaller set
of linearly uncorrelated topics
> 0.95
Variance share Cumulative variance share
User Clustering
Cluster Analysis
• Apply clustering algorithms over Topic Probabilities
Matrix to cluster users
• Multiple data slices
• Multiple algorithms
o K-means
o Hierarchical
o DBSCAN
Topic 1
Topic 3
Topic 2
Cluster Validity
• How to evaluate the “goodness” of the resulting
clusters?
• Validation Measures
– Internal : ex. Silhouette Coefficient, Dunn’s Index,
Calinski-Harabasz index, etc.
– External: ex. Entropy, Purity, Rand index, etc.
User Clustering
Travel
Lovers
Art
Lovers
Internet & Tech
Lovers
Users’ Biography Word Clouds
Cluster Labeling
Word Network for Clusters
Travel Lovers
Art Lovers
Tech Lovers
Hierarchical Clustering
Language
Gender
Impact of Demographics
Prediction
Prediction
Predict the category or the interest area of potential new users for
similar cultural or art events in the future
Decision Trees
o Prepare Required Data
o Grow Decision Tree
o Extract rules from the tree
o Predict using test data
o Evaluate
Extracted Rules
Rule 1 : if (0.36 < Bio_score < 0.37 OR Bio_score < 0.35)
then Travel Lover
Rule 2: if (0.35 < Bio_score < 0.36 AND Status_count >
14.5) OR (Bio_score > 0.37 AND language != Italian)
then Art Lover
Rule 3: if (Bio_score > 0.37 AND Language = Italian) then
Tech Lover
Otherwise: Not Interested
accuracy = 62 %
Prediction rules
Decision Tree
Image Analysis
Tweets Posts
14,062 30,256
Users Users
23,916 94,666
Authors Reacting Authors Reacting
7,724 16,197 16,681 77,985
From June 10th to July 30th
Only Instagram
Used Instagram Filters
People in Pictures
Age Sex50.4% female
49.6% male
Visitor Analytics
Race
Bias of the medium?
Image content analsys
Concept extraction (DNN based third party
service)
Comparison with hashtags / text
Image low-level feature analysis
Concepts in Pictures Hashtags
Users tend not to report the actual content of the photos
in their textual descriptions /hashtags
Object Extraction from Pictures
Main color shades among all photos
Color Detection for Subject Identification
Confusion Matrix
Simple techniques “good enough”?
Objects or Colors?
Ongoing Challenges
Future Challenges of KE
Determining exact
positioning based on
perspective
Future Challenges of KE
Network structures
and their temporal
evolution
Max graph perturbation
Daily graph variations
Future Challenges
Real cross-disciplinarity
(cultural heritage, humanities,
social science)
No visitors for the cultural part of the event!
(exhibition at the museum)
Exhibit--->
Conclusions
• (Sometimes) Simple methods work just fine
• Interesting profiling and behaviour detection
• Still far from cross-disciplinary approaches
Contacts: Marco Brambilla, @marcobrambi, [email protected]
http://datascience.deib.polimi.it
http://www.marco-brambilla.com
Analysis of Online User Behaviourfor Art and Culture Events
Marco Brambilla, Tahereh Arabghalizi, Behnam Rahdari
UNIVERSITY OF PITTSBURGH