17
1 Preparing Social Data for Advanced Analytics Jason Xue, Director of Engineering

Preparing Social Media Data for Advanced Analytics

Embed Size (px)

DESCRIPTION

Social media data, like other data sets, is completely unstructured and humongous in size. In order to gather insights from advanced analytics, the data needs to be preprocessed. The major preparation work for social media data includes: • Filtering duplicates, spam, blacklists and whitelists • Detecting author language and country • Analyzing sentiment by content tone and brand references • Measuring author influences • Indexing content and metadata

Citation preview

Page 1: Preparing Social Media Data for Advanced Analytics

1

Preparing Social Data for Advanced Analytics

Jason Xue, Director of Engineering

Page 2: Preparing Social Media Data for Advanced Analytics

In order to measure what matters to your business, we need to prepare/pre-process social media data before it can be used to gather insights.

Page 3: Preparing Social Media Data for Advanced Analytics

Filter duplicates, spams, blacklists, whitelists

Detect author language

Detect author country

Analyze sentiment – content tone and brand references

Measure author influences

Index content and metadata

Store content in distributed systems

Here’s How We Do It

3

Page 4: Preparing Social Media Data for Advanced Analytics

Since we get social media data from different data sources we need to identify duplicated content via permalinks and remove them.

Filter the content that is spam to save computing resources. Spam can be detected by URL, title and content.

Define blacklist and whitelist to filter content.

Filter Content

4

Page 5: Preparing Social Media Data for Advanced Analytics

Approaches: (see details at http://www.slideshare.net/edma2/evaluation-of-language-identification-methods) Common words and unique letter combinations

N-gram approach by Cavnar

Statistical approach by Ted Dunning

Compression based approach with PPM by Teahan

Language identification and character sets by Kikui

Software

Google's Compact Language Detector from Chrome (Library)

Google’s Translate APIs

SDL BeGlobal APIs

Microsoft language detection

Detect Language

5

Page 6: Preparing Social Media Data for Advanced Analytics

We can use user-provided location information to detect a user’s country if it exists.

Detect Author Country – by User-Provided Location

6

Page 7: Preparing Social Media Data for Advanced Analytics

Data inaccuracy

Data specificity

Location level

Multiple locations with the same place

Alternative spelling or abbreviation

The Challenges with User-provided Location

7

Page 8: Preparing Social Media Data for Advanced Analytics

We can sometimes use URL domains and sub-domains to detect author’s country

Challenges with URL

Improperly used country code domains

Domain hacks

Detect Author Country – By URL

8

Page 9: Preparing Social Media Data for Advanced Analytics

When country information is absent, we can use the result of language detection as a signal for author’s country.

Challenges with Language

Hopefully results of language detection will include geography

If not, we will make a “best guess” based a list of defaulted countries by language.

Detect Author Country – By Language

9

Page 10: Preparing Social Media Data for Advanced Analytics

Author influences or authority rankings can be measured by the following factors:

Facebook – friends, profiles, likes , replies

Twitter - followers , retweets , mentions, replies.

YouTube - watch ViewCount

Flickr – view counts

IdentiCa - subscribers

Wikipedia - rankings

Platforms /websites for measuring influences:

http://klout.com

http://traackr.com/

Measure Author Influences

10

Page 11: Preparing Social Media Data for Advanced Analytics

Content tone defines overall sentiment of a conversation.

Calculate Content Tone

11

Page 12: Preparing Social Media Data for Advanced Analytics

Content tone can be measured against predefined key words of positive and negative emotions (Posemo, Negemo)

Content tone can be calculated by the difference between the positive and negative words over total words in a conversation, and then converted to Likert scale.

Content tone calculation can be improved by machine learning

How to Calculate Content Tone

12

Page 13: Preparing Social Media Data for Advanced Analytics

Brand References analyzes the positive and negative words surrounding a brand keyword within a conversation. Results are scored either Negative, Neutral or Positive.

Calculate Brand References

13

Page 14: Preparing Social Media Data for Advanced Analytics

It considers the proximity of a Posemo or Negemo keyword to the

brand keyword queried. This will identify the phrase as a positive or

negative sentiment.

It considers any Negating keyword close to the brand keyword and

will invert the sentiment of the phrase to its opposite.

An overall label of Positive or Negative is applied depending on

which phrases have the larger count in the content.

If no positive or negative phrases are found, or if there are the same

number of each, then the content is given a label of Neutral.

How to Measure Brand References

14

Page 15: Preparing Social Media Data for Advanced Analytics

Social media data needs to be indexed so that it can be searched and analyzed.

Preparation: Convert conversations to searchable terms by removing stop words

Stop words are defined for different languages

Index Conversations are indexed by publication dates, and content types

Terms and meta data are mapped into document IDs (permalinks) and then shard locations on machine nodes

Shard location is chosen by hashing the document ID

The permalink of a conversation (document) is stored on a primary shard, and optionally one or more replica shards

Index Social Media Data

15

Page 16: Preparing Social Media Data for Advanced Analytics

Social media data are non-structured and humongous in size. They need to be stored in distributed systems to be scalable and computable.

HBase is a key/value store. Specifically it is aSparse, Consistent, Distributed, Multidimensional, Sorted map. We can use permalinks as an Hbase key for social content.

Hadoop (High-availability distributed object-oriented platform) is an open-source software framework that supports data-intensive distributed applications. It supports the running of advanced social analytics on large clusters of commodity hardware.

Storing Social Media Data

16