Upload
sdl-social-intelligence
View
705
Download
3
Embed Size (px)
DESCRIPTION
Social media data, like other data sets, is completely unstructured and humongous in size. In order to gather insights from advanced analytics, the data needs to be preprocessed. The major preparation work for social media data includes: • Filtering duplicates, spam, blacklists and whitelists • Detecting author language and country • Analyzing sentiment by content tone and brand references • Measuring author influences • Indexing content and metadata
Citation preview
1
Preparing Social Data for Advanced Analytics
Jason Xue, Director of Engineering
In order to measure what matters to your business, we need to prepare/pre-process social media data before it can be used to gather insights.
Filter duplicates, spams, blacklists, whitelists
Detect author language
Detect author country
Analyze sentiment – content tone and brand references
Measure author influences
Index content and metadata
Store content in distributed systems
Here’s How We Do It
3
Since we get social media data from different data sources we need to identify duplicated content via permalinks and remove them.
Filter the content that is spam to save computing resources. Spam can be detected by URL, title and content.
Define blacklist and whitelist to filter content.
Filter Content
4
Approaches: (see details at http://www.slideshare.net/edma2/evaluation-of-language-identification-methods) Common words and unique letter combinations
N-gram approach by Cavnar
Statistical approach by Ted Dunning
Compression based approach with PPM by Teahan
Language identification and character sets by Kikui
Software
Google's Compact Language Detector from Chrome (Library)
Google’s Translate APIs
SDL BeGlobal APIs
Microsoft language detection
Detect Language
5
We can use user-provided location information to detect a user’s country if it exists.
Detect Author Country – by User-Provided Location
6
Data inaccuracy
Data specificity
Location level
Multiple locations with the same place
Alternative spelling or abbreviation
The Challenges with User-provided Location
7
We can sometimes use URL domains and sub-domains to detect author’s country
Challenges with URL
Improperly used country code domains
Domain hacks
Detect Author Country – By URL
8
When country information is absent, we can use the result of language detection as a signal for author’s country.
Challenges with Language
Hopefully results of language detection will include geography
If not, we will make a “best guess” based a list of defaulted countries by language.
Detect Author Country – By Language
9
Author influences or authority rankings can be measured by the following factors:
Facebook – friends, profiles, likes , replies
Twitter - followers , retweets , mentions, replies.
YouTube - watch ViewCount
Flickr – view counts
IdentiCa - subscribers
Wikipedia - rankings
Platforms /websites for measuring influences:
http://klout.com
http://traackr.com/
Measure Author Influences
10
Content tone defines overall sentiment of a conversation.
Calculate Content Tone
11
Content tone can be measured against predefined key words of positive and negative emotions (Posemo, Negemo)
Content tone can be calculated by the difference between the positive and negative words over total words in a conversation, and then converted to Likert scale.
Content tone calculation can be improved by machine learning
How to Calculate Content Tone
12
Brand References analyzes the positive and negative words surrounding a brand keyword within a conversation. Results are scored either Negative, Neutral or Positive.
Calculate Brand References
13
It considers the proximity of a Posemo or Negemo keyword to the
brand keyword queried. This will identify the phrase as a positive or
negative sentiment.
It considers any Negating keyword close to the brand keyword and
will invert the sentiment of the phrase to its opposite.
An overall label of Positive or Negative is applied depending on
which phrases have the larger count in the content.
If no positive or negative phrases are found, or if there are the same
number of each, then the content is given a label of Neutral.
How to Measure Brand References
14
Social media data needs to be indexed so that it can be searched and analyzed.
Preparation: Convert conversations to searchable terms by removing stop words
Stop words are defined for different languages
Index Conversations are indexed by publication dates, and content types
Terms and meta data are mapped into document IDs (permalinks) and then shard locations on machine nodes
Shard location is chosen by hashing the document ID
The permalink of a conversation (document) is stored on a primary shard, and optionally one or more replica shards
Index Social Media Data
15
Social media data are non-structured and humongous in size. They need to be stored in distributed systems to be scalable and computable.
HBase is a key/value store. Specifically it is aSparse, Consistent, Distributed, Multidimensional, Sorted map. We can use permalinks as an Hbase key for social content.
Hadoop (High-availability distributed object-oriented platform) is an open-source software framework that supports data-intensive distributed applications. It supports the running of advanced social analytics on large clusters of commodity hardware.
Storing Social Media Data
16
To learn more, visit: sdl.com/si
17