Upload
jason-kessler
View
235
Download
0
Embed Size (px)
Citation preview
1
Jason S. Kessler | Data Day Texas, January 14, 2017@jasonkessler
Scattertext: A Tool for Visualizing Differences in Language
2
Word frequency
• Women and men tend to use different terms on Facebook.• As do introverts and extroverts.
• Hillary Clinton and Donald Trump used different terms in the presidential debate.
• Reveal differences in • content, • perceived strengths and weaknesses• communication style
• These are often obvious after being surfaced
3
Outline
• Previous work• Ways of visualizing word association
• Scattertext• Open-source Python/D3 framework for visualizing these
differences• Inspecting LDA, word2vec, sparse classification models
• How CDK Global is using this to help dealerships better sell cars.• We’re hiring senior data scientists + devs in Austin and Seattle.
4
OKCupid: an online dating site
hobos
almond butter
100 Years of Solitude
Bikram yoga
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
Which words and phrases statistically distinguish ethnic groups and genders?
5Source: Christian Rudder. Dataclysm. 2014.
Ranking with everyone else
High distance: white menignore k-pop
Low distance: white mendisproportionately mention Phish
The smaller the distance from the top left, the higher the association with white men.
6Source: Christian Rudder. Dataclysm. 2014.
my blue eyes
7
Scattertext: Democrats vs Republicans: 2012 Convention Speeches
8
Word Use Reflecting Gender and Personality in Facebook Statuses
• Objective:• Find words, phrases, and topics that correlate to
• gender, and• Big 5 personality type
• Data source:• My Personality App • 75k voluntary participants in Facebook based survey,
>300mm words• Agreed to give researchers access to statuses.
• Scoring algorithm• Linear regression weights, 2000 LDA topics. Lyle Ungar
2013 AAAITutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-
Vocabulary Approach. Plos One. 2013.
9
Lyle Ungar2013 AAAITutorial
The good:• Word clouds force
you to hunt for the most impactful terms
• You end up examining the long tail in the process
• Compactly represent a lot of phrases and topics
10
Lyle Ungar2013 AAAITutorial
The bad:
• “Mullets of the Internet” --Jeffrey Zeldman, 2005
• Longer phrases are are more prominent.
• Ranking is unclear
• Does size indicate higher frequency?
11
Lyle Ungar2013 AAAITutorial
12Lyle Ungar2013 AAAITutorial
13Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
NYT: 2012 Political Convention Word Use by Party
14Source: http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html,
15
Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403.
Diff
eren
ce in
z-s
core
s of
log-
odds
w/ p
rior log ¿(𝑤 ,𝐴)
|𝐴|−¿(𝑤 ,𝐴)¿−𝑙𝑜𝑔 ¿ (𝑤 ,𝐵)
|𝐵|−¿(𝑤 ,𝐵)¿
Log-odds for word w, categories A,B
log¿ (𝑤 , 𝐴 )+¿(𝑤 ,𝐶)
|𝐴|+¿𝐶∨− ¿(𝑤 , 𝐴)−¿ (𝑤 ,𝐶)¿−…
Log-odds w/ Dirichlet prior, given background corpus C
• Difference in z-score accounts for variation in word frequencies.
• Words with differences < 1.96 are greyed out.
16
Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403.
Diff
eren
ce in
z-s
core
s of
log-
odds
w/ p
rior
• Pros:• Popular among major CL
researchers (3rd edition of J+M)• Favors words which appear less
frequent in background.• Natural linear word listing
• Cons:• You have to pick a
representative, large background corpus. • If the corpus is small, divide
by 0 issue• Probably only practical for
unigrams• Inefficient use of space on chart
17Page 17
Repo: https://github.com/JasonKessler/scattertext
$ pip install scattertext
Why the plots look the way they do:http://bit.ly/scattertextdevelopment
Topic models, word vectors, and The Lasso:http://bit.ly/scattertext2016debates
Movie revenue and practical use:http://bit.ly/scattertextrevenuemovie
Hands-on Tutorial
18
CDK Global: Finding Words that Sell Cars
…I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd Party Automotive Site
# of users who read review:
# who went on to visit a Chevy dealer’s website: 15
20
Conversion rate of everyone who read review:
15/20=75%
Text:Car Reviewed: Chevy Cruze
Median conversion rate: 22%
19
CDK Global: Finding Words that Sell Cars5 star review wordsLoveComfortableFeaturesSolidAmazing
<3 star review wordsTransmissionProblemIssueDealershipTimes
20
CDK Global: Finding Words that Sell Cars5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet
<3 star review wordsTransmissionProblemIssueDealershipTimes
21
CDK Global: Finding Words that Sell Cars5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet
<3 star review words Low conversion wordsTransmission Money [spend my, save]Problem FeaturesIssue DealershipDealership AmazingTimes Build Quality [typically positive]
22
CDK Global: Finding Words that Sell Cars (SUV Specific)5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet
<3 star review words Low conversion wordsTransmission Money [spend my, save]Problem FeaturesIssue DealershipDealership AmazingTimes Build Quality [typically positive]
The worst thing you can say about an SUV may be:
I saved money and got all these amazing features!
23
Thank you.[first].[last]@gmail.com .Please see https://github.com/JasonKessler/scattertext for more info on this project.
We are hiring data scientists and developers in Seattle and Austin! Please contact me if you’d like to know more.
https://jobs.cdkglobal.com/