23
1 Jason S. Kessler | Data Day Texas, January 14, 2017 @jasonkessler Scattertext: A Tool for Visualizing Differences in Language

Scattertext: A Tool for Visualizing Differences in Language

Embed Size (px)

Citation preview

Page 1: Scattertext: A Tool for Visualizing Differences in Language

1

Jason S. Kessler | Data Day Texas, January 14, 2017@jasonkessler

Scattertext: A Tool for Visualizing Differences in Language

Page 2: Scattertext: A Tool for Visualizing Differences in Language

2

Word frequency

• Women and men tend to use different terms on Facebook.• As do introverts and extroverts.

• Hillary Clinton and Donald Trump used different terms in the presidential debate.

• Reveal differences in • content, • perceived strengths and weaknesses• communication style

• These are often obvious after being surfaced

Page 3: Scattertext: A Tool for Visualizing Differences in Language

3

Outline

• Previous work• Ways of visualizing word association

• Scattertext• Open-source Python/D3 framework for visualizing these

differences• Inspecting LDA, word2vec, sparse classification models

• How CDK Global is using this to help dealerships better sell cars.• We’re hiring senior data scientists + devs in Austin and Seattle.

Page 4: Scattertext: A Tool for Visualizing Differences in Language

4

OKCupid: an online dating site

hobos

almond butter

100 Years of Solitude

Bikram yoga

Christian Rudder: http://blog.okcupid.com/index.php/page/7/

Which words and phrases statistically distinguish ethnic groups and genders?

Page 5: Scattertext: A Tool for Visualizing Differences in Language

5Source: Christian Rudder. Dataclysm. 2014.

Ranking with everyone else

High distance: white menignore k-pop

Low distance: white mendisproportionately mention Phish

The smaller the distance from the top left, the higher the association with white men.

Page 6: Scattertext: A Tool for Visualizing Differences in Language

6Source: Christian Rudder. Dataclysm. 2014.

my blue eyes

Page 7: Scattertext: A Tool for Visualizing Differences in Language

7

Scattertext: Democrats vs Republicans: 2012 Convention Speeches

Page 8: Scattertext: A Tool for Visualizing Differences in Language

8

Word Use Reflecting Gender and Personality in Facebook Statuses

• Objective:• Find words, phrases, and topics that correlate to

• gender, and• Big 5 personality type

• Data source:• My Personality App • 75k voluntary participants in Facebook based survey,

>300mm words• Agreed to give researchers access to statuses.

• Scoring algorithm• Linear regression weights, 2000 LDA topics. Lyle Ungar

2013 AAAITutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-

Vocabulary Approach. Plos One. 2013.

Page 9: Scattertext: A Tool for Visualizing Differences in Language

9

Lyle Ungar2013 AAAITutorial

The good:• Word clouds force

you to hunt for the most impactful terms

• You end up examining the long tail in the process

• Compactly represent a lot of phrases and topics

Page 10: Scattertext: A Tool for Visualizing Differences in Language

10

Lyle Ungar2013 AAAITutorial

The bad:

• “Mullets of the Internet” --Jeffrey Zeldman, 2005

• Longer phrases are are more prominent.

• Ranking is unclear

• Does size indicate higher frequency?

Page 11: Scattertext: A Tool for Visualizing Differences in Language

11

Lyle Ungar2013 AAAITutorial

Page 12: Scattertext: A Tool for Visualizing Differences in Language

12Lyle Ungar2013 AAAITutorial

Page 13: Scattertext: A Tool for Visualizing Differences in Language

13Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html

NYT: 2012 Political Convention Word Use by Party

Page 14: Scattertext: A Tool for Visualizing Differences in Language

14Source: http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html,

Page 15: Scattertext: A Tool for Visualizing Differences in Language

15

Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403.

Diff

eren

ce in

z-s

core

s of

log-

odds

w/ p

rior log ¿(𝑤 ,𝐴)

|𝐴|−¿(𝑤 ,𝐴)¿−𝑙𝑜𝑔 ¿ (𝑤 ,𝐵)

|𝐵|−¿(𝑤 ,𝐵)¿

Log-odds for word w, categories A,B

log¿ (𝑤 , 𝐴 )+¿(𝑤 ,𝐶)

|𝐴|+¿𝐶∨− ¿(𝑤 , 𝐴)−¿ (𝑤 ,𝐶)¿−…

Log-odds w/ Dirichlet prior, given background corpus C

• Difference in z-score accounts for variation in word frequencies.

• Words with differences < 1.96 are greyed out.

Page 16: Scattertext: A Tool for Visualizing Differences in Language

16

Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403.

Diff

eren

ce in

z-s

core

s of

log-

odds

w/ p

rior

• Pros:• Popular among major CL

researchers (3rd edition of J+M)• Favors words which appear less

frequent in background.• Natural linear word listing

• Cons:• You have to pick a

representative, large background corpus. • If the corpus is small, divide

by 0 issue• Probably only practical for

unigrams• Inefficient use of space on chart

Page 17: Scattertext: A Tool for Visualizing Differences in Language

17Page 17

Repo: https://github.com/JasonKessler/scattertext

$ pip install scattertext

Why the plots look the way they do:http://bit.ly/scattertextdevelopment

Topic models, word vectors, and The Lasso:http://bit.ly/scattertext2016debates

Movie revenue and practical use:http://bit.ly/scattertextrevenuemovie

Hands-on Tutorial

Page 18: Scattertext: A Tool for Visualizing Differences in Language

18

CDK Global: Finding Words that Sell Cars

…I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze…

Rating: 4.4/5 Stars

Example Review Appearing on a 3rd Party Automotive Site

# of users who read review:

# who went on to visit a Chevy dealer’s website: 15

20

Conversion rate of everyone who read review:

15/20=75%

Text:Car Reviewed: Chevy Cruze

Median conversion rate: 22%

Page 19: Scattertext: A Tool for Visualizing Differences in Language

19

CDK Global: Finding Words that Sell Cars5 star review wordsLoveComfortableFeaturesSolidAmazing

<3 star review wordsTransmissionProblemIssueDealershipTimes

Page 20: Scattertext: A Tool for Visualizing Differences in Language

20

CDK Global: Finding Words that Sell Cars5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet

<3 star review wordsTransmissionProblemIssueDealershipTimes

Page 21: Scattertext: A Tool for Visualizing Differences in Language

21

CDK Global: Finding Words that Sell Cars5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet

<3 star review words Low conversion wordsTransmission Money [spend my, save]Problem FeaturesIssue DealershipDealership AmazingTimes Build Quality [typically positive]

Page 22: Scattertext: A Tool for Visualizing Differences in Language

22

CDK Global: Finding Words that Sell Cars (SUV Specific)5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet

<3 star review words Low conversion wordsTransmission Money [spend my, save]Problem FeaturesIssue DealershipDealership AmazingTimes Build Quality [typically positive]

The worst thing you can say about an SUV may be:

I saved money and got all these amazing features!

Page 23: Scattertext: A Tool for Visualizing Differences in Language

23

Thank you.[first].[last]@gmail.com .Please see https://github.com/JasonKessler/scattertext for more info on this project.

We are hiring data scientists and developers in Seattle and Austin! Please contact me if you’d like to know more.

https://jobs.cdkglobal.com/