14
2011.10.15 Ch1. Introduction: Hacking on Twitter Data chois79 11년 10월 20일 목요일

Mining the social web ch1

Embed Size (px)

DESCRIPTION

mining the social web chapter 1

Citation preview

Page 1: Mining the social web ch1

2011.10.15

Ch1. Introduction: Hacking on Twitter Datachois79

11년 10월 20일 목요일

Page 2: Mining the social web ch1

✤ python✤ http://www.python.org/download

✤ python package manager tools✤ allow to effortlessly install Python packages✤ easy_install

✤ http://pypi.python.org/pypi/setuptools✤ pip

✤ http://www.pip-installer.org/en/latest/installing.html✤ networkx

✤ creating and manipulating graphs and networks✤ ex) easy_install networkx or pip install networkx

Installing Python Development Tools

11년 10월 20일 목요일

Page 3: Mining the social web ch1

Collecting and Manipulating Twitter Data

11년 10월 20일 목요일

Page 4: Mining the social web ch1

Tinkering with Twitter’s API(1/2)

✤ Setup

✤ easy_install twitter

✤ but, Twitter’s apis was updated

✤ http://github.com/sixohsix/twitter/issues/56

✤ The Minimalist Twitter API for Python is a Python API for Twitter

✤ Equivalent REST query

✤ http://search.twitter.com/trends.json

11년 10월 20일 목요일

Page 5: Mining the social web ch1

Tinkering with Twitter’s API(2/2)

✤ Retrieving Twitter search trends

✤ Paging through Twitter search results

# ex.3import twittertwitter_api = twitter.Twitter()WORLD_WOE_ID = 1 # The Yahoo! Where On Earth ID for the entire worldworld_trends = twitter_api.trends._(WORLD_WOE_ID) # get back a callable#[ trend["name"] for trend in world_trends()[0]['trends'] ] # call the callablfor trend in world_trends()[0]['trends']: # call the callabl print trend["name"]

# ex.4search_results = []for page in range(1,6): search_results.append(twitter_api.search(q="Dennis Ritchie", rpp=20, page=page))

11년 10월 20일 목요일

Page 6: Mining the social web ch1

Frequency Analysis and Lexical Diversity(1/5)✤ Lexical diversity

✤ One of the most intuitive measurements that can be applied to unstructured text

✤ Expression of the number of unique tokens in the text divided by the total number of tokens

✤ Each tweet carries about 20 percent unique infomation

>>> words = []>>> for t in tweets:... words += [ w for w in t.split() ]>>> len(words) # total words7238>>> len(set(words)) # unique words1636>>> 1.0*len(set(words))/len(words) # lexical diversity0.22602928985907708>>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) # avg words per tweet14.476000000000001

11년 10월 20일 목요일

Page 7: Mining the social web ch1

Frequency Analysis and Lexical Diversity(2/5)

✤ Frequency Analysis: Use NLTK or collections.Count✤ Very simple, powerful tool

✤ Frequent tokens refer to entities such as people, times, activities✤ Infrequent terms amount to mostly noise

>>> import nltk>>> import cPickle>>> words = cPickle.load(open("myData.pickle"))>>> freq_dist = nltk.FreqDist(words)>>> freq_dist.keys()[:50] # 50 most frequent tokens[u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin', u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at', u'in', u'bieber', u'and', u'you', u'watching', u'tina', u'for', u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with', u'so', u"can't", u'who', u'great', u'it', u'going', u'im', u':)', u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal', u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2', u'will']

>>> freq_dist.keys()[-50:] # 50 least frequent tokens[u'what?!', u'whens', u'where', u'while', u'white', u'whoever', u'whoooo!!!!', u'whose', u'wiating', u'wii', u'wiig', u'win...', u'wink.', u'wknd.', u'wohh', u'won', u'wonder', u'wondering', u'wootwoot!', u'worked', u'worth', u'xo.', u'xx', u'ya', u'ya<3miranda', u'yay', u'yay!', u'ya\u2665', u'yea', u'yea.', u'yeaa', u'yeah!', u'yeah.', u'yeahhh.', u'yes,', u'yes;)', u'yess', u'yess,', u'you!!!!!', u"you'll", u'you+snl=', u'you,' u'youll', u'youtube??', u'youu<3', u'youuuuu', u'yum', u'yumyum', u'~', u'\xac\xac'

11년 10월 20일 목요일

Page 8: Mining the social web ch1

Frequency Analysis and Lexical Diversity(3/5)

✤ Extracting relationships from the tweets✤ The social web is foremost the linkages between people✤ One high convenient format for storing social web data is graph✤ Using regular expressions to find retweets

✤ RT followed by a username✤ via followed by a username

>>> import re>>> rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)>>> example_tweets = ["RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?",... "Justin Bieber is on SNL 2nite. w00t?!? (via @SocialWebMining)"]>>> for t in example_tweets:... rt_patterns.findall(t)[('RT', ' @SocialWebMining')][('via', ' @SocialWebMining')

11년 10월 20일 목요일

Page 9: Mining the social web ch1

Frequency Analysis and Lexical Diversity(4/5)

✤ >>> import networkx as nx✤ >>> import re✤ >>> g = nx.DiGraph()✤ >>> ✤ >>> all_tweets = [ tweet ✤ ... for page in search_results ✤ ... for tweet in page["results"] ]✤ >>> def get_rt_sources(tweet):✤ ... rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)",

re.IGNORECASE)✤ ... return [ source.strip() ✤ ... for tuple in rt_patterns.findall(tweet) ✤ ... for source in tuple ✤ ... if source not in ("RT", "via") ]✤ >>> for tweet in all_tweets:✤ ... rt_sources = get_rt_sources(tweet["text"])✤ ... if not rt_sources: continue✤ ... for rt_source in rt_sources:

✤ ... g.add_edge(rt_source, tweet["from_user"], {"tweet_id" : tweet["id"]})

✤ >>> g.number_of_nodes()✤ 160✤ >>> g.number_of_edges()✤ 125✤ >>> g.edges(data=True)[0]✤ (u'@ericastolte', u'bonitasworld', {'tweet_id': 11965974697L})✤ >>> len(nx.connected_components(g.to_undirected()))✤ 37✤ >>> sorted(nx.degree(g))✤ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ✤ 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 9, 37]

11년 10월 20일 목요일

Page 10: Mining the social web ch1

Frequency Analysis and Lexical Diversity(5/5)✤ Analysis

✤ 500 tweets✤ 160 users: number of nodes

✤ 160 users involved in retweet relationships with one another✤ 125 edges connected

✤ 1.28(160/125): some nodes are connected to more than one node

✤ 37: The graph consists of 32 subgraphs and is not fully connected

✤ The output of degree✤ node are connected to anywhere

11년 10월 20일 목요일

Page 11: Mining the social web ch1

Visualizing Tweet Graphs(1/3)

✤ Dot language✤ Text graph description language✤ Support simple way of describing graphs that both humans and

computer programs can use✤ Graphviz

✤ install from source: http://www.graphviz.org/ ✤ pygraphviz

✤ easy_install pygraphviz✤ setup.py: library_path, include_path

11년 10월 20일 목요일

Page 12: Mining the social web ch1

Visualizing Tweet Graphs(2/3)

✤ Generating DOT language output

✤ Output

OUT = "snl_search_results.dot"try: nx.drawing.write_dot(g, OUT)except ImportError, e: # Help for Windows users: # Not a general-purpose method, but representative of # the same output write_dot would provide for this graph # if installed and easy to implement dot = ['"%s" -> "%s" [tweet_id=%s]' % (n1, n2, g[n1][n2]['tweet_id']) \ for n1, n2 in g.edges()] f = open(OUT, 'w') f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),)) f.close()

strict digraph {"@ericastolte" -> "bonitasworld" [tweet_id=11965974697];"@mpcoelho" -> "Lil_Amaral" [tweet_id=11965954427];"@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062];"@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327];}

11년 10월 20일 목요일

Page 13: Mining the social web ch1

Visualizing Tweet Graphs(3/3)

✤ Convert ✤ $circo -Tpng -Osnl_search_results snl_search_results.dot

11년 10월 20일 목요일

Page 14: Mining the social web ch1

Closing Remarks

✤ Illustrated how easy it is to use Python’s interactive interpreter to explore and visualize Twitter data✤ Feel comfortable with your Python development environment✤ Spend some time with the Twitter APIs and Graphviz

✤ Canviz project✤ Draw Graphviz graphs on a web browser <canvas> element.

11년 10월 20일 목요일