Upload
ted-han
View
767
Download
1
Embed Size (px)
Citation preview
Sunday, August 29, 2010
Ted Han です
Sunday, August 29, 2010
はじめまして!
Sunday, August 29, 2010
If you would like a copy ofthese slides the are here:
http://cl.ly/6233b0f56bb686e57b74
(or at http://twitter.com/knowtheory)
Sunday, August 29, 2010
Work8
What We Will8
Rest8
•Eight Hours for Work•Eight Hours for Rest•Eight Hours for What We Will!
Labor Rights
This may not be a pattern that hackers are all that familiar with.
Sunday, August 29, 2010
We trade our time and expertise for money at work for 8+ hours a day at work
Sunday, August 29, 2010
But now the 8 hours of our free time are just as valuable
to companies as our work time.
Sunday, August 29, 2010
Who collects your data?Do you know what data they collect?
What do you get in return?
Sunday, August 29, 2010
• Google: Gmail, Search• Apple: iTunes Genius• Amazon: Recommendation• Last.fm: Rec’s & Neighbors• Facebook: ??? (Your friends’ families’ crazy rants)
What do you get for your Data?
Sunday, August 29, 2010
Companies benefit from our data and can ask and answer questions about our behavior.
Sunday, August 29, 2010
We benefit indirectly, but why can’t we benefit
directly as well?
Sunday, August 29, 2010
We can, if we know where and how to look.
Sunday, August 29, 2010
Ruby can help!
Sunday, August 29, 2010
• Data Collection• Data Querying & Manipulation• Data Analysis
Basic Data Mining
Sunday, August 29, 2010
DataMapper will helpwith these things!
Sunday, August 29, 2010
It would be nice to analyzeour search histories, but...
Google doesn’t provide an API.
Sunday, August 29, 2010
But, we can search our Google Chrome histories!
~/Library/Application Support/Google/Chrome/Default/History
(make a copy of your History. sqlite3 dbs are easy to corrupt)
Sunday, August 29, 2010
Once we have a datasourcewe need to answer yes to
at least one of three questionsabout the format of our source.
Sunday, August 29, 2010
• Does a DataMapper Adapter already exist?• Can you write an adapter?• Can you write a scraper to import your data?
Sunday, August 29, 2010
Does a DataMapper Adapter already exist?
Yep! Google Chrome’s History is an sqlite3 database!
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Urls Table
CREATE TABLE urls(
id INTEGER PRIMARY KEY, url LONGVARCHAR, title LONGVARCHAR, visit_count INTEGER DEFAULT 0 NOT NULL, typed_count INTEGER DEFAULT 0 NOT NULL, last_visit_time INTEGER NOT NULL, hidden INTEGER DEFAULT 0 NOT NULL, favicon_id INTEGER DEFAULT 0 NOT NULL);
Querying requires us to map data out of our source. To do this we have to tell DataMapper what the source schema is.
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Url model (naive)class Url include DataMapper::Resource property :id, Serial # Integer, :key=>true property :url, String property :title, String property :visit_count, Integer, :default => 0 property :typed_count, Integer, :default => 0 property :last_visit_time, Integer, :required => true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segmentsend
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Url model (naive)class Url include DataMapper::Resource property :id, Serial property :url, String property :title, String property :visit_count, Integer, :default => 0 property :typed_count, Integer, :default => 0 property :last_visit_time, Integer, :required => true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segmentsend
Inline Validations
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Urls Table
CREATE TABLE urls(
id INTEGER PRIMARY KEY, url LONGVARCHAR, title LONGVARCHAR, visit_count INTEGER DEFAULT 0 NOT NULL, typed_count INTEGER DEFAULT 0 NOT NULL, last_visit_time INTEGER NOT NULL, hidden INTEGER DEFAULT 0 NOT NULL, favicon_id INTEGER DEFAULT 0 NOT NULL);
Database Constraints
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Sanity Check
The Schemata Match! now lets test.>> Url.first(:url => "http://rubykaigi.org/")=> #<Url @id=1294 @url="http://rubykaigi.org/" @title="RubyKaigi 2010, August 27-29" @visit_count=8 ... >>> Url.count=> 47007>> Url.count("visit_count.lt" => 1)=> 20 >> # wat.
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Url model (w/ Sanity)class Url include DataMapper::Resource property :id, Serial property :url, String, :format => :url property :title, String property :visit_count, Integer, :min => 1 property :typed_count, Integer, :default => 0 property :last_visit_time, Integer, :required => true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segmentsend
lets add some businessrule validations
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Data Manipulationclass Url include DataMapper::Resource property :id, Serial property :url, URI, :format => :url property :title, String property :visit_count, Integer, :min => 1 property :typed_count, Integer, :default => 0 property :last_visit_time, Integer, :required => true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segmentsend
require ‘dm-types’
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Data Manipulation>> u = Url.first("url.like" => "%rubykaigi%")=> #<Url @id=1294 @url=#<Addressable::URI:0x81c7a1b0 URI:http://rubykaigi.com/ @title="RubyKaigi 2010, August 27-29" @last_visit_time=12927095498867853 ...>>> u.url=> #<Addressable::URI:0x81c7a1b0 URI:http://rubykaigi.com/>>> u.url.host=> "rubykaigi.com" # oops, .org is canonical>> u.url.host = "rubykaigi.org"; u.url=> #<Addressable::URI:0x81ccfdf4 URI:http://rubykaigi.org/>
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Data Manipulation>> u = Url.first("url.like" => "%rubykaigi%")=> #<Url @id=1294 @url=#<Addressable::URI:0x81c7a1b0 URI:http://rubykaigi.com/ @title="RubyKaigi 2010, August 27-29" @last_visit_time=12927095498867853 ...>>> u.last_visit_time=> 12927095498867853 # wtf is this?
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Urls TableCREATE TABLE urls(
id INTEGER PRIMARY KEY, url LONGVARCHAR, title LONGVARCHAR, visit_count INTEGER DEFAULT 0 NOT NULL, typed_count INTEGER DEFAULT 0 NOT NULL, last_visit_time INTEGER NOT NULL, hidden INTEGER DEFAULT 0 NOT NULL, favicon_id INTEGER DEFAULT 0 NOT NULL);
Not a lot of clues here...Okay, it’s an integer time, but it’s also freaking huge:12927095498867853?
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
chromium/src/base/time.h
// Time represents an absolute point // in time, internally represented as// microseconds (s/1,000,000) since // a platform-dependent epoch. Each// platform's epoch, along with other // system-dependent clock interface// routines, is defined in time_PLATFORM.cc.
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
chromium/src/base/time_mac.cc
// Core Foundation uses a double second // count since 2001-01-01 00:00:00 UTC.// The UNIX epoch is 1970-01-01 00:00:00 UTC.// Windows uses a Gregorian epoch of 1601. // We need to match this internally// so that our time representations match across // all platforms. See bug 14734.// irb(main):010:0> Time.at(0).getutc()// => Thu Jan 01 00:00:00 UTC 1970// irb(main):011:0> Time.at(-11644473600).getutc()// => Mon Jan 01 00:00:00 UTC 1601
Examples already in Ruby? Nice.
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Url model v2 (lib types)class Url include DataMapper::Resource property :id, Serial property :url, URI, :format => :url property :title, String property :visit_count, Integer, :min => 1 property :typed_count, Integer, :default => 0 property :last_visit_time, ChromeEpochTime, :required => true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segmentsend
write ChromeEpochTime
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
chrome_epoch_time.rbmodule DataMapper class Property class ChromeEpochTime < Integer def load(value) return value unless value.respond_to?(:to_i) ::Time.at((value/10**6)-11644473600) end
def dump(value) case value when ::Integer, ::Time then (value.to_i + 11644473600) * 10**6 when ::DateTime then (value.to_time.to_i + 11644473600) * 10**6 end end end # class ChromeEpochTime end # class Propertyend # module DataMapper
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Data Manipulation>> u = Url.first("url.like" => "%rubykaigi.com%")=> #<Url @id=42846 @url=#<Addressable::URI:0x81e232f0 URI:http://rubykaigi.com/ @title="RubyKaigi 2010, August 27-29" @last_visit_time=Tue Aug 24 12:51:38 +0900 2010 ...>>> u.last_visit_time=> Tue Aug 24 12:51:38 0900 2010
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Histograms, yay! (Analysis)
hour_histogram = Hash.new(0)Visit.all.map do |v| hour_histogram[v.visit_time.hour] += 1end
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Over what span of time?
>> Visit.first.visit_time=> Fri May 28 17:04:39 0900 2010>> Visit.last.visit_time=> Thu Aug 26 01:51:32 0900 2010
Sunday, August 29, 2010
0
1000
2000
3000
4000
5000
6000
7000
8000
Midnight 3am 6am 9am Noon 3pm 6pm 9pm
Aggregate Browsing by Hour
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
More Histograms, yay!
ruby_doc = Url.all("url.like" => "%ruby-doc%"); hour_histogram = Hash.new(0)ruby_doc.visits.map do |v| hour_histogram[v.visit_time.hour] += 1
end
Sunday, August 29, 2010
0
12.5
25
37.5
50
Midnight 3am 6am 9am Noon 3pm 6pm 9pm
Aggregate Browsing for ruby-doc.org by Hour
Sunday, August 29, 2010
But what happens whenWe have a data source
which isn’t well behaved?
Sunday, August 29, 2010
"Does Edge have an anti-PS3 bias?"http://arstechnica.com/civis/viewtopic.php?f=22&t=62024
Last year a thread on Ars Technica titled "Does Edge have an anti-PS3 bias?" resulted in a flame war erupted between PS3 fans and Xbox360 fans over whether or not PS3 was receiving unfair treatment, particularly held up against a game's score on metacritic.com.
Sunday, August 29, 2010
Helpfully, the thread title is a testable hypothesis
Sunday, August 29, 2010
Are an review outlet’s aggregate game scores (dis)similar to the aggregate
Metascore for those same games?
Sunday, August 29, 2010
Unfortunately, Metascore also has no API.
Sunday, August 29, 2010
Time for the Poor Man’s API:HTML scraping :(
Sunday, August 29, 2010
Save me Nokogiri!
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Yeah, that’s not pretty.def scores_for(game) game_page = case when (game.is_a? String) begin Nokogiri::HTML(open(game)) rescue puts "[FAIL] Failed to open #{game}" break end when (game.is_a? Nokogiri::HTML::Document) game else raise StandardError, "you need to provide either a url, or a nokogiri document" end page_title = game_page.css('title').text junk, title, platform, year = page_title.match(/^(.+)\s*\((#{PLATFORMS.join("|")}): (\d+)\): Reviews$/).to_a title.strip! metascore = game_page.css('table#scoretable img').select{ |i| /Metascore:/ =~ i.attributes['alt'] }.first.attributes['alt'].to_s.split.last puts "[WIN] #{title} on the #{platform} (#{year}) has a score of #{metascore}" #review_count = game_page.to_s.match(/based on <b>(\d+) reviews/).to_a.last reviews = game_page.css('div.scoreandreview')
review_count = reviews.size checksum = game_page.to_s.match(/based on <b>(\d+) reviews/).to_a.last.to_i checksum_message = "Number of Reviews on the page not equal to the claimed number of reviews" raise StandardError, checksum_message unless review_count == checksum scores = reviews.map do |review| score = review.css('div.criticscore').text pub = review.css('span.publication').text [score,pub] end return { :title =>title.strip, :metascore => metascore, :platform => platform, :publish_year => year, :reviews => scores }end
Sunday, August 29, 2010
But it works!<3 Nokogiri
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Modelsclass Game include DataMapper::Resource
property :id, Serial property :title, String, :length=>255 property :platform, String property :release_date, DateTime property :esrb_rating, String property :metascore, Float property :review_count, Integer property :created_at, DateTime property :updated_at, DateTime class Review include DataMapper::Resource
property :game_id, Integer, :key => true property :review_publisher_id, Integer, :key => true property :score, Integer belongs_to :review_publisher belongs_to :game end class Developer include DataMapper::Resource
property :id, Serial property :name, String, :length => 255
has n, :games endend
class ReviewPublisher include DataMapper::Resource property :id, Serial property :name, String, :length => 255 has n, :reviews, :model => "Game::Review" has n, :games, :through => :reviewsend
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Student’s T-Test (Analysis!)def t_value(prop1, collection1, prop2, collection2) c1_std = collection1.std(prop1) c1_avg = collection1.avg(prop1) c1_count = collection1.count c2_std = collection2.std(prop2) c2_avg = collection2.avg(prop2) c2_count = collection2.count
(c1_avg - c2_avg) / Math.sqrt( (c1_std**2 / c1_count)+(c2_std**2 / c2_count))
end
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
PS3 Reviewers vs Metascore
outlets = ReviewPublisher.all("games.platform"=>"ps3")t_scores = outlets.map do |outlet| t_value(:metascore, outlet.games(:platform=>"ps3"),
:score, outlet.reviews("game.platform"=>"ps3"))end # .size => 140
significant = t_scores.select do |t| (t > 1.96 or t < -1.96) and not t.infinite?
end
low = significant.select{ |s| s < -1.96} # .size => 20high = significant.select{ |s| s > 1.96} # .size => 10
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
Xbox360 Reviewers vs Metascore
outlets = ReviewPublisher.all("games.platform"=>"xbox360")t_scores = outlets.map do |outlet| t_value(:metascore, outlet.games(:platform=>"xbox360"),
:score, outlet.reviews("game.platform"=>"xbox360"))end # .size => 169
significant = t_scores.select do |t| (t > 1.96 or t < -1.96) and not t.infinite?
end
low = significant.select{ |s| s < -1.96} # .size => 37high = significant.select{ |s| s > 1.96} # .size => 29
Sunday, August 29, 2010
• A example bullet point• Another example here• Some more as you want
What about Edge Magazine?
>> outlet = ReviewPublisher.first("name.like"=>"%Edge%")=> #<ReviewPublisher @id=36 @name="Edge Magazine">>> t = t_value(:metascore, outlet.games(:platform=>"ps3"), :score, outlet.reviews("game.platform"=>"ps3"))=> 5.10786212293491>> t > 1.96=> true # Edge has a PRO PS3 bias, not Anti!
Sunday, August 29, 2010
There are lots of other possibilities!What would you like to learn?
Sunday, August 29, 2010
Learn about DataMapper perhaps?http://www.datamapper.org
irc://irc.freenode.net#datamapper
Sunday, August 29, 2010
Thanks! ありがとう@knowtheory
Sunday, August 29, 2010