Sidi Chang Insight Data Science Data Engineering Fellow
Jul 2016
JustBid
Sealed/blind second price auctionItem
Bidder
• Demo
Data Pipeline
Simulated
Data
Data
• 10K bidders
• Nearly 15 million bidding
Recommendation—Jaccard Similarity
Jaccard Similarity:
D_i = user_iC_i = items(user_i)
Recommendation
For𝑵 = 𝟏𝟎million,ittakesmorethanayear(AWSm4.largecluster)…
ThenwewillneedtouseminHashAlgorithmwhichcanbeeasilydistributed…
DoanunbiasedestimationbyChernoffBoundsandMarkovInequality:Theexpectederroris
MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1
Hash 2
MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1
Hash 2
MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1
Hash 2
MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1 1
Hash 2
MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1 1 3 0 1
Hash 2 0 2 0 0
Performance
Challenges• MinHash Algorithm implemented in distributed system
• Jaccard Similarity Tested in distributed system
• Use right data structures to faster computation
• Use both Scala and Python
About me• MS in CS and Operations Research