Sidi chang demo

Preview:

Citation preview

Sidi Chang Insight Data Science Data Engineering Fellow

Jul 2016

JustBid

Sealed/blind second price auctionItem

Bidder

• Demo

Data Pipeline

Simulated

Data

Data

• 10K bidders

• Nearly 15 million bidding

Recommendation—Jaccard Similarity

Jaccard Similarity:

D_i = user_iC_i = items(user_i)

Recommendation

For𝑵 = 𝟏𝟎million,ittakesmorethanayear(AWSm4.largecluster)…

ThenwewillneedtouseminHashAlgorithmwhichcanbeeasilydistributed…

DoanunbiasedestimationbyChernoffBoundsandMarkovInequality:Theexpectederroris

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1 1

Hash 2

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1 1 3 0 1

Hash 2 0 2 0 0

Performance

Challenges• MinHash Algorithm implemented in distributed system

• Jaccard Similarity Tested in distributed system

• Use right data structures to faster computation

• Use both Scala and Python

About me• MS in CS and Operations Research