Sidi chang demo
-
Upload
sidi-chang -
Category
Engineering
-
view
49 -
download
0
Embed Size (px)
Transcript of Sidi chang demo

Sidi Chang Insight Data Science Data Engineering Fellow
Jul 2016
JustBid

Sealed/blind second price auctionItem
Bidder

Data Pipeline
Simulated
Data

Data
• 10K bidders
• Nearly 15 million bidding

Recommendation—Jaccard Similarity
Jaccard Similarity:
D_i = user_iC_i = items(user_i)

Recommendation
For𝑵 = 𝟏𝟎million,ittakesmorethanayear(AWSm4.largecluster)…
ThenwewillneedtouseminHashAlgorithmwhichcanbeeasilydistributed…
DoanunbiasedestimationbyChernoffBoundsandMarkovInequality:Theexpectederroris

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1
Hash 2

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1
Hash 2

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1
Hash 2

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1 1
Hash 2

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod
53x+1
mod 5
1 0 1 0 0 1 1 1
2 1 0 0 1 0 2 4
3 2 0 1 0 1 3 2
4 3 1 0 1 1 4 0
5 4 0 0 1 0 0 3
U1 U2 U3 U4
Hash 1 1 3 0 1
Hash 2 0 2 0 0

Performance

Challenges• MinHash Algorithm implemented in distributed system
• Jaccard Similarity Tested in distributed system
• Use right data structures to faster computation
• Use both Scala and Python

About me• MS in CS and Operations Research