Sidi chang demo

15
Sidi Chang Insight Data Science Data Engineering Fellow Jul 2016 JustBid

Transcript of Sidi chang demo

Page 1: Sidi chang demo

Sidi Chang Insight Data Science Data Engineering Fellow

Jul 2016

JustBid

Page 2: Sidi chang demo

Sealed/blind second price auctionItem

Bidder

Page 3: Sidi chang demo

• Demo

Page 4: Sidi chang demo

Data Pipeline

Simulated

Data

Page 5: Sidi chang demo

Data

• 10K bidders

• Nearly 15 million bidding

Page 6: Sidi chang demo

Recommendation—Jaccard Similarity

Jaccard Similarity:

D_i = user_iC_i = items(user_i)

Page 7: Sidi chang demo

Recommendation

For𝑵 = 𝟏𝟎million,ittakesmorethanayear(AWSm4.largecluster)…

ThenwewillneedtouseminHashAlgorithmwhichcanbeeasilydistributed…

DoanunbiasedestimationbyChernoffBoundsandMarkovInequality:Theexpectederroris

Page 8: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

Page 9: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

Page 10: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

Page 11: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1 1

Hash 2

Page 12: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1 1 3 0 1

Hash 2 0 2 0 0

Page 13: Sidi chang demo

Performance

Page 14: Sidi chang demo

Challenges• MinHash Algorithm implemented in distributed system

• Jaccard Similarity Tested in distributed system

• Use right data structures to faster computation

• Use both Scala and Python

Page 15: Sidi chang demo

About me• MS in CS and Operations Research