Sam zhang demo

27
Improvement for StackOverflow.com Chentao Zhang Insight Data Engineering SV

Transcript of Sam zhang demo

Page 1: Sam zhang demo

Improvement for StackOverflow.com

Chentao Zhang Insight Data Engineering SV

Page 2: Sam zhang demo

Motivations

Page 3: Sam zhang demo

Motivations1.Tag right? 2. When will be answered?

~Help users to tag their questions on stackoverflow.com more properly

DEMO:

www.stackoverflowtags.tech

Page 4: Sam zhang demo

Data PipelineHistorical data(60G)

Streaming data

Page 5: Sam zhang demo

Input DataQuestion:{“post_id”:67172, “post_date”:”6-10-2015-00-01-02”, “type”:0,“parent_id”:0 “tiltle”:” Java Exception”, “body”:”....”, “tags”:“java;algorithm”, “user_id”:782,…}

Answer:{“post_id”:67172, “post_date”:”6-10-2015-00-01-23”, “type”:1,“parent_id”:67172 “tiltle”:” “, “body”:”You should....”, “tags”:“”, “user_id”:1982,…}

Page 6: Sam zhang demo

Query #1

• Prob. of a question labeled with specific tag(such as tag A) will be answered in 10 mins

= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

Page 7: Sam zhang demo

question_id tags answer_time(sec) posted_at

231 Java 3010 2016_01_02_21_20_01

290 spark 7381 2016_01_02_22_09_01

341 Java 5611 2016_01_10_01_02_05

Query #1 • Prob. of a question labeled with specific tag(such as tag A) will be

answered in 10 mins

= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

Page 8: Sam zhang demo

question_id tags answer_time(sec) posted_at

231 Java 3010 2016_01_02_21_20_01

290 spark 7381 2016_01_02_22_09_01

341 Java 5611 2016_01_10_01_02_05

Query #1 • Prob. of a question labeled with specific tag(such as tag A) will be

answered in 10 mins

= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

Page 9: Sam zhang demo

Shard

Shard Shard

Shard

Request

Query #1

Page 10: Sam zhang demo

Shard

Shard Shard

Shard

0.5|2000

0.3|2000

0.1|1660 (Prob.|sample size)

0.2|2000

Request

Query #1

Page 11: Sam zhang demo

Shard

Shard Shard

Shard

0.5|2000

0.3|2000

0.1|1660

0.2|2000

1660*0.1+2000*0.2+2000*0.5+2000*0.3)/(2000*3+1660)=0.28

Request

Query #1

Page 12: Sam zhang demo

Query #2• Recommend a list of tags which are similar to a

specific tag and sorted by their similarity to the tag

Page 13: Sam zhang demo

spark/tags_graph: { “id”:261 ”tags”:[”kdc:lat”,

“java”, “hadoop:java”, “apache”, “linux:OS”]

}

*vertice:tags*weight of vertice: number of people who have answered questions labeled with this tags *weight of edge: number of people who have answered questions for both tags * Similarity of A to B = WAB /WA

Query #2

Page 14: Sam zhang demo

• User entered tagA

Query #2

tag A

tag C tag B

tag D

3

2

5

5 10

3

7

Page 15: Sam zhang demo

• User entered tagA

• Search all neighbors of tagA and compute their similarity to A.

Query #2

tag A

tag C tag B

tag D

3

2

5

5 10

3

7

SC->A=3/5=0.6

SD->A=2/3=0.67

SB>A=5/10=0.5

Page 16: Sam zhang demo

• User entered tagA

• Search all neighbors of tagA and compute their similarity to A.

• Sort B,C,D by their similarity to A

Query #2

tag A

tag C tag B

tag D

3

2

5

5 10

3

7

SC->A=3/5=0.6

SD->A=2/3=0.67

SB>A=5/10=0.5

Page 17: Sam zhang demo

• User entered tagA

• Search all neighbors of tagA and compute their similarity to A.

• Sort B,C,D by their similarity to A

• Give result to user

Query #2

tag A

tag C tag B

tag D

3

2

5

5 10

3

7

SC->A=3/5=0.6

SD->A=2/3=0.67

SB>A=5/10=0.5

Page 18: Sam zhang demo

Challenges and Future Considerations

• Streaming processing to update information

• Process big data

• Scale up the performance of sorting in graph

Page 19: Sam zhang demo

About Me• Chentao(Sam) Zhang

• MS in Electrical & Computer Engineering from University of Delaware

• Passionated to learn and try new things

Page 20: Sam zhang demo

Query #1 tags Prob. of being answered in

10 mins Avg time(sec)

Java 0.32 1200

spark 0.013 31000

Page 21: Sam zhang demo

tags Prob. of being answered in 10 mins Avg time(sec)

Java 0.32 1200

spark 0.013 31000

Query #1

Page 22: Sam zhang demo

Challenges

~Process big data ~Performance of updating batch of data

Page 23: Sam zhang demo

tags Prob. of being answered in 10 mins Avg time(sec)

Java 0.32 1200

spark 0.013 31000

question_id tags answer_time(sec) posted_at

231 Java 3010 2016_01_02_21_20_01

290 spark 7381 2016_01_02_22_09_01

341 Java 5611 2016_01_10_01_02_05

Query #1

Page 24: Sam zhang demo

tags Prob. of being answered in 10 mins Avg time(sec)

Java 0.32 1200

spark 0.013 31000

question_id tags answer_time(sec) posted_at

231 Java 3010 2016_01_02_21_20_01

290 spark 7381 2016_01_02_22_09_01

341 Java 5611 2016_01_10_01_02_05

Query #1

Page 25: Sam zhang demo

Data Modelling

~Good at searching ~Graph engine

Index: spark

Type:tags_time Type:tags_graph

Page 26: Sam zhang demo

Shard

Shard Shard

Shard

Query #1 Elasticsearch cluster

Page 27: Sam zhang demo

spark/tags: { ”tags”:”kdc:latitude-longitude", “aws_in_time”:2985, “aws_no_in_time”:3023, “num_aws”:29795, “tol_time":3234324, “num_no_aws”:796 }

~Efficiency ~Data update

* Response time: the time from each question being posted to getting its first answer * average response time=tot_time/num_aws * Prob. of being answered in 10 mins=aws_in_time/(num_aws+num_no_aws)