Sam zhang demo

Post on 12-Apr-2017

100 views 0 download

Transcript of Sam zhang demo

Improvement for StackOverflow.com

Chentao Zhang Insight Data Engineering SV

Motivations

Motivations1.Tag right? 2. When will be answered?

~Help users to tag their questions on stackoverflow.com more properly

DEMO:

www.stackoverflowtags.tech

Data PipelineHistorical data(60G)

Streaming data

Input DataQuestion:{“post_id”:67172, “post_date”:”6-10-2015-00-01-02”, “type”:0,“parent_id”:0 “tiltle”:” Java Exception”, “body”:”....”, “tags”:“java;algorithm”, “user_id”:782,…}

Answer:{“post_id”:67172, “post_date”:”6-10-2015-00-01-23”, “type”:1,“parent_id”:67172 “tiltle”:” “, “body”:”You should....”, “tags”:“”, “user_id”:1982,…}

Query #1

• Prob. of a question labeled with specific tag(such as tag A) will be answered in 10 mins

= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

question_id tags answer_time(sec) posted_at

231 Java 3010 2016_01_02_21_20_01

290 spark 7381 2016_01_02_22_09_01

341 Java 5611 2016_01_10_01_02_05

Query #1 • Prob. of a question labeled with specific tag(such as tag A) will be

answered in 10 mins

= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

question_id tags answer_time(sec) posted_at

231 Java 3010 2016_01_02_21_20_01

290 spark 7381 2016_01_02_22_09_01

341 Java 5611 2016_01_10_01_02_05

Query #1 • Prob. of a question labeled with specific tag(such as tag A) will be

answered in 10 mins

= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

Shard

Shard Shard

Shard

Request

Query #1

Shard

Shard Shard

Shard

0.5|2000

0.3|2000

0.1|1660 (Prob.|sample size)

0.2|2000

Request

Query #1

Shard

Shard Shard

Shard

0.5|2000

0.3|2000

0.1|1660

0.2|2000

1660*0.1+2000*0.2+2000*0.5+2000*0.3)/(2000*3+1660)=0.28

Request

Query #1

Query #2• Recommend a list of tags which are similar to a

specific tag and sorted by their similarity to the tag

spark/tags_graph: { “id”:261 ”tags”:[”kdc:lat”,

“java”, “hadoop:java”, “apache”, “linux:OS”]

}

*vertice:tags*weight of vertice: number of people who have answered questions labeled with this tags *weight of edge: number of people who have answered questions for both tags * Similarity of A to B = WAB /WA

Query #2

• User entered tagA

Query #2

tag A

tag C tag B

tag D

3

2

5

5 10

3

7

• User entered tagA

• Search all neighbors of tagA and compute their similarity to A.

Query #2

tag A

tag C tag B

tag D

3

2

5

5 10

3

7

SC->A=3/5=0.6

SD->A=2/3=0.67

SB>A=5/10=0.5

• User entered tagA

• Search all neighbors of tagA and compute their similarity to A.

• Sort B,C,D by their similarity to A

Query #2

tag A

tag C tag B

tag D

3

2

5

5 10

3

7

SC->A=3/5=0.6

SD->A=2/3=0.67

SB>A=5/10=0.5

• User entered tagA

• Search all neighbors of tagA and compute their similarity to A.

• Sort B,C,D by their similarity to A

• Give result to user

Query #2

tag A

tag C tag B

tag D

3

2

5

5 10

3

7

SC->A=3/5=0.6

SD->A=2/3=0.67

SB>A=5/10=0.5

Challenges and Future Considerations

• Streaming processing to update information

• Process big data

• Scale up the performance of sorting in graph

About Me• Chentao(Sam) Zhang

• MS in Electrical & Computer Engineering from University of Delaware

• Passionated to learn and try new things

Query #1 tags Prob. of being answered in

10 mins Avg time(sec)

Java 0.32 1200

spark 0.013 31000

tags Prob. of being answered in 10 mins Avg time(sec)

Java 0.32 1200

spark 0.013 31000

Query #1

Challenges

~Process big data ~Performance of updating batch of data

tags Prob. of being answered in 10 mins Avg time(sec)

Java 0.32 1200

spark 0.013 31000

question_id tags answer_time(sec) posted_at

231 Java 3010 2016_01_02_21_20_01

290 spark 7381 2016_01_02_22_09_01

341 Java 5611 2016_01_10_01_02_05

Query #1

tags Prob. of being answered in 10 mins Avg time(sec)

Java 0.32 1200

spark 0.013 31000

question_id tags answer_time(sec) posted_at

231 Java 3010 2016_01_02_21_20_01

290 spark 7381 2016_01_02_22_09_01

341 Java 5611 2016_01_10_01_02_05

Query #1

Data Modelling

~Good at searching ~Graph engine

Index: spark

Type:tags_time Type:tags_graph

Shard

Shard Shard

Shard

Query #1 Elasticsearch cluster

spark/tags: { ”tags”:”kdc:latitude-longitude", “aws_in_time”:2985, “aws_no_in_time”:3023, “num_aws”:29795, “tol_time":3234324, “num_no_aws”:796 }

~Efficiency ~Data update

* Response time: the time from each question being posted to getting its first answer * average response time=tot_time/num_aws * Prob. of being answered in 10 mins=aws_in_time/(num_aws+num_no_aws)