Sam zhang demo
-
Upload
chentao-zhang -
Category
Engineering
-
view
100 -
download
0
Embed Size (px)
Transcript of Sam zhang demo

Improvement for StackOverflow.com
Chentao Zhang Insight Data Engineering SV

Motivations

Motivations1.Tag right? 2. When will be answered?
~Help users to tag their questions on stackoverflow.com more properly
DEMO:
www.stackoverflowtags.tech

Data PipelineHistorical data(60G)
Streaming data

Input DataQuestion:{“post_id”:67172, “post_date”:”6-10-2015-00-01-02”, “type”:0,“parent_id”:0 “tiltle”:” Java Exception”, “body”:”....”, “tags”:“java;algorithm”, “user_id”:782,…}
Answer:{“post_id”:67172, “post_date”:”6-10-2015-00-01-23”, “type”:1,“parent_id”:67172 “tiltle”:” “, “body”:”You should....”, “tags”:“”, “user_id”:1982,…}

Query #1
• Prob. of a question labeled with specific tag(such as tag A) will be answered in 10 mins
= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

question_id tags answer_time(sec) posted_at
231 Java 3010 2016_01_02_21_20_01
290 spark 7381 2016_01_02_22_09_01
341 Java 5611 2016_01_10_01_02_05
Query #1 • Prob. of a question labeled with specific tag(such as tag A) will be
answered in 10 mins
= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

question_id tags answer_time(sec) posted_at
231 Java 3010 2016_01_02_21_20_01
290 spark 7381 2016_01_02_22_09_01
341 Java 5611 2016_01_10_01_02_05
Query #1 • Prob. of a question labeled with specific tag(such as tag A) will be
answered in 10 mins
= number of questions answered in 10 mins and tagged with A/ total number of questions tagged with A

Shard
Shard Shard
Shard
Request
Query #1

Shard
Shard Shard
Shard
0.5|2000
0.3|2000
0.1|1660 (Prob.|sample size)
0.2|2000
Request
Query #1

Shard
Shard Shard
Shard
0.5|2000
0.3|2000
0.1|1660
0.2|2000
1660*0.1+2000*0.2+2000*0.5+2000*0.3)/(2000*3+1660)=0.28
Request
Query #1

Query #2• Recommend a list of tags which are similar to a
specific tag and sorted by their similarity to the tag

spark/tags_graph: { “id”:261 ”tags”:[”kdc:lat”,
“java”, “hadoop:java”, “apache”, “linux:OS”]
}
*vertice:tags*weight of vertice: number of people who have answered questions labeled with this tags *weight of edge: number of people who have answered questions for both tags * Similarity of A to B = WAB /WA
Query #2

• User entered tagA
Query #2
tag A
tag C tag B
tag D
3
2
5
5 10
3
7

• User entered tagA
• Search all neighbors of tagA and compute their similarity to A.
Query #2
tag A
tag C tag B
tag D
3
2
5
5 10
3
7
SC->A=3/5=0.6
SD->A=2/3=0.67
SB>A=5/10=0.5

• User entered tagA
• Search all neighbors of tagA and compute their similarity to A.
• Sort B,C,D by their similarity to A
Query #2
tag A
tag C tag B
tag D
3
2
5
5 10
3
7
SC->A=3/5=0.6
SD->A=2/3=0.67
SB>A=5/10=0.5

• User entered tagA
• Search all neighbors of tagA and compute their similarity to A.
• Sort B,C,D by their similarity to A
• Give result to user
Query #2
tag A
tag C tag B
tag D
3
2
5
5 10
3
7
SC->A=3/5=0.6
SD->A=2/3=0.67
SB>A=5/10=0.5

Challenges and Future Considerations
• Streaming processing to update information
• Process big data
• Scale up the performance of sorting in graph

About Me• Chentao(Sam) Zhang
• MS in Electrical & Computer Engineering from University of Delaware
• Passionated to learn and try new things

Query #1 tags Prob. of being answered in
10 mins Avg time(sec)
Java 0.32 1200
spark 0.013 31000

tags Prob. of being answered in 10 mins Avg time(sec)
Java 0.32 1200
spark 0.013 31000
Query #1

Challenges
~Process big data ~Performance of updating batch of data

tags Prob. of being answered in 10 mins Avg time(sec)
Java 0.32 1200
spark 0.013 31000
question_id tags answer_time(sec) posted_at
231 Java 3010 2016_01_02_21_20_01
290 spark 7381 2016_01_02_22_09_01
341 Java 5611 2016_01_10_01_02_05
Query #1

tags Prob. of being answered in 10 mins Avg time(sec)
Java 0.32 1200
spark 0.013 31000
question_id tags answer_time(sec) posted_at
231 Java 3010 2016_01_02_21_20_01
290 spark 7381 2016_01_02_22_09_01
341 Java 5611 2016_01_10_01_02_05
Query #1

Data Modelling
~Good at searching ~Graph engine
Index: spark
Type:tags_time Type:tags_graph

Shard
Shard Shard
Shard
Query #1 Elasticsearch cluster

spark/tags: { ”tags”:”kdc:latitude-longitude", “aws_in_time”:2985, “aws_no_in_time”:3023, “num_aws”:29795, “tol_time":3234324, “num_no_aws”:796 }
~Efficiency ~Data update
* Response time: the time from each question being posted to getting its first answer * average response time=tot_time/num_aws * Prob. of being answered in 10 mins=aws_in_time/(num_aws+num_no_aws)