BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

20
BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj

Transcript of BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Page 1: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

BIOTECA: Restructuring Wikipedia

Jenny Yuen

Serdar Balci

Erdong Chen

Alvin Raj

Page 2: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Problem definition Wikipedia

Collaborative editing 3.8 Million Edits per Month 38 Edits per Article

Various titles conveying the same meaning

BIOTECA: Restructuring Wikipedia a better way for Wikipedia users to access, analyze,

and use biography data on Wikipedia.

June 28, 2007 2

Page 3: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Problem formulation

Document set D = {d1,d2,…,dm}

Sentences in doc d, Sd = {s1,s2,…,sn(d)}

Segment sets, Ŝ = {ŝ1, ŝ2, …, ŝp} , p is unknown

Segd(si): {s1,s2,…,sn(d)} -> {ŝ’1, ŝ’2, …, ŝ’m(d)} Adjacent sentence constraint

Alid(ŝ’i): {ŝ’1, ŝ’2, …, ŝ’m(d)} -> {ŝ1, ŝ2, …, ŝp} Some segments may be empty

Goal: better alignment with reasonable segmentations (not too fine or coarse)

Page 4: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Barack Obama (Wikipedia article)

June 28, 2007 4

Barack Obama is a Democratic politician from Illinois. He is currently running for the United States Senate, which would be the highest elected office he has held thus far.

BiographyObama's father is Kenyan; his mother is from Kansas. He himself was born in Hawaii, where his mother and father met at the University of Hawaii. Obama's father left his family early on, and Obama was raised in Hawaii by his mother.

Created in 2004 (5 sentences)

Page 5: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

5907 revisions up to 2007 (>400 sentences)

Barack Obama (Wikipedia article)

June 28, 2007 5

Page 6: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Early Life (Section Title) "Early life, education, and family“ "Early years, education, military“

"Personal life and education“ "Early Life and Education" "Early years" "Personal life and family" "Personal life and career" "Childhood and Education“ "Early life and childhood“ "Childhood" "Early life, education, and early career“ "Early years and education“ "Early life" "Early biography" "Childhood and education“ "Earlier life“ "Youth” "Early Life & Family“ "Early years and family" "Family and education“ "Family and early life“ "Family Life" "Career after football" "Curriculum vitae" "Family and Personal Life" "Upbringing" "Early life and family“ "Early Years“ "Early and private life" "Early career" "The Early Years“ "Birth and education" "Early and personal life" "Background and early life" "Education and Family“ "Early life and education" "Family and Education“ "Early Life“ "Early Life and Family" "Background and family" "Personal and family life" "Family and childhood”

June 28, 2007 6

Page 7: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Title distribution

June 28, 2007 7

118,626 articles/ 257341 sections

Page 8: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Architecture

June 28, 2007 8

Page 9: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Data Collection & Cleaning

June 28, 2007 9

Page 10: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Data Collection & Cleaning Corpus statistics

118,626 articles 257341 sections

Data Cleaning Diagrams, tables, and links are removed Documents are parsed into sentences Sub-section titles are kept Paragraph structure are kept

June 28, 2007 10

Page 11: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Data Integration

June 28, 2007 11

Page 12: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Data Integration Hidden Markov Topical Model

HMM Distributional Similarity among titles Gibbs Sampling

Category: politician # of articles: 1928 # of paragraphs: 26367 # of sections: 9692 # of distinct titles: 3330

June 28, 2007 12

Page 13: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Graphical model

z: topic y: section titles w: section texts

Page 14: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Full Topic Graph

Page 15: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Experiments Statistics

245 section titles (appear at least 3 times) 3331 section titles (totally)

4 Clusters Manually labeled accuracy: 91.5%

5 Clusters Manually labeled accuracy: 86.5%

Page 16: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

4 & 5 Clusters

Page 17: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

User Interface

June 28, 2007 17

Page 18: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

User Interface

June 28, 2007 18

Page 19: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Wikipedia Adventure

June 28, 2007 19

Page 20: BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Wikipedia Adventure

June 28, 2007 20