BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

Post on 16-Jan-2016

222 views 0 download

Transcript of BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.

BIOTECA: Restructuring Wikipedia

Jenny Yuen

Serdar Balci

Erdong Chen

Alvin Raj

Problem definition Wikipedia

Collaborative editing 3.8 Million Edits per Month 38 Edits per Article

Various titles conveying the same meaning

BIOTECA: Restructuring Wikipedia a better way for Wikipedia users to access, analyze,

and use biography data on Wikipedia.

June 28, 2007 2

Problem formulation

Document set D = {d1,d2,…,dm}

Sentences in doc d, Sd = {s1,s2,…,sn(d)}

Segment sets, Ŝ = {ŝ1, ŝ2, …, ŝp} , p is unknown

Segd(si): {s1,s2,…,sn(d)} -> {ŝ’1, ŝ’2, …, ŝ’m(d)} Adjacent sentence constraint

Alid(ŝ’i): {ŝ’1, ŝ’2, …, ŝ’m(d)} -> {ŝ1, ŝ2, …, ŝp} Some segments may be empty

Goal: better alignment with reasonable segmentations (not too fine or coarse)

Barack Obama (Wikipedia article)

June 28, 2007 4

Barack Obama is a Democratic politician from Illinois. He is currently running for the United States Senate, which would be the highest elected office he has held thus far.

BiographyObama's father is Kenyan; his mother is from Kansas. He himself was born in Hawaii, where his mother and father met at the University of Hawaii. Obama's father left his family early on, and Obama was raised in Hawaii by his mother.

Created in 2004 (5 sentences)

5907 revisions up to 2007 (>400 sentences)

Barack Obama (Wikipedia article)

June 28, 2007 5

Early Life (Section Title) "Early life, education, and family“ "Early years, education, military“

"Personal life and education“ "Early Life and Education" "Early years" "Personal life and family" "Personal life and career" "Childhood and Education“ "Early life and childhood“ "Childhood" "Early life, education, and early career“ "Early years and education“ "Early life" "Early biography" "Childhood and education“ "Earlier life“ "Youth” "Early Life & Family“ "Early years and family" "Family and education“ "Family and early life“ "Family Life" "Career after football" "Curriculum vitae" "Family and Personal Life" "Upbringing" "Early life and family“ "Early Years“ "Early and private life" "Early career" "The Early Years“ "Birth and education" "Early and personal life" "Background and early life" "Education and Family“ "Early life and education" "Family and Education“ "Early Life“ "Early Life and Family" "Background and family" "Personal and family life" "Family and childhood”

June 28, 2007 6

Title distribution

June 28, 2007 7

118,626 articles/ 257341 sections

Architecture

June 28, 2007 8

Data Collection & Cleaning

June 28, 2007 9

Data Collection & Cleaning Corpus statistics

118,626 articles 257341 sections

Data Cleaning Diagrams, tables, and links are removed Documents are parsed into sentences Sub-section titles are kept Paragraph structure are kept

June 28, 2007 10

Data Integration

June 28, 2007 11

Data Integration Hidden Markov Topical Model

HMM Distributional Similarity among titles Gibbs Sampling

Category: politician # of articles: 1928 # of paragraphs: 26367 # of sections: 9692 # of distinct titles: 3330

June 28, 2007 12

Graphical model

z: topic y: section titles w: section texts

Full Topic Graph

Experiments Statistics

245 section titles (appear at least 3 times) 3331 section titles (totally)

4 Clusters Manually labeled accuracy: 91.5%

5 Clusters Manually labeled accuracy: 86.5%

4 & 5 Clusters

User Interface

June 28, 2007 17

User Interface

June 28, 2007 18

Wikipedia Adventure

June 28, 2007 19

Wikipedia Adventure

June 28, 2007 20