BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.
-
Upload
oswald-gibson -
Category
Documents
-
view
222 -
download
0
Transcript of BIOTECA: Restructuring Wikipedia Jenny Yuen Serdar Balci Erdong Chen Alvin Raj.
BIOTECA: Restructuring Wikipedia
Jenny Yuen
Serdar Balci
Erdong Chen
Alvin Raj
Problem definition Wikipedia
Collaborative editing 3.8 Million Edits per Month 38 Edits per Article
Various titles conveying the same meaning
BIOTECA: Restructuring Wikipedia a better way for Wikipedia users to access, analyze,
and use biography data on Wikipedia.
June 28, 2007 2
Problem formulation
Document set D = {d1,d2,…,dm}
Sentences in doc d, Sd = {s1,s2,…,sn(d)}
Segment sets, Ŝ = {ŝ1, ŝ2, …, ŝp} , p is unknown
Segd(si): {s1,s2,…,sn(d)} -> {ŝ’1, ŝ’2, …, ŝ’m(d)} Adjacent sentence constraint
Alid(ŝ’i): {ŝ’1, ŝ’2, …, ŝ’m(d)} -> {ŝ1, ŝ2, …, ŝp} Some segments may be empty
Goal: better alignment with reasonable segmentations (not too fine or coarse)
Barack Obama (Wikipedia article)
June 28, 2007 4
Barack Obama is a Democratic politician from Illinois. He is currently running for the United States Senate, which would be the highest elected office he has held thus far.
BiographyObama's father is Kenyan; his mother is from Kansas. He himself was born in Hawaii, where his mother and father met at the University of Hawaii. Obama's father left his family early on, and Obama was raised in Hawaii by his mother.
Created in 2004 (5 sentences)
5907 revisions up to 2007 (>400 sentences)
Barack Obama (Wikipedia article)
June 28, 2007 5
Early Life (Section Title) "Early life, education, and family“ "Early years, education, military“
"Personal life and education“ "Early Life and Education" "Early years" "Personal life and family" "Personal life and career" "Childhood and Education“ "Early life and childhood“ "Childhood" "Early life, education, and early career“ "Early years and education“ "Early life" "Early biography" "Childhood and education“ "Earlier life“ "Youth” "Early Life & Family“ "Early years and family" "Family and education“ "Family and early life“ "Family Life" "Career after football" "Curriculum vitae" "Family and Personal Life" "Upbringing" "Early life and family“ "Early Years“ "Early and private life" "Early career" "The Early Years“ "Birth and education" "Early and personal life" "Background and early life" "Education and Family“ "Early life and education" "Family and Education“ "Early Life“ "Early Life and Family" "Background and family" "Personal and family life" "Family and childhood”
June 28, 2007 6
Title distribution
June 28, 2007 7
118,626 articles/ 257341 sections
Architecture
June 28, 2007 8
Data Collection & Cleaning
June 28, 2007 9
Data Collection & Cleaning Corpus statistics
118,626 articles 257341 sections
Data Cleaning Diagrams, tables, and links are removed Documents are parsed into sentences Sub-section titles are kept Paragraph structure are kept
June 28, 2007 10
Data Integration
June 28, 2007 11
Data Integration Hidden Markov Topical Model
HMM Distributional Similarity among titles Gibbs Sampling
Category: politician # of articles: 1928 # of paragraphs: 26367 # of sections: 9692 # of distinct titles: 3330
June 28, 2007 12
Graphical model
z: topic y: section titles w: section texts
Full Topic Graph
Experiments Statistics
245 section titles (appear at least 3 times) 3331 section titles (totally)
4 Clusters Manually labeled accuracy: 91.5%
5 Clusters Manually labeled accuracy: 86.5%
4 & 5 Clusters
User Interface
June 28, 2007 17
User Interface
June 28, 2007 18
Wikipedia Adventure
June 28, 2007 19
Wikipedia Adventure
June 28, 2007 20