Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,
description
Transcript of Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,
A Two Tier Framework for Context-Aware Service Organization & Discovery
Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2, Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew
Lim Tan1
1 National University of Singapore
Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking
2 Institute for Infocomm Research3 Microsoft Research Asia
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 2
I2R-NUS-MSRA at TAC 2011: Entity Linking
I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Acronym ExpansionSemantic FeaturesInstance Selection
Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system
Offline Combination with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 3
I2R-NUS-MSRA at TAC 2011: Entity Linking
I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Acronym ExpansionSemantic FeaturesInstance Selection
Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system
Combine with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Acronym Expansion - Motivation
Text Analysis Conference, November 14-15, 2011 4
I2R-NUS-MSRA at TAC 2011: Entity Linking
Expanding an acronym from its context to reduce the ambiguities of a name E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous.
A Two Tier Framework for Context-Aware Service Organization & Discovery
Step 1 – Find Expansion Candidates
Text Analysis Conference, November 14-15, 2011 5
I2R-NUS-MSRA at TAC 2011: Entity Linking
Identifying Candidate Expansions (e.g. for ACM)
A Two Tier Framework for Context-Aware Service Organization & Discovery
Step 2 – Candidate Expansions Ranking
Text Analysis Conference, November 14-15, 2011 6
I2R-NUS-MSRA at TAC 2011: Entity Linking
Using SVM classifier to rank the candidates
Our SVM based acronym expansioncan handle link acronyms and full strings in the different
sentences in the articlesNumber of common characters between acronym and leading character of the expansion.
can handle acronym with swapped letters. E.g. Communist Party of China Vs. CCPSentence distance between acronym and expansion
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 7
I2R-NUS-MSRA at TAC 2011: Entity Linking
I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Acronym ExpansionSemantic FeaturesInstance Selection
Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system
Combine with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Related Work on Context Similarity
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 8
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010 Term MatchingHowever,1) Michael Jordan is a leading researcher in machine learning
and artificial intelligence.
2) Michael Jordan is currently a full professor at the University of California, Berkeley.
3) Michael Jordan (born February, 1963) is a former American professional basketball player.
4) Michael Jordan wins NBA MVP of 91-92 season.
No Term Match
A Two Tier Framework for Context-Aware Service Organization & Discovery
Our System - A Wikipedia-LDA model
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 9
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
1) Michael Jordan is a leading researcher in machine learning and artificial intelligence.
2) Michael Jordan is currently a full professor at the University of California, Berkeley.
3) Michael Jordan (born February, 1963) is a former American professional basketball player.
4) Michael Jordan wins NBA MVP of 91-92 season.
Topic: Basketball
Topic: Science
A Two Tier Framework for Context-Aware Service Organization & Discovery
Wikipedia – LDA Model
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 10
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
P( word i | category j)
Document
P( category i | document j)
Document
…
…
A Two Tier Framework for Context-Aware Service Organization & Discovery
Wikipedia – LDA Model
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 11
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
1) Michael Jordan is a leading
researcher in machine learning and artificial intelligence.
2) Michael Jordan is currently a full professor at the University of California, Berkeley.
3) Michael Jordan (born February, 1963) is a former American professional basketball player.
4) Michael Jordan wins NBA MVP of 91-92 season.
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 12
I2R-NUS-MSRA at TAC 2011: Entity Linking
I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Acronym ExpansionSemantic FeaturesInstance Selection
Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system
Combine with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Related Work
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 13
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
Vector Space ModelDifficult to combine bag of words (BOW) with other features.Performance needs to be improved
Supervised Approaches Using manual annotated training instances
Dredze et al., 2010; Zheng et al., 2010
Using automatically generated training instances Zhang et al. 2010
A Two Tier Framework for Context-Aware Service Organization & Discovery
Related Work
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 14
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
Auto-generate training instance (Zhang et al., 2010)
(News Article) Obama Campaign Drops The George W. Bush Talking Point …
A Two Tier Framework for Context-Aware Service Organization & Discovery
Related Work
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 15
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
From “George W. Bush” articlesNo positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generatedNo negative instances for “George W. Bush” generated
Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection
The distribution of the unambiguous mentions may not be the same in test data
A Two Tier Framework for Context-Aware Service Organization & Discovery
The Approach in Our System
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 16
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
An instance selection approach Select an informative, representative, and diverse subset from the auto-generated data set. Reduce the effect of the distribution differences
A Two Tier Framework for Context-Aware Service Organization & Discovery
Instance Selection
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 17
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
Small Initial data set
trainingSVM
Classifier
Test on auto-generated data set
2-D data set Illustration
SVM hyperplane
Select Informative, representative and diverse Instances
Add these selected instances to Initial data set
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 18
I2R-NUS-MSRA at TAC 2011: Entity Linking
I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Acronym ExpansionSemantic FeaturesInstance Selection
Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system
Combine with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Advantages over other clustering techniquesGlobally optimized resultsEfficient in time and spaceGenerally, produce a better result
Success in many areasImage segmentationGene expression clustering
Spectral Clustering
A Two Tier Framework for Context-Aware Service Organization & Discovery
Spectral Clustering
A = QɅQ-1
Eigen Decomposition on Graph LaplacianDimensionality Reduction (Luxburg, 2006)
George W. Bush
George H.W. Bush
A Two Tier Framework for Context-Aware Service Organization & Discovery
Hierarchical Agglomerative Clustering
Text Analysis Conference, November 14-15, 2011 21
I2R-NUS-MSRA at TAC 2011: Entity Linking
Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities.
Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010):
this model shows good performance in Web People Search In our work, original query name, its Wikipedia redirected
names and its coreference chain mentions are all considered as appearances of the query name in the text.
Similarity scores : cosine similarity and overlap similarity.
A Two Tier Framework for Context-Aware Service Organization & Discovery
Hierarchical Agglomerative Clustering
Text Analysis Conference, November 14-15, 2011 22
I2R-NUS-MSRA at TAC 2011: Entity Linking
Docs referred to the same entity are clustered according to doc pair-wise similarity scores.Start with singleton: each doc is a clusterIf there are two docs D and D' in clusters Ci and Cj respectively:
Two clusters Ci and Cj are merged to form a new cluster Cij
if Sim(D,D' ) > γ
Calculate the similarity between the new cluster Cij and all remaining
clusters
γ = 0.25
A Two Tier Framework for Context-Aware Service Organization & Discovery
Latent Dirichlet Allocation (LDA)
Text Analysis Conference, November 14-15, 2011 23
I2R-NUS-MSRA at TAC 2011: Entity Linking
LDA has been applied to many NLP tasks such as: summarization and text classification In our approach, the learned topics can represent the underlying entities of the ambiguous names Generative story:
A Two Tier Framework for Context-Aware Service Organization & Discovery
Text Analysis Conference, November 14-15, 2011 24
I2R-NUS-MSRA at TAC 2011: Entity Linking
Three classes SVM classifier to decide which system to be trusted
Features: scores given by the three systems
Three Clustering Systems Combination
Combine with the system of MSRA team at KB linking step
Binary SVM classifier to decide which system to be trusted Features: scores given by the two systems
A Two Tier Framework for Context-Aware Service Organization & Discovery
Experiment for Three Clustering Algorithms
Text Analysis Conference, November 14-15, 2011 25
I2R-NUS-MSRA at TAC 2011: Entity Linking
Algorithms Eval 09 Eval 10 Eval 10+
SGP 0.745 0.954 0.809
HAC 0.666 0.950 0.789
LDA 0.782 0.981 0.841
Combination 0.795 0.982 0.852
A Two Tier Framework for Context-Aware Service Organization & Discovery
Submissions
Text Analysis Conference, November 14-15, 2011 26
I2R-NUS-MSRA at TAC 2011: Entity Linking
Systems Acc. Precision Recall F1
Full 0.863 0.815 0.849 0.831
Partial 0.844 0.797 0.829 0.813
Highest - - - 0.846
Median - - - 0.716
A Two Tier Framework for Context-Aware Service Organization & Discovery
Conclusion
Text Analysis Conference, November 14-15, 2011 27
I2R-NUS-MSRA at TAC 2011: Entity Linking
Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)Acronym ExpansionSemantic FeaturesInstance Selection
Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA)