Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

27
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 , Zhiqiang Toh 2 , Yanchuan Sim 2 , Yunbo Cao 3 , Chin Yew Lin 3 and Chew Lim Tan 1 1 National University of Singapore Text Analysis Conference, November 14-15, 2011 I2R-NUS-MSRA at TAC 2011: Entity Linking 2 Institute for Infocomm Research 3 Microsoft Research Asia

description

I2R-NUS-MSRA at TAC 2011: Entity Linking. Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 , Zhiqiang Toh 2 , Yanchuan Sim 2 , Yunbo Cao 3 , Chin Yew Lin 3 and Chew Lim Tan 1. 1 National University of Singapore. 2 Institute for Infocomm Research. 3 Microsoft Research Asia. - PowerPoint PPT Presentation

Transcript of Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

Page 1: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2, Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew

Lim Tan1

1 National University of Singapore

Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking

2 Institute for Infocomm Research3 Microsoft Research Asia

Page 2: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 2

I2R-NUS-MSRA at TAC 2011: Entity Linking

I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Acronym ExpansionSemantic FeaturesInstance Selection

Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system

Offline Combination with the system of MSRA team at KB linking step

Page 3: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 3

I2R-NUS-MSRA at TAC 2011: Entity Linking

I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Acronym ExpansionSemantic FeaturesInstance Selection

Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system

Combine with the system of MSRA team at KB linking step

Page 4: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Acronym Expansion - Motivation

Text Analysis Conference, November 14-15, 2011 4

I2R-NUS-MSRA at TAC 2011: Entity Linking

Expanding an acronym from its context to reduce the ambiguities of a name E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous.

Page 5: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Step 1 – Find Expansion Candidates

Text Analysis Conference, November 14-15, 2011 5

I2R-NUS-MSRA at TAC 2011: Entity Linking

Identifying Candidate Expansions (e.g. for ACM)

Page 6: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Step 2 – Candidate Expansions Ranking

Text Analysis Conference, November 14-15, 2011 6

I2R-NUS-MSRA at TAC 2011: Entity Linking

Using SVM classifier to rank the candidates

Our SVM based acronym expansioncan handle link acronyms and full strings in the different

sentences in the articlesNumber of common characters between acronym and leading character of the expansion.

can handle acronym with swapped letters. E.g. Communist Party of China Vs. CCPSentence distance between acronym and expansion

Page 7: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 7

I2R-NUS-MSRA at TAC 2011: Entity Linking

I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Acronym ExpansionSemantic FeaturesInstance Selection

Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system

Combine with the system of MSRA team at KB linking step

Page 8: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Related Work on Context Similarity

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 8

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010 Term MatchingHowever,1) Michael Jordan is a leading researcher in machine learning

and artificial intelligence.

2) Michael Jordan is currently a full professor at the University of California, Berkeley.

3) Michael Jordan (born February, 1963) is a former American professional basketball player.

4) Michael Jordan wins NBA MVP of 91-92 season.

No Term Match

Page 9: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Our System - A Wikipedia-LDA model

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 9

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

1) Michael Jordan is a leading researcher in machine learning and artificial intelligence.

2) Michael Jordan is currently a full professor at the University of California, Berkeley.

3) Michael Jordan (born February, 1963) is a former American professional basketball player.

4) Michael Jordan wins NBA MVP of 91-92 season.

Topic: Basketball

Topic: Science

Page 10: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Wikipedia – LDA Model

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 10

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

P( word i | category j)

Document

P( category i | document j)

Document

Page 11: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Wikipedia – LDA Model

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 11

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

1) Michael Jordan is a leading

researcher in machine learning and artificial intelligence.

2) Michael Jordan is currently a full professor at the University of California, Berkeley.

3) Michael Jordan (born February, 1963) is a former American professional basketball player.

4) Michael Jordan wins NBA MVP of 91-92 season.

Page 12: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 12

I2R-NUS-MSRA at TAC 2011: Entity Linking

I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Acronym ExpansionSemantic FeaturesInstance Selection

Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system

Combine with the system of MSRA team at KB linking step

Page 13: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Related Work

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 13

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Vector Space ModelDifficult to combine bag of words (BOW) with other features.Performance needs to be improved

Supervised Approaches Using manual annotated training instances

Dredze et al., 2010; Zheng et al., 2010

Using automatically generated training instances Zhang et al. 2010

Page 14: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Related Work

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 14

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Auto-generate training instance (Zhang et al., 2010)

(News Article) Obama Campaign Drops The George W. Bush Talking Point …

Page 15: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Related Work

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 15

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

From “George W. Bush” articlesNo positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generatedNo negative instances for “George W. Bush” generated

Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection

The distribution of the unambiguous mentions may not be the same in test data

Page 16: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

The Approach in Our System

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 16

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

An instance selection approach Select an informative, representative, and diverse subset from the auto-generated data set. Reduce the effect of the distribution differences

Page 17: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Instance Selection

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 17

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Small Initial data set

trainingSVM

Classifier

Test on auto-generated data set

2-D data set Illustration

SVM hyperplane

Select Informative, representative and diverse Instances

Add these selected instances to Initial data set

Page 18: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 18

I2R-NUS-MSRA at TAC 2011: Entity Linking

I2R-NUS team at TACincorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Acronym ExpansionSemantic FeaturesInstance Selection

Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA) Combination system

Combine with the system of MSRA team at KB linking step

Page 19: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Advantages over other clustering techniquesGlobally optimized resultsEfficient in time and spaceGenerally, produce a better result

Success in many areasImage segmentationGene expression clustering

Spectral Clustering

Page 20: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Spectral Clustering

A = QɅQ-1

Eigen Decomposition on Graph LaplacianDimensionality Reduction (Luxburg, 2006)

George W. Bush

George H.W. Bush

Page 21: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Hierarchical Agglomerative Clustering

Text Analysis Conference, November 14-15, 2011 21

I2R-NUS-MSRA at TAC 2011: Entity Linking

Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities.

Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010):

this model shows good performance in Web People Search In our work, original query name, its Wikipedia redirected

names and its coreference chain mentions are all considered as appearances of the query name in the text.

Similarity scores : cosine similarity and overlap similarity.

Page 22: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Hierarchical Agglomerative Clustering

Text Analysis Conference, November 14-15, 2011 22

I2R-NUS-MSRA at TAC 2011: Entity Linking

Docs referred to the same entity are clustered according to doc pair-wise similarity scores.Start with singleton: each doc is a clusterIf there are two docs D and D' in clusters Ci and Cj respectively:

Two clusters Ci and Cj are merged to form a new cluster Cij

if Sim(D,D' ) > γ

Calculate the similarity between the new cluster Cij and all remaining

clusters

γ = 0.25

Page 23: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Latent Dirichlet Allocation (LDA)

Text Analysis Conference, November 14-15, 2011 23

I2R-NUS-MSRA at TAC 2011: Entity Linking

LDA has been applied to many NLP tasks such as: summarization and text classification In our approach, the learned topics can represent the underlying entities of the ambiguous names Generative story:

Page 24: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Text Analysis Conference, November 14-15, 2011 24

I2R-NUS-MSRA at TAC 2011: Entity Linking

Three classes SVM classifier to decide which system to be trusted

Features: scores given by the three systems

Three Clustering Systems Combination

Combine with the system of MSRA team at KB linking step

Binary SVM classifier to decide which system to be trusted Features: scores given by the two systems

Page 25: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Experiment for Three Clustering Algorithms

Text Analysis Conference, November 14-15, 2011 25

I2R-NUS-MSRA at TAC 2011: Entity Linking

Algorithms Eval 09 Eval 10 Eval 10+

SGP 0.745 0.954 0.809

HAC 0.666 0.950 0.789

LDA 0.782 0.981 0.841

Combination 0.795 0.982 0.852

Page 26: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Submissions

Text Analysis Conference, November 14-15, 2011 26

I2R-NUS-MSRA at TAC 2011: Entity Linking

Systems Acc. Precision Recall F1

Full 0.863 0.815 0.849 0.831

Partial 0.844 0.797 0.829 0.813

Highest - - - 0.846

Median - - - 0.716

Page 27: Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

A Two Tier Framework for Context-Aware Service Organization & Discovery

Conclusion

Text Analysis Conference, November 14-15, 2011 27

I2R-NUS-MSRA at TAC 2011: Entity Linking

Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)Acronym ExpansionSemantic FeaturesInstance Selection

Investigate three algorithms for NIL query clustering Spectral Graph Partitioning (SGP)Hierarchical Agglomerative Clustering (HAC)Latent Dirichlet allocation (LDA)