ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

21
Latent Domain Word Alignment for Heterogeneous Corpora Latent Domain Word Alignment for Heterogeneous Corpora Hoang Cuong Joint work with Khalil Sima’an, appearing at NAACL 2015 ILLC, University of Amsterdam 1 / 21

Transcript of ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Page 1: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

Latent Domain Word Alignment for

Heterogeneous Corpora

Hoang Cuong

Joint work with Khalil Sima’an, appearing at NAACL 2015ILLC, University of Amsterdam

1 / 21

Page 2: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Bitext word alignment

Alignment task: identifying translation relationshipsamong the words in parallel sentences.

Proposed by[Brown et al.(1993)Brown, Pietra, Pietra, and Mercer],turning out to be one of the most important tasks inNatural Language Processing.

2 / 21

Page 3: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Bitext word alignment

(a)

Bilingual Data

Alignment Model

Viterbi Decoding

Figure: Statistical Alignment Framework (a), c.f.,[Brown et al.(1993)Brown, Pietra, Pietra, and Mercer] 3 / 21

Page 4: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

SMT with Mix-of-Domains Haystack

We have Big DATA to train SMT systems.

Thanks to Europarl, UN, Common Crawl, ...

Data come from very different domains.

How does this affect the alignment accuracy?

Bigger data 6= producing better alignment quality

This in fact not so surprising!

In domain adaptation, [Moore and Lewis(2010),Axelrod et al.(2011)Axelrod, He, and Gao,Cuong and Sima’an(2014)] shows that bigger data doesnot mean better translation!

4 / 21

Page 5: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Word Alignment with Mix-of-Domains Haystack

Why? Haystack = too many different translations!

maestra → master (computer);maestra → teacher (education); maestra → dean (education);maestra → crack (other), maestra → ...

Suboptimal alignment quality has been repeatedly observed

[Gao et al.(2011)Gao, Lewis, Quirk, and Hwang,

Bach et al.(2008)Bach, Gao, and Vogel,

Banerjee et al.(2012)Banerjee, Naskar, Roturier, Way, and Genabith].

5 / 21

Page 6: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

How to overcome this problem?

6 / 21

Page 7: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Disentangling the Subdomains

(a) (b)

Bilingual Data

Model

Viterbi Decoding

Bilingual Data

Model1 Modeli ModelK... ...

Viterbi Decoding

Domain1 Domaini DomainK

Figure: Statistical Alignment Framework (a) vs. Statistical Latent DomainAlignment Framework (b).

7 / 21

Page 8: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Disentangling the Subdomains

Technical contributions

“Splitting” alignment statistics P(f, a| e) into differentdomain-sensitive alignment statistics P(f, a| e, D) withlatent variable DCombining domain-sensitive alignment statistics

8 / 21

Page 9: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

“Splitting” alignment statistics

fj−1 fj fj+1

aj−1 aj aj+1

Observed layer (source words)

Latent alignment layer (targetwords)

Figure: HMM alignment model with observed and latent alignmentlayers (a).

9 / 21

Page 10: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

“Splitting” alignment statistics

fj−1 fj fj+1

aj−1 aj aj+1

D

Observed layer (source words)

Latent alignment layer (targetwords)

Latent domain layer

Figure: Latent domain HMM alignment model. An additionallatent layer representing domains has been conditioned on by boththe rest two layers.

10 / 21

Page 11: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Likelihood

Likelihood: L ∝∑〈f, e〉

∑D P(D)

(P(f| e,D)P(e| D) + P(e|f, D)P(f| D)

)A joint model between language models andtranslation models

Too complex to train, unfortunately (we cannot learnfrom scratch now!).

Deep Neural Networks might help (suggested in the talkof the speaker)!

11 / 21

Page 12: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Learning

Our temporary solution: EM with Partial Supervision

Number of Domains: The values of D ∈ [1..(N + 1)]depends on the N available seed samples we know theirdomain in advance plus the so-called ”out-domain”.Parameter Constraints: We keep the domain priorparameters fixed for all sentence pairs that belong toseed samples.

12 / 21

Page 13: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Combining domain-sensitive alignment statistics

a = argmaxa

∑D

P(f, a, D| e)

= argmaxa

∑D

P(f, a| e, D)P(D| e)

= argmaxa

∑D

P(f, a| e, D)P(e| D)P(D).

Unfortunately, the decoding problem is NP-hard (see[DeNero and Macherey(2011),Chang et al.(2014)Chang, Rush, DeNero, and Collins]).

13 / 21

Page 14: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Combining domain-sensitive alignment statistics

a = argmaxa

∑D

P(f, a| e, D)P(e| D)P(D).

Two potential solutions

Lagrangian relaxation-based decoder (ack ack I don’twant to implement this!!!)Defining an approximate objective function, e.g., itslower bound (this work!)

a = argmaxa

∏D

P(f, a| e, D)P(e| D)P(D)

14 / 21

Page 15: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Data Preparation

Legal

Pharmacy

Hardware

Therest (3.7M)

Cmix

Training latent domain alignment model with the priorknowledge derived from domain information of threesubsets, comparing alignment accuracy to the baseline.

15 / 21

Page 16: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Alignment results

Model Prior Prec.↑ Rec.↑ AER↓1 Million

Baseline - 66.95 61.29 36.00

Latent

Pharmacy (100K) 67.85 61.72 35.36Legal (100K) 67.57 62.29 35.17Hardware (100K) 69.41 63.58 33.63ALL (300K) 69.64 63.30 33.68

16 / 21

Page 17: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Alignment results

Model Prior Prec.↑ Rec.↑ AER↓2 Million

Baseline - 68.34 61.58 35.22

Latent

Pharmacy (100K) 68.85 62.58 34.43Legal (100K) 69.98 64.01 33.13Hardware (100K) 69.45 63.23 33.81ALL (300K) 71.51 63.87 32.53

17 / 21

Page 18: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Alignment results

Model Prior Prec.↑ Rec.↑ AER↓4 Million

Baseline - 69.37 64.30 33.26

Latent

Pharmacy (100K) 69.69 62.80 33.94Legal (100K) 70.51 63.94 32.93Hardware (100K) 71.75 64.44 32.10ALL (300K) 72.16 64.30 31.99

18 / 21

Page 19: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

1. An Introduction

Discussion

Word alignment should involve latent conceptsrepresenting domains of data

We present the benefits: With the latent domain - themore we know about the data, the better we can improvethe performance.

We strongly believe this should be applicable for anystatistical model, and not limited into alignment modelsonly.

Challenge: Can we learn the latent domain (alignment)models from scratch?

19 / 21

Page 20: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

Bibliography

Bibliography I

Amittai Axelrod, Xiaodong He, and Jianfeng Gao.

Domain adaptation via pseudo in-domain data selection.In Proceedings of EMNLP, 2011.

Nguyen Bach, Qin Gao, and Stephan Vogel.

Improving word alignment with language model based confidence scores.In Proceedings of the Third Workshop on Statistical Machine Translation, 2008.

Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier, Andy Way, and Josef Genabith.

Translation quality-based supplementary data selection by incremental update of translation models.In Martin Kay and Christian Boitet, editors, COLING 2012, 24th International Conference onComputational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012,Mumbai, India, pages 149–166. Indian Institute of Technology Bombay, 2012.

Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer.

The mathematics of statistical machine translation: parameter estimation.Comput. Linguist., 1993.

Yin-Wen Chang, Alexander M. Rush, John DeNero, and Michael Collins.

A constrained viterbi relaxation for bidirectional word alignment.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Associationfor Computational Linguistics, 2014.URL http://www.aclweb.org/anthology/P/P14/P14-1139.

20 / 21

Page 21: ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

Bibliography

Bibliography II

Hoang Cuong and Khalil Sima’an.

Latent domain translation models in mix-of-domains haystack.In Proceedings of COLING, 2014.

John DeNero and Klaus Macherey.

Model-based aligner combination using dual decomposition.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies - Volume 1. Association for Computational Linguistics, 2011.URL http://dl.acm.org/citation.cfm?id=2002472.2002526.

Qin Gao, Will Lewis, Chris Quirk, and Mei-Yuh Hwang.

Incremental training and intentional over-fitting of word alignment.In Proceedings of MT Summit XIII. Asia-Pacific Association for Machine Translation, September 2011.URL http://research.microsoft.com/apps/pubs/default.aspx?id=153368.

Robert C. Moore and William Lewis.

Intelligent selection of language model training data.In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10, pages 220–224, Stroudsburg, PA,USA, 2010. Association for Computational Linguistics.URL http://dl.acm.org/citation.cfm?id=1858842.1858883.

21 / 21