ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Latent Domain Word Alignment for Heterogeneous Corpora

Latent Domain Word Alignment for

Heterogeneous Corpora

Hoang Cuong

Joint work with Khalil Sima’an, appearing at NAACL 2015ILLC, University of Amsterdam

1 / 21


1. An Introduction

Bitext word alignment

Alignment task: identifying translation relationshipsamong the words in parallel sentences.

Proposed by[Brown et al.(1993)Brown, Pietra, Pietra, and Mercer],turning out to be one of the most important tasks inNatural Language Processing.

2 / 21


1. An Introduction

Bitext word alignment

(a)

Bilingual Data

Alignment Model

Viterbi Decoding

Figure: Statistical Alignment Framework (a), c.f.,[Brown et al.(1993)Brown, Pietra, Pietra, and Mercer] 3 / 21


1. An Introduction

SMT with Mix-of-Domains Haystack

We have Big DATA to train SMT systems.

Thanks to Europarl, UN, Common Crawl, ...

Data come from very different domains.

How does this affect the alignment accuracy?

Bigger data 6= producing better alignment quality

This in fact not so surprising!

In domain adaptation, [Moore and Lewis(2010),Axelrod et al.(2011)Axelrod, He, and Gao,Cuong and Sima’an(2014)] shows that bigger data doesnot mean better translation!

4 / 21


1. An Introduction

Word Alignment with Mix-of-Domains Haystack

Why? Haystack = too many different translations!

maestra → master (computer);maestra → teacher (education); maestra → dean (education);maestra → crack (other), maestra → ...

Suboptimal alignment quality has been repeatedly observed

[Gao et al.(2011)Gao, Lewis, Quirk, and Hwang,

Bach et al.(2008)Bach, Gao, and Vogel,

Banerjee et al.(2012)Banerjee, Naskar, Roturier, Way, and Genabith].

5 / 21


1. An Introduction

How to overcome this problem?

6 / 21


1. An Introduction

Disentangling the Subdomains

(a) (b)

Bilingual Data

Model

Viterbi Decoding

Bilingual Data

Model1 Modeli ModelK... ...

Viterbi Decoding

Domain1 Domaini DomainK

Figure: Statistical Alignment Framework (a) vs. Statistical Latent DomainAlignment Framework (b).

7 / 21


1. An Introduction

Disentangling the Subdomains

Technical contributions

“Splitting” alignment statistics P(f, a| e) into differentdomain-sensitive alignment statistics P(f, a| e, D) withlatent variable DCombining domain-sensitive alignment statistics

8 / 21


1. An Introduction

“Splitting” alignment statistics

fj−1 fj fj+1

aj−1 aj aj+1

Observed layer (source words)

Latent alignment layer (targetwords)

Figure: HMM alignment model with observed and latent alignmentlayers (a).

9 / 21


1. An Introduction

“Splitting” alignment statistics

fj−1 fj fj+1

aj−1 aj aj+1

D

Observed layer (source words)

Latent alignment layer (targetwords)

Latent domain layer

Figure: Latent domain HMM alignment model. An additionallatent layer representing domains has been conditioned on by boththe rest two layers.

10 / 21


1. An Introduction

Likelihood

Likelihood: L ∝∑〈f, e〉

∑D P(D)

(P(f| e,D)P(e| D) + P(e|f, D)P(f| D)

)A joint model between language models andtranslation models

Too complex to train, unfortunately (we cannot learnfrom scratch now!).

Deep Neural Networks might help (suggested in the talkof the speaker)!

11 / 21


1. An Introduction

Learning

Our temporary solution: EM with Partial Supervision

Number of Domains: The values of D ∈ [1..(N + 1)]depends on the N available seed samples we know theirdomain in advance plus the so-called ”out-domain”.Parameter Constraints: We keep the domain priorparameters fixed for all sentence pairs that belong toseed samples.

12 / 21


1. An Introduction

Combining domain-sensitive alignment statistics

a = argmaxa

∑D

P(f, a, D| e)

= argmaxa

∑D

P(f, a| e, D)P(D| e)

= argmaxa

∑D

P(f, a| e, D)P(e| D)P(D).

Unfortunately, the decoding problem is NP-hard (see[DeNero and Macherey(2011),Chang et al.(2014)Chang, Rush, DeNero, and Collins]).

13 / 21


1. An Introduction

Combining domain-sensitive alignment statistics

a = argmaxa

∑D

P(f, a| e, D)P(e| D)P(D).

Two potential solutions

Lagrangian relaxation-based decoder (ack ack I don’twant to implement this!!!)Defining an approximate objective function, e.g., itslower bound (this work!)

a = argmaxa

∏D

P(f, a| e, D)P(e| D)P(D)

14 / 21


1. An Introduction

Data Preparation

Legal

Pharmacy

Hardware

Therest (3.7M)

Cmix

Training latent domain alignment model with the priorknowledge derived from domain information of threesubsets, comparing alignment accuracy to the baseline.

15 / 21


1. An Introduction

Alignment results

Model Prior Prec.↑ Rec.↑ AER↓1 Million

Baseline - 66.95 61.29 36.00

Latent

Pharmacy (100K) 67.85 61.72 35.36Legal (100K) 67.57 62.29 35.17Hardware (100K) 69.41 63.58 33.63ALL (300K) 69.64 63.30 33.68

16 / 21


1. An Introduction

Alignment results


Baseline - 68.34 61.58 35.22

Latent


17 / 21


1. An Introduction

Alignment results


Baseline - 69.37 64.30 33.26

Latent


18 / 21


1. An Introduction

Discussion

Word alignment should involve latent conceptsrepresenting domains of data

We present the benefits: With the latent domain - themore we know about the data, the better we can improvethe performance.

We strongly believe this should be applicable for anystatistical model, and not limited into alignment modelsonly.

Challenge: Can we learn the latent domain (alignment)models from scratch?

19 / 21


Bibliography

Bibliography I

Amittai Axelrod, Xiaodong He, and Jianfeng Gao.

Domain adaptation via pseudo in-domain data selection.In Proceedings of EMNLP, 2011.

Nguyen Bach, Qin Gao, and Stephan Vogel.

Improving word alignment with language model based confidence scores.In Proceedings of the Third Workshop on Statistical Machine Translation, 2008.

Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier, Andy Way, and Josef Genabith.

Translation quality-based supplementary data selection by incremental update of translation models.In Martin Kay and Christian Boitet, editors, COLING 2012, 24th International Conference onComputational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012,Mumbai, India, pages 149–166. Indian Institute of Technology Bombay, 2012.

Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer.

The mathematics of statistical machine translation: parameter estimation.Comput. Linguist., 1993.

Yin-Wen Chang, Alexander M. Rush, John DeNero, and Michael Collins.

A constrained viterbi relaxation for bidirectional word alignment.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Associationfor Computational Linguistics, 2014.URL http://www.aclweb.org/anthology/P/P14/P14-1139.

20 / 21

http://www.aclweb.org/anthology/P/P14/P14-1139


Bibliography

Bibliography II

Hoang Cuong and Khalil Sima’an.

Latent domain translation models in mix-of-domains haystack.In Proceedings of COLING, 2014.

John DeNero and Klaus Macherey.

Model-based aligner combination using dual decomposition.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies - Volume 1. Association for Computational Linguistics, 2011.URL http://dl.acm.org/citation.cfm?id=2002472.2002526.

Qin Gao, Will Lewis, Chris Quirk, and Mei-Yuh Hwang.

Incremental training and intentional over-fitting of word alignment.In Proceedings of MT Summit XIII. Asia-Pacific Association for Machine Translation, September 2011.URL http://research.microsoft.com/apps/pubs/default.aspx?id=153368.

Robert C. Moore and William Lewis.

Intelligent selection of language model training data.In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10, pages 220–224, Stroudsburg, PA,USA, 2010. Association for Computational Linguistics.URL http://dl.acm.org/citation.cfm?id=1858842.1858883.

21 / 21

http://dl.acm.org/citation.cfm?id=2002472.2002526

http://research.microsoft.com/apps/pubs/default.aspx?id=153368

http://dl.acm.org/citation.cfm?id=1858842.1858883

ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015

Data & Analytics

Transcript of ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015