ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
-
Upload
riilp -
Category
Data & Analytics
-
view
109 -
download
2
Transcript of ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
Latent Domain Word Alignment for Heterogeneous Corpora
Latent Domain Word Alignment for
Heterogeneous Corpora
Hoang Cuong
Joint work with Khalil Sima’an, appearing at NAACL 2015ILLC, University of Amsterdam
1 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Bitext word alignment
Alignment task: identifying translation relationshipsamong the words in parallel sentences.
Proposed by[Brown et al.(1993)Brown, Pietra, Pietra, and Mercer],turning out to be one of the most important tasks inNatural Language Processing.
2 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Bitext word alignment
(a)
Bilingual Data
Alignment Model
Viterbi Decoding
Figure: Statistical Alignment Framework (a), c.f.,[Brown et al.(1993)Brown, Pietra, Pietra, and Mercer] 3 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
SMT with Mix-of-Domains Haystack
We have Big DATA to train SMT systems.
Thanks to Europarl, UN, Common Crawl, ...
Data come from very different domains.
How does this affect the alignment accuracy?
Bigger data 6= producing better alignment quality
This in fact not so surprising!
In domain adaptation, [Moore and Lewis(2010),Axelrod et al.(2011)Axelrod, He, and Gao,Cuong and Sima’an(2014)] shows that bigger data doesnot mean better translation!
4 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Word Alignment with Mix-of-Domains Haystack
Why? Haystack = too many different translations!
maestra → master (computer);maestra → teacher (education); maestra → dean (education);maestra → crack (other), maestra → ...
Suboptimal alignment quality has been repeatedly observed
[Gao et al.(2011)Gao, Lewis, Quirk, and Hwang,
Bach et al.(2008)Bach, Gao, and Vogel,
Banerjee et al.(2012)Banerjee, Naskar, Roturier, Way, and Genabith].
5 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
How to overcome this problem?
6 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Disentangling the Subdomains
(a) (b)
Bilingual Data
Model
Viterbi Decoding
Bilingual Data
Model1 Modeli ModelK... ...
Viterbi Decoding
Domain1 Domaini DomainK
Figure: Statistical Alignment Framework (a) vs. Statistical Latent DomainAlignment Framework (b).
7 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Disentangling the Subdomains
Technical contributions
“Splitting” alignment statistics P(f, a| e) into differentdomain-sensitive alignment statistics P(f, a| e, D) withlatent variable DCombining domain-sensitive alignment statistics
8 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
“Splitting” alignment statistics
fj−1 fj fj+1
aj−1 aj aj+1
Observed layer (source words)
Latent alignment layer (targetwords)
Figure: HMM alignment model with observed and latent alignmentlayers (a).
9 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
“Splitting” alignment statistics
fj−1 fj fj+1
aj−1 aj aj+1
D
Observed layer (source words)
Latent alignment layer (targetwords)
Latent domain layer
Figure: Latent domain HMM alignment model. An additionallatent layer representing domains has been conditioned on by boththe rest two layers.
10 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Likelihood
Likelihood: L ∝∑〈f, e〉
∑D P(D)
(P(f| e,D)P(e| D) + P(e|f, D)P(f| D)
)A joint model between language models andtranslation models
Too complex to train, unfortunately (we cannot learnfrom scratch now!).
Deep Neural Networks might help (suggested in the talkof the speaker)!
11 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Learning
Our temporary solution: EM with Partial Supervision
Number of Domains: The values of D ∈ [1..(N + 1)]depends on the N available seed samples we know theirdomain in advance plus the so-called ”out-domain”.Parameter Constraints: We keep the domain priorparameters fixed for all sentence pairs that belong toseed samples.
12 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Combining domain-sensitive alignment statistics
a = argmaxa
∑D
P(f, a, D| e)
= argmaxa
∑D
P(f, a| e, D)P(D| e)
= argmaxa
∑D
P(f, a| e, D)P(e| D)P(D).
Unfortunately, the decoding problem is NP-hard (see[DeNero and Macherey(2011),Chang et al.(2014)Chang, Rush, DeNero, and Collins]).
13 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Combining domain-sensitive alignment statistics
a = argmaxa
∑D
P(f, a| e, D)P(e| D)P(D).
Two potential solutions
Lagrangian relaxation-based decoder (ack ack I don’twant to implement this!!!)Defining an approximate objective function, e.g., itslower bound (this work!)
a = argmaxa
∏D
P(f, a| e, D)P(e| D)P(D)
14 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Data Preparation
Legal
Pharmacy
Hardware
Therest (3.7M)
Cmix
Training latent domain alignment model with the priorknowledge derived from domain information of threesubsets, comparing alignment accuracy to the baseline.
15 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Alignment results
Model Prior Prec.↑ Rec.↑ AER↓1 Million
Baseline - 66.95 61.29 36.00
Latent
Pharmacy (100K) 67.85 61.72 35.36Legal (100K) 67.57 62.29 35.17Hardware (100K) 69.41 63.58 33.63ALL (300K) 69.64 63.30 33.68
16 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Alignment results
Model Prior Prec.↑ Rec.↑ AER↓2 Million
Baseline - 68.34 61.58 35.22
Latent
Pharmacy (100K) 68.85 62.58 34.43Legal (100K) 69.98 64.01 33.13Hardware (100K) 69.45 63.23 33.81ALL (300K) 71.51 63.87 32.53
17 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Alignment results
Model Prior Prec.↑ Rec.↑ AER↓4 Million
Baseline - 69.37 64.30 33.26
Latent
Pharmacy (100K) 69.69 62.80 33.94Legal (100K) 70.51 63.94 32.93Hardware (100K) 71.75 64.44 32.10ALL (300K) 72.16 64.30 31.99
18 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Discussion
Word alignment should involve latent conceptsrepresenting domains of data
We present the benefits: With the latent domain - themore we know about the data, the better we can improvethe performance.
We strongly believe this should be applicable for anystatistical model, and not limited into alignment modelsonly.
Challenge: Can we learn the latent domain (alignment)models from scratch?
19 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
Bibliography
Bibliography I
Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
Domain adaptation via pseudo in-domain data selection.In Proceedings of EMNLP, 2011.
Nguyen Bach, Qin Gao, and Stephan Vogel.
Improving word alignment with language model based confidence scores.In Proceedings of the Third Workshop on Statistical Machine Translation, 2008.
Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier, Andy Way, and Josef Genabith.
Translation quality-based supplementary data selection by incremental update of translation models.In Martin Kay and Christian Boitet, editors, COLING 2012, 24th International Conference onComputational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012,Mumbai, India, pages 149–166. Indian Institute of Technology Bombay, 2012.
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer.
The mathematics of statistical machine translation: parameter estimation.Comput. Linguist., 1993.
Yin-Wen Chang, Alexander M. Rush, John DeNero, and Michael Collins.
A constrained viterbi relaxation for bidirectional word alignment.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Associationfor Computational Linguistics, 2014.URL http://www.aclweb.org/anthology/P/P14/P14-1139.
20 / 21
Latent Domain Word Alignment for Heterogeneous Corpora
Bibliography
Bibliography II
Hoang Cuong and Khalil Sima’an.
Latent domain translation models in mix-of-domains haystack.In Proceedings of COLING, 2014.
John DeNero and Klaus Macherey.
Model-based aligner combination using dual decomposition.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies - Volume 1. Association for Computational Linguistics, 2011.URL http://dl.acm.org/citation.cfm?id=2002472.2002526.
Qin Gao, Will Lewis, Chris Quirk, and Mei-Yuh Hwang.
Incremental training and intentional over-fitting of word alignment.In Proceedings of MT Summit XIII. Asia-Pacific Association for Machine Translation, September 2011.URL http://research.microsoft.com/apps/pubs/default.aspx?id=153368.
Robert C. Moore and William Lewis.
Intelligent selection of language model training data.In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10, pages 220–224, Stroudsburg, PA,USA, 2010. Association for Computational Linguistics.URL http://dl.acm.org/citation.cfm?id=1858842.1858883.
21 / 21