arXiv:2003.07568v2 [cs.CL] 8 Apr 2020 · 2020. 4. 9. · pose a dataset called Multilingual...

25
XPersona: Evaluating Multilingual Personalized Chatbot Zhaojiang Lin * , Zihan Liu * , Genta Indra Winata * , Samuel Cahyawijaya * , Andrea Madotto * , Yejin Bang, Etsuko Ishii, Pascale Fung Center for Artificial Intelligence Research (CAiRE) Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong {zlinao,zliucr,giwinata,scahyawijaya,amadotto}@connect.ust.hk, [email protected] Abstract Personalized dialogue systems are an essen- tial step toward better human-machine in- teraction. Existing personalized dialogue agents rely on properly designed conversa- tional datasets, which are mostly monolingual (e.g., English), which greatly limits the usage of conversational agents in other languages. In this paper, we propose a multi-lingual ex- tension of Persona-Chat (Zhang et al., 2018), namely XPersona. Our dataset includes per- sona conversations in six different languages other than English for building and evaluating multilingual personalized agents. We experi- ment with both multilingual and cross-lingual trained baselines, and evaluate them against monolingual and translation-pipeline models using both automatic and human evaluation. Experimental results show that the multilin- gual trained models outperform the translation- pipeline and that they are on par with the monolingual models, with the advantage of having a single model across multiple lan- guages. On the other hand, the state-of-the- art cross-lingual trained models achieve infe- rior performance to the other models, showing that cross-lingual conversation modeling is a challenging task. We hope that our dataset and baselines 1 will accelerate research in multilin- gual dialogue systems. 1 Introduction Personalized dialogue agents have been shown effi- cient in conducting human-like conversation. This progress has been catalyzed thanks to existing con- versational dataset such as Persona-chat (Zhang et al., 2018; Dinan et al., 2019a). However, the training data are provided in a single language (e.g., * Equal contributions. Listing order is random 1 Datasets and all the baselines are available in https: //github.com/HLTCHKUST/Xpersona English), and thus the resulting systems can per- form conversations only in the training language. For wide, commercial dialogue systems are re- quired to handle a large number of languages since the smart home devices market is increasingly inter- national (Etherington, 2019). Therefore, creating multilingual conversational benchmarks is essen- tial, yet challenging since it is costly to perform human annotation of data in all languages. A possible solution is to use translation systems before and after the model inference, a two-step translation from any language to English and from English to any language. This comes with three ma- jor problems: 1) amplification of translation errors since the current dialogue systems are far from per- fect, especially with noisy input; 2) the three-stage pipeline system is significantly slower in terms of inference speed; and 3) high translation costs since the current state-of-the-art models, especially in low resources languages, are only available using costly APIs. In this paper, we analyze two possible workarounds to alleviate the aforementioned chal- lenges. The first is to build a cross-lingual trans- ferable system by aligning cross-lingual represen- tations, as in Conneau et al. (2018), in which the system is trained on one language and zero-shot to another language. The second is to learn a mul- tilingual system directly from noisy multilingual data (e.g., translated data), thus getting rid of the translation system dependence at inference time. To evaluate the aforementioned systems, we pro- pose a dataset called Multilingual Persona-Chat, or XPersona, by extending the Persona-Chat cor- pora (Dinan et al., 2019a) to six languages: Chi- nese, French, Indonesian, Italian, Korean, and Japanese. In XPersona, the training sets are au- tomatically translated using translation APIs with several human-in-the-loop passes of mistake cor- rection. In contrast, the validation and test sets arXiv:2003.07568v2 [cs.CL] 8 Apr 2020

Transcript of arXiv:2003.07568v2 [cs.CL] 8 Apr 2020 · 2020. 4. 9. · pose a dataset called Multilingual...

  • XPersona: Evaluating Multilingual Personalized Chatbot

    Zhaojiang Lin∗, Zihan Liu∗, Genta Indra Winata∗, Samuel Cahyawijaya∗, Andrea Madotto∗,Yejin Bang, Etsuko Ishii, Pascale Fung

    Center for Artificial Intelligence Research (CAiRE)Department of Electronic and Computer Engineering

    The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong{zlinao,zliucr,giwinata,scahyawijaya,amadotto}@connect.ust.hk,

    [email protected]

    Abstract

    Personalized dialogue systems are an essen-tial step toward better human-machine in-teraction. Existing personalized dialogueagents rely on properly designed conversa-tional datasets, which are mostly monolingual(e.g., English), which greatly limits the usageof conversational agents in other languages.In this paper, we propose a multi-lingual ex-tension of Persona-Chat (Zhang et al., 2018),namely XPersona. Our dataset includes per-sona conversations in six different languagesother than English for building and evaluatingmultilingual personalized agents. We experi-ment with both multilingual and cross-lingualtrained baselines, and evaluate them againstmonolingual and translation-pipeline modelsusing both automatic and human evaluation.Experimental results show that the multilin-gual trained models outperform the translation-pipeline and that they are on par with themonolingual models, with the advantage ofhaving a single model across multiple lan-guages. On the other hand, the state-of-the-art cross-lingual trained models achieve infe-rior performance to the other models, showingthat cross-lingual conversation modeling is achallenging task. We hope that our dataset andbaselines 1 will accelerate research in multilin-gual dialogue systems.

    1 Introduction

    Personalized dialogue agents have been shown effi-cient in conducting human-like conversation. Thisprogress has been catalyzed thanks to existing con-versational dataset such as Persona-chat (Zhanget al., 2018; Dinan et al., 2019a). However, thetraining data are provided in a single language (e.g.,

    ∗∗ Equal contributions. Listing order is random1Datasets and all the baselines are available in https:

    //github.com/HLTCHKUST/Xpersona

    English), and thus the resulting systems can per-form conversations only in the training language.For wide, commercial dialogue systems are re-quired to handle a large number of languages sincethe smart home devices market is increasingly inter-national (Etherington, 2019). Therefore, creatingmultilingual conversational benchmarks is essen-tial, yet challenging since it is costly to performhuman annotation of data in all languages.

    A possible solution is to use translation systemsbefore and after the model inference, a two-steptranslation from any language to English and fromEnglish to any language. This comes with three ma-jor problems: 1) amplification of translation errorssince the current dialogue systems are far from per-fect, especially with noisy input; 2) the three-stagepipeline system is significantly slower in terms ofinference speed; and 3) high translation costs sincethe current state-of-the-art models, especially inlow resources languages, are only available usingcostly APIs.

    In this paper, we analyze two possibleworkarounds to alleviate the aforementioned chal-lenges. The first is to build a cross-lingual trans-ferable system by aligning cross-lingual represen-tations, as in Conneau et al. (2018), in which thesystem is trained on one language and zero-shotto another language. The second is to learn a mul-tilingual system directly from noisy multilingualdata (e.g., translated data), thus getting rid of thetranslation system dependence at inference time.

    To evaluate the aforementioned systems, we pro-pose a dataset called Multilingual Persona-Chat,or XPersona, by extending the Persona-Chat cor-pora (Dinan et al., 2019a) to six languages: Chi-nese, French, Indonesian, Italian, Korean, andJapanese. In XPersona, the training sets are au-tomatically translated using translation APIs withseveral human-in-the-loop passes of mistake cor-rection. In contrast, the validation and test sets

    arX

    iv:2

    003.

    0756

    8v2

    [cs

    .CL

    ] 8

    Apr

    202

    0

    https://github.com/HLTCHKUST/Xpersonahttps://github.com/HLTCHKUST/Xpersona

  • Persona I helped design the game starcraft. I am a famous twitch streamer..

    EnSys: do you know the game starcraft ? i helped designing it !Usr: that s awesome , i m sure my ex boyfriend would like that gameSys: i usually stream the game on my twitch channel that s great , i also love twitch as well

    ItSys: conosci il gioco starcraft? ho aiutato a progettarlo!Usr: è fantastico, sono sicuro che al mio ex ragazzo gli piacerebbeSys: di solito faccio streaming mentre gioco sul mio canale Twitch

    FrSys: connaissez-vous le jeu starcraft? j’ai aidé à le concevoir!Usr: c’est génial, je suis sûr que mon ex petit ami aimerait ce jeuSys: Je diffuse généralement le jeu sur ma chaîne Twitch

    IdSys: apakah anda tahu game starcraft? saya yang mendesainnya!Usr: itu luar biasa, saya yakin mantan pacar saya suka game itu.Sys: saya biasanya memainkan game itu di channel twitch saya.

    ZhSys: 你知道游戏《星际争霸》吗?我帮忙设计了它!Usr: 好厉害,我觉得我的前男友会喜欢那个游戏Sys: 我经常在我的直播频道上直播游戏

    KoSys: 너게임스타크래프트를아니? 나는그것을디자인하는것을도왔어!Usr: 멋진데, 내전남자친구가그게임을좋아할거라고확신해.Sys: 나는보통내트위치채널로그게임을스트리밍해.

    JpSys: ゲームのスタークラフトを知っていますか?私はそれを設計するのを助けました!Usr: それはすごいです、私は私の元彼がそのゲームを好きになると確信していますSys: 私は通常、twitchチャンネルでゲームをストリーミングします

    Table 1: Multi-turn annotated dialogue samples from test set in seven languages. For simplicity, we only showthree turns for each dialogue and the persona in English.

    are annotated by human experts to facilitate bothautomatic and human evaluations in multiple lan-guages.

    Furthermore, we propose competitive baselinesin two training settings, namely, cross-lingualand multilingual, and compare them with trans-lation pipeline models. Our baselines leverage pre-trained cross-lingual (Chi et al., 2019) and multi-lingual (Devlin et al., 2018) models.

    An extensive automatic and human evalua-tion (Li et al., 2019) of our models shows that amultilingual system is able to outperform strongtranslation-based models and on par with or evenimprove the monolingual model. The cross-lingualperformance is still lower than other models, whichindicates that cross-lingual conversation modelingis very challenging. The main contribution of thispaper are summarized as follows:

    • We present the first multilingual non-goal-oriented dialogue benchmark for evaluatingmultilingual generative chatbots.

    • We provide both cross-lingual and multilin-gual baselines and discuss their limitations toinspire future research.

    • We show the potential of multilingual systems

    to understand the mixed language dialoguecontext and generate coherent responses.

    2 Related Work

    Dialogue Systems are categorized as goal-oriented (Williams and Young, 2007; Young et al.,2013) and chit-chat (Serban et al., 2016; Vinyalsand Le, 2015). Interested readers may refer to Gaoet al. (2018) for a general overview. In this paper,we focus on the latter, for which, in recent years,several tasks and datasets have been proposedto ground the conversation on knowledge (Dinanet al., 2019b; Gopalakrishnan et al., 2019; Shusteret al., 2018; Fan et al., 2019; Reddy et al., 2019;Choi et al., 2018; Moon et al., 2019) such as Wiki-Articles, Reddit-Post, and CNN-Article. In thiswork, we focus on personalized dialogue agentswhere the dialogues are grounded on persona infor-mation.

    Li et al. (2016a) was the first to introduce apersona-grounded dialogue dataset for improvingresponse consistency. Later on, Zhang et al. (2018)and Dinan et al. (2019a) introduced Persona-chat, amulti-turn conversational dataset, where two speak-ers are paired, and a persona description (4–5 sen-tences) is randomly assigned to each of them. By

  • Valid. TestLang #Dial. #Utt. Edit BLEU #Dial. #Utt. Edit BLEU

    Fr 248 3868 21.23 94.45 249 3900 24.29 94.19It 140 2160 83.01 80.45 140 2192 81.93 80.08Id 484 7562 157.58 60.46 484 7540 156.19 60.66Jp 275 4278 71.41 53.66 275 4322 75.83 49.56Ko 299 4684 74.04 61.25 300 4678 70.96 62.49Zh 222 3440 30.33 59.89 222 3458 33.07 64.61

    Table 2: The statistics of the collected dataset. We report the number of dialogues (#Dial.) and utterances (#Utt.)of the validation and test set in six languages. Edit distance per dialogue (Edit) and BLEU score are computed toshow the difference between the human-annotated dataset and auto-translated dataset. (Training set is reported inAppendix A)

    conditioning the response generation on the per-sona descriptions, a chit-chat model is able to pro-duce a more persona-consistent dialogue (Zhanget al., 2018). Several works have improved on theinitial baselines with various methodologies (Ku-likov et al., 2018; Yavuz et al., 2019; Hancocket al., 2019; Madotto et al., 2019; Joshi et al., 2017;Zemlyanskiy and Sha, 2018), especially using largepre-trained models (Wolf et al., 2019; Zhang et al.,2019).

    Multilingual Extensive approaches have been in-troduced to construct multilingual systems, for ex-ample, multilingual semantic role labeling (Ak-bik et al., 2015; He et al., 2019), multilingual ma-chine translation (Johnson et al., 2017), multilin-gual automatic speech recognition (Toshniwal et al.,2018; Yue et al., 2019; Nakayama et al., 2019;Winata et al., 2019c), and named entity recogni-tion (Winata et al., 2019a,b). Multilingual deepcontextualized model such as Multilingual BERT(M-BERT) (Devlin et al., 2018) have been com-monly used to represent multiple languages andelevate the performance in many NLP applications,such as classification tasks (Pires et al., 2019), tex-tual entailment, named entity recognition (K et al.,2020), and natural language understanding (Liuet al., 2019c). Multilingual datasets have also beencreated for a number of NLP tasks, such as namedentity recognition or linking (Sang, 2002; Sangand De Meulder, 2003; Pan et al., 2017; Aguilaret al., 2018), question answering (Liu et al., 2019a;Lewis et al., 2019), semantic role labeling (Hajicet al., 2009), part-of-speech tagging (Nivre et al.,2017), dialogue state tracking (Mrkšić et al., 2017),and natural language understanding (Schuster et al.,2019a). However, none of these datasets includethe multilingual chit-chat task.

    Cross-lingual Cross-lingual adaptation learnsthe inter-connections among languages and circum-vents the requirement of extensive training data intarget languages (Wisniewski et al., 2014; Zhanget al., 2016; Liu et al., 2019b). Cross-lingual trans-fer learning methods have been applied to multipleNLP tasks, such as named entity recognition (Niet al., 2017; Xie et al., 2018), natural languageunderstanding (Liu et al., 2019c), dialogue statetracking (Chen et al., 2018), part-of-speech tag-ging (Wisniewski et al., 2014; Zhang et al., 2016;Kim et al., 2017), and dependency parsing (Ahmadet al., 2019; Schuster et al., 2019b). Meanwhile,Lample and Conneau (2019) and Conneau et al.(2019) proposed pre-trained cross-lingual languagemodels to align multiple language representations,achieving state-of-the-art results in many cross-lingual classification tasks. The aforementionedtasks focused on classification and sequence label-ing, while instead, Chi et al. (2019) proposed to pre-train both the encoder and decoder of a sequence-to-sequence model (XNLG) to conduct cross-lingualgeneration tasks, namely, question generation andabstractive summarization. The latter is the closestto our task since it focuses on language generation;however cross-lingual dialogue generation has notyet been explored.

    3 Data Collection

    The proposed XPersona dataset is an extensionof the persona-chat dataset (Zhang et al., 2018;Dinan et al., 2019a). Specifically, we extend theConvAI2 (Dinan et al., 2019a) to six languages:Chinese, French, Indonesian, Italian, Korean, andJapanese. Since the test set of ConvAI2 is hid-den, we split the original validation set into a newvalidation set and test sets. Then, we firstly au-tomatically translate the training, validation, and

  • Word Embedding

    Positional Embedding

    Persona User Sys. User

    Word Embedding

    Positional Embedding

    Word Embedding

    Positional Embedding

    Persona User Sys. UserLanguage Embedding Language

    M-Encoder M-Decoder M-Causal Decoder

    Response Response

    (a) (b)

    X

    XposXseg

    Figure 1: (a) Multilingual Encoder-Decoder model. (b) Multilingual Causal Decoder model. (Detailed illustrationis reported in Appendix B)

    test set using APIs (PapaGo 2 for Korean, GoogleTranslate 3 for other languages). For each lan-guage, we hired native speaker annotators witha fluent level of English and asked them to revisethe machine-translated dialogues and persona sen-tences in the validation set and test set accordingto original English dialogues. The main goal ofhuman annotation is to ensure the revised conver-sations are coherent and fluent in target languagedespite the cultural discrepancy in different lan-guages. Therefore, annotators are not restrictedto translate the English dialogues. They are alsoallowed to customize dialogues and persona sen-tences. The annotated dialogues can deviate fromoriginal translation while retain persona and con-versation consistency. The full annotation instruc-tions are reported in Appendix A.

    Compared to collecting new persona sentencesand dialogues in each language, human-annotatingthe dialogues by leveraging translation APIs hasmultiple advantages. First, it increases the datadistribution similarity across languages (Conneauet al., 2018), which can better examine the system’scross-lingual transferability. Second, revising themachine-translated dialogues based on the originalEnglish dialogue improves the data constructionefficiency. Third, it leverages the well-constructedEnglish persona conversations as a reference toensure the dialogue quality without the need fortraining a new pool of workers to generate newsamples (Conneau et al., 2018).

    On the other hand, human-translating the entiretraining-set (∼130K utterances) in six languagesis expensive. Therefore, we propose an iterativemethod to improve the quality of the automatically

    2https://papago.naver.com3https://translate.google.com

    translated training set. We firstly sample 200 dia-logues from the training set (∼2600 utterances) ineach language, and we assign human annotators tolist all frequent translation mistakes in the givendialogues. For example, daily colloquial Englishexpressions such as “cool", “I see", and “lol" areusually literally translated. After that, we use asimple string matching to revise the inappropriatetranslations in the whole training-set and return arevision log, which records all the revised utter-ances. Then, we assign human annotators to checkall the revised utterances and list translation mis-takes again. We repeat this process at least twice foreach language. Finally, we summarize the statisticsof the collected dataset in Table 2.

    4 Multilingual PersonalizedConversational Models

    Let us define a dialogue D ={U1, S1, U2, S2, . . . , Un, Sn} as an alternat-ing set of utterances from two speakers, whereU and S represent the user and the system,respectively. Each speaker has its correspond-ing persona description that consists of a setof sentences P = {P1, . . . , Pm}. Given thesystem persona sentences Ps and dialoguehistory Dt = {U1, S1, U2, . . . , St−1, Ut}, we areinterested in predicting the system utterances St.

    4.1 Model Architecture

    We explore both encoder-decoder and causal de-coder architectures, and we leverage existing pre-trained contextualized multilingual language mod-els as weights initialization. Hence, we firstly de-fine the multilingual embedding layer and then thetwo multilingual models used in our experiments.

  • Embedding We define three embedding matri-ces: word embedding EW ∈ R|V |×d, positionalembedding EP ∈ RM×d, and segmentation em-bedding ES ∈ R|S|×d, where |.| denotes set car-dinality, d is the embedding size, V denotes thevocabulary, M denotes the maximum sequencelength, and S denotes the set of segmentation to-kens. Segmentation embedding (Wolf et al., 2019)is used to indicate whether the current token is partof i) Persona sentences, ii) System (Sys.) utter-ances, iii) User utterances, iv) response in Lan-guage lid. The language embedding lid is usedto inform the model which language to generate.Hence, given a sequence of tokens X , the embed-ding functions E are defined as:

    E(X) = EW (X)⊕EP (Xpos)⊕ES(Xseg), (1)

    where ⊕ denotes the positional sum, Xpos ={1, . . . , |X|} and Xseg is the sequence of segmen-tation tokens, as in Wolf et al. (2019). Figure 1shows a visual representation of the embeddingprocess. A more detailed illustration is reported inAppendix B.

    Encoder-Decoder To model the response gener-ation, we use a Transformer (Vaswani et al., 2017)based encoder-decoder (Vinyals and Le, 2015). Asillustrated in Figure 1, we concatenate 4 the systempersona Ps with the dialogue history Dt. Then weuse the embedding layer E to finally pass it to theencoder. In short, we have:

    H = Encoder(E([Ps;Dt])), (2)

    where H ∈ RL×dmodel is the hidden representationcomputed by the encoder, and L denotes the inputsequence length. Then, the decoder attends to Hand generates the system response St token by to-ken. In the decoder, segmentation embedding isthe language ID embedding (e.g., we look up theembedding for Italian to decode Italian). Thus:

    St = Decodert(H, lid), (3)

    Causal Decoder As an alternative to encoder-decoders, the causal-decoders (Radford et al., 2018,2019; He et al., 2018) have been used to modelconversational responses (Wolf et al., 2019; Zhanget al., 2019) by giving as a prefix the dialogue his-tory. In our model, we concatenate the persona Ps

    4We use the notation [a; b] for concatenating the vectors aand b

    and the dialogue history Dt as the language modelprefix, and autoregressively decode the system re-sponse St based on language embedding (i.e. lid):

    St = Decoder(E([Ps;Dt]), lid). (4)

    Figure 1 shows the conceptual differences be-tween the encoder-decoder and casual decoder.Note that in both multilingual models, the dialoguehistory encoding process is language-agnostic,while decoding language is controlled by the lan-guage embedding. Such design allows the modelto understand mixed-language dialogue contextsand to responds in the desired language (details inSection 5.3.2).

    4.2 Training Strategy

    We consider two training strategies to learn a multi-lingual conversational model: multilingual trainingand cross-lingual training.

    Multilingual Training jointly learns to performpersonalized conversations in multiple languages.We follow a transfer learning approach (Wolf et al.,2019; See et al., 2019) by initializing our modelswith the weights of the large multilingual pretrainedmodel M-Bert (Pires et al., 2019). For the causaldecoder, we add the causal mask into self-attentionlayer to convert M-Bert encoder to decoder. Forencoder-decoder model, we randomly initializethe cross encoder-decoder attention (Rothe et al.,2019). Then, we train the both models on the com-bined training set in all 7 languages using cross-entropy loss.

    Cross-lingual Training transfers knowledgefrom the source language data to the target lan-guages. In this setting, the model is trained onEnglish (source language) conversational samples,and evaluated on the other 6 languages. Followingthe methodology proposed by Chi et al. (2019), wealign the embedded representations of different lan-guages into the same embedding space by applyingcross-lingual pre-training to the encoder-decodermodel. The pre-training procedure consists of twostages:

    • pre-training the encoder and the decoder inde-pendently utilizing masked language model-ing, as in Lample and Conneau (2019);

    • jointly pre-training the encoder-decoder byusing two objective functions: Cross-Lingual

  • Bert2Bert M-Bert2Bert CausalBert M-CausalBert XNLGppl. BLEU ppl. BLEU ppl. BLEU ppl. BLEU ppl. BLEU

    En 21.99 1.53 25.99 0.57 16.08 1.79 15.62 1.97 54.74* 2.25*Zh 21.35 3.36 13.24 1.25 8.69 5.51 9.27 5.7 3482.27 2.16It 50.36 0.6 24.16 0.31 18.41 1.32 15.12 1.3 917.63 0.41

    Jp 10.09 5.23 10.64 0.79 11.00 6.74 7.13 4.53 999.81 0.0Ko 12.81 0.24 34.31 0.00 9.66 1.06 9.56 1.08 - -Id 21.37 0.11 22.83 0.22 14.77 2.1 14.61 1.92 844.98 0.15Fr 13.22 0.35 15.58 0.50 10.39 1.97 10.59 2.17 640.33 0.09

    Table 3: Results of automatic evaluation score on test set in seven languages. We compute the BLEU score andperplexity (ppl.) for monolingual, multilingual, and cross-lingual models.

    Auto-Encoding (XAE) and Denoising Auto-Encoding (DAE) (Chi et al., 2019).

    For instance, DAE adds perturbations to the inputsentence of encoder and tries to reconstructs theoriginal sentence using the decoder, whereas, XAEuses parallel translation data to pre-train both theencoder and decoder with machine translation ob-jective. As in the multilingual models, the languageIDs are fed into the decoder to control the languageof generated sentences. Both pre-training stagesrequire both parallel and non-parallel data in thetarget language.

    After the two stages of pre-training, the modelis fine-tuned using just the source language sam-ples (i.e., English) with the same cross-entropyloss as for the multilingual training. However, assuggested in Chi et al. (2019), only the encoderparameters are updated with back-propagation andboth the decoder and the word embedding layerremain frozen. This retains the decoders’ ability togenerate multilingual output while still being ableto learn new tasks using only the target language.

    5 Experiments

    5.1 Evaluation Metrics

    Evaluating open-domain chit-chat models is chal-lenging, especially in multiple languages and at thedialogue-level. Hence, we evaluate our models us-ing both automatic and human evaluation. In bothcases, human-annotated dialogues are used, whichshow the importance of the provided dataset.

    Automatic For each language, we evaluate re-sponses generated by the models using perplexity(ppl.) and BLEU (Papineni et al., 2002) with refer-ence to the human-annotated responses. Althoughthese automatic measures are not perfect (Liu et al.,

    2016), they help to roughly estimate the perfor-mance of different models under the same test set.More recently, Adiwardana et al. (2020) has shownthe correlation between perplexity and human judg-ment in open-domain chit-chat models.

    Human Asking humans to evaluate the qualityof a dialogue model is challenging, especially whenmultiple models have to be compared. The likertscore (a.k.a. 1 to 5 scoring) has been widely usedto evaluate the interactive experience with conver-sational models (Venkatesh et al., 2018; See et al.,2019; Zhang et al., 2018; Dinan et al., 2019a). Insuch evaluation, a human interacts with the sys-tems for several turns, and then they assign a scorefrom 1 to 5 based on three questions (Zhang et al.,2018) about fluency, engagingness, and consistency.This evaluation is both expensive to conduct andrequires many samples to achieve statistically sig-nificant results Li et al. (2019). To cope with theseissues, Li et al. (2019) proposed ACUTE-EVAL,an A/B test evaluation for dialogue systems. Theauthors proposed two modes: human-model chatsand self-chat (Li et al., 2016b; Ghandeharioun et al.,2019). In this work, we opt for the latter since it ischeaper to conduct and achieves similar results (Liet al., 2019) to the former. Another advantage of us-ing this method is the ability to evaluate multi-turnconversations instead of single-turn responses.

    Following ACUTE-EVAL, the annotator is pro-vided with two full dialogues made by self-chat orhuman-dialogue. The annotator is asked to choosewhich of the two dialogues is better in terms ofengagingness, interestingness, and humanness. Foreach comparison, we sample 60–100 conversationsfrom both models. In Appendix C, we report theexact questions and instructions given to the anno-tators, and the user interface used in the evaluation.We hired native speakers annotators for all six con-

  • Mul

    tiW

    ins%

    Lang Engageness Interestingness HumannessHuman Mono Poly Human Mono Poly Human Mono PolyEn 23.33 68.57 36.36 23.33 64.29 32.73 30.00 62.86 42.73Fr 32.00 55.17 42.86 16.00 53.45 48.21 28.00 50.00 44.64Id 21.67 51.67 65.45 23.33 46.67 55.45 25.00 46.67 65.45It 35.00 48.33 56.36 30.00 48.33 53.64 30.00 40.00 57.27Jp 18.33 50.00 61.82 13.33 43.33 45.45 18.33 51.67 59.09Ko 30.00 52.46 62.39 26.67 50.82 59.63 28.33 52.46 64.22Zh 36.67 55.00 65.45 36.67 60.00 61.82 36.67 55.00 70.91

    Table 4: Results of ACUTE-EVAL human evaluation. Tests are conducted pairwise between M-CausalBert (Multi.)and other models (Human, Poly-encoder (Poly), Monolingual CausalBert (Mono)). Numbers indicate the winningrate of Multi. Numbers in bold are statistically significant (p < 0.05).

    sidered languages. The annotators were differentfrom the dataset collection annotators to avoid anypossible bias.

    5.2 Implementation DetailsMultilingual Models We use the "BERT-Base,Multilingual Cased" checkpoint, and we denotethe multilingual encoder-decoder model as M-Bert2Bert (∼220M parameters) and causal de-coder model as M-CausalBert (∼110M param-eters). We fine-tune both models in the combinedtraining set (English in Persona-chat (Zhang et al.,2018), six languages in Xpersona) for five epochswith AdamW 5 optimizer and a learning rate of6.25e-5.

    Monolingual Models To verify whether the mul-tilingual agent will under-perform the monolingualagent in the monolingual conversational task, webuild a monolingual encoder-decoder model andcausal decoder model for each language. For a faircomparison, we initialize the monolingual modelswith a pre-trained monolingual BERT 6 (Devlinet al., 2018; Cui et al., 2019; Martin et al., 2019).We denote the monolingual encoder-decoder modelas Bert2Bert (∼220M parameters) and causal de-coder model as CausalBert (∼110M parameters).Then we fine-tune each model in each languageindependently for the same number of epoch andoptimizer as the multilingual model.

    Translation-based Models Another strong base-line we compare with is Poly-encoder (Humeauet al., 2019), a large-scale pre-trained retrievalmodel that has shown state-of-the-art performancein the English Persona-chat dataset (Li et al., 2019).

    5AdamW: Adam algorithm with weight decay6The monolingual BERT pre-trained models are available

    in https://github.com/huggingface/transformers

    We adapt this model to the other languages by us-ing the Google Translate API to translate target lan-guages (e.g., Chinese) query to English as the inputto the model, then translate the English responseback to the target language. Thus, the responsegeneration flow is: target query→ English query→ English response→ target response. We denotethis model as Poly.

    Cross-lingual Models. In the first pre-trainingstage, we use the pre-trained weights from XLMR-base (Conneau et al., 2019). Then, we follow thesecond pre-training stage of XNLG (Chi et al.,2019) for pre-training Italian, Japanese, Korean,Indonesia cross-lingual transferable models. ForChinese and French, we directly apply the pre-trained XNLG (Chi et al., 2019) weights7. Then,the pre-trained models are fine-tune on English Per-sonaChat training set and early stop based on theperplexity on target language validation set.

    5.3 Results and Discussion

    5.3.1 Quantitative AnalysisTable 3 compares monolingual, multilingual, andcross-lingual models in terms of BLEU and per-plexity in the human-translated test set. On bothevaluation matrices, the causal decoder models out-perform the encoder-decoder models. We observethat the encoder-decoder model tends to overlookdialogue context and generate digressive responses.(Generated samples are available in Appendix D)We hypothesize that this is because the one-to-many problem (Zhao et al., 2017) in open-domainconversation weakens the relation between encoderand decoder; thus the well pre-trained decoder(Bert) easily converges to a locally-optimal, and

    7Available in https://github.com/CZWin32768/XNLG

  • learns to ignore the dialogue context from the en-coder and generate the response in an unconditionallanguage model way. We leave the investigationof this problem to future work. On the other hand,M-CausalBert achieves a comparable or slightlybetter performance compared to CausalBert, whichsuggests that M-CausalBert leverages the data fromother languages. As expected, we observe a sig-nificant gap between the cross-lingual model andother models, which indicates that cross-lingualzero-shot conversation modeling is very challeng-ing.

    Table 4 shows the human evaluation result ofcomparing M-CausalBert (Multi) against the hu-man, translation-based Poly-encoder (Poly), andmonolingual CausalBert (Mono). The results il-lustrate that Multi outperforms Mono in Englishand Chinese, and is on par with Mono in other lan-guages. On the other hand, Poly shows a strongperformance in English as it was pre-trained witha large-scale English conversation corpus. In con-trast, the performance of Poly drops in other lan-guages, which indicates that the imperfect transla-tion affects translation-based systems. We also con-duct M-CausalBert (Multi) against XNLG (cross)human evaluation, and Multi achieve nearly 100percent winning rate.

    5.3.2 Qualitative Analysis and DiscussionWe randomly sample 7 self-chat dialogues for eachbaseline model in the seven languages and reportthem in Appendix D., And we summarize the gen-eration of each model as follows:

    Poly Poly-encoder, pretrained on 174 millionReddit data, can accurately retrieve coherent anddiverse responses in English. However, in the othersix languages, some of the retrieved responses aredigressive due to translation error.

    Monolingual & Multilingual We observe thatboth the monolingual and multilingual models cangenerate fluent responses. Compared to Bert2Bertand M-Bert2Bert, CausalBert and M-CausalBertcan generate more on-topic responses but some-times repeat through turns. CausalBert and M-CausalBert are on par with each other in mono-lingual conversational tasks, while M-CausalBertshows the advantage of handling a mixed-languagecontext. For multilingual speakers, the conversa-tion may involve multiple languages. Therefore,we experiment on M-CausalBert with two settings:1) many-to-one, in which users converse with the

    System PersonaMy favorite activity is rock climbing.

    I’m a mechanical engineer.I like to go hiking in the spring.

    I live in colorado.Dialogue

    Usrstudio inglese oggi e tu? (I am studyingEnglish today, you?)

    Sys just watching netflix with friends

    Usr我喜欢攀岩,你呢? (I like rockclimbing, what about you?)

    Sysnice, I enjoy it while in the summer.what are you up to?

    Usrsaya ingin melakukan panjat tebing(I want to go to hike)

    Sys I’m going hiking the most time of year

    Usrスポーツがお好きなようですね(It seems you like sports)

    SysI love the outdoors too, do you haveany hobbies?

    Usr나는티비보는것을좋아해.(I liketo watch tv)

    Sys i really like hiking and listening to music

    Table 5: Many-to-one: understand mixed-language di-alogue context in multiple languages and generate re-sponse in one language

    model in 6 languages, and the model generate re-sponses in English, 2) one-to-many, in which usersconverse with the model using English, and themodel generates responses in 6 languages usinglanguage embedding and corresponding personasentences. Table 5 and table 6 illustrate the gener-ation examples under these settings (more exam-ples reported in Appendix C.1). Most of the time,M-CausalBert can understand the mixed-languagecontext, and decode coherent response in differentlanguages. Understanding the mixed-language di-alogue context is a desirable skill for end-to-endchit-chat systems, and a systematic study of thisresearch question is needed in future.

    Cross-lingual. The current state-of-the-art cross-lingual generation approach XNLG (Chi et al.,2019) shows inferior performance on multi-turndialogue tasks, and generates repetitive responses.Although cross-lingual dialogue generation is chal-lenging, it reduces the human effort for data anno-tation in different languages. Therefore, the cross-language transfer is an important direction to inves-tigate.

  • System PersonaI love to drink fancy tea.

    I have a big library at home.I’m a museum tour guide.

    I’m partly deaf.Dialogue

    Usr Hi, I am a computer science student, you?

    EnI’m a bookkeeper for the localmuseum of art.

    Zh你好,我是一名博物馆老师。(I am a teacher in a museum)

    Fr

    bonjour, je suis juste un séjour à lamaison maman de mon immensebibliothèque. (hello, I’m just a stayat home my huge library.)

    ItSono un bibliotecario, ma ho bisognodi rilassarmi. (I am a bookkeper, butI need to relax)

    Jpこんにちは。私は大学にいます、あなたの専攻は何ですか?(Hello. I am in college, what is your major?)

    IdSaya tidak tahu, tetapi saya tuli.(I don’t know I am deaf)

    Ko안녕, 나는선생님이야.너는무엇을공부하고있니?(Hello, I am a teacher. What are you studying?)

    Table 6: One-to-many: response one dialogue contextwith 7 different languages

    6 Conclusion

    In this paper, we studied both cross-lingual andmultilingual approaches in end-to-end personalizeddialogue modeling. We presented the XPersonadataset, a multilingual extension of Persona-Chat,for evaluating the multilingual personalized chat-bots. We further provided both cross-lingual andmultilingual baselines and compared them with themonolingual approach and two-stage translation ap-proach. Extensive automatic evaluation and humanevaluation were conducted to examine the models’performance. The experimental results showed thatmultilingual trained models, with a single modelacross multiple languages, can outperform the two-stage translation approach and is on par with mono-lingual models. On the other hand, the current state-of-the-art cross-lingual approach XNLG achievedlower performance than other baselines. In futurework, we plan to research a more advanced cross-lingual generation approach and construct a mixed-language conversational benchmark for evaluatingmultilingual systems.

    ReferencesDaniel Adiwardana, Minh-Thang Luong, David R So,

    Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,et al. 2020. Towards a human-like open-domainchatbot. arXiv preprint arXiv:2001.09977.

    Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, MonaDiab, Julia Hirschberg, and Thamar Solorio. 2018.Named entity recognition on code-switched data:Overview of the calcs 2018 shared task. In Proceed-ings of the Third Workshop on Computational Ap-proaches to Linguistic Code-Switching, pages 138–147.

    Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, EduardHovy, Kai-Wei Chang, and Nanyun Peng. 2019. Ondifficulties of cross-lingual transfer with order differ-ences: A case study on dependency parsing. In Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers), pages 2440–2452.

    Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yun-yao Li, Shivakumar Vaithyanathan, and Huaiyu Zhu.2015. Generating high quality proposition banks formultilingual semantic role labeling. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers), pages 397–407.

    Wenhu Chen, Jianshu Chen, Yu Su, Xin Wang, DongYu, Xifeng Yan, and William Yang Wang. 2018. Xl-nbt: A cross-lingual neural belief tracking frame-work. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 414–424.

    Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2019. Cross-lingualnatural language generation via pre-training. arXivpreprint arXiv:1909.10481.

    Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-moyer. 2018. Quac: Question answering in context.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages2174–2184.

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzmán, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Unsupervisedcross-lingual representation learning at scale. arXivpreprint arXiv:1911.02116.

    Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel Bowman, Holger Schwenk,and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In Proceedings ofthe 2018 Conference on Empirical Methods in Natu-ral Language Processing, pages 2475–2485.

  • Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin,Ziqing Yang, Shijin Wang, and Guoping Hu. 2019.Pre-training with whole word masking for chinesebert. arXiv preprint arXiv:1906.08101.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

    Emily Dinan, Varvara Logacheva, Valentin Malykh,Alexander Miller, Kurt Shuster, Jack Urbanek,Douwe Kiela, Arthur Szlam, Iulian Serban, RyanLowe, et al. 2019a. The second conversationalintelligence challenge (convai2). arXiv preprintarXiv:1902.00098.

    Emily Dinan, Stephen Roller, Kurt Shuster, AngelaFan, Michael Auli, and Jason Weston. 2019b. Wiz-ard of wikipedia: Knowledge-powered conversa-tional agents. In International Conference on Learn-ing Representations.

    Darrell Etherington. 2019. Amazon launches multilin-gual mode for using alexa in multiple languages atonce.

    Angela Fan, Yacine Jernite, Ethan Perez, David Grang-ier, Jason Weston, and Michael Auli. 2019. Eli5:Long form question answering. arXiv preprintarXiv:1907.09190.

    Jianfeng Gao, Michel Galley, and Lihong Li. 2018.Neural approaches to conversational ai. In The41st International ACM SIGIR Conference on Re-search & Development in Information Retrieval,pages 1371–1374. ACM.

    Asma Ghandeharioun, Judy Hanwen Shen, NatashaJaques, Craig Ferguson, Noah Jones, AgataLapedriza, and Rosalind Picard. 2019. Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems. In Advances in Neu-ral Information Processing Systems, pages 13658–13669.

    Karthik Gopalakrishnan, Behnam Hedayatnia, Qin-lang Chen, Anna Gottardi, Sanjeev Kwatra, AnuVenkatesh, Raefer Gabriel, Dilek Hakkani-Tür, andAmazon Alexa AI. 2019. Topical-chat: To-wards knowledge-grounded open-domain conversa-tions. Proc. Interspeech 2019, pages 1891–1895.

    Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, M Antònia Martí, LluísMàrquez, Adam Meyers, Joakim Nivre, SebastianPadó, Jan Štěpánek, et al. 2009. The conll-2009shared task: Syntactic and semantic dependenciesin multiple languages. In Proceedings of the Thir-teenth Conference on Computational Natural Lan-guage Learning (CoNLL 2009): Shared Task, pages1–18.

    Braden Hancock, Antoine Bordes, Pierre-EmmanuelMazare, and Jason Weston. 2019. Learning fromdialogue after deployment: Feed yourself, chatbot!arXiv preprint arXiv:1901.05415.

    Shexia He, Zuchao Li, and Hai Zhao. 2019. Syntax-aware multilingual semantic role labeling. In Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP), pages 5353–5362.

    Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, ZhiboChen, and Tie-Yan Liu. 2018. Layer-wise coordi-nation between encoder and decoder for neural ma-chine translation. In Advances in Neural Informa-tion Processing Systems, pages 7944–7954.

    Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,and Jason Weston. 2019. Poly-encoders: Trans-former architectures and pre-training strategies forfast and accurate multi-sentence scoring. CoRRabs/1905.01969. External Links: Link Cited by, 2:2–2.

    Melvin Johnson, Mike Schuster, Quoc Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viégas, Martin Wattenberg, Greg Corrado,et al. 2017. Google’s multilingual neural machinetranslation system: Enabling zero-shot translation.Transactions of the Association for ComputationalLinguistics, 5:339–351.

    Chaitanya K Joshi, Fei Mi, and Boi Faltings. 2017. Per-sonalization in goal-oriented dialog. arXiv preprintarXiv:1706.07503.

    Karthikeyan K, Zihan Wang, Stephen Mayhew, andDan Roth. 2020. Cross-lingual ability of multilin-gual bert: An empirical study. In International Con-ference on Learning Representations.

    Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, andEric Fosler-Lussier. 2017. Cross-lingual transferlearning for pos tagging without cross-lingual re-sources. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 2832–2838.

    Ilya Kulikov, Alexander H Miller, Kyunghyun Cho,and Jason Weston. 2018. Importance of a searchstrategy in neural dialogue modelling. arXivpreprint arXiv:1811.00907.

    Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprintarXiv:1901.07291.

    Patrick Lewis, Barlas Oğuz, Ruty Rinott, SebastianRiedel, and Holger Schwenk. 2019. Mlqa: Eval-uating cross-lingual extractive question answering.arXiv preprint arXiv:1910.07475.

    Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-ithourakis, Jianfeng Gao, and Bill Dolan. 2016a. Apersona-based neural conversation model. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), volume 1, pages 994–1003.

    https://openreview.net/forum?id=r1l73iRqKmhttps://openreview.net/forum?id=r1l73iRqKmhttps://openreview.net/forum?id=r1l73iRqKmhttps://techcrunch.com/2019/09/25/amazon-launches-multilingual-mode-for-using-alexa-in-multiple-languages-at-once/https://techcrunch.com/2019/09/25/amazon-launches-multilingual-mode-for-using-alexa-in-multiple-languages-at-once/https://techcrunch.com/2019/09/25/amazon-launches-multilingual-mode-for-using-alexa-in-multiple-languages-at-once/https://openreview.net/forum?id=HJeT3yrtDrhttps://openreview.net/forum?id=HJeT3yrtDr

  • Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jian-feng Gao, and Dan Jurafsky. 2016b. Deep rein-forcement learning for dialogue generation. arXivpreprint arXiv:1606.01541.

    Margaret Li, Jason Weston, and Stephen Roller. 2019.Acute-eval: Improved dialogue evaluation with opti-mized questions and multi-turn comparisons. arXivpreprint arXiv:1909.03087.

    Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-worthy, Laurent Charlin, and Joelle Pineau. 2016.How not to evaluate your dialogue system: An em-pirical study of unsupervised evaluation metrics fordialogue response generation. In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing, pages 2122–2132. Associa-tion for Computational Linguistics.

    Jiahua Liu, Yankai Lin, Zhiyuan Liu, and MaosongSun. 2019a. Xqa: A cross-lingual open-domainquestion answering dataset. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 2358–2368.

    Zihan Liu, Jamin Shin, Yan Xu, Genta Indra Winata,Peng Xu, Andrea Madotto, and Pascale Fung. 2019b.Zero-shot cross-lingual dialogue systems with trans-ferable latent variables. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 1297–1303.

    Zihan Liu, Genta Indra Winata, Zhaojiang Lin, PengXu, and Pascale Fung. 2019c. Attention-informedmixed-language training for zero-shot cross-lingualtask-oriented dialogue systems.

    Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, andPascale Fung. 2019. Personalizing dialogue agentsvia meta-learning. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 5454–5459.

    Louis Martin, Benjamin Muller, Pedro Javier OrtizSuárez, Yoann Dupont, Laurent Romary, Éric Ville-monte de la Clergerie, Djamé Seddah, and BenoîtSagot. 2019. Camembert: a tasty french languagemodel. arXiv preprint arXiv:1911.03894.

    Seungwhan Moon, Pararth Shah, Anuj Kumar, and Ra-jen Subba. 2019. Opendialkg: Explainable conver-sational reasoning with attention-based walks overknowledge graphs. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 845–854.

    Nikola Mrkšić, Ivan Vulić, Diarmuid Ó Séaghdha, IraLeviant, Roi Reichart, Milica Gašić, Anna Korho-nen, and Steve Young. 2017. Semantic specializa-tion of distributional word vector spaces using mono-lingual and cross-lingual constraints. Transactionsof the Association for Computational Linguistics,5:309–324.

    Sahoko Nakayama, Andros Tjandra, Sakriani Sakti,and Satoshi Nakamura. 2019. Zero-shot code-switching asr and tts with multilingual machinespeech chain. In 2019 IEEE Automatic SpeechRecognition and Understanding Workshop (ASRU),pages 964–971. IEEE.

    Jian Ni, Georgiana Dinu, and Radu Florian. 2017.Weakly supervised cross-lingual named entity recog-nition via effective annotation and representationprojection. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1470–1480.

    Joakim Nivre, Željko Agić, Lars Ahrenberg, et al. 2017.Universal dependencies 2.0. lindat/clarin digital li-brary at the institute of formal and applied linguis-tics, charles university, prague.

    Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), volume 1, pages 1946–1958.

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

    Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual bert? In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 4996–5001.

    Alec Radford, Karthik Narasimhan, Time Salimans,and Ilya Sutskever. 2018. Improving language un-derstanding with unsupervised learning. Technicalreport, OpenAI.

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog, 1(8):9.

    Siva Reddy, Danqi Chen, and Christopher D Manning.2019. Coqa: A conversational question answeringchallenge. Transactions of the Association for Com-putational Linguistics, 7:249–266.

    Sascha Rothe, Shashi Narayan, and Aliaksei Sev-eryn. 2019. Leveraging pre-trained checkpointsfor sequence generation tasks. arXiv preprintarXiv:1907.12461.

    Erik F Sang. 2002. Introduction to the conll-2002shared task: Language-independent named entityrecognition. arXiv preprint cs/0209010.

    Erik F Sang and Fien De Meulder. 2003. Intro-duction to the conll-2003 shared task: Language-independent named entity recognition. arXivpreprint cs/0306050.

    https://doi.org/10.18653/v1/D16-1230https://doi.org/10.18653/v1/D16-1230https://doi.org/10.18653/v1/D16-1230http://arxiv.org/abs/1911.09273http://arxiv.org/abs/1911.09273http://arxiv.org/abs/1911.09273

  • Sebastian Schuster, Sonal Gupta, Rushin Shah, andMike Lewis. 2019a. Cross-lingual transfer learningfor multilingual task oriented dialog. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 3795–3805.

    Tal Schuster, Ori Ram, Regina Barzilay, and AmirGloberson. 2019b. Cross-lingual alignment of con-textual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers), pages 1599–1613.

    Abigail See, Stephen Roller, Douwe Kiela, and JasonWeston. 2019. What makes a good conversation?how controllable attributes affect human judgments.arXiv preprint arXiv:1902.08654.

    Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, andJoelle Pineau. 2016. Generative deep neural net-works for dialogue: A short review. arXiv preprintarXiv:1611.06216.

    Kurt Shuster, Samuel Humeau, Antoine Bordes, andJason Weston. 2018. Engaging image chat: Model-ing personality in grounded dialogue. arXiv preprintarXiv:1811.00945.

    Shubham Toshniwal, Tara N Sainath, Ron J Weiss,Bo Li, Pedro Moreno, Eugene Weinstein, and Kan-ishka Rao. 2018. Multilingual speech recognitionwith a single end-to-end model. In 2018 IEEE Inter-national Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), pages 4904–4908. IEEE.

    Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

    Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fen-fei Guo, Raefer Gabriel, Ashish Nagar, RohitPrasad, Ming Cheng, Behnam Hedayatnia, Ange-liki Metallinou, et al. 2018. On evaluating andcomparing conversational agents. arXiv preprintarXiv:1801.03625, 4:60–68.

    Oriol Vinyals and Quoc V Le. 2015. A neural conver-sational model. arXiv preprint arXiv:1506.05869.

    Jason D Williams and Steve Young. 2007. Partiallyobservable markov decision processes for spokendialog systems. Computer Speech & Language,21(2):393–422.

    Genta Indra Winata, Zhaojiang Lin, and Pascale Fung.2019a. Learning multilingual meta-embeddings forcode-switching named entity recognition. In Pro-ceedings of the 4th Workshop on RepresentationLearning for NLP (RepL4NLP-2019), pages 181–186.

    Genta Indra Winata, Zhaojiang Lin, Jamin Shin, ZihanLiu, and Pascale Fung. 2019b. Hierarchical meta-embeddings for code-switching named entity recog-nition. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages3532–3538.

    Genta Indra Winata, Andrea Madotto, Chien-ShengWu, and Pascale Fung. 2019c. Code-switched lan-guage models using neural based synthetic data fromparallel sentences. In Proceedings of the 23rd Con-ference on Computational Natural Language Learn-ing (CoNLL), pages 271–280.

    Guillaume Wisniewski, Nicolas Pécheux, SouhirGahbiche-Braham, and François Yvon. 2014. Cross-lingual part-of-speech tagging through ambiguouslearning. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 1779–1785.

    Thomas Wolf, Victor Sanh, Julien Chaumond, andClement Delangue. 2019. Transfertransfo: Atransfer learning approach for neural networkbased conversational agents. arXiv preprintarXiv:1901.08149.

    Jiateng Xie, Zhilin Yang, Graham Neubig, Noah ASmith, and Jaime Carbonell. 2018. Neural cross-lingual named entity recognition with minimal re-sources. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 369–379.

    Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, andDilek Hakkani-Tur. 2019. Deepcopy: Groundedresponse generation with hierarchical pointer net-works. In Proceedings of the 20th Annual SIGdialMeeting on Discourse and Dialogue, pages 122–132.

    Steve Young, Milica Gašić, Blaise Thomson, and Ja-son D Williams. 2013. Pomdp-based statistical spo-ken dialog systems: A review. Proceedings of theIEEE, 101(5):1160–1179.

    Xianghu Yue, Grandee Lee, Emre Yılmaz, Fang Deng,and Haizhou Li. 2019. End-to-end code-switchingasr for low-resourced language pairs. arXiv preprintarXiv:1909.12681.

    Yury Zemlyanskiy and Fei Sha. 2018. Aiming to knowyou better perhaps makes me a more engaging dia-logue partner. CoNLL 2018, page 551.

    Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018. Per-sonalizing dialogue agents: I have a dog, do youhave pets too? In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 2204–2213.Association for Computational Linguistics.

    http://aclweb.org/anthology/P18-1205http://aclweb.org/anthology/P18-1205http://aclweb.org/anthology/P18-1205

  • Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,Chris Brockett, Xiang Gao, Jianfeng Gao, JingjingLiu, and Bill Dolan. 2019. Dialogpt: Large-scalegenerative pre-training for conversational responsegeneration. arXiv preprint arXiv:1911.00536.

    Yuan Zhang, David Gaddy, Regina Barzilay, andTommi Jaakkola. 2016. Ten pairs to tag–multilingual pos tagging via coarse mapping be-tween embeddings. In Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 1307–1317.

    Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.2017. Learning discourse-level diversity for neuraldialog models using conditional variational autoen-coders. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 654–664.

  • A Dataset Collection

    A.1 Annotation Instructions

    In this section, we show the instructions for Frenchannotation:

    • There are two existing columns of conversa-tions: the first column (en) is the original con-versations in English, the second column (fr)is the conversations translated by an automaticsystem (e.g., Google Translate).

    • You should copy the conversation from thesecond column (the translated conversations)into the third column (named fr_annotation).In that column, you should then revise theincorrect or inappropriate translations.

    • The goal of the revision is to make the conver-sations more coherent and fluent in the targetlanguage (French). Hence you can customizedialogues and persona sentences to make themfluent and coherent in the target language, in-cluding by deviating from the original transla-tion. However, you should retain persona andconversation consistency.

    A.2 Training Set Statistics

    We report our iterative revised training set statisticsin Table 7.

    B Model Detail

    Figure 3 and 4 illustrates the details of the multilin-gual causal decoder and the multilingual encoder-decoder models.

    TrainLang # Dial. # Utt. Edit BLEU

    Fr 16878 248244 0.06 99.98It 16878 248244 1.09 99.8Id 16878 248244 0.18 99.94Jp 16878 248244 0.38 99.17Ko 16878 248244 0.97 99.51Zh 16878 248244 0.52 98.1

    Table 7: The number of dialogues (#Dial.) and utter-ances (#Utt.) of the training set in six languages. Editdistance per dialogue and BLEU score are computedto show the difference between the iterative reviseddataset and auto-translated dataset.

    Figure 2: Human evaluation interface modified fromACUTE-EVAL(Li et al., 2019)

    C Human Evaluation

    As illustrated in Figure 2, the annotator is providedwith two full dialogues made by a self-chat modelor human-dialogues. Then the annotators are askedthe following questions:

    • Who would you talk to for a long conversa-tion?

    • If you had to say one of these speakers isinteresting and one is boring, who would yousay is more interesting?

    • Which speaker sounds more human?

    D Generated Samples

    D.1 Mixed-language Samples

    We report more the mixed-language samples gener-ated by M-CausalBert in Table 8 and 9.

  • I love cats Hi !

    1 2 3 4 5+ + + + +

    + + + + +Hi how are you ? SOS I am fine and you ?

    6 7 8 9 10 11 12 13 14 15 16 17+ + + + + + + +

    + + + + + + + +

    + + + +

    + + + +

    I am fine and you ? EOS

    Causal Decoder

    Per Per Per Sys Sys Usr Usr Usr Usr Usr En En En En En En En

    X

    Xpos

    Xseg

    Figure 3: Multilingual Causal Decoder model.

    I love cats Hi !

    1 2 3 4 5+ + + + +

    + + + + +Hi how are you ? SOS I am fine and you ?

    6 7 8 9 10 1 2 3 4 5 6 7+ + + + + + + +

    + + + + + + + +

    + + + +

    + + + +

    I am fine and you ? EOS

    M-Encoder

    Per Per Per Sys Sys Usr Usr Usr Usr Usr En En En En En En En

    X

    Xpos

    Xseg

    M-Decoder

    Figure 4: Multilingual Encoder-Decoder model.

    D.2 Model Comparison SamplesWe randomly sample one self-chat dialogue exam-ples for each model in each language and reportthem in figure 5-37.

  • System Personai have two dogs.

    i like to work on vintage cars.my favorite music is country.i own two vintage mustangs.

    DialogueUsr do you have any hobbies?En i like music, country is my favorite. you?

    Zh我喜欢听乡村音乐。你呢(I like country music, you?)

    Frje possède deux voitures ancienneset j’aime faire de l’artisanat(I have two vintage cars and I like to make crafts)

    It

    Ciao. mi piace lavorare su auto d’epocae ascoltare musica country.(Hello. I love working on vintage cars andlistening to country music.)

    Jpこんにちは。私は大学にいます、(Hi, i really like it)

    IdSaya suka mendengarkan musik country, dan Anda?(I like listening to country music, and you?)

    Ko안녕,나는선생님이야.(I have two vintage mustangs.)

    System Personai love disneyland and mickey mouse.i love to spend time with my family.

    i’m a baby delivery nurse.i walk three miles every day.

    DialogueUsr what do you do in your free time?En i love to play with my family. how about you?

    Zh我主要是在家里(I just stay at home)

    Fr

    j’aime passer du temps avec mafamille et mon mari.(i love spending time with my family andmy husband.)

    ItCiao. Sono in pensione da quando ero un’infermiera.(Hello. I have been retired since I was a nurse.)

    Jpこんにちは、私は家族と時間を過ごすのが大好きです(Hi, i love spending time with my family)

    IdSaya suka berjalan 3 mil setiap hari dan Anda?(I love walking 3 miles every day and you?)

    Ko안녕,나는가족과시간을보내는것을즐긴다.(Hi, I enjoy spending time with my family.)

    Table 8: One-to-many by M-CausalBert

    System Personai am a cancer survivor.

    my parents were both very athletic.i got a new job just yesterday to be a life coach.

    i am a clean eater.Dialogue

    Usrこんにちは(Hello)

    Sys hey there how are you?

    Usr我很好,我是一名学生,你呢?(I am good, I am a student, what about you?)

    Sys i am a full time student and have no job yet.

    Usrcosa Studi?(What do you study?)

    Sys oh i see. what are you studying?

    Usrl’informatique(computer science)

    Sys computer stuff. what is your favorite subject?Usr i like machine learning.Sys i like to read and make fun of things.

    UsrApa pekerjaan orang tuamu?(what do your parents do?)

    Sys i’m from cali, my parents were both athletic.

    System Personamy brother is older than me.

    i ride the bus to school.i’m on the soccer team.

    i am 13 years old.Dialogue

    Usr안녕

    (Hello)Sys hello, how are you?

    Usr我很好,刚刚下班回家(I am good, just came home from work)

    Sys oh, cool. what do you do for work?

    UsrSono un insegnante di inglese(I am English teacher.)

    Sys oh, that’s awesome! i hope to get a job soon.

    Usrquel âge avez-vous?(how old are you? )

    Sysoh ok. i’m only 13, but my brother isolder than me

    Usr Apakah Anda pelajar? (are you a student?)Sys do you play soccer? i am a football playerUsr no, i like to readSys i like to ride the bus and play soccer

    Table 9: Many-to-one by M-CausalBert

  • i have a car that i like to use. do you have any hobbies?

    i do gardening, i love to garden

    i've five sons. they're my best friends.

    i have a husband, but he is trying to get me in shape

    how long have you been married?

    i have been married for over 6 years, but i am not a hugefan of clowns

    oh i see. do you have kids?

    no, i work in a gas station, i work with my hands and havelots of friends

    oh. i like to coach little league baseball.

    i love to garden, i'm afraid of clowns due to childhoodtrauma

    Figure 5: English CausalBert

    nice, what are you studying in?

    i'm in school in my twenties. what about you?

    i'm a third grade student in college

    that's great! i love math. do you work?

    no i just sing songs from the movie frozen

    that sounds fun. i just got back from school for chemistry.

    i like to sing in the way that i love singing songs from themovie frozen

    what kind of songs do you sing?

    i sing with mickey mouse every year, it is my favoritecharacter

    that is cool. i have never heard of that song.

    Figure 6: English M-CausalBert

    not really , art is really the only thing i like doing .

    do you hike or surf ? i do both with my border collies .

    other than my saturday routines on gettin my nails done , ilove hiking

    nice ! i am getting ready for a big move in a couple months, going to california .

    i love cali , was just there a month ago with my family

    in san diego ? i am going to be moving there in a fewmonths .

    yes ! i have been there a few times , nice place

    it really is ! san diego is expensive , but it looks great .

    it is little expensive for some , but still cheaper than insome major cities

    Figure 7: English PolyEncoder

    i don't have any pets. i've a cat named toto.

    i am a student. what do you do?

    hi! i'm doing well. just got done shopping.

    i'm not sure. i'm too old for that.

    i'm a graphic designer. i want to be an artist one day.

    i am a girl. what do you do for work?

    what do you do for work? i'm a computer programmer

    i like to eat junk food

    i am more of a beach type person.

    i'm a scientist.

    Figure 8: English M-Bert2Bert

  • 哦,我也喜欢上学,我喜欢去学校

    我喜欢上⼤学,但我辞职了,所以我不再上学。

    我也在上⼤学,我很寂寞

    你为什么寂寞?

    我在云上,所以我不会做很多事情

    我明⽩了。我今天是佛教徒,所以我可以回到佛教堂。

    我可以去佛教堂做些瑜伽,但是我的⼼情不好

    我明⽩。我只是想放松⼀下。你呢

    我不喜欢去上班,所以我可以花很多时间去上班

    我明⽩。我辞职了。

    Figure 9: Chinese CausalBert

    我喜欢唱歌,但是我不是很好

    你在唱歌什么样的歌?

    我的前妻离开了我,我真的很喜欢

    听到这个消息我很抱歉。

    我仍然希望我的前妻会有这样的。

    我希望你也是。我已经退休了。

    我不是很乐意退休。我⽬前为我的⼯作做⽣意。

    我是⼀名体育⽼师。你喜欢旅⾏吗?

    我是录⾳⼯程师,但我最喜欢的地⽅是在加利福尼亚。

    你喜欢看电影吗?

    Figure 10: Chinese M-CausalBert

    ⾃⼤学以来,我有轻度的ocd病例

    不好了 。那⼀定很难。我来⾃墨⻄哥 。你到过那⾥吗 ?

    我希望我不那么害怕,我在社交场合陷⼊僵局

    哈哈 。知道了。你喜欢烘焙⻝品吗?我烤⼀吨!

    我确实喜欢烘烤。 。 。主要是甜点

    在开始教书之前我是厨师。

    很酷。我没有时间。我做⾃由会计

    啊。⽐记账更有趣。

    哦耶 。我虽然有很多休假时间,所以我经常旅⾏

    我特别喜欢去墨⻄哥旅⾏。

    Figure 11: Chinese PolyEncoder

    我喜欢动物,但我讨厌动物。

    我喜欢去健身房和看⾜球

    你应该。我不想让他们的孩⼦微笑。

    我喜欢⼀个⼈,但是我不知道该怎么办。

    我喜欢在湖边游泳

    我很好。只是看⼀些动漫。你呢

    你以什么为⽣?

    我在⼀家⼩型⼯程公司⼯作

    你以什么为⽣?

    我很好。你好吗?

    Figure 12: Chinese M-Bert2Bert

  • 我很喜欢在冬天向南移动.

    我很好,我是⼀个⼥⼈.我很好.

    我很好,我很好.我很喜欢在户外烧毁.

    我很喜欢⻝物主义者.我很喜欢⻝素⻝主义者.

    我很好,我也是⼀位⼥孩⼦.

    我很好,我也是⼀位⼥⼈.我很喜欢的颜⾊是蓝⾊的.

    我很喜欢在户外烧毁.我很喜欢⻝素⻝主义者.

    我很好,我也是⼀位⼥孩⼦.我也很好.

    Figure 13: Chinese CrossLingual

    amo la tv, e tu?

    non mi piace la tv. mi piace fare passeggiate

    anche a me, mi piace andare al parco

    mi piace anche andare al parco

    qual'e la tua tv preferita?

    il mio cibo preferito e la pizza

    mi piace molto la pizza.

    qual e il tuo cibo preferito?

    amo la pizza! e tu?

    amo gli hamburger e le patatine fritte.

    Figure 14: Italian CausalBert

    gioco con i miei amici, quindi mi piace guardare film

    sembra divertente. non ho tempo per gli hobby.

    perché? mi piace cantare le canzoni del film

    è una buona cosa. mi piace anche la musica country.

    qual'è il tuo preferito da cantare?

    mi piace anche la musica country.

    mi piace anche la musica country! mi piace il paese ognianno

    mi piace anche la musica country.

    hai mai sentito parlare di Disney?

    no, non l'ho fatto.

    Figure 15: Italian M-CausalBert

    bere birra e mangiare cibo, anche calcio?

    ho fatto 4 viaggi in messico l'anno scorso, ma non mi piaceil cibo

    non sono mai stato in messico

    non mi interessa il cibo messicano. mia moglie ce la famolto, perché non le piaccio più.

    non la biasimo

    è divertente sciare, ti piace sciare?

    i miei amici vogliono farlo

    freddo . cosa ti piace fare tu e i tuoi amici?

    posso guardare il calcio

    il calcio mi sembra pericoloso.

    Figure 16: Italian PolyEncoder

  • ciao, sei un grande fan del cibo?

    i miei genitori sono ancora vivi.

    sto bene. solo ascoltando musica.

    Ciao! Cosa fa per vivere?

    hmmm. ho una sorella gemella. e tu?

    i miei figli adorano il rosso e il viola

    i miei genitori sono morti quando avevo sei anni.

    oh no, devi amare i tuoi gatti?

    no, non sono un grande fan del baseball.

    goodfellas. ne ho sentito parlare davvero. ho letto molto esono in giro

    Figure 17: Italian M-Bert2Bert

    my car is a lot of dreamers , ma i suoi film si acquistospesso scarpe e orologi . attualmente lavoro per unrivenditore di orologi .

    i have a children who frequenta la scuola media . i filmhorror sono preferiti , but my father worked as a deposito . isuoi film sono in casa papa . i live in un tempo e spero didiventare un avvocato . i 'm a children who frequenta lascuola media . i film horror sono i suoi film preferiti .

    i 'm gonna have a lot of children , but i suoi film si e ' statoun fantasma di scarpe e orologi . attualmente lavoro per unrivenditore di orologi . attualmente la famiglia is notsupporta le mie scelte di carriera . i film sono a childrenwho frequenta la scuola media . i film horror sono i suoifilm preferiti .

    i 've been a casa of a children , i suoi film sono preferiti .my father worked in casa a casa . i suoi film sono in casadi papi .

    Figure 18: Italian CrossLingual

    c'est bien je suis une personne cool

    je suis sûr que vous aimez la musique classique?

    je le fais quand je peux

    la musique classique est aussi mon préféré.

    j'aime le classique aussi

    oui, j'aime aussi le rouge. ma soeur est un danseur depole.

    j'aime aussi le rouge

    j'aime aussi le rouge. ma sœur a une danseuse de pole.

    j'ai aussi une soeur

    aimez - vous écouter de la musique classique?

    Figure 19: France CausalBert

    j'aime les sports mais je préfère regarder les sports àl'extérieur

    avez - vous des animaux domestiques?

    j'ai un chat et un chat. et vous?

    pas d'animaux de compagnie. je ne veux pas le faire.

    Oh non. je pense que ce serait difficile.

    je ne suis pas sûr. j'aime les sports de course.

    les voitures sont amusantes. j'ai une Honda Civic Civic2016.

    agréable. je ne peux pas en avoir un mais je conduis uneToyota Corolla.

    c'est une bonne voiture. avez - vous des animauxdomestiques?

    Figure 20: France M-CausalBert

  • cool ! je regarde un film avec ma petite fille

    cool ! je regarde un film avec ma petite fille

    je ne peux pas attendre halloween pour que je puissem'habiller

    as-tu un groupe préféré? les mines sont en tête de ligne

    je les aime tous, mais dernièrement mes filles m'ont fait dumal

    agréable ! j'ai ma propre agence de conseil en marketing,bien que petite

    ukr! depuis la fin de la dernière saison, je vais chasserpour tuer le temps maintenant

    ukr! depuis la fin de la dernière saison, je vais chasserpour tuer le temps maintenant

    même ! mes filles sont trop jeunes pour aller à la chasse.

    Figure 21: France PolyEncoder

    iya nih! j'ai des cheveux bruns, mais je suis un peu

    j'aime la couleur bleue et le bleu.

    Je ne sais pas. j'aime les chiens, mais je ne les ai pas,

    haha! je n'ai pas encore été touché par l'ouragan

    iya nih. je suis végétalien donc je mange beaucoup deviande.

    la musique country est cool. aimez - vous écouter de lamusique?

    i'm en train de regarder les yeux verts, j'adore le vert

    c'est super. je suis un étudiant à temps plein

    stephen king est un grand auteur. j'ai un cheval, il est monmeilleur ami.

    iya nih. je suis en train de me réveiller et de me détend

    Figure 22: France M-Bert2Bert

    i 'm a country . i 'm a my mari adopt.

    oh , i 'm a country . i 'm a country ?

    oh , i 'm a country . i 'm a my dog de berger .

    oh , i 'm a country . i 'm a country ?

    Figure 23: France CrossLingual

    私はあなたがすることを望みます!私はバーガーキングで働いています。

    ペットのカメがいますか

    いいえ、しません。あなたは?

    私は4歳です

    いいね!私は19歳で、両親の家に引っ越すのを待ちきれません。

    私の親友は隣に住んでいます

    私はちょうどバーガーキングでアルバイトをしています。

    私はペットのカメの⼈です

    私は19歳です。ペットのカメを飼うことはできません。

    ああ、私はあなたが何歳かを持っていると思います

    Figure 24: Japanese CausalBert

  • 私はプロのオリンピック選⼿です。

    すごい!私は⾃分の⻝べ物のフライドポテトを⻝べています。

    それは⾯⽩いです、私は彼らが私を助けることができると思います。

    ええ、私はあなたが私のバーガーとフライドポテトを⻝べるのが好きです!

    私はバーガーが⼤好きです。他の⻝べ物が好きですか?

    ええ、私は家にいて、散歩を楽しんでいます。

    私も。私はスポーツが好きです。特にオリンピックの旅⾏。

    あなたの好きな⾊は何ですか?

    私の好きな⾊は紫です。あなたのものは何ですか?

    Figure 25: Japanese M-CausalBert

    コミックをオンラインにし始めました。

    コミックをオンラインにし始めました。

    出版社を紹介するために⾃分で絵を描き始めました。

    すごいです 。コンドミニアムを⻘く塗りたい。

    すごいです 。コンドミニアムを⻘く塗りたい。

    カナダに住んでいますか?私はそこへ⾏ったことがありません 。

    番号 。私はカナダにいます。

    好きな⻝べ物はありますか?私はステーキです。

    はい、ステーキは私のお気に⼊りです。

    なぜあなたはカナダがそんなに好きですか?

    Figure 26: Japanese PolyEncoder

    我々は彼らが⼤きなスペースを持っていると思います。

    我々はあなたがそれを好むかもしれません。私はそれを

    我々はたくさんのお⾦を稼ぎます

    iyanih。私は夫がいます。彼は私の親友です

    你在学校学校吗?

    jeez、私の⽝は私の⽝です。

    ooohhhhhhhhhhhhhhhhhh

    jessgreenbergは私の好きな歌⼿です。

    哦。私はすべての種類の⾳楽が好きです。

    我々はあなたにとってとても素敵です。

    Figure 27: Japanese M-Bert2Bert

    その両親は電気技術のためには⽇本のオーハーを弾きます.

    その両親は⽇本のオーハーにも運動していた. その他のタイヤンは⽇本のタイナニンクの⼤好きになった. その⽇は⽇本の両親は⽇本の競技にも新しいものを作った. その両親は⽇本のオーハーをもう

    その両親は電気技術のためには⽇本のオーハーを弾きます. その両親は⽇本のオーハートにも活動していた. その両親は⽇本のオーハーニンクの⼤好きてもいた. その他のタイヤンは⽇本のタイナニン

    その両親は⽇本のタイヤンは運動していた. その両親は⽇本のオーハーソンとの⼤好きな. その両親は⽇本のオーハーチコーチにも⽇本のオーハーになる. その両親は⽇本のオーハーにも活動していた.

    Figure 28: Japanese CrossLingual

  • 나는 아마도 내가 직장에서 일하는 것처럼 그것을 시도해봐야 할 것이다.

    그것은 정말 멋질 것이다. 나는 내가 지금까지 살고 있었으면 좋겠다.

    나도 그래. 하지만 난 그냥 운동도 할 수 없어.

    그것은 매우 재미있을 것 같다.

    응, 난 그게 좋아. 그게 내가 하는 일이야.

    너는 어떤 종류의 음악을 좋아하니?

    나는 힙합이 더 좋아. 너는?

    나는 랩을 좋아한다.

    너는 어떤 종류의 음악을 좋아하니?

    나는 랩을 좋아하지만, 랩은 너무 좋아.

    Figure 29: Korean CausalBert

    나는 주말에 춤을 춘다.

    그것은 멋지다. 나는 언젠가 큰 부모를 찾고 싶다.

    나의 학생들도 그래. 너는 그것을 좋아하니?

    그것은 내가 그것을 할 수 있을 때 좋다.

    나는 그것에 대해 들어본 적이 없다. 나는 또한 그것을 싫어한다.

    나는 음악을 좋아한다. 나는 또한 민속음악을 좋아한다.

    나는 주로 락음악을 듣는다.

    나도! 나는 피아노 치는 것을 좋아해. 너는?

    나는 피아노 치는 것을 좋아한다. 그것은 멋진 취미다.

    너는 어떤 악기를 연주하니?

    Figure 30: Korean M-CausalBert

    좋아, 나는 매운 음식을 먹고있다

    !! 나는 비디오 게임을하고 있었다. 당신은 무엇을 좋아합니까?

    노아 나는 카메라 절약

    나는 비디오 게임을한다. 32에서 나는 일하고 게임을하지 않아야합니다.

    지난 4 년 동안 나는 영화관에서 일했다.

    일반적으로 어떤 종류의 것들을 촬영합니까?

    밤에는 정기적으로 운행합니다. 나도 긴 머리를 묶어야 해

    그것은 많은 일처럼 들린다. 포니 테일로 입어?

    허리 길이는 길지만 직장에서 포니 테일로 유지합니다.

    나는 일하고 싶지 않다. 나는 엄마 아빠와 함께 사는 것이 좋습니다.

    Figure 31: Korean PolyEncoder

    tidak, saya tinggal di kota kecil dan bepergian ke luarnegeri.

    hai aku baik - baik saja bagaimana kabarmu

    tidak, saya tidak pernah pergi ke pantai

    hai! saya bekerja di kantor saya sendiri. Apa yang kamukerjakan?

    hai! saya juga! apakah kamu tinggal di rumah bersamaanak - anak?

    iya nih! Saya suka mendengarkan musik.

    hai disana, apakah kamu seorang perawat?

    iya nih. Saya suka pergi ke konser. musik adalah satu -satunya teman saya

    iya nih! Saya suka musik rock dan saya suka bernyanyi dikamar

    Figure 32: Korean M-Bert2Bert

  • saya menontonnya juga. Saya menonton sepakbola untukperguruan tinggi

    Saya suka menontonnya. Saya menghabiskan satudekade bekerja di depot rumah

    itu keren. apakah kamu suka olahraga?

    saya lakukan, saya suka menonton sepak bola.

    saya tidak menonton banyak olahraga.

    Saya suka menonton horor, apakah Anda suka film horor?

    tidak juga. saya lebih suka menonton sepak bola.

    bagus, apa film kesukaanmu?

    saya suka sci fi. saya suka menontonnya.

    Saya suka film horor, favorit saya adalah horor

    Figure 33: Indonesian CausalBert

    saya suka berbelanja di sana. saya tidak pernahmendengar tentang mobil.

    mobil apa?

    saya tidak punya mobil. saya akan membelinya di sana.

    apakah kamu punya hobi lain?

    hanya mendengarkan musik.

    saya melihat. apakah kamu suka musik?

    ya saya suka musik rap.

    keren. apakah kamu punya hobi?

    saya suka menghabiskan waktu bersama anak - anaksaya. kamu?

    Saya suka mengerjakan game Starcraft.

    Figure 34: Indonesian M-CausalBert

    oh tidak ! saya punya kucing, namanya mater.

    nama kucing saya mater, dia kucing!

    oh keren kamu punya hewan peliharaan! apakah kamusuka petir mcqueen? saya lakukan.

    saya belum melihat yang satu itu. seperti apa petirmcqueen?

    kucing saya bernama mater. setelah kaki mater darimcqueen petir! film favorit saya

    kucing saya bernama mater. setelah kaki mater darimcqueen petir! film favorit saya

    nama kucing saya mater, dia kucing!

    saya memiliki hasrat untuk membaca petualangan ruangsci fi. apakah anda suka membaca ?

    membaca itu bagus. saya juga suka menonton bola basketuntuk bersantai. apakah kamu menyukai olahraga ?

    Figure 35: Indonesian PolyEncoder

    apakah kamu bermain olahraga?

    well, saya mengerti apa yang Anda maksud, apakah Andasuka musik

    hei kamu suka makanan?

    Saya suka bepergian dan mengunjungi taman nasional

    iya nih! saya tidak akan pernah bisa melakukannya!

    iya nih. saya tidak tahu bagaimana itu bisa terjadi.

    oh, saya ingin menjadi dokter hewan.

    baik, saya harus menonton filmnya

    oh, apakah kamu memasak untuknya?

    hai, saya seorang vegetarian dan saya mencintai binatang.

    Figure 36: Indonesian M-Bert2Bert

  • saya juga bertukar untuk menjadi perawat sukarang danmata biru .

    di waktu luang i pernah melakukan pekerjaan sukarela .

    i am a lot to have a sukarawat . i have a lot of sukarangand i have been doing a sukaren with my own friends . ilove to live at the night . i have been working in a sukarelaat night to pay for a sukarela .

    di waktu luang i pernah melakukan pekerjaan sukarela . ihave been menikmati di sekitar orang sukarela . i havebeen a lot of sukarela . i have been in the place to have alot of sukarela and i have been a sukarela at the night .

    Figure 37: Indonesian CrossLingual