The SAWA Corpus - A parallel Corpus English - Swahili

32
Language Technologies for African Languages – AfLaT 2009 The SAWA Corpus A Parallel Corpus English - Swahili Guy De Pauw ([email protected]) Peter Waiganjo Wagacha ([email protected]) Gilles-Maurice de Schryver ([email protected])

description

by Guy De Pauw, Peter Waiganjo Wagacha and Gilles-Maurice de Schryver

Transcript of The SAWA Corpus - A parallel Corpus English - Swahili

Page 1: The SAWA Corpus - A parallel Corpus English - Swahili

Language Technologies for African Languages – AfLaT 2009

The SAWA CorpusA Parallel Corpus English - Swahili

Guy De Pauw ([email protected])Peter Waiganjo Wagacha ([email protected])Gilles-Maurice de Schryver ([email protected])

Page 2: The SAWA Corpus - A parallel Corpus English - Swahili

2

Language Technologies for African Languages – AfLaT 2009

Resource-scarceness

• Language technology vs the digital divide• Digital data increasingly important for African languages

(web, mobile phone, …) • But: most research on African languages is rooted in

knowledge-based paradigm (↔ LT for Indo-European languages): - Hand-crafted expert systems- Typically high accuracy for domain- Limited portability to other languages and subdomains- Costly development phase- Limited resources (linguistic, expertise, financial, …)

• Need for a cheaper and faster (language-independent) alternative for developing African language technology

Page 3: The SAWA Corpus - A parallel Corpus English - Swahili

3

Language Technologies for African Languages – AfLaT 2009

Data-driven approaches• For Indo-European and Asian languages: the data-driven, corpus-

based approach has become the dominant paradigm since the 90’s • Basic methodology: automatically extract linguistic knowledge

from annotated text material (corpus) and bootstrap the development of language technology component

• Advantages:- language independence: portability (!!!!)- Knowledge acquisition bottleneck data-acquisition bottleneck- Robustness

• AfLaT-team: explore application of data-driven paradigm to African languages (Swahili, Gikuyu, Luo, Northern Sotho, …)

Page 4: The SAWA Corpus - A parallel Corpus English - Swahili

4

Language Technologies for African Languages – AfLaT 2009

Machine Translation3 paradigms:

- Rule-based MT- Statistical MT- Example-based MT

data-driven

Learn translation from examples:!! Parallel corpus !!

Page 5: The SAWA Corpus - A parallel Corpus English - Swahili

5

Language Technologies for African Languages – AfLaT 2009

Parallel Corpus

Collection of translated texts in two different languages, aligned on paragraph, sentence, phrase and/or word level

SAWA Corpus: parallel corpus English - Swahili

Page 6: The SAWA Corpus - A parallel Corpus English - Swahili

6

Language Technologies for African Languages – AfLaT 2009

Universal Declaration of Human Rights

Preamble

Whereas recognition of the inherent dignity and of the

equal and inalienable rights of all members of the human

family is the foundation of freedom, justice and peace in

the world,

Whereas disregard and contempt for human rights have

resulted in barbarous acts which have outraged the

conscience of mankind, and the advent of a world in which

human beings shall enjoy freedom of speech and belief

and freedom from fear and want has been proclaimed as

the highest aspiration of the common people,

Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa

lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za

Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa

katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana

Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa

Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na

ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya

nchi yo yote."

UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA

YA ULIMWENGU JUU YA HAKI ZA BINADAMU

UTANGULIZI

Kwa kuwa kukiri heshima ya asili na haki sawa kwa

binadamu wote ndio msingi wa uhuru, haki na amani

duniani,

Kwa kuwa kutojali na kudharau haki za binadamu

kumeletea vitendo vya kishenzi ambavyo vimeharibu

dhamiri ya binadamu na kwa sababu taarifa ya

ulimwengu ambayo itawafanya binadamu wafurahie

uhuru wao wa kusema, kusadiki na wa kutoogopa cho

chote imekwisha kutangazwa kwamba ndio hamu kuu ya

watu wote,

Example

Page 7: The SAWA Corpus - A parallel Corpus English - Swahili

7

Language Technologies for African Languages – AfLaT 2009

3 phases

• Data-collection: finding parallel texts

• Data-constitution: aligning the parallel texts on word level

• Data-exploitation- Statistical Machine Translation- Bootstrapping linguistic annotation

Page 8: The SAWA Corpus - A parallel Corpus English - Swahili

8

Language Technologies for African Languages – AfLaT 2009

Data Collection

• Limited availability of parallel texts English – Kiswahili:- Smaller documents: investment reports, political

texts, e.g. Universal Declaration of Human Rights

“there is no data, like more data”- Bible, Quran, secular literature- New translations

Page 9: The SAWA Corpus - A parallel Corpus English - Swahili

9

Language Technologies for African Languages – AfLaT 2009

Data Collection

• Even if the source data is digitally available beforehand, we are often faced with tough alignment problems during data constitution.

e.g. paragraph alignment

Page 10: The SAWA Corpus - A parallel Corpus English - Swahili

10

Language Technologies for African Languages – AfLaT 2009

Universal Declaration of Human Rights

Preamble

Whereas recognition of the inherent dignity and of the

equal and inalienable rights of all members of the human

family is the foundation of freedom, justice and peace in

the world,

Whereas disregard and contempt for human rights have

resulted in barbarous acts which have outraged the

conscience of mankind, and the advent of a world in which

human beings shall enjoy freedom of speech and belief

and freedom from fear and want has been proclaimed as

the highest aspiration of the common people,

Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa

lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za

Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa

katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana

Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa

Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na

ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya

nchi yo yote."

UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA

YA ULIMWENGU JUU YA HAKI ZA BINADAMU

UTANGULIZI

Kwa kuwa kukiri heshima ya asili na haki sawa kwa

binadamu wote ndio msingi wa uhuru, haki na amani

duniani,

Kwa kuwa kutojali na kudharau haki za binadamu

kumeletea vitendo vya kishenzi ambavyo vimeharibu

dhamiri ya binadamu na kwa sababu taarifa ya

ulimwengu ambayo itawafanya binadamu wafurahie

uhuru wao wa kusema, kusadiki na wa kutoogopa cho

chote imekwisha kutangazwa kwamba ndio hamu kuu ya

watu wote,

Page 11: The SAWA Corpus - A parallel Corpus English - Swahili

11

Language Technologies for African Languages – AfLaT 2009

e.g. sentence alignment

Article 12

No one shall be subjected to arbitrary interference

with his privacy, family, home or correspondence,

nor to attacks upon his honour and reputation.

Everyone has the right to the protection of the law

against such interference or attacks.

Kifungu cha 12

Kila mtu asiingiliwe bila sheria katika mambo yake

ya faragha, ya jamaa yake, ya nyumbani mwake au

ya barua zake.

Wala asivunjiwe heshima na sifa yake.

Kila mmoja ana haki ya kulindwa na sheria kutokana

na pingamizi au mambo kama hayo.

Page 12: The SAWA Corpus - A parallel Corpus English - Swahili

12

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Page 13: The SAWA Corpus - A parallel Corpus English - Swahili

13

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Page 14: The SAWA Corpus - A parallel Corpus English - Swahili

14

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Thanks to Mahmoud Shokrollahi-FarUniversity College of Nabiye Akram (Iran)

Page 15: The SAWA Corpus - A parallel Corpus English - Swahili

15

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Page 16: The SAWA Corpus - A parallel Corpus English - Swahili

16

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Page 17: The SAWA Corpus - A parallel Corpus English - Swahili

17

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Page 18: The SAWA Corpus - A parallel Corpus English - Swahili

18

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Page 19: The SAWA Corpus - A parallel Corpus English - Swahili

19

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Thanks to Dr. James Omboga ZajaUniversity of Nairobi

Page 20: The SAWA Corpus - A parallel Corpus English - Swahili

20

Language Technologies for African Languages – AfLaT 2009

Available data in SAWA Corpus

English Sentence

s

Kiswahili Sentence

s

EnglishWords

KiswahiliWords

New Testament 16.4k 16.3k 189.2k 151.1k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Total 50.2k 50.3k 542.1k 442.9k

All manually sentence aligned!

Page 21: The SAWA Corpus - A parallel Corpus English - Swahili

21

Language Technologies for African Languages – AfLaT 2009

Word alignment

Most difficult task: relate words between languages

No she ‘s uh, , up north

La

,

, ,yuko ,aa juu kaskazini

Page 22: The SAWA Corpus - A parallel Corpus English - Swahili

22

Language Technologies for African Languages – AfLaT 2009

Word alignment

You caught me skiving , I ‘m afraid .

Samahani , umenidaka nikihepa .

Page 23: The SAWA Corpus - A parallel Corpus English - Swahili

23

Language Technologies for African Languages – AfLaT 2009

Word alignment

• Can be done automatically using established tools (GIZA++)• Provide manual reference to evaluate automatic word alignment

tools (5000 words)

Page 24: The SAWA Corpus - A parallel Corpus English - Swahili

24

Language Technologies for African Languages – AfLaT 2009

Current results

Still a lot of room for improvement

Precision Recall F(=1)

39.4% 44.5% 41.79%

Page 25: The SAWA Corpus - A parallel Corpus English - Swahili

25

Language Technologies for African Languages – AfLaT 2009

Word alignment

Some alignment patterns are easy

No she ‘s uh, , up north

La

,

, ,yuko ,aa juu kaskazini

Page 26: The SAWA Corpus - A parallel Corpus English - Swahili

26

Language Technologies for African Languages – AfLaT 2009

Alignment problems

nimemkatalia

have turned him downI

Page 27: The SAWA Corpus - A parallel Corpus English - Swahili

27

Language Technologies for African Languages – AfLaT 2009

Morphological decomposition

have turned him downI

ni+ me+ m+ katalia

Page 28: The SAWA Corpus - A parallel Corpus English - Swahili

28

Language Technologies for African Languages – AfLaT 2009

Current results

Morpheme/Word alignment

Better alignment, but more complicated decoding

Precision Recall F(=1)

50.2% 64.5% 55.8%

Page 29: The SAWA Corpus - A parallel Corpus English - Swahili

29

Language Technologies for African Languages – AfLaT 2009

Future work

• Projection of Annotation

Page 30: The SAWA Corpus - A parallel Corpus English - Swahili

30

Language Technologies for African Languages – AfLaT 2009

Future work

• Projection of Annotation

• Refine GIZA++ alignment• Part-of-speech tagger

Page 31: The SAWA Corpus - A parallel Corpus English - Swahili

31

Language Technologies for African Languages – AfLaT 2009

Future work

• Projection of Annotation

• Refine GIZA++ alignment• Part-of-speech tagger• No data like more data: web-mining &

comparable corpora

• Example-based MT (omegaT)• Statistical MT (Moses)

Page 32: The SAWA Corpus - A parallel Corpus English - Swahili

32

Language Technologies for African Languages – AfLaT 2009

Conclusion

• Modest, but workable parallel corpus English – Swahili

• Bi-directional Machine Translation is now in the cards

• Modest, but encouraging word alignment scores

• Data-driven approach is viable for African languages