Available for Download ✅

⚠️ Always check the license of the data source before using the data ⚠️

Brief Description

Tatoeba is a large database of sentences and translations. Its content is ever-growing and results from the voluntary contributions of thousands of members.

Tatoeba provides a tool for you to see examples of how words are used in the context of a sentence. You specify words that interest you, and it returns sentences containing these words with their translations in the desired languages. The name Tatoeba (for example in Japanese) captures this concept.

Other Notes

Getting a parallel Irish-English corpus involves downloading and joining up a number of different files like the Irish sentences file, English sentences file, a Links file that maps one to the other and a Users file that provides the skill level of the person who added the translation.

  • Lines of text: 1,973
  • GA Word count: 10,352

Word Count Distribution

Lets take a quick peek at the word count distribution for Irish. Turns out to be mostly super short sentences

Code to Extract to a Pandas DataFrame

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('tatoeba/gle_sentences_detailed.tsv', sep='\t', header=None)
df.columns = ['id', 'lang', 'ga', 'username', 'date_added', 'date_modified']
df['ga_len'] = df.ga.str.split().str.len()
df.head()
id lang ga username date_added date_modified ga_len
0 557291 gle Cá bhfuil críochfort na mbus? niq 2010-10-10 13:17:41 2010-10-10 13:17:41 5
1 557299 gle Nuair a dhúisigh mé, bhí brón orm. niq 2010-10-10 13:20:49 2010-10-10 13:20:49 7
2 557533 gle Tosaíonn an t-oideachas sa bhaile. niq 2010-10-10 14:39:19 2010-10-10 14:39:19 5
3 557579 gle Táim i ngrá leat. niq 2010-10-10 14:48:53 2010-10-25 12:30:14 4
4 557591 gle Glanaimid ár rang tar éis scoile. niq 2010-10-10 14:52:42 2010-10-10 15:10:57 6

Load remaining necessary files. All these files can be downloaded from the Tatoeba downloads page: https://tatoeba.org/eng/downloads

# english sentences
en_df = pd.read_csv('tatoeba/eng_sentences_detailed.tsv', sep='\t', header=None)
en_df.columns = ['en_id', 'lang', 'en', 'username', 'date_added', 'date_modified']

# translation links files
l_df = pd.read_csv('tatoeba/links.csv', sep='\t', header=None)
l_df.columns = ['id1', 'id2']

# tags file - Not super helpful for Irish as not many tags
# t_df = pd.read_csv('tatoeba/tags.csv', sep='\t', header=None)
# t_df.columns = ['id', 'tag']

# User languages and self-reported skill level
u_df = pd.read_csv('tatoeba/user_languages.csv', sep='\t', header=None)
u_df.columns = ['user_lang', 'skill', 'user', 'details']
u_df = u_df.query('user_lang == "gle"')    # filter for ga only
u_df.loc[u_df.skill=='\\N', 'skill'] = -1   #

Merge

Merge all files to our Irish file

# ga to translation links
df = df.merge(l_df, left_on='id', right_on='id1')
# merge english
df = df.merge(en_df[['en_id','en']], left_on='id2', right_on='en_id')
# merge tags
#df = df.merge(t_df, left_on='id', right_on='id', how='left')
# merge users and skill level
df = df.merge(u_df, left_on='username', right_on='user', how='left')
df.loc[df.skill.isna(), 'skill'] = -1
df = df[['id', 'en_id', 'lang', 'ga', 'en', 'ga_len','skill', 'details']]
df.head()
id en_id lang ga en ga_len skill details
0 557291 35406 gle Cá bhfuil críochfort na mbus? Where is the bus terminal? 5 -1 NaN
1 557299 1361 gle Nuair a dhúisigh mé, bhí brón orm. When I woke up, I was sad. 7 -1 NaN
2 557533 19122 gle Tosaíonn an t-oideachas sa bhaile. Education starts at home. 5 -1 NaN
3 557579 1434 gle Táim i ngrá leat. I love you. 4 -1 NaN
4 934942 1434 gle Tá grá agam duit. I love you. 4 -1 NaN

Looking at the self-reported skills distribution shows that most people haven't reported their Irish skill level

sns.distplot(df.skill, kde=False)
plt.title('Self-reported skill distribution');

Save the file and we're done!

df.to_csv('processed_data/tatoeba_en-ga_20200612.csv')

A few more samples

df.sample(50)
id en_id lang ga en ga_len skill details
1016 3602540 703243 gle Táim ag labhairt le mo mhac léinn. I'm speaking with my student. 7 -1 NaN
1590 5599832 1357603 gle Déanta. Done. 1 -1 NaN
1702 6319314 2014783 gle Cén fáth a mbeimis ag iarraidh pionós a chur ort? Why would we want to punish you? 10 4 NaN
1055 3603017 3603008 gle Tá gabhlóg anseo. There is a fork here. 3 -1 NaN
141 871635 5152872 gle Chonaic sé seanchara an tseachtain seo caite n... Last week he saw an old friend whom he hadn't ... 12 3 NaN
397 2610940 2604279 gle Céard is ábhar taighde don tSoivéideolaí? What does a Sovietologist study? 6 -1 NaN
649 3128067 1079842 gle Tá mé ag léamh an nuachtán. I'm reading the newspaper. 6 -1 NaN
409 2712366 2705597 gle Ní chanaim. I do not sing. 2 -1 NaN
363 2150800 1476581 gle Tá sé an-dorcha. It's very dark. 3 -1 NaN
1671 6319282 2014752 gle Dúirt Tom go raibh comhluadar uaidh. Tom said he wanted some company. 6 4 NaN
911 3601095 1784975 gle Níl a fhios agam cén fáth. I don't know why. 6 -1 NaN
243 873503 873502 gle Sin í an bhean a bhfanann siad léi. That is the woman they stay with. 8 3 NaN
1265 3944944 3868719 gle Cén teanga atá á labhairt aige? What language is he speaking? 6 -1 NaN
1313 3961159 463294 gle Is duine é seo. This is a person. 4 -1 NaN
1934 8239950 273600 gle Go raibh maith agat roimh ré. Thanks in advance. 6 3 NaN
1753 7075957 989164 gle Léim an leabhar. I read the book. 3 0 NaN
1687 6319298 2014768 gle Nílimid ag iarraidh ach rudaí a dhíol leat. We just want to sell you things. 8 4 NaN
407 2712330 2684430 gle An bhfuil mé do chara? Am I your friend? 5 -1 NaN
362 2150798 1615217 gle Tá sé an-tirim. It's very dry. 3 -1 NaN
299 874891 874890 gle Nach rabhthas sásta? Weren't they satisfied? 3 -1 NaN
543 5599829 1053192 gle Bígí cúramach! Careful! 2 -1 NaN
987 3602422 2361385 gle Níl teileafón agam. I don't have a telephone. 3 -1 NaN
1929 8239940 772806 gle Níl a fhios ag aon duine cá bhfuil sé. Nobody knows where it is. 9 3 NaN
948 3601175 2549673 gle Tháinig sibh ar ais. You came back. 4 -1 NaN
1539 5516952 5516950 gle Sin bealach amháin le breathnú air is docha. That's one way of looking at it, I suppose. 8 1 NaN
1215 3896264 2700686 gle Níl aon fhadhb ann. There is no problem. 4 -1 NaN
410 2712386 4969010 gle Tá an leabhar ar an sheilf. The book is on the shelf. 6 -1 NaN
681 3233354 2002544 gle Tá an seomra dorcha. The room is dark. 4 -1 NaN
198 871759 871758 gle Conas mar a rinne tú é? How did you do it? 6 -1 NaN
523 2715102 2659060 gle Scríobh sí litir. She wrote a letter. 3 -1 NaN
1363 4445009 2474700 gle Bhí mé díomách sin. I was so disappointed. 4 3 Níl Gaeilge líofa agam, ach tá a fhios agam a ...
1722 6319355 1126729 gle Nuair a thugaim cuairt ar mo gharmhac, tugaim ... When I go to see my grandson, I always give hi... 14 4 NaN
662 3128092 2297248 gle Tá sé ag scríobh leabhair. He's writing a book. 5 -1 NaN
1895 7804899 7926273 gle Amach leat! Out you go! 2 3 Caint as Cúige Uladh
855 3599611 3599609 gle Tá Spáinnis aici. She knows Spanish. 3 -1 NaN
1090 3603161 2363944 gle Is cailín mé. I am a girl. 3 -1 NaN
1456 4773304 5127613 gle Caithfidh mé péire bróg nua a cheannach. I must buy a new pair of shoes. 7 4 Irish teacher for 20+ years.
1827 7801402 429220 gle Ádh mór! Good luck! 2 3 Caint as Cúige Uladh
1688 6319299 2014769 gle Ba mhaith linn rud a phlé le Tom. We want to have a word with Tom. 8 4 NaN
1760 7290675 7290677 gle Gheofá bainne a bhaint as na ba. You could get milk from the cows. 7 5 NaN
1587 5599801 393357 gle Mas é bhur dtoil é. Please. 5 -1 NaN
700 3335788 16255 gle Cad tá uait? What are you looking for? 3 -1 NaN
1824 7801398 348091 gle Tar isteach. Come in. 2 3 Caint as Cúige Uladh
1953 8290368 1192601 gle Tá ríomhaire uaim. I want a computer. 3 3 Caint as Cúige Uladh
1126 3604289 60147 gle Is liomsa an teach seo. This house is mine. 5 -1 NaN
949 3601177 3419582 gle Tá mo dheartháir níos láidre ná mé. My brother is stronger than me. 7 -1 NaN
1580 5599784 39996 gle Tá brón orm... Sorry... 3 -1 NaN
451 2714581 1814 gle Tá tart orm. I'm thirsty. 3 -1 NaN
845 3599522 3591019 gle Ní ach socrú sealadach é. It's only a temporary fix. 5 -1 NaN
138 871633 5152871 gle Tá fear ag an doras atá ag iarraidh caint leat. There's a man at the door who's asking to spea... 10 3 NaN