Jekyll2020-11-08T08:29:33-06:00https://www.nlp.irish/feed.xmlnlp.IrishCurated Irish language datasets for NLP researchTatoeba2020-06-12T00:00:00-05:002020-06-12T00:00:00-05:00https://www.nlp.irish/tatoeba/translation/nmt/mt/2020/06/12/Tatoeba<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-06-12-Tatoeba.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Available-for-Download--✅">Available for Download ✅<a class="anchor-link" href="#Available-for-Download--✅"> </a></h3><p>⚠️ Always check the license of the data source before using the data ⚠️</p>
<ul>
<li>Main page: <a href="https://tatoeba.org/eng">https://tatoeba.org/eng</a></li>
<li>Data Browse Link: <a href="https://tatoeba.org/eng/downloads">https://tatoeba.org/eng/downloads</a></li>
<li>Kaggle notebook showing how to Download: <a href="https://www.kaggle.com/alvations/how-to-get-parallel-sentences-from-tatoeba">https://www.kaggle.com/alvations/how-to-get-parallel-sentences-from-tatoeba</a></li>
<li>Github: <a href="https://github.com/Tatoeba/tatoeba2">https://github.com/Tatoeba/tatoeba2</a></li>
<li>Format: <strong>.tsv</strong> and <strong>.csv</strong></li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Brief-Description">Brief Description<a class="anchor-link" href="#Brief-Description"> </a></h3><p>Tatoeba is a large database of sentences and translations. Its content is ever-growing and results from the voluntary contributions of thousands of members.</p>
<p>Tatoeba provides a tool for you to see examples of how words are used in the context of a sentence. You specify words that interest you, and it returns sentences containing these words with their translations in the desired languages. The name Tatoeba (for example in Japanese) captures this concept.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Other-Notes">Other Notes<a class="anchor-link" href="#Other-Notes"> </a></h3><p>Getting a parallel Irish-English corpus involves downloading and joining up a number of different files like the Irish sentences file, English sentences file, a Links file that maps one to the other and a Users file that provides the skill level of the person who added the translation.</p>
<ul>
<li>Lines of text: 1,973</li>
<li>GA Word count: 10,352</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Word-Count-Distribution">Word Count Distribution<a class="anchor-link" href="#Word-Count-Distribution"> </a></h3><p>Lets take a quick peek at the word count distribution for Irish. Turns out to be mostly super short sentences</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_png output_subarea ">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAXcAAAEXCAYAAABWNASkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAZK0lEQVR4nO3dfZRdZWHv8e9PXlVQ3gYMSWhQUy26auSOwBVaKVgLVBusYAEvBqSNXQvu0lWtovd6wVVptUvFxbqW2yhIoPImgkRLq7koIL01MGB4CcESMSRDUjLIuyiV8Lt/7GfKYTgz52TOTGbmye+z1qyz97Ofs/fzzEl+Z89z9tmPbBMREXV5yVQ3ICIiJl7CPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3mLYknS3pH6a6Hb2QdIqkm1vWn5L06gna9yclfbUsz5NkSdtP0L73K23dbiL2F1tfwj2iS5IOlzTYyz5s72L7/ok4ju2/tv2nvbSn5ZhrJb29Zd/rSls3T8T+Y+tLuMeUUyP/FrfARJ2hR73yH2obIelAST+W9KSkb0i6QtJnyrbdJX1H0pCkR8vynFH2c6qkb7esr5F0Zcv6ekkLyvJbJd0q6fHy+NaWejdIOkfSvwBPA6+WtL+kG0sblwN7dejTQkkrJT0h6aeSjirl+0paJumR0r4/a3nORcP9LusvOEsuZ7AflXRnafcVknaW9HLgn4B9y3DFU5L2bdOmPcuxn5B0C/CaEdst6bVl+RhJ95T+PliO2/Y4ZYjqKkn/IOkJ4JRRhq0+IGmDpI2SPtJNvyVdAuwHfLsc72Mjh3k6/E7PlnSlpItLX1ZJ6h/rtYvJl3DfBkjaEbgGuAjYA7gMeHdLlZcAXwN+g+Y/+S+B/z3K7m4EfkfSSyTNAnYADi3HeTWwC3CnpD2AfwTOA/YEvgj8o6Q9W/Z1MrAY2BV4ALgUuI0m1P8KWDRGnw4CLgb+EtgN+F1gbdl8GTAI7AscB/y1pCNH21cb7wWOAvYHfhs4xfYvgKOBDWW4YhfbG9o898vAr4BZwAfKz2guAD5oe1fgjcD3OxxnIXBV6e/XR9nn7wHzgXcAZ7YOtYzG9snAOuBd5Xh/26Zap9/pHwGXl7YtY/R/P7GVJNy3DYcA2wPn2f617auBW4Y32v657W/aftr2k8A5wNva7aiMFz8JLCh1vgs8KOn1Zf2Htp8D/hC4z/Yltp+1fRlwL/Cult1dZHuV7WdpwvAtwKdsP2P7JuDbjO404ELby20/Z/tB2/dKmgscBnzc9q9srwS+SvNG0q3zbG+w/Uhpw4JunlQ+fHwP8L9s/8L23cDSMZ7ya+AASa+w/ajt2zsc4l9tf6v095ej1Pl0OfZdNG/YJ3bT9rF0+Tu92fZ1ZYz+EuBNvR43epNw3zbsCzzoF94lbv3wgqSXSfp7SQ+UP/lvAnbT6FdK3AgcTnO2fCNwA02wv62sDx/zgRHPewCY3a4Npf6j5cy1tf5o5gI/bVO+L/BIeZMa7bid/HvL8tM0f410o4/mTbS1X2P14T3AMcADZTjqv3bY//oO20fWeYDm99Grbn6nI39nO+dzgamVcN82bARmS1JL2dyW5Y8ArwMOtv0KmtAGaK3fajjcf6cs38iLw30DzTBPq/2AB1vWW99sNgK7lzHn1vqjWc+I8eyW4+4haddRjvsL4GUt2141xjFG6nQL1SHgWV74ux21D7Zvtb0Q2Bv4FjD82cVox+nmFq4jjz08pNOp32Ptu9PvNKahhPu24V+BzcAZkraXtBA4qGX7rjTj7I+VsfKzOuzvRpqx3ZfaHgR+SDNGvSfw41LnOuA3JZ1UjvknwAHAd9rt0PYDwADwaUk7SjqMFw7hjHQBcKqkI8v4/2xJr7e9Hvh/wN+UD0J/m2YIZ3iMeiVwjKQ9JL0K+HCHvrZ6CNhT0itH6cNm4Grg7PLX0AGM8rlB6eP7JL3S9q+BJ2heo47H6eBT5dhvAE4Frijlnfr9END2+vsufqcxDSXctwG2/wP4Y5r/kI8B/40mZJ8pVb4EvBR4GPgR8M8d9vdvwFM0oY7tJ4D7gX8Zvi7a9s+Bd9L8VfBz4GPAO20/PMauTwIOBh6heYO5eIw23EITXucCj9O84Qz/pXAiMI/mjPMa4Czby8u2S4A7aD58/R7Ph19Htu+l+WDxfkmPtbtaBjiDZhjn32k+wP7aGLs8GVhbhsL+nOZ16fY4o7kRWANcD3ze9vdKead+/w3wP8vxPtpmv2P9TmMaUibr2DZJWgH8H9tjhU9EzFA5c99GSHqbpFeVIZJFNJf4jXmGHhEzVz7N3na8juYDu11orjI5zvbGqW1SREyWDMtERFQowzIRERWaFsMye+21l+fNmzfVzYiImFFuu+22h233tds2LcJ93rx5DAwMTHUzIiJmFEmjfgM6wzIRERVKuEdEVCjhHhFRoYR7RESFEu4RERVKuEdEVKjrcJe0nZo5OL9T1veXtELSfWrmmdyxlO9U1teU7fMmp+kRETGaLTlz/xCwumX9c8C5tucDj9LcTpby+Kjt19LcjvVzE9HQiIjoXlfhLmkOzZyYXy3rAo6gmawXmnkijy3LC3l+3sirgCNHzAAUERGTrNtvqH6JZrKF4Wm29gQeKxMbQzMr+vB8irMp8zjaflbS46X+CyZpkLSYZuZ79ttvrNnUpq9LV6xrW37SwTOzPxFRj45n7pLeCWyyfVtrcZuq7mLb8wX2Etv9tvv7+treGiEiIsapmzP3Q4E/knQMsDPwCpoz+d0kbV/O3ufw/ES8gzST9A6W2c9fSTNtWkREbCUdz9xtf8L2HNvzgBOA79t+H/AD4LhSbRFwbVlexvOTAh9X6uem8RERW1Ev17l/HPgLSWtoxtQvKOUX0Mzcvgb4C+DM3poYERFbaotu+Wv7BuCGsnw/cFCbOr8Cjp+AtkVExDjlG6oRERVKuEdEVCjhHhFRoYR7RESFEu4RERVKuEdEVCjhHhFRoYR7RESFEu4RERVKuEdEVCjhHhFRoYR7RESFEu4RERVKuEdEVCjhHhFRoYR7RESFEu4RERXqGO6SdpZ0i6Q7JK2S9OlSfpGkn0laWX4WlHJJOk/SGkl3SjpwsjsREREv1M00e88AR9h+StIOwM2S/qls+0vbV42ofzQwv/wcDJxfHqe9S1esa1t+0sH7beWWRET0puOZuxtPldUdyo/HeMpC4OLyvB8Bu0ma1XtTIyKiW12NuUvaTtJKYBOw3PaKsumcMvRyrqSdStlsYH3L0wdL2ch9LpY0IGlgaGiohy5ERMRIXYW77c22FwBzgIMkvRH4BPB64C3AHsDHS3W120WbfS6x3W+7v6+vb1yNj4iI9rboahnbjwE3AEfZ3liGXp4BvgYcVKoNAnNbnjYH2DABbY2IiC51c7VMn6TdyvJLgbcD9w6Po0sScCxwd3nKMuD95aqZQ4DHbW+clNZHRERb3VwtMwtYKmk7mjeDK21/R9L3JfXRDMOsBP681L8OOAZYAzwNnDrxzZ6ZcjVORGwtHcPd9p3Am9uUHzFKfQOn9960iIgYr3xDNSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQt3MobqzpFsk3SFplaRPl/L9Ja2QdJ+kKyTtWMp3KutryvZ5k9uFiIgYqZsz92eAI2y/CVgAHFUmvv4ccK7t+cCjwGml/mnAo7ZfC5xb6kVExFbUMdzdeKqs7lB+DBwBXFXKlwLHluWFZZ2y/UhJmrAWR0RER12NuUvaTtJKYBOwHPgp8JjtZ0uVQWB2WZ4NrAco2x8H9myzz8WSBiQNDA0N9daLiIh4ga7C3fZm2wuAOcBBwG+1q1Ye252l+0UF9hLb/bb7+/r6um1vRER0YYuulrH9GHADcAiwm6Tty6Y5wIayPAjMBSjbXwk8MhGNjYiI7nRztUyfpN3K8kuBtwOrgR8Ax5Vqi4Bry/Kysk7Z/n3bLzpzj4iIybN95yrMApZK2o7mzeBK29+RdA9wuaTPAD8GLij1LwAukbSG5oz9hElod0REjKFjuNu+E3hzm/L7acbfR5b/Cjh+QloXERHjkm+oRkRUKOEeEVGhhHtERIUS7hERFUq4R0RUKOEeEVGhhHtERIUS7hERFUq4R0RUKOEeEVGhhHtERIUS7hERFUq4R0RUKOEeEVGhhHtERIUS7hERFUq4R0RUqJs5VOdK+oGk1ZJWSfpQKT9b0oOSVpafY1qe8wlJayT9RNIfTGYHIiLixbqZQ/VZ4CO2b5e0K3CbpOVl27m2P99aWdIBNPOmvgHYF/i/kn7T9uaJbHhERIyu45m77Y22by/LTwKrgdljPGUhcLntZ2z/DFhDm7lWIyJi8mzRmLukeTSTZa8oRWdIulPShZJ2L2WzgfUtTxukzZuBpMWSBiQNDA0NbXHDIyJidN0MywAgaRfgm8CHbT8h6XzgrwCXxy8AHwDU5ul+UYG9BFgC0N/f/6LtAZeuWNe2/KSD99vKLYmImaarM3dJO9AE+9dtXw1g+yHbm20/B3yF54deBoG5LU+fA2yYuCZHREQn3VwtI+ACYLXtL7aUz2qp9m7g7rK8DDhB0k6S9gfmA7dMXJMjIqKTboZlDgVOBu6StLKUfRI4UdICmiGXtcAHAWyvknQlcA/NlTan50qZiIitq2O4276Z9uPo143xnHOAc3poV0RE9CDfUI2IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKJdwjIirUzTR7cyX9QNJqSaskfaiU7yFpuaT7yuPupVySzpO0RtKdkg6c7E5ERMQLdXPm/izwEdu/BRwCnC7pAOBM4Hrb84HryzrA0TTzps4HFgPnT3irIyJiTB3D3fZG27eX5SeB1cBsYCGwtFRbChxblhcCF7vxI2C3EZNpR0TEJNuiMXdJ84A3AyuAfWxvhOYNANi7VJsNrG952mApi4iIraTrcJe0C/BN4MO2nxirapsyt9nfYkkDkgaGhoa6bUZERHShq3CXtANNsH/d9tWl+KHh4ZbyuKmUDwJzW54+B9gwcp+2l9jut93f19c33vZHREQb3VwtI+ACYLXtL7ZsWgYsKsuLgGtbyt9frpo5BHh8ePgmIiK2ju27qHMocDJwl6SVpeyTwGeBKyWdBqwDji/brgOOAdYATwOnTmiLIyKio47hbvtm2o+jAxzZpr6B03tsV0RE9CDfUI2IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKJdwjIirUzTdUY4a4dMW6tuUnHbzfVm5JREy1nLlHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREV6mYO1QslbZJ0d0vZ2ZIelLSy/BzTsu0TktZI+omkP5ishkdExOi6OXO/CDiqTfm5theUn+sAJB0AnAC8oTzn7yRtN1GNjYiI7nQMd9s3AY90ub+FwOW2n7H9M5pJsg/qoX0RETEOvYy5nyHpzjJss3spmw2sb6kzWMpeRNJiSQOSBoaGhnpoRkREjDTecD8feA2wANgIfKGUq01dt9uB7SW2+2339/X1jbMZERHRzrjC3fZDtjfbfg74Cs8PvQwCc1uqzgE29NbEiIjYUuMKd0mzWlbfDQxfSbMMOEHSTpL2B+YDt/TWxIiI2FId7+cu6TLgcGAvSYPAWcDhkhbQDLmsBT4IYHuVpCuBe4BngdNtb56cpkdExGg6hrvtE9sUXzBG/XOAc3ppVERE9CbfUI2IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKJdwjIiqUcI+IqFDCPSKiQgn3iIgKdby3TMx8l65YN+q2kw7ebyu2JCK2lpy5R0RUKOEeEVGhhHtERIUS7hERFar6A9XRPkjMh4gRUbuOZ+6SLpS0SdLdLWV7SFou6b7yuHspl6TzJK2RdKekAyez8RER0V43wzIXAUeNKDsTuN72fOD6sg5wNM2k2POBxcD5E9PMiIjYEh3D3fZNwCMjihcCS8vyUuDYlvKL3fgRsJukWRPV2IiI6M54P1Ddx/ZGgPK4dymfDaxvqTdYyl5E0mJJA5IGhoaGxtmMiIhoZ6KvllGbMreraHuJ7X7b/X19fRPcjIiIbdt4w/2h4eGW8riplA8Cc1vqzQE2jL95ERExHuMN92XAorK8CLi2pfz95aqZQ4DHh4dvIiJi6+l4nbuky4DDgb0kDQJnAZ8FrpR0GrAOOL5Uvw44BlgDPA2cOgltjoiIDjqGu+0TR9l0ZJu6Bk7vtVEREdGb3H4gIqJCCfeIiAol3CMiKlT1jcNi/HLTtYiZLWfuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREV6umukJLWAk8Cm4FnbfdL2gO4ApgHrAXea/vR3poZERFbYiLO3H/P9gLb/WX9TOB62/OB68t6RERsRZMxLLMQWFqWlwLHTsIxIiJiDL2Gu4HvSbpN0uJSto/tjQDlce92T5S0WNKApIGhoaEemxEREa16nYnpUNsbJO0NLJd0b7dPtL0EWALQ39/vHtsREREtejpzt72hPG4CrgEOAh6SNAugPG7qtZEREbFlxh3ukl4uadfhZeAdwN3AMmBRqbYIuLbXRkZExJbpZVhmH+AaScP7udT2P0u6FbhS0mnAOuD43psZ010m1I6YXsYd7rbvB97UpvznwJG9NCoiInqTb6hGRFQo4R4RUaFeL4WccqON9UZEbMty5h4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhWb8pZAxM+V2BRGTK2fuEREVSrhHRFQo4R4RUaGMuUeVMqYf27qcuUdEVChn7jEj5Ew8Yssk3COKvIFETSZtWEbSUZJ+ImmNpDMn6zgREfFikxLukrYDvgwcDRwAnCjpgMk4VkREvNhkDcscBKwp86wi6XJgIXDPJB0vYtJs6XBNrfXHes5UmeyhtIn8XWztYT/ZnvidSscBR9n+07J+MnCw7TNa6iwGFpfV1wE/adnFXsDDE96w6aX2PqZ/M1/tfayhf79hu6/dhsk6c1ebshe8i9heAixp+2RpwHb/ZDRsuqi9j+nfzFd7H2vv32R9oDoIzG1ZnwNsmKRjRUTECJMV7rcC8yXtL2lH4ARg2SQdKyIiRpiUYRnbz0o6A/gusB1woe1VW7CLtsM1lam9j+nfzFd7H6vu36R8oBoREVMr95aJiKhQwj0iokLTLtxrv22BpLWS7pK0UtLAVLdnIki6UNImSXe3lO0habmk+8rj7lPZxl6M0r+zJT1YXseVko6Zyjb2QtJcST+QtFrSKkkfKuVVvIZj9K+a17CdaTXmXm5b8G/A79NcTnkrcKLtar7ZKmkt0G97pn954j9J+l3gKeBi228sZX8LPGL7s+VNenfbH5/Kdo7XKP07G3jK9uensm0TQdIsYJbt2yXtCtwGHAucQgWv4Rj9ey+VvIbtTLcz9/+8bYHt/wCGb1sQ05jtm4BHRhQvBJaW5aU0/5lmpFH6Vw3bG23fXpafBFYDs6nkNRyjf1WbbuE+G1jfsj5IfS+Cge9Juq3cgqFW+9jeCM1/LmDvKW7PZDhD0p1l2GZGDlmMJGke8GZgBRW+hiP6BxW+hsOmW7h3vG1BBQ61fSDNHTNPL3/yx8xzPvAaYAGwEfjC1Dand5J2Ab4JfNj2E1PdnonWpn/VvYatplu4V3/bAtsbyuMm4BqaoagaPVTGOofHPDdNcXsmlO2HbG+2/RzwFWb46yhpB5rg+7rtq0txNa9hu/7V9hqONN3CverbFkh6eflAB0kvB94B3D32s2asZcCisrwIuHYK2zLhhkOveDcz+HWUJOACYLXtL7ZsquI1HK1/Nb2G7Uyrq2UAyuVIX+L52xacM8VNmjCSXk1ztg7NrR8uraF/ki4DDqe5hepDwFnAt4Argf2AdcDxtmfkh5Kj9O9wmj/nDawFPjg8Pj3TSDoM+CFwF/BcKf4kzbj0jH8Nx+jfiVTyGrYz7cI9IiJ6N92GZSIiYgIk3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj+hCuT3sR6e6HRHdSrhHRFRoUibIjphuJH0KeB/NXUcfprmn9+PAYmBHYA1wsu2nu9jXa4AvA33A08Cf2b5X0kXAE0A/8CrgY7avmvjeRHSWM/eonqR+4D00t3r9Y5rwBbja9ltsv4nmHt+ndbnLJcB/t/1fgI8Cf9eybRZwGPBO4LMT0PyIccmZe2wLDgOutf1LAEnfLuVvlPQZYDdgF+C7nXZUbhv7VuAbzf2oANippcq3yl0G75G0zwS1P2KLJdxjW9BungCAi4Bjbd8h6RSam4F18hLgMdsLRtn+TBfHjZh0GZaJbcHNwLsk7VzOvP+wlO8KbCz3+n5fNzsqkzz8TNLx0NxOVtKbJqPREb1IuEf1bN9Kc2/yO4CrgQGaD1M/RXNb2+XAvVuwy/cBp0m6A1hF5vmNaSi3/I1tgqRdbD8l6WXATcDi4UmTI2qUMffYViyRdACwM7A0wR61y5l7RAtJ/wM4fkTxN2qYMSu2LQn3iIgK5QPViIgKJdwjIiqUcI+IqFDCPSKiQv8fwOf7FlEdVp8AAAAASUVORK5CYII=
" />
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Code-to-Extract-to-a-Pandas-DataFrame">Code to Extract to a Pandas DataFrame<a class="anchor-link" href="#Code-to-Extract-to-a-Pandas-DataFrame"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="nn">sns</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'tatoeba/gle_sentences_detailed.tsv'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'id'</span><span class="p">,</span> <span class="s1">'lang'</span><span class="p">,</span> <span class="s1">'ga'</span><span class="p">,</span> <span class="s1">'username'</span><span class="p">,</span> <span class="s1">'date_added'</span><span class="p">,</span> <span class="s1">'date_modified'</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'ga_len'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">ga</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">split</span><span class="p">()</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>lang</th>
<th>ga</th>
<th>username</th>
<th>date_added</th>
<th>date_modified</th>
<th>ga_len</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>557291</td>
<td>gle</td>
<td>Cá bhfuil críochfort na mbus?</td>
<td>niq</td>
<td>2010-10-10 13:17:41</td>
<td>2010-10-10 13:17:41</td>
<td>5</td>
</tr>
<tr>
<th>1</th>
<td>557299</td>
<td>gle</td>
<td>Nuair a dhúisigh mé, bhí brón orm.</td>
<td>niq</td>
<td>2010-10-10 13:20:49</td>
<td>2010-10-10 13:20:49</td>
<td>7</td>
</tr>
<tr>
<th>2</th>
<td>557533</td>
<td>gle</td>
<td>Tosaíonn an t-oideachas sa bhaile.</td>
<td>niq</td>
<td>2010-10-10 14:39:19</td>
<td>2010-10-10 14:39:19</td>
<td>5</td>
</tr>
<tr>
<th>3</th>
<td>557579</td>
<td>gle</td>
<td>Táim i ngrá leat.</td>
<td>niq</td>
<td>2010-10-10 14:48:53</td>
<td>2010-10-25 12:30:14</td>
<td>4</td>
</tr>
<tr>
<th>4</th>
<td>557591</td>
<td>gle</td>
<td>Glanaimid ár rang tar éis scoile.</td>
<td>niq</td>
<td>2010-10-10 14:52:42</td>
<td>2010-10-10 15:10:57</td>
<td>6</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Load remaining necessary files. All these files can be downloaded from the Tatoeba downloads page: <a href="https://tatoeba.org/eng/downloads">https://tatoeba.org/eng/downloads</a></p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># english sentences</span>
<span class="n">en_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'tatoeba/eng_sentences_detailed.tsv'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="n">en_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'en_id'</span><span class="p">,</span> <span class="s1">'lang'</span><span class="p">,</span> <span class="s1">'en'</span><span class="p">,</span> <span class="s1">'username'</span><span class="p">,</span> <span class="s1">'date_added'</span><span class="p">,</span> <span class="s1">'date_modified'</span><span class="p">]</span>
<span class="c1"># translation links files</span>
<span class="n">l_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'tatoeba/links.csv'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="n">l_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'id1'</span><span class="p">,</span> <span class="s1">'id2'</span><span class="p">]</span>
<span class="c1"># tags file - Not super helpful for Irish as not many tags</span>
<span class="c1"># t_df = pd.read_csv('tatoeba/tags.csv', sep='\t', header=None)</span>
<span class="c1"># t_df.columns = ['id', 'tag']</span>
<span class="c1"># User languages and self-reported skill level</span>
<span class="n">u_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'tatoeba/user_languages.csv'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="n">u_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'user_lang'</span><span class="p">,</span> <span class="s1">'skill'</span><span class="p">,</span> <span class="s1">'user'</span><span class="p">,</span> <span class="s1">'details'</span><span class="p">]</span>
<span class="n">u_df</span> <span class="o">=</span> <span class="n">u_df</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="s1">'user_lang == "gle"'</span><span class="p">)</span> <span class="c1"># filter for ga only</span>
<span class="n">u_df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">u_df</span><span class="o">.</span><span class="n">skill</span><span class="o">==</span><span class="s1">'</span><span class="se">\\</span><span class="s1">N'</span><span class="p">,</span> <span class="s1">'skill'</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="c1">#</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Merge">Merge<a class="anchor-link" href="#Merge"> </a></h2><p>Merge all files to our Irish file</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># ga to translation links</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">l_df</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="s1">'id'</span><span class="p">,</span> <span class="n">right_on</span><span class="o">=</span><span class="s1">'id1'</span><span class="p">)</span>
<span class="c1"># merge english</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">en_df</span><span class="p">[[</span><span class="s1">'en_id'</span><span class="p">,</span><span class="s1">'en'</span><span class="p">]],</span> <span class="n">left_on</span><span class="o">=</span><span class="s1">'id2'</span><span class="p">,</span> <span class="n">right_on</span><span class="o">=</span><span class="s1">'en_id'</span><span class="p">)</span>
<span class="c1"># merge tags</span>
<span class="c1">#df = df.merge(t_df, left_on='id', right_on='id', how='left')</span>
<span class="c1"># merge users and skill level</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">u_df</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="s1">'username'</span><span class="p">,</span> <span class="n">right_on</span><span class="o">=</span><span class="s1">'user'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">'left'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">skill</span><span class="o">.</span><span class="n">isna</span><span class="p">(),</span> <span class="s1">'skill'</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s1">'id'</span><span class="p">,</span> <span class="s1">'en_id'</span><span class="p">,</span> <span class="s1">'lang'</span><span class="p">,</span> <span class="s1">'ga'</span><span class="p">,</span> <span class="s1">'en'</span><span class="p">,</span> <span class="s1">'ga_len'</span><span class="p">,</span><span class="s1">'skill'</span><span class="p">,</span> <span class="s1">'details'</span><span class="p">]]</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>en_id</th>
<th>lang</th>
<th>ga</th>
<th>en</th>
<th>ga_len</th>
<th>skill</th>
<th>details</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>557291</td>
<td>35406</td>
<td>gle</td>
<td>Cá bhfuil críochfort na mbus?</td>
<td>Where is the bus terminal?</td>
<td>5</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>557299</td>
<td>1361</td>
<td>gle</td>
<td>Nuair a dhúisigh mé, bhí brón orm.</td>
<td>When I woke up, I was sad.</td>
<td>7</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>557533</td>
<td>19122</td>
<td>gle</td>
<td>Tosaíonn an t-oideachas sa bhaile.</td>
<td>Education starts at home.</td>
<td>5</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>557579</td>
<td>1434</td>
<td>gle</td>
<td>Táim i ngrá leat.</td>
<td>I love you.</td>
<td>4</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>934942</td>
<td>1434</td>
<td>gle</td>
<td>Tá grá agam duit.</td>
<td>I love you.</td>
<td>4</td>
<td>-1</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Looking at the self-reported skills distribution shows that most people haven't reported their Irish skill level</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">sns</span><span class="o">.</span><span class="n">distplot</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">skill</span><span class="p">,</span> <span class="n">kde</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Self-reported skill distribution'</span><span class="p">);</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_png output_subarea ">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAX0AAAEWCAYAAACKSkfIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAX/UlEQVR4nO3de5SdVZ3m8e/ThJswCEJESIKhR1pFelSMgLduFMcGdMBxpOUyGm0ctAdndHSWonaLjjpqTy9vq70sWhAUkXa8LFBoMaMirTbR4AXBoGR5IemgRLlLe4n+5o+zS49FJZXUqdRJan8/a51V77v3ft+937cqz3nPPu85SVUhSerDH4x7AJKkuWPoS1JHDH1J6oihL0kdMfQlqSOGviR1xNDXvSSpJA9qy7sn+WSSO5L833GPbVtI8twkX5zBdlcmef4m6v4xyfKp9j98fregj9cmubAtH5Tk7iQ7be1YN7Hv9yb567Z8dJJ1s7Hftr8nJPnObO1Ps8fQn6eSPD7Jl1tY35rkS0kePYNdPRPYH9i3qk6a5WHOis2F77hU1XFVdcEs7/Omqtqzqn69uXZb+iRWVS+sqtfPxtgmP5FV1T9V1YNnY9+aXQvGPQDNviR7AZ8C/hL4CLAL8ATgFzPY3QOB71bVxi3se8GWth1VkgCZi77mmyQ7TffkofnJK/356Y8AqurDVfXrqvrXqvpMVV070SDJXyRZneS2JFckeeDknSR5HfAa4FltWuH0qTprV3lnJrkRuLGVPSTJivYq4ztJ/nyo/fltamFFkruSfGG4/ySPTfLV9irlq0keO1R3ZZI3JvkScA/wQQZPaH/Xxvh3W9D/vkkuTXJnkq8A/3ZTJzLJbkkuTPLTJLe38ew/RbsDklyb5H8OjXOrX30kObidj7uSrAD2G6pb2s71grb+3CTfa22/n+S0JA8F3gs8pp2P24fO+XuSXJ7kZ8ATW9kbJvX/qiQ/SfKDJKdNOu/PH1r/7auJJFe14m+2Pp81ebooyUPbPm5Pcn2SE4bqzk/yriSXtWNZmWSTvxONqKp8zLMHsBfwU+AC4Dhgn0n1TwfWAA9l8Grvr4AvD9UX8KC2/Frgwmn6K2AFcD9gd2APYC3wvLb/w4GfAA9r7c8H7gL+BNgVeAfwxVZ3P+A24Nlt21Pa+r6t/krgJuBhrX7nVvb8ofFM1//FDF4B7QEcBvzLRP9THNsLgE8C9wF2Ah4F7DU0lucDS4HvAmcMbffbMQHPHd7/8Pmdor9/Bt7azsuftPN0Yatb2rZd0MZ+J/DgVnfA0PH9Xn9D5/wO4HEMLvZ2a2VvaPVHAxuH+v5T4GdD+598jjd7TG1/69ryzgz+3l7F4FXnk9pxPXhobLcCR7Rj+xBw8bj/Hc3Xh1f681BV3Qk8nsE/xL8HNrQr24kr1BcAb6qq1TWYivnfwCOmutrfCm+qqlur6l+BpwE/qKr3V9XGqvoa8DEG7w9MuKyqrqqqXwCvZnBlugR4KnBjVX2wbfth4AbgPwxte35VXd/qfzXFWDbZfwZvgv4n4DVV9bOquo7Bk+Om/ArYl0Gg/bqqrmnnd8KhDALx7Ko6Z0tP1lSSHAQ8GvjrqvpFVV3F4AlnU34DHJZk96q6uaqun6aLS6rqS1X1m6r6+SbaTPT9BeAy4M830W5rHAXsCby5qn5ZVZ9jMP14ylCbj1fVV9rf44eAR8xCv5qCoT9PtUB/blUtZnA1eyDw9lb9QOAd7aX27QyusgIsmm6/7aX53e3xhKGqtUPLDwSOnNh/6+M04AFTta+qu9sYDmyPH07q9oeTxraWzdtc/wsZXE0O72Nyf8M+CFwBXJxkfZK/SbLzUP1pDF4pfHSaMW2JA4Hbqupn042ttXkW8ELg5jY18pBp9j/deZuq7wOn2WZLHAisrarfTNr38O/0R0PL9zB4ktA2YOh3oKpuYPAS+rBWtBZ4QVXtPfTYvaq+vAX7elgN7iDZs6r+abhqaHkt8IVJ+9+zqv5yqM2SiYUkezKY1lnfHpNfcRzEIFin6muq9c31v4HBNMaSofYHbeZ4f1VVr6uqQ4HHMngV8ZyhJq9lMHV0UUa/lfJmYJ8ke2zh2K6oqn/PYGrnBgav6uDe54NpyidM1ff6tvwzBlNcE4afwKezHliSZDhvJv9ONUcM/XmovYn5siSL2/oSBi+lr25N3gu8MsnDWv19k8zm7ZifAv4oybOT7Nwej25vMk44PoPbSncBXg+srKq1wOVt21OTLEjyLAZTKJ/aTH8/Bv5wS/qvwR0rHwdem+Q+SQ4Flm9qx0memOSPW6DfyWC6Z/iul18BJzGYY//gpGDbKlX1Q2AV8LokuyR5PL8/rTU8rv2TnNBC+hfA3UPj+jGwuJ3brTXR9xMYPMFNfDbjG8Az2jl7EDD5Tf3Jv4NhKxk8aby8/S6Obsd18QzGpxEZ+vPTXcCRwMp2p8bVwHXAywCq6hPAWxhMWdzZ6o6brc6r6i7gKcDJDK7yftT623Wo2UXA2QymdR7FYJqEqvopg7B5GYM3o18OPK2qfrKZLt/BYL7+tiTv3IL+X8Rg+uBHDF4BvX8z+34Ag6mbO4HVwBeACycd7y+BZwD3B84bJfiBUxn87m5lcH4+sIl2f8DgHK1vbf8U+K+t7nPA9cCPkmzuvE32IwZvmq9nMK/+wvYqEeBtwC8ZhPsFrX7Ya4EL2nTa770P0M7PCQz+xn4CvBt4ztC+NYdS5X+iormV5HwGd3b81bjHIvXGK31J6oihL0kdcXpHkjrilb4kdWS7/sK1/fbbr5YuXTruYUjSDuWaa675SVUtnKpuuw79pUuXsmrVqnEPQ5J2KEk2+Slzp3ckqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakj2/Unckd10cqbxtLvqUdu8n+4k6Sx8kpfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1JFpQz/JeUluSXLdUNn/SXJDkmuTfCLJ3kN1r0yyJsl3kvzZUPmxrWxNkrNm/1AkSdPZkiv984FjJ5WtAA6rqn8HfBd4JUCSQ4GTgYe1bd6dZKckOwHvAo4DDgVOaW0lSXNo2tCvqquAWyeVfaaqNrbVq4HFbflE4OKq+kVVfR9YAxzRHmuq6ntV9Uvg4tZWkjSHZmNO/y+Af2zLi4C1Q3XrWtmmyu8lyRlJViVZtWHDhlkYniRpwkihn+TVwEbgQxNFUzSrzZTfu7DqnKpaVlXLFi5cOMrwJEmTzPhbNpMsB54GHFNVEwG+Dlgy1GwxsL4tb6pckjRHZnSln+RY4BXACVV1z1DVpcDJSXZNcjBwCPAV4KvAIUkOTrILgzd7Lx1t6JKkrTXtlX6SDwNHA/slWQeczeBunV2BFUkArq6qF1bV9Uk+AnybwbTPmVX167afFwFXADsB51XV9dvgeCRJmzFt6FfVKVMUn7uZ9m8E3jhF+eXA5Vs1OknSrPITuZLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1ZNrQT3JekluSXDdUdr8kK5Lc2H7u08qT5J1J1iS5NsnhQ9ssb+1vTLJ82xyOJGlztuRK/3zg2EllZwGfrapDgM+2dYDjgEPa4wzgPTB4kgDOBo4EjgDOnniikCTNnWlDv6quAm6dVHwicEFbvgB4+lD5B2rgamDvJAcAfwasqKpbq+o2YAX3fiKRJG1jM53T37+qbgZoP+/fyhcBa4farWtlmyqXJM2h2X4jN1OU1WbK772D5Iwkq5Ks2rBhw6wOTpJ6N9PQ/3GbtqH9vKWVrwOWDLVbDKzfTPm9VNU5VbWsqpYtXLhwhsOTJE1lpqF/KTBxB85y4JKh8ue0u3iOAu5o0z9XAE9Jsk97A/cprUySNIcWTNcgyYeBo4H9kqxjcBfOm4GPJDkduAk4qTW/HDgeWAPcAzwPoKpuTfJ64Kut3f+qqslvDkuStrFpQ7+qTtlE1TFTtC3gzE3s5zzgvK0anSRpVvmJXEnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkdGCv0k/yPJ9UmuS/LhJLslOTjJyiQ3JvmHJLu0tru29TWtfulsHIAkacvNOPSTLAL+O7Csqg4DdgJOBt4CvK2qDgFuA05vm5wO3FZVDwLe1tpJkubQqNM7C4DdkywA7gPcDDwJ+GirvwB4els+sa3T6o9JkhH7lyRthRmHflX9C/C3wE0Mwv4O4Brg9qra2JqtAxa15UXA2rbtxtZ+38n7TXJGklVJVm3YsGGmw5MkTWGU6Z19GFy9HwwcCOwBHDdF05rYZDN1vyuoOqeqllXVsoULF850eJKkKYwyvfNk4PtVtaGqfgV8HHgssHeb7gFYDKxvy+uAJQCt/r7ArSP0L0naSqOE/k3AUUnu0+bmjwG+DXweeGZrsxy4pC1f2tZp9Z+rqntd6UuStp1R5vRXMnhD9mvAt9q+zgFeAbw0yRoGc/bntk3OBfZt5S8Fzhph3JKkGVgwfZNNq6qzgbMnFX8POGKKtj8HThqlP0nSaPxEriR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdGSn0k+yd5KNJbkiyOsljktwvyYokN7af+7S2SfLOJGuSXJvk8Nk5BEnSlhr1Sv8dwKer6iHAw4HVwFnAZ6vqEOCzbR3gOOCQ9jgDeM+IfUuSttKMQz/JXsCfAOcCVNUvq+p24ETggtbsAuDpbflE4AM1cDWwd5IDZjxySdJWG+VK/w+BDcD7k3w9yfuS7AHsX1U3A7Sf92/tFwFrh7Zf18okSXNklNBfABwOvKeqHgn8jN9N5UwlU5TVvRolZyRZlWTVhg0bRhieJGmyUUJ/HbCuqla29Y8yeBL48cS0Tft5y1D7JUPbLwbWT95pVZ1TVcuqatnChQtHGJ4kabIZh35V/QhYm+TBregY4NvApcDyVrYcuKQtXwo8p93FcxRwx8Q0kCRpbiwYcfv/BnwoyS7A94DnMXgi+UiS04GbgJNa28uB44E1wD2trSRpDo0U+lX1DWDZFFXHTNG2gDNH6U+SNBo/kStJHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SerIyKGfZKckX0/yqbZ+cJKVSW5M8g9Jdmnlu7b1Na1+6ah9S5K2zmxc6b8YWD20/hbgbVV1CHAbcHorPx24raoeBLyttZMkzaGRQj/JYuCpwPvaeoAnAR9tTS4Ant6WT2zrtPpjWntJ0hwZ9Ur/7cDLgd+09X2B26tqY1tfByxqy4uAtQCt/o7W/vckOSPJqiSrNmzYMOLwJEnDZhz6SZ4G3FJV1wwXT9G0tqDudwVV51TVsqpatnDhwpkOT5I0hQUjbPs44IQkxwO7AXsxuPLfO8mCdjW/GFjf2q8DlgDrkiwA7gvcOkL/kqStNOMr/ap6ZVUtrqqlwMnA56rqNODzwDNbs+XAJW350rZOq/9cVd3rSl+StO1si/v0XwG8NMkaBnP257byc4F9W/lLgbO2Qd+SpM0YZXrnt6rqSuDKtvw94Igp2vwcOGk2+pMkzYyfyJWkjszKlb6kuXPRypvG1vepRx40tr41O7zSl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SerIjEM/yZIkn0+yOsn1SV7cyu+XZEWSG9vPfVp5krwzyZok1yY5fLYOQpK0ZUa50t8IvKyqHgocBZyZ5FDgLOCzVXUI8Nm2DnAccEh7nAG8Z4S+JUkzsGCmG1bVzcDNbfmuJKuBRcCJwNGt2QXAlcArWvkHqqqAq5PsneSAth9J2qSLVt40tr5PPfKgsfW9LczKnH6SpcAjgZXA/hNB3n7evzVbBKwd2mxdK5u8rzOSrEqyasOGDbMxPElSM3LoJ9kT+Bjwkqq6c3NNpyirexVUnVNVy6pq2cKFC0cdniRpyEihn2RnBoH/oar6eCv+cZIDWv0BwC2tfB2wZGjzxcD6UfqXJG2dUe7eCXAusLqq3jpUdSmwvC0vBy4ZKn9Ou4vnKOAO5/MlaW7N+I1c4HHAs4FvJflGK3sV8GbgI0lOB24CTmp1lwPHA2uAe4DnjdC3JGkGRrl754tMPU8PcMwU7Qs4c6b9SZJG5ydyJakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdGeV/zpKkee+ilTeNpd9Tjzxom+zXK31J6oihL0kdMfQlqSOGviR1xDdypRka1xt80ii80pekjhj6ktSROZ/eSXIs8A5gJ+B9VfXmuR7DtjbOl/3b6t7e6cy3e5ml+WpOr/ST7AS8CzgOOBQ4JcmhczkGSerZXE/vHAGsqarvVdUvgYuBE+d4DJLUrbme3lkErB1aXwccOdwgyRnAGW317iTfGaG//YCfjLD99mKLj+O0bTyQWTCrv5MxH+98+fuCLTyW3v6+xum00Y7lgZuqmOvQzxRl9XsrVecA58xKZ8mqqlo2G/sap/lyHOCxbK/my7HMl+OAbXcscz29sw5YMrS+GFg/x2OQpG7Ndeh/FTgkycFJdgFOBi6d4zFIUrfmdHqnqjYmeRFwBYNbNs+rquu3YZezMk20HZgvxwEey/ZqvhzLfDkO2EbHkqqavpUkaV7wE7mS1BFDX5I6Mq9DP8lJSa5P8pskO+RtXEmOTfKdJGuSnDXu8cxUkvOS3JLkunGPZRRJliT5fJLV7W/rxeMe00wl2S3JV5J8sx3L68Y9plEl2SnJ15N8atxjGUWSHyT5VpJvJFk1m/ue16EPXAc8A7hq3AOZiXn2tRXnA8eOexCzYCPwsqp6KHAUcOYO/Dv5BfCkqno48Ajg2CRHjXlMo3oxsHrcg5glT6yqR8z2vfrzOvSranVVjfKJ3nGbN19bUVVXAbeOexyjqqqbq+prbfkuBgGzaLyjmpkauLut7tweO+ydHUkWA08F3jfusWzP5nXozwNTfW3FDhkw81GSpcAjgZXjHcnMtemQbwC3ACuqaoc9FuDtwMuB34x7ILOggM8kuaZ9Nc2s2eH/56wk/w94wBRVr66qS+Z6PLNs2q+t0Hgk2RP4GPCSqrpz3OOZqar6NfCIJHsDn0hyWFXtcO+7JHkacEtVXZPk6HGPZxY8rqrWJ7k/sCLJDe3V8sh2+NCvqiePewzbkF9bsR1KsjODwP9QVX183OOZDVV1e5IrGbzvssOFPvA44IQkxwO7AXslubCq/vOYxzUjVbW+/bwlyScYTPXOSug7vbN982srtjNJApwLrK6qt457PKNIsrBd4ZNkd+DJwA3jHdXMVNUrq2pxVS1l8O/kcztq4CfZI8m/mVgGnsIsPhHP69BP8h+TrAMeA1yW5Ipxj2lrVNVGYOJrK1YDH9nGX1uxzST5MPDPwIOTrEty+rjHNEOPA54NPKndTveNdnW5IzoA+HySaxlcYKyoqh36Vsd5Yn/gi0m+CXwFuKyqPj1bO/drGCSpI/P6Sl+S9PsMfUnqiKEvSR0x9CWpI4a+JHXE0Je2UPvmw/2mKP9y+7l04ltEkxy9o3/To+YnQ18aUVU9dtxjkLaUoS9NoX0q8rL2XfPXJXnWUN3uST6d5L+09bs3vSdp+2LoS1M7FlhfVQ+vqsOAiU9E7gl8Erioqv5+bKOTZsjQl6b2LeDJSd6S5AlVdUcrvwR4f1V9YIxjk2bM0JemUFXfBR7FIPzflOQ1repLwHHti9ekHY6hL00hyYHAPVV1IfC3wOGt6jXAT4F3j2ts0igMfWlqfwx8pf2vUq8G3jBU9xJgtyR/M5aRSSPwWzYlqSNe6UtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1JH/D7CXDF6PuaSTAAAAAElFTkSuQmCC
" />
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Save the file and we're done!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">'processed_data/tatoeba_en-ga_20200612.csv'</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>A few more samples</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>en_id</th>
<th>lang</th>
<th>ga</th>
<th>en</th>
<th>ga_len</th>
<th>skill</th>
<th>details</th>
</tr>
</thead>
<tbody>
<tr>
<th>1016</th>
<td>3602540</td>
<td>703243</td>
<td>gle</td>
<td>Táim ag labhairt le mo mhac léinn.</td>
<td>I'm speaking with my student.</td>
<td>7</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1590</th>
<td>5599832</td>
<td>1357603</td>
<td>gle</td>
<td>Déanta.</td>
<td>Done.</td>
<td>1</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1702</th>
<td>6319314</td>
<td>2014783</td>
<td>gle</td>
<td>Cén fáth a mbeimis ag iarraidh pionós a chur ort?</td>
<td>Why would we want to punish you?</td>
<td>10</td>
<td>4</td>
<td>NaN</td>
</tr>
<tr>
<th>1055</th>
<td>3603017</td>
<td>3603008</td>
<td>gle</td>
<td>Tá gabhlóg anseo.</td>
<td>There is a fork here.</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>141</th>
<td>871635</td>
<td>5152872</td>
<td>gle</td>
<td>Chonaic sé seanchara an tseachtain seo caite n...</td>
<td>Last week he saw an old friend whom he hadn't ...</td>
<td>12</td>
<td>3</td>
<td>NaN</td>
</tr>
<tr>
<th>397</th>
<td>2610940</td>
<td>2604279</td>
<td>gle</td>
<td>Céard is ábhar taighde don tSoivéideolaí?</td>
<td>What does a Sovietologist study?</td>
<td>6</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>649</th>
<td>3128067</td>
<td>1079842</td>
<td>gle</td>
<td>Tá mé ag léamh an nuachtán.</td>
<td>I'm reading the newspaper.</td>
<td>6</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>409</th>
<td>2712366</td>
<td>2705597</td>
<td>gle</td>
<td>Ní chanaim.</td>
<td>I do not sing.</td>
<td>2</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>363</th>
<td>2150800</td>
<td>1476581</td>
<td>gle</td>
<td>Tá sé an-dorcha.</td>
<td>It's very dark.</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1671</th>
<td>6319282</td>
<td>2014752</td>
<td>gle</td>
<td>Dúirt Tom go raibh comhluadar uaidh.</td>
<td>Tom said he wanted some company.</td>
<td>6</td>
<td>4</td>
<td>NaN</td>
</tr>
<tr>
<th>911</th>
<td>3601095</td>
<td>1784975</td>
<td>gle</td>
<td>Níl a fhios agam cén fáth.</td>
<td>I don't know why.</td>
<td>6</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>243</th>
<td>873503</td>
<td>873502</td>
<td>gle</td>
<td>Sin í an bhean a bhfanann siad léi.</td>
<td>That is the woman they stay with.</td>
<td>8</td>
<td>3</td>
<td>NaN</td>
</tr>
<tr>
<th>1265</th>
<td>3944944</td>
<td>3868719</td>
<td>gle</td>
<td>Cén teanga atá á labhairt aige?</td>
<td>What language is he speaking?</td>
<td>6</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1313</th>
<td>3961159</td>
<td>463294</td>
<td>gle</td>
<td>Is duine é seo.</td>
<td>This is a person.</td>
<td>4</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1934</th>
<td>8239950</td>
<td>273600</td>
<td>gle</td>
<td>Go raibh maith agat roimh ré.</td>
<td>Thanks in advance.</td>
<td>6</td>
<td>3</td>
<td>NaN</td>
</tr>
<tr>
<th>1753</th>
<td>7075957</td>
<td>989164</td>
<td>gle</td>
<td>Léim an leabhar.</td>
<td>I read the book.</td>
<td>3</td>
<td>0</td>
<td>NaN</td>
</tr>
<tr>
<th>1687</th>
<td>6319298</td>
<td>2014768</td>
<td>gle</td>
<td>Nílimid ag iarraidh ach rudaí a dhíol leat.</td>
<td>We just want to sell you things.</td>
<td>8</td>
<td>4</td>
<td>NaN</td>
</tr>
<tr>
<th>407</th>
<td>2712330</td>
<td>2684430</td>
<td>gle</td>
<td>An bhfuil mé do chara?</td>
<td>Am I your friend?</td>
<td>5</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>362</th>
<td>2150798</td>
<td>1615217</td>
<td>gle</td>
<td>Tá sé an-tirim.</td>
<td>It's very dry.</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>299</th>
<td>874891</td>
<td>874890</td>
<td>gle</td>
<td>Nach rabhthas sásta?</td>
<td>Weren't they satisfied?</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>543</th>
<td>5599829</td>
<td>1053192</td>
<td>gle</td>
<td>Bígí cúramach!</td>
<td>Careful!</td>
<td>2</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>987</th>
<td>3602422</td>
<td>2361385</td>
<td>gle</td>
<td>Níl teileafón agam.</td>
<td>I don't have a telephone.</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1929</th>
<td>8239940</td>
<td>772806</td>
<td>gle</td>
<td>Níl a fhios ag aon duine cá bhfuil sé.</td>
<td>Nobody knows where it is.</td>
<td>9</td>
<td>3</td>
<td>NaN</td>
</tr>
<tr>
<th>948</th>
<td>3601175</td>
<td>2549673</td>
<td>gle</td>
<td>Tháinig sibh ar ais.</td>
<td>You came back.</td>
<td>4</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1539</th>
<td>5516952</td>
<td>5516950</td>
<td>gle</td>
<td>Sin bealach amháin le breathnú air is docha.</td>
<td>That's one way of looking at it, I suppose.</td>
<td>8</td>
<td>1</td>
<td>NaN</td>
</tr>
<tr>
<th>1215</th>
<td>3896264</td>
<td>2700686</td>
<td>gle</td>
<td>Níl aon fhadhb ann.</td>
<td>There is no problem.</td>
<td>4</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>410</th>
<td>2712386</td>
<td>4969010</td>
<td>gle</td>
<td>Tá an leabhar ar an sheilf.</td>
<td>The book is on the shelf.</td>
<td>6</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>681</th>
<td>3233354</td>
<td>2002544</td>
<td>gle</td>
<td>Tá an seomra dorcha.</td>
<td>The room is dark.</td>
<td>4</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>198</th>
<td>871759</td>
<td>871758</td>
<td>gle</td>
<td>Conas mar a rinne tú é?</td>
<td>How did you do it?</td>
<td>6</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>523</th>
<td>2715102</td>
<td>2659060</td>
<td>gle</td>
<td>Scríobh sí litir.</td>
<td>She wrote a letter.</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1363</th>
<td>4445009</td>
<td>2474700</td>
<td>gle</td>
<td>Bhí mé díomách sin.</td>
<td>I was so disappointed.</td>
<td>4</td>
<td>3</td>
<td>Níl Gaeilge líofa agam, ach tá a fhios agam a ...</td>
</tr>
<tr>
<th>1722</th>
<td>6319355</td>
<td>1126729</td>
<td>gle</td>
<td>Nuair a thugaim cuairt ar mo gharmhac, tugaim ...</td>
<td>When I go to see my grandson, I always give hi...</td>
<td>14</td>
<td>4</td>
<td>NaN</td>
</tr>
<tr>
<th>662</th>
<td>3128092</td>
<td>2297248</td>
<td>gle</td>
<td>Tá sé ag scríobh leabhair.</td>
<td>He's writing a book.</td>
<td>5</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1895</th>
<td>7804899</td>
<td>7926273</td>
<td>gle</td>
<td>Amach leat!</td>
<td>Out you go!</td>
<td>2</td>
<td>3</td>
<td>Caint as Cúige Uladh</td>
</tr>
<tr>
<th>855</th>
<td>3599611</td>
<td>3599609</td>
<td>gle</td>
<td>Tá Spáinnis aici.</td>
<td>She knows Spanish.</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1090</th>
<td>3603161</td>
<td>2363944</td>
<td>gle</td>
<td>Is cailín mé.</td>
<td>I am a girl.</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1456</th>
<td>4773304</td>
<td>5127613</td>
<td>gle</td>
<td>Caithfidh mé péire bróg nua a cheannach.</td>
<td>I must buy a new pair of shoes.</td>
<td>7</td>
<td>4</td>
<td>Irish teacher for 20+ years.</td>
</tr>
<tr>
<th>1827</th>
<td>7801402</td>
<td>429220</td>
<td>gle</td>
<td>Ádh mór!</td>
<td>Good luck!</td>
<td>2</td>
<td>3</td>
<td>Caint as Cúige Uladh</td>
</tr>
<tr>
<th>1688</th>
<td>6319299</td>
<td>2014769</td>
<td>gle</td>
<td>Ba mhaith linn rud a phlé le Tom.</td>
<td>We want to have a word with Tom.</td>
<td>8</td>
<td>4</td>
<td>NaN</td>
</tr>
<tr>
<th>1760</th>
<td>7290675</td>
<td>7290677</td>
<td>gle</td>
<td>Gheofá bainne a bhaint as na ba.</td>
<td>You could get milk from the cows.</td>
<td>7</td>
<td>5</td>
<td>NaN</td>
</tr>
<tr>
<th>1587</th>
<td>5599801</td>
<td>393357</td>
<td>gle</td>
<td>Mas é bhur dtoil é.</td>
<td>Please.</td>
<td>5</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>700</th>
<td>3335788</td>
<td>16255</td>
<td>gle</td>
<td>Cad tá uait?</td>
<td>What are you looking for?</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1824</th>
<td>7801398</td>
<td>348091</td>
<td>gle</td>
<td>Tar isteach.</td>
<td>Come in.</td>
<td>2</td>
<td>3</td>
<td>Caint as Cúige Uladh</td>
</tr>
<tr>
<th>1953</th>
<td>8290368</td>
<td>1192601</td>
<td>gle</td>
<td>Tá ríomhaire uaim.</td>
<td>I want a computer.</td>
<td>3</td>
<td>3</td>
<td>Caint as Cúige Uladh</td>
</tr>
<tr>
<th>1126</th>
<td>3604289</td>
<td>60147</td>
<td>gle</td>
<td>Is liomsa an teach seo.</td>
<td>This house is mine.</td>
<td>5</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>949</th>
<td>3601177</td>
<td>3419582</td>
<td>gle</td>
<td>Tá mo dheartháir níos láidre ná mé.</td>
<td>My brother is stronger than me.</td>
<td>7</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>1580</th>
<td>5599784</td>
<td>39996</td>
<td>gle</td>
<td>Tá brón orm...</td>
<td>Sorry...</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>451</th>
<td>2714581</td>
<td>1814</td>
<td>gle</td>
<td>Tá tart orm.</td>
<td>I'm thirsty.</td>
<td>3</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>845</th>
<td>3599522</td>
<td>3591019</td>
<td>gle</td>
<td>Ní ach socrú sealadach é.</td>
<td>It's only a temporary fix.</td>
<td>5</td>
<td>-1</td>
<td>NaN</td>
</tr>
<tr>
<th>138</th>
<td>871633</td>
<td>5152871</td>
<td>gle</td>
<td>Tá fear ag an doras atá ag iarraidh caint leat.</td>
<td>There's a man at the door who's asking to spea...</td>
<td>10</td>
<td>3</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
</div>Morgan McGuireELRC, European Language Resource Coordination2020-06-11T00:00:00-05:002020-06-11T00:00:00-05:00https://www.nlp.irish/elrc/translation/nmt/mt/2020/06/11/ELRC_European_Language_Resource_Coordination<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-06-11-ELRC_European_Language_Resource_Coordination.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Available-for-Download--✅">Available for Download ✅<a class="anchor-link" href="#Available-for-Download--✅"> </a></h3><p>⚠️ Always check the license of the data source before using the data ⚠️</p>
<ul>
<li>Main page: <a href="https://elrc-share.eu/">https://elrc-share.eu/</a></li>
<li>Data Browse Link: <a href="https://elrc-share.eu/repository/search/">https://elrc-share.eu/repository/search/</a></li>
<li>Format: <strong>.tmx</strong></li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Brief-Description">Brief Description<a class="anchor-link" href="#Brief-Description"> </a></h3><p>The ELRC-SHARE repository is used for documenting, storing, browsing and accessing Language Resources that are collected through the European Language Resource Coordination and considered useful for feeding the CEF Automated Translation (CEF.AT) platform.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Other-Notes">Other Notes<a class="anchor-link" href="#Other-Notes"> </a></h3><p>The files here are all hosted individually on ELRC and so have to be downloaded individually which requires a little patience. Let us know if there is a more efficient way to download them!</p>
<ul>
<li>No. source documents: 33</li>
<li>Lines of text: 23,946 </li>
<li>GA Word count: 485,570</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Word-Count-Distribution">Word Count Distribution<a class="anchor-link" href="#Word-Count-Distribution"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_png output_subarea ">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAX0AAAEXCAYAAABBFpRtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAay0lEQVR4nO3dfZRcdZ3n8fcHwoMaJA80GJIwCZpVo0ditk0y4ihDPCFENMwIuxEWIpOd7APO0bOyCOM6QZEZnbMrDmcVN0siAeUhRpDIMGKf8DTODkk6EgIhYFoekiaRNHYSQBQNfPeP+2u4aaq6qtPdVU1+n9c5fere3/3Vvd97O/nU7d+9VaWIwMzM8nBIswswM7PGceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9vOJIuk/S9ZtcxEJI+LelnpfkXJJ04SOv+a0nXpOlJkkLSiEFa9wmp1kMHY33WeA59swGSdIqkzoGsIyJGRsTjg7GdiPjbiPiPA6mntM0nJX20tO5tqdaXB2P91ngOfRu2VPC/0X4YrDN6O3j5P1TmJE2X9ICk5yX9QNLNkr6alo2WdLukLkm70/SEKuu5QNKPS/MdklaW5rdLmpamPyhpvaS96fGDpX73SLpC0r8ALwInSpos6d5UYxtwTI19mi9po6TnJP1S0tzUfryk1ZK6U31/WXrOtT37neb3O6tOZ7wXSdqU6r5Z0pGS3gL8E3B8GvZ4QdLxFWoam7b9nKR1wNt7LQ9J70jT8yQ9kvb36bTdittJQ12rJH1P0nPAp6sMf/2FpB2Sdkr6fD37Lel64ATgx2l7F/ceLqpxTC+TtFLSdWlfNktq7et3Z0PPoZ8xSYcDtwLXAmOAG4E/K3U5BPgu8EcU//l/C/zvKqu7F/gTSYdIGgccBpyctnMiMBLYJGkM8I/AVcBY4BvAP0oaW1rXecBi4CjgKeAGYANF2F8OLOxjn2YA1wH/HRgFfBh4Mi2+EegEjgfOAv5W0uxq66rg3wFzgcnA+4BPR8RvgNOBHWnYY2RE7Kjw3G8BvwPGAX+RfqpZBvyniDgKeC9wV43tzAdWpf39fpV1/ikwBZgDXFIesqkmIs4DtgEfT9v7+wrdah3TTwA3pdpWU/3fjzWIQz9vs4ARwFUR8YeIuAVY17MwIn4dET+MiBcj4nngCuAjlVaUxqOfB6alPncCT0t6V5r/54h4BfgYsDUiro+IfRFxI/Ao8PHS6q6NiM0RsY8iJD8AfCkiXoqI+4AfU90iYHlEtEXEKxHxdEQ8Kmki8CHgCxHxu4jYCFxD8QJTr6siYkdEdKcaptXzpHTR85PA30TEbyLiYWBFH0/5AzBV0lsjYndE/LzGJv41In6U9ve3Vfp8OW37IYoX8k/VU3tf6jymP4uIO9I1gOuBkwa6XRsYh37ejgeejv0/dW97z4SkN0v6P5KeSkMH9wGjVP3OjXuBUyjOru8F7qEI/I+k+Z5tPtXreU8B4yvVkPrvTme65f7VTAR+WaH9eKA7vXhV224tvypNv0jx10s9WiheXMv71dc+fBKYBzyVhrX+uMb6t9dY3rvPUxTHY6DqOaa9j9mRvu7QXA79vO0ExktSqW1iafrzwDuBmRHxVoowByj3L+sJ/T9J0/fy+tDfQTFcVHYC8HRpvvwitBMYnca0y/2r2U6v8fLSdsdIOqrKdn8DvLm07G19bKO3Wh9V2wXsY/9jW3UfImJ9RMwHjgV+BPRcG6m2nXo+Krf3tnuGhmrtd1/rrnVMbRhy6OftX4GXgc9IGiFpPjCjtPwoinH8PWksfkmN9d1LMXb8pojoBP6ZYgx8LPBA6nMH8G8knZO2+e+BqcDtlVYYEU8B7cCXJR0u6UPsPxTU2zLgAkmz0/WF8ZLeFRHbgf8H/F26APs+iqGgnjHwjcA8SWMkvQ34XI19LXsGGCvp6Cr78DJwC3BZ+utpKlWuS6R9PFfS0RHxB+A5it9Rze3U8KW07fcAFwA3p/Za+/0MUPH9A3UcUxuGHPoZi4jfA39O8R91D/AfKML3pdTlm8CbgGeB+4Gf1FjfL4AXKMKeiHgOeBz4l577uiPi18AZFH9F/Bq4GDgjIp7tY9XnADOBbooXnuv6qGEdRahdCeyleCHq+cviU8AkijPUW4ElEdGWll0PPEhx0fenvBaKNUXEoxQXNB+XtKfS3TvAZyiGg35FceH8u32s8jzgyTSk9p8pfi/1bqeae4EOYA3wPyPip6m91n7/HfA/0vYuqrDevo6pDUPyl6hYmaS1wHcioq9QMrM3KJ/pZ07SRyS9LQ21LKS4FbHPM3oze+PyVXR7J8WFwpEUd72cFRE7m1uSmQ0VD++YmWXEwztmZhkZ1sM7xxxzTEyaNKnZZZiZvaFs2LDh2YhoqbRsWIf+pEmTaG9vb3YZZmZvKJKqvuPbwztmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhkZ1u/IHagb1m6r2H7OzL6+bc/M7ODlM30zs4w49M3MMuLQNzPLSF2hL2mUpFWSHpW0RdIfSxojqU3S1vQ4OvWVpKskdUjaJGl6aT0LU/+t6av5zMysgeo90/8H4CcR8S7gJGALcAmwJiKmAGvSPMDpwJT0sxi4GkDSGGAJMBOYASzpeaEwM7PGqBn6kt4KfBhYBhARv4+IPcB8YEXqtgI4M03PB66Lwv3AKEnjgNOAtojojojdQBswd1D3xszM+lTPmf6JQBfwXUkPSLpG0luA43q+QDs9Hpv6jwe2l57fmdqqtZuZWYPUE/ojgOnA1RHxfuA3vDaUU4kqtEUf7fs/WVosqV1Se1dXVx3lmZlZveoJ/U6gMyLWpvlVFC8Cz6RhG9LjrlL/iaXnTwB29NG+n4hYGhGtEdHa0lLxKx7NzOwA1Qz9iPgVsF3SO1PTbOARYDXQcwfOQuC2NL0aOD/dxTML2JuGf+4E5kganS7gzkltZmbWIPV+DMNfAd+XdDjwOHABxQvGSkmLgG3A2anvHcA8oAN4MfUlIrolXQ6sT/2+EhHdg7IXZmZWl7pCPyI2Aq0VFs2u0DeAC6usZzmwvD8FmpnZ4PE7cs3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMlLv1yUeVG5Yu61i+zkzT2hwJWZmjeUzfTOzjGR5pl+N/wIws4Odz/TNzDLi0Dczy4hD38wsIw59M7OMOPTNzDJSV+hLelLSQ5I2SmpPbWMktUnamh5Hp3ZJukpSh6RNkqaX1rMw9d8qaeHQ7JKZmVXTnzP9P42IaRHRmuYvAdZExBRgTZoHOB2Ykn4WA1dD8SIBLAFmAjOAJT0vFGZm1hgDGd6ZD6xI0yuAM0vt10XhfmCUpHHAaUBbRHRHxG6gDZg7gO2bmVk/1Rv6AfxU0gZJi1PbcRGxEyA9HpvaxwPbS8/tTG3V2vcjabGkdkntXV1d9e+JmZnVVO87ck+OiB2SjgXaJD3aR19VaIs+2vdviFgKLAVobW193XIzMztwdZ3pR8SO9LgLuJViTP6ZNGxDetyVuncCE0tPnwDs6KPdzMwapGboS3qLpKN6poE5wMPAaqDnDpyFwG1pejVwfrqLZxawNw3/3AnMkTQ6XcCdk9rMzKxB6hneOQ64VVJP/xsi4ieS1gMrJS0CtgFnp/53APOADuBF4AKAiOiWdDmwPvX7SkR0D9qemJlZTTVDPyIeB06q0P5rYHaF9gAurLKu5cDy/pdpZmaDwe/INTPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMlJ36Es6VNIDkm5P85MlrZW0VdLNkg5P7Uek+Y60fFJpHZem9scknTbYO2NmZn3rz5n+Z4EtpfmvA1dGxBRgN7AotS8CdkfEO4ArUz8kTQUWAO8B5gLflnTowMo3M7P+qCv0JU0APgZck+YFnAqsSl1WAGem6flpnrR8duo/H7gpIl6KiCeADmDGYOyEmZnVp94z/W8CFwOvpPmxwJ6I2JfmO4HxaXo8sB0gLd+b+r/aXuE5r5K0WFK7pPaurq5+7IqZmdVSM/QlnQHsiogN5eYKXaPGsr6e81pDxNKIaI2I1paWllrlmZlZP4yoo8/JwCckzQOOBN5KceY/StKIdDY/AdiR+ncCE4FOSSOAo4HuUnuP8nPMzKwBap7pR8SlETEhIiZRXIi9KyLOBe4GzkrdFgK3penVaZ60/K6IiNS+IN3dMxmYAqwbtD0xM7Oa6jnTr+YLwE2Svgo8ACxL7cuA6yV1UJzhLwCIiM2SVgKPAPuACyPi5QFs38zM+qlfoR8R9wD3pOnHqXD3TUT8Dji7yvOvAK7ob5FmZjY4/I5cM7OMOPTNzDLi0Dczy4hD38wsIwO5eycbN6zdVrH9nJknNLgSM7OB8Zm+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZaRm6Es6UtI6SQ9K2izpy6l9sqS1krZKulnS4an9iDTfkZZPKq3r0tT+mKTThmqnzMyssnrO9F8CTo2Ik4BpwFxJs4CvA1dGxBRgN7Ao9V8E7I6IdwBXpn5ImgosAN4DzAW+LenQwdwZMzPrW83Qj8ILafaw9BPAqcCq1L4CODNNz0/zpOWzJSm13xQRL0XEE0AHMGNQ9sLMzOpS15i+pEMlbQR2AW3AL4E9EbEvdekExqfp8cB2gLR8LzC23F7hOeVtLZbULqm9q6ur/3tkZmZV1RX6EfFyREwDJlCcnb+7Urf0qCrLqrX33tbSiGiNiNaWlpZ6yjMzszr16+6diNgD3APMAkZJGpEWTQB2pOlOYCJAWn400F1ur/AcMzNrgHru3mmRNCpNvwn4KLAFuBs4K3VbCNyWplenedLyuyIiUvuCdHfPZGAKsG6wdsTMzGobUbsL44AV6U6bQ4CVEXG7pEeAmyR9FXgAWJb6LwOul9RBcYa/ACAiNktaCTwC7AMujIiXB3d3zMysLzVDPyI2Ae+v0P44Fe6+iYjfAWdXWdcVwBX9L9PMzAaD35FrZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llpGboS5oo6W5JWyRtlvTZ1D5GUpukrelxdGqXpKskdUjaJGl6aV0LU/+tkhYO3W6ZmVkl9Zzp7wM+HxHvBmYBF0qaClwCrImIKcCaNA9wOjAl/SwGrobiRQJYAswEZgBLel4ozMysMWqGfkTsjIifp+nngS3AeGA+sCJ1WwGcmabnA9dF4X5glKRxwGlAW0R0R8RuoA2YO6h7Y2ZmferXmL6kScD7gbXAcRGxE4oXBuDY1G08sL30tM7UVq299zYWS2qX1N7V1dWf8szMrIa6Q1/SSOCHwOci4rm+ulZoiz7a92+IWBoRrRHR2tLSUm95ZmZWhxH1dJJ0GEXgfz8ibknNz0gaFxE70/DNrtTeCUwsPX0CsCO1n9Kr/Z4DL735bli7rWL7OTNPaHAlZmb1qefuHQHLgC0R8Y3SotVAzx04C4HbSu3np7t4ZgF70/DPncAcSaPTBdw5qc3MzBqknjP9k4HzgIckbUxtfw18DVgpaRGwDTg7LbsDmAd0AC8CFwBERLeky4H1qd9XIqJ7UPbCzMzqUjP0I+JnVB6PB5hdoX8AF1ZZ13JgeX8KNDOzweN35JqZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhmpGfqSlkvaJenhUtsYSW2StqbH0aldkq6S1CFpk6TppecsTP23Slo4NLtjZmZ9qedM/1pgbq+2S4A1ETEFWJPmAU4HpqSfxcDVULxIAEuAmcAMYEnPC4WZmTVOzdCPiPuA7l7N84EVaXoFcGap/boo3A+MkjQOOA1oi4juiNgNtPH6FxIzMxtiBzqmf1xE7ARIj8em9vHA9lK/ztRWrd3MzBposC/kqkJb9NH++hVIiyW1S2rv6uoa1OLMzHJ3oKH/TBq2IT3uSu2dwMRSvwnAjj7aXycilkZEa0S0trS0HGB5ZmZWyYGG/mqg5w6chcBtpfbz0108s4C9afjnTmCOpNHpAu6c1GZmZg00olYHSTcCpwDHSOqkuAvna8BKSYuAbcDZqfsdwDygA3gRuAAgIrolXQ6sT/2+EhG9Lw6bmdkQqxn6EfGpKotmV+gbwIVV1rMcWN6v6szMbFD5HblmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhmp+Y5c678b1m6r2H7OzBMaXImZ2f58pm9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZ8X36DeT7982s2Xymb2aWEYe+mVlGHPpmZhlx6JuZZcQXcocBX+A1s0bxmb6ZWUZ8pj+M+S8AMxtsDQ99SXOBfwAOBa6JiK81uoY3Or8YmNmBaujwjqRDgW8BpwNTgU9JmtrIGszMctboM/0ZQEdEPA4g6SZgPvBIg+s4KFX7C6CZ/NeH2fDS6NAfD2wvzXcCM8sdJC0GFqfZFyQ9doDbOgZ49gCfO1Syq+ncA39qdsfqAA3HmmB41pVTTX9UbUGjQ18V2mK/mYilwNIBb0hqj4jWga5nMLmm+g3HulxT/YZjXa6p0OhbNjuBiaX5CcCOBtdgZpatRof+emCKpMmSDgcWAKsbXIOZWbYaOrwTEfskfQa4k+KWzeURsXmINjfgIaIh4JrqNxzrck31G451uSZAEVG7l5mZHRT8MQxmZhlx6JuZZeSgC31JcyU9JqlD0iVNruVJSQ9J2iipPbWNkdQmaWt6HD3ENSyXtEvSw6W2ijWocFU6dpskTW9gTZdJejodq42S5pWWXZpqekzSaUNU00RJd0vaImmzpM+m9mYfq2p1Ne14STpS0jpJD6aavpzaJ0tam47VzelmDSQdkeY70vJJDazpWklPlI7TtNTekN9f2tahkh6QdHuab9pxAiAiDpofiovDvwROBA4HHgSmNrGeJ4FjerX9PXBJmr4E+PoQ1/BhYDrwcK0agHnAP1G8n2IWsLaBNV0GXFSh79T0ezwCmJx+v4cOQU3jgOlp+ijgF2nbzT5W1epq2vFK+zwyTR8GrE3HYCWwILV/B/gvafq/At9J0wuAm4fgOFWr6VrgrAr9G/L7S9v6b8ANwO1pvmnHKSIOujP9Vz/mISJ+D/R8zMNwMh9YkaZXAGcO5cYi4j6gu84a5gPXReF+YJSkcQ2qqZr5wE0R8VJEPAF0UPyeB7umnRHx8zT9PLCF4h3kzT5W1eqqZsiPV9rnF9LsYekngFOBVam997HqOYargNmSKr1RcyhqqqYhvz9JE4CPAdekedHE4wQH3/BOpY956Os/yFAL4KeSNqj4eAmA4yJiJxT/oYFjm1BXtRqaffw+k/7UXl4a9mp4TenP6vdTnC0Om2PVqy5o4vFKQxYbgV1AG8VfFHsiYl+F7b5aU1q+Fxg71DVFRM9xuiIdpyslHdG7pgr1DqZvAhcDr6T5sTT5OB1soV/zYx4a7OSImE7xqaIXSvpwE2upRzOP39XA24FpwE7gfzWjJkkjgR8Cn4uI5/rqWqGtkXU19XhFxMsRMY3iXfUzgHf3sd2m1CTpvcClwLuADwBjgC80qiZJZwC7ImJDubmP7TbkOB1soT+sPuYhInakx13ArRT/OZ7p+TMyPe5qQmnVamja8YuIZ9J/2leA/8trQxINq0nSYRTB+v2IuCU1N/1YVaprOByvVMce4B6KcfFRknre8Fne7qs1peVHU//w3kBqmpuGxyIiXgK+S2OP08nAJyQ9STHUfCrFmX9Tj9PBFvrD5mMeJL1F0lE908Ac4OFUz8LUbSFwWxPKq1bDauD8dGfDLGBvz9DGUOs1nvpnFMeqp6YF6c6GycAUYN0QbF/AMmBLRHyjtKipx6paXc08XpJaJI1K028CPkpxreFu4KzUrfex6jmGZwF3RbpaOcQ1PVp6wRbF2Hn5OA3p7y8iLo2ICRExiSKL7oqIc2niceop7KD6obgq/wuKMcYvNrGOEynuongQ2NxTC8UY3Rpga3ocM8R13Ejx5/8fKM4kFlWrgeLPy2+lY/cQ0NrAmq5P29xE8Y9/XKn/F1NNjwGnD1FNH6L4U3oTsDH9zBsGx6paXU07XsD7gAfSth8G/qb0b34dxcXjHwBHpPYj03xHWn5iA2u6Kx2nh4Hv8dodPg35/ZXqO4XX7t5p2nGKCH8Mg5lZTg624R0zM+uDQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfbMBUPERxxc1uw6zejn0zcwy0tAvRjcbbiR9CTiX4tMNnwU2UHy64WKK72ToAM6LiBfrWNfbKd7l2QK8CPxlRDwq6VrgOaAVeBtwcUSsqroisyHkM33LlqRW4JMUH1f85xShDHBLRHwgIk6i+EyZRXWucinwVxHxb4GLgG+Xlo2j+EiFM4CvDUL5ZgfEZ/qWsw8Bt0XEbwEk/Ti1v1fSV4FRwEjgzlorSh99/EHgB6XvvTii1OVHUXwi5iOSjhuk+s36zaFvOav2rUTXAmdGxIOSPk3xYVm1HELx5RjTqix/qY7tmg05D+9Yzn4GfFzFl2qPpPhaOyi+i3Zn+hz7c+tZURRfbPKEpLPh1S/ePmkoijYbCIe+ZSsi1lN8LPGDwC1AO8VF3C9RfCVhG/BoP1Z5LrBIUs/HaQ+372c280crW94kjYyIFyS9GbgPWBzpi8jNDkYe07fcLZU0leILLFY48O1g5zN9szpI+iJwdq/mH0TEFc2ox+xAOfTNzDLiC7lmZhlx6JuZZcShb2aWEYe+mVlG/j82C3R1nz5ysAAAAABJRU5ErkJggg==
" />
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Code-to-Extract-to-a-Pandas-DataFrame">Code to Extract to a Pandas DataFrame<a class="anchor-link" href="#Code-to-Extract-to-a-Pandas-DataFrame"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">metadata</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">tmx2dataframe</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="s1">'elrc/citizens_information_en-ga.tmx'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>10297
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>source_language</th>
<th>source_sentence</th>
<th>target_language</th>
<th>target_sentence</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>en</td>
<td>about Citizens Information</td>
<td>ga</td>
<td>maidir le faisnéis do shaoránaigh</td>
</tr>
<tr>
<th>1</th>
<td>en</td>
<td>the Citizens Information Board is the statutor...</td>
<td>ga</td>
<td>is é an Bord um fhaisnéis do shaoránaigh ( BFS...</td>
</tr>
<tr>
<th>2</th>
<td>en</td>
<td>it provides the Citizens Information website ,...</td>
<td>ga</td>
<td>cuireann sé an láithreán gréasáin um fhaisnéis...</td>
</tr>
<tr>
<th>3</th>
<td>en</td>
<td>it also funds and supports the Money Advice an...</td>
<td>ga</td>
<td>cuireann sé maoiniú agus tacaíocht ar fáil fre...</td>
</tr>
<tr>
<th>4</th>
<td>en</td>
<td>Citizensinformation.ie provides comprehensive ...</td>
<td>ga</td>
<td>cuireann citizensinformation.ie faisnéis chuim...</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Code-to-Interate-and-Extract-all-.tmx-files-downloaded">Code to Interate and Extract all <code>.tmx</code> files downloaded<a class="anchor-link" href="#Code-to-Interate-and-Extract-all-.tmx-files-downloaded"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">lang</span><span class="o">=</span><span class="s1">'ga'</span>
<span class="n">dir_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="sa">f</span><span class="s1">'elrc'</span><span class="p">)</span>
<span class="n">samp_count</span><span class="o">=</span><span class="mi">0</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">progress_bar</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">dir_path</span><span class="o">.</span><span class="n">iterdir</span><span class="p">())):</span>
<span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">suffix</span> <span class="o">==</span> <span class="s1">'.tmx'</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">_</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">tmx2dataframe</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">f</span><span class="p">))</span>
<span class="c1"># If target_language in dataframe contains the language string (like 'ga')</span>
<span class="n">df</span><span class="o">.</span><span class="n">target_language</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">target_language</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">target_language</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">lang</span><span class="p">)])</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">ga_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">target_language</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">lang</span><span class="p">)]</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">ga_df</span><span class="p">[</span><span class="s1">'filepath'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span><span class="k">pass</span>
<span class="c1">#print(f"Couldn't open {f}") </span>
<span class="n">var_exists</span> <span class="o">=</span> <span class="s1">'ga_df'</span> <span class="ow">in</span> <span class="nb">locals</span><span class="p">()</span> <span class="ow">or</span> <span class="s1">'ga_df'</span> <span class="ow">in</span> <span class="nb">globals</span><span class="p">()</span>
<span class="k">if</span> <span class="n">var_exists</span><span class="p">:</span>
<span class="c1">#print(f'{len(ga_df)} samples found in {f}')</span>
<span class="n">samp_count</span><span class="o">+=</span><span class="nb">len</span><span class="p">(</span><span class="n">ga_df</span><span class="p">)</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">f</span><span class="p">)</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="si">}</span><span class="s1">.csv'</span><span class="p">)</span>
<span class="k">del</span> <span class="n">ga_df</span>
<span class="n">gc</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="c1">#else: print(f'No {lang} text found in {f} ?')</span>
<span class="c1">#print()</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">samp_count</span><span class="si">}</span><span class="s1"> total text samples extracted'</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
<div>
<style>
/* Turns off some styling */
progress {
/* gets rid of default border in Firefox and Opera. */
border: none;
/* Needs to be in here for Safari polyfill so background images work as expected. */
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
<progress value='77' class='' max='77', style='width:300px; height:20px; vertical-align: middle;'></progress>
100.00% [77/77 00:14<00:00]
</div>
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>34235 total text samples extracted
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Compile-Saved-CSVs">Compile Saved CSVs<a class="anchor-link" href="#Compile-Saved-CSVs"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">lang</span><span class="o">=</span><span class="s1">'ga'</span>
<span class="n">dir_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="sa">f</span><span class="s1">'elrc'</span><span class="p">)</span>
<span class="n">f_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="n">dir_path</span><span class="o">.</span><span class="n">iterdir</span><span class="p">()):</span>
<span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">suffix</span> <span class="o">==</span> <span class="s1">'.csv'</span><span class="p">:</span> <span class="n">f_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">f</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">progress_bar</span><span class="p">(</span><span class="n">f_list</span><span class="p">)):</span>
<span class="k">try</span><span class="p">:</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="n">ga_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">tmp</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">ga_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">ga_df</span><span class="p">,</span> <span class="n">tmp</span><span class="p">])</span>
<span class="k">except</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Error with opening </span><span class="si">{</span><span class="n">f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ga_df</span><span class="p">))</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">'elrc_en-ga_compiled_2020-06-11.csv'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
<div>
<style>
/* Turns off some styling */
progress {
/* gets rid of default border in Firefox and Opera. */
border: none;
/* Needs to be in here for Safari polyfill so background images work as expected. */
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
<progress value='34' class='' max='34', style='width:300px; height:20px; vertical-align: middle;'></progress>
100.00% [34/34 00:00<00:00]
</div>
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>34243
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>source_language</th>
<th>source_sentence</th>
<th>target_language</th>
<th>target_sentence</th>
<th>filepath</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>en</td>
<td>Press release31 March 2020Brussels</td>
<td>ga</td>
<td>Preaseisiúint March 31, 2020An Bhruiséil</td>
<td>elrc/covid19_eu_presscorner_en-ga.tmx</td>
</tr>
<tr>
<th>1</th>
<td>en</td>
<td>State aid: Coronavirus: Irish Repayable Advanc...</td>
<td>ga</td>
<td>Státchabhair: An coróinvíreas: Scéim Réamhíoca...</td>
<td>elrc/covid19_eu_presscorner_en-ga.tmx</td>
</tr>
<tr>
<th>2</th>
<td>en</td>
<td>(i) Direct grants, selective tax advantages an...</td>
<td>ga</td>
<td>(i) an deontas díreach, buntáistí cánach roghn...</td>
<td>elrc/covid19_eu_presscorner_en-ga.tmx</td>
</tr>
<tr>
<th>3</th>
<td>en</td>
<td>(i) Direct grants, equity injections, selectiv...</td>
<td>ga</td>
<td>(i) Deontais dhíreacha, instealltaí cothromais...</td>
<td>elrc/covid19_eu_presscorner_en-ga.tmx</td>
</tr>
<tr>
<th>4</th>
<td>en</td>
<td>State aid_coronavirus_IrelandThe European Comm...</td>
<td>ga</td>
<td>Bearta tacaíochta na hÉireann</td>
<td>elrc/covid19_eu_presscorner_en-ga.tmx</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Number source documents:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>33</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Number of lines per source document:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
</tr>
<tr>
<th>filepath</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>elrc/citizens_information_en-ga.tmx</th>
<td>10297</td>
</tr>
<tr>
<th>elrc/Tuarascalaca_Bliantula_na_Roinne_Leanai_agus_Gnothai_Oige_en_ga_clean.tmx</th>
<td>2954</td>
</tr>
<tr>
<th>elrc/Tuarascail_Bhliantuil_Chomhairle_Chontae_Longfoirt_2017_en_ga_clean.tmx</th>
<td>2646</td>
</tr>
<tr>
<th>elrc/medical_domain_en-ga.tmx</th>
<td>1289</td>
</tr>
<tr>
<th>elrc/website_parallel_corpus_2259.en-ga.tmx</th>
<td>1134</td>
</tr>
<tr>
<th>elrc/Programme_for_Government_Annual_Report_2013_en_ga_clean.tmx</th>
<td>1020</td>
</tr>
<tr>
<th>elrc/Preasraitis_Gaois_Fiontar_Scoil_na_Gaeilge_DCU_1_en_ga_clean.tmx</th>
<td>975</td>
</tr>
<tr>
<th>elrc/Raitis_Airgeadais_Ollscoil_Mha_Nuad_2017-2018_en_ga_clean.tmx</th>
<td>677</td>
</tr>
<tr>
<th>elrc/Raitis_Airgeadais_Oifig_an_Choimisineara_Teanga_en_ga_clean.tmx</th>
<td>487</td>
</tr>
<tr>
<th>elrc/eu_vacination_portal_en-ga.tmx</th>
<td>359</td>
</tr>
<tr>
<th>elrc/Press_Releases_from_Department_of_Children_January-May_2019_en_ga_clean.tmx</th>
<td>353</td>
</tr>
<tr>
<th>elrc/coimisineir_teanga_web_corpus.tmx</th>
<td>321</td>
</tr>
<tr>
<th>elrc/Oifigi_Ombudsman_in_Eirinn_en_ga_clean.tmx</th>
<td>249</td>
</tr>
<tr>
<th>elrc/Leabhran_dAonad_Altranais_Pobail_Teach Uí_Riada_en_ga_clean.tmx</th>
<td>220</td>
</tr>
<tr>
<th>elrc/Tearmaiocht_agus_aistriucha_in_a_bhaineann_le_fograi_poist_foluntais_abhair_chomortha_1916_agus_eolas_ginearalta_ar_Oifig_na_Gaeilge_en_ga_clean.tmx</th>
<td>188</td>
</tr>
<tr>
<th>elrc/Polsasi_ar_Fheiniulacht_agus_Leiriu_Inscne_Ollscoil_Mha_Nuad 2019 _en_ga_clean.tmx</th>
<td>177</td>
</tr>
<tr>
<th>elrc/Preasraitis_Gaois_Fiontar_Scoil_na_Gaeilge_DCU_2_en_ga_clean.tmx</th>
<td>162</td>
</tr>
<tr>
<th>elrc/Preasraitis_Oifig_an_Choimisinéara_Teanga_en_ga_clean.tmx</th>
<td>91</td>
</tr>
<tr>
<th>elrc/Preasraitis_Ollscoil_Mha_Nuad_Earrach_2019_en_ga_clean.tmx</th>
<td>71</td>
</tr>
<tr>
<th>elrc/Tuairisc_a_thug_Maire_Nic_Shiubhlaigh_en_ga_clean.tmx</th>
<td>50</td>
</tr>
<tr>
<th>elrc/Preasraiteas_Mi_Iuil_en_ga_clean.tmx</th>
<td>40</td>
</tr>
<tr>
<th>elrc/Preasraitis_Ollscoil_Mha_Nuad_Samhradh 2019_en_ga_clean.tmx</th>
<td>38</td>
</tr>
<tr>
<th>elrc/Preasraiteas_faoi_foirgneamh_nua_scoile_en_ga_clean.tmx</th>
<td>22</td>
</tr>
<tr>
<th>elrc/Litir_ó_Oifig_an_Choimisinéara_Teanga_en_ga_clean.tmx</th>
<td>22</td>
</tr>
<tr>
<th>elrc/Faisnéis faoi IDS_en_ga_clean.tmx</th>
<td>20</td>
</tr>
<tr>
<th>elrc/Toiliu_don_Scagthastail_Scoile_um_Amhairc_Eisteachta_en_ga_clean.tmx</th>
<td>19</td>
</tr>
<tr>
<th>elrc/covid19_eu_presscorner_en-ga.tmx</th>
<td>16</td>
</tr>
<tr>
<th>elrc/covid19_europarl_v1_en-ga.tmx</th>
<td>13</td>
</tr>
<tr>
<th>elrc/Pleananna_ITBAC_le_comóradh_a_dheanamh_ar_1916_en_ga_clean.tmx</th>
<td>10</td>
</tr>
<tr>
<th>elrc/Foirm FSS Iarratais Duine ar a Shonraí _en_ga_clean.tmx</th>
<td>10</td>
</tr>
<tr>
<th>elrc/Postaer_faoi_scoil_ag_claru_en_ga_clean.tmx</th>
<td>6</td>
</tr>
<tr>
<th>elrc/Preasraiteas_faoi_Uachtarán_nua_en_ga_clean.tmx</th>
<td>5</td>
</tr>
<tr>
<th>elrc/covid19_europarl_v2_en-ga.tmx</th>
<td>5</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
</div>Morgan McGuireDGT-TM, DGT-Translation Memory2020-06-11T00:00:00-05:002020-06-11T00:00:00-05:00https://www.nlp.irish/dgt-tm/dgt-tm3/translation/nmt/mt/2020/06/11/DGT-Translation-Memories-DGT-TM<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-06-11-DGT-Translation Memories-DGT-TM.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Available-for-Download--✅">Available for Download ✅<a class="anchor-link" href="#Available-for-Download--✅"> </a></h3><p>⚠️ Always check the license of the data source before using the data ⚠️</p>
<ul>
<li>Link: <a href="https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory">https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory</a></li>
<li>Format: <strong>.tmx</strong></li>
<li>NOTE:<ul>
<li>There are <strong>no Irish translations</strong> in:<ul>
<li>DGT-TM Version 1 (Released in 2007) </li>
<li>DGT-TM-release 2011 </li>
</ul>
</li>
<li>"DGT-TM-release 2012" is the first release with Irish translations</li>
</ul>
</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Brief-Description">Brief Description<a class="anchor-link" href="#Brief-Description"> </a></h3><p>A parallel multilingual corpus of the European Union’s legislative documents (Acquis Communautaire) in 24 EU languages. The aligned translation units have been provided by the Directorate-General for Translation of the European Commission by extraction from one of its large shared translation memories in EURAMIS (European advanced multilingual information system). This memory contains most, although not all, of the documents which make up the Acquis Communautaire, as well as some other documents which are not part of the Acquis.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Other-Notes">Other Notes<a class="anchor-link" href="#Other-Notes"> </a></h3><p>See the section on the EU site called "How to produce bilingual extractions" for a java-based alternative to extracting the TMX files</p>
<ul>
<li>Lines of text: 190,500</li>
<li>GA Word count: 4,852,515</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Word-Count-Distribution">Word Count Distribution<a class="anchor-link" href="#Word-Count-Distribution"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_png output_subarea ">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAYMAAAEXCAYAAABPkyhHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAXoElEQVR4nO3dfbRddX3n8ffH8GhBw0NUSKABzbRFlyKmwKhVKw4G1IZWmEEZjZZpZhyZpWvpKNaxapVWu2a0wxofhpGUQFVARIkWB7NUsHYUCPIYELmiQEgK0fAoioLf+WP/rj3enHPvydN9yvu11ll379/+nb1/v7PvvZ+zf3uffVJVSJJ2bk+Y6gZIkqaeYSBJMgwkSYaBJAnDQJKEYSBJwjDQLJLkfUn+fqrbsS2SvCHJt3rmH05y6HZa958n+VSbXpikkuyyndZ9cGvrnO2xPk0+w0DaQZK8JMm6bVlHVe1VVbdvj+1U1V9V1X/Ylvb0bPNHSV7Ws+47W1sf3x7r1+QzDDTjpOPv7hbYXkcAmr38g1JfSY5Icm2Sh5J8LskFST7Ylu2T5MtJNia5r00vGLCeNyb5Us/8SJILe+bvSnJ4m35+kquTPNB+Pr+n3uVJzkjyT8AjwKFJDklyRWvjamD/Cfq0NMl1SR5M8oMkS1r5gUlWJdnU2vdnPc85Z7Tfbf433oW3d8hvT3JDa/cFSfZI8lvAV4AD2/DJw0kO7NOm/dq2H0xyFfD0McsryTPa9PFJbm79vbttt+922pDZRUn+PsmDwBsGDKP9aZL1STYkedsw/U5yHnAw8KW2vXeMHXaa4DV9X5ILk5zb+rI2yeLx9p12PMNAm0myG/AF4BxgX+CzwB/3VHkC8HfAb9P9U/gZ8L8GrO4K4A+SPCHJAcCuwAvadg4F9gJuSLIv8A/AmcB+wEeAf0iyX8+6XgcsB/YG7gA+A1xDFwIfAJaN06cjgXOB/wrMBV4E/Kgt/iywDjgQOBH4qyTHDFpXH/8WWAIcAjwbeENV/RQ4Dljfhk/2qqr1fZ77MeDnwAHAn7bHIGcD/7Gq9gaeBXx9gu0sBS5q/f30gHX+IbAIOBY4vXfoZ5Cqeh1wJ/Cqtr2/6VNtotf0j4DzW9tWMfj3R5PEMFA/RwO7AGdW1S+r6mLgqtGFVfWTqvp8VT1SVQ8BZwAv7reiNt79EHB4q3MZcHeS323z/1hVvwJeAdxWVedV1WNV9Vnge8CrelZ3TlWtrarH6P55/j7wnqp6tKq+CXyJwU4FVlTV6qr6VVXdXVXfS3IQ8ELgnVX186q6DvgUXfAM68yqWl9Vm1obDh/mSe1k66uBv6iqn1bVTcDKcZ7yS+CwJE+qqvuq6rsTbOLbVfXF1t+fDajz/rbtG+kC/jXDtH08Q76m36qqS9s5hvOA52zrdrVtDAP1cyBwd/3mXQzvGp1I8sQk/zvJHW0I4pvA3Ay+kuQK4CV078avAC6nC4IXt/nRbd4x5nl3APP7taHVv6+9M+6tP8hBwA/6lB8IbGqhNmi7E/nnnulH6I52hjGPLnR7+zVeH14NHA/c0YbH/vUE679rguVj69xB93psq2Fe07Gv2R6e15hahoH62QDMT5KesoN6pt8G/A5wVFU9ie6fPEBv/V6jYfAHbfoKNg+D9XTDTr0OBu7ume8Npw3APm3MvLf+IHcxZjy+Z7v7Jtl7wHZ/CjyxZ9nTxtnGWBPdEngj8Bi/+doO7ENVXV1VS4GnAF8ERs+9DNrOMLckHrvt0SGmifo93ronek01DRkG6ufbwOPAaUl2SbIUOLJn+d505wnub2P9751gfVfQjU3vWVXrgH+kG2PfD7i21bkU+FdJXtu2+e+Aw4Av91thVd0BrAHen2S3JC/kN4eUxjobeGOSY9r5i/lJfreq7gL+H/DX7cTvs+mGlEbH2K8Djk+yb5KnAW+doK+97gH2S/LkAX14HLgYeF872jqMAec9Wh9PSfLkqvol8CDdPppwOxN4T9v2M4E3Ahe08on6fQ/Q9/MPQ7ymmoYMA22mqn4B/AndH/D9wL+n+6f8aKvyt8CewI+B7wD/d4L1fR94mC4EqKoHgduBfxq9Lr2qfgK8ku6o4yfAO4BXVtWPx1n1a4GjgE10gXTuOG24iu6f3UeBB+gCavRI5DXAQrp3tF8A3ltVq9uy84Dr6U42f5V/+Wc5oar6Ht2J1NuT3N/vaiLgNLphpX+mO2H/d+Os8nXAj9rQ3H+i2y/DbmeQK4AR4GvAf6+qr7byifr918B/a9t7e5/1jveaahqKX26jYSS5EvhkVY33z0rSDOWRgfpK8uIkT2tDNsvoLpkc9whA0szl2XsN8jt0Jyj3orsK58Sq2jC1TZK0ozhMJElymEiSNIOHifbff/9auHDhVDdDkmaMa6655sdVNa/fshkbBgsXLmTNmjVT3QxJmjGSDPyEu8NEkiTDQJJkGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEliBn8CeVt85so7+5a/9qjxvjVRkmYvjwwkSYaBJMkwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAkYRhIkjAMJEkYBpIkDANJEoaBJAnDQJKEYSBJYgvCIMmcJNcm+XKbPyTJlUluS3JBkt1a+e5tfqQtX9izjne18luTvLynfEkrG0ly+vbrniRpGFtyZPAW4Jae+Q8DH62qRcB9wKmt/FTgvqp6BvDRVo8khwEnA88ElgAfbwEzB/gYcBxwGPCaVleSNEmGCoMkC4BXAJ9q8wFeClzUqqwETmjTS9s8bfkxrf5S4PyqerSqfgiMAEe2x0hV3V5VvwDOb3UlSZNk2CODvwXeAfyqze8H3F9Vj7X5dcD8Nj0fuAugLX+g1f91+ZjnDCrfTJLlSdYkWbNx48Yhmy5JmsiEYZDklcC9VXVNb3GfqjXBsi0t37yw6qyqWlxVi+fNmzdOqyVJW2KXIeq8APijJMcDewBPojtSmJtkl/bufwGwvtVfBxwErEuyC/BkYFNP+aje5wwqlyRNggmPDKrqXVW1oKoW0p0A/npVnQJ8AzixVVsGXNKmV7V52vKvV1W18pPb1UaHAIuAq4CrgUXt6qTd2jZWbZfeSZKGMsyRwSDvBM5P8kHgWuDsVn42cF6SEbojgpMBqmptkguBm4HHgDdX1eMASU4DLgPmACuqau02tEuStIW2KAyq6nLg8jZ9O92VQGPr/Bw4acDzzwDO6FN+KXDplrRFkrT9+AlkSZJhIEkyDCRJGAaSJAwDSRKGgSQJw0CShGEgScIwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAkYRhIkjAMJEkYBpIkDANJEoaBJAnDQJKEYSBJwjCQJGEYSJIwDCRJGAaSJAwDSRKGgSQJw0CShGEgScIwkCRhGEiSMAwkSRgGkiQMA0kSQ4RBkj2SXJXk+iRrk7y/lR+S5MoktyW5IMlurXz3Nj/Sli/sWde7WvmtSV7eU76klY0kOX37d1OSNJ5hjgweBV5aVc8BDgeWJDka+DDw0apaBNwHnNrqnwrcV1XPAD7a6pHkMOBk4JnAEuDjSeYkmQN8DDgOOAx4TasrSZokE4ZBdR5us7u2RwEvBS5q5SuBE9r00jZPW35MkrTy86vq0ar6ITACHNkeI1V1e1X9Aji/1ZUkTZKhzhm0d/DXAfcCq4EfAPdX1WOtyjpgfpueD9wF0JY/AOzXWz7mOYPK+7VjeZI1SdZs3LhxmKZLkoYwVBhU1eNVdTiwgO6d/O/1q9Z+ZsCyLS3v146zqmpxVS2eN2/exA2XJA1li64mqqr7gcuBo4G5SXZpixYA69v0OuAggLb8ycCm3vIxzxlULkmaJMNcTTQvydw2vSfwMuAW4BvAia3aMuCSNr2qzdOWf72qqpWf3K42OgRYBFwFXA0salcn7UZ3knnV9uicJGk4u0xchQOAle2qnycAF1bVl5PcDJyf5IPAtcDZrf7ZwHlJRuiOCE4GqKq1SS4EbgYeA95cVY8DJDkNuAyYA6yoqrXbrYeSpAlNGAZVdQPw3D7lt9OdPxhb/nPgpAHrOgM4o0/5pcClQ7RXkrQD+AlkSZJhIEkyDCRJGAaSJAwDSRKGgSQJw0CShGEgScIwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAkYRhIkjAMJEkYBpIkDANJEoaBJAnDQJKEYSBJwjCQJGEYSJIwDCRJGAaSJAwDSRKwy1Q3YDr5zJV39i1/7VEHT3JLJGlyeWQgSTIMJEmGgSQJw0CShGEgScIwkCRhGEiSGCIMkhyU5BtJbkmyNslbWvm+SVYnua393KeVJ8mZSUaS3JDkiJ51LWv1b0uyrKf8eUlubM85M0l2RGclSf0N86Gzx4C3VdV3k+wNXJNkNfAG4GtV9aEkpwOnA+8EjgMWtcdRwCeAo5LsC7wXWAxUW8+qqrqv1VkOfAe4FFgCfGX7dXPH8ENqkmaLCY8MqmpDVX23TT8E3ALMB5YCK1u1lcAJbXopcG51vgPMTXIA8HJgdVVtagGwGljSlj2pqr5dVQWc27MuSdIk2KJzBkkWAs8FrgSeWlUboAsM4Cmt2nzgrp6nrWtl45Wv61MuSZokQ9+bKMlewOeBt1bVg+MM6/dbUFtR3q8Ny+mGkzj44Mkbihk0HCRJs8VQRwZJdqULgk9X1cWt+J42xEP7eW8rXwcc1PP0BcD6CcoX9CnfTFWdVVWLq2rxvHnzhmm6JGkIw1xNFOBs4Jaq+kjPolXA6BVBy4BLespf364qOhp4oA0jXQYcm2SfduXRscBlbdlDSY5u23p9z7okSZNgmGGiFwCvA25Mcl0r+3PgQ8CFSU4F7gROassuBY4HRoBHgDcCVNWmJB8Arm71/rKqNrXpNwHnAHvSXUU07a8kkqTZZMIwqKpv0X9cH+CYPvULePOAda0AVvQpXwM8a6K2SJJ2DD+BLEkyDCRJfu3lDuEnkyXNNB4ZSJIMA0mSYSBJwjCQJGEYSJIwDCRJGAaSJAwDSRKGgSQJw0CShGEgScIwkCRhGEiSMAwkSXgL60nlra0lTVceGUiSDANJkmEgScIwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAk4Y3qpgVvYCdpqnlkIEkyDCRJhoEkCcNAkoRhIEnCMJAkMUQYJFmR5N4kN/WU7ZtkdZLb2s99WnmSnJlkJMkNSY7oec6yVv+2JMt6yp+X5Mb2nDOTZHt3UpI0vmGODM4BlowpOx34WlUtAr7W5gGOAxa1x3LgE9CFB/Be4CjgSOC9owHS6izved7YbUmSdrAJw6CqvglsGlO8FFjZplcCJ/SUn1ud7wBzkxwAvBxYXVWbquo+YDWwpC17UlV9u6oKOLdnXZKkSbK15wyeWlUbANrPp7Ty+cBdPfXWtbLxytf1KZckTaLtfQK533h/bUV5/5Uny5OsSbJm48aNW9lESdJYW3tvonuSHFBVG9pQz72tfB1wUE+9BcD6Vv6SMeWXt/IFfer3VVVnAWcBLF68eGBozBbes0jSZNnaI4NVwOgVQcuAS3rKX9+uKjoaeKANI10GHJtkn3bi+FjgsrbsoSRHt6uIXt+zLknSJJnwyCDJZ+ne1e+fZB3dVUEfAi5McipwJ3BSq34pcDwwAjwCvBGgqjYl+QBwdav3l1U1elL6TXRXLO0JfKU9JEmTaMIwqKrXDFh0TJ+6Bbx5wHpWACv6lK8BnjVROyRJO46fQJYkGQaSJMNAkoRhIEnCMJAkYRhIktj6TyBrCg36ZDL46WRJW8cjA0mSYSBJMgwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAk4SeQZx2/N1nS1vDIQJJkGEiSDANJEoaBJAnDQJKEYSBJwktLdxpecippPB4ZSJIMA0mSYSBJwjCQJOEJ5J2eJ5YlgUcGkiQMA0kSDhNpAIePpJ2LRwaSJMNAkuQwkbaQw0fS7OSRgSTJIwNtHx4xSDObYaAdypCQZoZpEwZJlgD/E5gDfKqqPjTFTdIOZEhI08u0CIMkc4CPAf8GWAdcnWRVVd08tS3TZDMkpKkxLcIAOBIYqarbAZKcDywFDAMBg0NiezJwtDObLmEwH7irZ34dcNTYSkmWA8vb7MNJbt2Kbe0P/HgrnjcdzZa+TIt+nLJ9VjMt+rKdzJa+zJZ+wLb35bcHLZguYZA+ZbVZQdVZwFnbtKFkTVUt3pZ1TBezpS+zpR9gX6aj2dIP2LF9mS6fM1gHHNQzvwBYP0VtkaSdznQJg6uBRUkOSbIbcDKwaorbJEk7jWkxTFRVjyU5DbiM7tLSFVW1dgdtbpuGmaaZ2dKX2dIPsC/T0WzpB+zAvqRqs6F5SdJOZroME0mSppBhIEnaecIgyZIktyYZSXL6VLdnSyX5UZIbk1yXZE0r2zfJ6iS3tZ/7THU7+0myIsm9SW7qKevb9nTObPvphiRHTF3LNzegL+9LcnfbN9clOb5n2btaX25N8vKpafXmkhyU5BtJbkmyNslbWvmM2y/j9GUm7pc9klyV5PrWl/e38kOSXNn2ywXtQhuS7N7mR9ryhVu98aqa9Q+6k9I/AA4FdgOuBw6b6nZtYR9+BOw/puxvgNPb9OnAh6e6nQPa/iLgCOCmidoOHA98he6zJ0cDV051+4foy/uAt/epe1j7XdsdOKT9Ds6Z6j60th0AHNGm9wa+39o74/bLOH2ZifslwF5telfgyvZ6Xwic3Mo/CbypTf9n4JNt+mTggq3d9s5yZPDr211U1S+A0dtdzHRLgZVteiVwwhS2ZaCq+iawaUzxoLYvBc6tzneAuUkOmJyWTmxAXwZZCpxfVY9W1Q+BEbrfxSlXVRuq6rtt+iHgFro7Acy4/TJOXwaZzvulqurhNrtrexTwUuCiVj52v4zur4uAY5L0+xDvhHaWMOh3u4vxflmmowK+muSadlsOgKdW1Qbo/iCAp0xZ67bcoLbP1H11Whs+WdEzXDcj+tKGFp5L9y50Ru+XMX2BGbhfksxJch1wL7Ca7sjl/qp6rFXpbe+v+9KWPwDstzXb3VnCYKjbXUxzL6iqI4DjgDcnedFUN2gHmYn76hPA04HDgQ3A/2jl074vSfYCPg+8taoeHK9qn7Lp3pcZuV+q6vGqOpzuTgxHAr/Xr1r7ud36srOEwYy/3UVVrW8/7wW+QPdLcs/ooXr7ee/UtXCLDWr7jNtXVXVP+wP+FfB/+Jchh2ndlyS70v3z/HRVXdyKZ+R+6deXmbpfRlXV/cDldOcM5iYZ/ZBwb3t/3Ze2/MkMP4z5G3aWMJjRt7tI8ltJ9h6dBo4FbqLrw7JWbRlwydS0cKsMavsq4PXt6pWjgQdGhy2mqzFj539Mt2+g68vJ7YqPQ4BFwFWT3b5+2rjy2cAtVfWRnkUzbr8M6ssM3S/zksxt03sCL6M7B/IN4MRWbex+Gd1fJwJfr3Y2eYtN9dnzyXrQXQ3xfbrxt3dPdXu2sO2H0l39cD2wdrT9dGODXwNuaz/3neq2Dmj/Z+kO039J907m1EFtpzvs/VjbTzcCi6e6/UP05bzW1hvaH+cBPfXf3fpyK3DcVLe/p10vpBtOuAG4rj2On4n7ZZy+zMT98mzg2tbmm4C/aOWH0gXWCPA5YPdWvkebH2nLD93abXs7CknSTjNMJEkah2EgSTIMJEmGgSQJw0CShGEgScIwkHaIdvvkt091O6RhGQaSJHaZuIq080nyHuAUujtC/hi4hu6OkMvpvhNjBHhdVT0yxLqeTvfp3XnAI8CfVdX3kpwDPAgsBp4GvKOqLhq4ImkH8shAGiPJYuDVdLdC/hO6f9YAF1fV71fVc+juF3PqkKs8C/gvVfU84O3Ax3uWHUB3O4VXAh/aDs2XtopHBtLmXghcUlU/A0jypVb+rCQfBOYCewGXTbSidlvl5wOf6/nOkd17qnyxurtq3pzkqdup/dIWMwykzQ36pqhzgBOq6vokbwBeMsS6nkD3xSSHD1j+6BDblXY4h4mkzX0LeFX7cvK9gFe08r2BDe3e+acMs6LqvmTlh0lOgl9/sfxzdkSjpW1hGEhjVNXVdLc8vh64GFhDd/L4PXRfp7ga+N4WrPIU4NQko7cgnw3fv61ZxltYS30k2auqHk7yROCbwPJqX7ouzUaeM5D6OyvJYXRfHrLSINBs55GBtA2SvBs4aUzx56rqjKloj7S1DANJkieQJUmGgSQJw0CShGEgSQL+P3BK85qH6YiYAAAAAElFTkSuQmCC
" />
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Code-to-Extract-TMX-to-DataFrame">Code to Extract TMX to DataFrame<a class="anchor-link" href="#Code-to-Extract-TMX-to-DataFrame"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Pip install the <code>tmx2dataframe</code> package <a href="https://github.com/jaderabbit/tmx2dataframe">here</a></p>
<blockquote><p><code>pip install tmx2dataframe</code></p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Extract-from-a-single-TMX-file">Extract from a single TMX file<a class="anchor-link" href="#Extract-from-a-single-TMX-file"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">tmx2dataframe</span> <span class="kn">import</span> <span class="n">tmx2dataframe</span>
<span class="n">metadata</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">tmx2dataframe</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="s1">'Volume_1/22003D0033.tmx'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>source_language</th>
<th>source_sentence</th>
<th>target_language</th>
<th>target_sentence</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>EN-GB</td>
<td>Decision of the EEA Joint Committee No 33/2003...</td>
<td>ET-01</td>
<td>EMP Ühiskomitee otsus nr 33/2003, 14. märts 20...</td>
</tr>
<tr>
<th>1</th>
<td>EN-GB</td>
<td>THE EEA JOINT COMMITTEE,</td>
<td>ET-01</td>
<td>EMP ÜHISKOMITEE,</td>
</tr>
<tr>
<th>2</th>
<td>EN-GB</td>
<td>Having regard to the Agreement on the European...</td>
<td>ET-01</td>
<td>võttes arvesse Euroopa Majanduspiirkonna lepin...</td>
</tr>
<tr>
<th>3</th>
<td>EN-GB</td>
<td>HAS DECIDED AS FOLLOWS:</td>
<td>ET-01</td>
<td>ON VASTU VÕTNUD JÄRGMISE OTSUSE:</td>
</tr>
<tr>
<th>4</th>
<td>EN-GB</td>
<td>Article 1</td>
<td>ET-01</td>
<td>Artikkel 1</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">source_sentence</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">df</span><span class="o">.</span><span class="n">target_sentence</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>('Decision of the EEA Joint Committee No 33/2003 of 14 March 2003 amending Annex XIII (Transport) to the EEA Agreement',
'EMP Ühiskomitee otsus nr 33/2003, 14. märts 2003, millega muudetakse EMP lepingu XIII lisa (transport)')</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">target_sentence</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>(35, 2)</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The metadata is also included:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">metadata</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'creationtool': 'tmexport2 2.32 27-03-2007',
'adminlang': 'EN-US',
'srclang': 'EN-GB'}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Extract-language-specific-sentences-from-multiple-volumes-and-TMX-files:">Extract language-specific sentences from multiple volumes and TMX files:<a class="anchor-link" href="#Extract-language-specific-sentences-from-multiple-volumes-and-TMX-files:"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">lang</span><span class="o">=</span><span class="s1">'GA'</span>
<span class="c1">#yr='2013'</span>
<span class="n">yr_list</span><span class="o">=</span><span class="p">[]</span>
<span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">10</span><span class="p">):</span>
<span class="n">yr_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s1">'201</span><span class="si">{</span><span class="n">y</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="c1"># For each release year</span>
<span class="k">for</span> <span class="n">yr</span> <span class="ow">in</span> <span class="n">yr_list</span><span class="p">:</span>
<span class="n">dir_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">yr</span><span class="si">}</span><span class="s1">_release'</span><span class="p">)</span>
<span class="n">dir_list</span><span class="o">=</span><span class="p">[]</span>
<span class="k">for</span> <span class="n">dd</span> <span class="ow">in</span> <span class="n">dir_path</span><span class="o">.</span><span class="n">iterdir</span><span class="p">():</span>
<span class="k">if</span> <span class="n">dd</span><span class="o">.</span><span class="n">is_dir</span><span class="p">():</span> <span class="n">dir_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">dd</span><span class="p">)</span>
<span class="n">mb</span> <span class="o">=</span> <span class="n">master_bar</span><span class="p">(</span><span class="n">dir_list</span><span class="p">)</span>
<span class="c1"># For directory in a specific release year</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">mb</span><span class="p">:</span>
<span class="k">if</span> <span class="n">d</span><span class="o">.</span><span class="n">is_dir</span><span class="p">()</span> <span class="o">&</span> <span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">suffix</span> <span class="o">!=</span> <span class="s1">'.zip'</span><span class="p">):</span>
<span class="c1"># For each file in a specific directory</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">progress_bar</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">iterdir</span><span class="p">()),</span> <span class="n">parent</span><span class="o">=</span><span class="n">mb</span><span class="p">):</span>
<span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">suffix</span> <span class="o">==</span> <span class="s1">'.tmx'</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">_</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">tmx2dataframe</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">f</span><span class="p">))</span>
<span class="c1"># If target_language in dataframe contains the language string (like 'GA')</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">target_language</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">lang</span><span class="p">)])</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">tmp</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">target_language</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">lang</span><span class="p">)]</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">tmp</span><span class="p">[</span><span class="s1">'filepath'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="n">var_exists</span> <span class="o">=</span> <span class="s1">'ga_df'</span> <span class="ow">in</span> <span class="nb">locals</span><span class="p">()</span> <span class="ow">or</span> <span class="s1">'ga_df'</span> <span class="ow">in</span> <span class="nb">globals</span><span class="p">()</span>
<span class="k">if</span> <span class="n">var_exists</span><span class="p">:</span> <span class="n">ga_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">ga_df</span><span class="p">,</span> <span class="n">tmp</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span> <span class="n">ga_df</span> <span class="o">=</span> <span class="n">tmp</span>
<span class="k">except</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Couldn't open </span><span class="si">{</span><span class="n">f</span><span class="si">}</span><span class="s2"> in </span><span class="si">{</span><span class="n">d</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">yr</span><span class="si">}</span><span class="s1"> DONE!'</span><span class="p">)</span>
<span class="n">var_exists</span> <span class="o">=</span> <span class="s1">'ga_df'</span> <span class="ow">in</span> <span class="nb">locals</span><span class="p">()</span> <span class="ow">or</span> <span class="s1">'ga_df'</span> <span class="ow">in</span> <span class="nb">globals</span><span class="p">()</span>
<span class="k">if</span> <span class="n">var_exists</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">ga_df</span><span class="p">)</span><span class="si">}</span><span class="s1"> samples found in </span><span class="si">{</span><span class="n">yr</span><span class="si">}</span><span class="s1"> release'</span><span class="p">)</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="sa">f</span><span class="s1">'dgt_tm_</span><span class="si">{</span><span class="n">yr</span><span class="si">}</span><span class="s1">_release_en-ga.csv'</span><span class="p">)</span>
<span class="k">del</span> <span class="n">ga_df</span>
<span class="n">gc</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'No </span><span class="si">{</span><span class="n">lang</span><span class="si">}</span><span class="s1"> text found in </span><span class="si">{</span><span class="n">yr</span><span class="si">}</span><span class="s1"> release'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">()</span>
<span class="c1">#ga_df.head()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>2012 DONE!
2848 samples found in 2012 release
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>2013 DONE!
No GA text found in 2013 release
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>2014 DONE!
41461 samples found in 2014 release
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>2015 DONE!
7673 samples found in 2015 release
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>2016 DONE!
9127 samples found in 2016 release
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>2017 DONE!
37181 samples found in 2017 release
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>2018 DONE!
30014 samples found in 2018 release
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>2019 DONE!
53652 samples found in 2019 release
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Compile-all-release-years">Compile all release years<a class="anchor-link" href="#Compile-all-release-years"> </a></h4>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">10</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="k">if</span> <span class="n">y</span> <span class="o">==</span> <span class="mi">2</span><span class="p">:</span> <span class="n">ga_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="sa">f</span><span class="s1">'dgt_tm_201</span><span class="si">{</span><span class="n">y</span><span class="si">}</span><span class="s1">_release_en-ga.csv'</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">tmp</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="sa">f</span><span class="s1">'dgt_tm_201</span><span class="si">{</span><span class="n">y</span><span class="si">}</span><span class="s1">_release_en-ga.csv'</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">ga_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">ga_df</span><span class="p">,</span> <span class="n">tmp</span><span class="p">])</span>
<span class="k">except</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Error with opening dgt_tm_201</span><span class="si">{</span><span class="n">y</span><span class="si">}</span><span class="s1">_release_en-ga.csv'</span><span class="p">)</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ga_df</span><span class="p">))</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">'dgt_tm_2012-2019_releases_en-ga.csv'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">ga_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>190500
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>source_language</th>
<th>source_sentence</th>
<th>target_language</th>
<th>target_sentence</th>
<th>filepath</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>EN-GB</td>
<td>Regulation (EU) No 1174/2011 of the European P...</td>
<td>GA-IE</td>
<td>Rialachán (AE) Uimh. 1174/2011 ó Pharlaimint n...</td>
<td>2012_release/Vol_2011_4/32011R1174.tmx</td>
</tr>
<tr>
<th>1</th>
<td>EN-GB</td>
<td>of 16 November 2011</td>
<td>GA-IE</td>
<td>an 16 Samhain 2011</td>
<td>2012_release/Vol_2011_4/32011R1174.tmx</td>
</tr>
<tr>
<th>2</th>
<td>EN-GB</td>
<td>on enforcement measures to correct excessive m...</td>
<td>GA-IE</td>
<td>maidir le bearta forfheidhmiúcháin chun míchot...</td>
<td>2012_release/Vol_2011_4/32011R1174.tmx</td>
</tr>
<tr>
<th>3</th>
<td>EN-GB</td>
<td>THE EUROPEAN PARLIAMENT AND THE COUNCIL OF THE...</td>
<td>GA-IE</td>
<td>TÁ PARLAIMINT NA hEORPA AGUS COMHAIRLE AN AONT...</td>
<td>2012_release/Vol_2011_4/32011R1174.tmx</td>
</tr>
<tr>
<th>4</th>
<td>EN-GB</td>
<td>Having regard to the Treaty on the Functioning...</td>
<td>GA-IE</td>
<td>Ag féachaint don Chonradh ar Fheidhmiú an Aont...</td>
<td>2012_release/Vol_2011_4/32011R1174.tmx</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Other-Files">Other Files<a class="anchor-link" href="#Other-Files"> </a></h3><h4 id="file_list.txt">file_list.txt<a class="anchor-link" href="#file_list.txt"> </a></h4><p>The .zip files also contain a .txt file with the original filename and what languages it is available in:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">df_ls</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'Volume_1/file_list.txt'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="n">df_ls</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'fst'</span><span class="p">]</span>
<span class="n">df_ls</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>fst</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>21970A0720(01).tmx EN:54 BG:34 CS:35 ET:44 FR...</td>
</tr>
<tr>
<th>1</th>
<td>21970A1123(01).tmx EN:631 BG:569 ET:547 FR:55...</td>
</tr>
<tr>
<th>2</th>
<td>21972A0722(03).tmx EN:4674 BG:1436 HU:1629 LT...</td>
</tr>
<tr>
<th>3</th>
<td>21972A0722(04).tmx EN:29 BG:16 FR:9 HU:28 MT:...</td>
</tr>
<tr>
<th>4</th>
<td>21972A0722(05).tmx EN:251 BG:214 FR:219</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">df_ls</span><span class="o">.</span><span class="n">fst</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>'21970A0720(01).tmx EN:54 BG:34 CS:35 ET:44 FR:41 HU:39 LV:41 MT:42 PL:40 SK:35 SL:41 '</pre>
</div>
</div>
</div>
</div>
</div>
</div>Morgan McGuireParaCrawl2020-06-11T00:00:00-05:002020-06-11T00:00:00-05:00https://www.nlp.irish/paracrawl/translation/nmt/mt/2020/06/11/ParaCrawl<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-06-11-ParaCrawl.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Available-for-Download--✅">Available for Download ✅<a class="anchor-link" href="#Available-for-Download--✅"> </a></h3><p>⚠️ Always check the license of the data source before using the data ⚠️</p>
<ul>
<li>Main page: <a href="https://paracrawl.eu/">https://paracrawl.eu/</a></li>
<li>Data Browse Link: <a href="https://paracrawl.eu/v6">https://paracrawl.eu/v6</a></li>
<li>Format: <strong>.tmx</strong> or <strong>.txt</strong></li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Brief-Description">Brief Description<a class="anchor-link" href="#Brief-Description"> </a></h3><p>Open source tools to crawl, align and clean bilingual data co-financed by the European Union. Open source datasets with parallel data crawled from the web and cleaned</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Other-Notes">Other Notes<a class="anchor-link" href="#Other-Notes"> </a></h3><ul>
<li>Lines of text: 1,366,628</li>
<li>GA Word count: 32,824,533</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Word-Count-Distribution">Word Count Distribution<a class="anchor-link" href="#Word-Count-Distribution"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_png output_subarea ">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAYoAAAEXCAYAAACzhgONAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAe+0lEQVR4nO3df7xVdZ3v8dc7TLPUQD0aAg7knJrIR530jHCzGstJgamw0gbsChp3qC7eRz2mblFTV6d0xube8ubjFl1KBmhSJH8kNRjxYBqd5ipyTFLxx3A0hCMMHAWVoqGgz/1jfU8tDnt/9z5nH/Y5cN7Px2M/9tqf9f1+13ctt+fD+q6111cRgZmZWTUvGewOmJnZ0OZEYWZmWU4UZmaW5URhZmZZThRmZpblRGFmZllOFDYsSLpa0j8Mdj8aIelyST8pff6FpFcPUNuflfSttDxeUkg6aoDaPj31dcRAtGfN50RhNggknSepq5E2IuK4iHhqILYTEX8TEf+lkf6UtrlJ0p+W2t6c+rp/INq35nOisCOKCv5e98FAnTnYkcv/Q1mfSTpL0oOSdkv6rqRbJF2T1o2S9ANJ3ZJ2peWxVdq5QtL3S587JS0vfd4iqS0tv1nSOkkvpPc3l8r9s6RrJf0rsAd4taQJku5OfVwNnFxjn6ZLWi/pRUlPSpqS4qdJWiFpZ+rfX5TqLO7Z7/T5gH+9p39Zf1LSQ6nft0h6maRXAHcBp6UhmV9IOq1Cn05K235R0v3AGb3Wh6Q/TMvTJD2a9veZtN2K20nDcLdK+gdJLwKXVxma+5CkrZK2SfpEPfst6dvA6cD30/Y+1Xsoq8YxvVrScklL075skNSe+29nh54ThfWJpKOBO4DFwInAzcB7S0VeAvw98AcUfzB+BfyfKs3dDbxV0kskjQZeCpybtvNq4DjgIUknAv8I3ACcBHwF+EdJJ5XaugyYCxwPPA3cBDxAkSC+CMzO7NM5wFLgvwMjgbcBm9Lqm4Eu4DTgYuBvJJ1fra0KPgBMASYAbwAuj4hfAlOBrWlI5riI2Fqh7teA/wBGAx9Kr2puBD4cEccDZwL/VGM704Fb0/5+p0qbbwdagQuA+eXhpGoi4jJgM/DutL2/q1Cs1jF9D7As9W0F1b8/1iROFNZXk4GjgBsi4jcRcTtwf8/KiHguIm6LiD0RsRu4FviTSg2l8fXdQFsqswp4RtIfpc//EhG/Bf4M2BgR346IfRFxM/A48O5Sc4sjYkNE7KP4w/rHwOcjYm9E3AN8n+rmAIsiYnVE/DYinomIxyWNA94CfDoi/iMi1gPfokhK9bohIrZGxM7Uh7Z6KqULv+8H/kdE/DIiHgGWZKr8Bpgo6YSI2BURP62xiXsj4ntpf39Vpcxfp20/TJH8Z9bT95w6j+lPImJluqbxbeCNjW7XGuNEYX11GvBMHPg0yS09C5JeLun/Sno6DWvcA4xU9Tte7gbOo/hX/N3AP1MkiT9Jn3u2+XSvek8DYyr1IZXflf5FXS5fzTjgyQrx04CdKeFV224t/15a3kNxllSPFoqEXN6v3D68H5gGPJ2G3P5Tjfa31Fjfu8zTFMejUfUc097H7GW+jjK4nCisr7YBYySpFBtXWv4E8FpgUkScQJEAAMrly3oSxVvT8t0cnCi2UgxllZ0OPFP6XE5c24BRaYy+XL6aLfQa/y9t90RJx1fZ7i+Bl5fWvSqzjd5qPba5G9jHgce26j5ExLqImA6cAnwP6LnWU2079Tw2uve2e4atau13ru1ax9SGICcK66t7gf3AlZKOkjQdOKe0/niK6xLPp2sLV9Vo726KsfBjI6IL+BeKMf2TgAdTmZXAayRdmrb558BE4AeVGoyIp4EO4K8lHS3pLRw4TNXbjcAVks5P10vGSPqjiNgC/D/gb9NF6DdQDFP1jOmvB6ZJOlHSq4CP19jXsu3ASZJeWWUf9gO3A1ens7SJVLnOkvbxg5JeGRG/AV6k+G9Uczs1fD5t+/XAFcAtKV5rv7cDFX/fUccxtSHIicL6JCJ+DbyP4n/u54H/TPEHe28q8r+BY4FngfuAH9Zo79+AX1AkCCLiReAp4F977ruPiOeAd1GcrTwHfAp4V0Q8m2n6UmASsJMiWS3N9OF+ij+E1wMvUCSvnjOYmcB4in8J3wFcFRGr07pvAz+juPD9I37/h7SmiHic4qLuU5Ker3TXE3AlxVDVv1PcPPD3mSYvAzal4b6PUPx3qXc71dwNdAJrgP8VET9K8Vr7/bfA59L2Plmh3dwxtSFInrjIGiVpLfCNiMj9ITOzw5TPKKzPJP2JpFelYaDZFLd9Zs8czOzw5TsJrD9eS3Gx9DiKu4Uujohtg9slMztUPPRkZmZZHnoyM7OsI27o6eSTT47x48cPdjfMzA4rDzzwwLMR0VJp3RGXKMaPH09HR8dgd8PM7LAiqeov/z30ZGZmWU4UZmaW5URhZmZZThRmZpblRGFmZllOFGZmluVEYWZmWU4UZmaW5URhZmZZNX+ZnSZDX0ox3eFvgYUR8dU0e9ktFBOQbAI+EBG70hSZX6WYv3cPcHnPRO/pkdSfS01fExFLUvxsiolZjqWYzexjERHVttHwXg+Qm9Zurrru0km5mTfNzA4f9ZxR7AM+ERGvAyYD89K0jPOBNRHRSjED1vxUfirQml5zgQUApWkxJ1FMnXmVpFGpzoJUtqfelBSvtg0zM2uSmokiIrb1nBFExG7gMWAMMB1YkootAS5Ky9OBpVG4DxgpaTRwIbA6Inams4LVwJS07oSIuDeKZ54v7dVWpW2YmVmT9OkahaTxwJuAtcCpPZPVpPdTUrExwJZSta4Uy8W7KsTJbKN3v+ZK6pDU0d3d3ZddMjOzGupOFJKOA24DPh4RL+aKVohFP+J1i4iFEdEeEe0tLRWfkmtmZv1UV6KQ9FKKJPGdiLg9hbenYSPS+44U7wLGlaqPBbbWiI+tEM9tw8zMmqRmokh3Md0IPBYRXymtWgHMTsuzgTtL8VkqTAZeSMNGq4ALJI1KF7EvAFaldbslTU7bmtWrrUrbMDOzJqln4qJzgcuAhyWtT7HPAtcByyXNATYDl6R1Kyluje2kuD32CoCI2Cnpi8C6VO4LEbEzLX+U398ee1d6kdmGmZk1Sc1EERE/ofJ1BIDzK5QPYF6VthYBiyrEO4AzK8Sfq7QNMzNrHv8y28zMspwozMwsy4nCzMyy6rmYPezlnulkZnak8xmFmZllOVGYmVmWE4WZmWU5UZiZWZYThZmZZTlRmJlZlhOFmZllOVGYmVmWE4WZmWU5UZiZWZYThZmZZTlRmJlZVj1ToS6StEPSI6XYLZLWp9emnpnvJI2X9KvSum+U6pwt6WFJnZJuSNOeIulESaslbUzvo1JcqVynpIcknTXwu29mZrXUc0axGJhSDkTEn0dEW0S0AbcBt5dWP9mzLiI+UoovAOYCrenV0+Z8YE1EtAJr0meAqaWyc1N9MzNrspqJIiLuAXZWWpfOCj4A3JxrQ9Jo4ISIuDdNlboUuCitng4sSctLesWXRuE+YGRqx8zMmqjRaxRvBbZHxMZSbIKkByXdLemtKTYG6CqV6UoxgFMjYhtAej+lVGdLlToHkDRXUoekju7u7sb2yMzMDtBoopjJgWcT24DTI+JNwF8CN0k6AVCFulGj7brrRMTCiGiPiPaWlpY6um1mZvXq9wx3ko4C3gec3ROLiL3A3rT8gKQngddQnA2MLVUfC2xNy9sljY6IbWloaUeKdwHjqtQZ8qrNinfppNOb3BMzs8Y0ckbxp8DjEfG7ISVJLZJGpOVXU1yIfioNKe2WNDld15gF3JmqrQBmp+XZveKz0t1Pk4EXeoaozMyseeq5PfZm4F7gtZK6JM1Jq2Zw8EXstwEPSfoZcCvwkYjouRD+UeBbQCfwJHBXil8HvFPSRuCd6TPASuCpVP6bwH/t++6ZmVmjag49RcTMKvHLK8Ruo7hdtlL5DuDMCvHngPMrxAOYV6t/ZmZ2aPmX2WZmluVEYWZmWU4UZmaW5URhZmZZThRmZpblRGFmZllOFGZmluVEYWZmWU4UZmaW5URhZmZZThRmZpblRGFmZllOFGZmluVEYWZmWU4UZmaW5URhZmZZThRmZpZVz1SoiyTtkPRIKXa1pGckrU+vaaV1n5HUKekJSReW4lNSrFPS/FJ8gqS1kjZKukXS0Sl+TPrcmdaPH6idNjOz+tVzRrEYmFIhfn1EtKXXSgBJEynm0n59qvN1SSMkjQC+BkwFJgIzU1mAL6W2WoFdQM+c3HOAXRHxh8D1qZyZmTVZzUQREfcAO+tsbzqwLCL2RsTPgU7gnPTqjIinIuLXwDJguiQB7wBuTfWXABeV2lqSlm8Fzk/lzcysiRq5RnGlpIfS0NSoFBsDbCmV6UqxavGTgOcjYl+v+AFtpfUvpPIHkTRXUoekju7u7gZ2yczMeutvolgAnAG0AduAL6d4pX/xRz/iubYODkYsjIj2iGhvaWnJ9dvMzPqoX4kiIrZHxP6I+C3wTYqhJSjOCMaVio4FtmbizwIjJR3VK35AW2n9K6l/CMzMzAZIvxKFpNGlj+8Feu6IWgHMSHcsTQBagfuBdUBrusPpaIoL3isiIoAfAxen+rOBO0ttzU7LFwP/lMqbmVkTHVWrgKSbgfOAkyV1AVcB50lqoxgK2gR8GCAiNkhaDjwK7APmRcT+1M6VwCpgBLAoIjakTXwaWCbpGuBB4MYUvxH4tqROijOJGQ3vrZmZ9ZmOtH+kt7e3R0dHx4C2edPazQPW1qWTTh+wtszMBoqkByKivdI6/zLbzMyynCjMzCzLicLMzLKcKMzMLKvmXU82sKpdGPdFbjMbqnxGYWZmWU4UZmaW5URhZmZZThRmZpblRGFmZllOFGZmluVEYWZmWU4UZmaW5URhZmZZThRmZpblRGFmZlk1E4WkRZJ2SHqkFPufkh6X9JCkOySNTPHxkn4laX16faNU52xJD0vqlHSDJKX4iZJWS9qY3keluFK5zrSdswZ+983MrJZ6zigWA1N6xVYDZ0bEG4B/Az5TWvdkRLSl10dK8QXAXIp5tFtLbc4H1kREK7AmfQaYWio7N9U3M7Mmq5koIuIeijmry7EfRcS+9PE+YGyuDUmjgRMi4t4o5l5dClyUVk8HlqTlJb3iS6NwHzAytWNmZk00ENcoPgTcVfo8QdKDku6W9NYUGwN0lcp0pRjAqRGxDSC9n1Kqs6VKnQNImiupQ1JHd3d3Y3tjZmYHaChRSPorYB/wnRTaBpweEW8C/hK4SdIJgCpUj1rN11snIhZGRHtEtLe0tNTXeTMzq0u/Jy6SNBt4F3B+Gk4iIvYCe9PyA5KeBF5DcTZQHp4aC2xNy9sljY6IbWloaUeKdwHjqtQxM7Mm6dcZhaQpwKeB90TEnlK8RdKItPxqigvRT6Uhpd2SJqe7nWYBd6ZqK4DZaXl2r/isdPfTZOCFniEqMzNrnppnFJJuBs4DTpbUBVxFcZfTMcDqdJfrfekOp7cBX5C0D9gPfCQiei6Ef5TiDqpjKa5p9FzXuA5YLmkOsBm4JMVXAtOATmAPcEUjO2pmZv1TM1FExMwK4RurlL0NuK3Kug7gzArx54DzK8QDmFerf2Zmdmj5l9lmZpblRGFmZllOFGZmluVEYWZmWU4UZmaW5URhZmZZThRmZpblRGFmZln9ftaTDayb1m6uGL900ulN7omZ2YF8RmFmZllOFGZmluVEYWZmWU4UZmaW5URhZmZZThRmZpblRGFmZllOFGZmllVXopC0SNIOSY+UYidKWi1pY3ofleKSdIOkTkkPSTqrVGd2Kr9R0uxS/GxJD6c6N6R5tatuw8zMmqfeM4rFwJResfnAmohoBdakzwBTgdb0mgssgOKPPsV825OAc4CrSn/4F6SyPfWm1NiGmZk1SV2JIiLuAXb2Ck8HlqTlJcBFpfjSKNwHjJQ0GrgQWB0ROyNiF7AamJLWnRAR96Z5spf2aqvSNszMrEkauUZxakRsA0jvp6T4GGBLqVxXiuXiXRXiuW0cQNJcSR2SOrq7uxvYJTMz6+1QXMxWhVj0I163iFgYEe0R0d7S0tKXqmZmVkMjiWJ7GjYive9I8S5gXKncWGBrjfjYCvHcNszMrEkaSRQrgJ47l2YDd5bis9LdT5OBF9Kw0SrgAkmj0kXsC4BVad1uSZPT3U6zerVVaRtmZtYkdc1HIelm4DzgZEldFHcvXQcslzQH2AxckoqvBKYBncAe4AqAiNgp6YvAulTuCxHRc4H8oxR3Vh0L3JVeZLZhZmZNUleiiIiZVVadX6FsAPOqtLMIWFQh3gGcWSH+XKVtmJlZ83iGu5Jqs8yZmQ1nfoSHmZllOVGYmVmWE4WZmWU5UZiZWZYThZmZZfmupyGu2p1Yl046vck9MbPhymcUZmaW5URhZmZZHno6THlIysyaxWcUZmaW5URhZmZZThRmZpblRGFmZllOFGZmluVEYWZmWf1OFJJeK2l96fWipI9LulrSM6X4tFKdz0jqlPSEpAtL8Skp1ilpfik+QdJaSRsl3SLp6P7vqpmZ9Ue/E0VEPBERbRHRBpxNMe3pHWn19T3rImIlgKSJwAzg9cAU4OuSRkgaAXwNmApMBGamsgBfSm21AruAOf3tr5mZ9c9ADT2dDzwZEU9nykwHlkXE3oj4OcWc2uekV2dEPBURvwaWAdMlCXgHcGuqvwS4aID6a2ZmdRqoRDEDuLn0+UpJD0laJGlUio0BtpTKdKVYtfhJwPMRsa9X/CCS5krqkNTR3d3d+N6YmdnvNJwo0nWD9wDfTaEFwBlAG7AN+HJP0QrVox/xg4MRCyOiPSLaW1pa+tB7MzOrZSCe9TQV+GlEbAfoeQeQ9E3gB+ljFzCuVG8ssDUtV4o/C4yUdFQ6qyiXNzOzJhmIoaeZlIadJI0urXsv8EhaXgHMkHSMpAlAK3A/sA5oTXc4HU0xjLUiIgL4MXBxqj8buHMA+mtmZn3Q0BmFpJcD7wQ+XAr/naQ2imGiTT3rImKDpOXAo8A+YF5E7E/tXAmsAkYAiyJiQ2rr08AySdcADwI3NtJfMzPru4YSRUTsobjoXI5dlil/LXBthfhKYGWF+FMUd0WZmdkg8S+zzcwsy4nCzMyynCjMzCzLicLMzLKcKMzMLMuJwszMspwozMwsy4nCzMyynCjMzCzLicLMzLKcKMzMLMuJwszMspwozMwsy4nCzMyynCjMzCzLicLMzLKcKMzMLKuhGe4AJG0CdgP7gX0R0S7pROAWYDzFdKgfiIhdkgR8FZgG7AEuj4ifpnZmA59LzV4TEUtS/GxgMXAsxSx4H0vzaVsFN63dXDF+6aTTm9wTMztSDNQZxdsjoi0i2tPn+cCaiGgF1qTPAFOB1vSaCywASInlKmASxdSnV0kaleosSGV76k0ZoD6bmVkdDtXQ03RgSVpeAlxUii+Nwn3ASEmjgQuB1RGxMyJ2AauBKWndCRFxbzqLWFpqy8zMmmAgEkUAP5L0gKS5KXZqRGwDSO+npPgYYEupbleK5eJdFeIHkDRXUoekju7u7gHYJTMz69HwNQrg3IjYKukUYLWkxzNlVSEW/YgfGIhYCCwEaG9v9/ULM7MB1PAZRURsTe87gDsorjFsT8NGpPcdqXgXMK5UfSywtUZ8bIW4mZk1SUOJQtIrJB3fswxcADwCrABmp2KzgTvT8gpglgqTgRfS0NQq4AJJo9JF7AuAVWndbkmT0x1Ts0ptmZlZEzQ69HQqcEfxN5yjgJsi4oeS1gHLJc0BNgOXpPIrKW6N7aS4PfYKgIjYKemLwLpU7gsRsTMtf5Tf3x57V3qZmVmTNJQoIuIp4I0V4s8B51eIBzCvSluLgEUV4h3AmY3008zM+s+/zDYzsywnCjMzyxqI22PtMOBHe5hZf/mMwszMspwozMwsy4nCzMyynCjMzCzLicLMzLKcKMzMLMuJwszMspwozMwsyz+4G+b8Qzwzq8VnFGZmluVEYWZmWU4UZmaW5URhZmZZ/U4UksZJ+rGkxyRtkPSxFL9a0jOS1qfXtFKdz0jqlPSEpAtL8Skp1ilpfik+QdJaSRsl3SLp6P7218zM+qeRM4p9wCci4nXAZGCepIlp3fUR0ZZeKwHSuhnA64EpwNcljZA0AvgaMBWYCMwstfOl1FYrsAuY00B/zcysH/p9e2xEbAO2peXdkh4DxmSqTAeWRcRe4OeSOoFz0rrONK0qkpYB01N77wAuTWWWAFcDC/rbZ6ufb5s1sx4Dco1C0njgTcDaFLpS0kOSFkkalWJjgC2lal0pVi1+EvB8ROzrFa+0/bmSOiR1dHd3D8AemZlZj4YThaTjgNuAj0fEixT/4j8DaKM44/hyT9EK1aMf8YODEQsjoj0i2ltaWvq4B2ZmltPQL7MlvZQiSXwnIm4HiIjtpfXfBH6QPnYB40rVxwJb03Kl+LPASElHpbOKcnkzM2uSRu56EnAj8FhEfKUUH10q9l7gkbS8Apgh6RhJE4BW4H5gHdCa7nA6muKC94qICODHwMWp/mzgzv7218zM+qeRM4pzgcuAhyWtT7HPUty11EYxTLQJ+DBARGyQtBx4lOKOqXkRsR9A0pXAKmAEsCgiNqT2Pg0sk3QN8CBFYrJB5IvcZsNPI3c9/YTK1xFWZupcC1xbIb6yUr10J9Q5veNmZtY8/mW2mZllOVGYmVmW56OwAVHt2kWOr2uYHR58RmFmZllOFGZmluVEYWZmWU4UZmaW5URhZmZZvuvJBo1/5W12ePAZhZmZZfmMwoYcn2mYDS1OFHbYcAIxGxxOFHbYcwIxO7R8jcLMzLKcKMzMLMtDT3bE8pCU2cBworBhp69PunViseFuyCcKSVOAr1JMk/qtiLhukLtkw4wTiw13QzpRSBoBfA14J9AFrJO0IiIeHdyemVXXn7k5BoqTlB0KQzpRUMyX3ZnmzkbSMmA64ERhVsFgJSknqCPbUE8UY4Atpc9dwKTehSTNBeamj7+Q9EQ/tnUy8Gw/6g0nPkb1GXbH6YN9rzLsjlE/NfM4/UG1FUM9UahCLA4KRCwEFja0IakjItobaeNI52NUHx+n2nyM6jNUjtNQ/x1FFzCu9HkssHWQ+mJmNiwN9USxDmiVNEHS0cAMYMUg98nMbFgZ0kNPEbFP0pXAKorbYxdFxIZDtLmGhq6GCR+j+vg41eZjVJ8hcZwUcdCQv5mZ2e8M9aEnMzMbZE4UZmaWNewThaQpkp6Q1Clp/mD3ZyiRtEnSw5LWS+pIsRMlrZa0Mb2PGux+NpukRZJ2SHqkFKt4XFS4IX2/HpJ01uD1vHmqHKOrJT2Tvk/rJU0rrftMOkZPSLpwcHrdXJLGSfqxpMckbZD0sRQfct+lYZ0oSo8ImQpMBGZKmji4vRpy3h4RbaV7uecDayKiFViTPg83i4EpvWLVjstUoDW95gILmtTHwbaYg48RwPXp+9QWESsB0v9zM4DXpzpfT/9vHun2AZ+IiNcBk4F56VgMue/SsE4UlB4REhG/BnoeEWLVTQeWpOUlwEWD2JdBERH3ADt7hasdl+nA0ijcB4yUNLo5PR08VY5RNdOBZRGxNyJ+DnRS/L95RIuIbRHx07S8G3iM4mkUQ+67NNwTRaVHhIwZpL4MRQH8SNID6TEpAKdGxDYovujAKYPWu6Gl2nHxd+xAV6Zhk0WlYcthf4wkjQfeBKxlCH6XhnuiqOsRIcPYuRFxFsUp7zxJbxvsDh2G/B37vQXAGUAbsA34cooP62Mk6TjgNuDjEfFirmiFWFOO03BPFH5ESEZEbE3vO4A7KIYDtvec7qb3HYPXwyGl2nHxdyyJiO0RsT8ifgt8k98PLw3bYyTppRRJ4jsRcXsKD7nv0nBPFH5ESBWSXiHp+J5l4ALgEYrjMzsVmw3cOTg9HHKqHZcVwKx0x8pk4IWeYYXhptd4+nspvk9QHKMZko6RNIHiYu39ze5fs0kScCPwWER8pbRqyH2XhvQjPA61Jj8i5HBzKnBH8V3mKOCmiPihpHXAcklzgM3AJYPYx0Eh6WbgPOBkSV3AVcB1VD4uK4FpFBdo9wBXNL3Dg6DKMTpPUhvFcMkm4MMAEbFB0nKKeWb2AfMiYv9g9LvJzgUuAx6WtD7FPssQ/C75ER5mZpY13IeezMysBicKMzPLcqIwM7MsJwozM8tyojAzsywnCjMzy3KiMGuy9LjtTw52P8zq5URhZmZZw/qX2Wb9IenzwAcpnuT5LPAA8ALFHAFHU/xy9rKI2FNHW2dQzInSQvFr27+IiMclLQZeBNqBVwGfiohbB35vzGrzGYVZH0hqB95P8Ujo91H8IQe4PSL+OCLeSDGvwJw6m1wI/LeIOBv4JPD10rrRwFuAd1E81sFsUPiMwqxv3gLcGRG/ApD0/RQ/U9I1wEjgOIrnh2Wlx0u/GfhueqYWwDGlIt9LT1p9VNKpA9R/sz5zojDrm0pzAkAx9edFEfEzSZdTPBCvlpcAz0dEW5X1e+vYrtkh56Ens775CfBuSS9LZwR/luLHA9vS/AIfrKehNEnNzyVdAsVjpyW98VB02qwRThRmfRAR6yjmBfgZcDvQQXEh+/MU01iuBh7vQ5MfBOZI+hmwAc/ZbkOQHzNu1keSjouIX0h6OXAPMDcifjrY/TI7VHyNwqzvFkqaCLwMWOIkYUc6n1GYHSKS/oqDZwD8bkRcOxj9MesvJwozM8vyxWwzM8tyojAzsywnCjMzy3KiMDOzrP8Pdi+2YylLajwAAAAASUVORK5CYII=
" />
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Code-to-Extract-to-a-Pandas-DataFrame">Code to Extract to a Pandas DataFrame<a class="anchor-link" href="#Code-to-Extract-to-a-Pandas-DataFrame"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">tmx2dataframe</span> <span class="kn">import</span> <span class="n">tmx2dataframe</span>
<span class="n">metadata</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">tmx2dataframe</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="s1">'processed_data/paracrawl_v6_en-ga.tmx'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">'processed_data/paracrawl_v6_en-ga.csv'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>1366628
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>source_language</th>
<th>source_sentence</th>
<th>target_language</th>
<th>target_sentence</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>en</td>
<td>Moncler Jackets Womens,Moncler a clothing sens...</td>
<td>ga</td>
<td>Moncler Dúin Jackets mBan Gheimhridh Zip Hoode...</td>
</tr>
<tr>
<th>1</th>
<td>en</td>
<td>Speaking at the public seminar, Director of Co...</td>
<td>ga</td>
<td>Ag labhairt dó ag an seimineár poiblí, dúirt S...</td>
</tr>
<tr>
<th>2</th>
<td>en</td>
<td>Home / 2018-B / sample happy new year message 34+</td>
<td>ga</td>
<td>Baile / 2018-B / sampla teachtaireacht bhliain...</td>
</tr>
<tr>
<th>3</th>
<td>en</td>
<td>No. Quartz countertops are a durable and heat-...</td>
<td>ga</td>
<td>Uimh. Is iad na countertops Grianchloch buan a...</td>
</tr>
<tr>
<th>4</th>
<td>en</td>
<td>the dog becomes naughty and nervous.</td>
<td>ga</td>
<td>thiocfaidh an madra go dona agus neirbhíseach.</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>source_language</th>
<th>source_sentence</th>
<th>target_language</th>
<th>target_sentence</th>
</tr>
</thead>
<tbody>
<tr>
<th>198426</th>
<td>en</td>
<td>On 20th January, 1993, Garda Lawless was trave...</td>
<td>ga</td>
<td>Ar an 20 Eanáir 1993, bhí Garda Lawless ag tai...</td>
</tr>
<tr>
<th>741199</th>
<td>en</td>
<td>21:8 Then, after setting out the next day, we ...</td>
<td>ga</td>
<td>21:8 Ansin, tar éis a leagan amach an chéad lá...</td>
</tr>
<tr>
<th>924975</th>
<td>en</td>
<td>Known for launching terminal battery life high...</td>
<td>ga</td>
<td>Is eol do sheoladh saol ceallraí críochfoirt a...</td>
</tr>
<tr>
<th>1310412</th>
<td>en</td>
<td>Engineers completing reports on planning appli...</td>
<td>ga</td>
<td>Ba chóir d'innealtóirí a ullmhaíonn tuairiscí ...</td>
</tr>
<tr>
<th>1241786</th>
<td>en</td>
<td>Individual family members can react differentl...</td>
<td>ga</td>
<td>D’fhéadfadh scaradh den tsórt seo dul i bhfeid...</td>
</tr>
<tr>
<th>634275</th>
<td>en</td>
<td>The last thing one wants to find on oneself af...</td>
<td>ga</td>
<td>Is é an rud deireanach mian le duine a fháil a...</td>
</tr>
<tr>
<th>157572</th>
<td>en</td>
<td>The LENA Worxx tractor is suitable for childre...</td>
<td>ga</td>
<td>Tá an tarracóir LENA Worxx oiriúnach do leanaí...</td>
</tr>
<tr>
<th>465291</th>
<td>en</td>
<td>2. —This Act may be cited as the Infanticide A...</td>
<td>ga</td>
<td>2. —Féadfar an tAcht um Naíonmharú, 1949, a gh...</td>
</tr>
<tr>
<th>176118</th>
<td>en</td>
<td>The hiring of civilians with the right skills ...</td>
<td>ga</td>
<td>Earcófar sibhialtaigh a bhfuil na scileanna ce...</td>
</tr>
<tr>
<th>933352</th>
<td>en</td>
<td>(d) a demolition order made under this Part of...</td>
<td>ga</td>
<td>(d) ordú leagtha do rinneadh fén gCuid seo den...</td>
</tr>
<tr>
<th>888709</th>
<td>en</td>
<td>[GA] (2) The Garda Síochána Acts, 1923 to 1958...</td>
<td>ga</td>
<td>[EN] (2) Féadfar Achtanna n Gharda Síochána, 1...</td>
</tr>
<tr>
<th>164022</th>
<td>en</td>
<td>(b) the amount of the interest on the company'...</td>
<td>ga</td>
<td>[EN] (b) méid an úis ar bhintiúir agus iasacht...</td>
</tr>
<tr>
<th>499459</th>
<td>en</td>
<td>FÁS - Recognition of Qualifications</td>
<td>ga</td>
<td>FÁS - Aitheantas ar Cháilíochtaí</td>
</tr>
<tr>
<th>752711</th>
<td>en</td>
<td>(2) A person who fails to comply with subsecti...</td>
<td>ga</td>
<td>(2) Duine a mhainneoidh déanamh de réir fho-al...</td>
</tr>
<tr>
<th>474477</th>
<td>en</td>
<td>19. Regulations made with the concurrence of t...</td>
<td>ga</td>
<td>19. Féadfaidh rialacháin a déanfar le comhthoi...</td>
</tr>
<tr>
<th>374173</th>
<td>en</td>
<td>Wooden MDF Box For Wine Supplier and Factory -...</td>
<td>ga</td>
<td>Soláthróir Cás Taispeáin Humidifier Chigire Hu...</td>
</tr>
<tr>
<th>1159993</th>
<td>en</td>
<td>[GA] Application of section 14 (3) of the Agri...</td>
<td>ga</td>
<td>[EN] Alt 14 (3) den Acht Talmhaíochta (An Fora...</td>
</tr>
<tr>
<th>1069310</th>
<td>en</td>
<td>This entry was posted on July 19, 2009, 2:59 p...</td>
<td>ga</td>
<td>cuireadh i bpost ar an iontráil seo Iúil 19, 2...</td>
</tr>
<tr>
<th>539502</th>
<td>en</td>
<td>This helicopter will also be fittedwith a Bamb...</td>
<td>ga</td>
<td>Feisteofar an héileacaptar sin le buicéad Bamb...</td>
</tr>
<tr>
<th>775145</th>
<td>en</td>
<td>(2) Any person who is aggrieved by the decisio...</td>
<td>ga</td>
<td>(2) Nuair a bheidh éileamh á chinneadh ag na C...</td>
</tr>
<tr>
<th>1009110</th>
<td>en</td>
<td>rolex masterpiece automatic full gold diamond ...</td>
<td>ga</td>
<td>masterpiece Rolex Diamond uathoibríoch marcála...</td>
</tr>
<tr>
<th>531551</th>
<td>en</td>
<td>Sponsors: County Donegal Heritage Office, Done...</td>
<td>ga</td>
<td>Urraithe: Oifig Oidhreachta Chontae Dhún na nG...</td>
</tr>
<tr>
<th>678310</th>
<td>en</td>
<td>The IT Section has also supported other depart...</td>
<td>ga</td>
<td>Thug an Rannóg TF tacaíocht do ranna eile nuai...</td>
</tr>
<tr>
<th>879087</th>
<td>en</td>
<td>Online play: flash games castle defense, castl...</td>
<td>ga</td>
<td>Imirt ar líne: cluichí flash caisleán cosanta,...</td>
</tr>
<tr>
<th>437555</th>
<td>en</td>
<td>-Bands linked together by Threaded screws like...</td>
<td>ga</td>
<td>Bannaí nasctha le chéile ag scriúnna snáithith...</td>
</tr>
<tr>
<th>1144512</th>
<td>en</td>
<td>Shopping in the EU</td>
<td>ga</td>
<td>Siopadóireacht san AE</td>
</tr>
<tr>
<th>733110</th>
<td>en</td>
<td>24 and 25 December 2013 Homily Christmas →</td>
<td>ga</td>
<td>24 agus 25 Nollaig 2013 Aitheasc Nollag →</td>
</tr>
<tr>
<th>326519</th>
<td>en</td>
<td>Home Business and Economy Dublin Economic Monitor</td>
<td>ga</td>
<td>Baile Gnó agus Eacnamaíocht Monatóir Geilleagr...</td>
</tr>
<tr>
<th>1184048</th>
<td>en</td>
<td>First Previous (THIRD SCHEDULE. COMMITTEES TO ...</td>
<td>ga</td>
<td>An Chéad Lch. Lch. Roimhe Seo (TRIU SCEIDEAL. ...</td>
</tr>
<tr>
<th>1241711</th>
<td>en</td>
<td>Click on “On-line Hack” button below and you w...</td>
<td>ga</td>
<td>Cliceáil ar “Ar-líne Hack” cnaipe thíos agus b...</td>
</tr>
<tr>
<th>1354113</th>
<td>en</td>
<td>Now you're probably thinking That a few gaps o...</td>
<td>ga</td>
<td>Anois tá tú ag smaoineamh is dócha Nach mbeidh...</td>
</tr>
<tr>
<th>804927</th>
<td>en</td>
<td>There is no restriction on earnings or the num...</td>
<td>ga</td>
<td>Níl aon teorainn leis an tuilleamh ná leis an ...</td>
</tr>
<tr>
<th>665197</th>
<td>en</td>
<td>3. Helps prevent hair loss and graying hair.</td>
<td>ga</td>
<td>3. Cuidíonn sé le cosc a chur ar chaillteanas ...</td>
</tr>
<tr>
<th>1185713</th>
<td>en</td>
<td>FÁS - Supply Chain Logistics Administrator Tra...</td>
<td>ga</td>
<td>FÁS - Cúrsa Oiliúna do Riarthóir Lóistíochta S...</td>
</tr>
<tr>
<th>1254702</th>
<td>en</td>
<td>36. Complaints about registered retail pharmac...</td>
<td>ga</td>
<td>Gearáin faoi chógaiseoirí cláraithe. 36 .</td>
</tr>
<tr>
<th>705917</th>
<td>en</td>
<td>Date Display Weekday Month Display Calendar Di...</td>
<td>ga</td>
<td>Taispeáin Dáta I rith na seachtaine Mí Taispeá...</td>
</tr>
<tr>
<th>713111</th>
<td>en</td>
<td>(d) there shall be entered into the Register s...</td>
<td>ga</td>
<td>(d) taifeadfar sa Chlár cibé sonraí eile (más ...</td>
</tr>
<tr>
<th>578491</th>
<td>en</td>
<td>The county of Meath, except the parts thereof ...</td>
<td>ga</td>
<td>Contae na Mí, ach amháin an chuid sin de atá i...</td>
</tr>
<tr>
<th>1359946</th>
<td>en</td>
<td>The French shores are washed by the North and ...</td>
<td>ga</td>
<td>Níonn na Muirir Thuaidh agus na Meánmhara, Bá ...</td>
</tr>
<tr>
<th>281613</th>
<td>en</td>
<td>Pei is a medium size dog, attentive, careful.</td>
<td>ga</td>
<td>Is madra meánmhéide é Pei, aireach, cúramach.</td>
</tr>
<tr>
<th>760395</th>
<td>en</td>
<td>Nagoya University - Higher Education abroad in...</td>
<td>ga</td>
<td>Ollscoil Nagoya - Ard-Oideachas thar lear san ...</td>
</tr>
<tr>
<th>1043028</th>
<td>en</td>
<td>Interest on and redemption of land bonds.</td>
<td>ga</td>
<td>An t-ús a bheidh ar bhannaí talmhan agus fuasc...</td>
</tr>
<tr>
<th>1285606</th>
<td>en</td>
<td>shall not be subject to the provisions of the ...</td>
<td>ga</td>
<td>ní bheidh sé nó sí faoi réir fhorálacha an Ach...</td>
</tr>
<tr>
<th>446548</th>
<td>en</td>
<td>Clothing changes a person both externally and ...</td>
<td>ga</td>
<td>Athruithe Éadaí duine seachtrach agus go hinmh...</td>
</tr>
<tr>
<th>983378</th>
<td>en</td>
<td>Thanks to the exquisite design and high qualit...</td>
<td>ga</td>
<td>Buíochas le dearadh fíorúil agus dea-chaighdeá...</td>
</tr>
<tr>
<th>744588</th>
<td>en</td>
<td>4:9 All the things that you have learned and a...</td>
<td>ga</td>
<td>4:9 Gach na rudaí a d'fhoghlaim tú agus glacad...</td>
</tr>
<tr>
<th>384018</th>
<td>en</td>
<td>Referendum Commission Official Languages Act 2003</td>
<td>ga</td>
<td>An Coimisiún Reifrinn Acha na dTeangacha Oifug...</td>
</tr>
<tr>
<th>1118395</th>
<td>en</td>
<td>(If anyone here speaks Hmong, please contact us)</td>
<td>ga</td>
<td>(Má labhraíonn aon duine anseo Hmong, déan tea...</td>
</tr>
<tr>
<th>99410</th>
<td>en</td>
<td>Anabolic steroids helped me to achieve amazing...</td>
<td>ga</td>
<td>stéaróidigh Anabolic chabhraigh liom chun tort...</td>
</tr>
<tr>
<th>306414</th>
<td>en</td>
<td>Some people may feel that it is a good time to...</td>
<td>ga</td>
<td>D'fhéadfadh roinnt daoine a bhraitheann go bhf...</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
</div>Morgan McGuireDCEP, Digital Corpus of the European Parliament2020-06-11T00:00:00-05:002020-06-11T00:00:00-05:00https://www.nlp.irish/dcep/translation/nmt/mt/2020/06/11/DCEP-Digital-Corpus-of-the-European-Parliament<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-06-11-DCEP-Digital-Corpus-of-the-European-Parliament.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Available-for-Download--✅">Available for Download ✅<a class="anchor-link" href="#Available-for-Download--✅"> </a></h3><p>⚠️ Always check the license of the data source before using the data ⚠️</p>
<ul>
<li>Main page: <a href="https://ec.europa.eu/jrc/en/language-technologies/dcep">https://ec.europa.eu/jrc/en/language-technologies/dcep</a></li>
<li>Download Link: <a href="https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-Download-Page.html">https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-Download-Page.html</a></li>
<li>Extraction Instructions: <a href="https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-extract-README.html">https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-extract-README.html</a></li>
<li>Format: <strong>Sentence-aligned data is in plain text</strong></li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Brief-Description">Brief Description<a class="anchor-link" href="#Brief-Description"> </a></h3><p>Contains the majority of the documents published on the European Parliament's official website. It comprises a variety of document types, from press releases to session and legislative documents related to European Parliament's activities and bodies. The current version of the corpus contains documents that were produced between 2001 and 2012.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Other-Notes">Other Notes<a class="anchor-link" href="#Other-Notes"> </a></h3><ul>
<li>Lines of text: 46,146</li>
<li>GA Word count: 1,029,348</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Word-Count-Distribution">Word Count Distribution<a class="anchor-link" href="#Word-Count-Distribution"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre><matplotlib.axes._subplots.AxesSubplot at 0x7fc6a3dc8ed0></pre>
</div>
</div>
<div class="output_area">
<div class="output_png output_subarea ">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAYAAAAEHCAYAAACncpHfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAXuklEQVR4nO3de7BdZ3nf8e8PG0O4Ssay60hyJYJCYzKDcU5tNyRMioh8CUFui1uBJ1appmpnnBSaMokpkyoFPAO9hNYzYEaNVWQGMMaBWk1JjGqgDH/YWL7ia3WwwVakWAdkm1AHJyJP/9jvgS1xLnsfHZ19jtb3M3NmrfWsd639rDVbevZ61y1VhSSpe5436gQkSaNhAZCkjrIASFJHWQAkqaMsAJLUUSePOoGZnHbaabVmzZpRpyFJS8qdd975napaMVu7RV0A1qxZw549e0adhiQtKUm+PUg7u4AkqaMsAJLUURYASeooC4AkdZQFQJI6ygIgSR1lAZCkjrIASFJHWQAkqaMW9Z3Ax+pTtz8+Zfzt55+1wJlI0uLjEYAkdZQFQJI6ygIgSR1lAZCkjrIASFJHWQAkqaMGKgBJ/nWSB5Lcn+TTSV6YZG2S25PsTfKZJKe0ti9o0+Nt/pq+9bynxR9JcuHx2SRJ0iBmLQBJVgL/Chirqp8HTgI2AR8CPlxV64CngC1tkS3AU1X1KuDDrR1Jzm7LvQa4CPhokpPmd3MkSYMatAvoZOCnkpwMvAg4ALwRuKnN3wlc2sY3tmna/PVJ0uI3VNVzVfUYMA6cd+ybIEmai1kLQFX9GfCfgMfp/cf/DHAn8HRVHW7N9gEr2/hK4Im27OHW/hX98SmW+ZEkW5PsSbJnYmJiLtskSRrAIF1Ay+n9el8L/DTwYuDiKZrW5CLTzJsufmSgantVjVXV2IoVs77UXpI0R4N0Ab0JeKyqJqrqr4HPAb8ILGtdQgCrgP1tfB+wGqDNfzlwqD8+xTKSpAU2SAF4HLggyYtaX/564EHgy8BbW5vNwM1tfFebps3/UlVVi29qVwmtBdYBX5+fzZAkDWvWp4FW1e1JbgLuAg4DdwPbgf8F3JDkAy12XVvkOuATScbp/fLf1NbzQJIb6RWPw8CVVfXDed4eSdKABnocdFVtA7YdFX6UKa7iqaofAJdNs56rgauHzFGSdBx4J7AkdZQFQJI6ygIgSR1lAZCkjrIASFJHWQAkqaMsAJLUURYASeooC4AkdZQFQJI6ygIgSR1lAZCkjrIASFJHWQAkqaMsAJLUUYO8E/jVSe7p+/teknclOTXJ7iR723B5a58k1yQZT3JfknP71rW5td+bZPP0nypJOt5mLQBV9UhVnVNV5wC/ADwLfB64Cri1qtYBt7Zp6L0wfl372wpcC5DkVHovlTmf3otktk0WDUnSwhu2C2g98M2q+jawEdjZ4juBS9v4RuD66rmN3svjzwQuBHZX1aGqegrYDVx0zFsgSZqTYQvAJuDTbfyMqjoA0Iant/hK4Im+Zfa12HRxSdIIDFwAkpwCvAX47GxNp4jVDPGjP2drkj1J9kxMTAyaniRpSMMcAVwM3FVVT7bpJ1vXDm14sMX3Aav7llsF7J8hfoSq2l5VY1U1tmLFiiHSkyQNY5gC8DZ+3P0DsAuYvJJnM3BzX/yKdjXQBcAzrYvoFmBDkuXt5O+GFpMkjcDJgzRK8iLgV4F/0Rf+IHBjki3A48BlLf4F4BJgnN4VQ+8AqKpDSd4P3NHava+qDh3zFkiS5mSgAlBVzwKvOCr2XXpXBR3dtoArp1nPDmDH8GlKkuabdwJLUkdZACSpoywAktRRFgBJ6igLgCR1lAVAkjrKAiBJHWUBkKSOsgBIUkdZACSpoywAktRRFgBJ6igLgCR1lAVAkjrKAiBJHWUBkKSOGqgAJFmW5KYkDyd5KMnfS3Jqkt1J9rbh8tY2Sa5JMp7kviTn9q1nc2u/N8nm6T9RknS8DXoE8F+BP62qvwO8FngIuAq4tarWAbe2aei9PH5d+9sKXAuQ5FRgG3A+cB6wbbJoSJIW3qwFIMnLgDcA1wFU1V9V1dPARmBna7YTuLSNbwSur57bgGVJzgQuBHZX1aGqegrYDVw0r1sjSRrYIEcArwQmgP+e5O4kf5jkxcAZVXUAoA1Pb+1XAk/0Lb+vxaaLHyHJ1iR7kuyZmJgYeoMkSYMZpACcDJwLXFtVrwP+Hz/u7plKpojVDPEjA1Xbq2qsqsZWrFgxQHqSpLkYpADsA/ZV1e1t+iZ6BeHJ1rVDGx7sa7+6b/lVwP4Z4pKkEZi1AFTVnwNPJHl1C60HHgR2AZNX8mwGbm7ju4Ar2tVAFwDPtC6iW4ANSZa3k78bWkySNAInD9jut4BPJjkFeBR4B73icWOSLcDjwGWt7ReAS4Bx4NnWlqo6lOT9wB2t3fuq6tC8bIUkaWgDFYCqugcYm2LW+inaFnDlNOvZAewYJkFJ0vHhncCS1FEWAEnqKAuAJHWUBUCSOsoCIEkdZQGQpI6yAEhSR1kAJKmjLACS1FEWAEnqKAuAJHWUBUCSOsoCIEkdZQGQpI6yAEhSR1kAJKmjBioASb6V5BtJ7kmyp8VOTbI7yd42XN7iSXJNkvEk9yU5t289m1v7vUk2T/d5kqTjb5gjgL9fVedU1eSbwa4Cbq2qdcCtbRrgYmBd+9sKXAu9ggFsA84HzgO2TRYNSdLCO5YuoI3Azja+E7i0L3599dwGLEtyJnAhsLuqDlXVU8Bu4KJj+HxJ0jEYtAAU8MUkdybZ2mJnVNUBgDY8vcVXAk/0LbuvxaaLHyHJ1iR7kuyZmJgYfEskSUMZ6KXwwOuran+S04HdSR6eoW2miNUM8SMDVduB7QBjY2M/MV+SND8GOgKoqv1teBD4PL0+/Cdb1w5teLA13wes7lt8FbB/hrgkaQRmLQBJXpzkpZPjwAbgfmAXMHklz2bg5ja+C7iiXQ10AfBM6yK6BdiQZHk7+buhxSRJIzBIF9AZwOeTTLb/VFX9aZI7gBuTbAEeBy5r7b8AXAKMA88C7wCoqkNJ3g/c0dq9r6oOzduWSJKGMmsBqKpHgddOEf8usH6KeAFXTrOuHcCO4dOUJM037wSWpI6yAEhSR1kAJKmjBr0PoBM+dfvjU8bffv5ZC5yJJB1/HgFIUkdZACSpoywAktRRFgBJ6igLgCR1lAVAkjrKAiBJHWUBkKSOsgBIUkd5J/Bx4B3FkpYCjwAkqaMGLgBJTkpyd5I/btNrk9yeZG+SzyQ5pcVf0KbH2/w1fet4T4s/kuTC+d4YSdLghjkCeCfwUN/0h4APV9U64ClgS4tvAZ6qqlcBH27tSHI2sAl4DXAR8NEkJx1b+pKkuRqoACRZBfwa8IdtOsAbgZtak53ApW18Y5umzV/f2m8Ebqiq56rqMXqvjDxvPjZCkjS8QY8A/gvwO8DftOlXAE9X1eE2vQ9Y2cZXAk8AtPnPtPY/ik+xzI8k2ZpkT5I9ExMTQ2yKJGkYsxaAJG8GDlbVnf3hKZrWLPNmWubHgartVTVWVWMrVqyYLT1J0hwNchno64G3JLkEeCHwMnpHBMuSnNx+5a8C9rf2+4DVwL4kJwMvBw71xSf1LyNJWmCzHgFU1XuqalVVraF3EvdLVXU58GXgra3ZZuDmNr6rTdPmf6mqqsU3tauE1gLrgK/P25ZIkoZyLDeC/S5wQ5IPAHcD17X4dcAnkozT++W/CaCqHkhyI/AgcBi4sqp+eAyfL0k6BkMVgKr6CvCVNv4oU1zFU1U/AC6bZvmrgauHTVKSNP+8E1iSOsoCIEkdZQGQpI6yAEhSR3XycdDTPa5ZkrrEIwBJ6igLgCR1lAVAkjrKAiBJHWUBkKSOsgBIUkdZACSpoywAktRRnbwRbFjT3Tj29vPPWuBMJGn+eAQgSR1lAZCkjhrkpfAvTPL1JPcmeSDJv2/xtUluT7I3yWeSnNLiL2jT423+mr51vafFH0ly4fHaKEnS7AY5B/Ac8Maq+n6S5wNfS/InwG8DH66qG5J8DNgCXNuGT1XVq5JsAj4E/JMkZ9N7PeRrgJ8G/neSn13Kr4X0oXKSlrJBXgpfVfX9Nvn89lfAG4GbWnwncGkb39imafPXJ0mL31BVz1XVY8A4U7xSUpK0MAa6CijJScCdwKuAjwDfBJ6uqsOtyT5gZRtfCTwBUFWHkzwDvKLFb+tbbf8y/Z+1FdgKcNZZ3b7KxquPJB1PA50ErqofVtU5wCp6v9p/bqpmbZhp5k0XP/qztlfVWFWNrVixYpD0JElzMNRVQFX1NPAV4AJgWZLJI4hVwP42vg9YDdDmvxw41B+fYhlJ0gIb5CqgFUmWtfGfAt4EPAR8GXhra7YZuLmN72rTtPlfqqpq8U3tKqG1wDrg6/O1IZKk4QxyDuBMYGc7D/A84Maq+uMkDwI3JPkAcDdwXWt/HfCJJOP0fvlvAqiqB5LcCDwIHAauXMpXAEnSUjdrAaiq+4DXTRF/lCmu4qmqHwCXTbOuq4Grh09TkjTfvBNYkjrKAiBJHWUBkKSOsgBIUkf5PoAF5J29khYTjwAkqaMsAJLUURYASeooC4AkdZQFQJI6ygIgSR3lZaCLgK+WlDQKHgFIUkdZACSpoywAktRRg7wRbHWSLyd5KMkDSd7Z4qcm2Z1kbxsub/EkuSbJeJL7kpzbt67Nrf3eJJun+0xJ0vE3yBHAYeDfVNXP0XsX8JVJzgauAm6tqnXArW0a4GJ6r3tcB2wFroVewQC2AefTe5HMtsmiIUlaeLMWgKo6UFV3tfG/oPc+4JXARmBna7YTuLSNbwSur57b6L08/kzgQmB3VR2qqqeA3cBF87o1kqSBDXUOIMkaeq+HvB04o6oOQK9IAKe3ZiuBJ/oW29di08UlSSMwcAFI8hLgj4B3VdX3Zmo6RaxmiB/9OVuT7EmyZ2JiYtD0JElDGqgAJHk+vf/8P1lVn2vhJ1vXDm14sMX3Aav7Fl8F7J8hfoSq2l5VY1U1tmLFimG2RZI0hEGuAgpwHfBQVf1B36xdwOSVPJuBm/viV7SrgS4AnmldRLcAG5Isbyd/N7SYJGkEBnkUxOuB3wC+keSeFvu3wAeBG5NsAR4HLmvzvgBcAowDzwLvAKiqQ0neD9zR2r2vqg7Ny1Z0jG8WkzQfZi0AVfU1pu6/B1g/RfsCrpxmXTuAHcMkKEk6PrwTWJI6yqeBdoTdRpKO5hGAJHWUBUCSOsouoBOIL5aRNAyPACSpoywAktRRFgBJ6igLgCR1lAVAkjrKAiBJHWUBkKSO8j4ATclHR0gnPo8AJKmjLACS1FEWAEnqqEFeCbkjycEk9/fFTk2yO8neNlze4klyTZLxJPclObdvmc2t/d4km6f6LEnSwhnkCODjwEVHxa4Cbq2qdcCtbRrgYmBd+9sKXAu9ggFsA84HzgO2TRYNSdJozFoAquqrwNHv7t0I7GzjO4FL++LXV89twLIkZwIXArur6lBVPQXs5ieLiiRpAc31HMAZVXUAoA1Pb/GVwBN97fa12HTxn5Bka5I9SfZMTEzMMT1J0mzm+z6AqV4eXzPEfzJYtR3YDjA2NjZlG80f3yEgdddcjwCebF07tOHBFt8HrO5rtwrYP0NckjQicy0Au4DJK3k2Azf3xa9oVwNdADzTuohuATYkWd5O/m5oMUnSiMzaBZTk08CvAKcl2Ufvap4PAjcm2QI8DlzWmn8BuAQYB54F3gFQVYeSvB+4o7V7X1UdfWJZkrSAZi0AVfW2aWatn6JtAVdOs54dwI6hspMkHTfeCSxJHWUBkKSOsgBIUkf5PgANxfcESCcOjwAkqaMsAJLUUXYB6biyy0havCwAmhc+U0haeuwCkqSOsgBIUkdZACSpozwHoEXFk8bSwrEAaCQ8aSyNnl1AktRRHgFoSRj2iMEuI2l2FgCdkObSxWTRUNcseBdQkouSPJJkPMlVC/35kqSeBT0CSHIS8BHgV+m9KP6OJLuq6sGFzEOaynydmJ7uSMIrnLTYLHQX0HnAeFU9CpDkBmAjYAHQCWPYQnK8C4OFR9NZ6AKwEniib3ofcH5/gyRbga1t8vtJHhli/acB3zmmDBfeUswZlmbeSyrny3uD45bz5cdjpT1Laj/3WYp5T5fz3x5k4YUuAJkiVkdMVG0Hts9p5cmeqhqby7KjshRzhqWZtzkvjKWYMyzNvI8154U+CbwPWN03vQrYv8A5SJJY+AJwB7AuydokpwCbgF0LnIMkiQXuAqqqw0l+E7gFOAnYUVUPzONHzKnraMSWYs6wNPM254WxFHOGpZn3MeWcqpq9lSTphOOzgCSpoywAktRRJ0QBWCqPl0iyOsmXkzyU5IEk72zx30/yZ0nuaX+XjDrXfkm+leQbLbc9LXZqkt1J9rbh8lHnOSnJq/v25T1JvpfkXYtxPyfZkeRgkvv7YlPu2/Rc077n9yU5dxHl/B+TPNzy+nySZS2+Jslf9u3zjy2inKf9PiR5T9vPjyS5cBQ5tzymyvszfTl/K8k9LT78vq6qJf1H72TyN4FXAqcA9wJnjzqvaXI9Ezi3jb8U+L/A2cDvA+8edX4z5P0t4LSjYv8BuKqNXwV8aNR5zvD9+HN6N8Ysuv0MvAE4F7h/tn0LXAL8Cb37aS4Abl9EOW8ATm7jH+rLeU1/u0W2n6f8PrR/k/cCLwDWtv9fTloseR81/z8D/26u+/pEOAL40eMlquqvgMnHSyw6VXWgqu5q438BPETv7uilaCOws43vBC4dYS4zWQ98s6q+PepEplJVXwUOHRWebt9uBK6vntuAZUnOXJhMf2yqnKvqi1V1uE3eRu8en0Vjmv08nY3ADVX1XFU9BozT+39mwc2Ud5IA/xj49FzXfyIUgKkeL7Ho/1NNsgZ4HXB7C/1mO3zesZi6U5oCvpjkzvaoDoAzquoA9AobcPrIspvZJo78B7KY9/Ok6fbtUvmu/zN6RyqT1ia5O8n/SfLLo0pqGlN9H5bKfv5l4Mmq2tsXG2pfnwgFYNbHSyw2SV4C/BHwrqr6HnAt8DPAOcABeod1i8nrq+pc4GLgyiRvGHVCg2g3G74F+GwLLfb9PJtF/11P8l7gMPDJFjoAnFVVrwN+G/hUkpeNKr+jTPd9WPT7uXkbR/64GXpfnwgFYEk9XiLJ8+n95//JqvocQFU9WVU/rKq/Af4bIzrcnE5V7W/Dg8Dn6eX35GT3QxseHF2G07oYuKuqnoTFv5/7TLdvF/V3Pclm4M3A5dU6pVs3ynfb+J30+tN/dnRZ/tgM34dFvZ8BkpwM/EPgM5OxuezrE6EALJnHS7Q+u+uAh6rqD/ri/f24/wC4/+hlRyXJi5O8dHKc3sm+++nt482t2Wbg5tFkOKMjfiEt5v18lOn27S7ginY10AXAM5NdRaOW5CLgd4G3VNWzffEV6b0HhCSvBNYBj44myyPN8H3YBWxK8oIka+nl/PWFzm8WbwIerqp9k4E57etRnNk+DmfKL6F3Rc03gfeOOp8Z8vwleoeS9wH3tL9LgE8A32jxXcCZo861L+dX0rsi4l7ggcn9C7wCuBXY24anjjrXo/J+EfBd4OV9sUW3n+kVqAPAX9P75bllun1Lr2viI+17/g1gbBHlPE6v33zye/2x1vYfte/NvcBdwK8vopyn/T4A7237+RHg4sX0/WjxjwP/8qi2Q+9rHwUhSR11InQBSZLmwAIgSR1lAZCkjrIASFJHWQAkqaMsAJLUURYA6Ri0Rwq/e9R5SHNhAZCkjlrQl8JLi02S3wMup3cX63eAO4FngK303i8xDvxG9T3eYIZ1/Qy9O3VXAM8C/7yqHk7yceB7wBjwt4Dfqaqb5n9rpOF4BKDOSjJG7/b519F7sNZYm/W5qvq7VfVaeu9s2DLgKrcDv1VVvwC8G/ho37wz6T0K5M3AB+chfemYeQSgLvsl4Oaq+kuAJP+zxX8+yQeAZcBLgFtmW1F7xPcvAp/tPfMP6L1RatL/qN5TJx9McsY85S8dEwuAumyq575D70Fbl1bVvUn+KfArA6zrecDTVXXONPOfG+BzpQVlF5C67GvAryd5YfsF/2st/lLgQHt3w+WDrKh6L/Z5LMll8KMXuL/2eCQtzRcLgDqrqu6g9xjge4HPAXvonQD+PXqv6twNPDzEKi8HtiSZfHT2onw3tTTJx0Gr05K8pKq+n+RFwFeBrVV116jzkhaC5wDUdduTnA28ENjpf/7qEo8ApAG0l51fdlT4s1V19SjykeaDBUCSOsqTwJLUURYASeooC4AkdZQFQJI66v8Dgm2LuIGBaowAAAAASUVORK5CYII=
" />
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Code-to-Extract-Files-to-Pandas-DataFrame">Code to Extract Files to Pandas DataFrame<a class="anchor-link" href="#Code-to-Extract-Files-to-Pandas-DataFrame"> </a></h3><p>GA-EN specific instructions are below, for more info see the offical <a href="https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-extract-README.html">extraction instructions page</a></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<ol>
<li>Download and extract language files</li>
</ol>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget -q http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-GA-pub.tar.bz2
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget -q http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-EN-pub.tar.bz2
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>tar jxf DCEP-sentence-GA-pub.tar.bz2
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>tar jxf DCEP-sentence-EN-pub.tar.bz2
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<ol>
<li>Download and extract language pair info</li>
</ol>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget -q http://optima.jrc.it/Resources/DCEP-2013/langpairs/DCEP-EN-GA.tar.bz2
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>tar jxf DCEP-EN-GA.tar.bz2
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<ol>
<li>Download and extract alignment scripts</li>
</ol>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget -q http://optima.jrc.it/Resources/DCEP-2013/DCEP-extract-scripts.tar.bz2
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>tar jxvf DCEP-extract-scripts.tar.bz2
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<ol>
<li>Create aligned file</li>
</ol>
<blockquote><p>The <code>--numbering-filter</code> is a crude but useful heuristic that attempts to drop numberings
and short titles from the output. It works simply by matching sentences on both sides
against a Unicode regex that looks for two alphabetic characters with space between them.</p>
<p>The <code>--length-filter-level=LENGTH_FILTER_LEVEL</code> argument is used to throw away as suspicious
all bisentences where the ratio of the shorter and the longer sentence (in character length)
is less than LENGTH_FILTER_LEVEL percent.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">cd</span> dcep <span class="o">&&</span> ./src/languagepair.py --numbering-filter --length-filter-level<span class="o">=</span><span class="m">40</span> EN-GA > EN-GA-bisentences.txt
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<ol>
<li>Open as a Dataframe</li>
</ol>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'dcep/EN-GA-bisentences.txt'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'en'</span><span class="p">,</span> <span class="s1">'ga'</span><span class="p">]</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">'dcep_en-ga_bisentences.csv'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>46147
</pre>
</div>
</div>
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>en</th>
<th>ga</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>RULES OF PROCEDURE</td>
<td>RIALACHA NÓS IMEACHTA</td>
</tr>
<tr>
<th>1</th>
<td>7th parliamentary term</td>
<td>7ú téarma parlaiminteach</td>
</tr>
<tr>
<th>2</th>
<td>July 2009</td>
<td>Iúil 2009</td>
</tr>
<tr>
<th>3</th>
<td>Interpretations of the Rules (pursuant to Rule...</td>
<td>Tá léirmhínithe ar na Rialacha (de bhun Riail ...</td>
</tr>
<tr>
<th>4</th>
<td>MEMBERS, PARLIAMENT BODIES AND POLITICAL GROUPS</td>
<td>FEISIRÍ, COMHLACHTAÍ PARLAIMINTE AGUS GRÚPAÍ P...</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
</div>Morgan McGuire