Scottish Gaelic Text-to-Speech

I'm currently working on a speech synthesizer for Scottish Gaelic, based on the Festival/Festvox system. When it's done it will read Scottish Gaelic text. Right now it only says a few words:
gainne thraigh
thu thubhairt
thuirt tim
tighinn tilg
tilleadh timcheall
Try it out here

Constructing a corpus from a Wikipedia dump

For the HMM-based approach, there has to be a list of phonetically balanced utterances. To do that you need a corpus. The text of Wikipedia can be downloaded and turned into a corpus.
  1. Download the Wikipedia dump from here.
  2. Run the following command to extract the XML:
    $ bzcat gdwiki-latest-pages-articles.xml.bz2 > gdwiki.xml
  3. Then use this python script (from here) to extract the text:
    $ python gdwiki.xml
  4. Then clean this up using this regular expression:
    $ cat gdwiki.txt | grep -v "''" | grep -v http | grep -v ';' | grep -v ':' | grep -v '(' | grep -v ')' | grep -v No | grep -vi 'pg.' | grep -v '-'| grep -v '+' | grep -v = | grep -e '................................................' | uniq > gdcorpus.txt
  5. After cleaning up some stray English sentences, you get something like this:

Jeff Berry 2008-11-2