I'm currently working on a speech synthesizer for Scottish Gaelic, based on the Festival/Festvox system. When it's done it will read Scottish Gaelic text. Right now it only says a few words:
Try it out here
is a presentation I put together for the Student Showcase March 7, 2008. Audio is available here, and on iTunes U under University of Arizona > Social & Behavioral Sciences > Linguistics Lectures.
- Here is a presentation from the Arizona Linguistics Circle Conference, held on November 1, 2008. This presentation focuses more on the HMM-based approach, which I have found to be easier than the diphone approach.
Constructing a corpus from a Wikipedia dump
For the HMM-based approach, there has to be a list of phonetically balanced utterances. To do that you need a corpus. The text of Wikipedia can be downloaded and turned into a corpus.
- Download the Wikipedia dump from here.
- Run the following command to extract the XML:
$ bzcat gdwiki-latest-pages-articles.xml.bz2 > gdwiki.xml
- Then use this python script (from here) to extract the text:
$ python nowiki-xml2txt.py gdwiki.xml
- Then clean this up using this regular expression:
$ cat gdwiki.txt | grep -v "''" | grep -v http | grep -v ';' | grep -v ':' | grep -v '(' | grep -v ')' | grep -v No | grep -vi 'pg.' | grep -v '-'| grep -v '+' | grep -v = | grep -e '................................................' | uniq > gdcorpus.txt
- After cleaning up some stray English sentences, you get something like this: