Identifying accents in Italian text: a preprocessing step in TTS
11 September 2002
Diacritic marks are often missing in informal communications such as emails; even in well-formatted corpora, diacritic marks are not consistently present. A text-to-speech synthesis system needs a preprocessor to restore the missing accents in order to produce correct word pronunciation. We present an algorithm for automatically identifying accents in Italian text. We consider accent identification as a classification problem and use supervised learning to automatically induce classification rules for disambiguating accents. The overall accuracy is 99.6 % when tested on over 2000 ambiguous words in a 420 MB corpus. For the most ambiguous words, the program achieved 91.4 % accuracy, comparing to the 71.3 % baseline. This accent identification system can serve as a preprocessor for a TTS system, invoked only when the input text contains words that are accent ambiguous.