Persian Machine Transliteration
Between Farsi and Tajiki
Despite speaking mutually intelligible varieties of the same language, speakers of Tajik Persian (a.k.a. Tajiki), written in a modified Cyrillic alphabet, cannot read Iranian and Afghan texts written in the Perso-Arabic script. Due to overwhelming similarity between the two dialects, machine transliteration rather than translation may be more appropriate. Previous work created a model using statistical methods, but lacked parallel corpora with which to accurately judge the model. We aim to definitively demonstrate that transliteration proves an effective way to “translate” between the two dialects, utilizing a neural-based approach to grapheme to phoneme conversion. Our current work focuses on one direction: from the Perso-Arabic script to Cyrillic.
This project has been presented at the LSA Annual Meeting in Denver, Colorado (presentation available here) and as a talk with Q&A at the 3rd North American Conference in Iranian Linguistics in Los Angeles, California, both held this year.