DeASCIIfication approach to handle diacritics in Turkish information retrieval

Arslan, Ahmet

Gelişmiş Arama

Göster/Aç

Tam Metin / Full Text (413.1Kb)

Erişim

info:eu-repo/semantics/closedAccess

Tarih

2016

Yazar

Arslan, Ahmet

Üst veri

Tüm öğe kaydını göster

Özet

The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics

Kaynak

Information Processing & Management

Cilt

Sayı

Bağlantı

https://dx.doi.org/10.1016/j.ipm.2015.08.004
https://hdl.handle.net/11421/19829

Koleksiyonlar

Makale Koleksiyonu [100]
Scopus İndeksli Yayınlar Koleksiyonu [8325]
WoS İndeksli Yayınlar Koleksiyonu [7605]