DeASCIIfication approach to handle diacritics in Turkish information retrieval

Arslan, Ahmet

Advanced Search

View/Open

Tam Metin / Full Text (413.1Kb)

Access

info:eu-repo/semantics/closedAccess

Date

2016

Author

Arslan, Ahmet

Metadata

Show full item record

Abstract

The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics

Source

Information Processing & Management

Volume

Issue

URI

https://dx.doi.org/10.1016/j.ipm.2015.08.004
https://hdl.handle.net/11421/19829

Collections

Makale Koleksiyonu [100]
Scopus İndeksli Yayınlar Koleksiyonu [8325]
WoS İndeksli Yayınlar Koleksiyonu [7605]