DeASCIIfication approach to handle diacritics in Turkish information retrieval

Arslan, Ahmet

dc.contributor.author	Arslan, Ahmet
dc.date.accessioned	2019-10-21T19:44:12Z
dc.date.available	2019-10-21T19:44:12Z
dc.date.issued	2016
dc.identifier.issn	0306-4573
dc.identifier.issn	1873-5371
dc.identifier.uri	https://dx.doi.org/10.1016/j.ipm.2015.08.004
dc.identifier.uri	https://hdl.handle.net/11421/19829
dc.description	WOS: 000371939800009	en_US
dc.description.abstract	The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics	en_US
dc.language.iso	eng	en_US
dc.publisher	Elsevier Sci LTD	en_US
dc.relation.isversionof	10.1016/j.ipm.2015.08.004	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	Accents	en_US
dc.subject	Deasciifier	en_US
dc.subject	Diacritics Restoration	en_US
dc.subject	Risk-Sensitive Evaluation	en_US
dc.subject	Stemming	en_US
dc.subject	Turkish Information Retrieval	en_US
dc.title	DeASCIIfication approach to handle diacritics in Turkish information retrieval	en_US
dc.type	article	en_US
dc.relation.journal	Information Processing & Management	en_US
dc.contributor.department	Anadolu Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü	en_US
dc.identifier.volume	52	en_US
dc.identifier.issue	2	en_US
dc.identifier.startpage	326	en_US
dc.identifier.endpage	339	en_US
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US]

Bu öğenin dosyaları:

Ad:: 19829.pdf
Boyut:: 413.1Kb
Biçim:: PDF
Açıklama:: Tam Metin / Full Text

Göster/Aç

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Makale Koleksiyonu [100]
Scopus İndeksli Yayınlar Koleksiyonu [8325]
Scopus Indexed Publications Collection
WoS İndeksli Yayınlar Koleksiyonu [7605]
WoS Indexed Publications Collection

Basit öğe kaydını göster