Advanced Search

Show simple item record

dc.contributor.authorArslan, Ahmet
dc.date.accessioned2019-10-21T19:44:12Z
dc.date.available2019-10-21T19:44:12Z
dc.date.issued2016
dc.identifier.issn0306-4573
dc.identifier.issn1873-5371
dc.identifier.urihttps://dx.doi.org/10.1016/j.ipm.2015.08.004
dc.identifier.urihttps://hdl.handle.net/11421/19829
dc.descriptionWOS: 000371939800009en_US
dc.description.abstractThe absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacriticsen_US
dc.language.isoengen_US
dc.publisherElsevier Sci LTDen_US
dc.relation.isversionof10.1016/j.ipm.2015.08.004en_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectAccentsen_US
dc.subjectDeasciifieren_US
dc.subjectDiacritics Restorationen_US
dc.subjectRisk-Sensitive Evaluationen_US
dc.subjectStemmingen_US
dc.subjectTurkish Information Retrievalen_US
dc.titleDeASCIIfication approach to handle diacritics in Turkish information retrievalen_US
dc.typearticleen_US
dc.relation.journalInformation Processing & Managementen_US
dc.contributor.departmentAnadolu Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümüen_US
dc.identifier.volume52en_US
dc.identifier.issue2en_US
dc.identifier.startpage326en_US
dc.identifier.endpage339en_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US]


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record