Gelişmiş Arama

Basit öğe kaydını göster

dc.contributor.authorPak, Muhammet Yasin
dc.contributor.authorGünal, Serkan
dc.date.accessioned2019-10-21T20:11:00Z
dc.date.available2019-10-21T20:11:00Z
dc.date.issued2017
dc.identifier.issn1302-3160
dc.identifier.urihttp://www.trdizin.gov.tr/publication/paper/detail/TWpRMk9ESTNOdz09
dc.identifier.urihttps://hdl.handle.net/11421/20031
dc.description.abstractAuthor identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.en_US
dc.description.abstractAuthor identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.en_US
dc.language.isoengen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectOrtak Disiplinleren_US
dc.titleThe Impact of Text Representation and Preprocessing on Author Identificationen_US
dc.typearticleen_US
dc.relation.journalAnadolu Üniversitesi Bilim ve Teknoloji Dergisi :A-Uygulamalı Bilimler ve Mühendisliken_US
dc.contributor.departmentAnadolu Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümüen_US
dc.identifier.volume18en_US
dc.identifier.issue1en_US
dc.identifier.startpage218en_US
dc.identifier.endpage224en_US
dc.relation.publicationcategoryMakale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.contributor.institutionauthorGünal, Serkan


Bu öğenin dosyaları:

Thumbnail

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Basit öğe kaydını göster