The Impact of Text Representation and Preprocessing on Author Identification

Pak, Muhammet Yasin; Günal, Serkan

dc.contributor.author	Pak, Muhammet Yasin
dc.contributor.author	Günal, Serkan
dc.date.accessioned	2019-10-21T20:11:00Z
dc.date.available	2019-10-21T20:11:00Z
dc.date.issued	2017
dc.identifier.issn	1302-3160
dc.identifier.uri	http://www.trdizin.gov.tr/publication/paper/detail/TWpRMk9ESTNOdz09
dc.identifier.uri	https://hdl.handle.net/11421/20031
dc.description.abstract	Author identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.	en_US
dc.description.abstract	Author identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.	en_US
dc.language.iso	eng	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Ortak Disiplinler	en_US
dc.title	The Impact of Text Representation and Preprocessing on Author Identification	en_US
dc.type	article	en_US
dc.relation.journal	Anadolu Üniversitesi Bilim ve Teknoloji Dergisi :A-Uygulamalı Bilimler ve Mühendislik	en_US
dc.contributor.department	Anadolu Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü	en_US
dc.identifier.volume	18	en_US
dc.identifier.issue	1	en_US
dc.identifier.startpage	218	en_US
dc.identifier.endpage	224	en_US
dc.relation.publicationcategory	Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.contributor.institutionauthor	Günal, Serkan

Bu öğenin dosyaları:

Ad:: 20031.pdf
Boyut:: 597.4Kb
Biçim:: PDF
Açıklama:: Tam Metin / Full Text

Göster/Aç

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Makale Koleksiyonu [100]
TR-Dizin İndeksli Yayınlar Koleksiyonu [3512]
TR-Dizin Indexed Publications Collection

Basit öğe kaydını göster