Have a personal or library account? Click to login
Efficiency of SVM classifier with Word2Vec and Doc2Vec models Cover
Open Access
|Feb 2020

Abstract

Support Vector Machine model is one of the most intensive used text data classifiers ever since the moment of its development. However, its performance depends not only on its features but also on data preprocessing and model tuning. The main purpose of this paper is to compare the efficiency of more Support Vector Machine models using both TF-IDF approach and Word2Vec and Doc2Vec neural networks for text data representation. Besides the data vectorization process, I try to enhance the models’ efficiency by identifying which kind of kernel fits better the data or if it is just better to opt for the linear case. My results prove that for the “Reuters 21578” dataset, nonlinear Support Vector Machine is more efficient when the conversion of text data into numerical attributes is realized using Word2Vec models instead of TF-IDF and Doc2Vec representations. When it is considered that data meet linear separability requirements, TF-IDF representation outperforms all other options. Surprisingly, Doc2Vec models have the lowest performance and only in terms of computational cost they provide satisfactory results. This paper proves that while Word2Vec models are truly efficient for text data representation, Doc2Vec neural networks are unable to exceed even TF-IDF index representation. This evidence contradicts the common idea according to which Doc2Vec models should provide a better insight into the training data domain than Word2Vec models and certainly than the TF-IDF index.

Language: English
Page range: 496 - 503
Published on: Feb 13, 2020
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2020 Maria Mihaela Truşcă, published by Grupul de Econometrie Aplicata
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.