Classification of Literary Works by Genre Using Machine Learning Methods

Classification of Literary Works by Genre Using Machine Learning Methods

Vardanyan Hayarpi,

Ohanyan Heghine

Summary

Key words: TF-IDF vectorization method, classifier, SGDClassifier algorithm, metric, confusion matrix, digital libraries

This article examines the problem of automated genre classification of literary works using machine learning methods. The relevance of the study is determined by the expansion of digital libraries and educational platforms, where manual processing of large text collections is inefficient. The proposed approach relies on TF-IDF vectorization with adapted stop-words for a multilingual corpus and a comparative evaluation of seven classification algorithms. The SGDClassifier demonstrated the best performance (accuracy 0.67), surpassing both Bayesian methods and ensemble models.

SGDClassifier demonstrated the highest performance among the tested algorithms (0.67 accuracy), indicating a moderate level of effectiveness. The result obtained is due to the semantic similarity of the genres and the limitations of classical models.

Error analysis revealed difficulties in distinguishing semantically close genres (“horror/mysticism” and “fantasy”), whereas specialized genres (“religion,” “business,” “novel”) achieved high classification accuracy. The implementation was carried out in Python using pandas, scikit-learn, and NLTK, enabling a complete data processing pipeline from preprocessing to model evaluation.

The study highlights the effectiveness of classical algorithms when applied to well-prepared corpora and identifies promising directions for future research, including the use of deep learning architectures (BERT, Transformers), ensemble methods, and model localization for different linguistic subsystems. The proposed methodology offers practical value for digital libraries, online book markets, and educational platforms.

PDF

DOI: https://doi.org/10.58726/27382923-2025.2-84