Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Philology: scientific researches
Reference:

Pelevin vs Sorokin: an Attempt of Stylometric Comparison

Zenkov Andrei Viacheslavovich

ORCID: 0000-0002-1233-9082

PhD in Physics and Mathematics

Associate Professor at the Department of Modeling of Controllable Systems, Ural Federal university

620002, Russia, Sverdlovsk region, Ekaterinburg, Mira st., 19, office 434

zenkow@mail.ru
Other publications by this author
 

 
Zenkov Miroslav Andreevich

Master's degree; Institute of Radio Engineering and Information Technology; Ural Federal University

620100, Russia, Sverdlovsk region, Yekaterinburg, Kuibyshev str., 88

zenkow@mail.ru
Zenkov Nikolai Andreevich

Bachelor's degree; Institute of Digital Technologies of Management and Information Security; Ural State University of Economics

620100, Russia, Sverdlovsk region, Yekaterinburg, Kuibyshev str., 88

zenkow@mail.ru

DOI:

10.7256/2454-0749.2024.7.71169

EDN:

OLBCPG

Received:

30-06-2024


Published:

01-08-2024


Abstract: Our study is related to quantitative linguistics and focuses on the application of a new method for analyzing the author's style in literary texts. The method uses computer analysis of numerical data found in texts, including both cardinal and ordinal numerals, expressed both in numbers and verbally. Author used the program which automatically removed phraseological units and fixed combinations accidentally containing numerals. Before analysis, the text must be manually cleaned of numbers that do not contribute to the author's artistic vision, such as page numbers or chapter numbers. The analysis revealed that the use of numerals by an author in his/her texts is unique and individual, forming a characteristic feature that distinguishes texts by different authors. For the first time, a formal quantitative stylometric analysis is performed of the literary works by Victor Pelevin and Vladimir Sorokin – authors whose literary styles share many similarities when viewed through the lens of a traditional descriptive philological approach. To validate this methodology, we have also included the texts of four "impostor" authors in our analysis. It has been found that Pelevin's and Sorokin's texts differ significantly in their use of numerals. The data on occurrences of numerals in the texts were subjected to hierarchical clustering, which accurately divided the texts into groups based on their authorship. Since the clusterization results can be influenced by the choice of both metrics and clustering method, we tried various reasonable combinations of them to ensure the reliability of our results. Each time, the dendrogram would change only slightly. Thus, the clustering outcomes were found to be reliable. The proposed new method of quantitative linguistics, which is based on the analysis of numerals in literary texts, has the potential to successfully solve the stylometric problems, particularly related to the attribution of texts.


Keywords:

stylometry, quantitative linguistics, text attribution, text authorship, numerals in texts, Victor Pelevin, Vladimir Sorokin, hierarchical cluster analysis, dendrogram, Manhattan metrics

This article is automatically translated.

1. Introduction

The annual appearance of another novel by Viktor Pelevin, unchanged for many years (by now this is a Journey to Eleusis, 2023), stimulates the attention of the reading public and literary criticism [1-6] to this peculiar type of socio-metaphysical fantasy, in which the funny and parodic coexist with black humor and absurdist plot twists, and apt everyday observations – with elements of the occult and surrealism. Pelevin has been compared to such masters of the socio-metaphysical fantasy genre as Gogol, Kafka and Borges, and in recent decades many have appreciated him as a writer who captured the spirit of the times and possessed the gift of foresight. Interest in Pelevin's personality is fueled by the almost complete closeness of his private life, like the "great recluses" D. Salinger and T. Pynchon. This even gave rise to rumors that the writer does not exist at all, but a group of authors works under the Pelevin brand; on the other hand, Pelevin's hidden authorship is seen in the texts of other authors (see below).

The listed artistic features are largely characteristic of the work of Vladimir Sorokin, who, along with Pelevin, is considered one of the two stars of Russian postmodern literature, who are in continuous unspoken confrontation [6-11]. Not only at the grassroots reader level, but also in literary criticism, the texts of these two authors are often considered together.

Without claiming to be a literary and critical analysis of the works of Pelevin and Sorokin, in this work we will apply a formal quantitative approach to their texts, which, as far as we know, has not been done by anyone yet.

Stylometry (and more broadly understood quantitative linguistics) – the quantitative study of the author's features of texts, including for their attribution – still does not have a completely satisfactory universal working method [12, 13]: the frequency of occurrence in texts of significant parts of speech and service words (prepositions, conjunctions), average word lengths and The most common words and even letter combinations are compared in a pair of analyzed texts (oddly enough, the latter approach often gives good results). Unfortunately, different methods often lead to contradictory conclusions, so it is more reliable to use several methods together.

Promising results were obtained using neural networks, and soon, apparently, artificial intelligence will be able to successfully solve the problems of quantitative linguistics [14], but meaningful interpretation of the results with this approach is difficult, since the method itself is a "black box".

We have developed an original stylometric method for analyzing author's texts based on taking into account the authors' use of numerals in their texts [15, 16]. Among the significant parts of speech, numerals are by their nature the most easily quantifiable. With regard to an artistic (not rigidly factual) text generated by free imagination, it is natural to assume that the use of numerals is associated with the psychological characteristics of the author, imperceptibly influencing the result of creativity for himself. Therefore, the manner of using numerals is an author's feature (fingerprint), which allows, under certain circumstances, to solve the problem of authorship of the text.

Note also that, unlike all the methods listed above, it is the statistics of the use of numerals that are invariant with respect to the translation of a text into another language. This makes it possible, if the original text in a given language is unavailable, to use its available translation, as well as to quantitatively compare the texts of authors who worked in several languages (A. Strindberg, S. Beckett, V. V. Nabokov, ...).

The analysis of the works of several dozen authors in Russian, Czech, and English revealed tangible authorial features of the use of numerals in texts, the influence of genre, style, and artistic direction on them [17-22]. Thus, the results of the analysis allow for a meaningful philological interpretation.

In this paper, we will analyze from the point of view of the use of numerals the main literary works of V. O. Pelevin and V. G. Sorokin, as well as some other texts that will be brought into consideration for the sake of reliability of the results obtained.

2. Method and objects of research

A computer program was used that searches for numerals in the Russian-language text, expressed both in numbers (numbers) and verbally in different word forms [22]. The search is based on comparing the words of the text with the dictionary base from M. Hagen's dictionary – A complete paradigm. Morphology. Frequency dictionary. Combined Dictionary (http://speakrus.ru/dict2/#morph-paradigm ). The program automatically removed phraseological units and stable combinations from the text, randomly (without the author's intention) containing numerals (like the back of your hand, behind seven locks, ...).

Previously, page numbers, chapters, and enumerations were manually deleted from the text 1), 2), 3), ... etc .

We have analyzed some of the most voluminous works of Pelevin and Sorokin, presented in Table. 1. The choice of author's texts for analysis was influenced by their availability for free download on the Internet, as well as their non-affiliation (at the time of preparation of this work) to the proscription lists.

3. Results

The inverse density of numerals is calculated for each text as a result of dividing the volume of the text by the number of numerals found in it. The lower the inverse density, the more often numerals occur in the text.

Already a comparison of the inverse densities of numerals reveals a significant difference between Pelevin's works (No. 1-15 in Table. 1) and Sorokin (No. 16-22): the average inverse densities differ by a third; in Sorokin's texts, numerals are used more often (more detail). At the same time, according to the magnitude of fluctuations in the inverse density in the analyzed texts (the ratio of maximum and minimum density: 1.6 and 2.2 times in the texts of Pelevin and Sorokin, respectively), the manner of using numerals is more uniform in Pelevin.

An even more definite difference in the use of numerals by the two authors is seen when using hierarchical cluster analysis [23], combining objects (here: texts) into clusters according to the principle of similarity – in our case, the similarity of the absolute frequencies of the numerals 1, 2, 3, ... , 10 in the texts (these numerals are present without exception in all analyzed texts). Since the texts vary significantly in volume (see Table. 1), for frequency comparability, we introduced correction coefficients, choosing S.N.U.F.F. Pelevin as the reference text for comparison. Therefore, for example, the frequencies for Generation N had to be multiplied by 1 285 434/ 832 755 = 1.54, and for the Day of the Oprichnik – on 1 285 434/ 414 628 = 3.10.

As you know, the measure of similarity in cluster analysis is the metric p ("distance"): the smaller the "distance" between objects, the greater the similarity between them. We applied the Manhattan metric

, (1)

where x and y are n–dimensional vectors, the components of which are the corrected absolute frequencies of the first n natural numbers in the two analyzed texts (here n = 10).

In the clustering process, the far neighbor method (Complete linkage method) was used [24], which leads to the formation of compact, well-defined clusters.

The studied texts were ideally distributed in clusters according to authorship (Fig. 1). The superclusters of Pelevin and Sorokin's texts merge at high altitude, which again confirms the great differences between the texts of the two authors. Note that this makes the marginal point of view about the group of authors writing under the brand "Pelevin" questionable.

In modern stylometry, the point of view is accepted that when comparing the texts of two specific authors, only an analysis in which the studied texts are "diluted" with the texts of fake authors (the so–called impostor s - "impostors") will have evidentiary force about their similarity/difference [25]. Following these ideas, we have introduced additional literary texts (see Table. 2) and re-clustered (Fig. 2).

A few conclusions following from the table. 2 and fig. 2:

· Additional texts were also clustered according to authorship;

· Writing a work jointly by two authors (Fr. Robski, K. Sobchak – No. 4 in the table. 2) makes it unlike the texts of only one of the authors (O. Robski – No. 2, 3 in Table. 2) and forces clustering separately – an additional argument in favor of the assumption of numerals as the author's invariant;

· The texts of Pelevin and Sorokin still never fall into the same low-level cluster, which supports the conclusion made above about the significant differences between the texts of the two authors.

The work "Okolonol", a literary hoax published in 2009 under the pseudonym "Nathan Dubovitsky", requires separate consideration. In the disputes about authorship, V. Sorokin and V. Pelevin were mentioned as possible authors, in particular. It has been suggested in Russian and foreign media that the novel was written by Russian politician Vladislav Surkov. He himself has been controversial about this. By now, his authorship is considered recognized [26].

What does our analysis show in terms of numerals statistics? The inverse density of numerals for this text is in the middle between the average values for the texts of Pelevin and Sorokin (Table. 2); on the dendrograms (Fig. 2, 3), "Okolonola" is not included in a low-level cluster with any work by these authors. Hypotheses about Pelevin or Sorokin as the alleged authors are not accepted. Of course, this does not prove Surkov's authorship, but we do not have any proof. his additional literary text for the study of this issue.

As you know, the choice of metrics and clustering method cannot be strictly justified; meanwhile, they can significantly affect the results of clustering. We conducted clustering of texts by the same authors as in Fig. 2, but using not the method of the far neighbor, as in the previous attempt, but the method of intergroup connections (Group average method, Between-groups linkage) [24]; still with the Manhattan metric (Fig. 3). In our case the results turned out to be quite stable; all conclusions remain valid. Other reasonable combinations of the metric and clustering method also only slightly change the dendrogram.

Table 1

The occurrence of numerals in the studied works

Author, text, year of publication

Volume (bytes, UTF encoding)

Number of numerals

The inverse density of numerals

1

Pelevin, Blue Lantern (short Stories), 1991

1 245 806

1152

1081

2

Pelevin, Chapaev and the Void (novel), 1996

1 075 941

665

1618

3

Pelevin, Generation P (novel), 1999

832 755

753

1106

4

Pelevin, Macedonian Criticism of French Thought (novel), 2007

675 550

582

1161

5

Pelevin, The Hall of Singing Caryatids (works), 2008

230 693

218

1058

6

Pelevin, t (novel), 2009

1 110 851

706

1573

7

Pelevin, Pineapple water for a beautiful lady (essays), 2010

795 699

583

1365

8

Pelevin, S.N.U.F.F. (novel), 2011

1 285 434

893

1439

9

Pelevin, Love for the Three Zuckerbrins (novel), 2014

1 059 056

698

1517

10

Pelevin, The Lamp of Methuselah, or the Extreme Battle of the Chekists with the Masons (novel), 2016

1 002 303

762

1315

11

Pelevin, iPhuck 10 (novel), 2017

1 007 765

909

1109

12

Pelevin, Secret Views of Mount Fuji (novel), 2018

986 624

844

1169

13

Pelevin, the Invincible Sun. Book I (novel), 2020

670 911

477

1407

14

Pelevin, Transhumanism Inc. (novel), 2021

1 217 515

887

1373

15

Pelevin, Journey to Eleusis (novel), 2023

909 633

541

1681

The average inverse density of numerals in fifteen Pelevin texts:

1322

16

Sorokin, The Hearts of Four (novel), 1991

448 680

681

659

17

Sorokin, Pir (Collection of short stories), 2000

932 711

1242

751

18

Sorokin, Ice (novel), 2002

697 517

818

852

19

Sorokin, The Day of the Oprichnik (novel), 2006

414 628

381

1088

20

Sorokin, Blizzard (novel), 2010

430 967

296

1456

21

Sorokin, Telluria (novel), 2013

868 261

829

1047

22

Sorokin, Dr. Garin (novel), 2021

1 295 466

981

1320

The average inverse density of numerals in the seven texts of Sorokin:

973

Table 2

The occurrence of numerals in the texts of fictitious authors

Author, text

Volume (bytes, UTF encoding)

Number of numerals

The inverse density of numerals

1

"Nathan Dubovitsky", Okolonolya

573 376

505

1135

2

O. Robski, Casual

649 060

608

1068

3

O. Robski, About any /on

502 168

490

1025

4

O. Robski, K. Sobchak, A marriage for a millionaire or a marriage of the highest grade

381 724

342

1116

5

E. Verkin, Cloud Regiment

741 749

584

1270

6

E. Verkin, April's Friend

758 194

715

1060

4. Conclusions

The new approach we are developing to the problems of stylometry, based on the analysis of statistics of numerals in texts, for all its simplicity, demonstrates high efficiency and sensitivity. The texts of V. O. Pelevin and V. G. Sorokin, the comparative analysis of which has been carried out so far only within the framework of the traditional descriptive philological approach, were for the first time subjected to formal quantitative analysis, which correctly distributed the texts according to authorship. Significant authorial differences in the manner of using numerals were found. The involvement of third-party authors (impostors) for the analysis of texts enhances the significance of the result obtained and confirms its non-random nature. The method is suitable for attribution of texts.

Figure 1 is the result of applying hierarchical cluster analysis to the texts of V. O. Pelevin and V. G. Sorokin (the far neighbor method and the Manhattan metric were used for clustering). The horizontal axis indicates the "distance" in arbitrary units

Изображение выглядит как текст, диаграмма, Параллельный, черно-белый  Автоматически созданное описание

Figure 2 is the result of applying hierarchical cluster analysis to the texts of V. O. Pelevin and V. G. Sorokin with the addition of texts by outside authors (the clustering used the method of the far neighbor, the Manhattan metric). The horizontal axis indicates the "distance" in arbitrary units

Изображение выглядит как текст, Параллельный, диаграмма, черно-белый  Автоматически созданное описание

Figure 3 is the result of applying hierarchical cluster analysis to the texts of V. O. Pelevin and V. G. Sorokin with the addition of texts by outside authors (the method of intergroup relations and the Manhattan metric were used for clustering). The horizontal axis indicates the "distance" in arbitrary units

References
1. Bogdanova, O.V., Kibalnik, S.A., & Safronova, L.V. (2008). Литературные стратегии Виктора Пелевина [Literary strategies of Victor Pelevin]. Saint Petersburg: Petropolis.
2. Polotovski, S.A. & Kozak, R.V. (2012). Пелевин и поколение пустоты [Pelevin and the generation of emptiness]. Moscow: Mann, Ivanov and Ferber.
3. Shilova, N.L. (2011). Визионерские мотивы в постмодернистской прозе 1960–1990-х годов (Вен. Ерофеев, А. Битов, Т. Толстая, В. Пелевин) [Visionary motives in postmodern prose of the 1960–1990s (Ven. Erofeev, A. Bitov, T. Tolstaya, V. Pelevin)]. Petrozavodsk: The Publ. House of the Karelian State Pedagogical Academy.
4. Khagi, S. (2018). Alternative Historical Imagination in Viktor Pelevin. Slavic and Eastern European Journal, 62(3), 483–502.
5. Khagi, S. (2023). Пелевин и несвобода: Поэтика, политика, метафизика [Pelevin and Unfreedom: Poetics, Politics, Metaphysics]. Moscow: Novoe literaturnoie obozrenie.
6. Lanin, B.A. (2015). Новая старая литературократия: Сорокин и Пелевин в борьбе с традициями [The new old literaturocracy: Sorokin and Pelevin's fight against tradition]. Cennosti i smysly, 40(6), 110–123.
7. Bogdanova, O.V. (2005). Концептуалист, писатель и художник Владимир Сорокин [Conceptualist, writer and artist Vladimir Sorokin]. Saint Petersburg: Saint Petersburg State University.
8. Andreeva, N.N., & Bibergan, E.S. (2012). Игры и тексты Владимира Сорокина [Games and texts of Vladimir Sorokin]. Saint Petersburg: Petropolis.
9. Marusenkov, M.P. (2012). Абсурдопедия русской жизни Владимира Сорокина: Заумь, гротеск и абсурд [The Absurdopedia: Vladimir Sorokin's Russian Life in Abstraction, Grotesque, and Absurdity]. Saint Petersburg: Aleteia.
10. Bibergan, E.S. (2014). Рыцарь без страха и упрёка: Художественное своеобразие прозы Владимира Сорокина [A Knight without Fear and Reproach: The Artistic Originality of Vladimir Sorokin's Prose]. Saint Petersburg: Petropolis.
11. Kalinin, I.A., Lipovetski, M.N., Dobrenko, E.A. et al. (2018). «Это просто буквы на бумаге…». Владимир Сорокин: после литературы [“These are just letters on paper... ". Vladimir Sorokin: After Literature ]. Moscow: Novoe literaturnoie obozrenie.
12. Stamatatos, E. (2009). A survey of modern authorship attribution methods. J. Amer. Soc. for Information Science and Technology, 60(3), 538–556.
13. Tempestt, N., Kalaivani, S., Aneez, F., Yiming, Y., Yingfei, X., & Damon, W. (2017). Surveying Stylometry Techniques and Applications. ACM Comput. Surv., 50(6), Article 86.
14. La Inteligencia Artificial ayuda a descubrir una obra desconocida de Lope de Vega en los fondos de la BNE, Biblioteca Nacional de España [Artificial Intelligence helps to discover an unknown work by Lope de Vega in the collections of the BNE, National Library of Spain], https://www.bne.es/es/noticias/inteligencia-artificial-ayuda-descubrir-obra-desconocida-lope-vega-fondos-bne
15. Zenkov, A.V. (2017). Новый метод стилеметрии на основе статистики числительных [A new method of stylometry based on numerals statistics]. Kompiuternye issledovaniia i modelirovanie, 9(5), 837–850.
16. Zenkov, A.V. (2018). A Method of Text Attribution Based on the Statistics of Numerals. J. of Quantitative Linguistics, 25(3), 256–270.
17. Zenkov, A.V., & Místecký, M. (2019). The Romantic Clash: Influence of Karel Sabina over Macha’s Cikani from the Perspective of the Numerals Usage Statistics. Glottometrics, 46, 12–28.
18. Zenkov, A.V. (2021). Stylometry and Numerals Usage: Benford’s Law and Beyond. Stats, 4, 1051–1068.
19. Zenkov, A., & Místecký, M. (2022). Young Vladimír Vašek? – A Numerals Analysis Contribution to the Bezruč−Hrzánský Identity Issue. Naše řeč, 105(3), 151–161.
20. Zenkov, A.V. (2023). Литературные мистификации и авторское использование числительных [Literary hoaxes and the use of numerals by authors]. Filologicheskie nauki. Voprosy teorii i praktiki, 16(11), 3696–3709. https://doi.org/10.30853/phil20230568
21. Zenkov, A.V. (2023). Under a False Flag: Literary Hoaxes and the Use of Numerals. Litera, 10, 86–109. Retrieved from https://doi.org/10.25136/2409-8698.2023.10.68743
22. Zenkov, A.V., & Ermakov, N.E. (2023). Числительные в текстах как характерная особенность авторского стиля [The use of numerals in texts is a distinctive feature of the author's writing style]. Russian Linguistic Bulletin, 45(9). Retrieved from https://doi.org/10.18454/RULB.2023.45.28
23. Moisl, H. (2015). Cluster Analysis for Corpus Linguistics. De Gruyter Mouton.
24. Gan, G., Ma, C., & Wu, J. (2007). Data Clustering: Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics.
25. Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. J. of the Association for Information Science and Technology, 65(1), 178–187.
26. Plekhanova, I.I. (2013). Внутрилитературная полемика начала XXI века: мотивы и содержание («Околоноля» Н. Дубовицкого и «S.N.U.F.F.» В. Пелевина) [The intra-literary debate of the early 21st century: themes and content (N. Dubovitsky's "Okolonolia" and V. Pelevin's "S.N.U.F.F.")]. Filologicheski klass, 33(3), 26–32.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The material presented for publication is focused on the manifestation of a formal approach to the study of literary texts. The author himself focuses on this, and the data obtained also indicate this. In my opinion, the existing format is acceptable as an alternative, it is likely that the work / algorithm (program) can be used when studying not only the texts of Viktor Pelevin and Vladimir Sorokin, but also others. Judgments in the course of work are objective, verified, and systematic: for example, "the listed artistic features are largely characteristic of the work of Vladimir Sorokin, who, along with Pelevin, is considered one of the two stars of Russian postmodern literature who are in continuous unspoken confrontation [6-11]. Not only at the grassroots reader level, but also in literary criticism, the texts of these two authors are often considered together," or "for each text, the inverse density of numerals is calculated as a result of dividing the volume of the text by the number of numerals found in it. The lower the inverse density, the more often numerals occur in the text. Already a comparison of the inverse densities of numerals reveals a significant difference between Pelevin's works (No. 1-15 in Table. 1) and Sorokin (No. 16-22): the average inverse densities differ by a third; in Sorokin's texts, numerals are used more often (more detail). At the same time, according to the magnitude of fluctuations in the inverse density in the analyzed texts (the ratio of maximum and minimum density: 1.6 and 2.2 times in the texts of Pelevin and Sorokin, respectively), the manner of using numerals is more uniform in Pelevin," etc. The text is divided into semantic blocks, this is quite appropriate, the results are summarized in a tabular and schematic format. In general, the author's concept is presented convincingly, but some points could have been spelled out more precisely. For example, the connection between "the use of numerals in the style of one or another author with psychological characteristics" is not very clear: "in relation to an artistic (not rigidly factual) text generated by free imagination, it is natural to assume that the use of numerals is associated with the psychological characteristics of the author, imperceptibly influencing the result of creativity for himself. Therefore, the manner of using numerals is an author's feature (fingerprint), which allows, under certain circumstances, to solve the problem of authorship of the text." The literary base of comparison is quite well chosen: I think that this option still proves the "authorship" of a particular text. Although, in my opinion, the problem should be solved in other research areas. As a result, the author comes to the following conclusion: "the new approach we are developing to the problems of stylometry, based on the analysis of statistics of numerals in texts, for all its simplicity, demonstrates high efficiency and sensitivity. The texts of V. O. Pelevin and V. G. Sorokin, the comparative analysis of which has been carried out so far only within the framework of the traditional descriptive philological approach, were for the first time subjected to formal quantitative analysis, which correctly distributed the texts according to authorship. Significant authorial differences in the manner of using numerals were found. The involvement of third-party authors (impostors) for the analysis of texts enhances the significance of the result obtained and confirms its non-random nature. The method is suitable for attribution of texts." I think it is possible to adjust the list of sources: it is desirable to eliminate some formal points. For example, "Hagi S. Pelevin and unfreedom: Poetics, politics, metaphysics. M.: New Literary Review, 2023. – 392 p. ISBN: 978-5-4448-1967-8", etc. In general, the topic of this work has been disclosed, the goal has been achieved, and the result is available. I recommend the article "Pelevin vs Sorokin: the experience of stylometric comparison" for publication in the journal "Philology: Scientific Research".