Quantitative linguistics methods in the study of the original and translation of a literary text

Сафина З.М.

doi:10.7256/2454-0749.2025.10.76298

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Back to contents

Philology: scientific researches

Reference:

Safina, Z.M. (2025). Quantitative linguistics methods in the study of the original and translation of a literary text. Philology: scientific researches, 10, 56–64. https://doi.org/10.7256/2454-0749.2025.10.76298

Quantitative linguistics methods in the study of the original and translation of a literary text

Safina Zarema Miniaminovna

ORCID: 0009-0009-3486-7757

PhD in Philology

Associate Professor; Department of Linguodidactics and Translation Studies; Ufa University of Science and Technology

22 Kommunisticheskaya str., Ufa, Republic of Bashkortostan, 450076, Russia

safinazarem@yandex.ru

Other publications by this author

DOI:

10.7256/2454-0749.2025.10.76298

EDN:

NGJXWK

Received:

10/13/2025

Published:

10/20/2025

Abstract: This article is devoted to the consideration of two variants of the translation of literary texts by Anton Chekhov into English. The purpose of the work is to investigate the original and translations of a literary text using the methods of quantitative linguistics in combination with traditional translation analysis of a literary text, and the descriptive method. The relevance of the work is due to the increasing interest of researchers in the possibilities of applied methods of computational linguistics in the study of the quantitative component of texts. Frequency lists in the original and translated texts are compiled, parts-of-speech analysis of text corpora is performed, as well as calculation of lexical diversity, and of lexical density. A comparative analysis of translations of twenty verbs of sensory perception and mental activity is carried out. The research is based on the originals of Chekhov's short stories and two versions of their translation into English, made by the British translator K. Garnett and the American translators R. Peaver and L. Volokhonskaya. The study was conducted on the basis of the NLTK libraries written in the Python programming language. The results obtained using statistical methods may differ depending on the size of the studied text corpora, as well as on the structural features of the Russian and English languages. It is concluded that the translations of R. Peaver and L. Volokhonskaya surpass the translations of K. Garnett in quantitative terms. An in-depth linguistic and translation analysis revealed that K. Garnett uses much more diverse translation techniques and strategies, which makes it possible to more accurately convey the shades of meaning of the original text into another language. The research results can be used to further develop corpus and statistical research of originals and translations of literary texts.

Keywords:

literary text, translation, statistical method, quantitative analysis, text corpus, A. Chekhov, lexical diversity, frequency lists, lexical density, lemma

This article is automatically translated.

Introduction

In modern linguistics, text data processing is a task of paramount importance due to the need for prompt and effective analysis of extensive information. It is impossible to conduct such research without the use of computer technologies and statistical analysis methods implemented within the framework of quantitative linguistics, a special branch of mathematical linguistics. Quantitative research methods are aimed at identifying patterns on the basis of which laws about certain linguistic phenomena are formed. The main task of quantitative linguistics, according to the founder of this field, the German researcher R. Koehler, is "the desire to establish a hierarchy of explanations that lead to the emergence of more and more general theories and cover an increasing number of phenomena" ^{[1, p. 8]}.

Empirical data obtained in the framework of quantitative linguistics are based on the results of statistical research conducted in the framework of corpus and computational linguistics. The use of statistical approaches significantly increases the productivity of linguistic research, providing a structured presentation of the collected information. In translation theory, "the use of statistical research methods contributes to a more accurate understanding of the structure of the original and, consequently, to the creation of a more correct version of the translation" ^{[2, p. 137]}. Unlike classical methods, quantitative or quantitative techniques can provide a higher degree of reliability and impartiality of the conclusions, which allows us to identify a number of advantages of these techniques over qualitative methods, namely: the objective nature of the research; accuracy and unambiguity of measurement results; the ability to process a large amount of data without significant time expenditure. These advantages significantly expand the scope of quantitative research methods, but they also have their drawbacks, which are why many linguists prefer qualitative research methods. Firstly, not all linguistic phenomena can be measured using quantitative methods, and secondly, the use of these methods is possible only within certain limits, beyond which their application becomes impossible, and the researcher will have to resort to other methods of text analysis ^{[1, p. 7]}. The complexity of data collection and subjectivity at the stage of interpretation of the data obtained These data also relate to the significant disadvantages of the methods of quantitative linguistics.

Corpus linguistics occupies a special place in the quantitative research of linguistic phenomena. Currently, this field of linguistics greatly facilitates the process of finding the necessary information. As a tool for linguistic research, the corpus should consist of a large array of natural texts presented in electronic form, as well as have certain software or a search engine ^[3]. Among the numerous tasks of the corpus, it is possible to single out the compilation of frequency lists of words, which are formed based on research objectives, based on relevant texts or their parts, and reflect the frequency of detected elements. The general rule is that the most frequently occurring elements play a major role in the text and are significant, while elements with a small number of mentions indicate their rare use in speech. In general, "frequency lists make it possible to identify the core and periphery of vocabulary" ^{[4, p. 741]}.

The main part

In this work, using the methods of quantitative linguistics, three text bodies were studied, consisting of the originals of thirty short stories by Anton Chekhov ^[5], as well as two versions of their translation into English, performed by the British translator K. Garnett ^[6] and American translators R. Beer and L. Volokhonskaya ^[7]. The study was conducted on the basis of one of the varieties of natural language processing software, the NLTK library package written in the Python programming language. Python is a general—purpose programming language, also known as a scripting language. Python uses a dynamic type system and automatic memory management, as well as an extensive library ^{[8, p. 1856]}. The use of the NLTK library makes it possible to pre-process data, "allowing machine algorithms to work with text data and perform text analysis" ^{[9, p. 177]}.

Before starting work, the studied texts were converted to txt format with UTF-8 encoding. Usually, the first step in working with text is to normalize it. This stage includes reducing the text to a single case, removing punctuation marks, and unnecessary whitespace characters. Normalization is necessary to ensure uniformity in text processing. The next stage, tokenization, is the process of splitting text into separate units that include basic elements such as words, numbers, or punctuation marks. As a result of splitting the text we are analyzing into tokens, we get the following (the first 10 tokens are given): ['They said', ',' 'what', 'on', ‘embankment', 'appeared', 'new', 'face', ':', 'lady']. After removing punctuation marks, we get the following set of tokens: ['They said', 'what', 'on', ‘embankment', 'appeared', ‘new', 'face', 'lady', 'with', ‘dog’].

At the next stage, it is necessary to reduce all words to lowercase, i.e. to get rid of uppercase letters so that the same word written in lowercase and uppercase letters is not perceived by the system as two different tokens. After executing the command, we get a set of tokens without capital letters: ['they said', 'what', 'on', ‘embankment', 'appeared', ‘new', 'face', 'lady', 'with', ‘dog’].

Thus, the text is prepared for further analysis.

The key point in creating frequency dictionaries is to identify the linguistic unit that will be used to measure the occurrence of elements in a text. There are two ways to compile frequency lists: either a word form or a lemma is taken as a dictionary unit. When a lexical form is used as the basic unit of a text corpus, lemmatization of the text must be performed before applying statistical analysis methods. A lemma is a grammatical form used to represent a lexeme ^{[10, p. 611]}. In other words, a lemma is the original form of a word. The lemmatization procedure is a rather complicated process, since lemmatizers allow errors that have to be adjusted manually. In this study, the lemmatization of the Russian text was performed using the morphological analyzer pymorphy3. Lemmatization of English-language texts is carried out using the WordNetLemmatizer tool. The result of lemmatization is as follows: ['say', 'what', 'on', ‘embankment', 'appear', ‘new', 'face', 'lady', 'with', ‘dog’].

After lemmatization, all the forms of the same word presented in the text were combined and taken into account as a single lemma. This allowed us to get a more accurate idea of the frequency of use of a particular word in the analyzed text. Further, linguistic units that have no independent meaning and can affect the statistics of the occurrence of significant words were removed from the text array. These elements included all official words, proper names, and a number of other non-significant words. At the final stage, the frequency lists of lemmas for three text corpora were formed. The first ten lemmas from each received list are shown in Table 1.

A.P. Chekhov		To. Garnett		R. Piver and L. Volokhonskaya
to be	1882	be	6782	be	4865
speak	734	have	2520	have	1790
to say	536	go	1155	say	1092
human	528	say	1116	go	1057
could	357	there	877	there	767
now	350	do	808	do	695
become	322	look	629	come	675
life	307	come	624	life	614
day	283	life	594	look	539
eye	276	think	518	think	534

Table 1 Frequency lists of lemmas

According to the data provided, in each of the considered cases, the verb to be (including its English counterpart be) occupies a leading position in terms of frequency of use. However, this verb is much more common in English-language texts. This is explained by the analytical structure of the English language, where be plays a key role in the formation of various grammatical constructions. An analysis of the frequency of use of the verbs be, have, and do in the corpus of the texts of the two translations demonstrates that there are many more of them in the corpus of K. Garnett than in the corpus of R. Piver and L. Volokhonskaya. This is probably due to the fact that the American version of English, in which R. Piver and L. Volokhonskaya wrote, tends to omit auxiliary verbs such as be, have and do, which are actively used in British English to construct modern forms of the verb.

Another indicator calculated in the framework of quantitative linguistics is the coefficient of lexical diversity of texts. The index of lexical diversity is calculated as the ratio of the number of non–repeating words to the total number of word uses in the text, which is expressed by the formula k=100*n/N, where N is the total number of all word uses in the corpus, n is the number of unique words in the corpus (words that occur only once in the text, excluding repetitions, are called unique), k – coefficient of lexical diversity ^{[11, p. 558]}. Table 2 shows the results of calculating the degree of lexical diversity of the original texts and translations.

	N	n	k
A.P. Chekhov	137112	35062	25,57
To. Garnett	184284	32453	17,61
R. Piver, L. Volokhonskaya	170319	31771	18,65

Table 2 Coefficient of lexical diversity

A direct relationship was found between the value of the coefficient k and the richness of vocabulary in the text. An analysis of the data revealed that the translations performed by K. Garnett demonstrate a slightly lower lexical diversity compared to the works of R. Pivera and L. Volokhonskaya. At the same time, the total number of non-repeating words (indicator n) in K. Garnett's texts slightly exceeds the similar values of other translators. The results of our study indicate that with an increase in the volume of the text (as observed in K. Garnett), the level of lexical diversity decreases due to an increase in the repetition of lemmas in longer works. It is worth noting that the greatest lexical diversity is observed in the corpus of texts belonging to Anton Chekhov. This result is natural, given that the Russian language, according to the typological classification, belongs to the synthetic type, despite the presence of elements of analyticism. This property is manifested in the fact that the grammatical characteristics of words in Russian are encoded by morphemes, and not by individual lexical units, unlike in English. This phenomenon can also be explained by the fact that translations, no matter how well they are executed, are always somewhat inferior to the original in terms of volume and composition of information. Translation losses are inevitable, and therefore the translation text is always less complete. This may explain the presence of several different versions of the translation of the same literary text.

At the next stage, the number of parts of speech in the text corpora under consideration was calculated. The quantitative characteristics of the main classes of words in the original text and its two translated versions are presented in Table 3.

Body	Buy flax	Verb	Adjective	Adverb
A.P. Chekhov	27821	25729	10374	6370
To. Garnett	39363	35106	12176	11718
R. Piver, L. Volokhonskaya	33886	33294	12079	12281

Table 3 Quantitative indicator of parts of speech

The analysis shows that the difference in the frequency distribution of parts of speech between the original text and its two translations is small. In all three text corpora, nouns dominate, taking the first place in frequency, while verbs take the second place. Of particular interest is the fact that K. Garnett has 4275 more nouns than verbs, while R. Beer and L. Volokhonskaya have only 592 words. As you know, in addition to the nominative function, nouns in a work of fiction also perform descriptive and emotional functions. They participate in the transmission of deep meanings and the most subtle shades of feelings. Almost no metaphor or comparison is complete without this part of speech. Based on this, it can be assumed that the translation performed by K. Garnett is characterized by greater descriptiveness and variability, compared with the translation performed by R. Beer and L. Volokhonskaya. The least frequent parts of speech in all corpora are adjectives and adverbs. It is worth noting that in the two versions of the translation, adverbs and adjectives practically coincide in the number of uses, while in the original, adjectives are used almost twice as often as adverbs.

Another value studied in the framework of quantitative linguistics is the lexical density coefficient, which is one of the methods of discursive analysis. This characteristic makes it possible to assess how informative and difficult the text is to understand from a linguistic point of view. You can determine the complexity of a text's vocabulary by comparing the frequency of use of grammatical elements with the number of significant words – lexical units. There are a number of approaches to measuring the lexical saturation of a text. It can be measured as the ratio of lexical elements to the total number of words, or as the ratio of lexical elements to the number of higher structural elements (for example, sentences) ^{[12, p. 64]}. In this study, the lexical density is calculated as: Ld = Nlex/N×100, where: Ld is the lexical density of the analyzed text, Nlex is the number of lexical elements (nouns, adjectives, verbs, adverbs) in the analyzed text, N is the number of all lexemes (total number of words) in the analyzed text.

To determine the number of significant lexical units, it is necessary to exclude all service words from the total number of words. In other words, the number of meaningful words is calculated by subtracting functional elements from the total volume of the text. The obtained data for calculating the lexical density of texts are shown in Table 4.

	N	Nlex	Ld
A.P. Chekhov	143330	84913	59,24%
To. Garnett	184310	77424	42,01%
R.Piver and L. Volokhonskaya	169974	71958	42,33%

Table 4 Lexical density coefficient

The data obtained allow us to conclude that a decrease in both the total number of words and the proportion of service words in the text array entails an increase in lexical saturation. This trend is typical for all three analyzed text corpora. The highest lexical density is recorded in the corpus of source texts. The translated versions show a decrease of about 17% compared to the originals. It is interesting to note that K. Garnett's translations have a slightly lower lexical saturation than the works of R. Beer and L. Volokhonskaya. The analysis shows that the texts translated by American specialists demonstrate a higher degree of lexical saturation (the difference reaches 0.32%). However, in our opinion, this is not always optimal, since excessive concentration of vocabulary can make it difficult to understand the text. It should be noted that K. Garnett uses significantly more different words in her translations than her American colleagues.

At the last stage of our research, a comparative analysis of translations was carried out in order to confirm or refute the results of statistical calculations. 20 verbs of sensory perception and mental activity with different frequency of use were selected from the Russian-language corpus of texts. Among the verbs of mental activity, the highest-frequency verb turned out to be to think, for which K. Garnett proposed 22 translation options, while R. Piver and L. Volokhonskaya used 14 translation options. The most frequent verbs in both translations are the verbs think, ponder, reflect, brood. As for the verbs to believe, to imagine, to love, the translators offer the same number of English correspondences for their translation. Let's consider some cases of discrepancies in translation decisions when translating the same verb: You work, you try, you suffer, you don't sleep at night, you keep thinking about how best. – a) One works and does one's utmost, one wears oneself out, getting no sleep at night, and racksone'sbrain what to do for the best ^[6]. b) You work, you do your utmost, you suffer, you don’t sleep, thinking how to do your best ^[7]. The correspondences used by rack one's brain and think differ in that the first expression (lit. racking his head) conveys the complexity and agony of the speaker's thought process. Thus, the translator manages to accurately convey the stylistic coloring of the sentence describing the emotional and mental state of the hero.

If I know that I am mentally ill, can I trust myself? – a) "If I know I am mentally affected, can I trustmyself?" ^[6] b) “If I know that I am mentally ill, then can I believe myself?” ^[7]. In this example, both English verbs trust and believe mean faith in something or someone, but there is some difference between them: the first verb implies that the speaker trusts someone who is incapable of harming, the second verb is used more to denote faith in the correctness of some cause. Thus, K. Garnett literally emphasizes that the hero is sincerely interested in whether he can trust himself and whether he will harm himself, whereas R. Piver and L. Volokhonskaya focus on the hero's thoughts about the correctness of his intentions.

He loves Lisa very much and, apparently, she likes him... – a) He is very much in love with Liza, and she seems to like him… ^[6]. b) He loves Liza very much, and she apparently likes him… ^[7]. The correspondences of be in love and love used by the translators describe strong emotional attachment and love, however, there is a slight difference in the degree of expression of this feeling: the first correspondence describes the initial insane adoration of a person, while the second describes an already formed deep feeling of love.

In general, the analyzed examples show that K. Garnett's translations are distinguished by a wide variety in the choice of such lexical units, which, in addition to the main meaning, have an additional component of meaning bearing emotional and emotive coloring. Thus, the translator provides readers with additional information about the emotions of the characters and what is happening, which fully reveals the intention of the original author. The translations of R. Pivera and L. Volokhonskaya are more literal, characterized by the choice of one correspondence when transferring a lexical unit in different contexts, which sometimes leads to an inaccurate interpretation of the meanings laid down by the author.

Conclusion

At first glance, the difference in the parameters considered in this study may seem insignificant, but it should be borne in mind that we are talking about two ways to translate the same corpus of texts into the same English language. We believe that the territorial and temporal differences between the two translations led to the lexical and grammatical differences observed in this study. Even the slightest differences become important, because "it is precisely these small differences: the presence of differentiation in one case and narrowing of the spectrum, "flattening" in the other, greater or lesser accuracy in choosing an equivalent, observance or non–observance of prohibitions on excessively frequent repetitions, etc. - collectively create the impression that the reader takes out of the translation." ^{[13, p. 14]}. The linguistic analysis using statistical data and quantitative methods allowed us to conclude that the translations of R. Pivera and L. Volokhonskaya surpass the translations of K. Garnett in quantitative parameters, which is reflected in a higher coefficient of lexical diversity of the corpus of texts of American translators, compared with the corpus of texts by K. Garnett. Nevertheless, an in-depth linguistic and translation analysis revealed that K. Garnett uses much more diverse translation techniques and strategies in the translation process, which makes it possible to more accurately convey the shades of meaning of the original text. This apparent contradiction is explained by the inverse relationship between lexical diversity and the volume of the analyzed text. K. Garnett's corpus of texts, being more voluminous, contains a greater number of repetitions, which reduces the coefficient of lexical diversity. The large length of the text inevitably leads to an increase in the repeatability of the same lemmas. Moreover, the analysis demonstrated the effect of the grammatical properties of the language on the value of the coefficient of lexical diversity. Thus, the corpus of texts by R. Pivera and L. Volokhonskaya is characterized by a lower frequency of use of official words. In turn, K. Garnett's analyzed corpus includes a larger number of non-repeating words, which indicates the breadth and diversity of her vocabulary.

In general, quantitative research is a valuable tool for studying the translation of a literary text, allowing us to obtain objective data and identify objective patterns. However, it should be considered as an addition, not a substitute, for qualitative research methods, since it is the combination of both approaches that makes it possible to fully and deeply understand the deep meanings laid down by the author of a literary text and adequately convey them in translation.

The article is published in the version approved by the reviewers (after receiving a positive review recommending the manuscript for publication) with corrections made by the author (after receiving the editor’s comments, if any).
Read all reviews on this article

References

1. Köhler, R. (2005). Gegenstand und Arbeitsweise der Quantitativen Linguistik. In R. Köhler, G. Altmann, & R. G. Piotrowski (Eds.), Quantitative linguistics: An international handbook (pp. 1-16). de Gruyter.
2. Morozkina, E. A., Vorobyov, V. V., & Safina, Z. M. (2023). Statistical methods of research in literary translation. Reports of Bashkir University, 8(3), 130-137. https://doi.org/10.33184/dokbsu-2023.3.15
3. Zakharov, V. P., & Bogdanova, S. Yu. (2020). Corpus linguistics. S.-Petersburg University Press.
4. Safina, Z. M., Kornilova, A. D., & Smakova, A. L. (2022). Quantitative and statistical analysis of lexical units in literary translation. Bulletin of Bashkir University, 27(3), 741-746. https://doi.org/10.33184/bulletin-bsu-2022.3.42
5. Chekhov, A. P. (n.d.). Stories. Retrieved October 1, 2025, from https://traumlibrary.ru/page/chehov_p.html
6. Chekhov, A. (2000). The essential tales of Chekhov (C. Garnett, Trans.). Ecco.
7. Chekhov, A. (2004). The complete short novels (L. Volokhonsky & R. Pevear, Trans.). Everyman's Library.
8. Rana, Y. (2019). Python: Simple though an important programming language. International Research Journal of Engineering and Technology (IRJET), 6(2), 1856–1858.
9. Safina, Z. M. (2024). Translation analysis of a literary text in Python. Global Scientific Potential, 11(164), 177-180.
10. Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall.
11. Gonçalves, L. L., & Gonçalves, L. B. (2006). Fractal power law in literary English. Physica A: Statistical Mechanics and its Applications, 360(2), 557-575. https://doi.org/10.1016/j.physa.2005.06.049
12. Castello, E. (2008). Text complexity and reading comprehension tests. Peter Lang.
13. Buzadzhy, D. M., & Lanchikov, V. K. (2011). Literalism and linguistic diversity: On the use of one method of corpus linguistics in translation studies. Mosty, 4, 12-27.

Journals

Books

Quantitative linguistics methods in the study of the original and translation of a literary text