Library
|
Your profile |
Philology: scientific researches
Reference:
Zenkov A.V.
The numbers reveal the author: a stylometric comparison of German-language modernist texts
// Philology: scientific researches.
2024. № 11.
P. 50-62.
DOI: 10.7256/2454-0749.2024.11.72167 EDN: PDWIOX URL: https://en.nbpublish.com/library_read_article.php?id=72167
The numbers reveal the author: a stylometric comparison of German-language modernist texts
DOI: 10.7256/2454-0749.2024.11.72167EDN: PDWIOXReceived: 01-11-2024Published: 07-12-2024Abstract: The present study pertains to stylometry (and, more broadly, to quantitative linguistics). The novel quantitative method of studying the author's style of literary texts, based on the analysis of statistics of numerals found in them, is applied to literary texts in German. A computer program has been developed to search in the text for cardinal and ordinal numerals expressed both in numbers and verbally (in different word forms). The program automatically removes phraseological units and stable combinations from the text that accidentally (without the author's intention) contain numerals. Previously, the text is manually cleared of auxiliary numerals such as pagination, chapter numbers, etc. It is shown that the numerals used by the author in the (artistic) text are individual for each author; their totality is a characteristic feature (author's invariant, "fingerprint") that distinguishes the texts written by different authors. A comparative stylometric analysis of a number of literary works by Thomas Mann, Hermann Broch, Robert Musil, and Elias Canetti – the representatives of German-language literary modernism of the 20th century – is performed. Substantial authorial differences in the manner of using numerals were discovered. The results of the analysis were subjected to hierarchical clustering process (the Manhattan metric; Complete linkage and Between-groups methods). The cluster analysis correctly distributed the texts according to their authorship. The use of various clustering methods for text analysis enhances the significance of the results obtained and confirms their non-random nature. This demonstrates that the novel method of stylometry is able to accurately attribute literary texts to their correct authors. Keywords: stylometry, stylometric, quantitative linguistics, attribution of texts, authorship of texts, numerals in texts, T. Mann, H. Broch, R. Musil, E. CanettiIntroduction The present study has two main objectives: firstly, to provide new examples to support our approach to the problems of stylometry [1–8], and secondly, based on this approach, to conduct a quantitative analysis of the works by T. Mann, H. Broch, R. Musil, and E. Canetti – classics of 20th-century German-language modernist literature. Stylometry (and, more broadly, quantitative linguistics) still does not have a completely satisfactory universal working method [9, 10]: some studies consider the frequencies of occurrence of content and function words (prepositions, conjunctions), average word and sentence lengths; in a pair of analyzed texts, one compares the most frequently used words common for both texts (the well-known “Burrows delta” [11]) and even letter combinations (oddly enough, the latter approach often demonstrates good results). Unfortunately, different methods often lead to controversial conclusions, so it is more reliable to use several methods together. Promising results have been obtained using neural networks, and it seems that soon, artificial intelligence will be able to successfully solve problems in quantitative linguistics [12]. Nevertheless, meaningful interpretation of results within this approach remains problematic, since the method itself is a black box. The study of apocrypha (starting with the biblical [13] and Shakespearean [14]), cases of dubious authorship (M. Ageyev [2, 15], B. Traven [16]) and fictitious authorship (Émile Ajar [17]), forged memoirs (Misha Defonseca [18]) are examples of tasks in which stylometric methods can be useful. We have developed an original stylometric method for analyzing authorial texts based on the authors' use of numerals in their texts [1–8]. Among the content words, numerals by their nature are the easiest to quantify. With regard to literary texts, the content of which is not rigidly tied to real-life events, but generated by free imagination, it is natural to assume that the use of numerals is associated with the author’s psychological features which imperceptibly for him influence the result of his creative work. Consequently, the manner of numerals use is an author-specifics feature (or fingerprint), which allows, under certain circumstances, to solve the problem of text authorship attribution. Note that, unlike all the methods listed above, it is the analysis of the use of numerals that is almost independent of the translation of the text into another language (the structure of the language may have a slight effect on the statistics of numerals: in the English phrase tenth anniversary a numeral will be found, while in its German equivalent zehnjähriges Jubiläum it will not). This makes it possible, when the original text in a given language is unavailable, to use its accessible translation, as well as to quantitatively compare the texts of authors who wrote in several languages (A. Strindberg, S. Beckett, V.V. Nabokov, ...). The study of the works of several dozens of authors in Russian, Czech, and English revealed tangible authorial features in the use of numerals in texts, the influence of genre, style, and artistic direction on them [1–8]. Thus, the results of the analysis allow for a meaningful philological interpretation. By now we have developed a computer program that identifies numerals in German-language texts, and in this work the objects of study will be German-language literary texts for the first time. We will analyze some works by Thomas Mann (1875–1955), Hermann Broch (1886–1951), Robert Musil (1880–1942), and Elias Canetti (1905–1994) from the point of view of the use of numerals. T. Mann is recognized as one of the most prominent representatives of German literary modernism (with all the vagueness of this concept) [19–24]. In Austria, such a description could be given to Musil [24–33] (less well-known to the general public and less prolific as a writer, but comparable to Mann in the artistic merits of his works) and Broch [24, 33–37] (author of prose, poetry, philosophical and political essays). Their younger contemporary Canetti, who is considered more of a postmodernist, was distinguished by the versatility of his work: from a novel, plays, and artistic autobiography to an extensive compilation treatise “Crowds and Power” claiming to be scientific [38–43]. Hardly anything can be added to the existing literary and critical examination of the works of these writers. In this paper, we will apply a formal quantitative approach to their texts, which, to our knowledge, has not been done before. Method and objects of research Our computer program scans the German-language text for cardinal and ordinal numerals, expressed both in numbers and in words in various forms. The program automatically eliminates idiomatic expressions (im siebenten Himmel) and fixed phrases (die fünfte Kolonne), which accidentally contain numerals. The numerals not related to the authors’ creative ideas were manually deleted from the text beforehand – such as page and chapter numbering, itemizations 1), 2), 3), etc. The following texts were analyzed: T. Mann: · Königliche Hoheit («Royal Highness»), 1909 – novel; · Bekenntnisse des Hochstaplers Felix Krull («Confessions of Felix Krull»), 1922–54 – novel; · Der Zauberberg («The Magic Mountain»), 1924 – novel; · Lotte in Weimar, 1939 – novel; · Doktor Faustus («Doctor Faustus»), 1947 – novel; · Erzählungen – a collection of short stories [44] including Herr und Hund, Der Knabe Henoch (Fragment), Die vertauschten Köpfe, Die Betrogene, Fiorenza, Gesang vom Kindchen. H. Broch: · Die Schlafwandler («The Sleepwalkers»), 1932 – novel; · Die Entsühnung («The atonement»), 1933 – a play; · Die Verzauberung («The Spell»), published 1976 – novel; · Gedichte – The complete collection of poems [45]. R. Musil: · Die Verwirrungen des Zöglings Törless («The Confusions of Young Törless»), 1906 – novel; · Der Mann ohne Eigenschaften («The Man without Qualities»), 1932 – novel. E. Canetti: · Masse und Macht («Crowds and Power»), 1962 – non-fiction prose; · Die gerettete Zunge («The Tongue Set Free»), 1977 – a fictionalized autobiography.
Some numerical characteristics of the texts are presented in Table 1. The choice of texts for analysis was influenced by their availability for free download on the Internet. Unfortunately, some important works were not available, even though the copyright protection period for most of them had long expired.
Results For a primary assessment of the similarity/differences in the authors' use of numerals, we calculated the inverse density of numerals for each text (the result of dividing the volume of the text by the number of numerals contained in it). The lower the inverse density, the more often numerals appear in the text. Noteworthy is the significantly lower value of the inverse density for Musil’s texts (which turned out to be the same to within tenths of a fraction in both analyzed works!) compared to the texts of other authors: Musil more often resorts to numerals. As for Canetti's texts, with their very different densities of numerals, this result provides a preliminary answer to the question discussed in literary criticism: whether Crowds and Power should be attributed to fiction or to texts modeled on scientific ones. It is indeed a text that claims to be scientific. We will return to this issue later. The texts by Mann and Broch differ slightly in the inverse density of numerals. Poems (by Broch), quite expectedly, have the highest inverse density of numerals: in poetry, numerals are less common than in prose. After examining the general use of numerals in the texts, we proceeded to a separate account of each numeral. The differences in the authors' use of numerals become apparent when we apply hierarchical cluster analysis [46, 47], which groups objects (here: texts) into clusters based on their similarity – in our case, the similarity of the absolute frequencies of occurrence of the numerals 1, 2, 3, 4, 5 in the texts (these numerals are present without exception in all analyzed texts, subsequent numerals are found with gaps). Since the texts differ significantly in volume (see Table 1), correction factors had to be introduced to make the frequencies comparable. Mann's Der Zauberberg served as the reference text for comparison. Therefore, for example, for Königliche Hoheit the frequencies were multiplied by 2,075,077 / 751,961 = 2.76, and for Der Mann ohne Eigenschaften by Musil – by 2,075,077 / 4,437,225 = 0.47. The measure of similarity in cluster analysis is the metric ρ ("distance"): the smaller the "distance" between objects, the more similar they are. We applied the Manhattan metric where x and y are n-dimensional vectors, the components of which are the corrected absolute frequencies of the first n natural numbers occurring in the two analyzed texts (here n = 5). In the clustering process, we used the far neighbor method, also known as the Complete linkage method [48], which leads to the formation of compact isolated clusters. In the initial phase, we grouped together only literary works by Mann, Broch, and Musil. They were reasonably distributed into clusters according to authorship (Figure 1). Conclusions: 1) The uniqueness of Musil's writings is confirmed. But now it becomes clear which numeral is responsible for the high frequency of numerals. It is ein ("one") in different word forms; unfortunately, in German it is formally and semantically impossible to distinguish it from the indefinite article. Our program has taken into account all instances of ein appearing in the text. 2) In general, the texts by Mann and Broch do not differ greatly in the use of specific numbers. 3) Our approach to the problems of stylometry is based on the assumption that each writer has an individual manner of using numerals; this would seem to be contradicted by the alternation of Mann and Broch microclusters. However, firstly, there is no universal stylometric method that perfectly distributes texts according to authorship; secondly, these microclusters merge into an intermediate cluster at a high altitude (10 – in Figure 1) which is still 2.5 times less than the height of the formation of the final supercluster (with Musil's texts participating). How stable is the dendrogram structure with respect to the addition of new texts from other authors? We now add the texts of the fourth author, E. Canetti, and re-run the clustering (Fig. 2). Based on Fig. 2, we can draw several observations: 1) The general appearance of the dendrogram has not changed much (the program has only reordered the low-level clusters). 2) The two texts by Canetti were clustered not just separately, but in branches of the dendrogram that merge at the maximum height. This indicates a fundamental distinction between the texts: while Die gerettete Zunge is a work that adheres to the conventions of fiction, Masse und Macht is not. The abundance of factual numbers places this text in a shared cluster with Musil's texts, albeit at a high height (an addition regarding Canetti's works will be provided below). 3) Adding Canetti’s texts literally loosens the dendrogram: the heights of merging increase (note that the maximum height is always normalized to 25). Additional information about the author's use of numerals can be derived from Figure 3, which shows a fragment of the frequency distribution of numerals from the range [1; 30] in some works of the authors under consideration: 1) The frequency of numerals decreases rapidly as the numerals increase. 2) Local maxima are observed at the round numbers 10, 20, 30, and so on. They can be explained by a well-known psychological phenomenon of preference for "round" numbers. 3) The differences between the texts by Mann and Broch become noticeable: Mann has a greater variety and frequency of numerals in texts (with the exception of the numeral ein (one), which, however, can also be an article, as noted above). In our opinion, the most significant indicator of the author's desire for subjective "accuracy" of the narrative is the inclusion of specific dates in the text. According to this indicator, Canetti's works are the leaders among all the analyzed texts. Although they are little similar in the use of numerals in general, they are very close in the abundance of dates found in them. The choice of metric and clustering method cannot be definitively justified, yet they can have a substantial impact on the outcomes of clustering. We performed clustering of texts by the same authors as in Figure 1, but using not the far neighbor method, as in the previous attempt. Instead, we utilized the group average method (Between-groups linkage) [47], still using the Manhattan metric (Figure 4). In our case, the results were quite consistent, and all the conclusions remained valid. Even when we used different combinations of metrics and clustering techniques, the dendrogram only changed slightly.
Table 1 Occurrence of numerals in the studied texts
Conclusions The approach to stylometry problems we are developing, based on the analysis of numerals statistics in texts, despite its simplicity, demonstrates high efficiency and sensitivity. Texts by T. Mann, H. Broch, R. Musil, and E. Canetti, which have so far been analyzed only through the traditional descriptive philological methods, were subjected to a formal stylometric analysis for the first time. The analysis correctly distributed the texts by authors and revealed some features of the literary style. Appreciable authorial differences in the manner of using numerals were discovered. The use of various clustering methods for text analysis enhances the significance of the results obtained and confirms their non-random nature. The method is suitable for text attribution.
Figure 1 – The result of applying hierarchical cluster analysis to the texts by T. Mann, H. Broch, and R. Musil (clustering uses the far neighbor method and the Manhattan metric). The horizontal axis indicates the "distance" in arbitrary units Figure 2 – The result of applying hierarchical cluster analysis to the texts by T. Mann, H. Broch, R. Musil, and E. Canetti (clustering uses the far neighbor method and the Manhattan metric – the same as in Figure 1). The horizontal axis indicates the "distance" in arbitrary units
Figure 3 – A fragment of the frequency distribution of numerals from the range [1; 30] in some works by T. Mann, H. Broch, R. Musil, and E. Canetti. The vertical axis shows the frequency of numerals after introducing correction factors to account for the different sizes of texts. The axis is broken to save space
Figure 4 – The result of applying hierarchical cluster analysis to the texts by T. Mann, H. Broch, and R. Musil (unlike Figure 1, the clustering uses the group average method, but still the Manhattan metric). The horizontal axis indicates the "distance" in arbitrary units
References
1. Zenkov, A. V. (2017). The new stylometry method based on the statistics of numerals. Computer research and modelling, 5, 837–850.
2. Zenkov, A.V. (2018). A Method of Text Attribution Based on the Statistics of Numerals. J. of Quantitative Linguistics, 25(3), 256–270. 3. Zenkov, A.V., & Místecký, M. (2019). The Romantic Clash: Influence of Karel Sabina over Mácha’s Cikáni from the Perspective of the Numerals Usage Statistics. Glottometrics, 46, 12–28. 4. Zenkov, A.V. (2021). Stylometry and Numerals Usage: Benford’s Law and Beyond. Stats, 4, 1051–1068. 5. Zenkov, A., & Místecký, M. (2022). Young Vladimír Vašek? – A Numerals Analysis Contribution to the Bezruč−Hrzánský Identity Issue. Naše řeč, 105(3), 151–161. 6. Zenkov, A.V. (2023). Literary mystifications and the author's use of numerals. Philological sciences. Theoretical and practical issues, 16(11), 3696–3709. Retrieved from https://doi.org/10.30853/phil20230568 7. Zenkov, A.V. (2023). Under a False Flag: Literary Hoaxes and the Use of Numerals. Litera, 10, 86–109. doi:10.25136/2409-8698.2023.10.68743 Retrieved from http://en.e-notabene.ru/fil/article_68743.html 8. Zenkov, A. V., & Ermakov, N. E. (2023). Numerals in texts as a characteristic peculiarity of the author's style. Russian Linguistic Bulletin, 45(9). Retrieved from https://doi.org/10.18454/RULB.2023.45.28 9. Stamatatos, E. (2009). A survey of modern authorship attribution methods. J. Amer. Soc. for Information Science and Technology, 60(3), 538–556. 10. Tempestt, N., Kalaivani, S., Aneez, F., Yiming, Y., Yingfei, X., & Damon, W. (2017). Surveying Stylometry Techniques and Applications // ACM Comput. Surv. No. 50(6), Article 86, 36 pages. 11. Burrows, J. (2002). Delta: a Measure of Stylistic Difference and a Guide to Likely Authorship / J. Burrows // Literary and Linguistic Computing. — 17(3). — P. 267–287. 12. La Inteligencia Artificial ayuda a descubrir una obra desconocida de Lope de Vega en los fondos de la BNE, Biblioteca Nacional de España. Retrieved from https://www.bne.es/es/noticias/inteligencia-artificial-ayuda-descubrir-obra-desconocida-lope-vega-fondos-bne 13. Schröter, J. (2020). Die apokryphen Evangelien: Jesusüberlieferungen außerhalb der Bibel. Munich: C. H. Beck. 14. Vickers, B. (2002). 'Counterfeiting' Shakespeare: Evidence, Authorship and John Ford's Funerall Elegye. Cambridge: Cambridge University Press. 15. Sorokina, M. Yu., Superfin, GG, (1994). ‘There was a writer Ageyev’ ...: a version of the fate or about the benefits of naive biographism. In: The past: Historical almanac, vol. 16. Moscow, St. Petersburg: Phoenix-Athenaeum, pp 265–289. 16. Dammann, G. (ed.) (2012). B. Traven, Autor – Werk – Werkgeschichte. Würzburg: Königshausen & Neumann. 17. Bellos, D. (2010). Romain Gary: A Tall Story. London: Harvill Secker. 18. Hupertz, H. (2021). Wie eine Frau sich als Holocaust-Überlebende ausgab. Frankfurter Allgemeine, 23 November. Retrieved from https://www.faz.net/aktuell/feuilleton/medien/frau-gab-sich-als-holocaust-ueberlebende-aus-dokumentation-bei-arte-17646920.html 19. Arnold, H. L. (1976). Thomas Mann. München: Edition Text u. Kritik. 20. M. Travers, Thomas Mann. (1992). London: Macmillan Education. 21. Thomas Mann-Handbuch: Leben – Werk – Wirkung, A. Blödorn, F. Marx (Eds.). Retrieved from https://doi.org/10.1007/978-3-476-05341-1, Verlag J.B. Metzler Stuttgart, Springer-Verlag Berlin Heidelberg 2015. 22. Thomas Mann: neue kulturwissenschaftliche Lektüren. S. Börnchen, G. Mein, G. Schmidt (Eds.), Wilhelm Fink Verlag, 2012. 23. C. Grawe, Sprache im Prosawerk. Beispiele von Goethe, Fontane, Thomas Mann, Bergengruen, Kleist und Johnson. Bonn: Bouvier Verlag Herbert Grundmann, 1987. 24. Dowden, S. D. (1986). Sympathy for the abyss: a study in the novel of German modernism: Kafka, Broch, Musil, and Thomas Mann. Tübingen: Niemeyer. 25. Nübel, B. (2006). Robert Musil – Essayismus als Selbstreflexion der Moderne, Berlin, New York: De Gruyter. Retrieved from https://doi.org/10.1515/9783110201857 26. Nübel, B. and Wolf, N. Ch. Robert-Musil-Handbuch, Berlin, Boston: De Gruyter, 2016. Retrieved from https://doi.org/10.1515/9783110255577 27. Boelderl, A. R. and Neymeyr, B. Robert Musil im Spannungsfeld zwischen Psychologie und Phänomenologie, Berlin, Boston: De Gruyter, 2024. Retrieved from https://doi.org/10.1515/9783110988352 28. H. Bloom, Robert Musil's the Man Without Qualities. Chelsea House Publishers, 2005. ISBN 9780791081228. 211 pages. 29. J. Bouveresse, La Voix de l'âme et les Chemins de l'esprit. Dix études sur Robert Musil. Éditions du Seuil, 2001, ISBN: 9782020362894.462 p. 30. A Companion to the Works of Robert Musil, P. Payne, G. Bartram, and G. Tihanov (Eds.). Camden House, Rochester, New York. 2007. ISBN: 978–1–57113–110–2. 472 p. 31. P. Payne. Robert Musil’s ‘The Man Without Qualities’: A Critical Study. Cambridge University Press, 1988. ISBN: 978-0-521-11060-0. 271 p. 32. Th. Sebastian, The Intersection of Science and Literature in Musil's The Man Without Qualities. Camden House, an imprint of Boydell & Brewer Inc., Rochester, 2005. ISBN: 1–57113–116–7. 159 p. 33. F. Schwarzwälder, Der Weltanschauungsroman 2. Ordnung: Probleme literarischer Modellbildung bei Hermann Broch und Robert Musil. transcript Verlag, Bielefeld, 2019, 372 Seiten. ISBN: 978-3-8376-4996-3. 34. A Companion to the Works of Hermann Broch, G. Bartram, S. McGaughey and G. Tihanov (Eds.), 2019. Camden House, an imprint of Boydell & Brewer Inc., Rochester, ISBN: 9781571135414, 290 p. 35. Hermann-Broch-Handbuch: Zeit – Werk – Forschung, M. Kessler, P. M. Lützeler (Eds.), De Gruyter, 2015. ISBN: 978-3110200713. 685 S. 36. Wohlleben, D. & Lützeler, P. M. (Eds.). Hermann Broch und die Romantik, Berlin, Boston: De Gruyter, 2014. Retrieved from https://doi.org/10.1515/9783110351958 37. Hermann Broch, Visionary in Exile, The 2001 Yale Symposium, P. M. Lützeler, M. Konzett and W. Riemer (Eds.). Camden House, an imprint of Boydell & Brewer Inc., Rochester. ISBN: 9781571132727. 280 p. 38. W. C. Donahue, The End of Modernism: Elias Canetti’s Auto-da-Fé. The University of North Carolina Press, 2001. ISBN: 978-1-4696-5742-4. 302 p. 39. J. P. Arnason and D. Roberts, Elias Canetti's Counter-Image of Society: Crowds, Power, Transformation. Camden House, an imprint of Boydell & Brewer Inc., Rochester. 2004. ISBN: 9781571131607. 174 p. 40. A Companion to the Works of Elias Canetti, D. C. G. Lorenz (Ed.). Camden House, an imprint of Boydell & Brewer Inc., Rochester. 2004. ISBN: 9781571134080. 364 p. 41. J S Mcclelland, The Crowd and the Mob: From Plato to Canetti. Unwin Hyman Ltd, 2011. ISBN 9780415602495. 356 Pages. 42. B. Neumann, G. Wimmer, Elias Canetti in seiner Zeit: Kulturelle, wissenschaftliche und politische Deskriptionen. J.B. Metzler, ein Imprint des Springer-Verlages, 2020. ISBN 978-3-476-05649-8. 264 S. 43. Radaelli, G. (2011). Literarische Mehrsprachigkeit: Sprachwechsel bei Elias Canetti und Ingeborg Bachmann, Berlin: Akademie Verlag. Retrieved from https://doi.org/10.1524/9783050053592 44. Th. Mann, Die Erzählungen, Zweiter Band. Fischer Taschenbuch Verlag GmbH, Frankfurt am Main. 1979. 45. H. Broch, Gedichte. Kommentierte Werkausgabe, Band 8. Suhrkamp, Frankfurt, 1980. ISBN: 978-3518370728. 244 S. 46. Moisl, H. (2015). Cluster Analysis for Corpus Linguistics. De Gruyter Mouton. ISBN:9783110350258. 47. Gan, G., Ma, C., & Wu, J. (2007). Data Clustering: Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics. doi:10.1137/1.9780898718348
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|