Library
|
Your profile |
Historical informatics
Reference:
Sokolov Y.V.
Discussions about the Russian Revolution of 1917 on the Web: methodological approaches to the study of web forums as a historical source
// Historical informatics.
2023. ¹ 2.
P. 19-33.
DOI: 10.7256/2585-7797.2023.2.40601 EDN: AFQXIL URL: https://en.nbpublish.com/library_read_article.php?id=40601
Discussions about the Russian Revolution of 1917 on the Web: methodological approaches to the study of web forums as a historical source
DOI: 10.7256/2585-7797.2023.2.40601EDN: AFQXILReceived: 27-04-2023Published: 08-06-2023Abstract: The subject of the study is a methodology for analyzing the electronic content of social networks (forums) as a historical source. The discussion of the revolution of 1917 during the centenary of this historical event was used as a material for analysis. The aim of the study was to test approaches to the methodology of working with large arrays of online texts, and the possible combination of two approaches to working with online texts - quantitative analysis tools (distant reading) and traditional methods of working with historical text (slow reading). As part of the "distant reading", thematic modeling is used using the LDA (latent Dirichlet placement) and LSA (latent semantic analysis) algorithm in the R programming environment in the R studio program (version 4.2.1). During the "slow reading" we analyze the entire volume of the text directly.The novelty of the research lies in the application of thematic modeling to sources in the R programming environment in conjunction with classical methods of analyzing historical texts. Within the framework of the study, a methodology for analyzing the content of social networks (forums) has been tested, focused on substantial arrays of text that are physically impossible to read in full or at least in a significant part, using exclusively traditional means of interaction of the researcher with the corpus of sources. A step-by-step research algorithm is proposed, in which the researcher needs to analyze the text by "distant reading" methods, identifying the topics of texts consisting of terms (words). Then, using these keywords, you should find the relevant text fragments in which the identified topic was discussed most actively, and analyze the fragments in more detail using traditional methods of working with a text source. A possible way to improve the quality of identifying topics necessary for the researcher in social networks and forums by the LDA algorithm is proposed, namely, preliminary splitting of a large text and subsequent analysis of fragments by the LDA method as separate documents. Keywords: digital sources, online forums, distant reading, latent semantic analysis, latent Dirichlet placement, thematic modeling, historical information science, public history, web forum, online textThis article is automatically translated. Scientific relevance. The increasing role of digitalization and the development of network technologies have led to the emergence of a completely new type of historical source — digital, which forces us to look for new methodological solutions when working with online texts [1, p. 62]. Approaches to working with sources formed within the framework of network communication are a debatable issue, since, on the one hand, online communication is also a classic text, which implies reading it and drawing conclusions using traditional methods for historians. On the other hand, the scale of online discussions, as well as the opportunities created by the digital approach to the processing of materials, make it necessary, in our opinion, the use of "long-distance reading" methods due to the huge volumes of online texts. By "long-distance reading" in this work, we mean the analysis of texts using, among other things, quantitative methods, information technologies, as opposed to "slow reading", which consists in direct acquaintance with the text. The terms "slow" and "long-distance reading", which are used in our article, are revealed in the work of F. Moretti [2]. The problem of network source studies was considered in the article by M. S. Kornev [1], according to whom "the modern theory and practice of domestic source studies does not give a complete and sufficiently clear picture of working with sources in modern conditions of development of information and communication technologies (ICT), digitalization and new media" [1, pp. 60-61]. Thus, it is obvious that there is a problem of synthesizing various source-based approaches for working with text, both traditional and those that appeared together with digital text, finding optimal methodological solutions when working with online sources. The purpose of the article is to test approaches to the methodology of working with a large array of online texts and the possible combination of two approaches to working with online texts - "slow" and "long–distance reading". As a material for analysis, we took network discussions of the revolution of 1917 during the centenary of this historical event. Sources of research. Internet sources today have a significant impact on the formation of public opinion about historical events, and to an even greater extent than professional publications [3, p. 57]. Online content is a source or one of the sources in a number of historical studies [4-22]. Although work is being done with digital texts that exist online, the main tool of researchers is still "slow reading". "Long—distance reading"—the use of information technology to work with text - if used by Internet content researchers, it is mainly in the form of quantitative analysis of the frequency of use of words in posts /messages, but not in the form of the use of promising, in our opinion, thematic modeling in the programming environment, which will allow more relevant to identify topics of discussion than a simple analysis of the frequency of words. One of the most promising such sources is a web forum. Two important advantages are noted: 1. The specifics of data on everyday historical knowledge about the past, which is formed without interference from the researcher [13, p. 142]. 2. The source reflects many areas of everyday interaction of people — from discussing news to interpersonal relationships. I.e. historical knowledge is here in the context of cumulative everyday knowledge, which makes it possible to see which aspects of social life it is associated with. Such an approach can give an understanding of the specifics of everyday historical knowledge [13, p. 143]. As a material for testing the methodology, we chose a discussion of the revolution of 1917 in the year of its centenary. In 2017, there was an expected surge of interest in this topic [23] (see Figure 1). Fig. 1. Query dynamics "revolution of 1917" in 2013-2023 according to Google Trends Empirical data and limitations. For the study, we selected four forums that loaded after the query "revolution of 1917" AND Forum" in the Yandex search engine and in which there is a discussion of the revolutionary events of 1917 in 2017 as part of a special topic created on the forum (two forums) or as part of a large group of messages (two more forums). The search for Internet sources for research is an independent and certainly very important problem, one of the promising solutions to which may be the use of neural networks (such as ChatGPT). However, since the subject of consideration in this article is the methods of processing sources, this problem is not considered in this article. An important stage in the study of Internet sources is the assessment of the representativeness of the data used. First of all, it is determined by the methods used for selecting Internet content, but it is also related to the methods used for processing empirical data. To what extent are quantitative approaches dominant in research approaches? The problem of the representativeness of qualitative research (in our case, qualitative research is a "slow reading" of the material) cannot be solved unambiguously using only quantitative estimates. This approach is neither universal nor exhaustive, and verification of the representativeness of the sample can be carried out not only mathematically, but also empirically [24, pp.24-25]. In some cases, the topic of the revolution of 1917 was created in the year of the centennial anniversary — in 2017, although often the topic was created much earlier, but in 2017 it was updated again on forums. Texts from the forums were uploaded using the rvest package. 1. The first forum—"Strategium.ru "—the largest Russian-language resource about strategic computer games in which the plot is based on historical events (for example, the Hearth of Iron, Vicrtoria, Europa series). The forum has a special section for discussing historical issues, which contains the topic of the revolution of 1917 [25], consisting of about 400 messages written by more than 20 participants. 2. "History forum" is a forum on historical topics, one of the topics of discussion at which is the revolution of 1917 [26]. The discussion was conducted mainly between two users and contains about 50 messages. 3. "Playground" is a forum about computer games, historical issues (including the revolution of 1917) are discussed in the "Chatterbox" section, there is no specialized historical section here [27]. There are about 100 messages and 20 participants in the group of messages about the revolution. 4. "Gambling Mania" is also a forum about computer games, the topic of the 1917 revolution is discussed in the "Gazebo" section [28]. About 150 messages (about 10 debating) were devoted to the topic. Research methodology. The messages about the revolution of 1917 written in 2017 will be investigated by the methods of "long-distance reading" and then "slow reading". The phrase "distant reading" was introduced by F. Moretti by analogy with the concept of "slow" or "close reading" —an approach to the study of literature used by American "new" critics in 1940-1970. Although the "new criticism" has long lost its relevance in its pure form, its main ideas are firmly rooted in English literary studies (and also, for various reasons, are characteristic of literary studies in many other countries): the main attention of a literary critic should be attracted by "great" works, the important meanings of which should be extracted through careful "peering" into individual texts or even fragments of texts. All the articles in "distant reading" are somehow directed against these ideas: according to Moretti, literature should be studied not by looking into details, but by looking from a "distance" — and this means covering not one or several works, but hundreds and thousands of texts" [2, p. 11]. Thus, this discussion about reading methods was mainly in literary studies, however, in our opinion, this binary opposition — "close" reading of small passages vs reading huge volumes of texts using various methodological techniques is also relevant for the study of textual sources in the humanities. In our article, the methods of "long-distance reading" will be used for preliminary identification of the topics of texts and the search for possible similarities of discussions on various forums. Then, using "slow reading", we will finally determine the topics of the discussions and analyze them in more detail. According to our hypothesis, the preliminary identification of the topics of texts during the "long reading" will make it possible to work more effectively with the text at the stage of "slow reading", paying attention to keywords and topics. The "far reading" in this paper included the algorithms LDA (latent Dirichlet placement) and LSA (latent semantic analysis) implemented in the R programming environment in the R studio program (version 4.2.1). Within the framework of historical research, the LSA algorithm was used in the work of A.V. Kuznetsov [29]. The researcher also posted on the electronic resource github a methodology for applying LSA and LDA algorithms in the R programming environment using the example of his articles [30]. The methodology of thematic modeling using the LDA algorithm is also registered on the electronic resource "tidytextmining.com " [31]. Previously, before the "distant reading" in our article, with the help of the tm package, the texts were cleared of stop words. Stop words are insignificant in the text, do not carry a semantic load, for example, they are prepositions or pronouns, but also words that do not carry information in the context of this study - "event", "century", etc. In the course of the analysis, the list of stop words was replenished by us. Also, the texts were lemmatized, translated into lowercase, numbers and extra spaces were removed. "Distant reading". We will analyze the texts using "long-distance reading" methods, using latent Dirichlet placement (LDA) and latent semantic analysis (LSA). Latent Dirichlet placement (LDA) is one of the methods of thematic modeling. The method is based on the concept that a certain text consists of topics, and topics consist of words. One topic differs from another by the probabilities of the occurrence of certain words in these topics, the LDA algorithm allows you to fix this difference. After executing the algorithm, we get a set of topics, each of which is a list of words ranked by the probability of a word entering the topic. The basis for performing LDA is a term-document matrix, the columns of which correspond to documents, and the rows correspond to terms (words). For example, here is a fragment of the matrix (Table 1). Table 1. Term-document matrix (fragment)
The values in the matrix indicate the frequency of occurrence of certain terms in each document. Based on this matrix, the algorithm identifies topics using a number of mathematical operations. We performed the LDA algorithm on the matrix in order to determine for each forum the keywords characterizing its topic. Therefore, a search was set for four topics by the number of documents in the collection (one document corresponds to one forum). As a result of the algorithm implementation, the following set of topics was obtained (Fig.2).
Fig. 2. Topics identified by the LDA algorithm (2 topics) In order to correlate a certain topic with a certain document (forum), we calculated the indicator "Probability of a topic in a document", which we give in the table (Table 2): Table 2. Probability of the topic in the document
This calculation shows that the text of the forum "Gambling Addiction" 100% consists of topic 1 (see here and topics 2-4 in Figure 1). Discussions on the forums "History" and "Playground" 100% consist of topic 2, and on the forum Strategium— 86% of topic 3 and on 14% of topic 4. Thus, topic 4 does not appear much in the documents, so it makes sense to re-run the LDA algorithm and identify 3 topics (Fig. 3).
Fig. 3. Topics identified by the LDA algorithm (3 topics) We will re-check the indicator "Probability of the topic in the document" (Table 3). Table 3. Probability of a topic in the document (3 topics).
After this operation, each document belongs to a specific topic, and the materials of the "History" and "Playground" forums consist of one topic—topic 2, so we do not have an "extra" topic. Since the materials of the "History" and "Playground" forums were assigned to the same topic by the LDA algorithm, we checked the semantic proximity of the two forums using the LSA algorithm by constructing the semantic space of all four forums under study (Fig. 4). Fig. 4. Semantic Document Space (LSA) In the semantic space, the forums "History" and "Playground" are really "side by side", compared to the other two forums. Let's analyze the results of the "far reading" (see Fig. 2). In all three topics, the first and most important term is "revolution", that is, this is how users of these forums mostly perceive the events of 1917 — as opposed to the "coup". Topic 1 corresponds to the forum "Gambling Addiction". The terms "Abdication", "February", "Nicholas" allow us to make an assumption about the dominance of the topics of the February Revolution and the abdication of Nicholas II in the discussion at the forum. The algorithm attributed the discussions on the "History" and "Playground" forums to one topic, the second in order (see Figure 2). The terms "Debt" (probably) and "Ruble" indicate economic topics, and the terms "Lenin", "Communism", "USSR" — to the discussion of the October Revolution. Topic 3, in our opinion, consists of unrelated terms from a wide range of topics, so either the algorithm did not identify the topic here, or a very wide range of topics was discussed on the Strategium forum. The Strategium forum, where the LDA algorithm did not identify the topic, differs from other forums by a large number of posts (20 pages versus 3-7 on other forums), as well as by a large number of debating users. In our opinion, the diverse stylistics of the authors could also complicate the operation of the LDA algorithm. To test this hypothesis, we have divided the text into 2 and 4 equal parts according to the chronological principle. The division into 2 parts still did not allow us to identify topics, and the division into 4 parts allowed us to identify more specific topics. We asked the algorithm to identify 4 topics of this forum by the number of parts into which we divided the discussion on the forum. Here is the result of the analysis (Fig. 5).
Fig. 5. Topics identified by the LDA algorithm on the Strategium forum after splitting the text The first topic can be attributed to the "food" topic ("hunger", "grain", export", "export"). The second and third topics are ideological ("democracy", "socialism", "communism", "anarchist", "fascism", "party", "Socialist Revolutionary"). The fourth theme is "food" and "military" ("army", "war", "bread", "peace", "productivity"). In our opinion, it makes no sense to check by the indicator "probability of a topic - document", since the division into documents in this case is chronological and conditional, it is impossible to separate one document from another. The semantic space shows the semantic proximity of the materials of the forums "History" and "Playground" and their isolation from the other two forums, which was also shown by the LDA analysis. "Slow reading". After studying the texts using the methods of "long-distance reading", we have an approximate idea of what topics were discussed at the forums. We use terms from these topics to search for relevant messages on forums in order to study in more detail the discussion of a particular topic, in what context the discussion was held, what provoked its beginning. The operations will be performed either using the search service of the forum itself, or using the ctrl+f command. It must be remembered that lemmatization was carried out and should be searched by the root of the word. As a result of the analysis, we came to the conclusion that the discussions on the forums were provoked by the 100th anniversary of the revolution. The themes were either created in 2017, or received an impulse to continue during the anniversary period, if they were created earlier. A characteristic feature of the discussions was the evidence that users "entered" the discussion with already formed views on the events of the 1917 revolution, its prerequisites and consequences. The discussion at the History forum largely revolved around the topic of Russia's financial situation on the eve and during the revolution. It is on this issue that disputes arose in the topic. According to one user, the main catalyst for both revolutions of 1917 was the First World War, which, in turn, Russia was forced to enter because of debts to French banks. When the Bolshevik government refused to pay the tsar's debts, the Entente countries began military intervention and unleashed a Civil War in Russia (quote: "Besides, it was precisely because of the squandered debts of Nicholas II to French bankers that he was forced to get into the First World War... When Lenin stopped the war and announced that Soviet Russia would not pay the tsar's debts, the Entente troops entered the country...the "whites" were armed and the civil war began"). Another user blamed the Bolsheviks for the unfavorable, in his opinion, exit of Russia from the First World War and for the Civil War ("Lenin ... unleashed a Civil War. Maybe you can explain why the October Revolution was needed if it was not ordered (to remove Russia, while possible, from the war). All members of the Entente received their dividends, all except Russia") [26]. Users on the Playground forum are generally "on the side" of the Bolsheviks in the historical discussion. As a result: the debates on the forum were mostly about particulars. According to one user, the basis for the Bolsheviks coming to power was the support of the impoverished peasantry ("85% of the population of Russia were peasants. These people have reached the edge. The most terrible thing is that they were afraid to starve to death. Lenin saw this. And the power came to him by itself. Without any blood and violence. Terror was unleashed by "noble" people of the upper class in order to regain their wealth and human slaves. The people said, "No!""). But there was another opinion, uncharacteristic for this forum, according to which the February Revolution was necessary and natural, and the October Revolution brought harm to the country ("Only the February revolution was needed because Nicholas II was a weak monarch and his abdication was a boon for Russia in 1917. But after Kerensky's unsuccessful actions, the Communists came to power and began the collapse of the country. Under the communist regime, we lost the almost won World War I, signed the shameful Brest peace, ruined the economy") [27]. At the Strategium forum, one of the central topics was the discussion of the famine in the USSR of 1932-1933 and its causes ("The fact that grain was exported is undoubtedly. But whether this led to mass starvation, or whether the famine had other causes to a greater extent, is a big subject of debate"). In another group of messages, users discussed the definitions and essence of the main ideologies. The term "war" in the topic of the forum pointed to the mention of the Civil War and the First World War ("In 1917, the management of the army was lost. This loss is irrevocable, because it is impossible to do this without a strong government. There was no strong power. This is the requirement of the current moment. There was a revolution. Coming to power could be possible only after the suppression of internal enemies. With the continuation of the war with Germany, a successful fight against the internal enemy is impossible. War for Russia with Germany is an impossible thing. Peace was required") [25]. At the Igromania forum, the most significant discussion in terms of volume was the discussion of the abdication of Nicholas II. According to some users, the abdication was legal, others believed that it was either illegal or completely "fake" ("the abdication itself is typewritten and signed with a pencil. The text of the abdication contradicted the legal norms of that time, about which the Emperor could not have been unaware. But the conspirators who created the fake are quite. Nikolai himself was under arrest and any statements could be made on his behalf") [28]. At all forums, the most interesting are unemotional comments with links to scientific literature, statistics or direct speech of participants in the events of the 1917 revolution. As a rule, such comments are larger in volume and set the direction of discussion on forums. Forums differ among themselves in terms of volume and topics of discussion. Three of the four communities are devoted to computer games, in particular strategic ones, with a historical plot, in connection with which these communities have an interest in history, including the events of 1917 in Russia. In our opinion, there is material on such forums for research discussions about other historical events. Conclusions. The proposed method of work is focused on the work of the researcher with substantial arrays of text that are physically impossible to read in full or at least in a significant part, using exclusively traditional means of interaction of the researcher with the corpus of sources. When interacting with such complexes of digital sources, the first stage should be a "long-distance reading" using thematic modeling, which can be performed using the LDA algorithm. The result of its application should be a set of text themes. If the topic has not been identified in the subjective opinion of the researcher, then the text should be divided into parts, and each part should be analyzed as a separate document and a topic should be identified for it. Then, in the course of "slow reading", using the received keywords, it is necessary to find pages and messages in which the topic of interest to the researcher was discussed most actively, and analyze it in more detail already using traditional methods of working with a text source. References
1. Kornev, M. S. (2018). Source science 2.0: new approaches to working with sources in a digital network environment. Vestnik RGGU. Series History. Philology. Culturology. Oriental studies. 11(44), 59-66.
2. Moretti, F. (2016). Distant reading [Monograph]. Publishing House of the Gaidar Institute. 352. 3. Borodkin, L. I. (2015). «Digital turn» in discussions at the XXII International Congress of Historical Sciences (China, 2015). Historical informatics. information technologies and mathematical methods in historical research and education. 3-4 (13-14), 56-67. 4. Trubina, E. (2010). Past Wars in the Russian Blogosphere: On the Emergence of Cosmopolitan Memory. War, Conflict and Commemoration in the Age of Digital Reproduction, 63-85. 5. Zvereva, V. V. (2011). Discussions about the Soviet past in the communities of the VKontakte network. Bulletin of Public Opinion. 4 (110), 97-112. 6. Pfanzelter, E. (2015). At the crossroads with public history: mediating the Holocaust on the Internet. Holocaust Studies. 21 (4), 250-271. 7. Commane, G., R. Potton (2018). Instagram and Auschwitz: a critical assessment of the impact social media has on Holocaust representation. Holocaust Studies: A Journal of Culture and History. 1-24. 8. Manca, S. (2019). Holocaust memorialization and social media. Investigating how memorials of former concentration camps use Facebook and Twitter. Paper presented at the 6th European Conference on Social Media-ECSM 2019, Brighton. 189-198. 9. Mugueta, I. (2018). History popularised and tweeted: emotions and social representations around the conquest of Navarre in 1512. Imago Temporis. Medium Aevum. 12, 57-90. 10. Keith, S. (2012). Forgetting the Last Big War: Collective Memory and Liberation Images in an Off-Year Anniversary. American Behavioral Scientist. 56(2), 204-222. 11. Belykh, O. L. (2017). Internet periodicals as a source for the study of Russian-American relations (Candidate’s thesis). 12. Heimo, A. (2014). The 1918 Finnish Civil War Revisited: The Digital Aftermath. Folklore (Estonia). 57, 141-168. 13. Makhov, A. S. (2015). Everyday knowledge about the past in discussions on a web forum. New and recent history. 1, 141-154. 14. Bubnov, A. (2020). Yu. Memory of the Civil War in Russia as a part of public controversy in the digital space. History: Journal of Education and Science. 9 (95), 1-14. 15. Promyslov, N. V. The Patriotic War of 1812 in the Russian-speaking segment of the Internet (2021). History, an electronic scientific and educational journal. 5(38), 5. 16. Bubnov, A. Yu. (2019). «Civil War of Memory»: constructing narratives about the Civil War in Russia in online discussions. Moscow University Bulletin. 6, 29-43. 17. Griban, N. V. (2018). Molotov-Ribbentrop Pact in modern media political discourse. Political Linguistics. 1(67), 131-138. 18. Ermolin, D.S. Mikhailova A.A. (2021). Remembering Pristina: network communities and the practice of studying ethnosocial processes. Ethnography. 3(13), 146-170. 19. Kiryukhin, D.V. (2019). Functioning of Internet resources and communities in the social network «VKontakte» dedicated to the Great Patriotic War. Historical Bulletin. 3, 5-13. 20. Clavert, F. History in the Era of Massive Data (2021). Geschichte und Gesellschaft. 47 (1), 175-194. 21. Eiroa, M. (2019). Primary sources for a digital-born history: the Hispanic blogosphere on the Spanish Civil War and Franco’s regime. Culture & History Digital Journal. 7(2): 016, 1-57. 22. Marcinkevicius, A. (2018). Constructing Historical Justice Discourse in Lithuanian and Russian Press in Lithuania: The Case of Holocaust. Filosofija. Sociology. 29 (4), 246-252. 23. Google Trends (2023). Google LLC. Retrieved from: https://trends.google.com/home. 24. Belanovsky, S.A. (2019). In-depth interviews and focus groups [Monograph], 372 p. 25. Strategium.ru. (2017). Internet forum. Retrieved from: https://www.strategium.ru/forum/topic/85647-stoletie-oktyabrskoy-revolyucii/. 26. History-forum.ru. (2017). Internet forum. Retrieved from: https://history-forum.ru/viewtopic.php?t=1308. 27. Playground. (2017). Internet forum. Retrieved from: https://forums.playground.ru/talk/society/nuzhna_li_byla_revolyutsiya_1917_go_goda-562484/. 28. Igromania. (2017). Internet forum. Retrieved from: http://forum.igromania.ru/showthread.php?t=78610&page=117. 29. Kuznetsov, A. V. (2020). Computer analysis of texts in Latin: latent semantic analysis of Isidor of Seville's «History of the Goths, Vandals and Suebi». Historical informatics. 2, 178-190. 30. Kuznetsov, A. V. (2023). Computer analysis of medieval Latin texts. Github. Retrieved from: https://alexeyvkuznetsov.github.io/. 31. Julia Silge and David Robinson. Text Mining with R: A Tidy Approach. Retrieved from: https://www.tidytextmining.com/topicmodeling.html.
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|