Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Historical informatics
Reference:

The experience of classifying the social position of the victims of political repressions in the USSR using the support vector machine

Lyagushkina Liudmila

ORCID: 0000-0003-4647-5215

PhD in History

Researcher, Institute of Soviet and Post-Soviet History, HSE

105066, Russia, Moskva, g. Moscow, Staraya Basmannaya, 24/1s1

luskom@gmail.com
Other publications by this author
 

 

DOI:

10.7256/2585-7797.2022.1.37719

Received:

20-03-2022


Published:

11-05-2022


Abstract: The article describes various approaches to the classification of occupations in historical research, using the example of the database "Victims of political terror in the USSR". A brief overview of the methods by which this problem was previously solved is given: from manual assignment of certain occupations and professions of the repressed to different social groups that existed in the 1930s in the USSR, to automatic clustering. Further, a new method is proposed: to apply supervised machine learning for classification: use records already divided into groups during the author’s previous studies for training the algorithm and automatic labeling. The best of the tested methods turned out to be the support vector machine, which showed an accuracy of 95% on the test sample. The advantages and limitations of such a classification are considered, with the main limitation appears to be that some social groups are systematically defined more poorly. Nevertheless, the application of this technique made possible to mark up 350 thousand new records from the database extremely quickly. Markup based on the "training" data processed by the historian appears to be a promising direction for historical data science.


Keywords:

historical data bases, data markup, historical data science, Russian history, USSR, machine learning, Stalinism, support vector machine, political terror, classification

This article is automatically translated.

Machine learning is increasingly being used in the humanities, but there are few examples of its application in historical research. Most often, such methods can be found in interdisciplinary articles on economic history, demography, and jurisprudence. For example, machine learning helped to identify information about entrepreneurs with great accuracy in the materials of the UK population censuses [1], to find the same people or family members in the primary materials of the American censuses of 1900, 1910 and 1920 [2], or even to highlight key topics in court decisions of Great Britain before the Industrial Revolution [3]. Nevertheless, historians sometimes have to solve problems in which they cannot do without modern methods of data analysis. In particular, one of the common applied tasks for researchers where machine learning can be extremely useful is classification.

 

Classification of the occupations of the repressed in historical research

The Russian material in the literature has repeatedly raised the question of the classification of the occupation of the repressed in the USSR, information about which is included in the database "Victims of political terror in the USSR" [4], created by the international historical and educational society "Memorial"* (listed by the Ministry of Justice of the Russian Federation in the list of foreign agents). This rich source, which contains more than three million records of people who were subjected to repression in the USSR in 1917-1991, has repeatedly become the subject of study.

To describe the process of repression, it is very important to understand who the victims of terror were. However, in the Memorial database, the occupations of the repressed are presented in a very diverse way and with varying degrees of detail. Some have written, for example, just a profession or a social group – "collective farmer" or "worker", others also indicate the place of work ("collective farmer of the collective farm "12th anniversary of October""), sometimes the position and place are indicated in great detail ("digger "Soyuzvodstroy" at GOGRES"). Such differences in the format of the description of classes are explained both by the specifics of the creation of this database (it is based on the "Memory Books" of victims of repression compiled in different regions according to different principles [see the introductory article on the database website: 4]) and by the huge volume of records (even within the framework of one "Memory Book" there could be differences in approaches to the set of information).

It should also be borne in mind that in some cases the database may contain not entirely reliable information about the occupations of the repressed due to the complexity of the source itself, on the basis of which it was compiled. For example, during the years of terror, there were cases when, shortly before the arrest, a person was dismissed from any position, he got a job in much less significant positions, and they were the ones who got into the "Book of Memory". Nevertheless, studies in which such information was checked and compared did not reveal a significant distortion of the picture [5, p. 160, 6, p. 34-36]. Accordingly, "Memory books" may not be the best source of biographical information about the life of a particular person, however, in general, they should correctly reflect the overall picture.

Researchers took different approaches to the question of how to generalize the occupations of victims of repression. One direction in historiography tried to restore the social status of the repressed in accordance with the classification of the All-Union Population Census of 1939. M. Ilich, when analyzing the Leningrad Martyrology, relied on the classification of social groups (workers, collective farmers, employees, cooperative and non-cooperative artisans, sole proprietors, others) from the 1939 census. At the same time, the researcher made an additional breakdown of certain groups – for example, by adding categories of "ministers of religious worship", which according to the census should have been classified as "other", or by making additional categories for different employees [7, pp. 323-324]. E. M. Mishina used an adapted classification of occupations from the 1939 census, which is based on industries: for example, collective farmers and sole proprietors are included in a single category of "agricultural occupations" [6, pp. 216-224]. The author of this article in her dissertation research relied on the classification from the internal statistics of arrests of the NKVD [8, pp. 158-159]. This classification largely repeats the social groups from the census, but it also highlights a number of groups that were important in the framework of the Great Terror operations (1937-1938) for state security agencies – for example, the same "ministers of religious worship". More details about this classification will be discussed below.

The approach of the above authors has a number of obvious advantages, the most important of which is that the social status of the repressed can be compared with the composition of the population according to the census. As a result, we can conclude which social strata were affected by the repression more. At the same time, the breakdown of the repressed into groups according to these classifications in all cases, as far as the author knows, took place in manual or semi-automatic mode (search and automatic replacement for certain keywords). This method is quite time-consuming and therefore practically eliminates the possibility of analyzing hundreds of thousands, let alone millions, of records.

The second possible approach to classifying classes is an attempt to break them up automatically. For example, political scientists Y. Zhukova and R. Talibova used the entire Memorial database (2.6 million people at the time of writing their article) to look at the long-term effects of repression on the political behavior of Russians and Ukrainians, namely, their turnout in the elections in 2000-2010. The authors tried to divide the repressed into the following categories: agriculture, forestry, industry, management, pensioners, services, unemployed. Unfortunately, the article and the additional materials attached to it (including the research code in the R language) do not show exactly how the division took place, but, apparently, this method was not very successful: in 73% of cases, the data were either missing or they could not be broken down into the above categories [9, Online appendix, A.3].

In addition, the author is aware of attempts (in particular, in the yet unpublished work of J. R. Chua, C. Becker and K. Gilman from Duke University) work with part of the Memorial database using the international classifiers of professions ISCO (International Standard Classification of Occupations) or ESCO (European Skills, Competencies, Qualifications and Occupations.). This is a common tool in the social sciences, which, moreover, is available in packages for the R language, which, again, makes the breakdown of professions fast and possible for a large array of data [10]. However, this method also has great limitations: firstly, the Russian language is not supported (accordingly, all classes must be translated into English with inevitable semantic losses), secondly, the classifier is designed for modern, not historical professions (since the 1930s, the nature of many of them could change) and without taking into account the Soviet specifics.

Finally, some approaches to the automatic classification of classes were taken by the participants of the memo.data hackathon organized by the Memorial Society in 2017. The members of the "Rubric of the professions of the repressed" team estimated that there are 500 thousand different spellings of professions in the database. By preprocessing the data, normalization and lemmatization (bringing words to a normal vocabulary form), it was possible to reduce the list to 30 thousand unique classes. The most frequent 200 professions covered 80% of the records. Then the names of professions were translated into vector representations using the Word2vec model, which works using a neural network, and clustered. It seemed to the project participants that splitting into 10 clusters was the most successful, however, the authors themselves admitted that there were many errors in them [9]. The result of this breakdown and classification can be viewed in a file on GitHub [11]. In general, this method seems to be effective for a large array of data, but it is not without obvious drawbacks: firstly, it is necessary to decide what to do with those 20% of records that cover 29 thousand classes; secondly, to clearly determine exactly how the clusters were allocated and how they relate to "social groups" The 1930s.

Thus, the literature has not yet found an optimal way to classify the occupations of the repressed for large amounts of data. This article will tell you how to combine the first of these approaches with the technical implementation of the second.

 

Classification "with a teacher" using the method of support vectors

During the preparation of the dissertation and subsequent research, the author of this article turned out to have a normalized database of 65 thousand people repressed as part of the "Great Terror" (1937-1938) in five regions of the RSFSR. In it, all the victims of terror, who had classes prescribed, were assigned certain categories of their social status in the manual or "semi-automatic" way described above (keyword search). In the future, in order to study the gender aspect of repression, the author decided to expand the time (for 1941-1945) and geographical scope of work (by about 20 more regions), which led to the need to classify the occupations of about 320 thousand more people. To solve this problem, the method of machine learning "with a teacher" was chosen: having trained the model on its source data, you can apply it to "new" people.

Of course, from the point of view of source studies, it would be much more correct to use a sample not prepared by a person for "teaching" the model (since it does not exclude subjectivity), but a ready-made classifier of social positions from a historical source of the late 1930s. Similar classifiers of occupations in the USSR existed at a later time, but for the end of the 1930s, the author could not find anything like this. In some cases, the social status of the repressed, along with their occupation, can be found in investigative cases – such a graph was in the questionnaires of the arrested. However, a more detailed study of this source calls into question the correctness of filling in the column about the social status from the words of the repressed. It is possible that the arrested themselves did not fully understand what a "social group" was [5, p. 163]. The statisticians who prepared the All-Union Population Census of 1937 worried that in response to a direct question people would name the wrong social groups [12, p. 20]. Thus, the creation of such an "ideal" training database can still be considered a promising task.

As part of the study, a dataset (data set) with two main variables was used to train the model: firstly, a person's occupation, as it was indicated in the Memorial database, and secondly, his social status. The classification system was somewhat improved in comparison with the dissertation in the course of the author's work on another project, jointly with A.M. Markevich. The repressed were divided into 13 groups (classes), namely: collective farmers; workers; employees; sole proprietors; cooperative artisans; uncooperated artisans; housewives, dependents, pensioners; persons without certain occupations and other declassified elements; servants of religious worship; Red Army and junior command staff; Red Army Commissariat; NKVD employees; status is not defined. The latter category was introduced for those cases when it is difficult to determine by a person's occupation which group he would belong to according to the census. For example, prisoners fell into this group, since during the 1939 census they were counted either by their occupations in camps and colonies (so they joined the ranks of "workers"), or by their occupations before arrest (in the case of those who were under investigation), or as "prisoners" (those who was in prison) [12, p. 119].

Next, this dataset was loaded into the Python software environment and processed using the scikit-learn library. Various settings and methods of preprocessing the source text were tested. Of the models I tested, the best results were shown by a linear classifier – the support vector machine (SVM) with stochastic gradient descent (SGD). This classifier is often used, for example, in spam filters. The general principle of its operation is as follows: the classifier tries to determine the equation of a hyperplane that can optimally – with the greatest distance – distribute two classes across a set of features. Objects falling on one side are the first class, on the other – the second [13, pp. 199-200].

At first, the source texts with people's classes were preprocessed using regular expressions, a language that helps to search for and replace various parts in text variables. In particular, standard abbreviations (for example, "k-z") were replaced with full names ("collective farm"), the names of specific enterprises were removed as far as possible (for example, the accountant of the artel "Red Loader" could be recorded not by an employee, but by a worker, because of the name of the artel), various special symbols, all letters are reduced to a single case. In Fig. 1. you can see how the dataset looked after the initial preprocessing.

 

Fig. 1. Random examples of dataset entries before the start of training (screenshot from the Jupyter Notebook environment). The work variable shows the original value in the database, work_clean – preprocessed, social position_updated – the social position to which this or that person was assigned.

Then the existing dataset was divided into a training (90%) and a test (validation) sample (10%). Using the TfidfVectorizer tool, the text descriptions of the "work_clean" variable were turned into numeric vectors that took into account the importance of mentioning each word in a set of words. Then the model (SGDClassifier in the scikit-learn package) was trained on a training sample, then applied the data obtained on a test sample.

 

 

Fig. 2. Records from the test sample (variable test), which were assigned to the final class (result) with the least probability (probability) (screenshot from the Jupyter Notebook environment).

The model works in such a way that some value (class) is assigned to a variable in each case. However, it is clear that there are a lot of controversial and complex cases, and in a practical sense it would be important to be able to highlight them automatically during manual verification. Since the support vector machine itself does not involve calculating the probabilities of being assigned to a particular class, an additional tool was used for this – CalibratedClassifierCV of the scikit-learn package. It allows you to calculate the probability of assigning each record to each of the 13 classes. Figure 2 shows the most "controversial" cases when one or another record was assigned to the final class with the least probability.

 

Fig. 3. Evaluation of the classification success by the test sample model (screenshot from the Jupyter Notebook environment).

The results of the model (on a test sample) are shown in Fig. 3. In general, the model quite successfully "predicted" the classes of records. The average accuracy (precision), that is, the proportion of objects classified by the classifier to this class and at the same time really being positive is 0.87, the average weighted accuracy (taking into account the number of categories) is 0.95. That is, in 95% of cases, the classes in the test sample were determined correctly.

However, the main problem is that some classes of social status have been systematically defined worse. The worst indicators were in the categories of "Red Army and junior command staff" and "command staff" of the Red Army, "non-operated artisans", "NKVD officers". Due to their small number (in the Support column of Fig. 3, you can see the number of observations in the test sample for each of the categories), their overall "contribution" to the overall result is not so large, but the underestimation of individual categories may in some case affect the qualitative conclusions of the study.

As a result of manual verification, it was possible to identify a number of typical causes of errors: firstly, some categories of classes were automatically poorly determined due to the peculiarities of their writing, especially for military groups "Red Army men and junior command staff" and "Red Army command staff". There are many specific abbreviations and numbers in such records (for example, "198th SP 66th SD, com. departments"), the combination of words is important ("in / h 5499, com. companies"). All this makes it difficult to classify at least on the volumes of data that were used for training (this group of repressed, in general, in our dataset was relatively small – about 1%, 740 observations). Secondly, the breakdown did not work well in difficult cases when you need to choose from several different classes. For example, in the study, it was decided to mark prisoners by their previous occupations, if this was indicated, and if not indicated, mark them as "status undefined". As a result, the entry "prisoner, servant of the cult – priest" was marked as "status undefined", although according to the methodology it had to be classified as "ministers of worship".

Thus, despite some conditionality of such marking of the repressed and the inevitability that a number of records will be classified incorrectly, as a result of the application of this technique, 320 thousand records of the repressed in the USSR were very quickly marked up. As a result of manual verification of random samples (30 records each) from these new marked-up data, social groups from 6.6 to 10% of records were corrected, i.e. it can be assumed that, on average, 92% of records were automatically classified correctly. Since these records were used as part of the study of the gender aspect of repression to compare different parts of the same sample: two waves of terror (1937-1938 and 1941-1945) and the situation of men and women, it seems that some inevitable percentage of errors in this breakdown should not significantly affect the qualitative conclusions of the study.

There are probably ways to improve the quality of classification. Firstly, it would make sense to add to the training sample more examples of classes (groups by social status) that are poorly represented in it. For example, there are more examples of the military – there are much more of them in the database during the war. Secondly, it may be necessary to revise the classification, to combine the military (they have two categories: command staff and enlisted personnel) and artisans (they are divided into "uncooperated" and "cooperative") into two corresponding groups. Thirdly, perhaps certain categories or those observations where there is the least "probability" of being assigned to a certain class (as mentioned above about Fig. 2) it is better to double-check manually if the amount of markup is relatively small.

Finally, it is important that the author did not resort to more complex natural language processing (NLP) tools. For example, it might make sense to involve models of the Russian language corpus [14] and, as participants in the hackathon project mentioned above, use, for example, the word2vec tool. Russian Russian corpus, compiled on the basis of a huge number of texts in Russian, would attract lexical vectors of words from the corpus of the Russian language. In this case, we would work not only with those words that are mentioned in the dataset, but also would attract lexical vectors of words from the corpus of the Russian language. This would expand the ability of the model to "understand" the proximity of certain words to each other, to take into account synonymy. On the other hand, it is possible that, taking into account the specifics and narrow focus of the texts (we classify only descriptions of occupations, jobs, places of work), this would not fundamentally improve the breakdown.

The classification method presented in the article is one of the most common algorithms in machine learning. The author, not being an expert in this field, does not pretend that he is the only optimal one for the case under consideration. Perhaps by testing a number of different methods and settings mentioned above, it would be possible to increase the accuracy of class predictions by a few more percent.

The purpose of this article is primarily to show that machine learning methods can, in principle, be used in historical research. And this is far from the only way even for the same Memorial data. For example, using the surnames and names of the repressed, as well as data on their nationality, it is possible to "train" the model to approximately determine the nationality of the repressed by their surname in cases where it was not mentioned. Obviously, this method will contain a high proportion of errors, but with a large amount of data, some idea of the national composition can be obtained. The authors of another study with the Memorial and Memory of the People databases used the same method of reference vectors to determine the nationality of military personnel by surname (while they determined only whether a person belonged to the titular Russian nationality or not) and claim that the prediction accuracy was 96.5% [15].

Over the years of work in the field of historical informatics and digital humanities, researchers have created a large number of databases. They are often the fruits of long and painstaking work of the authors to transfer materials from a complex source into electronic form. Often such databases are created for the study of specific issues, only their authors work with them, and after the end of the project they remain in closed access. In the future, it seems quite possible and promising to use such "author's" databases as "training" material for solving similar methodological problems in various interdisciplinary studies, as it was demonstrated by the example of the database of the repressed.

References
1. Montebruno, P., Bennett, R. J., Smith, H., & Lieshout, C. van. (2020). Machine learning classification of entrepreneurs in British historical census data. Information Processing & Management, 57(3), 102210. https://doi.org/10.1016/j.ipm.2020.102210
2. Price, J., Buckles, K., Van Leeuwen, J., & Riley, I. (2019). Combining Family History and Machine Learning to Link Historical Records (No. w26227; p. w26227). National Bureau of Economic Research. https://doi.org/10.3386/w26227
3. Grajzl, P., & Murrell, P. (2021). A machine-learning history of English caselaw and legal ideas prior to the Industrial Revolution I: Generating and interpreting the estimates. Journal of Institutional Economics, 17(1), 1–19. https://doi.org/10.1017/S1744137420000326
4. Victims of political terror in the USSR. (2017). International Society "Memorial". https://base.memo.ru/
5. Lyagushkina, L. A. (2014). To assess the information potential of the "Books of Memory" in comparison with the investigative cases of the victims of the "Great Terror". Historical Journal: Research Studies, 2, 157–166.
6. Mishina, E. M. (2021). The time of "quiet terror". Political repressions in Altai in 1935-the first half of 1937 (Political Encyclopedia).
7. Ilic, M. (2013). The Great Terror in Leningrad: Evidence from the Leningradskii martirolog. In J. Harris (Ed.), The Anatomy of Terror (pp. 306–325). Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199655663.003.0017
8. The tragedy of the Soviet village. Collectivization and dispossession. 1927-1939. Documents and materials. T. 5, book. 2. (2006), 158-159. Russian political encyclopedia.
9. Zhukov, Y. M., & Talibova, R. (2018). Stalin's terror and the long-term political effects of mass repression. Journal of Peace Research, 55(2), 267–283. https://doi.org/10.1177/0022343317751261
10. Occupations classification. ESCO-ISCO relationship. (n.d.). Retrieved March 18, 2022, from https://cran.r-project.org/web/packages/labourR/vignettes/occupations_retrieval.html
11. Hackathon project archive memo.data on GitHub [Python]. (2020). https://github.com/fatayri/memodata/blob/9b9d82b12f382547b6fea4671c3c49373423e194/final_clusters.csv (Original work published 2017)
12. Zhyromskaya, V. B., Kiselev, I. N., & Polyakov, Yu. A. (1996). Half a century classified as "secret": All-Union population census of 1937 (pp. 20, 119). Moscow: Nauka.
13. Geron, O. (2018). Applied Machine Learning with Scikit-Learn and TensorFlow. Concepts, tools and techniques for creating intelligent systems. Dialectics (pp. 199-200). Saint Petersburg.
14. RusVectōrēs: Models. (n.d.). Retrieved March 5, 2022, from https://rusvectores.org/ru/models/
15. Rozenas, A., Talibova, R., & Zhukov, Y. M. (2021). Fighting for Tyranny: State Repression and Combat Motivation. https://www.royatalibova.com/_files/ugd/c3f304_38dc519b11794aa180a7527cf79cd406.pd

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

Review of the article "The experience of classifying the social status of the repressed in the USSR using the support vector method" The reviewed article is one of the first historical studies in which artificial intelligence methods are effectively used, specifically machine learning with a teacher. In this paper, the author uses the method of support vectors. The problem of classifying the occupations of the repressed in the 1930s and 1940s is solved according to the database "Victims of political terror in the USSR". The occupations of the repressed are an important characteristic of their social appearance, and in some cases the database used may contain not entirely reliable information on this point. Often there is no information on this point at all. The author of this article notes that attempts to study and classify the occupations of the repressed have been made over the past decades, but they cannot be recognized as successful. The database used by the author includes information about more than 380 thousand people. The idea of the machine learning method used "with a teacher" is that, having trained a neural network on a well-studied dataset, apply the resulting model to the main data array. As part of the study, a dataset with two main variables was used to train the model. As the author notes, this is "firstly, the occupation of a person, as it was indicated in the Memorial database, and secondly, his social status. The repressed were divided into 13 groups (classes), namely: collective farmers; workers; employees" and others. In addition, the category "status undefined" is being introduced. A rule of correspondence between classes and social groups was formed on the training sample, then on the basis of this rule, social status was "recognized" for the entire set of records in the database. The advantage of the work and the undoubted novelty of the work is the use of the Python software environment during the research, in which various settings and methods of data preprocessing were tested. The method allows you to estimate the probability of assigning each record (person) in the database to each of the 13 classes. The author gives examples of screenshots of the results of evaluating the success of such a classification. In general, the model gave high indicators of "prediction" of classes of classes, the average weighted accuracy was 95% in the test sample. It is logical that at the same time, some classes of social status were determined worse, this applies to small social groups (for example, "uncooperated artisans" and the Red Army command staff). The use of this technique allowed the author to very quickly mark up 320 thousand records of those repressed in the USSR and use the results in the study of the gender aspect of repression. This is a rare example of a historian working with a huge amount of data. In the final part of the article, the author examines possible ways to improve the quality of classification, for example, add more variants of social groups to the training sample, as well as use more sophisticated natural language processing (NLP) tools. An important result of this article is a demonstration that machine learning methods can be used in historical research, however, it should be noted that this requires large amounts of training samples. The article requires additional proofreading: there is no space at the beginning of the sentence after Fig. 3; the combination "status undefined" instead of "status undefined" occurs several times in the text. Taking into account this revision, the article certainly deserves publication in the journal Historical Informatics, since it has scientific novelty and originality. The article is written in good language, based on detailed historiography and will be of interest to a wide range of readers of the magazine.