Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Historical informatics
Reference:

Beyond Topic Modeling: Analyzing Historical Text with Large Language Models

Kuznetsov Alexey Valer'evich

ORCID: 0000-0003-4755-250X

PhD in History

Researcher; Institute of World History of the Russian Academy of Sciences

119334, Russia, Moscow, Leninsky ave., 32 a, office 1426

kuznetsovaleks@rambler.ru
Other publications by this author
 

 

DOI:

10.7256/2585-7797.2024.4.72560

EDN:

UOIKPJ

Received:

03-12-2024


Published:

10-12-2024


Abstract: The article explores the potential of large language models in thematic analysis of historical texts, exemplified by the 1849 diary of Vologda gymnasium student Kirill Antonovich Berezkin. This rich source illuminates the everyday life, worldview, and social interactions of a young individual in mid-19th century provincial Russia. The diary offers a multifaceted narrative, capturing cultural events, political contexts, and personal introspections. By meticulously analyzing this text, researchers can reconstruct not just an individual's experiences, but also gain profound insights into the social, cultural, and educational landscape of the era. Employing the Gemini 1.5 Pro model, renowned for processing extensive textual data, the study conducted a comprehensive analysis. The research methodology involved examining the diary both holistically and through monthly segmentation, enabling the identification of nuanced content aspects. The novelty of the approach lies in applying modern large language models to a Russian historical document. The results demonstrated the model's remarkable capability to identify key themes, successfully isolating eight major thematic areas that reflect the gymnasium student's life. Utilizing parallel prompting with a monthly text breakdown revealed specific themes and subtleties that a comprehensive review might have overlooked. The study ultimately validates the effectiveness of large language models in historical source analysis, presenting promising opportunities for automating topic modeling and uncovering hidden patterns in extensive textual datasets. However, the inherently stochastic nature of these models necessitates multiple analyses, careful result interpretation, and critical comparison with traditional historical research methodologies.


Keywords:

large language models, topic modeling, diary, historical source, text analysis, 19th century, digital history, prompt engineering, artificial intelligence, Kirill A. Berezkin

This article is automatically translated.

Introduction

Thematic modeling is a common method of computer text analysis that allows you to identify hidden topics in large collections of documents. Over the years, many options for constructing thematic models have been developed [6, p. 63],[12, p. 410]. Probabilistic models have been widely used in the humanities, including historical research [3], such as Latent Dirichlet Allocation (LDA) [10]. LDA and similar algorithms allow you to identify key topics and track their dynamics in large amounts of textual information.

One of the first examples of the use of thematic modeling in historical research was the Pennsylvania Gazette analysis project for 1728-1800, the results of which were published in 2006 [30]. Since the 2010s, the method has been used to analyze diaries [11], letters [27], periodicals [41], collections of articles [18], literary corpora [21], etc.

But despite their effectiveness, traditional methods of thematic modeling face a number of methodological difficulties [8],[39]. In many ways, the results depend on the quality of text preprocessing [14] and the optimal setting of hyperparameters, such as the number of topics [7],[9],[31, p. 429-433]. In addition, the presentation of results in the form of lists of keywords often causes difficulties with interpretation, requiring significant efforts of the researcher [5],[17],[31]. Another major limitation is the low efficiency of traditional methods when working with short texts and small datasets. In such cases, the lack of co-occurring words reduces the accuracy of the selection of topics. These problems highlight the need to find alternative approaches that can overcome the limitations of traditional thematic modeling.

A new stage in the development of computer text analysis and thematic modeling, among other things, began with the advent of large language models. Due to their ability to take into account context, semantic connections and linguistic nuances, they can overcome the limitations of traditional methods. It is already obvious that their use opens up new horizons for historical science and humanitarian research in general [16],[19],[22],[24]. However, the potential of large language models for historical science is still being studied. This article explores the possibilities of universal large language models for the analysis of historical texts using the example of the diary of a Vologda gymnasium student in the middle of the XIX century. It is assumed that the use of such models will allow to obtain interpretable and detailed results in the thematic analysis of texts. The main objectives of the study include evaluating the effectiveness of large language models, identifying their limitations in identifying key topics from Russian-language historical documents, as well as developing practical recommendations for their application in humanitarian studies.

Literature review

Large language models open up new possibilities for thematic modeling, providing an alternative and complement to traditional approaches. They are successfully used to interpret [13],[25] and evaluate the quality of the results of thematic modeling [38].

Recently, the efforts of researchers have focused on experiments to highlight topics exclusively using large language models, without involving traditional algorithms. In such approaches, the model receives promptas (queries) that formulate the task of highlighting key topics from the text. This approach is based on the ability of models to effectively summarize and summarize information [35] In fact, the model proposes to "compress" the text to the level of the topics discussed in it. Even simple queries, for example, "Select the main topic from the text: [text]", demonstrate high quality results [37, p. 239]. Prompta are adapted for specific tasks: models are assigned a specific role or response format, which allows you to systematize and refine the results [23],[29],[40].

Experimental data confirm that large language models outperform traditional methods by several criteria. They provide higher coherence (consistency) of the highlighted topics, their interpretability and are able to work with short texts, where traditional algorithms usually have difficulties [29],[40].

However, the use of large language models for thematic modeling involves a number of challenges that need to be taken into account in research practice. First, the limited amount of input data that the model can process. For example, ChatGPT-4 is capable of analyzing up to 128,000 tokens, Claude – up to 200,000 tokens, and Gemini 1.5 Pro – up to 2 million tokens. Two approaches are used to work with large texts: parallel and sequential prompting [15, p. 24]. In a parallel approach, the text is divided into parts, each of which is processed independently, and the results are then summarized. The sequential approach involves a step-by-step analysis of the text, where the model accumulates information and clarifies topics at each step. Secondly, the models generate topics with different levels of detail from overly general to overly specific [28]. To solve this problem, an iterative approach is used, when the list of topics is refined using additional queries [29], or examples of topics (seed topics) are added to the query to control the level of detail [15]. The third problem is hallucinations, when models create plausible but incorrect information, including non–existent topics [20],[42]. Finally, the models face the so-called "loss of the middle of the context" [26]. When analyzing long texts, they may miss information from the middle parts, which reduces the accuracy and completeness of the highlighted topics. This limitation requires a fractional approach to text analysis in order to improve the coverage and detail of the thematic structure.

Despite the existing limitations, large language models demonstrate significant progress in thematic modeling. The prospects for their use include not only the automation of analysis, but also a deeper interpretation of texts, which makes them a powerful tool for humanitarian research.

Research material

The diary of Kirill Antonovich Berezkin (18301908), covering the period from January 1 to September 12, 1849, was chosen as the material for the study. The diary was kept in the last year of his studies at the Vologda Provincial Men's Gymnasium. The source provides valuable material for studying the daily life, worldview and social interactions of a young person in provincial Russia in the middle of the XIX century. The author of the diary, a hereditary nobleman, shares his thoughts and observations about various aspects of life: studying at the gymnasium, relationships with family and friends, social entertainment, cultural events, the political situation and his inner feelings. The diary was first published in 2014. [4] and reissued in 2020 with additions [2]. In our study, an electronic version of the diary was used, available on the website of the project "Lived" [1].

Research methodology

To analyze the topics of K. Berezkin's diary, the Gemini 1.5 Pro model from Google Corporation was used (it is recognized as an extremist organization in the Russian Federation). This model stands out for its unique ability to process input data of up to 2 million tokens – this is about twice the volume of the text of the novel "War and Peace" by L.N. Tolstoy. The model demonstrates high efficiency of working with the Russian language, occupies the top lines of the ratings of language models (see Fig. 1), where it competes with the latest developments of Open AI and Anthropic. A significant advantage for us is that access to it is provided free of charge with minor restrictions on the number of requests per day. The analysis was carried out on the Google AI Studio platform (https://aistudio.google.com /), where the full text of the diary was uploaded in .txt format. The 90,708-word text was tokenized by a 160,477-token model.

Figure 1. Gemini 1.5 Pro model in the ranking on the LLM Arena platform (https://llmarena.ru /) from 05.12.2024.

The effectiveness of language models in thematic modeling tasks largely depends on the approach to creating prompta [29],[32]. The analysis of historical documents requires taking into account not only the text, but also the cultural, historical and social context of the epoch. When compiling prompta, we relied on modern ideas about their structure and functionality [33],[36] and the experience of compiling prompta in historical research [24],[34].

The prompta we offer consists of two parts: a system and a user prompta. The system prompt sets the role of the model, defines the context and research plan, and provides general methodological recommendations:

You are an expert on the history of the Russian Empire of the XIX century and a specialist in the analysis of personal diaries. Your task is to analyze the text of the diary and identify key topics taking into account the cultural, historical and social context of the era.

Context:

- The author of the diary: Kirill Antonovich Berezkin, a gymnasium student at the Vologda Provincial Gymnasium, a hereditary nobleman.- The period of keeping a diary: January 1849 – September 1849.- Location: Vologda city, Vologda province.

- Historical significance: the diary reflects the perception of important historical events in the Russian province.

Action plan:

1. Read the text of the diary.

2. Identify key topics based on their frequency, importance to the author, and emotional coloring.

3. Describe the topics, taking into account their connection with the context of the XIX century.

4. Give examples in the form of quotations indicating the date of recording.

6. Pay attention to the use of archaisms, historicisms and cultural references in the text.

Methodological recommendations:

- Approach the analysis from an interdisciplinary point of view, linking topics with history, sociology and pedagogy.

- Choose quotes that most representatively and vividly illustrate the topic.

The custom prompt gives the model specific instructions, defines the response format, and includes a sample response:

Analyze the subject of the text of the diary of a high school student from Vologda in the middle of the XIX century. Each individual entry in the diary begins with the characters ###. Identify and describe the main topics discussed in the diary.

Present the results in a list format according to the following scheme:

- Topic name: [short topic name]

- Description: [a detailed description of the topic, including its significance and possible reflection of the cultural and social realities of the time]

- Example: [2-3 quotes from the text of the diary with reference to the date of entry, which illustrate this topic and emphasize its importance]

Sample response:

1. Topic title: Education and school days

- Description: The theme reflects the daily life of a high school student, including classes, academic achievements and difficulties he faces. This topic helps to understand how the education system in the 19th century shaped the personality and worldview of young people.

- An example: "Today we wrote an exam in Latin, and despite all efforts, received only "satisfactory"." (March 25, 1849)

Taking into account the problem of "loss of the middle of context" by language models [26], we also decided to study how much the volume of the provided text affects the depth and detail of the selection of topics. To do this, the text of the diary was divided monthly into nine parts. After that, using a Python script developed by us and the available PROXYAPI API (https://proxyapi.ru /) conducted an analysis of the topic of each part in automatic mode. The analysis was carried out in this case on the Google Colab platform (https://colab.research.google.com /) with the same Gemini 1.5 Pro model. In this case, the "parallel prompting" approach was used [15], when each part was analyzed separately, and then the model was asked to summarize the results.

In both cases, in order to avoid hallucinations of the model, the temperature parameter was set to 0.2. The temperature parameter in language models regulates the degree of randomness or "creativity" in the generated texts. When the value is close to 0, the model selects the most likely answers, which makes the text predictable and less diverse. This mode is suitable for tasks requiring high accuracy, including text summarization.

We also tested the ability to control the level of detail of the highlighted topics by adding examples of topics to the prompta, as suggested in [15]. At the final stage of the study, we decided to check whether the model is able to track changes in the subject of the diary over time. To do this, we offered the following request to the model:

Keep track of how the topics highlighted in the diary have changed over time.

Break down the text of the diary into time periods by individual months.

For each period, determine which topics were predominant, how did the interests and mood of the author change?

Explain what events in the author's life are associated with such changes in the subject?

To answer, use the following scheme:

Period 1:

Prevailing topics: [Provide a list of prevailing topics]

Description: [Describe how the topics in the diary change, and also explain the reasons for the changes: what events in the author's life are related to]

All materials, scripts, and model responses are available in the GitHub repository: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm

The results of the study

Large language models are stochastic in nature, which means that their responses are based on probability distributions rather than hard-coded rules. This makes them flexible and creative, allowing them to take into account a wide context, but also leads to variability in results even with identical queries. As a result, the model may formulate the same topic in different ways, mix similar concepts, or generate plausible but incorrect information (hallucinate). This feature requires repeated verification and refinement of queries to obtain stable and interpretable results. Therefore, in the course of the study, we conducted five thematic analyses of the full text of K. A. Berezkin's diary. The results are summarized in Table 1 (columns Option 1 – Option 5).

Topic

Option 1

Option 2

Option 3

Option 4

Option 5

Monthly analysis

Education and school days

Education and school days

Education and school days

Education and school days

Education and school days

Education and school days

Education and gymnasium life

Social life and entertainment

Social life and social entertainment

Social life and entertainment

Social life and entertainment

Social life and entertainment

Social life and entertainment

Social life and leisure

Personal experiences, introspection

Personal experiences, reflections and introspection

Personal experiences and introspection

Personal experiences and introspection

Introspection and reflection

Introspection and reflection

Personal development and introspection

Relationship with the family

Relationships with family and loved ones

Relationships with others

Relationships with others

Family and home life

Family and home life

Family relations

Romantic hobbies

Romantic hobbies and relationships with girls

Romantic hobbies and love

Love and romantic hobbies

Romantic interests and relationships with girls

Love and romantic interests

Romantic interests and relationships with girls

Religion and beliefs

Religious views and church rituals

Religion and beliefs

Religion and popular beliefs

Religion and popular beliefs

Religion and the Church

Religion and spirituality

Reading and lit. interests

Reading and literary interests

Literature and art

Literature and art

Reading and literary interests

Literature and reading

Literary interests and creativity

Political developments

Political events and public sentiment

Political events and public life

Political events and public sentiment

Political events and public life

Historical events and political situation

Historical context

Hobbies and hobbies

Hobbies and hobbies

Independent creativity (drawing, writing)

Art and creativity (drawing)

Art and creativity (drawing)

Everyday life and daily life

Everyday life in Vologda

Everyday life and daily life

Nature

Nature observations and the changing of the seasons

Observations of the surrounding world and nature

Observing the world around us and thinking about life

Future plans

Plans for the future and finding your place in life

Childhood memories

Childhood memories

Table 1. Topics highlighted in K. Berezkin's diary by the Gemini 1.5 Pro model.

Repeated analysis made it possible to identify stable thematic clusters and assess the degree of variability in the selected topics. In all five runs of the analysis, the model consistently identified eight main topics, each of which reflects significant aspects of Kirill Berezkin's life:

1. "Education and school days": This topic covers descriptions of lessons, exams, relationships with teachers and classmates, reflecting the daily life of a high school student. In all cases, the model emphasizes that this topic dominates the diary and reflects the daily life of a high school student at the Vologda Provincial Gymnasium.

2. "Social life and Entertainment": This topic contains entries describing participation in balls, masquerades, evening gatherings, fairs and other forms of leisure. These events illustrate the cultural and social life of a provincial town, in which the author actively participated.

3. "Personal experiences and introspection": This topic reveals the inner world of K. Berezkin: his reflections on character, feelings, aspirations and shortcomings, as well as his reflections on his future career and finding his place in life. It emphasizes the author's penchant for reflection and critical self-reflection.

4. "Relationship with family": This topic reflects the author's relationship with his closest relatives – mother, stepfather, sisters and grandmother. Childhood memories are also included, which help to understand family traditions and their influence on the formation of the author's personality. Interestingly, when talking about his stepfather, Kirill uses an outdated word – patrimony. But for the model, this does not become an obstacle in determining their family relationship.

5. "Romantic hobbies": The topic includes entries about the romantic hobbies of a high school student, his perception of the appearance and character of girls, as well as experiences associated with communicating with them. These recordings show how the author's romantic ideals were formed.

6. "Religion and Beliefs": This topic contains descriptions of church visits, participation in church rituals, as well as reflections on folk traditions and religious holidays. The theme emphasizes the importance of religion in the life of K. Berezkin.

7. "Reading and literary interests": The theme combines the author's impressions of reading the works of Russian and foreign writers, as well as his own literary experiences, including writing poetry and prose. Reading for K. Berezkin is one of the main forms of leisure activities.

8. "Political events": Here are the entries where the author mentions key events of his time, such as revolutions in Europe, the war in Hungary and the cholera epidemic in St. Petersburg. The author conveys the public sentiment associated with these events, as well as his own perception of the political situation.

It is noteworthy that the name of the topic "Education and school days" given by us in the promo as a sample is literally repeated in all the answers. The other names of the topics differ to varying degrees in the variety of formulations.

In addition to these eight main topics, in some runs the model highlighted additional, less pronounced topics such as "Hobbies and hobbies" (drawing, breeding birds), "Everyday life and daily life", "Weather", "Childhood memories" and "Plans for the future". The presence of these additional themes varied from one run to the next, which highlights the stochastic nature of the model.

The analysis of the diary text with its monthly breakdown shows a tendency to more detailed and specific allocation of topics compared to the analysis of the whole text. This is manifested primarily in the variety of names of topics. So instead of the generalized theme "Social life and entertainment", the model gives very diverse formulations: "Social life and entertainment", "Social life and relationships", "Social life and leisure", "Social life and urban environment", "Social life and society", "Social life and social events" (For the full answer, see: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/data/intermediate_themes_gemini.txt). But the main thing is that the descriptions of topics that emphasize different aspects of public life differ significantly. New topics also appear: "Social inequality and corruption" (the author's complaints about social inequality and corruption in the gymnasium), "Nature and rural life" (impressions of the time spent in July 1849 in the village of Kalinkino), "Childhood memories", "Cholera and the fear of disease". When analyzing smaller text fragments, the model focuses on more local contexts and can gain a more complete understanding of the thematic diversity of the text.

When summarizing the results of the monthly analysis, the model generates a list that actually coincides with the topics highlighted in the analysis of the whole text (Table 1, column Monthly analysis) (for the full answer, see: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/data/summary_themes_gemini.txt).

The response format we set in the prompt makes it easy to interpret the topics highlighted by the model. For example, here is a description of the topic from one analysis of the full test (Table 1, Option 5, see the entire text of the answer: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/gemini_05.txt ):

Topic title: Historical events and political situation

Description: Kirill mentions important historical events taking place in Russia and the world, for example, the Hungarian Uprising, revolutionary movements in Europe, the death of Grand Duke Mikhail Pavlovich. These recordings reflect his interest in politics and history, and also allow us to understand how these events were perceived in the Russian province.

Example:

- "I read St. Petersburg newspapers, foreign incidents. The Austrians are lucky, they are well managed in Hungary, but what else will come of it all?" (January 4)

- "Cholera has increased in St. Petersburg, won't it stop completely in Russia by spring?" (January 7)

- "They write that Russian troopsseven thousandmarched to Transylvania to help the Austrians against the Czechs." (March 1)

- "A republic has been established in Rome, which of all this will happen?" (March 1)

- "They write that Russian troopsseven thousandmarched to Transylvania to help the Austrians against the Czechs [Here the author clearly confuses the Czechs with the Hungarians – A.K.]. In Rome, the republic was established, what of all this will happen? It seems that everyone will fight, a battle will begin, a cataclysm, it seems." (March 1)

Let's compare this description with a similar topic of the monthly analysis of the text of the diary for March 1849. (https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/data/intermediate_themes_gemini.txt):

Topic title: Historical events and public life

Description: Kirill mentions historical events taking place in Europe (revolutions of 1848-1849, the war of Austria with the Kingdom of Sardinia), reads newspapers, is interested in politics. These records show how European events were reflected in the Russian province, and what sources of information were available.

Example:

"March 1st. ...They write that Russian troopsseven thousandmarched to Transylvania to help the Austrians against the Czechs. A republic has been established in Rome, what of all this?"

"March 3rd. ...In the evening, now I read the newspapers, and there's the same story: riots in the West, it's obvious that it won't do without a big fight."

"March 29th. ..I read the newspapers today, there is nothing special, Austria's war with Karl Albert has begun..."

In both variants, the model provided meaningful and accurate descriptions and provided relevant examples to assess the quality of the highlighted topic. Note that in the first version of the description of the topic, quotes about the participation of the Russian Empire in the suppression of the Hungarian uprising ("They write that Russian troopsseven thousandmarched to Transylvania to help the Austrians against the Czechs") and the establishment of a republic in Rome ("The republic is established in Rome, which of all this will be?") are reproduced twice. The first time separately, and the second as part of a broader quote, beginning with a description of the participation of the Russian Empire in the suppression of the Hungarian uprising and ending with the author's general impression of events in Europe ("It seems that everyone will fight, a battle will begin, it seems"). Which, in our opinion, indicates that the model can identify individual historical events even by a brief mention in the volume of a single sentence. This is confirmed by the second example, where one phrase is "March 29th. ..I read the newspapers today, there is nothing special, Austria's war with Karl Albert has begun..." – the model defined Austria's war with the Kingdom of Sardinia in 1848-1849.

In the variant when the model was asked to analyze the entire text at once, one pattern emerged: as examples, the model chose the first suitable quotations that appeared in the text. Therefore, the overwhelming number of examples relate to January 1849.

The level of detail of the selected topics can be effectively adjusted by clarifying the model requests. For example, if a query specifies a rarer topic, such as "A Trip to the Countryside", or focuses on a specific aspect of a broad category, for example, replacing "Political Events" with "Political Events in Europe in 1849", the model is able to accurately identify and describe these topics in detail. This approach allows not only to adapt the results of the analysis to research tasks, but also to provide a deeper immersion into the aspects of the text of interest (for the entire dialogue with the model, see: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/themes_change.ipynb ):

Topic title: Trip to the village

Description: The theme tells about a high school student's trip to the village of Kalinkino and Slabpikha. Kirill admires the beauty of nature, observes the changing seasons, describes rural holidays and customs. This theme reflects his sensitivity to the world around him and his characteristic interest in nature as a source of inspiration and harmony. His descriptions of a trip to Kalinkino and the surrounding area give an idea of the life of a Russian village in the middle of the XIX century, including social interactions, everyday life and holidays.

Example:

"... the location of this village is the most unfavorable, there are no beauties of nature, no stream, no river, not even a forest..." (July 8, 1849)

"A green meadow spreads wide, geese and colorful ducks walk slowly and smoothly on it..." (July 22, 1849)

"Everyone here is familiar, every peasant from whatever village he is, seeing that the bars are coming, takes off his hat, the women with their hands pressed to their hearts bow." (July 20, 1849)

Topic title: Political events in Europe in 1849

Description: Kirill is interested in political events in Europe (revolutions of 1848-1849, the war in Hungary), follows the news in newspapers ("Petersburg Vedomosti"). This topic shows how global events were perceived in the Russian province and what influence they had on the formation of the worldview of a young nobleman. He is thinking about the possibility of military service in connection with these events.

Example:

"In Hungary, the war is in full swing Derbechin is taken without firing a shot, the Russians are moving forward..." (July 12, 1849)

"They write that Russian troopsseven thousandmarched to Transylvania to help the Austrians against the Czechs. A republic has been established in Rome, what will happen of all this?" (January 11, 1849)

"Should I join the military? The proverb says the truth, that some people have seven Fridays in the week, that's the way it is with me everywhere: I want to go to both such a service and another..." (March 3, 1849)

When asked to trace changes in the subject of the diary over time, the model gave a detailed and high-quality response, which confirmed her ability to interpret complex texts taking into account the historical and cultural context. The analysis of the model turned out to be well structured, reflecting not only thematic changes, but also their connection with key events in the author's life (for the entire dialogue with the model, see: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/themes_change.ipynb ).

For example, in the description of January 1849, the model connects the predominance of themes of social life, romantic interests and religious reflections with the peculiarities of the winter period, including the celebration of Epiphany and the season of balls and masquerades. Kirill analyzes his emotions, reflects on relationships with others, which is typical for the festive atmosphere and his age. In parallel, training sessions begin, which leads to mentions of education. This approach shows the ability of the model to take into account the layering of topics in the diary.

In February, the model identified the topic of education as the dominant one, which is related to exam preparation. The increase in entries devoted to school activities, teachers and classmates reflects the increased tension and importance of the educational process in the author's life. At the same time, the model correctly noted the influence of Lent, which strengthened the religious theme and Kirill's reflections on the meaning of life. This month, the elements of social activity associated with the Carnival festivities are also preserved, and interest in literature is developing.

The model also adequately conveyed the change in the thematic structure in April and May, when "Education and schooldays" and "Introspection and reflections on the future" come to the fore, the theme "Political events in Europe in 1849" appears. These topics are related to exam preparation, thinking about the future and the outbreak of war in Hungary. In June, it is indicated that the role of family and social life will increase in connection with saying goodbye to the gymnasium and discussing future service with parents.

Special attention should be paid to the change of subject in July, when Kirill goes to the village. The model correctly identifies dominant themes such as "A trip to the countryside", "Literary interests and creativity", "Introspection and reflections on the future", linking them with descriptions of rural landscapes, observations of village life and folk customs, intensive reading and reflections on their future.

In August and September, after returning from the village, the theme "Introspection and reflections on the future" prevails, related to the author's thoughts about choosing a profession.

Thus, the model not only accurately identified the prevailing themes of each month, but also linked their changes to specific events in the life of a high school student, including his family relations, educational activities and participation in the cultural life of Vologda.

Discussion of the results

The results of the study confirmed the significant potential of large language models for the thematic analysis of historical texts. Using the example of K. A. Berezkin's diary, the Gemini 1.5 Pro model successfully identified eight key topics covering various aspects of a high school student's life in the Russian Empire of the mid-19th century. The stability of these topics has been confirmed by repeated analysis of the text, which indicates the reliability of the results obtained. The analysis revealed important features of Kirill Berezkin's worldview. His writings testify to a negative attitude towards social injustice and the ability to critically comprehend the surrounding reality. The diary reveals the inner world of the author, his complex emotional experiences related to studies, exams and various aspects of social life. At the same time, the author appears to us as a young man who is passionate about literature, fine art, observing folk traditions, trying his hand at writing.

The highlighted topics can become the basis for further research aimed at studying the social, cultural and educational realities of the era. For example, the topic "Political events" goes beyond the simple fixation of historical facts, demonstrating the author's reflections on the significance of global processes such as the European revolutions and the cholera epidemic, as well as his civic position. Similarly, the theme "Education and school life" allows us to reconstruct the educational environment of the gymnasium and its influence on the formation of a young person's personality, emphasizing how the learning system and interpersonal interactions shaped the views of a young man.

A comparison of the results of the analysis of the full text of the diary and its monthly fragments showed that splitting the text into parts allows you to identify more specific topics that were not obvious when analyzing the full text. This can be explained by the fact that processing smaller text fragments allows the model to focus on local contexts, minimizing the loss of information from the middle of long texts.

The observed variability in the formulations and details of topics is due to the stochastic nature of large language models. The wording of the topics may differ in the degree of specificity: from the general ones, for example, "Literature and Art" (Table 1, Option 2, see the entire text of the answer: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/gemini_02.txt ), to narrower ones, such as "Literature and reading" or "Art and creativity (drawing)" (Table 1, Option 5, See the full text of the answer: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/gemini_05.txt ). In another case, the topic "Reading and literary interests" is supplemented by a separate topic "Hobbies and hobbies", which, according to the description, covers such activities of the author as drawing, collecting herbarium and breeding birds (Table 1, Option 1, see the entire text of the answer: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/gemini_01.txt ). At the same time, topics less significant for the narrative, such as "Everyday life and everyday life", "Weather", "Hobbies and hobbies", "Plans for the future", appear only in some of the model's responses. Such variability confirms the need for repeated analysis and careful interpretation of the results. At the same time, she points out the "blurring" of the boundaries between the topics and their interrelationship.

However, this variability does not call into question the correctness of the highlighted topics, but rather indicates the need for repeated analysis and careful interpretation of the results. Moreover, variability can be interpreted as an additional source of information, pointing to the "blurred" boundaries between topics and their relationship. In confirmation of this, at the end of one of the analyses, the model sums up in the following words: "These are the main topics that can be traced in Kirill Berezkin's diary. It is important to note that many of them are closely intertwined and reflect the versatility of the author's personality, as well as the complex and interesting life of the Russian province in the middle of the XIX century" (Table 1, Option 1, see the entire text of the answer: https://github.com/alexeyvkuznetsov/topic_modeling_with_llm/blob/main/gemini_01.txt

In some cases, the model made mistakes by quoting irrelevant quotes or incorrectly specifying dates. For example, the quote about the entry of Russian troops into Transylvania, taken from the entry for March 1, 1849, was mistakenly dated several times on January 11, 1849, as shown in the example above. Such inaccuracies emphasize the importance of a critical approach when using the answers obtained using the language model and the need to verify them using traditional methods of historical analysis.

Our experience of thematic analysis of the Vologda gymnasium student's diary allows us to formulate a number of recommendations on the use of language models for thematic analysis of texts. To increase the reliability of the results obtained, it is recommended to analyze the same text several times. Comparing the results will make it possible to determine a stable thematic core and assess the degree of variability in the allocation of less explicit topics. For a more detailed analysis, it is useful to analyze individual parts of a large text. By changing the prompta, you can direct the model's attention to various aspects of the text and adjust the level of detail of the topic analysis.

Limitations and prospects of the study

The present study, despite the results obtained, has a number of limitations. The analysis was carried out on the basis of a single source – the diary of K.A. Berezkin. To summarize the conclusions, it is necessary to conduct similar studies on a more diverse set of historical texts of various genres, periods and authors. We used only one large language model – Gemini 1.5 Pro. A comparative study of different models will allow for a better understanding of their capabilities and limitations. It is important to understand that language models are rapidly evolving, and corporations are competing to expand their capabilities and the quality of responses. The trend of 2024 is the emergence of multimodal generative models working with various types of content: text, images, videos, music. Open source models are also developing with a small margin from proprietary models. The next generation of Gemini models is expected to appear in late 2024 or early 2025. We don't know their capabilities yet.

The analysis of the subject of K. A. Berezkin's diary is only the first step in the study of this monument. In the future, we plan to focus on two areas: in-depth analysis of specific topics identified in the diary (in particular, the perception of European events in the Russian province), and further development of methods for analyzing historical texts using large language models. The creation of specialized agents integrated with historical knowledge bases seems to be a promising direction. This approach will improve the quality of analysis, accuracy and interpretability of the results obtained, opening up new opportunities for historical research.

Conclusion

The conducted research has convincingly demonstrated the significant potential of large language models as an effective tool for thematic analysis of historical sources. The analysis of Kirill Antonovich Berezkin's diary allowed not only to identify a number of significant topics, but also to reveal the versatility of the author's personality, reconstruct the social context of provincial Russia in the middle of the XIX century, as well as to gain a deeper understanding of his worldview and perception of the surrounding reality.

A comparison of the analysis of the full text of the diary and its fragments showed that splitting the text allows you to identify additional nuances and avoid information losses typical for processing long texts. The stochastic nature of language models requires careful interpretation of their responses and comparison of the results of several analysis options, which makes it possible to detect a stable thematic core. A critical approach and comparison of the results obtained using large language models with the data of traditional methods of analyzing historical sources is the key to the reliability and scientific value of the study.

In conclusion, it should be noted that large language models open up new perspectives for historical research, automating time-consuming stages of thematic analysis and helping to identify hidden patterns and trends in large bodies of texts. This not only speeds up the research process, but also allows us to formulate new hypotheses and form a deeper and more comprehensive understanding of the past, opening the way to a qualitatively new level of analysis of historical sources.

References
1. Berezkin, K. A. (n.d.). Diary (A. Zavina & A. Mikerin, Eds.). Electronic corpus "Prozhito". Retrieved from https://corpus.prozhito.org/person/925
2. Berezkin, K. A. (2020). Dnevnik: K 190-letiiu K. A. Berezina [Diary: For the 190th anniversary of K. A. Berezkin] (2nd ed., supplemented). A. V. Kiselev.
3. Vorontsov, K. V. (2017). Obzor veroiatnostnykh tematicheskikh modelei [Review of probabilistic topic models]. In E. I. Bolshakova, K. V. Vorontsov, N. E. Efremova, E. S. Klyshinskii, N. V. Lukashevich, & A. S. Sapin (Eds.), Avtomaticheskaia obrabotka tekstov na estestvennom iazyke i analiz dannykh: Ucheb. posobie [Automatic processing of natural language texts and data analysis: Tutorial] (pp. 195–268). HSE Publishing House.
4. Berezkin, K. A. (2014). Dnevnik Kirilla Antonovicha Berezina [Diary of Kirill Antonovich Berezkin]. (G. L. Shalagina, Ed.).
5. Kuznetsov, A. V. (2023). Problemy interpretatsii rezul'tatov tematicheskogo modelirovaniia v istoricheskikh issledovaniiakh [Problems of interpreting the results of topic modeling in historical research]. In Aktual'nye problemy gumanitarnykh i obshchestvennykh nauk: Sbornik statei IX Vserossiiskoi nauchno-prakticheskoi konferentsii [Actual problems of the humanities and social sciences: Collection of articles of the IX All-Russian Scientific and Practical Conference] (pp. 90–91).
6. Milkova, M. A. (2019). Tematicheskie modeli kak instrument «Dal'nego chteniia» [Topic models as a tool for "distant reading"]. Tsifrovaia ekonomika, 1(5), 57–70.
7. Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74–88.
8. Althage, M. (2022). Potenziale und Grenzen der Topic-Modellierung mit Latent Dirichlet Allocation für die Digital History. Digital History: Konzepte, Methoden und Kritiken Digitaler Geschichtswissenschaft, 6(24), 255–277.
9. Binkley, D., Heinz, D., Lawrie, D., & Overfelt, J. (2014). Understanding LDA in source code analysis. Proceedings of the 22nd international conference on program comprehension (pp. 26–36). ACM.
10. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
11. Blevins, C. (2011). Topic Modeling Historical Sources: Analyzing the Diary of Martha Ballard. Digital Humanities Quarterly, 5(1). Retrieved from http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html
12. Bodrunova, S. S. (2021). Topic Modeling in Russia: Current Approaches and Issues in Methodology. The Palgrave Handbook of Digital Russia Studies (pp. 409–426). Palgrave Macmillan.
13. de Melo, T., & Merialdo, P. (2024). Beyond Topic Modeling: Comparative Evaluation of Topic Interpretation by Large Language Models. Intelligent Systems Conference (pp. 215–230). Springer Nature Switzerland.
14. Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168–189.
15. Doi, T., Isonuma, M., & Yanaka, H. (2024). Topic Modeling for Short Texts with Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) (pp. 21–33).
16. Garcia, G. G., & Weilbach, C. (2023). If the Sources Could Talk: Evaluating Large Language Models for Research Assistance in History. arXiv preprint arXiv:2310.10808. Retrieved from https://arxiv.org/abs/2310.10808
17. Gillings, M., & Hardie, A. (2023). The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice. Digital Scholarship in the Humanities, 38(2), 530–543.
18. Goldstone, A., & Underwood, T. (2014). The quiet transformations of literary studies: What thirteen thousand scholars could tell us. New Literary History, 45(3), 359–384.
19. Götter, C. (2024). Künstliche Intelligenz und Geschichtswissenschaft. Mehr als ein neues Werkzeug für die Digital History? Historische Zeitschrift, 319(2), 299–330.
20. Huang, L., et al. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232. Retrieved from https://arxiv.org/abs/2311.05232
21. Jockers, M. L., & Mimno, D. (2013). Significant themes in 19th-century literature. Poetics, 41(6), 750–769.
22. Kansteiner, W. (2022). Digital doping for historians: Can history, memory, and historical theory be rendered artificially intelligent? History and Theory, 61(4), 119–133.
23. Kirilenko, A. P., & Stepchenkova, S. (2025). Facilitating topic modeling in tourism research: Comprehensive comparison of new AI technologies. Tourism Management, 106, Article 105007. Retrieved from https://doi.org/10.1016/j.tourman.2024.105007
24. König, M. (2024). ChatGPT und Co. in den Geschichtswissenschaften – Grundlagen, Prompts und Praxisbeispiele. Digital Humanities am DHIP. Retrieved from https://dhdhi.hypotheses.org/9197
25. Li, D., Zhang, B., & Zhou, Y. (2023). Can large language models (LLM) label topics from a topic model? SocArXiv. Retrieved from https://doi.org/10.31235/osf.io/23x4m
26. Liu, N. F., et al. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173.
27. McGillivray, B., Buning, B., & Hengchen, S. (2019). Topic Modelling: Hartlib's Correspondence before and after 1650. In Reassembling the Republic of Letters in the Digital Age (pp. 426–428). Vandenhoeck & Ruprecht.
28. Mu, Y., et al. (2024). Addressing Topic Granularity and Hallucination in Large Language Models for Topic Modelling. arXiv preprint arXiv:2405.00611. Retrieved from https://arxiv.org/abs/2405.00611
29. Mu, Y., et al. (2024). Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 10160–10171.
30. Newman, D. J., & Block, S. (2006). Probabilistic Topic Decomposition of an Eighteenth-Century American Newspaper. Journal of the American Society for Information Science and Technology, 57(6), 753–767.
31. Oiva, M. (2021). Topic Modeling Russian History. The Palgrave Handbook of Digital Russia Studies (pp. 427–442). Palgrave Macmillan.
32. Pham, C., et al. (2024). TopicGPT: A Prompt-based Topic Modeling Framework. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2956–2984.
33. Phoenix, J., & Taylor, M. (2024). Prompt Engineering for Generative AI: Future-Proof Inputs for Reliable AI Outputs at Scale. O’Reilly Media, Inc.
34. Pollin, Ch. (n.d.). Workshopreihe «Angewandte Generative KI in den (digitalen) Geisteswissenschaften» (v1.1.0). Zenodo. Retrieved from https://doi.org/10.5281/zenodo.10647754
35. Pu, X., Gao, M., & Wan, X. (2023). Summarization is (almost) dead. arXiv preprint arXiv:2309.09558. Retrieved from https://arxiv.org/abs/2309.09558
36. Schulhoff, S., et al. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv preprint arXiv:2406.06608. Retrieved from https://arxiv.org/abs/2406.06608
37. Solanki, S. R., & Khublani, D. K. (2024). Generative Artificial Intelligence: Exploring the Power and Potential of Generative AI. Apress.
38. Stammbach, D., et al. (2023). Revisiting automated topic model evaluation with large language models. arXiv preprint arXiv:2305.12152. Retrieved from https://arxiv.org/abs/2305.12152
39. Villamor Martin, M., Kirsch, D. A., & Prieto-Nañez, F. (2023). The promise of machine-learning-driven text analysis techniques for historical research: Topic modeling and word embedding. Management & Organizational History, 18(1), 81–96. Retrieved from https://doi.org/10.1177/17449359221148596
40. Wang, H., et al. (2023). Prompting large language models for topic modeling. 2023 IEEE International Conference on Big Data (BigData), 1236–1241. Retrieved from https://doi.org/10.1109/BigData55660.2023.10131865
41. Wehrheim, L. (2019). Economic History Goes Digital: Topic Modeling the Journal of Economic History. Cliometrica, 13(1), 83–125. Retrieved from https://doi.org/10.1007/s11698-018-0166-3
42. Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817. Retrieved from https://arxiv.org/abs/2401.11817

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The article is devoted to the advanced method of text analysis — thematic modeling. Thematic modeling (unlike content analysis, which is familiar to historians) is not yet so common in historical research, the more interesting are the experiences of using such computerized text analysis in historical research. The article convincingly shows that over the past decade, thematic modeling has been tested on very different historical sources, starting with newspaper materials and continuing with the analysis of diaries, letters, periodicals, collections of articles, literary corpora. As the author rightly notes, in order to apply thematic modeling, it is necessary to analyze a lengthy text, preferably consisting of a large amount of data, which should also preferably be pre-processed by the researcher. As a historical source, on the basis of which a methodical experiment was conducted, the diary entries of the nobleman Kirill Antonovich Berezkin, available both in the form of a publication edited by G. L. Shalagina (St. Petersburg; Vologda, 2020) and in the electronic corpus of the diaries "Lived" (https://corpus.prozhito.org/person/925 ), describing most of 1849. The author of the article emphasizes that a significant enhancement of thematic modeling can be large language models that are able to capture a very detailed context of statements from historical sources. Quite rightly, the article notes that two approaches are presented in the literature: parallel and sequential prompting. Considering current approaches to using large language models for research tasks in history, it makes sense to pay attention to options for building your own models, which can already be obtained without using large computing resources (as shown, for example, in the book Raschka S. Build a Large Language Model. Manning, 2024. 368 p.). The Gemini 1.5 Pro model was used in the study, which allows processing up to 2 million tokens of incoming information, that is, in fact, you can download the entire text of the source under study. The article takes into account important features of working with historical sources, indeed, "the analysis of historical documents requires taking into account not only the text, but also the cultural, historical and social context of the epoch." At the same time, it was possible to apply a resourceful methodological solution in the work, dividing the prompta into system (role-playing) and user (research). The author has chosen a methodically balanced approach, taking into account the frequent loss of content in the middle of long sections of texts, and in principle reducing the probabilistic "hallucinations" of the model by lowering the "temperature" indicator, as well as testing options for detailing the responses of the model. The study showed that in order to identify stable thematic groups, it is necessary to repeat multiple calls to the model, because the model works probabilistically, not algorithmically. In five iterations of the analysis, the model consistently identified eight main topics, each of which reflects the essential aspects of Berezkin's diary entries. An important advantage of this work is that all important experimental results are available as files in the GitHub repository. The main question that remains after getting acquainted with an interesting and useful experiment of using thematic modeling enhanced by a large language model (Gemini 1.5 Pro) is whether it is possible to go beyond the boundaries of automation of analytical procedures understandable to the researcher, and build work with automation of interpretative research tasks. For example, it is quite an obvious task that arises when using language models: is it possible to build a detailed and reliable biographical information based on diary entries. But this is a question for the author's future works. The experiment proposed for publication was quite successful. The article can certainly be recommended for publication in the journal "Historical Informatics". The conclusions are independent, well–founded, pioneering, the article is sure to arouse great interest among the readership.