Galushko I.N. —
The use of topic modeling to optimize the process of searching for relevant historical documents (on the example of the stock exchange press of the early 20th century)
// Historical informatics. – 2023. – ¹ 2.
– P. 129 - 144.
DOI: 10.7256/2585-7797.2023.2.43466
URL: https://en.e-notabene.ru/istinf/article_43466.html
Read the article
Abstract: The key task of the presented article is to test how we can analyze the information potential of a historical sources collection by using thematic modeling. Some modern collections of digitized historical materials number tens of thousands of documents, and at the level of an individual researcher, it is difficult to cover available funds. Following a number of researchers, we suggest that thematic modeling can become a convenient tool for preliminary assessment of the content of a collection of historical documents; can become a tool for selecting only those documents that contain information relevant to the research tasks. In our case, the Birzhevye Vedomosti newspaper was chosen as one of the main collection of historical documents. At this stage, we can confirm that in our study, the use of topic modeling proved to be a productive solution for optimizing the process of searching for historical documents in a large collection of digitized historical materials. At the same time, it should be emphasized that in our work topic modeling was used exclusively as an applied tool for primary assessment of the information potential of a documents collection through the analysis of selected topics. Our experience has shown that, at least for Birzhevye Vedomosti, topic modeling with LDA does not allow us to draw conclusions from the standpoint of our content analysis methodology. The data of our models are too fragmentary, it can only be used for the initial assessment of the topics describing the information contained in the source.
Galushko I.N. —
Correcting OCR Recognition of the Historical Sources Texts Using Fuzzy Sets (on the Example of an Early 20th Century Newspaper)
// Historical informatics. – 2023. – ¹ 1.
– P. 102 - 113.
DOI: 10.7256/2585-7797.2023.1.40387
URL: https://en.e-notabene.ru/istinf/article_40387.html
Read the article
Abstract: Our article is presenting an attempt to apply NLP methods to optimize the process of text recognition (in case of historical sources). Any researcher who decides to use scanned text recognition tools will face a number of limitations of the pipeline (sequence of recognition operations) accuracy. Even the most qualitatively trained models can give a significant error due to the unsatisfactory state of the source that has come down to us: cuts, bends, blots, erased letters - all these interfere with high-quality recognition. Our assumption is to use a predetermined set of words marking the presence of a study topic with Fuzzy sets module from the SpaCy to restore words that were recognized with mistakes. To check the quality of the text recovery procedure on a sample of 50 issues of the newspaper, we calculated estimates of the number of words that would not be included in the semantic analysis due to incorrect recognition. All metrics were also calculated using fuzzy set patterns. It turned out that approximately 119.6 words (mean for 50 issues) contain misprints associated with incorrect recognition. Using fuzzy set algorithms, we managed to restore these words and include them in semantic analysis.