Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Historical informatics
Reference:

Artificial intelligence technologies in the formation of the archival environment: problems and prospects

Mashchenko Natalia Evgen'evna

ORCID: 0000-0003-0126-545X

PhD in Economics

Associate Professor in the scientific specialty 'Documentary, Documentation and Archival Science'; Associate Professor; Department of Information Management Systems; Donetsk State University

198A Chelyuskintsev str., Donetsk People's Republic, 283015, Russia

maschenko_n@mail.ru
Gaidar Elena Valentinovna

ORCID: 0009-0008-3353-8831

PhD in Economics

Associate Professor; Department of Information Management Systems; Donetsk State University

198A Chelyuskintsev str., Donetsk People's Republic, 283015, Russia

e.gaydar.dongu@mail.ru

DOI:

10.7256/2585-7797.2025.1.73393

EDN:

QEIGBR

Received:

17-02-2025


Published:

17-04-2025


Abstract: The authors studied the prospects of using artificial intelligence (AI) technologies to create and develop a digital archival environment, as well as their impact on the optimization and automation of archived data management processes. The main purpose of the work is to analyze modern digital solutions aimed at improving the processes of storing, searching and processing archival documents (including handwritten, damaged, multilingual). The paper explores key technologies used in digital archives, including intelligent scanning, natural language processing (NLP), computer vision, machine learning, and intelligent search methods. Special attention is paid to the problems of loss of archival materials, the need to restore them, ensure data security and accessibility, which is especially important in an unstable political situation and limited resources for new territories. The research is based on a systematic analysis of modern information technologies and their application in the archival business. The work uses methods of comparative analysis, classification and forecasting, which allows us to identify key areas of AI implementation in the archival field. The novelty of the work lies in an integrated approach to analyzing the use of AI in the archival field, identifying problematic aspects of archive digitalization, and proposing automation of the processes of storing, processing, and searching archival data. It is concluded that artificial intelligence technologies can significantly improve the efficiency of archives, providing accelerated document processing, intelligent classification, data protection and convenient access to information. In addition, the need to develop new algorithms based on machine learning is emphasized, which will improve the recognition of handwritten texts, the processing of corrupted documents and multilingual archival materials. The introduction of such technologies is becoming an important part of the digital transformation strategy of archival affairs and plays a key role in preserving historical heritage.


Keywords:

archives, digital archival environment, digital transformation, artificial intelligence, machine learning, computer vision, natural language processing, data security, intelligent scanning, predictive intelligence

This article is automatically translated.

Introduction. Archival systems are faced with increasing amounts of data, a variety of formats, and demands for reliability and accessibility of information. In these circumstances, the introduction of artificial intelligence is becoming a strategically important step to create an effective archival environment capable of meeting the challenges of the digital age.

The archival environment is a complex system that includes archival institutions and archival units, as well as a set of archival materials that form an archival space and ensure the preservation, processing and use of archival data through various methods and technologies under the influence of certain factors [1].

The main part.

The current stage of society's development is characterized by the era of digital transformation, the operation of huge amounts of information, access to global innovation processes and significant development of ways for enterprises to use various information technologies. In a digital environment, the development and effective conduct of business becomes impossible without the use of modern information systems and technologies [2].

The formation of a digital archival environment involves a comprehensive process of transition of archival institutions and organizations to digital technologies, aimed at ensuring the safety, accessibility and ease of use of archival data. This is an important stage for the archival field, which includes both technical and organizational changes that contribute to improving the work with documents in digital format.

The digitalization of archives is beginning to actively use artificial intelligence tools.

Artificial intelligence makes it possible to automate, optimize, and improve processes related to the creation, storage, retrieval, and analysis of archived data. The main tasks solved with the help of AI are: accelerating the processing of large amounts of data; providing intelligent search and archive management; improving information security and protection; automating the processes of document classification and indexing [3,4].

These capabilities make AI a key tool for building a modern archive environment.

The formation of a digital archival environment in the DPR is an important part of the strategy for modernizing state and municipal institutions, as well as ensuring the safety, accessibility and security of archival data. Archives, as custodians of historical, legal and cultural information, play a key role in the socio-economic development of the region. In an unstable political situation and limited resources, the transition to digital technologies helps to solve the problems of preserving and simplifying access to archival materials, increasing the efficiency of their management.

Digitalization of archives in the DPR is becoming not only a way to improve archival work, but also an important component of national security and state policy for the preservation of historical heritage. In this context, creating a digital archive environment involves several key steps, from creating digital replicas of documents to protecting them using modern technologies.

In the context of military operations on the territory of the Donetsk People's Republic, as well as in other liberated territories, the problem of damage and loss of archives and archival documents remains extremely urgent. The loss of documents leads to serious difficulties for citizens in the process of ensuring their civil rights and freedoms.

So, in Mariupol, Avdiivka and a number of other cities, the procedure for restoring identity documents is associated with lengthy and complex processes, and in some cases it is not possible to restore work experience and property rights. The archives of organizations were often destroyed as a result of fires and explosions, along with workbooks, personal files and other documents, which makes it impossible for citizens to confirm their work experience, education level and other important aspects of their biography.

In addition, during the evacuation from settlements, organizations left documents and archives in a place where they are in inappropriate conditions, for example, in bags or basements, scattered, requiring further sorting, processing and transfer to archives.­­ ­

Due to the lack of necessary funds and other organizational factors, the archival institutions of Donetsk were in poor condition during the period of the previous Ukrainian leadership. The archive buildings are in a state of physical deterioration: roofs are leaking, temperature and humidity conditions are not met, and other parameters necessary for the preservation of documents are violated. As a result, this led to damage and partial or complete loss of archival materials: documents were flooded with water, crumbled, etc.

Citizens, in turn, often do not know where to turn to find the information needed to restore documents. This leads to serious social consequences: people are forced to live without documents, which limits their access to social benefits, housing and other rights, exacerbating social disadvantage and provoking conflicts.

The escalation of the military-political situation in the region indicates a high probability of a repeat of such scenarios in the future. Thus, to solve these problems, it is necessary to develop and implement effective algorithms and tools based on artificial intelligence technologies, which will significantly speed up and simplify the recovery of archived data.

To prevent the negative scenarios described above in the future, large-scale work on their digitization is necessary, which involves the introduction of modern archival information systems, the use of specialized scanning devices, as well as the use of artificial intelligence tools to automate processing, classification, search, analytics and security.

In the context of digital transformation, the possibility of using artificial intelligence-based solutions is one of the main goals and a priority area of scientific and technological development of the Russian Federation for the next 10-15 years according to Decree of the President of the Russian Federation dated December 1, 2016 No. 642 "On the Strategy of Scientific and Technological Development of the Russian Federation" [5].

According to Presidential Actions. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, October 30, 2023 "AI Model" means an information system component that implements AI technology and uses computational, statistical, or machine learning methods to obtain results from a given set of input data.

Let's highlight the main AI tools that can be used in such situations.

1. One of the most promising areas is intelligent scanning and recognition of archived documents. In this case, AI significantly saves employees time and eliminates the role of the so–called human factor - it reduces the number of unexpected errors at the data processing stage [6].

Intelligent scanning is the process by which documents are converted to digital format using image analysis and processing technologies. Recognition of archived documents involves the use of artificial intelligence (AI) and machine learning algorithms to extract textual and structural information from digital images.

These technologies combine tools such as:

Optical character recognition (OCR): the most widely used engines are PyLaia, Kraken, Calamari, and Tesseract. The source codes of the listed engines and their documentation are publicly available on the GitHub website [7,8];

natural language processing (NLP): text analysis to extract key information (Amazon Textract and Google Cloud Vision): cloud solutions for analyzing and processing documents using AI; AI models based on OpenAI and GPT: used for processing and analyzing text, including extracting meaning and creating annotations;

Machine vision: recognition of shapes, seals, and other visual elements.

Intelligent recognition technologies are particularly innovative in handwriting processing (HTR) tasks, especially when it comes to complex or non-standard handwriting typical of old documents, which requires additional processing and algorithm training for high accuracy. In 2022, M. Terras [9] introduced the Transcribus platform.

Special attention should be paid to the processing of damaged materials, where the process is complicated by the presence of stains, tears, faded or burnt text. In such cases, special algorithms are required that can adapt to partial loss of information.

Multilingualism (working with documents presented in different languages, for example, in Ukrainian, which is relevant for new territories) is a separate problem for intelligent recognition, which involves the use of adapted language models, lexical and grammatical signs.

Finally, the large amount of data in archives, which often includes millions of documents, necessitates the use of high-performance computing resources for efficient processing, storage, and analysis of information arrays.

2. Intelligent archive classification is the process of automatically dividing archival materials into categories using advanced AI technologies: machine learning (ML), natural language processing (NLP) and computer vision. This process differs from traditional classification in that AI analyzes not only the textual and visual elements of documents, but also their context, meaning, and interrelationships. This approach significantly improves the organization and accessibility of archived data.

Machine learning algorithms play a key role in classifying archived data. The models are trained using examples of already sorted data, which allows the system to learn how to automatically classify new documents. The following approaches are used for this:

learning with a teacher: models are trained on pre-marked data, where each document already has a corresponding label (for example, "financial report", "scientific article", "personal letter");

Unsupervised learning: it is used for clustering materials when it is not known in advance which category the document belongs to; the system automatically groups materials according to similar criteria.;

semi-learnable algorithms: An approach where the system first works with labeled data, and then uses little information about the new data to classify it.

For example, E. Shang et al. [10] presented an effective method for classifying archives using XGBoost and Spark computing.

Natural language processing technologies help AI "understand" the text content of documents and classify them by meaning, not just by keywords. For example, AI can distinguish between legal documents, scientific articles, and letters based on their content. The main NLP methods used for classification include:

Thematic modeling: dividing documents into categories related to their topic (for example, politics, economics, art) [11];

Tonality analysis: assessment of the emotional coloring of a document to classify documents by type (for example, official, personal, business);

highlighting keywords and phrases: It helps to create indexes and tags that make it easier to search and sort documents.

In archives containing visual materials (photographs, maps, drawings, etc.), AI is used to classify images. With the help of computer vision, the system is able to recognize objects in images and classify them into categories. For example, photos can be automatically divided into groups (for example, "portraits", "landscapes", "events"). This is possible through the use of Deep Learning algorithms: Neural networks (for example, convolutional neural networks, CNNs) can recognize objects and scenes in images, and recurrent neural networks (RNNs) can work with sequential text data, such as documents with a large amount of text or historical manuscripts.

K. Carter [12] used a two-step supervised machine learning algorithm that uses tools such as Google AutoML Vision to process the page table of contents. A. Männistö [13] introduced the Automatic image Content Extraction (AICE) platform.

Intelligent classification also uses image text recognition: OCR technology is used to classify documents containing text (for example, letters, books), which helps extract text from images and classify it according to its content; speech recognition is used in archives containing audio and video materials to convert audio to text, and Video stream analysis is also used to classify scenes, objects, or events occurring in the video.

The process of intellectual classification of archival materials includes several stages.

At the stage of data preprocessing, the system cleans and prepares archived data for analysis. In the case of text data, this can be noise removal (unnecessary characters, spaces), error correction and text formatting, for images – image quality improvement, alignment or restoration of damaged materials; for multimedia – sound cleaning, data synchronization.

At the stage of feature extraction, the system analyzes the content of documents, extracting from them significant features for classification. (for texts, it is the extraction of keywords, topics, phrases, addresses, names, etc.; for images, it is the recognition of objects, photographs, faces, text, etc.; for multimedia, it is the extraction of speech and its transformation into text, context analysis).

At the stage of learning the model, the system learns from examples, learns to recognize the data structure, and creates an intelligent model for classifying archival materials based on machine learning algorithms. Training includes building classification models such as neural networks, decision trees, and cluster algorithms; testing the model on new data to verify its effectiveness and accuracy; and gradual refinement as new data becomes available.

Classification and categorization – after completing the training, the system applies the acquired knowledge to analyze new archival materials. Based on the knowledge gained, the system classifies new data into specified categories. This can be done by tags such as "financial documents", "historical records", "scientific articles", etc. The system can integrate with archive systems, automatically adding categories and helping employees navigate data faster.

Automatic indexing is the final stage of classification, where metadata is assigned to classified materials to simplify search and management. All classified documents or materials are automatically indexed and metadata is obtained, which simplifies their further search. Classified and indexed materials become searchable using simple queries or complex filters.

3. Intelligent search of archived documents is the process of extracting information from archived data using AI, NLP, ML technologies and other advanced approaches. It involves not only keyword search, but also a deeper understanding of the context, subject matter, and relationship between documents and their contents. The user can ask questions in free form, and the AI selects relevant documents based on the context. This approach significantly improves the efficiency of working with large amounts of data and improves the accessibility of archives for users.

NLP technologies allow analyzing and understanding the text content of archived documents. NLP helps you identify keywords, topics, and entities; analyze the syntax and grammar of a text; extract meaning and context from complex phrases; and apply synonyms and contextual terms to improve search results.

A. Alothman and Abdul Sait [14] have implemented a ranking algorithm for efficient document search. M. Modiba [15] examines several advantages of efficient information search with a particular focus on record management.

ML algorithms are trained on historical data and can improve their results over time by adapting to user requests.

For example, semantic search allows you to take into account not only the exact matches of words, but also their meaning in the context. For example, the system understands that "financial report" and "balance sheet" are different phrases, but both refer to the same direction.

Semantic models include:

models based on the vector representation of words (Word2Vec, GloVe): these models represent words as vectors, which allows us to identify semantic links between them.;

Transformers (BERT, GPT): Modern transformer models allow for contextual text analysis, which significantly improves search quality.

Archives can use the entity recognition method, which is used to extract specialized terms, names, dates, and other important aspects that can be used to improve the search. For example, AI can recognize the names of authors, organizations, geographical objects and events, which helps to classify documents more accurately.

Using identification and categorization methods by document type, AI can automatically recognize the document type (for example, a report, letter, contract, scientific article) and apply appropriate filters and categories to simplify the search.

AI systems can also extract and analyze metadata (author, date, keywords, subject), which allows you to speed up the search and improve the accuracy of the results.

Archives often contain not only text documents, but also images, audio, video, and other types of data. Intelligent search can integrate data from different sources and analyze multimedia materials.:

Image and video analysis: Computer vision can be used to recognize text in images (OCR), as well as analyze the content of images and video content for search purposes.;

Speech recognition: Video and audio files can be processed using speech recognition technology, which makes it possible to extract textual information from audiovisual data.

Y. Yang [16] presented an improved text-to-video conversion model designed specifically for audiovisual archives, demonstrating the development of multimedia information management.

4. Intelligent analytics and forecasting. Using machine learning allows you to identify patterns in archived data, predict user needs, or optimize storage processes.

Using data mining technologies, archival systems can automatically analyze the structure and content of archives, classify documents, identify key topics, and identify relationships between different materials. This simplifies access to information and increases the accuracy of the search.

Using machine learning algorithms, archives can predict the condition of documents at risk of damage, for example, due to non-compliance with storage conditions. This allows timely measures to be taken to preserve them.

AI systems are able to analyze user requests, identifying the most sought-after documents or topics. This data is used to optimize the work of archives, for example, to prioritize the digitization of certain materials.

AI systems can monitor and manage archive collections. Based on data about storage conditions such as temperature, humidity, or light levels, intelligent systems can offer recommendations for improving conditions and preventing document destruction.

Using historical data, archives can predict trends in user requests, that is, which topics or materials will become in demand in the future, for example, in connection with social or political events.

AI-based analytical systems can offer solutions for optimal resource allocation, such as digitization, restoration, or organization of new exhibitions.

AI is able to ensure data security, detect suspicious activity (such as unauthorized access), and prevent data leakage. It can also provide automatic encryption and backups.

5. The main directions of application of artificial intelligence for ensuring data security in archives:

AI-based systems are capable of analyzing huge amounts of data in real time, identifying potential threats such as unauthorized access attempts, abnormal user activity, or malicious software.

AI can optimize and adapt encryption methods, ensuring data protection both during storage and transmission. Machine learning technologies allow you to create dynamically changing encryption algorithms that make them difficult to crack.

Artificial intelligence is able to analyze user behavior and create customized access models. For example, if suspicious activity is detected, the system automatically restricts or blocks access.

AI algorithms are also used to predict and prevent cyber attacks, including phishing, SQL injection, and DDoS attacks. Machine learning systems analyze previous incidents, identify weaknesses, and propose measures to eliminate them.

AI is able to track the movement of data within archival systems, detecting unauthorized copies, transfers, or modifications of documents.

AI-based systems can automatically analyze event and audit logs, identifying suspicious activity and ensuring transparency of all operations.

The synergy of blockchain technologies and AI makes it possible to record all changes in archived data and prevent their falsification. In addition, AI helps optimize verification procedures, which simplifies the management of digital records.

Conclusion. Artificial intelligence opens up new horizons for the archive environment, providing speed, convenience and security of working with information. Its implementation allows not only to improve current processes, but also to adapt to the challenges of the future, creating more intelligent and efficient archive systems. Archives integrating AI are becoming not just data repositories, but dynamic knowledge centers, ready for use at any moment.

Intelligent scanning and recognition of archived documents is not just a trend, but a necessity in the context of digital transformation. These technologies open up new possibilities for automating and simplifying work with archives, ensuring the safety, accessibility and usability of data. Investing in the development and implementation of such solutions will be a key success factor for archival institutions in the 21st century.

References
1. Mashenko, N. E. (2023). Formation of the archival environment as an element of the socio-cultural space. In Donetsk Readings 2023: Education, Science, Innovations, Culture and Challenges of Modernity: Proceedings of the VIII International Scientific Conference (pp. 91-93). Donetsk.
2. Gaidar, E. V. (2022). Modern information systems and technologies in the context of digital transformation of business. Economics: Collection of Scientific Works of the State Educational Institution of Higher Professional Education 'DONAUIHGS', 25, 47-57.
3. Belov, I. I. (2022). The role of artificial intelligence technologies in the digital transformation of document management and archival affairs. Scientific Bulletin of Crimea, 4(39), 1-6.
4. Lobanov, S. L. (2021). The place of artificial intelligence in the training of specialists in document science and archival studies. Bulletin of the Law Institute of MIIT, 2(34), 135-142.
5. Ilina, K. B. (2024). Artificial intelligence in archives: The experience of application in the Russian Federation, problems, and prospects. In Archives and Electronic Documents: Challenges of Time: Reports and Presentations of the International Scientific and Practical Conference (pp. 144-152). Moscow: VNIIDAD.
6. Shalkov, D. Y. (2024). Artificial intelligence in document science: Ergonomics of professional activity. In Information and Documentation Management in the Digital Environment: Collection of Scientific Articles from the III All-Russian Scientific and Practical Conference (pp. 124-132). Donetsk: DonGU.
7. Kiselev, I. N. (2024). On the application of artificial intelligence in text recognition. Bulletin of VNIIDAD, 1, 84-95.
8. Davletov, A. R. (2023). Modern methods of machine learning and OCR technology for automating document processing. Bulletin of Science, 5(10), 676-698. https://doi.org/10.24412/2712-8849-2023-1067-676-698
9. Terras, M. (2022). Inviting AI into the archives: The reception of handwritten recognition technology into historical manuscript transcription. Archives, Access and Artificial Intelligence, December, 179-204. https://doi.org/10.1515/9783839455845-008
10. Shang, E., Liu, X., Wang, H., Rong, Y., & Liu, Y. (2020). Research on the application of artificial intelligence and distributed parallel computing in archives classification. In 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) (pp. 1267-1271). https://doi.org/10.1109/IAEAC47372.2019.8997992
11. Haffenden, C., Fano, E., Malmsten, M., & Börjeson, L. (2023). Making and using AI in the library: Creating a BERT model at the National Library of Sweden. College & Research Libraries, 84(1). https://doi.org/10.5860/crl.84.1.30
12. Carter, K., Gondek, A., Underwood, W., Randby, T., & Marciano, R. (2022). Using AI and ML to optimize information discovery in under-utilized, Holocaust-related records. AI & Society, 37, 837-858. https://doi.org/10.1007/s00146-021-01368-w
13. Männistö, A., Seker, M., Iosifidis, A., & Raitoharju, J. (2022). Automatic image content extraction: Operationalizing machine learning in humanistic photographic studies of large visual archives. arXiv. https://doi.org/10.48550/arXiv.2204.02149
14. Alothman, A., & Sait, A. (2022). Managing and retrieving bilingual documents using an artificial intelligence-based ontological framework. Computational Intelligence and Neuroscience, pp. 1-15. https://doi.org/10.1155/2022/4636931
15. Modiba, M. (2023). User perception on the utilization of artificial intelligence for the management of records at the Council for Scientific and Industrial Research. Collection and Curation, 42(3), 81-87. https://doi.org/10.1108/CC-11-2021-0033
16. Yang, Y. (2023). Write what you want: Applying text-to-video retrieval to audiovisual archives. arXiv. https://doi.org/10.48550/arXiv.2310.05825

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The last decades have been marked by the rapid growth of information and communication technologies, which are radically changing our daily lives. Of course, such changes often have both positive and negative sides: what are the only disputes around the use of artificial intelligence, including those applicable to scientific research? At the same time, the possibilities of artificial intelligence are very multifaceted, and therefore it is important to turn to the study of its use in various fields of historical computer science. These circumstances determine the relevance of the article submitted for review, the subject of which is artificial intelligence technologies in the formation of the archival environment. The author aims to consider the formation of a digital archival environment using the example of the DPR, analyze such areas as intelligent scanning and recognition of archived documents, intelligent analytics and forecasting, and show the application of artificial intelligence to ensure data security in archives. The work is based on the principles of analysis and synthesis, reliability, objectivity, the methodological basis of the research is a systematic approach based on the consideration of the object as an integral complex of interrelated elements. The scientific novelty of the article lies in the very formulation of the topic: the author seeks to characterize the problems and prospects of artificial intelligence technology in the formation of an archival environment. Considering the bibliographic list of the article as a positive point, it should be noted that it is large and versatile: in total, the list of references includes 16 different sources and studies. The undoubted advantage of the reviewed article is the attraction of foreign English-language literature, which is determined by the very formulation of the topic. Among the works used by the author, we will point to the works of I.I. Belov, S.L. Lobachev, M. Tererak and others, which focus on various aspects of studying the use of artificial intelligence in archives. Note that the bibliography of the article is important both from a scientific and educational point of view: after reading the text of the article, readers can refer to other materials on its topic. In general, in our opinion, the integrated use of various sources and research contributed to the solution of the tasks facing the author. The style of writing an article can be attributed to a scientific one, but at the same time it is understandable not only to specialists, but also to a wide readership, to anyone who is interested in both artificial intelligence in general and its capabilities in archival work, in particular. The appeal to the opponents is presented at the level of the information collected, obtained by the author during the work on the topic of the article. The structure of the work is characterized by a certain logic and consistency, in it one can distinguish the introduction, the main part, and the conclusion. At the beginning, the author defines the relevance of the topic, shows that in modern conditions "the introduction of artificial intelligence is becoming a strategically important step to create an effective archival environment capable of coping with the challenges of the digital age." The work highlights such promising areas as intelligent scanning and recognition of archival documents, which "significantly saves employees time and eliminates the role of the so–called human factor - reduces the number of unforeseen errors at the data processing stage," as well as "when it comes to complex or non-standard handwriting typical of old documents." The author draws attention to the fact that "the synergy of blockchain technologies and AI makes it possible to record all changes in archived data and prevent their falsification." Ultimately, as the author of the reviewed article rightly notes, "archives integrating AI become not just data warehouses, but dynamic knowledge centers ready for use at any moment." The main conclusion of the article is that "artificial intelligence opens up new horizons for the archive environment, providing speed, convenience and security of working with information." The article submitted for review is devoted to a relevant topic, will arouse reader interest, summarizes domestic and foreign experience, and its materials can be used both in training courses and in the framework of archival institutions. In general, in our opinion, the article can be recommended for publication in the journal Historical Informatics.