A prototype of the system for searching and detecting extremist messages on the VKontakte social network

Баумтрог В.Э., Еськов А.В., Смирнов Ю.А.

doi:10.7256/2454-0692.2024.5.71460

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Back to contents

Police activity

Reference:

Baumtrog, V.E., Es'kov, A.V., Smirnov , Y.A. (2024). A prototype of the system for searching and detecting extremist messages on the VKontakte social network. Police activity, 5, 98–109. . https://doi.org/10.7256/2454-0692.2024.5.71460

A prototype of the system for searching and detecting extremist messages on the VKontakte social network

Baumtrog Viktor Etmontovich

PhD in Physics and Mathematics

Professor at the Department of Computer Studies and Special Equipment of Barnaul Juridical Institute of the Ministry of Internal Affairs of Russia

656052, Russia, Altai Krai, Barnaul, 49 Chkalova str., room 432

barnaul@list.ru

Other publications by this author

Es'kov Aleksandr Vasil'evich

Doctor of Technical Science

Head of the Department of Information Security ; Krasnodar University of the Ministry of Internal Affairs of Russia

350005, Russia, Krasnodar Territory, Krasnodar, Yaroslavskaya str., 128

alesc72@mail.ru

Other publications by this author

Smirnov Yurii Aleksandrovich

Research Associate of the Data Transmission Department of the Center for Communication Facilities and Systems; Scientific Research Institute of Special Equipment of the Federal State Budgetary Institution NPO STIs of the Ministry of Internal Affairs of Russia

111024, Russia, Moscow, Prudsky Klyuchiki str., 2

yra.smirnov01@mail.ru

DOI:

10.7256/2454-0692.2024.5.71460

EDN:

FLLZDX

Received:

12-08-2024

Published:

07-11-2024

Abstract: The object of research is neural networks, the VKontakte platform, the Telegram messenger, the Python programming language and its libraries, and a block diagram of a computer system model. The subject of the study is a computer technology for detecting extremist content in text form and specific groups containing it on the VKontakte social network. The authors consider in detail the structural scheme of the computer system model, the functional modules included in it, and illustrate their interaction. The paper uses a pre-trained model designed for processing the Russian language, and provides conditions for ensuring high accuracy of recognition of illegal content without signs of retraining. The paper presents the results of checking the test data confirming the operability of the computer system. The proposed prototype of the computer system ensures its integration with the Telegram messenger, which increases usability and facilitates the process of generating queries and reports. The novelty of the research lies in the creation of a prototype of a computer system for searching and detecting extremist messages on the VKontakte social network using the Python programming language and the VKontakte API programming interface (VK API). The basis of the prototype computer system is a neural network that works with the Тгаnsformers and Тоrch. A special feature of the computer system is the ability to analyze messages on a social network and subject them to binary classification for the content or non-content of illegal information in messages. The main conclusions of the study show the efficiency of the system, the simplicity and convenience of its use, the possibility of detecting illegal text content. A distinctive feature of the prototype is the ability to detect illegal content presented using slang expressions.

Keywords:

Vkontakte, extremist content, prototype of a computer system, neural network, library, illegal, binary text classification, deep learning, bot, platform
This article is automatically translated.

Introduction

The social network VKontakte (VK) is widely used and especially popular among Russian-speaking users around the world (The social network "In contact" and its audience. URL: https://www.seowizard.ru/blog/faq/wiki/с/socialnaya-set-v-kontakte-i-ee-auditoriya ). Individual publications on this network may entail legal liability for the owners of the platforms and resources on which they are posted (Legal framework for countering Extremism and Terrorism. URL: https://252.56.xn--b1aew.xn--p1ai/news/item/45436942 ), in this regard, there is a need to detect and control content that has illegal specifics. For effective monitoring and management of this content, sophisticated tools and methods are needed, which, despite all efforts ^{[2,3,4,5,6,7,8,9]}, turn out to be insufficient. To solve the problems of monitoring and managing content on the VKontakte social network, the authors of this work have created and tested a prototype computer system that is able to identify and prevent the dissemination of illegal information.

Selection of tools

The Python scripting programming language was chosen as the programming language for the prototype computer system, as it is universal and suitable for solving problems on various platforms, including iOS and Android, as well as server operating systems and allows the use of a neural network. In addition, it already has a library responsible for working with the software interface of the VKontakte social network (VKontakte API) – the Transformers library with the Torch software platform (Introduction to the Transformers library and the Hugging Face platform. [electronic resource]. URL: https://habr.com/ru/articles/704592/). The VKontakte API is a powerful tool for developers and researchers, capable of providing access to an extensive amount of data on this social network. The VKontakte API is an interface that allows you to extract information from a database vk.com via HTTP requests to a specialized server ^{[5, 8, 10]}. The query syntax and the type of data returned are strictly defined by the service. The next tool used is the Hugging Face platform, which is a collection of ready-made modern pre-trained deep learning models. Deep learning is a type of machine learning based on data analysis through multi—layered networks similar to the human brain. The essence of deep learning is that computers find solutions on their own. They learn from their own mistakes and make more accurate predictions every time (Deep learning: what it is, how it works and where it is applied. URL: https://getcompass.ru/blog/posts/deep-learning). The Transformers library provides tools and interfaces for easy download and use.

The generalized scheme of the model and the order of interaction of its modules

The model of the system is shown in Fig. 1. Telegram messenger was chosen as the user interface due to its convenient and simple language for creating bots, as well as its availability on personal computers and smartphones ^{[10, 11]}. In addition, the Telegram platform already has a built-in user registration and verification system, which facilitates the authentication process. When a user sends a message containing a link to a VKontakte group, the link is sent to the server. Using the VKontakte API, the server of the developed system collects the required number of messages from the specified VKontakte group. On the server, these messages are processed by a neural network, and a report is generated, which is then returned to the Telegram messenger to the user.

Fig. 1. Generalized diagram of the computer system model.

The entire structure of the data processing and receipt process can be divided into separate functional modules, the interaction scheme of which is shown in Fig. 2 and each of which performs its own task. The modules have names: bot, creating_network, messagedumper, processing, reportgen.

Fig. 2. The scheme of interaction of the system modules.

At the initial stage, you need to create an initial database (a reference file). To do this, the creating_network module was developed, which accepts a csv file from the user for input. This reference file, created by the developers, contains messages (phrases, words) divided into two groups, with appropriate labels: 1 — information contains illegal information and 0 — does not contain. The data of the reference file allows the neural network to perform binary text classification ^{[13, 14, 15, 16, 17, 18, 19]}.

The work uses a pre-trained DeepPavlov/rubert-base-cased-sentence model designed for processing the Russian language ^{[15, 18]}. The data of the reference file is divided by the user into three groups in the ratio of 60/20/20:

train — training data;

val — validation data;

test — test data.

The training data is used to train the model. However, to avoid overfitting, validation data is created. This data is gradually included in the training set during the training of the model, until an optimal ratio is reached in which the model demonstrates high accuracy in recognizing illegal content without signs of retraining. Test data play a key role in evaluating the effectiveness and reliability of the model based on new, previously unknown data ^{[19, 20]}.

The next step involves dividing all the data into two main parts: messages and headers. Headers are used to set labels (0 or 1) that characterize the context or type of the message. It is also important to tokenize the text, which means splitting it into separate words, phrases, or other significant elements in order to prepare the data for subsequent processing and analysis.

The resulting text is converted into a tensor (in the context of neural networks, a tensor is a multidimensional array of numbers used for storing and processing data) and loaded into a DataLoader, which piecemeal feeds data for training and validation of the model. The user does not need to train the model, it is possible to add their own data for text classification.

In the process of training the model, 20 generations of weights are generated, from which the most successful ones are selected. A file is created based on the results of this process saved_weights.pt , containing weights optimal for binary text classification. These weights are then checked against test data to evaluate their effectiveness.

An illustration of the work of a prototype computer system

The results of the module's work with DeepPavlov/rubert-base-cased-sentence are formed in the form of floating-point numbers in the range from 0 to 1. It is noteworthy that the threshold determining the correctness of recognition is empirically set at 0.92: values up to 0.92 are considered negative, and values above positive. This threshold will be used in the future as a criterion for classifying the text.

A small part of the final results of the prototype computer system is shown in Fig. 3, which presents the following data:

messages – messages (shown in the picture in a modified encoding);

labels – the initially set weights of the corresponding messages;

confidence – the exact value of the model's confidence in recognition;

pred is the final recognition result.

These components help to understand how the model interprets and classifies input data.

Fig. 3. The results of processing the test data.

Most often, the system interprets the data in the same way as the user, but there are also inconsistent results. For example, in line 894, the user defined the data as not containing illegal information (Label=0), and the system interpreted this data as containing such information.

The module responsible for the operation of the bot is called Bot, respectively. He explains how to work with him, collects messages and sends a link to the VKontakte group for further processing (Kruglik R. I. Creating a chatbot in Telegram // Postulate. 2019. №. 9) ^{[11, 12]} . In addition, the bot module sends the result of reportgen's work to the user who monitors the available content of the social network if extremist or terrorist information has been recognized.

Next, Messagedumper collects the last 10 non-empty community messages and sends them to the server for verification (Fig. 4). The number of messages can be increased, but this will also increase the processing time in proportion to the increase in the number of messages.

Fig. 4. Illustration of the program code of the process of collecting messages from the VKontakte social network.

Processing processes received messages from messagedumper using weights taken from creating_network. The accuracy of illegal information recognition reaches 95% if the messages have a large volume and are similar in subject matter to the database. The obtained results are transmitted to the reportgen module, which generates a report and sends it to the bot ^[21]. If illegal information was not detected during the processing, the report generation is skipped, and the bot issues a message stating that the group meets the established standards.

Conclusion

The proposed prototype of the computer system has significant potential for further development and improvement in the future. Technically, it is possible to check not only the posts of VKontakte groups, but also the comments under them, save user statistics and form their overall rating, which could help in identifying terrorist or extremist activities. At the moment, the platform's Rules prohibit the use of user data, which does not allow the above-mentioned features to be implemented at the moment. Here are the restrictions set out in section 2 "Working with data" of the Platform Rules when using the VKontakte public API (Rules for using the VKontakte API. Revision dated 03/01/2024. URL: https://dev.vk.com/ru/rules ).

Applications are prohibited from:

2.1. Collect and store user data, including the User ID, for purposes unrelated to the operation of the Application. The requested data should only be used in the context of the application.

For example, you can cache the IDs of a user's friends to display the list faster on a mobile device. You cannot transfer the IDs of all users to your server in order to store them in your own database, just in case.

2.2. Transfer any user data obtained automatically through the API (including User ID) to third-party services (for example, advertising) both directly and through intermediaries.

2.3. Use user data in any advertisements. For example, to address a user by name from an advertising banner.

2.4. Data obtained through the API, including the methods of newsfeed.search, wall.get, wall.search, including user IDs, cannot be used for transfer or resale, creation of analytical reports, scoring, etc. directly or through intermediaries without the express consent of the Site Administration. Such consent, for example, may be an agreement with an advertising agency for the use of data on ad impressions in reports to clients.

The learning process of the developed computer system was implemented offline. For real-time self-learning, sufficiently powerful equipment and constant administration are needed, which will lead to an increase in the percentage of recognition of illegal text content.

The modular structure of the system allows you to constantly upgrade its individual components. The scheme of building a prototype of a computer system can also be applied to platforms of other social networks.

It is possible to reconfigure the model for binary classification of text that is not related to terrorism and extremism. The model works with the source data coming into its input database. If you need to search for other information, changing keywords will allow you to quickly reconfigure the model in accordance with the necessary requirements.

Illustrations 3 and 4 show the testing of a prototype computer system. As a result, they were found:

1) The performance of the system.

2) Ease of use, including on smartphones.

3) Sufficiently high accuracy in the search for illegal content.

4) The ability of the system to analyze content not only by keywords, but also by the semantic meaning of the text.

So, the developed prototype of a computer system using neural network technologies greatly simplifies the process of detecting extremist (illegal) information on the VKontakte social network. The system allows you to automate the content monitoring process, ensuring timely identification and response to publications that are potentially dangerous or violate the rules of the social network, which allows you to increase security and ensure that the content used in the platform meets regulatory requirements. An important advantage of the prototype is that it is constructed from freely distributed software.

The article will be of interest to researchers in the field of information security, developers of automated content monitoring systems, as well as specialists in social network analysis. It is important to note that the practical application of the proposed system may be of interest to government agencies and organizations involved in the fight against extremism.

References

1. Salakhutdinov, A. A. (2014). Social networks as an information channel of extremist material // Young Scientist, 17(76), 561-564. Retrieved from https://moluch.ru/archive/76/13119/
2. Martyshkin, A. I., Markin, E. I., Zuparova, V. V. (2021). Research and development of a prototype module for automatic tracking of social network content // XXI century: results of the past and problems of the present. Vol. 10 (2), 96-100. doi: 10.46548/21vek-2021-1054-0017
3. Titov, N. G. et al. (2019). Methods of monitoring social networks, their development and application in the context of ensuring their information security. Information and security, Vol. 22 (3), 305-324. Retrieved from https://www.elibrary.ru/download/elibrary_41595797_74850594.pdf
4. Golosnoy, K. S., Yanaeva, M. V. (2022). Analysis of potentially dangerous content on the VKontakte social network // Science, society, personality: problems and prospects of interaction in the modern world, 103-107. Retrieved from https://www.elibrary.ru/download/elibrary_49490534_58906686.pdf
5. Ostapenko, A. G. et al. (2018). Organization of monitoring of posts on the VKontakte social network using the vkapi interface. Information and security, Vol. 21(3), 408-415. Retrieved from https://www.elibrary.ru/item.asp?id=36716826
6. Vikhlyaev, D. R., Glagolev V. A. (2021). Data parsing of the VKontakte community using VK API. Postulate,10. Retrieved from https://e-postulat.ru/index.php/Postulat/article/view/3792
7. Zhdanov, A.V., Tyutyakin, A. A. (2017). Search for common groups and communities of social network users using web services on the example of VKontakte. Information and education: boundaries of communications, 9, 74-76. Retrieved from https://www.elibrary.ru/download/elibrary_29901598_37616555.pdf
8. Lekhov, K. A., Speranskiy, D. D. et al. (2021). Sistema izvlecheniya i analiza tekstovykh dannykh iz sotsial'nykh setey dlya obrazovatel'nogo uchrezhdeniya. Modeli, sistemy, seti v ekonomike, tekhnike, prirode i obshchestve, 1, 128–136. doi:10.21685/2227-8486-2021-1-11
9. Cordial, A. L. et al. (2020). A cartographic approach to the study of the processes of distribution of destructive content in communities of a single topic of the Vkontakte social network. Information and Security, Vol. 23(2), 203-214. Retrieved from https://www.elibrary.ru/item.asp?edn=dnfmqh
10. Bikov D. I. (2020). Methods of processing requests for a chatbot using VK API tools. Priority areas of innovation in industry, 35-36. Retrieved from https://www.elibrary.ru/download/elibrary_44460384_57688577.pdf
11. Kozlov, A. A., Batishchev, A.V. (2017). Telegram bot as a simple and convenient way to get information. The territory of science, 5, 55-64. Retrieved from https://www.elibrary.ru/download/elibrary_32399239_57151620.pdf
12. Shvedov N. D. Creating a simple Telegram bot: step-by-step instructions //Academic Journalism, 2023. No. 3-1. pp. 7-14. Retrieved from https://aeterna-ufa.ru/sbornik/AP-2023-03-1.pdf
13. Rabchevsky, A.N. (2023). Review of methods and systems for generation of synthetic training data. Applied Mathematics and Control Sciences, 4, 6–45. doi: 10.15593/2499-9873/2023.4.01
14. Abdullah A. L. I., Solovyova E. B. Binary classification of texts using a separable convolutional neural network (BTC_SCNN). A computer program. Certificate No. 2022613069 dated 03/01/2022. Retrieved from https://etu.ru/ru/nauchnaya-i-innovacionnaya-deyatelnost/obekty-intellektualnoy-sobstvennosti/patenty-i-svidetelstva-2022-goda
15. Galchenko, Yu. V., Nesterov, S. A. (2023). Classification of texts by tonality by machine learning methods. System analysis in design and management, vol. 26 (3), 369-378. doi 10.18720/SPBPU/2/id23-501
16. Khaykin S. (2016). Neural networks: a complete course, 2nd ed./ Translated from English. Moscow: I.D. Williams LLC.
17. Zhuravlev, D. V., Smolin, V.S. (2023). The neural network revolution of artificial intelligence and it's development options. In Designing the future. Problems of digital Reality. Retrieved from: https://keldysh.ru/future/2023/16.pdf
18. Kulikov, A. A., Mailyan, E. K. (2021). Comparison of architectures of recurrent neural networks in the problem of binary classification of texts. Innovative development of technology and technologies in industry (INTEX-2021). Part 3. Moscow: Kosygin Russian State University, 223-226. Retrieved from https://www.elibrary.ru/download/elibrary_46298428_94237344.pdf
19. Legotin D. L., Zrybnaya E.A. (2019). Implementation of a recurrent artificial neural network for text classification. Actual problems of teaching information and natural science disciplines. Kostroma. KSU, 197-202. Retrieved from https://www.elibrary.ru/download/elibrary_38247279_55764596.pdf
20. Batura, T. V. (2017). Methods of automatic text classification. Software products and systems, Vol. 30(1), 85-89. Doi 10.15827/0236-235X.030.1.085-099
21. Alekseeva, V. A. (2014). The use of intellectual analysis methods in binary classification problems. Proceedings of the Samara Scientific Center of the Russian Academy of Sciences, Vol. 16 (6-2), 354-356. Retrieved from http://www.ssc.smr.ru/media/journals/izvestia/2014/2014_6_354_356.pdf

First Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The article is devoted to the development and testing of a prototype computer system for detecting extremist messages on the VKontakte social network. This system is based on the use of neural network technologies and the use of a pre-trained model for classifying text content. The research methodology is described in detail and includes the use of the Python scripting language, Transformers and Torch libraries, as well as the Hugging Face platform to implement a deep learning model. The stages of creating a database, preparing data for training the model, as well as the algorithm of the prototype system are described. The process of text tokenization and binary classification methods are also disclosed in detail, which gives the study scientific rigor and methodological depth. The relevance of the topic is obvious in the context of modern challenges related to the need to monitor and control content on social networks to prevent the spread of extremist information. Due to the increasing number of such threats, the development of automated systems to identify illegal content is an important step in ensuring user safety and compliance with legal norms. The scientific novelty of the work lies in the creation of a prototype system using modern neural network text analysis technologies for automatic monitoring of content on the VKontakte social network. The authors propose a comprehensive approach to detecting extremist information, which differs from existing solutions focused on the use of keywords or simple filtering algorithms. The article is written in a scientific style in compliance with academic terminology. The structure of the article is logically consistent and includes an introduction, a description of the methodology, an illustration of the prototype's work, as well as a conclusion with conclusions. The text of the article in some places requires improvement in terms of clarity of presentation, especially in sections related to the technical aspects of the system, where the description of processes can be simplified for a better audience perception. The conclusions of the article correspond to the tasks set and confirm the operability of the proposed system. Nevertheless, in the conclusions section, it would be possible to focus more on the prospects of using the developed system in other social networks and potential areas for its improvement. The article will be of interest to researchers in the field of information security, developers of automated content monitoring systems, as well as specialists in social network analysis. It is important to note that the practical application of the proposed system may be of interest to government agencies and organizations involved in the fight against extremism. Recommendations for improvement: 1. It is recommended to simplify and make more understandable some technical sections of the article for a wide scientific audience. 2. It would be useful to analyze in more detail the possible limitations of the developed system and suggest ways to overcome them in further research. 3. The list of references should be expanded to include modern research in the field of using neural network technologies for analyzing text content. The article is a significant contribution to the field of monitoring social media content. The proposed system demonstrates high results in the task of detecting extremist messages. I recommend accepting the article for publication after making the above improvements.

Second Peer Review

The subject of the peer-reviewed study is technologies for detecting extremist messages on social networks. The VKontakte network was chosen as a case study. The author rightly associates the high degree of relevance of the chosen topic with the need to develop technologies for detecting and controlling illegal content in order to counteract the spread of this content. Accordingly, the reviewed work is also of great practical importance, related to potential problems that may result from the publication of extremist messages for both the owners of the relevant communication channels and their authors. Computer modeling and neural network methods were used as basic methodological tools. With the help of the Python scripting programming language, the Transformers library and the Hugging Face platform, a trainable model of the search and detection system for extremist messages on the VKontakte social network using the API of this network was developed and tested (although testing was carried out offline due to legal restrictions). Actually, the development and testing of a prototype of this model may well claim scientific novelty and practical usefulness. Unlike other tools, the model developed by the author allows you to analyze posts on the specified network, posts and comments under them, save user statistics and form their rating, which can become the basis for further work to identify illegal content and its authors. It can be assumed that the author of the reviewed article plans to scale this model to other social networks – Odnoklassniki, Telegram (especially since the author has already worked with this network), Moy Mir, Yandex.Zen, TikTok, etc. Structurally, the article also makes a positive impression: its logic is consistent and reflects the main aspects of the research. The following sections are highlighted in the text: - "Introduction", where a scientific problem is posed and its relevance and practical significance are argued; - "Choice of tools", where methodological and software tools for developing a computer model are disclosed in sufficient detail, as well as their choice is argued; - "Generalized scheme of the model and the order of interaction of its modules", which describes the basic principles of the model, as well as the modules of which it consists; - "Illustration of the prototype of a computer system", which reveals the results of testing the prototype model; - "Conclusion", which describes the legal difficulties faced by the author, summarized the results of the conducted research, conclusions are drawn and prospects for further research are outlined. The style of the reviewed article is scientific and analytical, with a strong bias towards technical details. There are a small number of stylistic and grammatical errors in the text, but in general it is written quite competently, in good Russian, with the correct use of scientific terminology. The bibliography contains 21 titles and adequately reflects the state of research on the subject of the article. Although it could be strengthened by including sources in foreign languages. There is no appeal to opponents, but due to the scientific and technical nature of the article, it is not a mandatory requirement. The advantages of the article also include the use of illustrative material (four drawings), which significantly simplifies the perception of the author's arguments. GENERAL CONCLUSION: the article proposed for review can be qualified as a scientific work that meets the basic requirements for works of this kind. The results obtained by the author will be of interest to specialists in the field of information security, in the field of media and PR, civil servants, as well as students of the listed specialties. The presented material corresponds to the subject of the magazine "Police activity". According to the results of the review, the article is recommended for publication.

Journals

Books

A prototype of the system for searching and detecting extremist messages on the VKontakte social network