Library
|
Your profile |
Police activity
Reference:
Baumtrog V.E., Es'kov A.V., Smirnov Y.A.
A prototype of the system for searching and detecting extremist messages on the VKontakte social network
// Police activity.
2024. ¹ 5.
P. 98-109.
DOI: 10.7256/2454-0692.2024.5.71460 EDN: FLLZDX URL: https://en.nbpublish.com/library_read_article.php?id=71460
A prototype of the system for searching and detecting extremist messages on the VKontakte social network
DOI: 10.7256/2454-0692.2024.5.71460EDN: FLLZDXReceived: 12-08-2024Published: 07-11-2024Abstract: The object of research is neural networks, the VKontakte platform, the Telegram messenger, the Python programming language and its libraries, and a block diagram of a computer system model. The subject of the study is a computer technology for detecting extremist content in text form and specific groups containing it on the VKontakte social network. The authors consider in detail the structural scheme of the computer system model, the functional modules included in it, and illustrate their interaction. The paper uses a pre-trained model designed for processing the Russian language, and provides conditions for ensuring high accuracy of recognition of illegal content without signs of retraining. The paper presents the results of checking the test data confirming the operability of the computer system. The proposed prototype of the computer system ensures its integration with the Telegram messenger, which increases usability and facilitates the process of generating queries and reports. The novelty of the research lies in the creation of a prototype of a computer system for searching and detecting extremist messages on the VKontakte social network using the Python programming language and the VKontakte API programming interface (VK API). The basis of the prototype computer system is a neural network that works with the Òãànsformers and Òîrch. A special feature of the computer system is the ability to analyze messages on a social network and subject them to binary classification for the content or non-content of illegal information in messages. The main conclusions of the study show the efficiency of the system, the simplicity and convenience of its use, the possibility of detecting illegal text content. A distinctive feature of the prototype is the ability to detect illegal content presented using slang expressions. Keywords: Vkontakte, extremist content, prototype of a computer system, neural network, library, illegal, binary text classification, deep learning, bot, platformThis article is automatically translated.
Introduction The social network VKontakte (VK) is widely used and especially popular among Russian-speaking users around the world (The social network "In contact" and its audience. URL: https://www.seowizard.ru/blog/faq/wiki/с/socialnaya-set-v-kontakte-i-ee-auditoriya ). Individual publications on this network may entail legal liability for the owners of the platforms and resources on which they are posted (Legal framework for countering Extremism and Terrorism. URL: https://252.56.xn--b1aew.xn--p1ai/news/item/45436942 ), in this regard, there is a need to detect and control content that has illegal specifics. For effective monitoring and management of this content, sophisticated tools and methods are needed, which, despite all efforts [2,3,4,5,6,7,8,9], turn out to be insufficient. To solve the problems of monitoring and managing content on the VKontakte social network, the authors of this work have created and tested a prototype computer system that is able to identify and prevent the dissemination of illegal information. Selection of tools The Python scripting programming language was chosen as the programming language for the prototype computer system, as it is universal and suitable for solving problems on various platforms, including iOS and Android, as well as server operating systems and allows the use of a neural network. In addition, it already has a library responsible for working with the software interface of the VKontakte social network (VKontakte API) – the Transformers library with the Torch software platform (Introduction to the Transformers library and the Hugging Face platform. [electronic resource]. URL: https://habr.com/ru/articles/704592/). The VKontakte API is a powerful tool for developers and researchers, capable of providing access to an extensive amount of data on this social network. The VKontakte API is an interface that allows you to extract information from a database vk.com via HTTP requests to a specialized server [5, 8, 10]. The query syntax and the type of data returned are strictly defined by the service. The next tool used is the Hugging Face platform, which is a collection of ready-made modern pre-trained deep learning models. Deep learning is a type of machine learning based on data analysis through multi—layered networks similar to the human brain. The essence of deep learning is that computers find solutions on their own. They learn from their own mistakes and make more accurate predictions every time (Deep learning: what it is, how it works and where it is applied. URL: https://getcompass.ru/blog/posts/deep-learning). The Transformers library provides tools and interfaces for easy download and use. The generalized scheme of the model and the order of interaction of its modules The model of the system is shown in Fig. 1. Telegram messenger was chosen as the user interface due to its convenient and simple language for creating bots, as well as its availability on personal computers and smartphones [10, 11]. In addition, the Telegram platform already has a built-in user registration and verification system, which facilitates the authentication process. When a user sends a message containing a link to a VKontakte group, the link is sent to the server. Using the VKontakte API, the server of the developed system collects the required number of messages from the specified VKontakte group. On the server, these messages are processed by a neural network, and a report is generated, which is then returned to the Telegram messenger to the user. Fig. 1. Generalized diagram of the computer system model. The entire structure of the data processing and receipt process can be divided into separate functional modules, the interaction scheme of which is shown in Fig. 2 and each of which performs its own task. The modules have names: bot, creating_network, messagedumper, processing, reportgen. Fig. 2. The scheme of interaction of the system modules. At the initial stage, you need to create an initial database (a reference file). To do this, the creating_network module was developed, which accepts a csv file from the user for input. This reference file, created by the developers, contains messages (phrases, words) divided into two groups, with appropriate labels: 1 — information contains illegal information and 0 — does not contain. The data of the reference file allows the neural network to perform binary text classification [13, 14, 15, 16, 17, 18, 19]. The work uses a pre-trained DeepPavlov/rubert-base-cased-sentence model designed for processing the Russian language [15, 18]. The data of the reference file is divided by the user into three groups in the ratio of 60/20/20: train — training data; val — validation data; test — test data. The training data is used to train the model. However, to avoid overfitting, validation data is created. This data is gradually included in the training set during the training of the model, until an optimal ratio is reached in which the model demonstrates high accuracy in recognizing illegal content without signs of retraining. Test data play a key role in evaluating the effectiveness and reliability of the model based on new, previously unknown data [19, 20]. The next step involves dividing all the data into two main parts: messages and headers. Headers are used to set labels (0 or 1) that characterize the context or type of the message. It is also important to tokenize the text, which means splitting it into separate words, phrases, or other significant elements in order to prepare the data for subsequent processing and analysis. The resulting text is converted into a tensor (in the context of neural networks, a tensor is a multidimensional array of numbers used for storing and processing data) and loaded into a DataLoader, which piecemeal feeds data for training and validation of the model. The user does not need to train the model, it is possible to add their own data for text classification. In the process of training the model, 20 generations of weights are generated, from which the most successful ones are selected. A file is created based on the results of this process saved_weights.pt , containing weights optimal for binary text classification. These weights are then checked against test data to evaluate their effectiveness. An illustration of the work of a prototype computer system The results of the module's work with DeepPavlov/rubert-base-cased-sentence are formed in the form of floating-point numbers in the range from 0 to 1. It is noteworthy that the threshold determining the correctness of recognition is empirically set at 0.92: values up to 0.92 are considered negative, and values above positive. This threshold will be used in the future as a criterion for classifying the text. A small part of the final results of the prototype computer system is shown in Fig. 3, which presents the following data: messages – messages (shown in the picture in a modified encoding); labels – the initially set weights of the corresponding messages; confidence – the exact value of the model's confidence in recognition; pred is the final recognition result. These components help to understand how the model interprets and classifies input data. Fig. 3. The results of processing the test data. Most often, the system interprets the data in the same way as the user, but there are also inconsistent results. For example, in line 894, the user defined the data as not containing illegal information (Label=0), and the system interpreted this data as containing such information. The module responsible for the operation of the bot is called Bot, respectively. He explains how to work with him, collects messages and sends a link to the VKontakte group for further processing (Kruglik R. I. Creating a chatbot in Telegram // Postulate. 2019. №. 9) [11, 12] . In addition, the bot module sends the result of reportgen's work to the user who monitors the available content of the social network if extremist or terrorist information has been recognized. Next, Messagedumper collects the last 10 non-empty community messages and sends them to the server for verification (Fig. 4). The number of messages can be increased, but this will also increase the processing time in proportion to the increase in the number of messages. Fig. 4. Illustration of the program code of the process of collecting messages from the VKontakte social network. Processing processes received messages from messagedumper using weights taken from creating_network. The accuracy of illegal information recognition reaches 95% if the messages have a large volume and are similar in subject matter to the database. The obtained results are transmitted to the reportgen module, which generates a report and sends it to the bot [21]. If illegal information was not detected during the processing, the report generation is skipped, and the bot issues a message stating that the group meets the established standards. Conclusion The proposed prototype of the computer system has significant potential for further development and improvement in the future. Technically, it is possible to check not only the posts of VKontakte groups, but also the comments under them, save user statistics and form their overall rating, which could help in identifying terrorist or extremist activities. At the moment, the platform's Rules prohibit the use of user data, which does not allow the above-mentioned features to be implemented at the moment. Here are the restrictions set out in section 2 "Working with data" of the Platform Rules when using the VKontakte public API (Rules for using the VKontakte API. Revision dated 03/01/2024. URL: https://dev.vk.com/ru/rules ). Applications are prohibited from: 2.1. Collect and store user data, including the User ID, for purposes unrelated to the operation of the Application. The requested data should only be used in the context of the application. For example, you can cache the IDs of a user's friends to display the list faster on a mobile device. You cannot transfer the IDs of all users to your server in order to store them in your own database, just in case. 2.2. Transfer any user data obtained automatically through the API (including User ID) to third-party services (for example, advertising) both directly and through intermediaries. 2.3. Use user data in any advertisements. For example, to address a user by name from an advertising banner. 2.4. Data obtained through the API, including the methods of newsfeed.search, wall.get, wall.search, including user IDs, cannot be used for transfer or resale, creation of analytical reports, scoring, etc. directly or through intermediaries without the express consent of the Site Administration. Such consent, for example, may be an agreement with an advertising agency for the use of data on ad impressions in reports to clients. The learning process of the developed computer system was implemented offline. For real-time self-learning, sufficiently powerful equipment and constant administration are needed, which will lead to an increase in the percentage of recognition of illegal text content. The modular structure of the system allows you to constantly upgrade its individual components. The scheme of building a prototype of a computer system can also be applied to platforms of other social networks. It is possible to reconfigure the model for binary classification of text that is not related to terrorism and extremism. The model works with the source data coming into its input database. If you need to search for other information, changing keywords will allow you to quickly reconfigure the model in accordance with the necessary requirements. Illustrations 3 and 4 show the testing of a prototype computer system. As a result, they were found: 1) The performance of the system. 2) Ease of use, including on smartphones. 3) Sufficiently high accuracy in the search for illegal content. 4) The ability of the system to analyze content not only by keywords, but also by the semantic meaning of the text. So, the developed prototype of a computer system using neural network technologies greatly simplifies the process of detecting extremist (illegal) information on the VKontakte social network. The system allows you to automate the content monitoring process, ensuring timely identification and response to publications that are potentially dangerous or violate the rules of the social network, which allows you to increase security and ensure that the content used in the platform meets regulatory requirements. An important advantage of the prototype is that it is constructed from freely distributed software. The article will be of interest to researchers in the field of information security, developers of automated content monitoring systems, as well as specialists in social network analysis. It is important to note that the practical application of the proposed system may be of interest to government agencies and organizations involved in the fight against extremism. References
1. Salakhutdinov, A. A. (2014). Social networks as an information channel of extremist material // Young Scientist, 17(76), 561-564. Retrieved from https://moluch.ru/archive/76/13119/
2. Martyshkin, A. I., Markin, E. I., Zuparova, V. V. (2021). Research and development of a prototype module for automatic tracking of social network content // XXI century: results of the past and problems of the present. Vol. 10 (2), 96-100. doi: 10.46548/21vek-2021-1054-0017 3. Titov, N. G. et al. (2019). Methods of monitoring social networks, their development and application in the context of ensuring their information security. Information and security, Vol. 22 (3), 305-324. Retrieved from https://www.elibrary.ru/download/elibrary_41595797_74850594.pdf 4. Golosnoy, K. S., Yanaeva, M. V. (2022). Analysis of potentially dangerous content on the VKontakte social network // Science, society, personality: problems and prospects of interaction in the modern world, 103-107. Retrieved from https://www.elibrary.ru/download/elibrary_49490534_58906686.pdf 5. Ostapenko, A. G. et al. (2018). Organization of monitoring of posts on the VKontakte social network using the vkapi interface. Information and security, Vol. 21(3), 408-415. Retrieved from https://www.elibrary.ru/item.asp?id=36716826 6. Vikhlyaev, D. R., Glagolev V. A. (2021). Data parsing of the VKontakte community using VK API. Postulate,10. Retrieved from https://e-postulat.ru/index.php/Postulat/article/view/3792 7. Zhdanov, A.V., Tyutyakin, A. A. (2017). Search for common groups and communities of social network users using web services on the example of VKontakte. Information and education: boundaries of communications, 9, 74-76. Retrieved from https://www.elibrary.ru/download/elibrary_29901598_37616555.pdf 8. Lekhov, K. A., Speranskiy, D. D. et al. (2021). Sistema izvlecheniya i analiza tekstovykh dannykh iz sotsial'nykh setey dlya obrazovatel'nogo uchrezhdeniya. Modeli, sistemy, seti v ekonomike, tekhnike, prirode i obshchestve, 1, 128–136. doi:10.21685/2227-8486-2021-1-11 9. Cordial, A. L. et al. (2020). A cartographic approach to the study of the processes of distribution of destructive content in communities of a single topic of the Vkontakte social network. Information and Security, Vol. 23(2), 203-214. Retrieved from https://www.elibrary.ru/item.asp?edn=dnfmqh 10. Bikov D. I. (2020). Methods of processing requests for a chatbot using VK API tools. Priority areas of innovation in industry, 35-36. Retrieved from https://www.elibrary.ru/download/elibrary_44460384_57688577.pdf 11. Kozlov, A. A., Batishchev, A.V. (2017). Telegram bot as a simple and convenient way to get information. The territory of science, 5, 55-64. Retrieved from https://www.elibrary.ru/download/elibrary_32399239_57151620.pdf 12. Shvedov N. D. Creating a simple Telegram bot: step-by-step instructions //Academic Journalism, 2023. No. 3-1. pp. 7-14. Retrieved from https://aeterna-ufa.ru/sbornik/AP-2023-03-1.pdf 13. Rabchevsky, A.N. (2023). Review of methods and systems for generation of synthetic training data. Applied Mathematics and Control Sciences, 4, 6–45. doi: 10.15593/2499-9873/2023.4.01 14. Abdullah A. L. I., Solovyova E. B. Binary classification of texts using a separable convolutional neural network (BTC_SCNN). A computer program. Certificate No. 2022613069 dated 03/01/2022. Retrieved from https://etu.ru/ru/nauchnaya-i-innovacionnaya-deyatelnost/obekty-intellektualnoy-sobstvennosti/patenty-i-svidetelstva-2022-goda 15. Galchenko, Yu. V., Nesterov, S. A. (2023). Classification of texts by tonality by machine learning methods. System analysis in design and management, vol. 26 (3), 369-378. doi 10.18720/SPBPU/2/id23-501 16. Khaykin S. (2016). Neural networks: a complete course, 2nd ed./ Translated from English. Moscow: I.D. Williams LLC. 17. Zhuravlev, D. V., Smolin, V.S. (2023). The neural network revolution of artificial intelligence and it's development options. In Designing the future. Problems of digital Reality. Retrieved from: https://keldysh.ru/future/2023/16.pdf 18. Kulikov, A. A., Mailyan, E. K. (2021). Comparison of architectures of recurrent neural networks in the problem of binary classification of texts. Innovative development of technology and technologies in industry (INTEX-2021). Part 3. Moscow: Kosygin Russian State University, 223-226. Retrieved from https://www.elibrary.ru/download/elibrary_46298428_94237344.pdf 19. Legotin D. L., Zrybnaya E.A. (2019). Implementation of a recurrent artificial neural network for text classification. Actual problems of teaching information and natural science disciplines. Kostroma. KSU, 197-202. Retrieved from https://www.elibrary.ru/download/elibrary_38247279_55764596.pdf 20. Batura, T. V. (2017). Methods of automatic text classification. Software products and systems, Vol. 30(1), 85-89. Doi 10.15827/0236-235X.030.1.085-099 21. Alekseeva, V. A. (2014). The use of intellectual analysis methods in binary classification problems. Proceedings of the Samara Scientific Center of the Russian Academy of Sciences, Vol. 16 (6-2), 354-356. Retrieved from http://www.ssc.smr.ru/media/journals/izvestia/2014/2014_6_354_356.pdf
First Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
Second Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|