Library
|
Your profile |
Software systems and computational methods
Reference:
Alpatov A.N., Terloev E.Z., Matchin V.T.
Architecture of a three-dimensional convolutional neural network for detecting the fact of falsification of a video sequence
// Software systems and computational methods.
2024. ¹ 3.
P. 1-11.
DOI: 10.7256/2454-0714.2024.3.70849 EDN: MNOVWB URL: https://en.nbpublish.com/library_read_article.php?id=70849
Architecture of a three-dimensional convolutional neural network for detecting the fact of falsification of a video sequence
DOI: 10.7256/2454-0714.2024.3.70849EDN: MNOVWBReceived: 26-05-2024Published: 10-06-2024Abstract: The article reflects the use of neural network technologies to determine the facts of falsification of the contents of video sequences. In the modern world, new technologies have become an integral part of the multimedia environment, but their proliferation has also created a new threat – the possibility of misuse to falsify the contents of video sequences. This leads to serious problems, such as the spread of fake news and misinformation of society. The scientific article examines this problem and determines the need to use neural networks to solve it. In comparison with other existing models and approaches, neural networks have high efficiency and accuracy in detecting video data falsification due to their ability to extract complex features and learn from large amounts of source data, which is especially important when reducing the resolution of the analyzed video sequence. Within the framework of this work, a mathematical model for identifying the falsification of audio and video sequences in video recordings is presented, as well as a model based on a three-dimensional convolutional neural network to determine the fact of falsification of a video sequence by analyzing the contents of individual frames. Within the framework of this work, it was proposed to consider the problem of identifying falsifications in video recordings as a joint solution to two problems: identification of falsification of audio and video sequences, and the resulting problem itself was transformed into a classical classification problem. Any video recording can be assigned to one of the four groups described in the work. Only the videos belonging to the first group are considered authentic, and all the others are fabricated. To increase the flexibility of the model, probabilistic classifiers have been added, which allows to take into account the degree of confidence in the predictions. The peculiarity of the resulting solution is the ability to adjust the threshold values, which allows to adapt the model to different levels of rigor depending on the task. The architecture of a three-dimensional convolutional neural network, including a preprocessing layer and a neural network layer, is proposed to determine fabricated photoreceads. The resulting model has a sufficient degree of accuracy in determining falsified video sequences, taking into account a significant decrease in frame resolution. Testing of the model on a training dataset showed the proportion of correct detection of video sequence falsification above 70%, which is noticeably better than guessing. Despite the sufficient accuracy, the model can be refined to more significantly increase the proportion of correct predictions. Keywords: machine learning, neural networks, convolutional neural networks, video falsification, deepfakes, deepfake detection, audio falsification, data preprocessing, anomaly detection, batch normalizationThis article is automatically translated. Introduction
Currently, neural networks have become very popular, the main purpose of which is to generate images and voice recordings. The high degree of accessibility for the ordinary man in the street makes them more popular. The most popular services are DALL-E from OpenAI, midjourney, stable diffusion, FaceApp, FaceSwap and the like [1] [2]. Popular services such as elevenlabs, Microsoft custom neural voice and speechify are used to generate voice recordings [3]. In most cases, these utilities are used for harmless purposes, to present the resulting images to friends and acquaintances, to post on their social network page, to speed up the design workflow or to speed up the process of creating audio books. It is not difficult to imagine a significant simplification of work processes in the artistic fields, including the film industry. In addition, it is possible to "resurrect" deceased actors using voice audio recording generation tools, as well as face transfer tools [4]. On the other hand, these technologies call into question the need for actors and artists in cinema, and greater accessibility makes them more attractive tools for intruders [5]. Among the scenarios for using neural network photo and audio series generation tools, it is possible to create a video recording in which a popular political or media personality makes a controversial statement that can cause great reputational damage. Identity theft and further criminal acts are also possible [6]. An example of identity theft using neural network technologies is the case that occurred in the spring of 2022, when broadcasts began to appear on the YouTube video hosting service, with the participation of a neural network copy of Elon Musk, offering viewers to transfer their cryptocurrency investments to him in order to receive them back with interest. An example of broadcasts is shown in Figure 1. [7] Figure 1 – Broadcasts featuring a neural network copy of Elon Musk on the YouTube platform[7]
On the one hand, the fraudulent scheme is quite obvious. On the other hand, an ignorant user may not attach importance to this, since the broadcast is conducted by a fairly popular personality, which increases the user's confidence in the information received.
Description of the model for identifying the fact of falsification of video recordings The task of identifying the falsification of audio and video sequences, within the framework of this work, is reduced to the classical classification problem. Any video recording will be determined by one of the 4 groups: 1. The photo frames and audio and video recordings are authentic; 2. The photo series of the video recording is fabricated, the audio series of the video recording is authentic; 3. The photo sequences of the video are authentic, the audio series of the video is fabricated; 4. The photo frames and audio and video recordings are fabricated. Videos belonging to the first group are considered authentic, and videos not related to it are considered fabricated. Let's denote the photo sequence of the video recording by X, and the audio sequence of the video recording by A. Let 's distinguish two classifiers . The first classifier determines the authenticity of the photo sequence (authentic or fabricated), and the classifier determines the authenticity of the audio sequence (authentic or fabricated). Then, if the photo frames are authentic, otherwise , that is, the photo frames are fabricated. In the case of audio sequences, if the audio sequences are authentic, otherwise , that is, the audio sequences are fabricated. Then the authenticity of the video recording can be defined as . Then the procedure for determining the video recording group can be generalized. To do this, first, we will determine the values using classifiers. Next, let's compare the results with the possible combinations to correlate with each group. So, if, then the video belongs to the first group, which means it is authentic. If, then the video is not authentic, since the video sequence is fabricated (belongs to the second group). If, then the video recording is not authentic, since the audio track is fabricated (belongs to the third group). Otherwise, if, then the video recording is not authentic, since the audio track and video sequence are fabricated (belongs to the fourth group). However, such a "hard" threshold for classification requires confidence that the model is correct in its predictions. To increase flexibility, we modify the proposed model by adding probabilistic classifiers. Let denote the probability that the photo frames are authentic, and . Next, we use the Bayesian approach to determine the joint probability of authenticity of the photo frames and the audio sequence. Let them be the threshold values for determining the authenticity of photo frames and audio sequences, respectively. Then the authenticity of the photo and audio sequences based on the threshold values is determined by if , otherwise 0. Similarly, if , otherwise 0 Then the authenticity of the video recording can be determined .
The ability to adjust thresholds allows you to adapt the model to different levels of rigor depending on the task. For example, in critical cases, high thresholds can be set to minimize false positives, which makes the proposed model more customizable. This is especially useful in situations where the data may be ambiguous or noisy. In turn, this approach makes it possible to improve the reliability of the system, since decisions are made based on a probability distribution, and not on the basis of a single deterministic result. Technically, the identification of fabricated video sequences is possible by analyzing frames and searching for anomalies, by analyzing the audio series for anomalies, or by combined analysis. In this paper, we will consider only the analysis of photodiodes. A three-dimensional convolutional neural network can be used to identify falsified photoreceads. Convolutional layers in a neural network allow you to reduce the dimension of the input, thereby speeding up the learning process. A three-dimensional convolutional layer has dimension N x M x K, where: N is the number of frames in the time axis, M and K are spatial dimensions (height and width of the frame).
In this case, a single three-dimensional layer will be decomposed into a layer with dimension 1 x M x K, called spatial convolution, and a layer with dimension N x 1 x 1, called temporal convolution. Thus, a reduction in the number of trained parameters is achieved, compared with using a conventional three-dimensional layer with dimension N x M x K, and also shows a better result when determining actions on video [9]. Let's denote the input video data as , where C is the number of channels (for example, 3 for RGB video). Spatial convolution, in this case, is used to process the spatial features of each frame, that is, two-dimensional convolutions are applied to each frame independently. Then let be a spatial convolution operation with a kernel
The new height and width of each frame after convolution will depend on the size of the core, the stride and padding. The specific values can be calculated as follows (if stride = 1 and padding = 0)
Temporal convolution is used to process temporal features, that is, one-dimensional convolutions are applied along the time axis. Let be a time convolution operation with a kernel The visualization of such a convolution is shown in Figure 2. Figure 2 – Visualization of a three-dimensional convolution with decomposition into spatial and temporal
Artificial Neural Network Architecture The ZF DeepFake Dataset was used as a dataset [10]. This set consists of short videos, of which 199 videos are falsified and 176 are authentic (at the time of writing). The presented technological solution consists of a video preprocessing layer and a neural network layer. In the preprocessing layer, the video is split into 10 frames and the size of each frame is reduced or increased to a size of 224 by 224 pixels. The input dimension of each video recording vector is 10 x 224 x 224 x 3, where the last dimension is the color channels: red, green and blue. After preprocessing the video, the vectors are transferred to the neural network model. The machine learning model consists of an input layer; a layer of three-dimensional convolution with the decomposition of convolutions into spatial and temporal with packing up to the output size, 16 filters and a core size of 3 x 7 x 7; a layer of batch normalization, a layer of piecewise linear activation function (ReLU)[11]; a layer of changing the frame dimension to 112 x 112; a residual layer with 32 filters and a core size of 3 x 3 x 3; a layer of reducing the frame dimension to 64 x 64; a residual layer with 64 filters and a core size of 3 x 3 x 3; a three-dimensional layer of subdiscretization based on the average the values of [12], the flatten layer and the fully connected layer with 10 outputs. The error function of the model is categorical cross entropy with the Adam optimizer and a learning rate of 0.0001. The data set for training the model consists of 100 videos, of which 50 are falsified and 50 are authentic. The testing and validation sets consist of 40 records, each set includes 20 authentic and 20 falsified ones. Each specific video recording in the sets is not used in more than one of the sets at the same time. The training is conducted over 10 epochs. The structure of the neural network is shown in Figure 3. Figure 3 – The structure of a neural network in the form of a flowchart
Testing the solution
The proportion of correct predictions (accuracy) of the training set in the last training era was 75%. The change in the value of the error function over the course of training is shown in Figure 4. The change in the value of accuracy over the course of training is shown in Figure 5. Figure 4 – Graph of the change in the value of the error function over the course of training for the training and validation data set Figure 5 – Graph of the change in the accuracy value over the course of training for the training and validation data set
Despite a noticeable increase in the accuracy of predictions on the training set, only a slight improvement is visible on the validation dataset. The matrix of inconsistencies in the training dataset is shown in Figure 6. Figure 6 – The matrix of inconsistencies in the training set
Based on the discrepancy matrix, it can be seen that the model more often defines the video as authentic, which is why there are many false negative predictions. The discrepancy matrix for the test dataset is shown in Figure 7. Figure 7 – The matrix of inconsistencies of the test set
The values of accuracy, completeness, and F1-measures for possible classes are presented in Table 1.
Table 1 – Values of accuracy, completeness, and F1-measures for predicted classes
Conclusion
In this paper, a neural network is presented to determine the fact of falsification of video sequences with a significant proportion of correct definition. Despite this, the model can be significantly improved by additional filling of the training data set and subsequent increase in the proportion of the training set; reduction of the workspace by highlighting and subsequent analysis of specific areas of possible falsification; changes in the structure of the neural network. Further work on the problem may also be aimed at developing a method for determining the fact of falsification without using machine learning models, in order to reduce the risk of possible problems with retraining and reduce the proportion of correct predictions in the event of changes in the technology of falsifying video sequences using neural networks. References
1. Beyan, E. V. P., & Rossy, A. G. C. (2023). A review of AI image generator: influences, challenges, and future prospects for architectural field. Journal of Artificial Intelligence in Architecture, 2(1), 53-65.
2. Huang, Y., Lv, S., Tseng, K. K., Tseng, P. J., Xie, X., & Lin, R. F. Y. (2023). Recent advances in artificial intelligence for video production system. Enterprise Information Systems, 17(11), 2246188. 3. Albert, V. D., & Schmidt, H. J. (2023). Al-based B-to-B brand redesign: A case study. transfer, 47. 4. Aliev, E. V. (2023). Problems of using digital technologies in the film industry. European Journal of Arts, 1, 33-37. 5. Chow, P. S. (2020). Ghost in the (Hollywood) machine: Emergent applications of artificial intelligence in the film industry. NECSUS_European Journal of Media Studies, 9(1), 193-214. 6. Lemaykina, S. V. (2023). Problems of counteracting the use of dipfeits for criminal purposes. Jurist-Pravoveden, 2(105), 143-148. 7. Vakilinia, I. (2022, October). Cryptocurrency giveaway scam with youtube live stream. In 2022 IEEE 13th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) (pp. 0195-0200). IEEE. 8. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450-6459). 9. Naik, K. J., & Soni, A. (2021). Video classification using 3D convolutional neural network. In Advancements in Security and Privacy Initiatives for Multimedia Images (pp. 1-18). IGI Global. 10. ZF DeepFake Dataset [Electronic resource]. Retrieved from https://www.kaggle.com/datasets/zfturbo/zf-deepfake-dataset. 11. Garbin, C., Zhu, X., & Marques, O. (2020). Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimedia tools and applications, 79(19), 12777-12815. 12. Zhou, D. X. (2020). Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124, 319-327.
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|