Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Software systems and computational methods
Reference:

Architecture of a three-dimensional convolutional neural network for detecting the fact of falsification of a video sequence

Alpatov Aleksey Nikolaevich

ORCID: 0000-0001-8624-1662

Associate Professor; IiPPO Department; MIREA - Russian Technological University

78 Vernadsky Ave., Moscow, 119454, Russia

aleksej01-91@mail.ru
Other publications by this author
 

 
Terloev Emil' Ziyaudinovich

Postgraduate student; Department of Instrumental and Applied Software; MIREA — Russian Technological University

78 Vernadsky Ave., Moscow, 119454, Russia

emil199@yandex.ru
Matchin Vasilii Timofeevich

Senior Lecturer; Institute of Information Technology; MIREA — Russian Technological University

78 Vernadsky Ave., Moscow, 119454, Russia

matchin@mirea.ru

DOI:

10.7256/2454-0714.2024.3.70849

EDN:

MNOVWB

Received:

26-05-2024


Published:

10-06-2024


Abstract: The article reflects the use of neural network technologies to determine the facts of falsification of the contents of video sequences. In the modern world, new technologies have become an integral part of the multimedia environment, but their proliferation has also created a new threat – the possibility of misuse to falsify the contents of video sequences. This leads to serious problems, such as the spread of fake news and misinformation of society. The scientific article examines this problem and determines the need to use neural networks to solve it. In comparison with other existing models and approaches, neural networks have high efficiency and accuracy in detecting video data falsification due to their ability to extract complex features and learn from large amounts of source data, which is especially important when reducing the resolution of the analyzed video sequence. Within the framework of this work, a mathematical model for identifying the falsification of audio and video sequences in video recordings is presented, as well as a model based on a three-dimensional convolutional neural network to determine the fact of falsification of a video sequence by analyzing the contents of individual frames. Within the framework of this work, it was proposed to consider the problem of identifying falsifications in video recordings as a joint solution to two problems: identification of falsification of audio and video sequences, and the resulting problem itself was transformed into a classical classification problem. Any video recording can be assigned to one of the four groups described in the work. Only the videos belonging to the first group are considered authentic, and all the others are fabricated. To increase the flexibility of the model, probabilistic classifiers have been added, which allows to take into account the degree of confidence in the predictions. The peculiarity of the resulting solution is the ability to adjust the threshold values, which allows to adapt the model to different levels of rigor depending on the task. The architecture of a three-dimensional convolutional neural network, including a preprocessing layer and a neural network layer, is proposed to determine fabricated photoreceads. The resulting model has a sufficient degree of accuracy in determining falsified video sequences, taking into account a significant decrease in frame resolution. Testing of the model on a training dataset showed the proportion of correct detection of video sequence falsification above 70%, which is noticeably better than guessing. Despite the sufficient accuracy, the model can be refined to more significantly increase the proportion of correct predictions.


Keywords:

machine learning, neural networks, convolutional neural networks, video falsification, deepfakes, deepfake detection, audio falsification, data preprocessing, anomaly detection, batch normalization

This article is automatically translated.

Introduction

 

Currently, neural networks have become very popular, the main purpose of which is to generate images and voice recordings. The high degree of accessibility for the ordinary man in the street makes them more popular. The most popular services are DALL-E from OpenAI, midjourney, stable diffusion, FaceApp, FaceSwap and the like [1] [2]. Popular services such as elevenlabs, Microsoft custom neural voice and speechify are used to generate voice recordings [3].

In most cases, these utilities are used for harmless purposes, to present the resulting images to friends and acquaintances, to post on their social network page, to speed up the design workflow or to speed up the process of creating audio books. It is not difficult to imagine a significant simplification of work processes in the artistic fields, including the film industry. In addition, it is possible to "resurrect" deceased actors using voice audio recording generation tools, as well as face transfer tools [4].

On the other hand, these technologies call into question the need for actors and artists in cinema, and greater accessibility makes them more attractive tools for intruders [5]. Among the scenarios for using neural network photo and audio series generation tools, it is possible to create a video recording in which a popular political or media personality makes a controversial statement that can cause great reputational damage. Identity theft and further criminal acts are also possible [6].

An example of identity theft using neural network technologies is the case that occurred in the spring of 2022, when broadcasts began to appear on the YouTube video hosting service, with the participation of a neural network copy of Elon Musk, offering viewers to transfer their cryptocurrency investments to him in order to receive them back with interest. An example of broadcasts is shown in Figure 1. [7]

Fake Elon Musk live videos stream on YouTube

Figure 1 – Broadcasts featuring a neural network copy of Elon Musk on the YouTube platform[7]

 

On the one hand, the fraudulent scheme is quite obvious. On the other hand, an ignorant user may not attach importance to this, since the broadcast is conducted by a fairly popular personality, which increases the user's confidence in the information received.

 

Description of the model for identifying the fact of falsification of video recordings

          The task of identifying the falsification of audio and video sequences, within the framework of this work, is reduced to the classical classification problem.

Any video recording will be determined by one of the 4 groups:

1. The photo frames and audio and video recordings are authentic;

2. The photo series of the video recording is fabricated, the audio series of the video recording is authentic;

3. The photo sequences of the video are authentic, the audio series of the video is fabricated;

4. The photo frames and audio and video recordings are fabricated.

Videos belonging to the first group are considered authentic, and videos not related to it are considered fabricated.

Let's denote the photo sequence of the video recording by X, and the audio sequence of the video recording by A. Let 's distinguish two classifiers . The first classifier determines the authenticity of the photo sequence (authentic or fabricated), and the classifier determines the authenticity of the audio sequence (authentic or fabricated). Then, if the photo frames are authentic, otherwise , that is, the photo frames are fabricated. In the case of audio sequences, if the audio sequences are authentic, otherwise , that is, the audio sequences are fabricated. Then the authenticity of the video recording can be defined as .

Then the procedure for determining the video recording group can be generalized. To do this, first, we will determine the values using classifiers. Next, let's compare the results with the possible combinations to correlate with each group. So, if, then the video belongs to the first group, which means it is authentic. If, then the video is not authentic, since the video sequence is fabricated (belongs to the second group). If, then the video recording is not authentic, since the audio track is fabricated (belongs to the third group).  Otherwise, if, then the video recording is not authentic, since the audio track and video sequence are fabricated (belongs to the fourth group). However, such a "hard" threshold for classification requires confidence that the model is correct in its predictions. To increase flexibility, we modify the proposed model by adding probabilistic classifiers.

 Let denote the probability that the photo frames are authentic, and . Next, we use the Bayesian approach to determine the joint probability of authenticity of the photo frames and the audio sequence. Let them be the threshold values for determining the authenticity of photo frames and audio sequences, respectively. Then the authenticity of the photo and audio sequences based on the threshold values is determined by if , otherwise 0. Similarly, if , otherwise 0 Then the authenticity of the video recording can be determined .

 

The ability to adjust thresholds allows you to adapt the model to different levels of rigor depending on the task. For example, in critical cases, high thresholds can be set to minimize false positives, which makes the proposed model more customizable. This is especially useful in situations where the data may be ambiguous or noisy. In turn, this approach makes it possible to improve the reliability of the system, since decisions are made based on a probability distribution, and not on the basis of a single deterministic result.

Technically, the identification of fabricated video sequences is possible by analyzing frames and searching for anomalies, by analyzing the audio series for anomalies, or by combined analysis. In this paper, we will consider only the analysis of photodiodes.

A three-dimensional convolutional neural network can be used to identify falsified photoreceads. Convolutional layers in a neural network allow you to reduce the dimension of the input, thereby speeding up the learning process. A three-dimensional convolutional layer has dimension N x M x K,

where:

N is the number of frames in the time axis,

M and K are spatial dimensions (height and width of the frame).

 

In this case, a single three-dimensional layer will be decomposed into a layer with dimension 1 x M x K, called spatial convolution, and a layer with dimension N x 1 x 1, called temporal convolution. Thus, a reduction in the number of trained parameters is achieved, compared with using a conventional three-dimensional layer with dimension N x M x K, and also shows a better result when determining actions on video [9].

Let's denote the input video data as , where C is the number of channels (for example, 3 for RGB video). Spatial convolution, in this case, is used to process the spatial features of each frame, that is, two-dimensional convolutions are applied to each frame independently. Then let be a spatial convolution operation with a kernel

 

The new height and width of each frame after convolution will depend on the size of the core, the stride and padding. The specific values can be calculated as follows (if stride = 1 and padding = 0)

 

Temporal convolution is used to process temporal features, that is, one-dimensional convolutions are applied along the time axis. Let be a time convolution operation with a kernel

The visualization of such a convolution is shown in Figure 2.

Figure 2 – Visualization of a three-dimensional convolution with decomposition into spatial and temporal

 

         

Artificial Neural Network Architecture

The ZF DeepFake Dataset was used as a dataset [10]. This set consists of short videos, of which 199 videos are falsified and 176 are authentic (at the time of writing).

          The presented technological solution consists of a video preprocessing layer and a neural network layer. In the preprocessing layer, the video is split into 10 frames and the size of each frame is reduced or increased to a size of 224 by 224 pixels. The input dimension of each video recording vector is 10 x 224 x 224 x 3, where the last dimension is the color channels: red, green and blue.

          After preprocessing the video, the vectors are transferred to the neural network model. The machine learning model consists of an input layer; a layer of three-dimensional convolution with the decomposition of convolutions into spatial and temporal with packing up to the output size, 16 filters and a core size of 3 x 7 x 7; a layer of batch normalization, a layer of piecewise linear activation function (ReLU)[11]; a layer of changing the frame dimension to 112 x 112; a residual layer with 32 filters and a core size of 3 x 3 x 3; a layer of reducing the frame dimension to 64 x 64; a residual layer with 64 filters and a core size of 3 x 3 x 3; a three-dimensional layer of subdiscretization based on the average the values of [12], the flatten layer and the fully connected layer with 10 outputs. The error function of the model is categorical cross entropy with the Adam optimizer and a learning rate of 0.0001.

The data set for training the model consists of 100 videos, of which 50 are falsified and 50 are authentic. The testing and validation sets consist of 40 records, each set includes 20 authentic and 20 falsified ones. Each specific video recording in the sets is not used in more than one of the sets at the same time. The training is conducted over 10 epochs. The structure of the neural network is shown in Figure 3.

Figure 3 – The structure of a neural network in the form of a flowchart

 

Testing the solution

 

          The proportion of correct predictions (accuracy) of the training set in the last training era was 75%. The change in the value of the error function over the course of training is shown in Figure 4. The change in the value of accuracy over the course of training is shown in Figure 5.

Figure 4 – Graph of the change in the value of the error function over the course of training for the training and validation data set

Figure 5 – Graph of the change in the accuracy value over the course of training for the training and validation data set

 

          Despite a noticeable increase in the accuracy of predictions on the training set, only a slight improvement is visible on the validation dataset. The matrix of inconsistencies in the training dataset is shown in Figure 6.

Figure 6 – The matrix of inconsistencies in the training set

 

          Based on the discrepancy matrix, it can be seen that the model more often defines the video as authentic, which is why there are many false negative predictions. The discrepancy matrix for the test dataset is shown in Figure 7.

Figure 7 – The matrix of inconsistencies of the test set

 

          The values of accuracy, completeness, and F1-measures for possible classes are presented in Table 1.

 

Table 1 – Values of accuracy, completeness, and F1-measures for predicted classes

Class/metric

Authentic

Falsified

Accuracy

0.552

0.6364

Completeness

0.8

0.35

F1 is a measure

0.653

0.451

 

 

Conclusion

 

          In this paper, a neural network is presented to determine the fact of falsification of video sequences with a significant proportion of correct definition. Despite this, the model can be significantly improved by additional filling of the training data set and subsequent increase in the proportion of the training set; reduction of the workspace by highlighting and subsequent analysis of specific areas of possible falsification; changes in the structure of the neural network.

          Further work on the problem may also be aimed at developing a method for determining the fact of falsification without using machine learning models, in order to reduce the risk of possible problems with retraining and reduce the proportion of correct predictions in the event of changes in the technology of falsifying video sequences using neural networks.

References
1. Beyan, E. V. P., & Rossy, A. G. C. (2023). A review of AI image generator: influences, challenges, and future prospects for architectural field. Journal of Artificial Intelligence in Architecture, 2(1), 53-65.
2. Huang, Y., Lv, S., Tseng, K. K., Tseng, P. J., Xie, X., & Lin, R. F. Y. (2023). Recent advances in artificial intelligence for video production system. Enterprise Information Systems, 17(11), 2246188.
3. Albert, V. D., & Schmidt, H. J. (2023). Al-based B-to-B brand redesign: A case study. transfer, 47.
4. Aliev, E. V. (2023). Problems of using digital technologies in the film industry. European Journal of Arts, 1, 33-37.
5. Chow, P. S. (2020). Ghost in the (Hollywood) machine: Emergent applications of artificial intelligence in the film industry. NECSUS_European Journal of Media Studies, 9(1), 193-214.
6. Lemaykina, S. V. (2023). Problems of counteracting the use of dipfeits for criminal purposes. Jurist-Pravoveden, 2(105), 143-148.
7. Vakilinia, I. (2022, October). Cryptocurrency giveaway scam with youtube live stream. In 2022 IEEE 13th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) (pp. 0195-0200). IEEE.
8. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450-6459).
9. Naik, K. J., & Soni, A. (2021). Video classification using 3D convolutional neural network. In Advancements in Security and Privacy Initiatives for Multimedia Images (pp. 1-18). IGI Global.
10. ZF DeepFake Dataset [Electronic resource]. Retrieved from https://www.kaggle.com/datasets/zfturbo/zf-deepfake-dataset.
11. Garbin, C., Zhu, X., & Marques, O. (2020). Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimedia tools and applications, 79(19), 12777-12815.
12. Zhou, D. X. (2020). Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124, 319-327.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The article discusses the development and testing of a model of a three-dimensional convolutional neural network (3D CNN) to detect the fact of falsification of a video sequence. The aim of the research is to create a system capable of effectively recognizing authentic and fabricated video files. The methodology includes the use of 3D CNN, where convolutional layers are decomposed into spatial and temporal, which allows you to reduce the number of trained parameters and improve the results of video analysis. The ZF DeepFake dataset was used for training and testing the model, which ensures sufficient reliability of the results. The model was trained and tested on various datasets, including authentic and falsified video recordings. With the development of neural network technologies and their accessibility to the general public, the risk of using these technologies for fraudulent purposes, such as creating deepfakes, increases. The relevance of the study is emphasized by the need to develop reliable methods for detecting fraud, which can help prevent crimes and preserve the reputation of public figures. The scientific novelty of the work lies in the proposal of an improved 3D CNN architecture for detecting video sequence falsifications, as well as in using a probabilistic approach to improve classification accuracy. The proposed model allows flexible adjustment of threshold values for various tasks, which makes it universal and more accurate. The article is written in a scientific style with a clear structure and logical presentation of the material. The introduction describes in detail the current problems and objectives of the study. The description of the methodology and architecture of the model is given in detail, which allows you to understand the key aspects of the work. Testing of the model and discussion of the results are carried out using graphs and tables, which makes the conclusions transparent and understandable. In conclusion, the authors emphasize the effectiveness of the proposed model and the need for further improvement. It is indicated that the model can be improved by increasing the amount of data for training and changing the architecture of the neural network. Further research may also be aimed at developing methods for detecting fraud without using machine learning, which may reduce the risk of retraining. The article will be of interest to researchers in the field of artificial intelligence, computer vision and information security. The presented results can be applied in various fields, including the media industry, the legal sphere and cybersecurity. To further develop the work, I propose to increase the amount of data for training. This includes extending the dataset by using more data for training and testing the model. It is important to consider the use of various data sources, including public datasets and proprietary video collections. It is also necessary to diversify the data to include various types of falsifications, which will allow for a more complete presentation of all possible scenarios. The article is an important contribution to the field of detecting video series falsifications and is recommended for publication. The presented results demonstrate the high potential of the proposed model and its applicability in real conditions. A small note: in the sentence "The values of accuracy, completeness, and F1-measures for possible classes are presented ..." before "and" a comma is not needed.