Library
|
Your profile |
National Security
Reference:
Gorokhova R.I., Nikitin P.V.
Detecting anomalies in images of critical information infrastructure
// National Security.
2024. № 6.
P. 32-46.
DOI: 10.7256/2454-0668.2024.6.71885 EDN: SMNUAJ URL: https://en.nbpublish.com/library_read_article.php?id=71885
Detecting anomalies in images of critical information infrastructure
DOI: 10.7256/2454-0668.2024.6.71885EDN: SMNUAJReceived: 04-10-2024Published: 04-12-2024Abstract: The authors present a method for detecting anomalies in the visual content of critical information infrastructure based on comparing hash strings obtained from visual data to detect potential deviations or duplicate content. The subject of the study is the detecting of anomalies in critical information infrastructure (CII) images using hashing technology. The relevance of the study is due to the growing threat of copyright infringement and the distribution of illegal content. Critical information structures, including public administration, science, economics and energy systems, directly depend on the protection of their information. Therefore, identifying anomalies in images and their prompt response play a key role in maintaining the integrity and confidentiality of data. The goal is to develop an algorithm for identifying duplicate content and create an effective tool for monitoring images. The work uses the integration of computer vision methods and machine learning algorithms. The development includes the use of hash strings for precise comparison of images. The scientific novelty of this study lies in the development and implementation of a new approach to detecting anomalies in images of critical information infrastructure using hashing technology. The use of the technology provides unique identifiers for visual data and allows for efficient comparison and analysis of images. This approach significantly increases the speed of data processing and the accuracy of detecting duplicate content and anomalies in images. Hash-based image classification provides a higher degree of sensitivity to anomalies and allows filtering out false positives, which is critical for organizations with a high level of information security. The results show the high efficiency of the proposed method; a significant degree of accuracy in detecting anomalies has been achieved, which is confirmed by experiments on real data. The presented algorithms have demonstrated improvement over existing solutions. Keywords: critical information infrastructure, visual content, anomalies, threats, computer vision, similar images, duplicate content, hash strings, perceptual hashing, machine learningThis article is automatically translated. Introduction. Various types of anomalies may occur in the visual content of the critical information infrastructure (CII), which pose potential threats to information security. Anomalies can be presented in the form of malicious images containing malicious code or scripts (for example, XSS attacks, embedded exploits) or intentionally distorted to bypass filtering and detection systems; falsified images representing fake or fabricated images used for disinformation and manipulation or manipulated to conceal or distort information; images-indicators of compromise that contain signs of intrusion, targeted attacks or malicious actions; atypical or suspicious images that do not correspond to the "normal" visual content characteristic of this CUE, demonstrating abnormal characteristics that deviate from established patterns or security policies or are associated with atypical events in the functioning of the CUE. It should be noted that security issues are being investigated very actively at the moment. Information security is an important component of a wide variety of fields. The specifics of threats are considered in the study of Romashkova O. N. and Kaptereva A. I. [1], the identification and classification of threats that may arise for various objects, including both information systems and physical assets is given in the work of Turyanskaya K. A. [2], current threats related to the use of cloud computing are given in the article by Boyarchuk D. A., Frolova K. A. and Sklyaruka V. L. [3], key aspects related to information security in the context of digitalization, including vulnerabilities of new technologies, ways of responding to incidents and the importance of developing risk management strategies are studied in the work of Mustafaeva A. G., Kobzarenko D. N., Buchaeva A. Ya. [4], Yemelyanov A. A. analyzes the risks and vulnerabilities that may arise when using hypervisors, and their impact on the protection of virtualized environments [5]. Various approaches are proposed to threat modeling by developing scenarios to identify vulnerabilities and predict possible attacks on information resources [6]. An integrated approach to risk management and the need for constant threat monitoring is proposed in the work of Grin V.S. [7]. The author presents recommendations for preventing information leaks, including access control methods, system audits and staff training. The article by A. Bekmukhan and O. Usatova [8] explores approaches to improving the security level of multi-server web applications and systems. The paper discusses risk management methods, including threat identification, vulnerability assessment, and the use of technologies to ensure data protection. The issues of information verification in the conditions of modern information flow are considered in detail in the work of Shesterkina L.P., Krasavina A.V. and Khakimova E.M. [9]. The authors cover various aspects of the fact-checking process, including methods for assessing the reliability of information sources, the use of effective tools and technologies, as well as signs of unreliable data. Recently, an increasing number of researchers have been dealing with image processing issues. The article by Alpatov A. N. [10] analyzes modern approaches and methods of video stream analysis in order to identify anomalous events with a deep genesis. The author emphasizes the importance of timely detection of anomalies to improve the safety and effectiveness of various systems such as video surveillance, security and monitoring. The paper substantiates the main characteristics of the deep genesis of anomalies, as well as analyzes the algorithms and technologies used to identify them. A key aspect of the research is the application of machine learning and artificial intelligence methods to automate the analysis process and improve detection accuracy. In [11], a study of the use of neural networks in the context of studying visual content from the point of view of information security was conducted, and in the article [12] the possibilities of various AI technologies, such as machine learning, big data analysis and neural networks, and their impact on the diagnosis, treatment and prevention of diseases based on the analysis of available images are shown. A software and technical solution including machine learning algorithms and data analysis methods that effectively identify potentially dangerous multimedia objects are given in the article by Pilkevich S. V. et al. [13]. In addition, attention is focused on the importance of research in the context of improving cybersecurity and protecting users from malicious content. The mechanisms by which attackers can manipulate images, creating distorted data, which can lead to errors in their interpretation, are considered in the study [14]. Modern threats related to the impact of malicious disturbances on computer vision systems are considered. The authors analyze various types of attacks, scenarios of their implementation and consequences for the reliability and security of image processing systems. As a response to these threats, the authors discuss existing methods of protection and development aimed at increasing the resilience of systems to malicious influences. Approaches based on the use of various filtering and regularization algorithms are considered, as well as the use of learning methods that can improve the generalizing ability of models. An innovative approach to malware detection is proposed in the article [15]. The authors propose a method in which the binary code of programs is converted into images, which allows you to use existing machine learning algorithms for threat analysis and recognition. The article describes the process of converting binary files into two-dimensional images and justifies the choice of this technique as a means of combating malware. In [16], based on the use of convolutional neural networks, the possibility of static analysis of applications presented as a sequence of bytes and further translation of the received data into an image format for malware detection is shown. It should be noted that many works have focused on converting binary executable files into images. For example, in [17], the authors group binary sequences of executable files by 8-bit vectors. The converted 8-bit vectors are then converted to black and white images. After the conversion process, the authors directly apply the random forest algorithm to classify malware using pixel values as objects. In the studies [18, 19], the authors extract visual signs using classical extractors of computer vision objects. The issues of detecting malicious images and the related cybersecurity problem are presented in [20, 21]. The authors emphasize that image files can be used to spread malware, bypassing traditional filtering mechanisms. As noted in [22], the following criteria were used to classify legitimate and malicious files: file size, maximum marker size, number of markers, number of bytes after the end of the file. The following machine learning methods were used to classify legitimate and malicious JPEG files: decision trees and ensembles of decision trees: random forest and stochastic gradient boosting [23]. In the article [24], the authors explain that perceptual hashing allows you to create image hashes that take into account their visual characteristics, and not just bit representations. This approach provides a more stable identification, even if the images have been subjected to changes such as resizing, compression, or minor changes in colors. The authors [25] emphasize that effective image comparison is of key importance for computer vision applications such as multimedia content management, digital photography, and monitoring systems. In [26], various distance metrics used for the analysis and comparison of raster images are considered. In addition, the authors discuss the impact of metric selection on the quality of image recognition and matching, and provide examples of scenarios in which different metrics can show significant differences in productivity. Identifying such anomalies in the visual content of a critical information infrastructure is an important task to ensure its information security and resilience to cyber threats. The use of deep learning methods makes it possible to create effective systems for detecting and responding to these types of anomalies. The relevance is due to the need to solve problems related to managing a large volume of digital content, identifying duplicate content, protecting copyrights and combating fraud. Models should be able to detect anomalies in visual content that may indicate the presence of malware, phishing attacks, or other threats to information security. To develop an approach based on the application of transfer learning and active learning methods to adapt deep learning models to the features of the visual content of a critical information infrastructure. The goal is to increase the accuracy and efficiency of detecting anomalies in images uploaded to the cue and increase control over the use of images by developing an algorithm for searching for identical and similar images in a database using perceptual hash strings.
Research methods Searching for identical and similar images in a database is an important task in the field of computer vision, image processing and data management systems. Various approaches and methods make it possible to effectively solve this problem. Image hashing is based on creating a unique representation (hash) for each image. There are several hashing methods: - perceptual hashing (pHash), which uses transformations (for example, discrete cosine transform) to create a compact representation of images and the more similar two images are, the closer their hashes are in value, - classification hashing (dHash and aHash), which uses simple algorithms such as calculating the difference between neighboring pixels (dHash) or creating an average value (aHash). They also create unique hashes that allow you to quickly identify similarities. The next method is to calculate hashes, which allows you to compare images. For comparison, the most commonly used methods are: - Hamming distance, which allows for a quick comparison of bit sequences of hashes, determining the number of differences between them, - cosine similarity, which measures the angle between the vectors representing the images and allows you to identify the similarity of images in the direction of their vectors. Also, one of the methods is extraction features for a more accurate search for features that represent the image in the form of high-level characteristics: Convolutional neural networks (CNNs) are used to extract deep image features such as textures, shapes and colors, and they are vectors that can be compared using search methods such as kNN (k-Nearest Neighbors), - key point extraction methods, algorithms such as SIFT (Scale-Invariant Feature Transform) and SURF (Accelerated-Up Robust Features), extract key points and image descriptions. The analysis of possibilities of multiple existing search engines similar images, such as Yandex Images, Google Images, Duplicate Photo Finder, VisiPics, their applications and algorithms on which they are based, are presented in table 1. Existing solutions Google Images, Yandex Images have limitations API, speed and increasing image database. Duplicate Photo Finder and VisiPics have low resistance to modifications. Thus, there is no suitable open solution for the task at hand.
Table 1–Comparison of existing solutions
The analysis made it possible to conclude that the development of a method for searching for similar images is relevant, since existing solutions do not fully meet the selected criteria. Services that search for images in local files are not resistant to image modifications, but are able to detect only complete duplicates. They also do not provide a search for a specific image, only a complete directory crawl and identification of all duplicate pairs found are available. There are several approaches to identifying similar images, the main ones among them are pixel comparison as a sequential comparison of pixel values of images at the same positions, image histograms in the form of an analysis of the distribution of brightness or color characteristics of an image and the formation of corresponding histograms, the use of convolutional neural networks, perceptual hashes based on calculating the hash sum of an image that takes into account its content. A comparison of the methods is presented in table 2.
Table 2–Comparison of existing methods
The method based on perceptual hash functions is easy to implement, but at the same time shows high speed and accuracy in detecting duplicates and similar images, and does not require large computational resources. A comparative analysis of existing perceptual hash algorithms is presented in Table 3.
Table 3 – Comparative analysis of perceptual hash algorithms
As a result of the comparison, a hashing algorithm based on the discrete cosine transform (DCT) was chosen, since it is the most accurate and resistant to image modifications. The algorithm of work is performed according to the following algorithm: PREP is applied to the processed image, which allows you to divide the image into frequencies, then the work goes only with low-frequency components, and high-frequency noise and small details are ignored, then binarization is performed, the result of which is a chain of bits, on the basis of which a hash is being built. Binarization is performed in accordance with formula (1). In accordance with the formula, each DCT coefficient is compared with the average value of the entire coefficient matrix, if the value is greater than or equal to the average, then the value 1 is written to the bit chain, otherwise 0. The output is a 64-bit hash.
where is the sequence of signal points, is the image size, is the PREP.
To assess the reliability and discriminatory characteristics of image hashing, distance or similarity metrics Hamming distance, normalized Hamming distance, and bit error coefficient are required to determine the differences between two similar media objects. Based on this data, it can be concluded whether the images are identical or completely different, that is, the two hashes should reflect the level of their "visual difference. The analysis of similarity metrics by the criteria of calculation speed, calculation complexity, memory size, calculation accuracy, sensitivity to small image changes showed the effectiveness of the Hamming distance metric. The Hamming Distance was chosen as the similarity metric of perceptual hashes, since the metric is simple and fast to calculate, suitable for working with hashes of the same length and gives a direct idea of the number of differences. It is calculated as the number of positions in which the corresponding characters of two bit strings are different and d is calculated using the formula (2):
where xi and yi are the values of the bits of the hash functions x and y, L is the hash length.
The main stages of the proposed method include (Figure 1): – pre-processing of images; – database preparation; – image hash generation; – Search for similar images based on the similarity score.
Figure 1 – The main stages of the method The stages of image preprocessing include sequential execution of operations: – Resizing: bicubic interpolation reduces the amount of data to process, eliminates high frequencies and image detail, and the image is reduced to a size of 32x32. Bicubic interpolation is calculated using the formula (3).
where is the new pixel value in, – the value of the nearest pixel, – fractional parts of the coordinates, respectively. – Color normalization: The histogram alignment method is used to increase resistance to changes in brightness and color gamut. The histogram is aligned according to the formula (4).
where is the new pixel value in, – number of brightness levels (256), – width and length of the image, – Blur: The Gaussian filter is used to reduce noise, smooth texture and reduce image detail. The Gauss filter is calculated using the formula (5).
where is the Gaussian function in, – Color reduction: By calculating the average value of RGB channels, the image is converted to grayscale to reduce the amount of data. The average value of the channels is determined by the formula (6).
where is the new pixel value in, – the value of the red, green and blue channels. After performing all the preprocessing operations, a small and blurred grayscale version of the image will be obtained. With the standard implementation of the algorithm, the search for similar images is carried out by calculating the Hamming distance between the hash of the desired image and each image in the database, which is resource-intensive. For optimization, the study considered the threshold Hamming distance, then according to the segmentation factor, the hash can be divided into 3 substrings. Each substring will be stored in a separate table. The segmentation factor means that if you split the hash into parts, then there will be at least substrings for which the Hamming distance will be no more than one.
where is the segmentation factor, – threshold value of the Hamming distance. Then, for similar images, each of the substrings will either completely match, or will differ by no more than one bit. Database preparation is performed according to the following algorithm: - preprocessing of the resulting set of images, - generation of a hash in the form of bit strings for each processed image, - dividing a bit string into substrings and obtaining a set of image substrings, - create tables for each substring, - creating indexes and a Bloom filter for substring tables. Thus, the image search is performed according to the algorithm in Figure 2. The hash for the desired image is divided into 3 substrings. To identify similar images, it is enough to generate a set of combinations of substrings that differ by a maximum of one bit and then check their presence in the database.
Figure 2 – Algorithm for searching for similar images in the database
To increase the search speed, B-tree indexes and a Bloom filter are used. The Bloom filter is a probabilistic data structure that can uniquely determine that an element is missing from a dataset, thus reducing the amount of substrings searched in tables. To estimate the running time of the algorithm, a data set of 1200 different images with an average size of 3.2 MB was used, each image was searched 5 times, the search time was measured and the average result was calculated. A comparison was made between the operating time of image search algorithms for the standard algorithm, algorithms with hash string division and server add-ons, also with index addition and Bloom filter and the proposed method with hash string division, server add-ons, index addition and Bloom filter. The result showed that the developed algorithm with hash division into substrings, server settings, indexes and a Bloom filter works 3 times faster than the standard implementation. The stability of the presented algorithm to image modifications was evaluated. The results of testing the stability of different hash algorithms to image modifications are shown in Table 4, where Ah is the Hash based on the average, Dh is the Hash based on the difference, and Ph is the Hash based on the PREP.
Table 4 – Stability of the algorithm to image modifications
The results of comparing hash algorithms demonstrated that the PREP-based algorithm provided the highest resistance to various types of modifications, maintaining stability in 15 out of 20 cases. At the same time, the hash algorithm showed stability on average only in 6 cases, and the hash based on the difference in 7 cases. These results indicate that the PREP-based hash algorithm is the most effective tool for duplicate image search tasks due to its high resistance to modifications. Thus, the proposed approach of searching for identical and similar images is effective.
Conclusion In the course of the study, the effectiveness of image search algorithms using perceptual hashing was analyzed. The results of evaluating the running time of the algorithms confirmed the high performance of the developed optimized method, which includes splitting the hash into substrings, configuring server parameters, using indexes and the Bloom filter. This approach demonstrated a three-fold acceleration of the search process compared to the traditional algorithm based on calculating the Hamming distance. In addition, the developed algorithm has shown good results in terms of resistance to image changes. This allows you to reliably identify duplicates based on a comparison of the Hamming distance of no more than three. It is important to note that the hypothesis of using this method to search for similar images has been successfully tested on the example of audio signals. The developed algorithm has demonstrated its effectiveness in detecting duplicate audio recordings and has demonstrated resistance to a variety of data modifications. Thus, the results of the study confirm the possibility of using the proposed algorithms in tasks related to both image processing and audio signal analysis, which opens up new prospects for wider use of perceptual hashing technologies in various fields. References
1. Romashkova, O. N., & Kapterev, A. I. (2023). Analysis of threats and risks of information security in the university. Bulletin of the Moscow City Pedagogical University. Series: Computer science and informatization of education, 1(63), 37-47.
2. Turyanskaya, K. A. (2024). Methods, models and tools for detecting, identifying and classifying threats to the information security of objects of various types and classes. International Journal of Humanities and Natural Sciences, 2-2(89), 151-155. 3. Boyarchuk, D. A., Frolov, K. A., & Sklyaruk, V. L. (2022). Threats to information security of cloud technologies. Modern problems of radio electronics and telecommunications, 5, 207. 4. Mustafaev, A. G., Kobzarenko, D. N., & Buchaev, A. Ya. (2021). Digital transformation of the economy: threats to information security. Beneficium, 2(39), 21-26. 5. Emelianov, A. A. (2022). Ensuring information security when using virtualization tools based on hypervisors. Regional informatics (RI-2022). Anniversary XVIII St. Petersburg. 6. Barybina, A. Z. (2022). Modeling information security threats using a scenario approach. Research in Natural Sciences and Humanities, 42(4), 35-44. 7. Grin', V. S. (2021). Analysis of information security threats and information leakage channels. StudNet, 4(7), 1616-1620. 8. Bekmukhan, A., & Usatova, O. (2024). Security optimization in multi-server web systems: effective risk management. Bulletin of KazATK, 133(4), 296-307. 9. Fact-checking and verification: a tutorial. L.P. Shesterkina, A.V. Krasavina, E.M. Khakimova. (2021). Chelyabinsk: Publishing center of SUSU. 10. Alpatov, A. N. (2023). Features of detecting deep genesis anomalies in a video stream. Systemic transformation is the basis for sustainable innovative development. 11. Application of neural networks for pattern recognition. (2023). E. M. Pavlov, A. V. Ryzhov, K. S. Balanev, I. M. Krepkov. Bulletin of Science and Practice, 12, 52-58. doi:10.33619/2414-2948/97/06 12. Lazarev, E. A. (2023). Application of computer vision and image processing using neural networks. Bulletin of Science, 12-1(69), 412-415. 13. Pilkevich, S. V. et al. (2023). Demonstrator of a Software and Hardware Tool for Automated Recognition of Malicious Multimedia Objects on the Internet (Research Results) // Bulletin of the Russian New University. Series: Complex Systems: Models, Analysis and Management, 2, 157-175. 14. Esipov, D. A. et al. (2023). Attacks Based on Malicious Perturbations on Image Processing Systems and Methods of Protection Against Them. Scientific and Technical Bulletin of Information Technologies, Mechanics and Optics, 23(4) 720-733. 15. Panchekhin, N. I., Desyatov, A. G., & Sidorkin, A. D. (2023). Malware Recognition System Based on Representation of a Binary File as an Image Using Machine Learning. Polytechnic Youth Journal, 04, 1-10. doi:10.18698/2541-8009-2023-4-886 16. Basarab, M. A., & Konnova, N. S. (2017). Intelligent Technologies Based on Artificial Neural Networks. Moscow, Bauman Moscow State Technical University. 17. Random Forest for Malware Classification. (2023). Retrieved from https://arxiv.org/abs/1609.07770 18. Xu, L., Zhang, D., Jayasena, N., & Cavazos, J. (2018). HADM: Hybrid Analysis for Detection of Malware. Proceedings of SAI Intelligent Systems Conference. Retrieved from http://doi.org/10.1007/978-3-319-56991-8_51 19. Towards Building an Intelligent Anti-Malware System: A Deep Learning Approach using Support Vector Machine (SVM) for Malware Classification. (2023). Retrieved from https://arxiv.org/abs/1801.00318 20. Machine LearningBased Solution for the Detectionof MaliciousJPEGImages [Electronic resource]. (2020). Retrieved from https://ieeexplore.ieee.org/document/8967109/metrics#metrics 21. Petiforova, D. E., & Shtepa, K. A. (2021). Analysis of the use of perceptual hashing in the process of image identification. Information and telecommunication technologies and mathematical modeling of high-tech systems, 274-277. 22. Myagkikh, P. A., & Yaduta, A. Z. (2023). Comparison of images using perceptual hashes. Fundamental and applied research in science and education, 72-76. 23. Valishin, A. A., Zaprivoda, A. V., & Tsukhlo, S. S. (2024). Modeling and comparative analysis of the efficiency of perceptual hash functions for searching for segmented images. Mathematical Modeling and Numerical Methods, 2(42), 46-67. 24. Nikiforov, M. B., & Tarasova, V. Yu. (2022). Algorithm for detecting visual similarity of images. Digital Signal Processing, 3, 53-57. 25. Detkov, A. A. et al. (2024). Comparative analysis of vector distance metrics of raster images. Bulletin of Cybernetics, 23(3), 22-30. 26. Trefilov, P. A. (2020). Storage and search of similar images in temporal databases using perceptual hash strings. Proceedings of the International Symposium "Reliability and Quality", 1, 192-196.
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|