Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Software systems and computational methods
Reference:

Emotion Recognition by Audio Signals as one of the Ways to Combat Phone Fraud

Nikitin Petr Vladimirovich

ORCID: 0000-0001-8866-5610

PhD in Pedagogy

Associate Professor, Department of Data Analysis and Machine Learning

125993, Russia, Moscow, 4th veshnyakovsky str., 4, office building 2

pvnikitin@fa.ru
Other publications by this author
 

 
Osipov Aleksei Viktorovich

PhD in Physics and Mathematics

Associate Professor, Department of Information Security, Federal State Educational Budgetary Institution of Higher Education "Financial University under the Government of the Russian Federation"

125167, Russia, Moscow, 4th veshnyakovsky str., 4, office building 2

avosipov@fa.ru
Pleshakova Ekaterina Sergeevna

PhD in Technical Science

Associate Professor, Department of Information Security, Federal State Educational Budgetary Institution of Higher Education "Financial University under the Government of the Russian Federation"

125167, Russia, Moscow, 4th veshnyakovsky str., 4, office building 2

espleshakova@fa.ru
Korchagin Sergei Alekseevich

PhD in Physics and Mathematics

Deputy Dean of the Faculty of Information Technology, Federal State Educational Budgetary Institution of Higher Education "Financial University under the Government of the Russian Federation"

125167, Russia, Moscow, 4th veshnyakovsky str., 4, office building 2

sakorchagin@fa.ru
Gorokhova Rimma Ivanovna

PhD in Pedagogy

Associate Professor, Department of Data Analysis and Machine Learning

125167, Russia, Moscow, 4th veshnyakovsky str., 4, office building 2

rigorokhova@fa.ru
Other publications by this author
 

 
Gataullin Sergei Timurovich

PhD in Economics

Deputy Dean of the Faculty of Information Technology, Federal State Educational Budgetary Institution of Higher Education "Financial University under the Government of the Russian Federation"

125167, Russia, Moscow, 4th veshnyakovsky str., 4, office building 2

stgataullin@fa.ru

DOI:

10.7256/2454-0714.2022.3.38674

EDN:

ZBVOCN

Received:

22-08-2022


Published:

29-08-2022


Abstract: The relevance of the study is dictated by the current state in the field of telephone fraud. According to research conducted by Kaspersky Lab, the share of users who encountered various unwanted spam calls in the spring of 2022 was at the level of 71%. The subject of the research is machine learning and deep learning technologies for determining emotions by the timbre of the voice. The authors consider in detail such aspects as: the creation of a marked-up dataset; the conversion of WAV audio format into a numerical form convenient for fast processing; machine learning methods for solving the problem of multiclass classification; the construction and optimization of neural network architecture to determine emotions in real time. A special contribution to the study of the topic is that the authors implemented a fast method of conversion sound formats into numerical coefficients, which significantly increased the speed of data processing, practically without sacrificing their informativeness. As a result, the models were trained by machine learning algorithms quickly and efficiently. It should be particularly noted that the architecture of a convolutional neural network was modeled, which allowed to obtain the quality of model training up to 98%. The model turned out to be lightweight and was taken as the basis for training the model to determine emotions in real time. The results of the real-time operation of the model were comparable with the results of the trained model. The developed algorithms can be implemented in the work of mobile operators or banks in the fight against telephone fraud. The article was prepared as part of the state assignment of the Government of the Russian Federation to the Financial University for 2022 on the topic "Models and methods of text recognition in anti-telephone fraud systems" (VTK-GZ-PI-30-2022).


Keywords:

fraud, phone fraud, artificial intelligence, machine learning, neural network training, classification, convolutional neural networks, mel-kepstral coefficients, information security, emotions

This article is automatically translated.

 

The article was prepared as part of the state assignment of the Government of the Russian Federation to the Financial University for 2022 on the topic "Models and methods of text recognition in anti-telephone fraud systems" (VTK-GZ-PI-30-2022)

 

Introduction. The development of mobile communications is directly related to the development of information technology. Opportunities that provide a person with means of communication contribute to improving the quality of life. The use of mobile communication is expanding every year, becoming an integral part of the lives of almost all people. This gives rise to an increase in illegal actions using means of communication. Telephone fraud is especially widespread. In the Criminal Code of the Russian Federation, article 159 defines fraud as "theft of someone else's property or acquisition of the right to someone else's property by deception or abuse of trust." Thus, the type of fraud in question is essentially aimed at using information and telecommunication technologies to obtain personal funds of citizens or bank card data. In this case, the increasingly widespread possibilities of electronic payments become a way to receive other people's money. According to Article 159.3 of the Criminal Code of the Russian Federation, "electronic means of payment is a means and (or) a method that allows the client of a money transfer operator to make, certify and transmit orders for the purpose of transferring funds within the framework of the forms of non-cash payments used using information and communication technologies, electronic media, including payment cards, and also other technical devices". It becomes clear that it is enough for fraudsters to get confidential information from the victim, then the commission of the crime becomes irreversible. That is why the issues related to the prevention of telephone fraud are given so much attention in various studies.

Special attention in modern research is paid to the psychological portrait of people exposed to telephone fraud. The research of N. V. Meshkova, V. T. Kudryavtseva, S. N. Enikolopova [1] examines potential victims of telephone fraudsters from the perspective of their psychological portrait. The implementation of telephone fraud begins with a call and develops further depending on the behavior of the victim: as far as possible to manipulate her, impose their own rules of behavior. The authors of the article identified the factors of people used by scammers: "a high level of claims, dissatisfaction with their social status, an inflated standard of consumption and lack of communication, excessive credulity, not critical, superstitious behavior" [1].

In A. A. Pudovkin's dissertation research, the analysis of the personal characteristics of a fraudster and victims of fraud was carried out. A modern fraudster who commits telephone fraud is characterized as an "intellectual criminal" who has a high level of education, is competent in the professional field related to telecommunications technologies and psychology, has acting skills, the ability to inspire confidence in strangers, and the skills to perform risky actions. The study showed the importance of studying the properties of a person who becomes a victim of a crime and becomes involved in complex schemes of criminals. In this regard, A. A. Pudovkin determines that the most susceptible are people, on the one hand, who possess such traits as gambling and adventurism, on the other hand, people who do not have critical thinking, therefore gullible and naive. Another characteristic is the thirst for easy money and pathological greed (Pudovkin, A. A. Criminal law and criminological features of fraud: dis. ... cand. jurid. sciences': 12.00.08 St. Petersburg, 2007 144 p. RGB OD, 61:07-12/949).

The properties of the victim personality, the most vulnerable to violence, are also considered in the work of O. A. Klachkova. In the conducted research, a system of psychological properties characterizing such personalities is determined and their qualities are determined. The author first of all points out the lack of attention and the desire to be noticed, to be able to surprise others. A special place among the qualities of people susceptible to fraud is carelessness [2].

The analysis of the presented studies shows that telephone fraud has two opposite sides. On the one hand, there are scammers with their own characteristic features, on the other hand, there are individuals subjected to fraud. And in order to achieve results, both components must coincide, that is, there must be a person ready to believe in a specific fraudulent action and perform the action imposed on him, that is, to become a victim. I. G. Moiseeva considered the problem of telephone fraud from the perspective of psychological analysis of fraudulent actions through electronic means of payment. The author identifies the types of fraud committed on the basis of the use of a number of means of information and telecommunication technologies. New technologies and their active implementation give a huge scope for use in the commission of crimes: from mobile communications to electronic payments and plastic bank cards [3].

Obtaining a technology that will allow identifying a fraudster by identifying emotions in accordance with the general model of fraudulent actions will help prevent the actions of the victim.

In the study of A. A. Romanova and V. A. Mashlyakevich considered the methods of using mobile communication tools to carry out the most unthinkable crimes to obtain funds when making transactions during purchase and sale. The authors have defined algorithms for committing frauds using mobile communications [4]. Since the actions of fraudsters according to different schemes are quite variable and are, to some extent, of a nature depending on the specific victim and the specific situation, in this case it is possible to determine with the help of his emotions whether the event that is taking place is a stage in the commission of a crime.

Not only users, but also banks are concerned about phone fraud. This direction is constantly developing with the development of information technology and is causing more and more problems in the financial sector. The authors E. V. Barasheva, D. A. Stepanenko conducted a historical study of crimes involving the use of information and communication technologies in the banking sector. [5] This study showed that fraud has its own history of development and is becoming more sophisticated. A. A. Ivanova, V. V. Mishchenko also consider the relevance of problems directly related to financial activities and methods of combating fraud in the financial sphere. Researchers see a decrease in the statistics of financial crimes in the need to increase, first of all, the financial literacy of citizens, as well as in establishing the regulatory framework for combating them [6]. I. V. Sukhorukova pays special attention to the relevance of problems directly related to the financial activities of Sberbank. The study determined that cyber attacks are carried out on the bank constantly and highlighted the most commonly used methods of fraud and theft from bank cards [7].

The questions of predicting emotions are the subject of research in a variety of fields. The prediction of emotions, age, and origin based on vocal data was considered in the article by A. Anuchitanukul, L. Specia [8]. Burst2Vec's adversarial multitasking approach uses pre-trained speech representations to capture acoustic information from raw signals and includes the concept of eliminating model bias through adversarial learning. The authors achieved a 30% increase in the productivity of the model application in the course of the study. The possibilities of applying the investigated approach show the possibilities of multitasking learning (MTL) for recognizing various components of the caller on the phone, including emotions, in the fight against telephone fraud.

The research of A. I. Ivanov and I. A. Kubasov [9] is devoted to the need to prevent telephone fraud by identifying the voice characteristics of fraudsters. The study is based on the creation of a database of voices and the transition to automated procedures for detecting telephone fraudsters based on the analysis of their voice portraits. The authors propose the use of artificial intelligence and for this purpose build the following algorithm: automatic marking of a voice message into frames; automated formation of evidence images by documenting the voice, which is carried out by keywords and allows you to determine the scope of the fraudster; consideration of the tonality of sounds; the use of various speech recognition methods [9]. The authors propose a technological method that can serve as the basis for the use of artificial neural networks and their training to determine the main key elements of the portrait of a fraudster by his voice and timbre of sound. This study examines the basics for the use of artificial intelligence to recognize fraudsters.

The use of a voice identification system as an additional user protection is considered in the study of M. A. Maslov and V. A. Kostikova. The article highlights the advantages and disadvantages of using voice biometrics to recognize the calls of intruders. The authors propose the construction of a voice identification system based on an acoustic voice model, a linguistic language model, a semantic model and a semantic model [10].

Thus, research on combating telephone fraud is constantly being conducted in different directions. Emotion recognition based on the received voice signal can serve as a way to combat intruders. It becomes possible to stop criminal actions at the very beginning if modern methods are used more and more widely. Among such methods, the most relevant is the use of predictive analytics and machine learning methods.

Data and methods. Let's consider algorithms for recognizing emotions by the timbre of the voice by means of machine learning. The evidence of the effectiveness of the use of machine learning and deep learning in determining the authors were considered in studies [11-16].

The first thing to do is to find suitable datasets. Currently, there is not much data with voice messages describing emotions in the public domain. The authors found three large datasets that can be used in the study. But for our research, we will limit ourselves to two.

The first TESS dataset (https://www.kaggle.com/ejlok1/toronto-emotional-speech-set-tess ). It contains 2800 WAV audio tracks. Note that this dataset is voiced only by female voices and is marked up by 7 emotions: anger, disgust, fear, happiness, sadness, surprise, neutral emotion.

The second SAVEE dataset (https://www.kaggle.com/barelydedicated/savee-database ). The dataset is voiced by male voices and marked up according to the same 7 emotions. It contains 3360 WAV audio tracks.

Since there may be both men and women among the scammers (among the victims), in order to increase the efficiency of the system being developed, the authors combined the TESS and SAVEE datasets into one dataset. Thus, the final dataset will be a dataset consisting of 6160 wav audio tracks, voiced by both male and female voices and marked up by 7 emotions: anger, disgust, fear, happiness, sadness, surprise, neutral emotion. Figure 1 shows the distribution of classes in the final dataset.

Figure 1. The distribution of classes in the dataset shows that all emotions are approximately equal, there is a slight predominance of zero emotions, but everything is within acceptable limits.

The main stage for solving the subsequent classification problem is the stage of converting audio files into a numeric format. Note that this transformation should take place quickly and at the same time the information should not lose its informativeness. The authors of the study came to the conclusion that it is most optimal to convert audio files to mel-cepstral coefficients (MFCC).

The mathematical transformations of sound in MFCC are as follows (for example, the word "one"):

1. It is necessary to apply the Fourier transform to obtain the spectrum of the audio signal (Fig. 2); Figure 2.

 

A temporary representation of the word "one" and its spectrum after the Fourier transform.

 

2. Using windows (weight functions) uniformly located on the chalk axis, we project the spectrum obtained in the previous step onto the chalk scale and transfer this resulting graph to the frequency scale (Fig. 3). Figure 3. Projection of the window function onto the frequency scale, so that the windows are more accurately concentrated at low frequencies, because that's where it's most difficult for us to distinguish sounds and that's where we need the maximum amount of information from the audio signal.

 

 

3. Find the amount of signal energy that is in each window by multiplying the signal spectrum vectors by the window function (formula 1). (1)4.                

 

 

The next step is to square the results obtained, then take the logarithm and use the discrete cosine transform (formula 2).

 

 

As a result, we get the result we need (Fig. 4).Figure 4. MFCC coefficients All transformations were performed using the Python programming language.

 

Figure 5 shows the transformation of a data set into chalk-kepstral coefficients.

 

Figure 5. Conversion of audio data As a result, we have obtained a small set of values that can replace thousands of values of the importance of a speech signal or spectrograms when determining speech in full.

 

This significantly increases the speed of data processing, practically without sacrificing their informativeness. The received data sets can be viewed at the following link: https://drive.google.com/drive/folders/1WQflU_1ZYsO4EuCJx9SRB5lUGKDtgEOL ?usp= sharing.

In the second stage, we will use machine and deep learning methods to solve the classification problem. In our case, for the task of determining emotions.

The following machine learning methods were used to solve the classification problem: logistic regression; randomForest, gradient boosting [17-19]. The ROC-AUC curve was taken as a learning metric.

The randomForest model showed itself best. The results of the model's training are shown in Figure 6.Figure 6. The quality of the model's training from Figure 6 shows that the model guesses emotions 4 and 5 the worst of all. These emotions can be very important in cases of telephone fraud.

 

 

Therefore, it is necessary to improve the results of the model.

Consider neural networks for this.

The authors tested several neural network architectures: multilayer neural networks, convolutional neural networks (CNN), recurrent neural networks (LSTM) [20-22]. The best results were achieved when using CNN, with the architecture shown in Figure 7.

 

Figure 7. CNN architecture for recognizing emotions by voice timbre, the neural network achieved the best results at a learning rate of lr = 0.01 after 400 training epochs.

 

The accuracy reached 98% for each of the emotions (Fig.8).

 

Figure 8. Results of neural network training To use this model, a script was implemented that receives an audio stream from an input device, calculates MFCC for fixed fragments of this stream and predicts one of the seven emotions using the developed model.

 

Note that the model is lightweight enough to carry out classification in real time. In the script using the resulting model on a real audio stream from the input device, the PyAudio library was used. The results were comparable.

But, nevertheless, I would like to note that due to technical limitations and the lack of real data, the operation of the system was not tested properly (for example, in a telephone conversation). Moreover, the training took place on recordings with English-speaking speech, this can give unexpected results when working with other languages. But these remarks do not lower the quality of the study.

Conclusions. The authors have developed and implemented a neural network model for determining emotions by the timbre of the voice in real time. The results of the study showed that machine learning and deep learning technologies can be used as a way to combat telephone fraud. Mobile operators or banks have big data with fraudulent conversations, conversations of victims. From this data, you can get marked-up datasets and using the described technologies to fight fraud. It is especially interesting to consider this study in conjunction with the technologies already used to detect fraudulent conversations, for example, by stop words. Then the multimodal technology will be more effective and will help to avoid fraudulent actions to a greater extent.

 

   

References
1. Meshkova N.V., Kudryavtsev V.T., Enikolopov S.N. On the psychological portrait of victims of telephone fraud // Bulletin of the Moscow University. Series 14. Psychology. 2022. No. 1. pp. 138-157. doi: 10.11621/vsp.2022.01.06.
2. Klachkova O. A. Psychological features of victim personality // Izvestiya RSPU named after A. I. Herzen. 2008. No.58. URL: https://cyberleninka.ru/article/n/psihologicheskie-osobennosti-viktimnoy-lichnosti (date of address: 02.08.2022).
3. Moiseeva I.G. Psychological aspects of countering telephone fraud // Kaluga Economic Bulletin. 2022. No. 1. pp. 70-74.
4. Romanov A.A., Mashlyakevich V.A. About modern methods of fraud committed using mobile communication means // Eurasian Legal Journal. 2021. No. 10 (161). pp. 254-255.
5. Barasheva E. V., Stepanenko D. A. Historical and legal aspects of cybercrime in the banking sector // Humanities, socio-economic and social sciences. 2022. No. 6 pp. 75-77. – DOI 10.23672/y5463-0677-0213-l.
6. Ivanova A. A. Mishchenko V. V. Actual problems of fraudulent activity in the financial sphere // Internauka. 2022. ¹ 18-5(241). pp. 52-53.
7. Sukhorukova I. V. Cyberbullying as the main problem of carrying out operations with plastic cards in Sberbank PJSC // Spirit Time. 2021. ¹ 11(47). pp. 14-16.
8. Anuchitanukul A., Specia L. 2022. Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion, Age, and Origin from Vocal Bursts. [Submitted on 24 Jun 2022]. doi: 10.48550/arXiv.2206.12469.
9. Ivanov A. I., Kubasov I. A. The prospect of strengthening the policy of accounting for voice features of biometric data of telephone fraudsters // Bulletin of the Voronezh Institute of the Federal Penitentiary Service of Russia. 2021. No. 1. pp. 89-96.
10. Maslova M. A., Kostikov V. A. Using the voice identification system as an additional user protection // Modern problems of radio electronics and telecommunications. 2021. No. 4. p. 223.
11. Vanneste, P., Oramas, J., Verelst, T., Tuitelaars, T., Raes, A., Depepe, F., and Van den Northgate, V. 2021. Computer vision and human behavior, emotion detection and cognition: an example of use for student engagement. Mathematics 9: 287. DOI: 10.3390/math9030287.
12. Zhang, H., Feng, L., Li, N.,. Jin, Z., and Cao, L. 2020 Video-based stress detection using deep learning. Sensors 20: 5552 DOI: 10.3390/s20195552.
13. Dogadina, E.P., Smirnov, M.V., Osipov, A.V., and Suvorov, S.V. 2021. Evaluation of the forms of education of high school students using a hybrid model based on various optimization methods and a neural network. Informaticsthis link is disabled 8(3): 46.
14. Heo, T. S., Kim, Y. S., Choi, J. M., Jeong, Y. S., Seo, S. Y., Lee, J. H., Kim, C. 2020. Prediction of stroke outcome using natural language processing-based machine learning of radiology report of brain MRI. Journal of personalized medicine, 10(4), 286
15. Prasetio, B.H., Tamura, H., and Tanno, K. 2018. Facial stress recognition based on signs of a multihistogram and convolutional neural network. IEEE Int. Conference on Systems, Man and Cybernetics (SMC): 881-887. DOI: 10.1109/SMC.2018.00157
16. Lischer S., Safi N., Dickson C. Remote learning and students' mental health during the Covid-19 pandemic: A mixed-method enquiry. PROSPECTS. 2021. p. 1-11. (In Eng.). DOI: 10.1007/s11125-020-09530-w
17. Pranckevičius T., Marcinkevičius V. Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. 2017. Baltic Journal of Modern Computing. Ò. 5. ¹. 2. p. 221.
18. Shah, K., Patel, H., Sanghvi, D., & Shah, M. (2020). A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augmented Human Research. 5(1). pp. 1-16.
19. Tatarintsev, M.; Korchagin, S.; Nikitin, P.; Gorokhova, R.; Bystrenina, I.; Serdechnyy, D. 2021. Analysis of the forecast price as a factor of sustainable development of agriculture. Agronomy, 11, 1235. https://doi.org/10.3390/agronomy11061235.
20. Durstewitz D., Koppe G., Meyer-Lindenberg A. Deep neural networks in psychiatry. Molecular Psychiatry. 2019; 24:1583-1598. (In Eng.). DOI: 10.1038/s41380-019-0365-9
21. Janssen R.J., Mourão-Miranda J., Schnack H.G. 2018. Making Individual Prognoses in Psychiatry Using Neuroimaging and Machine Learning. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging. 3(9):798-808. DOI: 10.1016/j.bpsc.2018.04.004
22. Erickson B.J., Korfiatis P., Akkus Z., Kline T.L. 2019. Machine Learning for Medical Imaging. RadioGraphics. 37(2):505-515. DOI: 10.1148/rg.2017160130

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The reviewed article is devoted to the study of the possibility of combating telephone fraud based on the development of machine learning models for recognizing emotions from audio signals. The research methodology is based on the generalization of literature sources on the topic of the work, the use of predictive analytics methods and algorithms for recognizing emotions by voice timbre using machine learning tools. The authors rightly attribute the relevance of the study to the fact that emotion recognition based on the received voice signal can serve as a way to combat intruders and can stop criminal actions at the very beginning of the offense. The scientific novelty of the presented research lies in the development and implementation of a neural network model for determining emotions by voice timbre in real time, which can be used as a way to combat telephone fraud. The following sections are structurally highlighted in the article: Introduction, Data and methods, Conclusions, Bibliography. The introduction substantiates the relevance of the work, provides an overview of modern publications related in one way or another to the problem being solved in the article. The following is a description of publicly available datasets suitable for use in research with voice messages describing emotions. The final dataset used in the article for machine learning consists of 6160 audio tracks voiced by both male and female voices and marked up by 7 emotions: anger, disgust, fear, happiness, sadness, surprise, neutral emotion. To convert audio files to a numeric format, audio files were converted to spectral coefficients, and the Fourier transform, a computer program in Python, was used to obtain the audio signal spectrum. Widely known machine learning methods were used to solve the classification problem: logistic regression; randomForest method, gradient boosting, multilayer neural networks, convolutional neural networks, recurrent neural networks. The ROC curve is used as a measure of the classifier's ability to distinguish classes. The authors draw conclusions about the methods that provided the best training for the model, noting that the learning results provided 98% recognition accuracy for each of the emotions. The bibliographic list includes 22 sources – scientific publications and Internet resources in Russian and English, to which the text contains address links indicating the presence of an appeal to opponents in the publication. The authors rightly point out not only the strengths of the conducted research, but also its disadvantages: the lack of testing the model in a real telephone conversation, learning from recordings with English–speaking speech - such self-criticism contributes to the fact that the reader will not be misled about the practical possibilities of the proposed development. Nevertheless, the topic of the article is relevant, the material corresponds to the topic of the journal "Software Systems and Computational Methods", may arouse interest among readers and is recommended for publication.