Library
|
Your profile |
Software systems and computational methods
Reference:
Trofimova V.S., Karshieva P.K., Rakhmanenko I.A.
Fine-tuning neural networks for the features of a dataset in the speaker verification task using transfer learning
// Software systems and computational methods.
2024. ¹ 3.
P. 26-36.
DOI: 10.7256/2454-0714.2024.3.71630 EDN: XHZCTS URL: https://en.nbpublish.com/library_read_article.php?id=71630
Fine-tuning neural networks for the features of a dataset in the speaker verification task using transfer learning
DOI: 10.7256/2454-0714.2024.3.71630EDN: XHZCTSReceived: 30-08-2024Published: 06-09-2024Abstract: The subject of this study is neural networks, trained using transfer learning methods tailored to the specific characteristics of the dataset. The object of the study is machine learning methods used for solving speaker verification tasks. The aim of the research is to improve the efficiency of neural networks in the task of speaker verification. In this work, three datasets in different languages were prepared for the fine-tuning process: English, Russian, and Chinese. Additionally, an experimental study was conducted using modern pre-trained models ResNetSE34L and ResNetSE34V2, aimed at enhancing the efficiency of neural networks in text-independent speaker verification. The research methodology includes assessing the effectiveness of fine-tuning neural networks to the characteristics of the dataset in the speaker verification task, based on the equal error rate (EER) of Type I and Type II errors. A series of experiments were also conducted, during which parameters were varied, and layer freezing techniques were applied. The maximum reduction in the equal error rate (EER) when using the English dataset was achieved by adjusting the number of epochs and the learning rate, reducing the error by 50%. Similar parameter adjustments with the Russian dataset reduced the error by 63.64%. When fine-tuning with the Chinese dataset, the lowest error rate was achieved in the experiment that involved freezing the fully connected layer, modifying the learning rate, and changing the optimizer—resulting in a 16.04% error reduction. The obtained results can be used in the design and development of speaker verification systems and for educational purposes. It was also concluded that transfer learning is effective for fine-tuning neural networks to the specific characteristics of a dataset, as a significant reduction in EER was achieved in the majority of experiments, indicating improved speaker recognition accuracy. Keywords: transfer learning, fine-tuning, dataset, speaker verification, speaker recognition, feature extraction, speech processing, neural networks, deep learning, pattern recognitionThis article is automatically translated. 1 Introduction
Currently, technologies are widely used that allow user verification using various biometric parameters. One of these parameters is the voice [1]. Speaker verification is a form of speaker recognition in which a decision is made about whether a voice sample belongs to an individual whose identity has been declared [2]. This procedure allows you to ensure a high level of security and convenience when accessing information to users. The use of voice biometrics is becoming increasingly relevant in various fields, including banking, healthcare institutions, security systems, and even daily life [3]. Further training of neural networks for the features of a data set plays a key role in improving the accuracy of speaker verification systems by voice [4]. Each data set has unique attributes. During retraining, pre-trained models are used, which in the process adapt to new features, which leads to a significant improvement in the efficiency of the system. The effectiveness of neural networks performing speaker verification by voice directly affects the security and usability of various services. An accurate voice speaker verification system allows you to prevent unauthorized access to confidential information and protect user accounts from intruders.
2 Conducting an experiment
2.1 Preliminary data
Pre-trained models in English, such as ResNetSE34L and ResNetSE34V2, were selected to retrain the neural network. These models, based on the ResNet34 architecture, are used to identify the speaker by an arbitrary phrase and subsequent verification. The main difference between ResNetSE34L and ResNetSE34V2 is in the methods of extracting and adapting audio data features. In addition, ResNetSE34V2 offers more efficient data processing, which ensures a higher level of learning accuracy. VoxCeleb2 speech corpus was used as a training dataset for the ResNetSE34L model [5]. Using this case, the weights "baseline_lite_ap.model" were obtained. This acoustic-phonetic corpus of speech is one of the largest datasets used to evaluate automatic speech recognition systems. Also, the TIMIT speech corpus was used for the ResNetSE34L model [6], on which the model was trained and the weights "model000000100.model" were obtained. Further, three speech corpora were prepared for the subsequent training of models: in English, Russian and Chinese. As the first, the TIMIT speech corpus [6] was used, containing audio recordings in English. It was divided into a training dataset containing 4,620 audio recordings and a test dataset consisting of 1,680 audio recordings. This dataset was used in the process of retraining using the trained weights "baseline_lite_ap.model" and "baseline_v2_ap.model". In the case of "model000000100.model", this approach was not applied, since the corresponding weights were obtained during the training of the ResNetSE34L model based on the TIMIT speech corpus. Russian Russian speech corpus includes audio recordings made by 50 native speakers of the Russian language. Each speaker has 50 audio recordings. This data set was divided into two groups: a training set containing recordings of 30 speakers, and a test set consisting of recordings of 20 other speakers [7]. This study also used a speech corpus in Chinese called "HI-MIA". The data for the case was collected at home using microphone arrays and a HI-FI microphone. The dataset was divided into two sets: a training set, including recordings of 42 speakers, and a test set, consisting of recordings of 40 speakers [8]. Speech corpora in Russian and Chinese were used in experiments with all the trained weights presented, since none of them had previously been used to train models. These data sets were selected for further training of the neural network in order to increase its effectiveness in speech recognition. Due to the variety of dialects, voices and languages in audio recordings, it is possible to get a wider coverage of variability in the pronunciation of words and phrases. For the user verification itself, a test scenario is used, including attempts to authenticate a legitimate user and an attacker (in case of a voice mismatch). A legal attempt in a text file with a test script is marked with a label 1, and an attack on the system is marked with a label 0. In the context of the desire to improve the efficiency of the neural network in voice verification tasks, the main attention was paid to reducing the equal error of the 1st and 2nd kind [9]. The goal was to achieve more accurate speaker recognition, which is important to ensure high reliability of the system. As part of the study, in order to qualitatively reduce errors of the 1st and 2nd kind, the transfer learning method was applied using pre-trained weights of the model performing the task of verifying the speaker by an arbitrary phrase. The method used is a technique for retraining a neural network, when a pre-trained model on a large data set is additionally trained on a narrower data set to perform the task of speaker verification
2.2 Additional training parameters In the process of retraining the neural network, various strategies were used to adapt the model to new data. Special attention was paid to the choice of learning parameters, including the learning rate, the change in the optimizer, the number of learning epochs and the structure of the neural network. In addition to the parameters described above, the size of the batch was also taken into account. When the model was retrained on the TIMIT speech case, the size of the batch was 50. When working with the Russian case, it was size 30, and with the Chinese case, the size was 40. Description of the parameters used to train the neural network: 1) The learning rate affects the convergence of the model and prevents overfitting. The change in weights should be moderate in order to avoid violation of the learned signs. 2) The number of epochs in the process of retraining affects the retraining of the model. Also, an insufficient number of epochs in the process of retraining can lead to insufficient adaptation. 3) The small size of the batch in the process of retraining allows you to find a balance between computational efficiency, resistance to change and the ability of the model to generalize to new data. 4) The optimizer is responsible for adjusting the weights of the model during the training process to minimize the loss function. 5) The structure of the neural network describes the architecture of the model, including the number of layers, as well as their relationship. The retraining process determines how new data is integrated into the existing architecture. Changing the network structure may include freezing layers. This allows the model to learn new features of the data and retain its initial training. During the research, the audio recordings were divided into two text files: "train" and "test". This made it possible to systematize the data and provide the necessary structure for subsequent analysis.
3 Results
3.1 The result of additional training using the TIMIT speech corpus
In order to determine the best retraining strategy, a series of experiments were conducted using the TIMIT dataset for the trained weights "baseline_lite_ap.model" and "baseline_v2_ap.model". To determine how much more effectively the neural network recognizes speakers, the EER (Equal Error Rate) value obtained during testing and subsequent model training was considered [9]. In order to analytically compare the values obtained after retraining with the initial data, the model was evaluated without using a training process. An EER value of 0.012 and 0.013 was obtained for "baseline_lite_ap.model" and "baseline_v2_ap.model", respectively. Next, the model was further trained with changes to its parameters. At the end of twenty experiments, a summary table was compiled, which presents the results of an equal error of the 1st and 2nd kind to compare the effectiveness of model retraining. This table shows the models used, the pre-trained weights, and the experiments performed. Each experiment included an attempt to retrain models on the TIMIT speech corpus [6].
Table 1 — Experimental results for the TIMIT set
When training using the weights of the "baseline_lite_ap.model" model, the best result was obtained with a slight change in the learning rate. The initial EER value was 0.012, after further training it decreased to 0.006. This change indicates a decrease in EER by 50%. When using pre-trained weights "baseline_v2_ap.model", the best result was achieved when changing the optimizer and significantly reducing the learning rate. Before retraining, the value of the equal error of the 1st and 2nd kind was 0.013, after retraining, the indicator decreased to 0.007, which indicates a decrease of 46.15%. In addition to conducting the retraining process, all models were also trained on the TIMIT dataset [6]. The results obtained were compared with the EER [9] obtained in the context of retraining in order to demonstrate a decrease in EER during the retraining process compared with exclusively basic training.
3.2 The result of additional training using the Russian speech corpus
To optimize the model's operation and improve the efficiency of speaker recognition, a series of experiments based on a dataset in Russian using pre-trained weights "model000000100.model", "baseline_lite_ap.model" and "baseline_v2_ap.model" were conducted. The plan for the experiments conducted with the Russian speech corpus [8] did not differ from the experimental study with the TIMIT speech corpus [7]. In order to evaluate the effectiveness of speaker recognition before the retraining process, the model was evaluated without training. This was done to evaluate the effectiveness of the model before making adjustments to its parameters and implementing the retraining process. Due to the fact that the model was initially trained in English, and subsequent pre-training was conducted in Russian, the process of changing the language environment had an impact on the accuracy of the model. The accuracy of the model did not rise above sixty for ten epochs. This transition required adapting the model to the new language and its features. English and Russian have many differences in grammar and syntax. In addition to the above, the morphology of each language can also affect further education. This procedure is carried out in order to extend the functionality of the model to work with different languages and cultural contexts. As a result of the experiments described above using a dataset in Russian, table 2 was formed. This table shows the results of equal errors of the 1st and 2nd kind obtained during all experiments in which the pre-trained weights of the models "baseline_v2_ap.model", "baseline_lite_ap.model" and "model000000100.model" were used.
Table 2 — Experimental results for the Russian speech corpus
In the process of learning using the weights "model000000100.model", a significant improvement was obtained with a slight modification of the learning rate. The initial EER value was 0.066, after pre-training, the indicator decreased to 0.024, indicating a decrease of 63.64%. By using pre-trained "baseline lite ap.model" weights, the best result was achieved by freezing the fully connected layer, changing the learning rate and the optimizer. The initial EER value was 0.012, after further training it decreased to 0.007, which means a 41.67% reduction in EER. When using pre-trained weights "baseline_v2_ap.model", the highest quality learning process was implemented by freezing a fully connected layer. Before retraining, the EER value was 0.078, after retraining, the indicator decreased to 0.062, which indicates a decrease of 20.51%.
3.3 The result of additional training using the HI-MIA speech corpus
In this study, pre-trained weights were also retrained on a speech corpus containing audio recordings in Chinese [8]. Working with this dataset did not lead to successful model completion in most cases. Experiments have been conducted previously with other datasets. Experiments with other learning rate values and freezing of other layers were also added. As a result, only when working with pre-trained weights "baseline_lite_ap.model" in an experiment with freezing a fully connected layer, changing the learning rate and the optimizer, successful retraining was achieved. During further training, where the weights "model000000100.model" were used, the initial EER was equal to 0.066 [9]. The new dataset contained a variety of background noises, which is a typical real-world interaction environment. The model parameters have been adapted for high-quality additional training. After further training on a dataset containing audio recordings in Chinese, an increase in EER to 0.214 was observed. All the results of the experiments conducted with the Chinese dataset are presented in Table 3.
Table 3 — Experimental results for HI-MIA
The analysis showed that the reason for the increase in EER may be both linguistic and phonetic differences between English and Chinese, as well as the presence of noise masking key acoustic features of speech. Additional training on a noisy Chinese dataset led to retraining on the specific features of this dataset, which reduced the generalizability of the model. It is also possible that standard methods of retraining do not cope with their task in conditions of strong language variability and the presence of noise.
4 Scope of the results and novelty
This study avoids additional costs for training models from scratch and, when using these models, increases the efficiency of speaker verification systems by voice. The technology of speaker verification by voice is widespread in the banking sector, since when a client contacts a contact center, the only available biometric parameter is voice. The work used well-known pre-trained models and the method of transfer retraining within a given subject area. The main scientific result is the results of evaluating the effectiveness and applicability of specific methods of retraining neural networks for the features of data sets in the task of text-independent speaker verification.
5 Conclusions
In the work, a study was conducted on the process of transfer learning of neural networks for the characteristics of a data set, which allowed us to obtain more accurate models after further training. The obtained observations emphasize the importance of correct parameter settings when further training models. Fine-tuning carried out during the further training of models has a beneficial effect on optimizing their effectiveness. When using pre-trained weights "baseline_lite_ap.model" and the TIMIT speech corpus, the maximum decrease in the equal error of the 1st and 2nd kind was obtained, it decreased by 50%. With the data set in Russian, the EER indicator has significantly decreased when using pre-trained weights "model000000100.model". The decrease was 63.64%. When working with the HI-MIA dataset, successful model completion was achieved only in an experiment with freezing a fully connected layer, changing the learning rate and the optimizer. As a result, it was recorded that the equal error of the 1st and 2nd kind (EER) decreased by 16.04%. References
1. Gassiev, D. O., Sakharov, V. A., & Ermolaeva, V. V. (2019). Voice authentication. Trends in science and education, 56(2), 22-24.
2. GOST R 58668.11-2019 (ISO/IEC 19794-13:2018). Information Technology. Biometrics. Biometric data exchange formats. Section 11. Voice data. (2019). Moscow: Standard-Inform. 3. Devjatkov, V. V., & Fedorov, I.B. (2001). Artificial Intelligence Systems. BMSTU. 4. Galushkin, A. I. (2012). Neural networks. Fundamentals of theory. Hotline – Telecom. 5. Suzuki, K. (2013). Artificial Neural Networks: Architectures and Applications. InTech. 6. Evsyukov, M.V., Putyato, M.M., & Makaryan, A.S. (2020). Protection methods in modern voice authentication systems. Caspian journal: Control and High Technologies, 3(59), 84-92. 7. Nagrani, A., Chung, J.S., & Zisserman, A. (2018). VoxCeleb: A large-scale speaker identification dataset. arXiv:1706.08612. Retrieved from https://arxiv.org/pdf/1706.08612 8. Hinton, G. E., Srivastava, X., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Retrieved from https://arxiv.org/pdf/1207.0580 9. Konev, À.À. (2007). Model and algorithms of speech signal analysis and segmentation: Thesis abstract for the degree of Candidate of Technical Sciences. Tomsk. 10. Qin, X., Bu, H., & Li, M. (2020). HI-MIA: A Far-field Text-Dependent Speaker Verification Database and the Baselines. IEEE International Conference on Acoustics, Speech, and Signal Processing, 7609-7613. 11. Rakhmanenko, I.A., Shelupanov, A.A, & Kostyuchenko, E. Y. (2020). Automatic text-independent speaker verification using convolutional deep belief network. Computer Optics, 44(4), 596-605.
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|