Fine-tuning neural networks for the features of a dataset in the speaker verification task using transfer learning

Трофимова В.С., Каршиева П.К., Рахманенко И.А.

doi:10.7256/2454-0714.2024.3.71630

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Back to contents

Software systems and computational methods

Reference:

Trofimova, V.S., Karshieva, P.K., Rakhmanenko, I.A. (2024). Fine-tuning neural networks for the features of a dataset in the speaker verification task using transfer learning. Software systems and computational methods, 3, 26–36. https://doi.org/10.7256/2454-0714.2024.3.71630

Fine-tuning neural networks for the features of a dataset in the speaker verification task using transfer learning

Trofimova Varvara Sergeevna

ORCID: 0009-0008-5044-2321

Student; Department of Integrated Information Security of Electronic Computing Systems; Tomsk State University of Control Systems and Radioelectronics

146 Krasnoarmeyskaya str., room 509, Tomsk region, 634045, Russia

varvara.trofimova.01@mail.ru

Karshieva Polina Konstantinovna

ORCID: 0009-0004-8390-2348

Student; Department of Integrated Information Security of Electronic Computing Systems; Tomsk State University of Control Systems and Radioelectronics

146 Krasnoarmeyskaya str., room 509, Tomsk region, 634045, Russia

polinakarshieva1@gmail.com

Rakhmanenko Ivan Andreevich

ORCID: 0000-0002-8799-601X

PhD in Technical Science

Associate Professor; Department of Information Systems Security; Tomsk State University of Control Systems and Radioelectronics

146 Krasnoarmeyskaya str., room 509, Tomsk region, 634045, Russia

ria@fb.tusur.ru

DOI:

10.7256/2454-0714.2024.3.71630

EDN:

XHZCTS

Received:

30-08-2024

Published:

06-09-2024

Abstract: The subject of this study is neural networks, trained using transfer learning methods tailored to the specific characteristics of the dataset. The object of the study is machine learning methods used for solving speaker verification tasks. The aim of the research is to improve the efficiency of neural networks in the task of speaker verification. In this work, three datasets in different languages were prepared for the fine-tuning process: English, Russian, and Chinese. Additionally, an experimental study was conducted using modern pre-trained models ResNetSE34L and ResNetSE34V2, aimed at enhancing the efficiency of neural networks in text-independent speaker verification. The research methodology includes assessing the effectiveness of fine-tuning neural networks to the characteristics of the dataset in the speaker verification task, based on the equal error rate (EER) of Type I and Type II errors. A series of experiments were also conducted, during which parameters were varied, and layer freezing techniques were applied. The maximum reduction in the equal error rate (EER) when using the English dataset was achieved by adjusting the number of epochs and the learning rate, reducing the error by 50%. Similar parameter adjustments with the Russian dataset reduced the error by 63.64%. When fine-tuning with the Chinese dataset, the lowest error rate was achieved in the experiment that involved freezing the fully connected layer, modifying the learning rate, and changing the optimizer—resulting in a 16.04% error reduction. The obtained results can be used in the design and development of speaker verification systems and for educational purposes. It was also concluded that transfer learning is effective for fine-tuning neural networks to the specific characteristics of a dataset, as a significant reduction in EER was achieved in the majority of experiments, indicating improved speaker recognition accuracy.

Keywords:

transfer learning, fine-tuning, dataset, speaker verification, speaker recognition, feature extraction, speech processing, neural networks, deep learning, pattern recognition
This article is automatically translated.

1 Introduction

Currently, technologies are widely used that allow user verification using various biometric parameters. One of these parameters is the voice ^[1]. Speaker verification is a form of speaker recognition in which a decision is made about whether a voice sample belongs to an individual whose identity has been declared ^[2]. This procedure allows you to ensure a high level of security and convenience when accessing information to users. The use of voice biometrics is becoming increasingly relevant in various fields, including banking, healthcare institutions, security systems, and even daily life ^[3].

Further training of neural networks for the features of a data set plays a key role in improving the accuracy of speaker verification systems by voice ^[4]. Each data set has unique attributes. During retraining, pre-trained models are used, which in the process adapt to new features, which leads to a significant improvement in the efficiency of the system.

The effectiveness of neural networks performing speaker verification by voice directly affects the security and usability of various services. An accurate voice speaker verification system allows you to prevent unauthorized access to confidential information and protect user accounts from intruders.

2 Conducting an experiment

2.1 Preliminary data

Pre-trained models in English, such as ResNetSE34L and ResNetSE34V2, were selected to retrain the neural network. These models, based on the ResNet34 architecture, are used to identify the speaker by an arbitrary phrase and subsequent verification. The main difference between ResNetSE34L and ResNetSE34V2 is in the methods of extracting and adapting audio data features. In addition, ResNetSE34V2 offers more efficient data processing, which ensures a higher level of learning accuracy.

VoxCeleb2 speech corpus was used as a training dataset for the ResNetSE34L model ^[5]. Using this case, the weights "baseline_lite_ap.model" were obtained. This acoustic-phonetic corpus of speech is one of the largest datasets used to evaluate automatic speech recognition systems.

Also, the TIMIT speech corpus was used for the ResNetSE34L model ^[6], on which the model was trained and the weights "model000000100.model" were obtained.

Further, three speech corpora were prepared for the subsequent training of models: in English, Russian and Chinese.

As the first, the TIMIT speech corpus ^[6] was used, containing audio recordings in English. It was divided into a training dataset containing 4,620 audio recordings and a test dataset consisting of 1,680 audio recordings. This dataset was used in the process of retraining using the trained weights "baseline_lite_ap.model" and "baseline_v2_ap.model". In the case of "model000000100.model", this approach was not applied, since the corresponding weights were obtained during the training of the ResNetSE34L model based on the TIMIT speech corpus.

Russian Russian speech corpus includes audio recordings made by 50 native speakers of the Russian language. Each speaker has 50 audio recordings. This data set was divided into two groups: a training set containing recordings of 30 speakers, and a test set consisting of recordings of 20 other speakers ^[7].

This study also used a speech corpus in Chinese called "HI-MIA". The data for the case was collected at home using microphone arrays and a HI-FI microphone. The dataset was divided into two sets: a training set, including recordings of 42 speakers, and a test set, consisting of recordings of 40 speakers ^[8].

Speech corpora in Russian and Chinese were used in experiments with all the trained weights presented, since none of them had previously been used to train models.

These data sets were selected for further training of the neural network in order to increase its effectiveness in speech recognition. Due to the variety of dialects, voices and languages in audio recordings, it is possible to get a wider coverage of variability in the pronunciation of words and phrases.

For the user verification itself, a test scenario is used, including attempts to authenticate a legitimate user and an attacker (in case of a voice mismatch). A legal attempt in a text file with a test script is marked with a label 1, and an attack on the system is marked with a label 0.

In the context of the desire to improve the efficiency of the neural network in voice verification tasks, the main attention was paid to reducing the equal error of the 1st and 2nd kind ^[9]. The goal was to achieve more accurate speaker recognition, which is important to ensure high reliability of the system.

As part of the study, in order to qualitatively reduce errors of the 1st and 2nd kind, the transfer learning method was applied using pre-trained weights of the model performing the task of verifying the speaker by an arbitrary phrase.

The method used is a technique for retraining a neural network, when a pre-trained model on a large data set is additionally trained on a narrower data set to perform the task of speaker verification

2.2 Additional training parameters

In the process of retraining the neural network, various strategies were used to adapt the model to new data. Special attention was paid to the choice of learning parameters, including the learning rate, the change in the optimizer, the number of learning epochs and the structure of the neural network.

In addition to the parameters described above, the size of the batch was also taken into account. When the model was retrained on the TIMIT speech case, the size of the batch was 50. When working with the Russian case, it was size 30, and with the Chinese case, the size was 40. Description of the parameters used to train the neural network:

1) The learning rate affects the convergence of the model and prevents overfitting. The change in weights should be moderate in order to avoid violation of the learned signs.

2) The number of epochs in the process of retraining affects the retraining of the model. Also, an insufficient number of epochs in the process of retraining can lead to insufficient adaptation.

3) The small size of the batch in the process of retraining allows you to find a balance between computational efficiency, resistance to change and the ability of the model to generalize to new data.

4) The optimizer is responsible for adjusting the weights of the model during the training process to minimize the loss function.

5) The structure of the neural network describes the architecture of the model, including the number of layers, as well as their relationship. The retraining process determines how new data is integrated into the existing architecture. Changing the network structure may include freezing layers. This allows the model to learn new features of the data and retain its initial training.

During the research, the audio recordings were divided into two text files: "train" and "test". This made it possible to systematize the data and provide the necessary structure for subsequent analysis.

3 Results

3.1 The result of additional training using the TIMIT speech corpus

In order to determine the best retraining strategy, a series of experiments were conducted using the TIMIT dataset for the trained weights "baseline_lite_ap.model" and "baseline_v2_ap.model". To determine how much more effectively the neural network recognizes speakers, the EER (Equal Error Rate) value obtained during testing and subsequent model training was considered ^[9].

In order to analytically compare the values obtained after retraining with the initial data, the model was evaluated without using a training process. An EER value of 0.012 and 0.013 was obtained for "baseline_lite_ap.model" and "baseline_v2_ap.model", respectively.

Next, the model was further trained with changes to its parameters. At the end of twenty experiments, a summary table was compiled, which presents the results of an equal error of the 1st and 2nd kind to compare the effectiveness of model retraining. This table shows the models used, the pre-trained weights, and the experiments performed. Each experiment included an attempt to retrain models on the TIMIT speech corpus ^[6].

Table 1 — Experimental results for the TIMIT set

№	Experiment	Models and weights
№	Experiment	model000000100.model	baseline_lit_ap	baseline_v2_ap
0	Before further education	0,066	0,012	0,013
1	No parameter changes	*	0,014	0,017
2	Freezing of the 1st and 2nd convolutional layers	*	0,015	0,011
3	Freezing of 1 convolutional layer	*	0,012	0,011
4	Freezing 2 convolutional layers	*	0,011	0,008
5	Freezing of the fully connected layer, learning speed and changing the optimizer	*	0,011	0,012
6	Freezing of a fully connected layer	*	0,009	0,015
7	Changing the optimizer	*	0,008	0,007
8	Selection of epochs and learning rates (max)	*	0,012	0,007
9	Selection of epochs and learning rates (min)	*	0,006	0,010
10	Lowering lr_decay and lr	*	0,008	0,008

When training using the weights of the "baseline_lite_ap.model" model, the best result was obtained with a slight change in the learning rate. The initial EER value was 0.012, after further training it decreased to 0.006. This change indicates a decrease in EER by 50%.

When using pre-trained weights "baseline_v2_ap.model", the best result was achieved when changing the optimizer and significantly reducing the learning rate. Before retraining, the value of the equal error of the 1st and 2nd kind was 0.013, after retraining, the indicator decreased to 0.007, which indicates a decrease of 46.15%.

In addition to conducting the retraining process, all models were also trained on the TIMIT dataset ^[6]. The results obtained were compared with the EER ^[9] obtained in the context of retraining in order to demonstrate a decrease in EER during the retraining process compared with exclusively basic training.

3.2 The result of additional training using the Russian speech corpus

To optimize the model's operation and improve the efficiency of speaker recognition, a series of experiments based on a dataset in Russian using pre-trained weights "model000000100.model", "baseline_lite_ap.model" and "baseline_v2_ap.model" were conducted.

The plan for the experiments conducted with the Russian speech corpus ^[8] did not differ from the experimental study with the TIMIT speech corpus ^[7]. In order to evaluate the effectiveness of speaker recognition before the retraining process, the model was evaluated without training. This was done to evaluate the effectiveness of the model before making adjustments to its parameters and implementing the retraining process.

Due to the fact that the model was initially trained in English, and subsequent pre-training was conducted in Russian, the process of changing the language environment had an impact on the accuracy of the model. The accuracy of the model did not rise above sixty for ten epochs. This transition required adapting the model to the new language and its features. English and Russian have many differences in grammar and syntax. In addition to the above, the morphology of each language can also affect further education. This procedure is carried out in order to extend the functionality of the model to work with different languages and cultural contexts.

As a result of the experiments described above using a dataset in Russian, table 2 was formed. This table shows the results of equal errors of the 1st and 2nd kind obtained during all experiments in which the pre-trained weights of the models "baseline_v2_ap.model", "baseline_lite_ap.model" and "model000000100.model" were used.

Table 2 — Experimental results for the Russian speech corpus

№	Experiment	Models and weights
№	Experiment	model000000100.model	baseline_lit_ap	baseline_v2_ap
0	Before further education	0,066	0,012	0,078
1	No parameter changes	0,055	0,014	0,114
2	Freezing of the 1st and 2nd convolutional layers	0,035	0,013	0,066
3	Freezing of 1 convolutional layer	0,053	0,014	0,064
4	Freezing 2 convolutional layers	0,053	0,010	0,067
5	Freezing of the fully connected layer, learning speed and changing the optimizer	0,032	0,007	0,078
6	Freezing of a fully connected layer	0,048	0,010	0,062
7	Changing the Optimizer	0,049	0,011	0,067
8	Selection of epochs and learning rates (max)	0,033	0,008	0,073
9	Selection of epochs and learning rates (min)	0,024	0,008	0,064
10	Lowering lr_decay and lr	0,028	0,009	0,068

In the process of learning using the weights "model000000100.model", a significant improvement was obtained with a slight modification of the learning rate. The initial EER value was 0.066, after pre-training, the indicator decreased to 0.024, indicating a decrease of 63.64%.

By using pre-trained "baseline lite ap.model" weights, the best result was achieved by freezing the fully connected layer, changing the learning rate and the optimizer. The initial EER value was 0.012, after further training it decreased to 0.007, which means a 41.67% reduction in EER.

When using pre-trained weights "baseline_v2_ap.model", the highest quality learning process was implemented by freezing a fully connected layer. Before retraining, the EER value was 0.078, after retraining, the indicator decreased to 0.062, which indicates a decrease of 20.51%.

3.3 The result of additional training using the HI-MIA speech corpus

In this study, pre-trained weights were also retrained on a speech corpus containing audio recordings in Chinese ^[8].

Working with this dataset did not lead to successful model completion in most cases. Experiments have been conducted previously with other datasets. Experiments with other learning rate values and freezing of other layers were also added. As a result, only when working with pre-trained weights "baseline_lite_ap.model" in an experiment with freezing a fully connected layer, changing the learning rate and the optimizer, successful retraining was achieved.

During further training, where the weights "model000000100.model" were used, the initial EER was equal to 0.066 ^[9]. The new dataset contained a variety of background noises, which is a typical real-world interaction environment. The model parameters have been adapted for high-quality additional training. After further training on a dataset containing audio recordings in Chinese, an increase in EER to 0.214 was observed.

All the results of the experiments conducted with the Chinese dataset are presented in Table 3.

Table 3 — Experimental results for HI-MIA

№	Experiment	Models and weights
№	Experiment	model000000100.model	baseline_lit_ap	baseline_v2_ap
0	Before further education	0,180	0,106	0,089
1	No parameter changes	0,214	0,108	0,091
2	Freezing of the 1st and 2nd convolutional layers	0,197	0,125	0,094
3	Freezing of 1 convolutional layer	0,201	0,138	0,096
4	Freezing 2 convolutional layers	0,207	0,141	0,097
5	Freezing of the fully connected layer, learning speed and changing the optimizer	0,259	0,089	0,121
6	Freezing of a fully connected layer	0,210	0,130	0,120
7	Changing the Optimizer	0,231	0,147	0,092
8	Selection of epochs and learning rates (max)	0,302	0,222	0,148
9	Selection of epochs and learning rates (min)	0,229	0,134	0,123
10	Lowering lr_decay and lr	0,223	0,135	0,123

The analysis showed that the reason for the increase in EER may be both linguistic and phonetic differences between English and Chinese, as well as the presence of noise masking key acoustic features of speech. Additional training on a noisy Chinese dataset led to retraining on the specific features of this dataset, which reduced the generalizability of the model. It is also possible that standard methods of retraining do not cope with their task in conditions of strong language variability and the presence of noise.

4 Scope of the results and novelty

This study avoids additional costs for training models from scratch and, when using these models, increases the efficiency of speaker verification systems by voice. The technology of speaker verification by voice is widespread in the banking sector, since when a client contacts a contact center, the only available biometric parameter is voice.

The work used well-known pre-trained models and the method of transfer retraining within a given subject area. The main scientific result is the results of evaluating the effectiveness and applicability of specific methods of retraining neural networks for the features of data sets in the task of text-independent speaker verification.

5 Conclusions

In the work, a study was conducted on the process of transfer learning of neural networks for the characteristics of a data set, which allowed us to obtain more accurate models after further training.

The obtained observations emphasize the importance of correct parameter settings when further training models. Fine-tuning carried out during the further training of models has a beneficial effect on optimizing their effectiveness.

When using pre-trained weights "baseline_lite_ap.model" and the TIMIT speech corpus, the maximum decrease in the equal error of the 1st and 2nd kind was obtained, it decreased by 50%.

With the data set in Russian, the EER indicator has significantly decreased when using pre-trained weights "model000000100.model". The decrease was 63.64%.

When working with the HI-MIA dataset, successful model completion was achieved only in an experiment with freezing a fully connected layer, changing the learning rate and the optimizer. As a result, it was recorded that the equal error of the 1st and 2nd kind (EER) decreased by 16.04%.

References

1. Gassiev, D. O., Sakharov, V. A., & Ermolaeva, V. V. (2019). Voice authentication. Trends in science and education, 56(2), 22-24.
2. GOST R 58668.11-2019 (ISO/IEC 19794-13:2018). Information Technology. Biometrics. Biometric data exchange formats. Section 11. Voice data. (2019). Moscow: Standard-Inform.
3. Devjatkov, V. V., & Fedorov, I.B. (2001). Artificial Intelligence Systems. BMSTU.
4. Galushkin, A. I. (2012). Neural networks. Fundamentals of theory. Hotline – Telecom.
5. Suzuki, K. (2013). Artificial Neural Networks: Architectures and Applications. InTech.
6. Evsyukov, M.V., Putyato, M.M., & Makaryan, A.S. (2020). Protection methods in modern voice authentication systems. Caspian journal: Control and High Technologies, 3(59), 84-92.
7. Nagrani, A., Chung, J.S., & Zisserman, A. (2018). VoxCeleb: A large-scale speaker identification dataset. arXiv:1706.08612. Retrieved from https://arxiv.org/pdf/1706.08612
8. Hinton, G. E., Srivastava, X., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Retrieved from https://arxiv.org/pdf/1207.0580
9. Konev, А.А. (2007). Model and algorithms of speech signal analysis and segmentation: Thesis abstract for the degree of Candidate of Technical Sciences. Tomsk.
10. Qin, X., Bu, H., & Li, M. (2020). HI-MIA: A Far-field Text-Dependent Speaker Verification Database and the Baselines. IEEE International Conference on Acoustics, Speech, and Signal Processing, 7609-7613.
11. Rakhmanenko, I.A., Shelupanov, A.A, & Kostyuchenko, E. Y. (2020). Automatic text-independent speaker verification using convolutional deep belief network. Computer Optics, 44(4), 596-605.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The article is devoted to an urgent and in–demand topic - the development of methods to improve speaker verification based on voice biometrics using neural networks. In particular, transfer learning of neural networks is being investigated, where pre-trained models are adapted to the specific features of various speech data. The proposed method solves the problem of improving the accuracy and reliability of voice verification systems, which is of great importance for such areas as banking, security systems, medical institutions and everyday digital services. The methodology of the article demonstrates a high scientific level. The authors use several well-known neural networks, such as ResNetSE34L and ResNetSE34V2, which are further trained on various datasets (English, Russian and Chinese speech corpora). Specific learning parameters are given, such as the learning rate, the size of the batch and the number of epochs, which allows you to fully evaluate the experiments and their results. The use of transfer learning is one of the key aspects in modern artificial intelligence research, which indicates the high relevance of the chosen approach. The topic of the article is at the cutting edge of modern technology. The use of voice biometrics for identity identification is gaining popularity, and improving the accuracy of such systems is becoming especially important in the face of an increasing number of cyber attacks and increasing security requirements. In addition, the use of transfer learning to retrain models on new datasets can significantly reduce the cost of computing resources and improve system performance. The article offers an original approach to adapting pre-trained models to the specifics of specific speech data, which allows to increase the accuracy of speaker verification systems. The authors investigate the impact of various learning parameters on the effectiveness of models and provide detailed results that show a significant reduction in error (EER). The application of the method on different language data is particularly interesting, which expands the scope of the proposed technology. The style of presentation of the material is logical and consistently structured. The article is structured in a classic way, starting with an introduction to the topic, an overview of the models used, and ending with a detailed analysis of the results. It is especially worth noting the clarity and clarity of the presentation of the experiments. Each stage of the study is described in detail, which makes it easy to follow the course of thought of the authors and assess the significance of the data obtained. The conclusions of the article are supported by the results of the conducted experiments and represent the logical conclusion of the study. The authors emphasize that the use of transfer learning with pre-trained weights can significantly reduce the percentage of errors of the 1st and 2nd kind (EER), which improves the overall quality of the system. The conclusions fully correspond to the stated goals of the study and confirm the effectiveness of the proposed approach. The article will be of interest to both experts in the field of artificial intelligence and machine learning, as well as researchers working on improving biometric security systems. In addition, the results of the work may be useful to practitioners involved in the development of commercial solutions in the field of voice-based user verification. The article is a significant contribution to the study of methods for improving voice verification using neural networks. The methodology presented by the authors demonstrates the scientific novelty and originality of the approach, and the experimental results confirm the effectiveness of the proposed method. I recommend the article for publication without significant comments.

Journals

Books

Fine-tuning neural networks for the features of a dataset in the speaker verification task using transfer learning