Library
|
Your profile |
Software systems and computational methods
Reference:
Dagaev, A.E., Popov, D.I. (2024). Comparison of automatic summarization of texts in Russian. Software systems and computational methods, 4, 13–22. https://doi.org/10.7256/2454-0714.2024.4.69474
Comparison of automatic summarization of texts in Russian
DOI: 10.7256/2454-0714.2024.4.69474EDN: CSFMFCReceived: 29-12-2023Published: 07-11-2024Abstract: The subject of the research in this article is the generalization of texts in Russian using artificial intelligence models. In particular, the authors compare the popular models GigaChat, YaGPT2, ChatGPT-3.5, ChatGPT-4, Bard, Bing AI and YouChat and conduct a comparative study of their work on Russian texts. The article uses datasets for the Russian language, such as Gazeta, XL-Sum and WikiLingua, as source materials for subsequent generalization, as well as additional datasets in English, CNN Dailymail and XSum, were taken to compare the effectiveness of generalization. The article uses the following indicators: ROUGE, BLEU score, BERTScore, METEOR and BLEURT to assess the quality of text synthesis. In this article, a comparative analysis of data obtained during automatic generalization using artificial intelligence models is used as a research method. The scientific novelty of the research is to conduct a comparative analysis of the quality of automatic generalization of texts in Russian and English using various neural network models of natural language processing. The authors of the study drew attention to the new models GigaChat, YaGPT2, ChatGPT-3.5, ChatGPT-4, Bard, Bing AI and YouChat, considering and analyzing their effectiveness in the task of text generalization. The results of the generalization in Russian show that YouChat demonstrates the highest results in terms of the set of ratings, emphasizing the effectiveness of the model in processing and generating text with a more accurate reproduction of key elements of content. Unlike YouChat, the Bard model showed the worst results, representing the model with the least ability to generate coherent and relevant text. The data obtained during the comparison will contribute to a deeper understanding of the models under consideration, helping to make a choice when using artificial intelligence for text summarization tasks as a basis for future developments. Keywords: natural language processing, text summarization, GigaChat, YaGPT2, ChatGPT-3, ChatGPT-4, Bard, Bing AI, YouChat, text compressionThis article is automatically translated. Introduction Text abstracting is an important area in the field of natural language processing, which is of considerable importance for a large number of tasks. To summarize the text, the artificial intelligence model must be able to create coherent and relevant content, while simultaneously compressing the main information into a shorter form, regardless of the subject areas. With the help of text generalization, it is possible to determine and compare the quality of neural networks. The article will conduct a comparative study of popular artificial intelligence models based on Russian language texts. Related works Recently, they have been actively engaged in assessing the quality of language models, but most of the research is aimed at processing English as an international language. With Russian as the main language, no comparison was found between GigaChat, YaGPT2, ChatGPT-3.5, ChatGPT-4, Bard, Bing AI and YouChat. In [6], recursive generalization via GPT-3.5 was investigated, as well as methods for selecting essential content for generalization. In [7], it is indicated that GPT models have difficulty identifying important information and are more prone to errors when generalizing long text forms. A study of the generation quality using GPT models and further analysis showed that the quality for languages with high linguistic frequency is higher than for languages with low [8]. And in [9], the weak performance of GPT was noted when working with the Russian language in a multilingual dataset. Recent studies [1][2] show that the quality of news generalization using large language models is at a level comparable to that created by man. Data sets The following data sets are used for the Russian language: Gazeta [3]. The dataset contains 63435 news items posted on the site gazeta.ru . XL-Sum [12]. The set contains 1.35 million annotated pairs of BBC articles in various languages, including 77803 in Russian. WikiLingua [16]. A multilingual dataset created to evaluate the generalization task. The materials include articles in 18 languages from wikiHow. 52928 articles are collected in Russian. To compare the efficiency of generalization, additional data sets in English were taken: CNN Dailymail [11]. The set includes CNN news articles from April 2007 to April 2015 and the Daily Mail from June 2010 to April 2015. The total number is 311672 articles. XSum [10]. The set consists of 226711 BBC articles from 2010 to 2017. From all the listed sets, 100 original texts were randomly selected, which were unified in length by 1024 tokens. Evaluation indicators In this work, the indicators for evaluating the quality of texts ROUGE [4], BLEU score [5], BERTScore [13], METEOR [14] and BLEURT [15] were used. ROUGE is used to evaluate the quality of texts created by machines. It analyzes the similarities between the artificially created text and the reference text. ROUGE determines the accuracy and completeness of the information. It can be used to evaluate various tasks, including text synthesis and machine translation. Among the different variants of ROUGE, the following were used in the work: ROUGE-1. This indicator calculates the overlap of unigrams (individual words) between the machine and the reference text. This helps to evaluate the accuracy of machine translation or text generalization. ROUGE-2, which works with bigrams, that is, with pairs of words. ROUGE-L, which analyzes long phrases in the text and as a result measures the similarity between the generated text and the reference text, taking into account word sequences, that is, estimates the length of the longest common subsequence Each of these ROUGE options allows you to evaluate the quality of the machine-generated text in different ways. ROUGE-1 and ROUGE-2 focus on overlap at the word and bigram levels, while ROUGE-L looks at the structure and order of words in texts. BLEU score [5] is an indicator used to measure the quality of machine text, in particular in the context of machine translation and text generalization. It was originally created to evaluate the quality of machine translation, but is now used for many other NLP tasks. BLEU score evaluates the similarity between machine-generated text with one or more reference texts written by humans. This is achieved by comparing n-grams in machine text with n-grams in reference texts. BERTScore is a metric that evaluates the similarity between two texts using vector representations obtained using the BERT model. The BERTScore score correlates well with human judgments and provides a higher efficiency of model selection than existing indicators, in addition, it is more resistant to complex examples compared to existing metrics [13]. The article uses the F1 Score, which is calculated as the average harmonic value of accuracy and memorization. This provides a balanced indicator that takes into account both false positive and false negative results. METEOR [14]. The METEOR metric is often used to evaluate the quality of machine translation and provides more detailed information than the BLEU metric. This takes into account not only the accuracy and memorability of individual words, but also the basics of words, synonyms and word order. All this is done in order to ensure a more holistic assessment of quality. BLEURT [15] is a trained metric based on BERT and RemBERT. Pairs of texts are used as input data: the candidate is a reference text, and an assessment is output showing how well the candidate speaks the language and conveys the main meaning of the text. Results The results of the above indicators for the Gazeta dataset are presented in table 1. Table 1 – Results on the Gazeta set
The results for the XL-Sum set are shown in Table 2. Table 2 – Results on the XL-Sum set
For the WikiLingua set, the results are shown in Table 3. Table 3 – Results on the WikiLingua set
To calculate the overall score, an individual weight was assigned to each indicator. The weights were distributed based on the specifics of the task of generalizing texts, where more weight is allocated to semantic similarity, a combination of semantic and structural similarities, as well as the degree of compression. Calculation formula:
Figure 1 shows a diagram of the overall ratings for all the Russian language datasets used. Figure 1 – Diagram of the general estimates for Russian language datasets YouChat showed the highest results on all data sets for a set of indicators. This highlights the effectiveness of the model in processing and generating text, as well as its ability to accurately reproduce key elements of content. Bard, unlike other models, generates coherent and contextually relevant text of the worst quality, which leads to unsatisfactory results when evaluating similarity, generalization and other natural language processing tasks. Also, lower indicators may indicate difficulties in perceiving the subtleties of language, which leads to problems such as irrelevant information, lack of semantic consistency and inaccuracies in reproducing the main context. GigaChat is better suited to the task of generalization compared to ChatGPT-3.5, but in general the results are at a comparable quality level. GigaChat showed context reproduction more accurately than YaGPT2 and demonstrated more meaningful text generation with an overall increased generalization ability. Bard showed the lowest final score on the Russian language sets. The compression results between the input and output data are shown in Figure 2. Figure 2 – Generalized compression graph on Russian language datasets The maximum compression was shown by YaGPT2 – 80.58%, followed by GigaChat – 67.14%, Bing AI – 60.22%, ChatGPT-3.5 – 57.83%, ChatGPT-4 – 54.41%, YouChat – 44.89%, and the lowest compression results were obtained by Bard – 41.89%. After summarizing in English for the CNN Retail dataset, the indicators shown in Table 4 were recorded. It should be noted that YaGPT2 does not work with texts in English, for this reason it was excluded from the list of models for subsequent analysis. Table 4 – Results on the CNN Gmail set
The results for the XSum set are shown in Table 5 Table 5 – Results on the XSum set
The overall assessment based on the English language data is shown in Figure 3. The results show that the quality of the completed generalization depends on the source language. Figure 3 is a diagram of common scores for English language datasets GigaChat showed the highest score among the other models reviewed. ChatGPT-4 sets in English show the lowest quality, and also compresses text less than other models. The compression results between the input and output data are shown in Figure 4. Figure 4 – Generalized compression graph on English language datasets The maximum compression on English–language datasets was shown by Bard - 77.29%, ChatGPT–3.5 – 71.98%, GigaChat – 71.18%, Bing AI – 68.65%, YouChat - 59.37%, ChatGPT–4 had the lowest indicators - 34.88% In general, all the presented models showed acceptable results for the selected metrics. This indicates that they can be used to solve problems of text generalization. However, it is worth noting that based on the different capabilities of the models, there may be differences in certain use cases related to the size of the model, the type of task or language features, which requires additional research. Conclusion In this paper, we compared the quality of automatic text generalization using various neural networks in terms of natural language processing, such as GigaChat, YaGPT2, ChatGPT-3.5, ChatGPT-4, Bard, Bing AI and YouChat. For this purpose, a dataset containing texts in Russian and for comparison in English was taken and preprocessed. Then a generalization of the same list of texts on each model was performed. After that, the results were obtained according to the indicators ROUGE [4], BLEU score [5], BERTScore [13], METEOR [14] and BLEURT [15], which compared the original and the texts generated during automatic abstracting. The results of the overall assessment between all indicators were also obtained, where each indicator was given a weight based on its importance for the task of summarizing the text. The data obtained during the comparison will contribute to a deeper understanding of the models under consideration, helping to make a choice when using artificial intelligence for text generalization tasks as a basis for future developments. In the future, it is planned to explore the work on text processing between models with different settings. References
1. Goyal, T., Li, J. J., & Durrett, G. (2022). News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
2. Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2023). Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848. 3. Gusev, I. (2020). Dataset for automatic summarization of Russian news. In Artificial Intelligence and Natural Language: 9th Conference, AINL 2020, Helsinki, Finland, October 7–9, 2020, Proceedings 9 (pp. 122-134). Springer International Publishing. 4. Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81). 5. Post, M. (2018). A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771. 6. Bhaskar, A., Fabbri, A., & Durrett, G. (2023, July). Prompted opinion summarization with GPT-3.5. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 9282-9300). 7. Tang, L., Sun, Z., Idnay, B., Nestor, J. G., Soroush, A., Elias, P. A., ... & Peng, Y. (2023). Evaluating large language models on medical evidence summarization. npj Digital Medicine, 6(1), 158. 8. Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., ... & Awadalla, H. H. (2023). How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210. 9. Jiao, W., Wang, W., Huang, J. T., Wang, X., & Tu, Z. (2023). Is ChatGPT a good translator? A preliminary study. arXiv preprint arXiv:2301.08745. 10. Narayan, S., Cohen, S. B., & Lapata, M. (1808). Don’t Give Me the Details, Just the Summary!. Topic-Aware Convolutional Neural Networks for Extreme Summarization. ArXiv, abs. 11. Nallapati, R., Zhou, B., Gulcehre, C., & Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023. 12. Hasan, T., Bhattacharjee, A., Islam, M. S., Samin, K., Li, Y. F., Kang, Y. B., ... & Shahriyar, R. (2021). XL-sum: Large-scale multilingual abstractive summarization for 44 languages. arXiv preprint arXiv:2106.13822. 13. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675. 14. Banerjee, S., & Lavie, A. (2005, June). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72). 15. Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696. 16. Ladhak, F., Durmus, E., Cardie, C., & McKeown, K. (2020). WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.03093.
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|