Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Software systems and computational methods
Reference:

Comparison of automatic summarization of texts in Russian

Dagaev Alexander Evgenevich

Postgraduate student, Department of Informatics and Information Technologies, Moscow Polytechnic University

107023, Moscow, st. Bolshaya Semyonovskaya, 38

alejaandro@bk.ru
Popov Dmitry Ivanovich

Doctor of Technical Science

Sochi State University, Professor of the Department of Information Technologies and Mathematics

354000, Krasnodar region, Sochi, st. Plastunskaya, 94

damitry.popov@gmail.com

DOI:

10.7256/2454-0714.2024.4.69474

EDN:

CSFMFC

Received:

29-12-2023


Published:

07-11-2024


Abstract: The subject of the research in this article is the generalization of texts in Russian using artificial intelligence models. In particular, the authors compare the popular models GigaChat, YaGPT2, ChatGPT-3.5, ChatGPT-4, Bard, Bing AI and YouChat and conduct a comparative study of their work on Russian texts. The article uses datasets for the Russian language, such as Gazeta, XL-Sum and WikiLingua, as source materials for subsequent generalization, as well as additional datasets in English, CNN Dailymail and XSum, were taken to compare the effectiveness of generalization. The article uses the following indicators: ROUGE, BLEU score, BERTScore, METEOR and BLEURT to assess the quality of text synthesis. In this article, a comparative analysis of data obtained during automatic generalization using artificial intelligence models is used as a research method. The scientific novelty of the research is to conduct a comparative analysis of the quality of automatic generalization of texts in Russian and English using various neural network models of natural language processing. The authors of the study drew attention to the new models GigaChat, YaGPT2, ChatGPT-3.5, ChatGPT-4, Bard, Bing AI and YouChat, considering and analyzing their effectiveness in the task of text generalization. The results of the generalization in Russian show that YouChat demonstrates the highest results in terms of the set of ratings, emphasizing the effectiveness of the model in processing and generating text with a more accurate reproduction of key elements of content. Unlike YouChat, the Bard model showed the worst results, representing the model with the least ability to generate coherent and relevant text. The data obtained during the comparison will contribute to a deeper understanding of the models under consideration, helping to make a choice when using artificial intelligence for text summarization tasks as a basis for future developments.


Keywords:

natural language processing, text summarization, GigaChat, YaGPT2, ChatGPT-3, ChatGPT-4, Bard, Bing AI, YouChat, text compression

This article is automatically translated.

Introduction

Text abstracting is an important area in the field of natural language processing, which is of considerable importance for a large number of tasks. To summarize the text, the artificial intelligence model must be able to create coherent and relevant content, while simultaneously compressing the main information into a shorter form, regardless of the subject areas. With the help of text generalization, it is possible to determine and compare the quality of neural networks. The article will conduct a comparative study of popular artificial intelligence models based on Russian language texts.

Related works

Recently, they have been actively engaged in assessing the quality of language models, but most of the research is aimed at processing English as an international language. With Russian as the main language, no comparison was found between GigaChat, YaGPT2, ChatGPT-3.5, ChatGPT-4, Bard, Bing AI and YouChat.

In [6], recursive generalization via GPT-3.5 was investigated, as well as methods for selecting essential content for generalization. In [7], it is indicated that GPT models have difficulty identifying important information and are more prone to errors when generalizing long text forms. A study of the generation quality using GPT models and further analysis showed that the quality for languages with high linguistic frequency is higher than for languages with low [8]. And in [9], the weak performance of GPT was noted when working with the Russian language in a multilingual dataset. Recent studies [1][2] show that the quality of news generalization using large language models is at a level comparable to that created by man.

Data sets

The following data sets are used for the Russian language:

Gazeta [3]. The dataset contains 63435 news items posted on the site gazeta.ru .

XL-Sum [12]. The set contains 1.35 million annotated pairs of BBC articles in various languages, including 77803 in Russian.

WikiLingua [16]. A multilingual dataset created to evaluate the generalization task. The materials include articles in 18 languages from wikiHow. 52928 articles are collected in Russian.

To compare the efficiency of generalization, additional data sets in English were taken:

CNN Dailymail [11]. The set includes CNN news articles from April 2007 to April 2015 and the Daily Mail from June 2010 to April 2015. The total number is 311672 articles.

XSum [10]. The set consists of 226711 BBC articles from 2010 to 2017.

From all the listed sets, 100 original texts were randomly selected, which were unified in length by 1024 tokens.

Evaluation indicators

In this work, the indicators for evaluating the quality of texts ROUGE [4], BLEU score [5], BERTScore [13], METEOR [14] and BLEURT [15] were used. ROUGE is used to evaluate the quality of texts created by machines. It analyzes the similarities between the artificially created text and the reference text. ROUGE determines the accuracy and completeness of the information. It can be used to evaluate various tasks, including text synthesis and machine translation.

Among the different variants of ROUGE, the following were used in the work:

ROUGE-1. This indicator calculates the overlap of unigrams (individual words) between the machine and the reference text. This helps to evaluate the accuracy of machine translation or text generalization.

ROUGE-2, which works with bigrams, that is, with pairs of words.

ROUGE-L, which analyzes long phrases in the text and as a result measures the similarity between the generated text and the reference text, taking into account word sequences, that is, estimates the length of the longest common subsequence

Each of these ROUGE options allows you to evaluate the quality of the machine-generated text in different ways. ROUGE-1 and ROUGE-2 focus on overlap at the word and bigram levels, while ROUGE-L looks at the structure and order of words in texts.

BLEU score [5] is an indicator used to measure the quality of machine text, in particular in the context of machine translation and text generalization. It was originally created to evaluate the quality of machine translation, but is now used for many other NLP tasks. BLEU score evaluates the similarity between machine-generated text with one or more reference texts written by humans. This is achieved by comparing n-grams in machine text with n-grams in reference texts.

BERTScore is a metric that evaluates the similarity between two texts using vector representations obtained using the BERT model. The BERTScore score correlates well with human judgments and provides a higher efficiency of model selection than existing indicators, in addition, it is more resistant to complex examples compared to existing metrics [13]. The article uses the F1 Score, which is calculated as the average harmonic value of accuracy and memorization. This provides a balanced indicator that takes into account both false positive and false negative results.

METEOR [14]. The METEOR metric is often used to evaluate the quality of machine translation and provides more detailed information than the BLEU metric. This takes into account not only the accuracy and memorability of individual words, but also the basics of words, synonyms and word order. All this is done in order to ensure a more holistic assessment of quality.

BLEURT [15] is a trained metric based on BERT and RemBERT. Pairs of texts are used as input data: the candidate is a reference text, and an assessment is output showing how well the candidate speaks the language and conveys the main meaning of the text.

Results

The results of the above indicators for the Gazeta dataset are presented in table 1.

Table 1 – Results on the Gazeta set

ROUGE-1

ROUGE-2

ROUGE-L

BLEU

BERTScore

METEOR

BLEURT

GigaChat

0,17

0,09

0,17

6,71

0,71

0,16

0,14

YaGPT2

0,11

0,04

0,11

6,05

0,70

0,09

0,00

ChatGPT-3.5

0,29

0,12

0,28

5,88

0,72

0,19

0,14

ChatGPT-4

0,27

0,10

0,25

6,12

0,71

0,16

0,04

Bard

0,33

0,18

0,32

4,43

0,71

0,26

-0,06

Bing AI

0,33

0,15

0,31

5,09

0,72

0,23

0,03

YouChat

0,33

0,19

0,32

9,60

0,72

0,24

0,22

The results for the XL-Sum set are shown in Table 2.

Table 2 – Results on the XL-Sum set

ROUGE-1

ROUGE-2

ROUGE-L

BLEU

BERTScore

METEOR

BLEURT

GigaChat

0,15

0,05

0,14

5,69

0,68

0,09

-0,03

YaGPT2

0,10

0,02

0,10

5,98

0,66

0,04

-0,07

ChatGPT-3.5

0,24

0,10

0,23

6,60

0,70

0,15

0,20

ChatGPT-4

0,24

0,10

0,24

4,36

0,69

0,20

-0,09

Bard

0,24

0,10

0,23

4,48

0,69

0,20

-0,13

Bing AI

0,32

0,17

0,30

4,41

0,71

0,21

-0,09

YouChat

0,38

0,23

0,36

5,94

0,73

0,26

0,12

For the WikiLingua set, the results are shown in Table 3.

Table 3 – Results on the WikiLingua set

ROUGE-1

ROUGE-2

ROUGE-L

BLEU

BERTScore

METEOR

BLEURT

GigaChat

0,33

0,16

0,32

5,30

0,73

0,23

0,16

YaGPT2

0,20

0,05

0,19

5,07

0,72

0,09

0,12

ChatGPT-3.5

0,27

0,09

0,26

5,11

0,71

0,17

0,00

ChatGPT-4

0,23

0,05

0,21

4,54

0,70

0,14

0,04

Bard

0,34

0,17

0,33

5,04

0,75

0,24

0,08

Bing AI

0,41

0,24

0,39

4,51

0,75

0,29

0,09

YouChat

0,56

0,36

0,54

4,70

0,83

0,47

0,14

To calculate the overall score, an individual weight was assigned to each indicator. The weights were distributed based on the specifics of the task of generalizing texts, where more weight is allocated to semantic similarity, a combination of semantic and structural similarities, as well as the degree of compression. Calculation formula:

Where is the overall score,

– ROUGE-1;

– ROUGE-2;

– ROUGE-L;

– BLEU;

– BERTScore;

– METEOR;

– BLEURT;

– The degree of text compression (%).

Figure 1 shows a diagram of the overall ratings for all the Russian language datasets used.

Figure 1 – Diagram of the general estimates for Russian language datasets

YouChat showed the highest results on all data sets for a set of indicators. This highlights the effectiveness of the model in processing and generating text, as well as its ability to accurately reproduce key elements of content.

Bard, unlike other models, generates coherent and contextually relevant text of the worst quality, which leads to unsatisfactory results when evaluating similarity, generalization and other natural language processing tasks. Also, lower indicators may indicate difficulties in perceiving the subtleties of language, which leads to problems such as irrelevant information, lack of semantic consistency and inaccuracies in reproducing the main context.

GigaChat is better suited to the task of generalization compared to ChatGPT-3.5, but in general the results are at a comparable quality level.

GigaChat showed context reproduction more accurately than YaGPT2 and demonstrated more meaningful text generation with an overall increased generalization ability.

Bard showed the lowest final score on the Russian language sets.

The compression results between the input and output data are shown in Figure 2.

Figure 2 – Generalized compression graph on Russian language datasets

The maximum compression was shown by YaGPT2 – 80.58%, followed by GigaChat – 67.14%, Bing AI – 60.22%, ChatGPT-3.5 – 57.83%, ChatGPT-4 – 54.41%, YouChat – 44.89%, and the lowest compression results were obtained by Bard – 41.89%.

After summarizing in English for the CNN Retail dataset, the indicators shown in Table 4 were recorded. It should be noted that YaGPT2 does not work with texts in English, for this reason it was excluded from the list of models for subsequent analysis.

Table 4 – Results on the CNN Gmail set

ROUGE-1

ROUGE-2

ROUGE-L

BLEU

BERTScore

METEOR

BLEURT

GigaChat

0,32

0,16

0,30

6,27

0,72

0,16

-0,29

ChatGPT-3.5

0,28

0,11

0,25

5,35

0,71

0,13

-0,34

ChatGPT-4

0,30

0,09

0,27

5,02

0,71

0,16

-0,37

Bard

0,32

0,17

0,30

7,04

0,72

0,16

-0,55

Bing AI

0,33

0,15

0,31

5,63

0,72

0,16

-0,38

YouChat

0,39

0,19

0,36

5,48

0,73

0,22

-0,28

The results for the XSum set are shown in Table 5

Table 5 – Results on the XSum set

ROUGE-1

ROUGE-2

ROUGE-L

BLEU

BERTScore

METEOR

BLEURT

GigaChat

0,38

0,22

0,36

6,01

0,74

0,19

-0,27

ChatGPT-3.5

0,34

0,14

0,31

5,09

0,72

0,18

-0,28

ChatGPT-4

0,34

0,11

0,31

4,58

0,71

0,18

-0,36

Bard

0,36

0,20

0,35

4,31

0,73

0,15

-0,36

Bing AI

0,42

0,22

0,40

4,45

0,74

0,25

-0,36

YouChat

0,42

0,21

0,39

5,47

0,73

0,25

-0,17

The overall assessment based on the English language data is shown in Figure 3. The results show that the quality of the completed generalization depends on the source language.

Figure 3 is a diagram of common scores for English language datasets

GigaChat showed the highest score among the other models reviewed.

ChatGPT-4 sets in English show the lowest quality, and also compresses text less than other models.

The compression results between the input and output data are shown in Figure 4.

Figure 4 – Generalized compression graph on English language datasets

The maximum compression on English–language datasets was shown by Bard - 77.29%, ChatGPT–3.5 – 71.98%, GigaChat – 71.18%, Bing AI – 68.65%, YouChat - 59.37%, ChatGPT–4 had the lowest indicators - 34.88%

In general, all the presented models showed acceptable results for the selected metrics. This indicates that they can be used to solve problems of text generalization. However, it is worth noting that based on the different capabilities of the models, there may be differences in certain use cases related to the size of the model, the type of task or language features, which requires additional research.

Conclusion

In this paper, we compared the quality of automatic text generalization using various neural networks in terms of natural language processing, such as GigaChat, YaGPT2, ChatGPT-3.5, ChatGPT-4, Bard, Bing AI and YouChat. For this purpose, a dataset containing texts in Russian and for comparison in English was taken and preprocessed. Then a generalization of the same list of texts on each model was performed. After that, the results were obtained according to the indicators ROUGE [4], BLEU score [5], BERTScore [13], METEOR [14] and BLEURT [15], which compared the original and the texts generated during automatic abstracting. The results of the overall assessment between all indicators were also obtained, where each indicator was given a weight based on its importance for the task of summarizing the text.

The data obtained during the comparison will contribute to a deeper understanding of the models under consideration, helping to make a choice when using artificial intelligence for text generalization tasks as a basis for future developments.

In the future, it is planned to explore the work on text processing between models with different settings.

References
1. Goyal, T., Li, J. J., & Durrett, G. (2022). News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
2. Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2023). Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848.
3. Gusev, I. (2020). Dataset for automatic summarization of Russian news. In Artificial Intelligence and Natural Language: 9th Conference, AINL 2020, Helsinki, Finland, October 7–9, 2020, Proceedings 9 (pp. 122-134). Springer International Publishing.
4. Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81).
5. Post, M. (2018). A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771.
6. Bhaskar, A., Fabbri, A., & Durrett, G. (2023, July). Prompted opinion summarization with GPT-3.5. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 9282-9300).
7. Tang, L., Sun, Z., Idnay, B., Nestor, J. G., Soroush, A., Elias, P. A., ... & Peng, Y. (2023). Evaluating large language models on medical evidence summarization. npj Digital Medicine, 6(1), 158.
8. Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., ... & Awadalla, H. H. (2023). How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
9. Jiao, W., Wang, W., Huang, J. T., Wang, X., & Tu, Z. (2023). Is ChatGPT a good translator? A preliminary study. arXiv preprint arXiv:2301.08745.
10. Narayan, S., Cohen, S. B., & Lapata, M. (1808). Don’t Give Me the Details, Just the Summary!. Topic-Aware Convolutional Neural Networks for Extreme Summarization. ArXiv, abs.
11. Nallapati, R., Zhou, B., Gulcehre, C., & Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
12. Hasan, T., Bhattacharjee, A., Islam, M. S., Samin, K., Li, Y. F., Kang, Y. B., ... & Shahriyar, R. (2021). XL-sum: Large-scale multilingual abstractive summarization for 44 languages. arXiv preprint arXiv:2106.13822.
13. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
14. Banerjee, S., & Lavie, A. (2005, June). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72).
15. Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
16. Ladhak, F., Durmus, E., Cardie, C., & McKeown, K. (2020). WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.03093.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The subject of the research in this article is to compare the quality of automatic generalization (abstracting) of texts in Russian using various models of artificial intelligence. The research methodology includes the selection and preprocessing of text data sets in Russian and English, the generation of abstracts of these texts by various AI models, as well as the assessment of the quality of the abstracts obtained using standard metrics ROUGE, BLEU, BERTscore, METEOR and BLEURT. The metrics used in the work (ROUGE, BLEU, BERTscore, METEOR and BLEURT) form an integrated approach to assessing the quality of automatic text abstraction, taking into account various aspects: the accuracy of the transmission of individual words and phrases (ROUGE, BLEU), semantic similarity and word order in sentences (BERTscore, METEOR), the overall transmission of the meaning of the source text (BLEURT). Each metric has its advantages and disadvantages. In general, their joint use makes it possible to obtain the most objective assessment and compare the effectiveness of different models of automatic text abstraction. At the same time, the results of individual metrics may differ somewhat, due to their consideration of various linguistic factors. The topic is relevant because the task of automatic text abstraction is actively being researched in the field of natural language processing and has many practical applications. The effectiveness of different approaches for the Russian language has not been compared before. The scientific novelty of the work lies in the fact that for the first time a comparative study of the quality of generalization of texts in Russian was conducted using a number of popular artificial intelligence models. The presentation style is scientific, the text is structured, the main sections correspond to the logic of the study. The content fully reveals the stated topic. The bibliography is relevant and covers the latest works in this subject area. The results of the study are of interest to specialists in the field of computational linguistics and natural language processing. They can be used in the selection of optimal AI models for solving problems of automatic text abstraction. Thus, the article is relevant, has scientific novelty and can be recommended for publication. Recommendations for further research: 1. Expanding the list of compared automatic referencing models due to the most advanced and popular architectures. 2. A more detailed analysis of the influence of model configuration (size, volume of training data, etc.) on the quality of abstracting. 3. The study of the features of the application of models for texts from different subject areas and in different languages. 4. Development of combined approaches using several models at different stages of the abstracting process. 5. Comparison with abstracts compiled by experts to identify the shortcomings of existing algorithms. Conducting such research will allow us to better understand the possibilities and limitations of modern models of automatic text abstraction.