Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Taxes and Taxation
Reference:

Statistical analysis of the differentiation of regions of the Russian Federation by tax revenues of consolidated budgets using the R language

Apal'kova Tamara Gennadievna

ORCID: 0000-0001-8094-1588

PhD in Economics

Associate Professor, Department of Mathematics, Financial University under the Government of the Russian Federation

49 Leningradsky Prospekt str., Moscow, 125993, Russia

apalkova.t.g@yandex.ru
Levchenko Kirill Gennadievich

ORCID: 0009-0008-7380-3388

PhD in Physics and Mathematics

Associate Professor, Department of Mathematics, Financial University under the Government of the Russian Federation

49 Leningradsky Prospekt str., Moscow, 125167, Russia

kglevchenko@fa.ru

DOI:

10.7256/2454-065X.2024.3.70590

EDN:

RUQJWX

Received:

27-04-2024


Published:

08-05-2024


Abstract: The subject area of this article is the application of descriptive statistics and multidimensional classification methods to describe the regional features of the formation of tax components of revenues of consolidated budgets of the Russian Federation. The aim of the work is to demonstrate the simplicity and effectiveness of using mathematical and statistical methods and the functionality of the open source language R to solve problems of structural analysis of tax revenues, identify regional specifics, and comparative analysis of regions from the point of view of tax revenues. The apparatus of mathematical statistics implemented in the R language, in particular, opens up wide opportunities for classifying tax subjects, including multidimensional ones, significantly facilitating the procedures of analysis, ranking and planning. The functionality described in the article can be used in the process of forming and adjusting tax policy at different levels.  The possibilities of the mathematical statistics apparatus in combination with the instrumental methods of the R language are revealed by the example of the classification analysis of the regions of the Russian Federation. At the same time, the absolute and relative values of tax revenues in the revenues of regional budgets are selected as classification features. The classification by belonging to the federal district and the study of the "natural" stratification by the method of cluster analysis are considered. The apparatus of mathematical statistics and, especially, the tools of the R language are used unreasonably rarely in research of this kind, despite the ease of use and the absence of the need for special training, these circumstances determine the relevance of this article. Aggregation by federal districts made it possible to identify: the Ural Federal District as the leader in terms of the average regional share of tax revenues in the revenue part of the budget and the North Caucasus Federal District, characterized by the lowest average regional contributions of tax payments to regional budgets. The analysis of the natural stratification of the regions of the Russian Federation by their relative tax contributions to consolidated budgets made it possible to identify groups: the most typical regions, subsidized regions, donor regions and regions in which the most expensive assets of enterprises of the Russian Federation are concentrated


Keywords:

R language, mathematical statistics, regional tax revenues, multivariate classification, cluster analysis, descriptive statistics, budget tax revenues, tax collection statistics, subsidized regions, tax analysis methods

This article is automatically translated.

Introduction

The study of publications investigating the features, structure and differentiation of tax payments and fees [1-5] has shown that mathematical and statistical analysis in these works is rather rare and in most cases reduces to the analysis of dynamics [2, 3]. However, in combination with modern technical means, methods of mathematical statistics are a powerful tool that allows you to highlight the processes of collecting, calculating, and paying taxes from different sides and thus identify the most problematic and, conversely, positive aspects of these processes. This article aims to demonstrate the capabilities, effectiveness and simplicity of visualization tools, descriptive statistics, aggregation and clustering of the open source programming language R in tax analysis.

Data for modeling.

The initial data was information on the execution of the consolidated budgets of the subjects of the Russian Federation for 2020 (Ministry of Finance of the Russian Federation. Data on the execution of consolidated budgets of the subjects of the Russian Federation), presented by the source in the .xlsx file format. You can import data in this format into the R environment using the readxl::read_excel() function. Here and further in the text, the entry before the double colon indicates the name of the package to which the function specified after the double colon belongs. The mentioned file contains information on income and expenses of the consolidated budget, income is differentiated into tax and non-tax, and the three most "significant" taxes for each region are separately allocated: income, property and personal income. Figure 1 shows a fragment of the source file with a slight change: after the "Region" column, the "District" column was added, the rows with intermediate and final summation (by districts and the country as a whole) were deleted. The data generated in this way is called "rectangular": each row contains information about one object, each column is responsible for a specific feature. In addition to externally obtained data, several synthetic features were added to the data set: the relative contribution of tax revenues (total and for each tax) to the revenue side of budgets.

Figure 1 – A set of source data (fragment)

Data aggregation. If the researcher is interested in the difference in tax revenues to the consolidated budget by federal districts, it is possible to determine the average values for each district. For this purpose, the basic functions are applicable: aggregate() – in the case when you need to calculate averages for several criteria at once, or tapply() [6] – if the averages are calculated based on one attribute. Let's consider an example of using the aggregate() function[7] to calculate the average values of total tax revenues for federal districts (Table 1), as well as the average shares of tax revenues (total and by type) in the budget revenue (Table 2).

In general, the syntax of the aggregate() function looks like this:

aggregate(x =<truncated, or complete data set>,

                         by=list(<one or more grouping features>),

                         FUN = mean)

Table 1- Regional average values of consolidated budget revenues by type of income and federal districts, million rubles

The district

Total income

Tax and non-tax income

Tax revenues

Corporate income tax

Corporate property tax

Personal income tax

             

Far Eastern Federal District

119511257

73507483

67380456

20812532

7739126

26092894

PFD

149479227

99578519

93573818

22474054

7927675

37393626

Northwestern Federal District

149216638

116133709

109882505

31548774

10284799

48888251

North Caucasus Federal District

92213275

29249889

27493643

3830534

2475400

12663845

SFO

150846949

101698591

95735467

29090442

7086183

36871352

UFA

232046876

192116831

182897714

62714623

30552349

65740254

Central Federal District

283274516

234648498

215238958

66308098

14350978

99215141

Southern Federal District

150454181

89772487

84085745

18571173

8863769

33242862

 

Table 2 - Regional averages of relative contributions of tax revenues to consolidated budgets by type of income and federal districts

The district

Corporate income tax

Corporate property tax

Personal income tax

Tax revenues

         

Far Eastern Federal District

0,15

0,06

0,21

0,53

PFD

0,13

0,05

0,24

0,59

Northwestern Federal District

0,16

0,09

0,25

0,63

North Caucasus Federal District

0,03

0,03

0,13

0,27

SFO

0,14

0,04

0,22

0,54

UFA

0,26

0,12

0,27

0,75

Central Federal District

0,16

0,05

0,26

0,64

Southern Federal District

0,10

0,05

0,19

0,48

Graphs make a more visual comparison of the average values by districts, in this case, bar charts are most appropriate (Figures 1 and 2). Figures 1 and 2 show graphs for the average values of tax revenues for each district, aggregated data from Tables 1 and 2 are used, respectively. The ggplot2::ggplot() function can be used as a visualization tool in combination with one of the quick access functions: geom_bar(), or geom_col() [8-10]. At the same time, as arguments for the quick access functions, you should specify the variables responsible for the x and y coordinates – the federal district and the average tax revenues of the budget for the district (for Fig. 1), or the average share of tax revenues in the budget for the district (for Fig. 2) accordingly. In addition, the geom_bar() function in this situation requires additionally specifying the value of the stat='identity' argument, which will allow the y coordinate values to be reflected along the ordinate axis instead of frequencies. The geom_col() function does not require this additional argument.

Figure 1 – Regional average values of tax revenues by federal districts, million rubles

Figure 2 – Regional average values of relative contributions of tax revenues to the consolidated budgets of regions by federal districts

When comparing the diagrams, it becomes obvious that as a result of calculating relative indicators, the scale effect is leveled: for several districts (Far Eastern, Volga, Northwestern, Northern, Central), the average regional shares of tax revenues to budgets differ little, the Central District, the leader in absolute average contribution, is inferior to the Ural District in terms of share contribution (0.64 and 0.75 of all budget revenues, respectively). The regions of the North Caucasus Federal District are characterized on average by both the lowest amounts of taxes collected and the minimum share of tax revenues in the formation of the revenue side of the budget (0.27). The latter conclusion is quite natural, since of the six subsidized regions of the Russian Federation, which at the end of 2019 received subsidies of at least 40% of their own income for two out of three years, three are located in the North Caucasus Federal District (Order of the Ministry of Finance "On approval of the list of Subjects of the Russian Federation in accordance with the provisions of paragraph 5 of Article 130 of the Budget Code Of the Russian Federation").

Clustering. Next, we will consider the possibility of identifying the natural stratification of regions by a set of features characterizing the share of tax payments in the revenue side of budgets. At the same time, payments are taken into account: income taxes, property taxes, personal income and total tax revenues. The most appropriate method in this case is cluster analysis, which provides a classification "without a teacher" - that is, the rule of division into clusters is a priori unknown [11, 12].

The procedure of agglomerative cluster analysis in the R language can be implemented using fairly simple code. We will present it in stages with proper comments.

1)                The calculation of distances between objects is carried out by the command

d<-dist(scale(PT)), where PT is the name of the dataset containing the values of classification features (the share of tax payments in the budget). The default metric is a simple Euclidean distance:

,                                      (1)

 where d is the distance, is the coordinate i of the first object (the value of feature i), is the coordinate i of the second object, i=1,...,n, n is the number of classification features.

2)                Clustering in this example is carried out using the Ward method, which implies combining objects at each step, resulting in a minimal increase in the intra-group sum of squares. to do this, run the command:  hcw

3) Next, using the plot command (hcw,cex=0.5) a dendrogram is built and clusters are allocated on it, the number of which is set by the researcher (in the example under consideration there are 4 of them). The rect.hclust(hcw, k = 4) function is used for this. The result of this stage is shown on the graph (Fig. 3):

 

Figure 3 is the first way to visualize the results of cluster analysis (a fragment of a dendrogram: clusters 2 and 4)

Note that the final number of clusters is determined empirically, there are procedures that evaluate the quality of the partition and allow you to compare different options. Consideration of these procedures is beyond the scope of this work and the example demonstrates one of the possible partitions (4 clusters).

         The functions discussed above, used in the clustering process, belong to the basic category and represent a set of minimally sufficient tools. However, R also provides techniques to facilitate the perception of clustering results. So, the following code:

library("ape") #connecting a special library

colors = c("red", "blue", "green", "black") #allocation of chart colors by number of clusters

groups4

plot(as.phylo(hcw), type = "fan", tip.color = colors[groups4],

     label.offset = 0.5, cex = 0.7,no.margin = TRUE) #plotting

It is an alternative to step 3 and represents a dendrogram in the form of a circle and highlights objects related to each cluster in one color, Fig. 4.

Figure 4 is the second way to visualize the results of cluster analysis using the functionality of the ape library (a fragment of a dendrogram, clusters 1 and 2 are partially represented. Cluster 3 is completely represented)

4) at the last stage, when the composition of the clusters is identified, it is necessary to characterize the resulting clusters and identify the key principles of the resulting partition. In the described case, the composition of the clusters was determined as follows :

The first cluster:  Belgorod Region, Vladimir Region, Voronezh Region, Kaluga Region, Kursk Region, Lipetsk Region, Moscow Region, Ryazan Region, Smolensk Region, Tver Region, Tula Region, Yaroslavl Region, Arkhangelsk Region, Vologda Region, Novgorod Region, Krasnodar Territory, Astrakhan Region, Volgograd Region, Rostov Region, Republic of Bashkortostan, The Republic Tatarstan, Udmurt Republic, Perm Region, Nizhny Novgorod Region, Orenburg Region, Samara Region, Saratov Region, Ulyanovsk Region, Sverdlovsk Region, Chelyabinsk Region, Irkutsk Region, Kemerovo Region, Novosibirsk Region, Omsk Region, Tomsk Region, Primorsky Territory, Khabarovsk Territory, Amur Region, Magadan Region.

The second cluster: Bryansk region, Ivanovo region, Kostroma region, Oryol region, Tambov region, Republic Karelia, Kaliningrad region, Pskov region, Republic Adygea, Republic of Kalmykia, Republic of Crimea, Sevastopol, Republic of Dagestan, Republic of Ingushetia, Kabardino-Balkarian Republic, Karachay-Cherkess Republic, Rep. North Ossetia, Chechen Republic, Stavropol Territory, Republic Mary El, Republic of Mordovia, Chuvash Republic, Kirov region, Penza region, Kurgan region, Republic Altai, Republic Tyva, Republic of Khakassia, Altai Territory, Republic of Buryatia, Republic of Sakha (Yakutia), Trans-Baikal Territory, Kamchatka Territory, Jewish JSC, Chukotka JSC.

The third cluster: Moscow, Leningrad region, Murmansk region, St. Petersburg, Tyumen region, Krasnoyarsk Territory, Sakhalin region.

Fourth cluster: Republic Komi, Nenets Autonomous Okrug, Khanty-Mansiysk Autonomous Okrug, Yamalo-Nenets Autonomous Okrug.

Clusters can be described, for example, by calculating the average values of features for each group. To do this, you can use the aggregate() function with the following set of arguments:

an x – frame consisting of features that need to be used to calculate the average values, in the example under consideration, this is for tax revenues in the revenue part of the budget;

by is a grouping attribute, in this example it is the cluster number;

FUN is a calculated characteristic, in the example under consideration – the average sample value (mean).

In addition to describing the clusters themselves, for the convenience of interpreting the clustering results, it is also useful to calculate the averages for each feature for all objects (regions), this is possible using the basic colMeans() function. As an argument, it is enough to specify a frame consisting of features for which the average values should be calculated.

Thus, you can get the following result:

Table 3 – Regional averages of relative contributions of tax revenues to consolidated budgets by income types and clusters

Cluster

Tax revenues, total

Income tax

Corporate property tax

Personal income tax

1

0,66

0,17

0,06

0,26

2

0,39

0,07

0,03

0,17

3

0,82

0,36

0,06

0,30

4

0,73

0,19

0,25

0,23

All regions

0,57

0,14

0,06

0,23

The clustering results can be commented on as follows.

The characteristics of cluster No. 1 are most closely related to the general regional average.

Cluster No. 2 is characterized by minimum regional average shares of tax payments in the revenue side of the budget – this also applies to total tax revenues and income from each of the three taxes considered separately. On average, for the regions of this cluster, the share of tax revenues is 39% of the budget revenue, conditionally this group can be called "Subsidized regions".

Cluster No. 3 has the highest budget contributions from total tax payments, income tax and personal income tax among all four clusters. These figures exceed the regional average. On average, the budgets of the regions of the third cluster are filled by more than 80% due to tax payments. In addition, the total tax payments of these seven regions account for 37% of all tax payments of the regions of the Russian Federation. Conditionally, this group can be called "Donor Regions".

The regions of cluster No. 4 differ in the maximum share of payments for corporate property tax to regional budgets. It can be concluded that these are the regions where the most expensive assets of organizations in the territory of the Russian Federation are concentrated.

Conclusion

The considered example demonstrates the effectiveness of mathematical and statistical methods in solving problems of describing the characteristics of taxpayer regions. Two methods of classification are considered – aggregation by districts and identification of natural stratification. The federal districts are identified, which are characterized by the minimum and maximum regional average relative contributions of tax revenues to the revenue side of the budget. Cluster analysis made it possible to identify a group of donor regions and a group of subsidized regions. It should be emphasized that all the necessary calculations – visualization, calculation of group averages, clustering - were carried out using a small set of functions of the R language, simple in syntax, the use of which does not require deep programming knowledge. The advantages of using the R language in calculations are its high functionality and financial accessibility, which distinguish it favorably from specialized software and MS Excel, as well as in the command syntax, the simplicity of which, in our opinion, distinguishes this language from Python.

References
1. Zhilyakov D.I., Novosel'skij S.O., Plahutina Yu.V. & Petrushina O.V. (2023). Retrospective analysis of tax revenues of the federal budget. Economic Sciences, 2(219), 173-181. Retrieved from https://ecsn.ru/wp-content/uploads/202302_173.pdf
2. Vasil'chenko, A.D. (2019). Tax revenues to the Russian budget system: statistical assessment and mobilization measures.Taxes and taxation, 5, 45-57 doi:10.7256/2454-065X.2019.5.30101
3. Kostina, A.A. (2017). Statisticheskij analiz struktury i dinamiki nalogovyh postuplenij Rossijskoj Federacii [Statistical analysis of the structure and dynamics of tax revenues of the Russian Federation]. Master's Bulletin, 6-1(69), 15-18.
4. Dedeneva, D.B.(2022). Analysis of tax revenues to the Russian budget system. Vektor of economy, 4. Retrieved from http://www.vectoreconomy.ru/images/publications/2022/4/taxes/Dedeneva.pdf
5. Selyukov, M.V. (2023). Analysis of tax revenues to the Russian budget system. Siberian Financial School, 1, 35-43. doi:10.34020/1993-4386-2023-1-35-43
6. Apal'kova, T.G., Glebov, V.I., Zadadaev, S.A., Krivolapov, S.Ya. & Levchenko, K.G. (2023). Matematicheskaya statistika. Praktikum: uchebnoe posobie [Math statistics. Training manual]. Moscow: INFRA-M, 2023. doi:10.12737/1896790
7. Wickham, H., Cetinkaya-Rundel & M.& Grolemund, G. (2023). R for Data Science, 2nd Edition. Beijing, Boston, Farnham, Sebastopol, Tokyo: O'Reilly Media, Inc. Retrieved from R for Data Science (2e) - Preface to the second edition (hadley.nz)
8. Markova, S. V. (2023). Analiz dannyh na yazyke R.: uchebnik i praktikum [Data analysis in R language: textbook and workshop]. Moscow: KnoRus.
9. Mastickij, S.E. (2017). Vizualizaciya dannyh s pomoshch'yu ggplot2 [Visualizing data using ggplot2]. Moscow: DMK Press.
10. Platonov, V.V. (2020). Vizualizatsiya bolshikh dannyh v ekonomicheskikh naukakh v usloviyakh informatsionnogo obschestva [Big data visualization in economic sciences in the information society]. Russian journal of innovation economics, 10(4), 1831-1848. doi:10.18334/vinec.10.4.111373
11. Shipunov A. B., Korobejnikov, A. I. & Baldin, E. M. (2012). Analiz dannyh s R (II) [Data Analysis with R (II)]. Retrieved from https://inp.nsk.su/~baldin/DataAnalysis/R/R-07-datamining.pdf?ysclid=lvic8543su725025982
12. Dubrov, A.M., Mhitaryan, V.S. & Troshin, L.I. (2011). Mnogomernye statisticheskie metody: uchebnik [Multivariate statistical methods: textbook]. Moscow: Finance and Statistics.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The reviewed article discusses the issues of statistical analysis of the differentiation of regions of the Russian Federation by tax revenues of consolidated budgets using the R programming language. The research methodology is based on the application of mathematical and statistical modeling methods, cluster analysis and visualization methods. The authors rightly attribute the relevance of the work to the fact that the methods of mathematical statistics in combination with modern technical means are a powerful tool that allows you to highlight the processes of collection, accrual, payment of taxes and identify problematic and positive aspects of these processes. The scientific novelty of the work, according to the reviewer, consists in substantiating the possibilities and effectiveness of using visualization tools, descriptive statistics, aggregation and clustering of the open source programming language R in tax analysis. The article highlights the federal districts, which are characterized by minimum and maximum regional average relative contributions of tax revenues to the revenue side of the budget, using cluster analysis methods, groups of donor regions and subsidized regions are identified. Structurally, the following sections are highlighted in the article: Introduction, Data for modeling, Data Aggregation, Clustering, Conclusion, Bibliography. The publication contains a fragment of a set of initial data, reflects the regional average values of consolidated budget revenues by type of income and federal districts, as well as the regional average values of relative contributions of tax revenues to consolidated budgets by type of income and federal districts. The text of the article is accompanied by illustrations made using visualization tools of the R programming language. The authors consider the possibility of identifying the natural stratification of regions by a set of features characterizing the share of tax payments in the revenue side of budgets. The procedure of agglomerative cluster analysis using the R language code is reflected, the features of calculating distances between objects, clustering using the Ward method, constructing a dendrogram and allocating clusters on it, as well as creating a dendrogram in the form of a circle are shown. As a result of the study, four clusters were identified, for each of which regional averages of relative contributions of tax revenues to consolidated budgets by type of income were obtained, and characteristics of each cluster were given. According to the authors of the article, the advantages of using the R language in calculations are its high functionality and financial accessibility, which favorably distinguish it from specialized software and MS Excel, as well as in the command syntax, the simplicity of which distinguishes this language from Python. The bibliographic list includes 12 sources – scientific publications of domestic and foreign authors on the topic in Russian and English. The text of the publication contains targeted references to the list of references confirming the existence of an appeal to opponents. The topic of the article is relevant, the material reflects the results of the research conducted by the authors, contains elements of increment of scientific knowledge, corresponds to the topic of the journal "Taxes and Taxation", may arouse interest among readers and is recommended for publication.