Library
|
Your profile |
Software systems and computational methods
Reference:
Alpatov A.N., Bogatireva A.A.
Data storage format for analytical systems based on metadata and dependency graphs between CSV and JSON
// Software systems and computational methods.
2024. ¹ 2.
P. 1-14.
DOI: 10.7256/2454-0714.2024.2.70229 EDN: TVEPRE URL: https://en.nbpublish.com/library_read_article.php?id=70229
Data storage format for analytical systems based on metadata and dependency graphs between CSV and JSON
DOI: 10.7256/2454-0714.2024.2.70229EDN: TVEPREReceived: 25-03-2024Published: 01-04-2024Abstract: In the modern information society, the volume of data is constantly growing, and its effective processing is becoming key for enterprises. The transmission and storage of this data also plays a critical role. Big data used in analytics systems is most often transmitted in one of two popular formats: CSV for structured data and JSON for unstructured data. However, existing file formats may not be effective or flexible enough for certain data analysis tasks. For example, they may not support complex data structures or provide sufficient control over metadata. Alternatively, analytical tasks may require additional information about the data, such as metadata, data schema, etc. Based on the above, the subject of this study is a data format based on the combined use of CSV and JSON for processing and analyzing large amounts of information. The option of sharing the designated data types for the implementation of a new data format is proposed. For this purpose, designations have been introduced for the data structure, which includes CSV files, JSON files, metadata and a dependency graph. Various types of functions are described, such as aggregating, transforming, filtering, etc. Examples of the application of these functions to data are given. The proposed approach is a technique that can significantly facilitate the processes of information analysis and processing. It is based on a formalized approach that allows you to establish clear rules and procedures for working with data, which contributes to their more efficient processing. Another aspect of the proposed approach is to determine the criteria for choosing the most appropriate data storage format. This criterion is based on the mathematical principles of information theory and entropy. The introduction of a criterion for choosing a data format based on entropy makes it possible to evaluate the information content and compactness of the data. This approach is based on the calculation of entropy for selected formats and weights reflecting the importance of each data value. By comparing entropies, you can determine the required data transmission format. This approach takes into account not only the compactness of the data, but also the context of their use, as well as the possibility of including additional meta-information in the files themselves and supporting data ready for analysis. Keywords: Data storage formats, JSON, CSV, Analysis Ready Data, Metadata, Data processing, Data analysis, Integration of data formats, Apache Parquet, Big DataThis article is automatically translated.
Introduction The task of integrating data in various scientific fields is extremely important. The development of big data has led to the emergence of a large number of disparate tools used in various scientific research, and written using different technology stacks, and most importantly, the formats of the data used vary from tool to tool, which leads to the need for data reformatting, which can be difficult for big data. To solve this problem, you can use a variety of programming interfaces (API-Application programming interface) [1]. This approach has a number of undoubted advantages, but one of the significant disadvantages is the high load on data serialization and deserialization, which can become a serious bottleneck for applications. In addition, there is no standardization of data representation, which leads to developers performing customized integrations using existing or customized systems. To solve the above problems, an alternative method of data exchange can be applied: data exchange using common file formats optimized for high performance in analytical systems. In fact, the data format conversion method is a fairly common way to integrate data [2]. Most often, data, without using highly specialized platforms, is accumulated in CSV and JSON formats, and in the case of systems such as Hadoop, the format, for example, avro, can be used. CSV (Comma-Separated Values) is a text format designed to represent tabular data [3]. A table row corresponds to a line of text that contains one or more fields separated by commas. It is allowed to specify the names of the data columns in the first row, but this is not a mandatory requirement. This format does not involve data typing, all data values are considered strings. CSV is the most commonly used format for storing structured data obtained from a relational database. JSON (JavaScript Object Notation) is a text data exchange format based on JavaScript, while the format is independent of JS and can be used in any programming language [4]. The format is designed for unstructured data, and has gained popularity due to the spread of REST and NoSQL approaches to data storage. The Avro format is a compact, fast and serializable data format developed by the Apache Software Foundation [5]. It provides a data schema in JSON format, which makes it self-describing. Avro supports various data types, data compression and is efficient for transferring data between different systems. It is widely used in Big Data and distributed systems for data exchange and data storage. The main difficulty of using such formats for storing and presenting big data is the problem of guaranteed availability of metadata for describing datasets in order to use them in the future. Metadata greatly simplifies the integration of disparate data by providing users with additional information about how to collect or generate a dataset, which allows them to evaluate the quality of the dataset. Metadata may also contain information about the methods used in conducting research that led to the appearance of this dataset. Unfortunately, the presentation of metadata is not always a generally accepted practice, which leads to the fact that metadata is not supplied or is provided in incomplete volume, which complicates the further use of the dataset [6]. During the analysis, it was also revealed that the considered formats also have not very high indicators [7], but their advantage is full platform independence. For example, they may not support complex data structures or provide sufficient control over metadata, or are suboptimal for processing large amounts of data due to their structure or performance limitations when reading/writing. Thus, the development and use of big data file formats aimed at optimizing the storage of big data is an important and urgent task.
Format development Let's define a special case of data in which neither CSV nor JSON will be the preferred data transfer format. This case is presented below, in Figure 1, as an example of a fragment of a class diagram. Figure 1 – A fragment of the class diagram The diagram shows two classes: Service (list of products /services) and ServiceCost (price history of the product). The Service contains the name of the product, its purchase price, description of the characteristics and the rest of the goods. ServiceCost contains the name of the product, its value and the date of assignment of this value. For Service, the optimal data transfer format is a CSV table, and ServiceCost is more convenient to transfer to JSON, since if you add this data to the same CSV file, all product data will be duplicated for each of its prices. At the moment, it is proposed to transfer data in the form of two archived files. Listings 1 and listing 2 show the structure of these files. Listing 1 – Transfer of Service in the service.csv file
Listing 2 – Passing ServiceCost in the service_cost.json file
This format will still have extra redundancy, since there are many extra characters, but also have insufficient information, since there is no additional information about the type of data in the files. Let's introduce a criterion for choosing an effective data transfer method for the Service and ServiceCost classes, using the theory of information and entropy for this purpose. Entropy is a measure of the information uncertainty of data [8]. It is calculated by summing the probabilities of occurrence of each data value and their logarithms based on base 2 (Shannon's formula) [9]. Let them be the entropies of the data in the files service.csv and service_cost.json. We introduce additional weighting factors for each probability of occurrence of a data value. Let them be the weighting factors for the probabilities of data values appearing in CSV and JSON. These coefficients may reflect the importance of each value or the context in which it is used. Then entropy can be used to evaluate information content and as a criterion for choosing data formats, as shown in (1) and (2).
where is the probability of data appearing in CSV records, n is the number of different values in the CSV file.
where is the probability of data appearing in JSON records, m is the number of different values in the CSV file.
Let's compare the entropy of data, using the example of CSV and JSON formats, to determine the optimal data transfer format. A lower entropy may indicate a more compact representation of the data. In other words, if, then CSV may be a more optimal format for data transmission. Otherwise, if, then JSON may be preferable. Or, to determine the preferred of the two formats, a slightly different approach can be used, taking into account complicated entropy and weight coefficients, as shown in (3).
where are the coefficients reflecting the importance of entropy in each format.
If so, the preferred format will be CSV. If so, the preferred format will be JSON. Otherwise, at the same time, data storage formats will be equally preferred. This approach allows you to take into account not only the compactness of the data, but also the possibility of including key information in the files themselves. For the task of developing a new format, such key information may be additional meta-information, or, for example, the presence of noise in the data. If the data contains a lot of noise or random values, high entropy may indicate the uncertainty and complexity of analyzing this data. The proposed evaluation criterion takes into account the context of data use, their value, the degree of complexity and possible consequences of errors in data analysis due to high entropy. Based on the above, it is also possible to formulate basic data format requirements for the BigDate domain. These include: 1. Standardization and prevalence. Standardization means the establishment of common and unified standards for the presentation and processing of data. For example, standardized data formats include XML (eXtensible Markup Language), which is used for structuring and exchanging data in various fields, including web services. In the context of big data storage formats, standardization helps to create generally accepted and unified ways of presenting information. This facilitates data exchange between different systems and applications, simplifies software development, and ensures data compatibility between different tools. The prevalence of the data format is related to how widely this format is used in the industry and among various applications. The more widespread the data format, the more likely it is that many tools and systems support it, which is very important for the field of big data analysis, where there is a developed ecosystem for working with this format, including programming languages and data analysis tools and storage systems. 2. Compactness. The format should take up a minimum amount of storage space. This is important for the efficient use of storage and data transmission over the network, especially in the case of large amounts of information. Compact formats can save storage resources, speed up data transfer and reduce network load. 3. Compression capability. Support for data compression mechanisms can significantly reduce the amount of data stored and speed up its transmission over the network. 4. Metadata storage. Storing metadata in a data format plays an important role in ensuring that the information stored in this format is fully and correctly interpreted. Metadata is data about data and contains information about the structure, types, format, and other characteristics of the data. 5. Support for data ready for analysis. Analysis-Ready Data (ARD) is a concept in the field of big data analysis that involves preprocessing and preparing data before analysis. It is designed to simplify and speed up the data analysis process by providing ready-to-use datasets. Currently, this concept is being widely developed in the field of satellite image analysis [10]. The proposed data storage format will be based on the synthesis of two common data storage formats, such as CSV and JSON. This solution has a number of advantages, due to the fact that CSV allows you to store data in a tabular structure with explicitly defined columns, while JSON has a more flexible structure. Combining them can provide convenient storage of tabular data while preserving the hierarchy and nested objects. Also, using JSON for data storage provides data typing, which makes it easier to perceive and work with data, and an additional metafile can contain information about data types for the CSV part, which provides complete typing.
Modification of the data storage format with metadata It is possible to add column names in the CSV file and, since CSV cannot store information about the data type, you can add type information in a separate meta file. A modified CSV file and a metafile with type information are presented in listings 3 and 4. Listing 3 – Modification of service.csv.
Listing 4 is the service.meta meta file.
The JSON format, when deserialized in various programming languages, provides for data typing, but to unify the format, data types will be stored in a separate metafile (listing 5). Listing 5 is the service_cost.meta meta file.
Thus, the developed format is an archive consisting of the four files described above. The format can be expanded with additional data in CSV or JSON format with appropriate metafiles if necessary. "PandasKeepFormat" is proposed as the name of the format, the format extension accordingly looks like PDKF.
Mathematical description of the format and applicable functions over the data structure Next, let's consider a mathematical description for the data format proposed in this paper, which combines CSV and JSON, thus defining a set of valid operations on data. Let: ? C - a set of CSV files representing a set of rows and columns. ? J - a set of JSON files containing structured data in JSON format. ? M is a set of metadata describing information about the data structure in CSV or JSON, such as data types, column names, and other characteristics.
Then, the new data format can be defined as follows, as shown in formula 4. In the above equality, it is a set combining data from CSV and JSON. The elements of this set can be either tabular (CSV) or hierarchical (JSON) data. A is a set of pairs where the first element represents data from , and the second element is the corresponding metadata of M. Thus, each data entry from is accompanied by its own metadata. This approach will allow you to combine data and metadata, however, it does not take into account the presence of more complex dependencies in the data. For example, some data may depend on others, but for some tasks it may also be important to understand the context and sequence of using this data. Or, in some analytical scenarios, there may be a need to perform complex operations that involve related data elements. For example, calculations that require a combination of data from multiple sources may use information about relationships. The peculiarity of the data storage format proposed above is that it can be relatively easily expanded, due to the possibility of adding additional data, as mentioned above. For example, functions defined in the implementation of this format are added to the function dictionary and can be applied to data stored in the structure. To expand this model, it is proposed to introduce elements from graph theory and category theory. To do this, we additionally introduce a directed graph G into the model, where vertices V represent elements from and edges E represent connections or dependencies between these elements, in other words G=(V,E). To work with the data defined by the proposed format, it is necessary to additionally define a set of applicable functions on the data. Here, in this paper, applicable functions, in the context of data structures, are understood to be functions or operations that can be correctly applied or performed on the data of the proposed structure. That is, these are functions that correspond to the type or structure of the data and can be used to process or modify this data. Accordingly, let there be a set of functions that can be applied to the data from, taking into account their structure and metadata. Let's assume that each function has certain characteristics such as aggregation, filtering, transformation, combination, etc. Let's define some of these functions. Let D be a data structure, C be a set of CSV files, J be a set of JSON files, M be a set of metadata, and f be a function from a set of applicable functions, where are functions that can perform various operations such as filtering, sorting, aggregation, etc. Let's denote the data element in D as , where i and j are the coordinates of the element in the coordinate grid. In other words , where n is the number of rows, m is the number of columns. Then, taking into account the above, the aggregating function can be expressed. The transform function converts data from a structure to a new structure. The column data extraction function extracts a specific column from CSV or JSON data (if the data is organized appropriately). Let be an element from D, and let be the corresponding element from M. Then the function . For example, if - CSV file, - metadata, and , then returns the age column. The value filtering function leaves only those data rows where a certain column satisfies a given condition. Similar to the data extraction function, the function can be expressed in , where condition means a condition that determines which rows of data should be left after applying the filtering function. This condition is checked for a specific column in the data, and only those rows that satisfy this condition remain in the resulting dataset. For example, if - a JSON file, - metadata, and , then it will leave only records with a price above $100. Taking into account the above, finally, the new data format can be defined as shown in formula 5.
Implementation of the format handler This section presents the implementation of the PDKF format handler in the Python programming language, which converts data from a file into a list of Pandas dataframes. A fragment of the handler's source code is presented in listing 6. Listing 6 – Implementation of the format handler.
The following dependencies are required to implement the handler: · Pandas is a popular library necessary for big data analytics, it contains the DataFrame data type. · Zipfile is a built–in library necessary for working with ZIP archives.
The above code allows you to read files in CSV and JSON formats, as well as from a ZIP archive with a specific structure, and create the appropriate pandas DataFrame for further data processing. In addition, it provides metadata processing and conversion of data types to a DataFrame according to the specified types in the metadata. An example of using the above module and some manipulations with the received data are presented in Listing 7. Listing 7 is an example of using the module.
In this case, information is displayed about each dataframe received from the file. In the future, you can perform any manipulations with the data available in the Pandas module, such as analytics, forecasting based on this data, etc. The developed module is posted in the Python PYPI package catalog [https://pypi.org/project/pdkf /] under the MIT license, which allows you to install it with the console command "pip install pdkf". Figure 2 shows a screenshot of the placement of the developed solution in the official software repository for Python - PyPI, an analogue of CRAN for the R language.
Figure 2 – Placement of the pdkf format handler module in the PYPI repository
Conclusion The proposed solution suggests using a combination of CSV and JSON formats for optimal data storage and transmission. For the convenience of data perception and ensuring complete typing, it is proposed to additionally use metafiles that contain information about data types for the corresponding CSV and JSON files. The modification of the CSV format includes adding column names and storing information about data types in a separate metafile. The JSON format already provides data typing during deserialization, but it is also proposed to store type information in a metafile to unify the format. Thus, the proposed solution is an archive consisting of CSV and JSON files, as well as corresponding metafiles, which provide structured data storage with preservation of types and metadata. The format can be supplemented with additional data in CSV or JSON format with appropriate metafiles, if necessary. References
1. Malcolm, R., Morrison, C., Grandison, T., Thorpe, S., Christie, K., Wallace, A., & Ñampbell, A. (2014). Increasing the accessibility to big data systems via a common services api. In 2014 IEEE International Conference on Big Data (Big Data), 883-892.
2. Wu, T. (2009). System of Teaching quality analyzing and evaluating based on Data Warehouse. Computer Engineering and Design, 30(6), 1545-1547. 3. Vitagliano, G., Hameed, M., Jiang, L., Reisener, L., Wu, E., & Naumann, F. (2023). Pollock: A Data Loading Benchmark. Proceedings of the VLDB Endowment, 16(8), 1870-1882. 4. Xiaojuan, L., & Yu, Z. (2023). A data integration tool for the integrated modeling and analysis for east. Fusion Engineering and Design, 195, 113933. 5. Lemzin, A. (2023). Streaming Data Processing. Asian Journal of Research in Computer Science, 15(1), 11-21. 6. Hughes, L. D., Tsueng, G., DiGiovanna, J., Horvath, T. D., Rasmussen, L. V., Savidge, T. C., NIAID Systems Biology Data Dissemination Working Group. (2023). Addressing barriers in FAIR data practices for biomedical data. Scientific Data, 10(1), 98. 7. Gohil, A., Shroff, A., Garg, A., Kumar, S. (2022). A compendious research on big data file formats. In 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS), 905-913. 8. Elsukov P. Yu. (2017). Information asymmetry and information uncertainty. ITNOU: Information technologies in science, education and management, 4(4), 69-76. 9. Bromiley, P. A., Thacker, N. A., & Bouhova-Thacker, E. (2004). Shannon entropy, Renyi entropy, and information. Statistics and Inf. Series (2004-004), 9(2004), 2-8. 10. Dwyer, J. L., Roy, D. P., Sauer, B., Jenkerson, C. B., Zhang, H. K., & Lymburner, L. (2018). Analysis ready data: enabling analysis of the Landsat archive. Remote Sensing, 10(9), 1363.
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|